[ https://issues.apache.org/jira/browse/SPARK-11098?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14958103#comment-14958103 ]
Marcelo Vanzin commented on SPARK-11098: ---------------------------------------- So, while working on another patch in this area, I ran into this issue, and I don't think it's a problem in the RPC layer, but rather a problem of the code calling the RPC layer. Even if somehow you synchronize things in the RPC env implementation so that RPCs are sent in the order they arrive, there are multiple threads that can be calling {{RpcEndpoint.send()}} or {{RpcEndpoint.ask()}} at the same time, and at that point there's not guarantee of any order. The problem I ran into explicitly was the Worker ignoring messages from the Master because it thought the master was not active. That's because those messages were arriving before the master had replied to the Worker's registration message. That's not the fault of the RPC layer, that's the fault of that reply being sent to the Worker as a separate message, instead of an RPC reply to the {{RegisterWorker}} message. {{Worker}} in this case should be using {{ask}} and getting the reply from that ask; that ensures the reply will arrive before any other messages the Master may want to send to the worker. If you want to see how to do that properly, see how {{CoarseGrainedExecutorBackend}} does its registration with the scheduler using {{ask}} instead of {{send}}. Anyway, I have that fixed in my patch, I might take it out as a separate fix and attach it to this bug. But I'm not sure if other areas of the code don't suffer from the same problem. > RPC message ordering is not guaranteed > -------------------------------------- > > Key: SPARK-11098 > URL: https://issues.apache.org/jira/browse/SPARK-11098 > Project: Spark > Issue Type: Sub-task > Components: Spark Core > Reporter: Reynold Xin > > NettyRpcEnv doesn't guarantee message delivery order since there are multiple > threads sending messages in clientConnectionExecutor thread pool. We should > fix that. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org