[ https://issues.apache.org/jira/browse/SPARK-12267?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15051732#comment-15051732 ]
Marcelo Vanzin commented on SPARK-12267: ---------------------------------------- I think the following would work. The problem right now is that the Worker listens for incoming connections; and when that happens, the {{senderAddress}} of RPC messages becomes the listening address of the Worker, instead of the address of the socket sending messages to the Master. When the worker disconnects, the Master sees a disconnection from that client socket, but doesn't know that it actually relates to that listening address, so doesn't unregister anything. I think instead that, in Netty's case, {{RpcCallContext.senderAddress}} should always be the address of the client socket, regardless of whether the sender is listening. That would fix this problem. RpcEndpoints for those listening processes would still have the listen address of the RpcEnv. There are three places where `senderAddress` is used outside of Master: - MapOutputTrackerMasterEndpoint when handling GetMapOutputStatuses, but that's only logging - in CoarseGrainedSchedulerBackend when handling {{RegisterExecutor}}, but that already seems to be doing the right thing (since in Netty's case executors are not listening) - in ReceiverTracker when handling RegisterReceiver, but that's also only logging So the above suggestion should work as far as I can tell. > Standalone master keeps references to disassociated workers until they sent > no heartbeats > ----------------------------------------------------------------------------------------- > > Key: SPARK-12267 > URL: https://issues.apache.org/jira/browse/SPARK-12267 > Project: Spark > Issue Type: Bug > Components: Spark Core > Affects Versions: 1.6.0 > Reporter: Jacek Laskowski > > While toying with Spark Standalone I've noticed the following messages > in the logs of the master: > {code} > INFO Master: Registering worker 192.168.1.6:59919 with 2 cores, 2.0 GB RAM > INFO Master: localhost:59920 got disassociated, removing it. > ... > WARN Master: Removing worker-20151210090708-192.168.1.6-59919 because > we got no heartbeat in 60 seconds > INFO Master: Removing worker worker-20151210090708-192.168.1.6-59919 > on 192.168.1.6:59919 > {code} > Why does the message "WARN Master: Removing > worker-20151210090708-192.168.1.6-59919 because we got no heartbeat in > 60 seconds" appear when the worker should've been removed already (as > pointed out in "INFO Master: localhost:59920 got disassociated, > removing it.")? > Could it be that the ids are different - 192.168.1.6:59919 vs localhost:59920? > I started master using {{./sbin/start-master.sh -h localhost}} and the > workers {{./sbin/start-slave.sh spark://localhost:7077}}. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org