[ 
https://issues.apache.org/jira/browse/SPARK-12267?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15051732#comment-15051732
 ] 

Marcelo Vanzin commented on SPARK-12267:
----------------------------------------

I think the following would work. The problem right now is that the Worker 
listens for incoming connections; and when that happens, the {{senderAddress}} 
of RPC messages becomes the listening address of the Worker, instead of the 
address of the socket sending messages to the Master. When the worker 
disconnects, the Master sees a disconnection from that client socket, but 
doesn't know that it actually relates to that listening address, so doesn't 
unregister anything.

I think instead that, in Netty's case, {{RpcCallContext.senderAddress}} should 
always be the address of the client socket, regardless of whether the sender is 
listening. That would fix this problem. RpcEndpoints for those listening 
processes would still have the listen address of the RpcEnv.

There are three places where `senderAddress` is used outside of Master:

- MapOutputTrackerMasterEndpoint when handling GetMapOutputStatuses, but that's 
only logging
- in CoarseGrainedSchedulerBackend when handling {{RegisterExecutor}}, but that 
already seems to be doing the right thing (since in Netty's case executors are 
not listening)
- in ReceiverTracker when handling RegisterReceiver, but that's also only 
logging

So the above suggestion should work as far as I can tell.

> Standalone master keeps references to disassociated workers until they sent 
> no heartbeats
> -----------------------------------------------------------------------------------------
>
>                 Key: SPARK-12267
>                 URL: https://issues.apache.org/jira/browse/SPARK-12267
>             Project: Spark
>          Issue Type: Bug
>          Components: Spark Core
>    Affects Versions: 1.6.0
>            Reporter: Jacek Laskowski
>
> While toying with Spark Standalone I've noticed the following messages
> in the logs of the master:
> {code}
> INFO Master: Registering worker 192.168.1.6:59919 with 2 cores, 2.0 GB RAM
> INFO Master: localhost:59920 got disassociated, removing it.
> ...
> WARN Master: Removing worker-20151210090708-192.168.1.6-59919 because
> we got no heartbeat in 60 seconds
> INFO Master: Removing worker worker-20151210090708-192.168.1.6-59919
> on 192.168.1.6:59919
> {code}
> Why does the message "WARN Master: Removing
> worker-20151210090708-192.168.1.6-59919 because we got no heartbeat in
> 60 seconds" appear when the worker should've been removed already (as
> pointed out in "INFO Master: localhost:59920 got disassociated,
> removing it.")?
> Could it be that the ids are different - 192.168.1.6:59919 vs localhost:59920?
> I started master using {{./sbin/start-master.sh -h localhost}} and the
> workers {{./sbin/start-slave.sh spark://localhost:7077}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to