[ 
https://issues.apache.org/jira/browse/SPARK-12267?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15051683#comment-15051683
 ] 

Marcelo Vanzin commented on SPARK-12267:
----------------------------------------

Pasting Shixiong's comments from github 
(https://github.com/apache/spark/pull/9138):

{quote}
@vanzin just found an issue about this change. Now if the master receives 
RegisterWorker, it won't use the workerRef to send the reply. So there is no 
connection from Master to the server in Worker. If the Worker is killed now, 
Master only observes some client is lost, but the address is just a client 
address in Worker and won't match the Worker address. So Master cannot remove 
this dead Worker at once. However, this Worker will be removed in 60 seconds 
because of no heartbeat.
{quote}

> Standalone master keeps references to disassociated workers until they sent 
> no heartbeats
> -----------------------------------------------------------------------------------------
>
>                 Key: SPARK-12267
>                 URL: https://issues.apache.org/jira/browse/SPARK-12267
>             Project: Spark
>          Issue Type: Bug
>          Components: Spark Core
>    Affects Versions: 1.6.0
>            Reporter: Jacek Laskowski
>
> While toying with Spark Standalone I've noticed the following messages
> in the logs of the master:
> {code}
> INFO Master: Registering worker 192.168.1.6:59919 with 2 cores, 2.0 GB RAM
> INFO Master: localhost:59920 got disassociated, removing it.
> ...
> WARN Master: Removing worker-20151210090708-192.168.1.6-59919 because
> we got no heartbeat in 60 seconds
> INFO Master: Removing worker worker-20151210090708-192.168.1.6-59919
> on 192.168.1.6:59919
> {code}
> Why does the message "WARN Master: Removing
> worker-20151210090708-192.168.1.6-59919 because we got no heartbeat in
> 60 seconds" appear when the worker should've been removed already (as
> pointed out in "INFO Master: localhost:59920 got disassociated,
> removing it.")?
> Could it be that the ids are different - 192.168.1.6:59919 vs localhost:59920?
> I started master using {{./sbin/start-master.sh -h localhost}} and the
> workers {{./sbin/start-slave.sh spark://localhost:7077}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to