[ 
https://issues.apache.org/jira/browse/SPARK-16190?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17198736#comment-17198736
 ] 

Valentina Glagoleva edited comment on SPARK-16190 at 9/19/20, 2:20 PM:
-----------------------------------------------------------------------

We faced similar issue and found possible problem. There is a possible race 
condition described in ticket https://issues.apache.org/jira/browse/SPARK-4592:
 (1) Master A fails and Worker attempts to reconnect to all masters
 (2) Master B takes over and notifies Worker
 (3) Worker responds by registering with Master B
 (4) Meanwhile, Worker's previous reconnection attempt reaches Master B, 
causing the same Worker to register with Master B twice

This race condition was partially fixed and worker tries now to reregister only 
with previously known master except for the first reregister try (usually first 
request times out as failover takes more than 10 seconds, but if failover 
happens fast enough then "duplicate worker" exception may happen).

Logs from the driver:
{code:java}
2020-08-04 06:23:57,976 INFO dispatcher-event-loop-2 
org.apache.spark.deploy.worker.Worker: x.x.x.22:7077 Disassociated !
 2020-08-04 06:23:57,976 ERROR dispatcher-event-loop-2 
org.apache.spark.deploy.worker.Worker: Connection to master failed! Waiting for 
master to reconnect...
 2020-08-04 06:23:57,977 INFO worker-register-master-threadpool-2 
org.apache.spark.deploy.worker.Worker: Connecting to master x.x.x.21:7077...
 2020-08-04 06:23:57,977 INFO dispatcher-event-loop-2 
org.apache.spark.deploy.worker.Worker: x.x.x.22:7077 Disassociated !
 2020-08-04 06:23:57,977 ERROR dispatcher-event-loop-2 
org.apache.spark.deploy.worker.Worker: Connection to master failed! Waiting for 
master to reconnect...
 ...
 2020-08-04 06:23:58,069 WARN worker-register-master-threadpool-3 
org.apache.spark.deploy.worker.Worker: Failed to connect to master 
x.x.x.22:7077...
 2020-08-04 06:23:58,090 INFO worker-register-master-threadpool-2 
org.apache.spark.deploy.worker.Worker: Sending master registration message: 
x.x.x.21:7077
 2020-08-04 06:23:58,100 INFO dispatcher-event-loop-2 
org.apache.spark.deploy.worker.Worker: Master has changed, new master is at 
spark://x.x.x.21:70772020-08-04 06:23:58,100 INFO dispatcher-event-loop-2 
org.apache.spark.deploy.worker.Worker: WorkerWebUI is available at 
http://x.x.x.21:8080/proxy/worker-20200804061729-x.x.x.47-407092020-08-04 
06:23:58,125 ERROR dispatcher-event-loop-1 
org.apache.spark.deploy.worker.Worker: Worker registration failed: Duplicate 
worker ID
 2020-08-04 06:23:58,129 INFO shutdown-hook-0 
org.apache.spark.deploy.worker.ExecutorRunner: Killing process!{code}
Logs from the master:
{code:java}
2020-08-04 06:23:57,635 INFO dispatcher-event-loop-1-EventThread 
org.apache.spark.deploy.master.ZooKeeperLeaderElectionAgent: We have gained 
leadership
 2020-08-04 06:23:57,787 INFO dispatcher-event-loop-1 
org.apache.spark.deploy.master.Master: I have been elected leader! New state: 
RECOVERING
 ..
 2020-08-04 06:23:58,119 INFO dispatcher-event-loop-0 
org.apache.spark.deploy.master.Master: Registering worker x.x.x.47:40709 with 4 
cores, 14.7 GB RAM
 2020-08-04 06:23:58,120 INFO dispatcher-event-loop-0 
org.apache.spark.deploy.master.Master: Worker x.x.x.47:40709 is already in the 
list of known worker, sending error 'Duplicate worker ID', id 
worker-20200804061729-x.x.x.47-40709
 2020-08-04 06:23:58,130 INFO dispatcher-event-loop-1 
org.apache.spark.deploy.master.Master: Worker has been re-registered: 
worker-20200804061729-x.x.x.47-40709
 2020-08-04 06:23:58,141 INFO dispatcher-event-loop-1 
org.apache.spark.deploy.master.Master: Recovery complete - resuming 
operations!{code}


was (Author: valentina_gl):
We faced similar issue and found possible problem. There is a possible race 
condition described in ticket https://issues.apache.org/jira/browse/SPARK-4592:
 (1) Master A fails and Worker attempts to reconnect to all masters
 (2) Master B takes over and notifies Worker
 (3) Worker responds by registering with Master B
 (4) Meanwhile, Worker's previous reconnection attempt reaches Master B, 
causing the same Worker to register with Master B twice

This race condition was partially fixed and worker tries now to reregister only 
with previously known master except for the first reregister try (usually first 
request times out as failover takes more than 10 seconds, but if failover 
happens fast enough then "duplicate worker" exception may happen).

Logs from the driver:
{code:java}
2020-08-04 06:23:57,976 INFO dispatcher-event-loop-2 
org.apache.spark.deploy.worker.Worker: x.x.x.22:7077 Disassociated !
 2020-08-04 06:23:57,976 ERROR dispatcher-event-loop-2 
org.apache.spark.deploy.worker.Worker: Connection to master failed! Waiting for 
master to reconnect...
 2020-08-04 06:23:57,977 INFO worker-register-master-threadpool-2 
org.apache.spark.deploy.worker.Worker: Connecting to master x.x.x.21:7077...
 2020-08-04 06:23:57,977 INFO dispatcher-event-loop-2 
org.apache.spark.deploy.worker.Worker: x.x.x.22:7077 Disassociated !
 2020-08-04 06:23:57,977 ERROR dispatcher-event-loop-2 
org.apache.spark.deploy.worker.Worker: Connection to master failed! Waiting for 
master to reconnect...
 ...
 2020-08-04 06:23:58,069 WARN worker-register-master-threadpool-3 
org.apache.spark.deploy.worker.Worker: Failed to connect to master 
x.x.x.22:7077...
 2020-08-04 06:23:58,090 INFO worker-register-master-threadpool-2 
org.apache.spark.deploy.worker.Worker: Sending master registration message: 
x.x.x.21:7077
 2020-08-04 06:23:58,100 INFO dispatcher-event-loop-2 
org.apache.spark.deploy.worker.Worker: Master has changed, new master is at 
spark://x.x.x.21:70772020-08-04 06:23:58,100 INFO dispatcher-event-loop-2 
org.apache.spark.deploy.worker.Worker: WorkerWebUI is available at 
http://x.x.x.21:8080/proxy/worker-20200804061729-x.x.x.47-407092020-08-04 
06:23:58,100 INFO dispatcher-event-loop-2 
 2020-08-04 06:23:58,100 INFO dispatcher-event-loop-2 
org.apache.spark.deploy.worker.Worker: Cancelling is done
 2020-08-04 06:23:58,102 INFO dispatcher-event-loop-2 
org.apache.spark.deploy.worker.Worker: MasterChanged: Sending master 
spark://x.x.x.21:7077my id worker-20200804061729-x.x.x.47-407092020-08-04 
06:23:58,125 ERROR dispatcher-event-loop-1 
org.apache.spark.deploy.worker.Worker: Worker registration failed: Duplicate 
worker ID
 2020-08-04 06:23:58,129 INFO shutdown-hook-0 
org.apache.spark.deploy.worker.ExecutorRunner: Killing process!{code}
Logs from the master:
{code:java}
2020-08-04 06:23:57,635 INFO dispatcher-event-loop-1-EventThread 
org.apache.spark.deploy.master.ZooKeeperLeaderElectionAgent: We have gained 
leadership
 2020-08-04 06:23:57,787 INFO dispatcher-event-loop-1 
org.apache.spark.deploy.master.Master: I have been elected leader! New state: 
RECOVERING
 ..
 2020-08-04 06:23:58,119 INFO dispatcher-event-loop-0 
org.apache.spark.deploy.master.Master: Registering worker x.x.x.47:40709 with 4 
cores, 14.7 GB RAM
 2020-08-04 06:23:58,120 INFO dispatcher-event-loop-0 
org.apache.spark.deploy.master.Master: Worker x.x.x.47:40709 is already in the 
list of known worker, sending error 'Duplicate worker ID', id 
worker-20200804061729-x.x.x.47-40709
 2020-08-04 06:23:58,130 INFO dispatcher-event-loop-1 
org.apache.spark.deploy.master.Master: Worker has been re-registered: 
worker-20200804061729-x.x.x.47-40709
 2020-08-04 06:23:58,141 INFO dispatcher-event-loop-1 
org.apache.spark.deploy.master.Master: Recovery complete - resuming 
operations!{code}

> Worker registration failed: Duplicate worker ID
> -----------------------------------------------
>
>                 Key: SPARK-16190
>                 URL: https://issues.apache.org/jira/browse/SPARK-16190
>             Project: Spark
>          Issue Type: Bug
>          Components: Scheduler, Spark Core
>    Affects Versions: 1.6.1
>            Reporter: Thomas Huang
>            Priority: Minor
>         Attachments: 
> spark-mqq-org.apache.spark.deploy.worker.Worker-1-slave19.out, 
> spark-mqq-org.apache.spark.deploy.worker.Worker-1-slave2.out, 
> spark-mqq-org.apache.spark.deploy.worker.Worker-1-slave7.out, 
> spark-mqq-org.apache.spark.deploy.worker.Worker-1-slave8.out
>
>
> Several worker crashed simultaneously due to this error: 
> Worker registration failed: Duplicate worker ID
> This is the worker log on one of those crashed workers:
> 16/06/24 16:28:53 INFO ExecutorRunner: Killing process!
> 16/06/24 16:28:53 INFO ExecutorRunner: Runner thread for executor 
> app-20160624003013-0442/26 interrupted
> 16/06/24 16:28:53 INFO ExecutorRunner: Killing process!
> 16/06/24 16:29:03 WARN ExecutorRunner: Failed to terminate process: 
> java.lang.UNIXProcess@31340137. This process will likely be orphaned.
> 16/06/24 16:29:03 WARN ExecutorRunner: Failed to terminate process: 
> java.lang.UNIXProcess@4d3bdb1d. This process will likely be orphaned.
> 16/06/24 16:29:03 INFO Worker: Executor app-20160624003013-0442/8 finished 
> with state KILLED
> 16/06/24 16:29:03 INFO Worker: Executor app-20160624003013-0442/26 finished 
> with state KILLED
> 16/06/24 16:29:03 INFO Worker: Cleaning up local directories for application 
> app-20160624003013-0442
> 16/06/24 16:31:18 INFO ExternalShuffleBlockResolver: Application 
> app-20160624003013-0442 removed, cleanupLocalDirs = true
> 16/06/24 16:31:18 INFO Worker: Asked to launch executor 
> app-20160624162905-0469/14 for SparkStreamingLRScala
> 16/06/24 16:31:18 INFO SecurityManager: Changing view acls to: mqq
> 16/06/24 16:31:18 INFO SecurityManager: Changing modify acls to: mqq
> 16/06/24 16:31:18 INFO SecurityManager: SecurityManager: authentication 
> disabled; ui acls disabled; users with view permissions: Set(mqq); users with 
> modify permissions: Set(mqq)
> 16/06/24 16:31:18 INFO ExecutorRunner: Launch command: 
> "/data/jdk1.7.0_60/bin/java" "-cp" 
> "/data/spark-1.6.1-bin-cdh4/conf/:/data/spark-1.6.1-bin-cdh4/lib/spark-assembly-1.6.1-hadoop2.3.0.jar:/data/spark-1.6.1-bin-cdh4/lib/datanucleus-core-3.2.10.jar:/data/spark-1.6.1-bin-cdh4/lib/datanucleus-api-jdo-3.2.6.jar:/data/spark-1.6.1-bin-cdh4/lib/datanucleus-rdbms-3.2.9.jar"
>  "-Xms10240M" "-Xmx10240M" "-Dspark.driver.port=34792" "-XX:MaxPermSize=256m" 
> "org.apache.spark.executor.CoarseGrainedExecutorBackend" "--driver-url" 
> "spark://CoarseGrainedScheduler@100.65.21.199:34792" "--executor-id" "14" 
> "--hostname" "100.65.21.223" "--cores" "5" "--app-id" 
> "app-20160624162905-0469" "--worker-url" "spark://Worker@100.65.21.223:46581"
> 16/06/24 16:31:18 INFO Worker: Master with url spark://100.65.21.199:7077 
> requested this worker to reconnect.
> 16/06/24 16:31:18 INFO Worker: Master with url spark://100.65.21.199:7077 
> requested this worker to reconnect.
> 16/06/24 16:31:18 INFO Worker: Connecting to master 100.65.21.199:7077...
> 16/06/24 16:31:18 INFO Worker: Successfully registered with master 
> spark://100.65.21.199:7077
> 16/06/24 16:31:18 INFO Worker: Worker cleanup enabled; old application 
> directories will be deleted in: /data/spark-1.6.1-bin-cdh4/work
> 16/06/24 16:31:18 INFO Worker: Not spawning another attempt to register with 
> the master, since there is an attempt scheduled already.
> 16/06/24 16:31:18 INFO Worker: Master with url spark://100.65.21.199:7077 
> requested this worker to reconnect.
> 16/06/24 16:31:18 INFO Worker: Connecting to master 100.65.21.199:7077...
> 16/06/24 16:31:18 INFO Worker: Master with url spark://100.65.21.199:7077 
> requested this worker to reconnect.
> 16/06/24 16:31:18 INFO Worker: Not spawning another attempt to register with 
> the master, since there is an attempt scheduled already.
> 16/06/24 16:31:18 INFO Worker: Master with url spark://100.65.21.199:7077 
> requested this worker to reconnect.
> 16/06/24 16:31:18 INFO Worker: Not spawning another attempt to register with 
> the master, since there is an attempt scheduled already.
> 16/06/24 16:31:18 INFO Worker: Master with url spark://100.65.21.199:7077 
> requested this worker to reconnect.
> 16/06/24 16:31:18 INFO Worker: Not spawning another attempt to register with 
> the master, since there is an attempt scheduled already.
> 16/06/24 16:31:18 INFO Worker: Master with url spark://100.65.21.199:7077 
> requested this worker to reconnect.
> 16/06/24 16:31:18 INFO Worker: Not spawning another attempt to register with 
> the master, since there is an attempt scheduled already.
> 16/06/24 16:31:18 INFO Worker: Master with url spark://100.65.21.199:7077 
> requested this worker to reconnect.
> 16/06/24 16:31:18 INFO Worker: Not spawning another attempt to register with 
> the master, since there is an attempt scheduled already.
> 16/06/24 16:31:18 INFO Worker: Master with url spark://100.65.21.199:7077 
> requested this worker to reconnect.
> 16/06/24 16:31:18 INFO Worker: Not spawning another attempt to register with 
> the master, since there is an attempt scheduled already.
> 16/06/24 16:31:18 INFO Worker: Asked to launch executor 
> app-20160624153031-0467/27 for SparkRealtimeRecommender
> 16/06/24 16:31:18 INFO SecurityManager: Changing view acls to: mqq
> 16/06/24 16:31:18 INFO SecurityManager: Changing modify acls to: mqq
> 16/06/24 16:31:18 INFO SecurityManager: SecurityManager: authentication 
> disabled; ui acls disabled; users with view permissions: Set(mqq); users with 
> modify permissions: Set(mqq)
> 16/06/24 16:31:18 INFO ExecutorRunner: Launch command: 
> "/data/jdk1.7.0_60/bin/java" "-cp" 
> "/data/spark-1.6.1-bin-cdh4/conf/:/data/spark-1.6.1-bin-cdh4/lib/spark-assembly-1.6.1-hadoop2.3.0.jar:/data/spark-1.6.1-bin-cdh4/lib/datanucleus-core-3.2.10.jar:/data/spark-1.6.1-bin-cdh4/lib/datanucleus-api-jdo-3.2.6.jar:/data/spark-1.6.1-bin-cdh4/lib/datanucleus-rdbms-3.2.9.jar"
>  "-Xms61440M" "-Xmx61440M" "-Dspark.driver.port=50193" "-XX:MaxPermSize=256m" 
> "org.apache.spark.executor.CoarseGrainedExecutorBackend" "--driver-url" 
> "spark://CoarseGrainedScheduler@100.65.21.199:50193" "--executor-id" "27" 
> "--hostname" "100.65.21.223" "--cores" "46" "--app-id" 
> "app-20160624153031-0467" "--worker-url" "spark://Worker@100.65.21.223:46581"
> 16/06/24 16:31:18 ERROR Worker: Worker registration failed: Duplicate worker 
> ID
> 16/06/24 16:31:18 INFO ExecutorRunner: Killing process!
> 16/06/24 16:31:18 INFO ExecutorRunner: Killing process!
> 16/06/24 16:31:18 INFO ExecutorRunner: Killing process!
> 16/06/24 16:31:18 INFO ExecutorRunner: Killing process!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to