[ https://issues.apache.org/jira/browse/SPARK-16190?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16796772#comment-16796772 ]
zuotingbing edited comment on SPARK-16190 at 3/20/19 3:59 AM: -------------------------------------------------------------- i faced the same issue. worker log as follows: {code:java} 2019-03-15 20:22:10,474 INFO Worker: Master has changed, new master is at spark://vmax17:7077 2019-03-15 20:22:14,862 INFO Worker: Master with url spark://vmax18:7077 requested this worker to reconnect. 2019-03-15 20:22:14,863 INFO Worker: Connecting to master vmax18:7077... 2019-03-15 20:22:14,863 INFO Worker: Connecting to master vmax17:7077... 2019-03-15 20:22:14,865 INFO Worker: Master with url spark://vmax18:7077 requested this worker to reconnect. 2019-03-15 20:22:14,865 INFO Worker: Not spawning another attempt to register with the master, since there is an attempt scheduled already. 2019-03-15 20:22:14,868 INFO Worker: Master with url spark://vmax18:7077 requested this worker to reconnect. 2019-03-15 20:22:14,868 INFO Worker: Not spawning another attempt to register with the master, since there is an attempt scheduled already. 2019-03-15 20:22:14,871 INFO Worker: Master with url spark://vmax18:7077 requested this worker to reconnect. 2019-03-15 20:22:14,871 INFO Worker: Not spawning another attempt to register with the master, since there is an attempt scheduled already. 2019-03-15 20:22:14,879 ERROR Worker: Worker registration failed: Duplicate worker ID 2019-03-15 20:22:14,891 INFO ExecutorRunner: Killing process! 2019-03-15 20:22:14,891 INFO ExecutorRunner: Killing process! 2019-03-15 20:22:14,893 INFO ExecutorRunner: Killing process! 2019-03-15 20:22:14,893 INFO ExecutorRunner: Killing process! 2019-03-15 20:22:14,893 INFO ExecutorRunner: Killing process! 2019-03-15 20:22:14,894 INFO ExecutorRunner: Killing process! 2019-03-15 20:22:14,894 INFO ExecutorRunner: Killing process! 2019-03-15 20:22:14,894 INFO ExecutorRunner: Killing process! 2019-03-15 20:22:14,894 INFO ExecutorRunner: Killing process! 2019-03-15 20:22:14,894 INFO ExecutorRunner: Killing process! 2019-03-15 20:22:14,894 INFO ExecutorRunner: Killing process! 2019-03-15 20:22:14,894 INFO ExecutorRunner: Killing process! 2019-03-15 20:22:14,894 INFO ExecutorRunner: Killing process! 2019-03-15 20:22:14,895 INFO ExecutorRunner: Killing process! 2019-03-15 20:22:14,895 INFO ExecutorRunner: Killing process! 2019-03-15 20:22:14,895 INFO ExecutorRunner: Killing process! 2019-03-15 20:22:14,895 INFO ExecutorRunner: Killing process! 2019-03-15 20:22:14,895 INFO ExecutorRunner: Killing process! 2019-03-15 20:22:14,896 INFO ShutdownHookManager: Shutdown hook called 2019-03-15 20:22:14,898 INFO ShutdownHookManager: Deleting directory /data4/zdh/spark/tmp/spark-c578bf32-6a5e-44a5-843b-c796f44648ee 2019-03-15 20:22:14,908 INFO ShutdownHookManager: Deleting directory /data3/zdh/spark/tmp/spark-7e57e77d-cbb7-47d3-a6dd-737b57788533 2019-03-15 20:22:14,920 INFO ShutdownHookManager: Deleting directory /data2/zdh/spark/tmp/spark-0beebf20-abbd-4d99-a401-3ef0e88e0b05{code} was (Author: zuo.tingbing9): i faced the same issue. worker log as follows: {code:java} 2019-03-15 20:22:10,474 INFO Worker: Master has changed, new master is at spark://vmax17:7077 2019-03-15 20:22:14,862 INFO Worker: Master with url spark://vmax18:7077 requested this worker to reconnect. 2019-03-15 20:22:14,863 INFO Worker: Connecting to master vmax18:7077... 2019-03-15 20:22:14,863 INFO Worker: Connecting to master vmax17:7077... 2019-03-15 20:22:14,865 INFO Worker: Master with url spark://vmax18:7077 requested this worker to reconnect. 2019-03-15 20:22:14,865 INFO Worker: Not spawning another attempt to register with the master, since there is an attempt scheduled already. 2019-03-15 20:22:14,868 INFO Worker: Master with url spark://vmax18:7077 requested this worker to reconnect. 2019-03-15 20:22:14,868 INFO Worker: Not spawning another attempt to register with the master, since there is an attempt scheduled already. 2019-03-15 20:22:14,871 INFO Worker: Master with url spark://vmax18:7077 requested this worker to reconnect. 2019-03-15 20:22:14,871 INFO Worker: Not spawning another attempt to register with the master, since there is an attempt scheduled already. 2019-03-15 20:22:14,879 ERROR Worker: Worker registration failed: Duplicate worker ID 2019-03-15 20:22:14,891 INFO ExecutorRunner: Killing process! 2019-03-15 20:22:14,891 INFO ExecutorRunner: Killing process! 2019-03-15 20:22:14,893 INFO ExecutorRunner: Killing process! 2019-03-15 20:22:14,893 INFO ExecutorRunner: Killing process! 2019-03-15 20:22:14,893 INFO ExecutorRunner: Killing process! 2019-03-15 20:22:14,894 INFO ExecutorRunner: Killing process! 2019-03-15 20:22:14,894 INFO ExecutorRunner: Killing process! 2019-03-15 20:22:14,894 INFO ExecutorRunner: Killing process! 2019-03-15 20:22:14,894 INFO ExecutorRunner: Killing process! 2019-03-15 20:22:14,894 INFO ExecutorRunner: Killing process! 2019-03-15 20:22:14,894 INFO ExecutorRunner: Killing process! 2019-03-15 20:22:14,894 INFO ExecutorRunner: Killing process! 2019-03-15 20:22:14,894 INFO ExecutorRunner: Killing process! 2019-03-15 20:22:14,895 INFO ExecutorRunner: Killing process! 2019-03-15 20:22:14,895 INFO ExecutorRunner: Killing process! 2019-03-15 20:22:14,895 INFO ExecutorRunner: Killing process! 2019-03-15 20:22:14,895 INFO ExecutorRunner: Killing process! 2019-03-15 20:22:14,895 INFO ExecutorRunner: Killing process! 2019-03-15 20:22:14,896 INFO ShutdownHookManager: Shutdown hook called 2019-03-15 20:22:14,898 INFO ShutdownHookManager: Deleting directory /data4/zdh/spark/tmp/spark-c578bf32-6a5e-44a5-843b-c796f44648ee 2019-03-15 20:22:14,908 INFO ShutdownHookManager: Deleting directory /data3/zdh/spark/tmp/spark-7e57e77d-cbb7-47d3-a6dd-737b57788533 2019-03-15 20:22:14,920 INFO ShutdownHookManager: Deleting directory /data2/zdh/spark/tmp/spark-0beebf20-abbd-4d99-a401-3ef0e88e0b05{code} > Worker registration failed: Duplicate worker ID > ----------------------------------------------- > > Key: SPARK-16190 > URL: https://issues.apache.org/jira/browse/SPARK-16190 > Project: Spark > Issue Type: Bug > Components: Scheduler > Affects Versions: 1.6.1 > Reporter: Thomas Huang > Priority: Minor > Attachments: > spark-mqq-org.apache.spark.deploy.worker.Worker-1-slave19.out, > spark-mqq-org.apache.spark.deploy.worker.Worker-1-slave2.out, > spark-mqq-org.apache.spark.deploy.worker.Worker-1-slave7.out, > spark-mqq-org.apache.spark.deploy.worker.Worker-1-slave8.out > > > Several worker crashed simultaneously due to this error: > Worker registration failed: Duplicate worker ID > This is the worker log on one of those crashed workers: > 16/06/24 16:28:53 INFO ExecutorRunner: Killing process! > 16/06/24 16:28:53 INFO ExecutorRunner: Runner thread for executor > app-20160624003013-0442/26 interrupted > 16/06/24 16:28:53 INFO ExecutorRunner: Killing process! > 16/06/24 16:29:03 WARN ExecutorRunner: Failed to terminate process: > java.lang.UNIXProcess@31340137. This process will likely be orphaned. > 16/06/24 16:29:03 WARN ExecutorRunner: Failed to terminate process: > java.lang.UNIXProcess@4d3bdb1d. This process will likely be orphaned. > 16/06/24 16:29:03 INFO Worker: Executor app-20160624003013-0442/8 finished > with state KILLED > 16/06/24 16:29:03 INFO Worker: Executor app-20160624003013-0442/26 finished > with state KILLED > 16/06/24 16:29:03 INFO Worker: Cleaning up local directories for application > app-20160624003013-0442 > 16/06/24 16:31:18 INFO ExternalShuffleBlockResolver: Application > app-20160624003013-0442 removed, cleanupLocalDirs = true > 16/06/24 16:31:18 INFO Worker: Asked to launch executor > app-20160624162905-0469/14 for SparkStreamingLRScala > 16/06/24 16:31:18 INFO SecurityManager: Changing view acls to: mqq > 16/06/24 16:31:18 INFO SecurityManager: Changing modify acls to: mqq > 16/06/24 16:31:18 INFO SecurityManager: SecurityManager: authentication > disabled; ui acls disabled; users with view permissions: Set(mqq); users with > modify permissions: Set(mqq) > 16/06/24 16:31:18 INFO ExecutorRunner: Launch command: > "/data/jdk1.7.0_60/bin/java" "-cp" > "/data/spark-1.6.1-bin-cdh4/conf/:/data/spark-1.6.1-bin-cdh4/lib/spark-assembly-1.6.1-hadoop2.3.0.jar:/data/spark-1.6.1-bin-cdh4/lib/datanucleus-core-3.2.10.jar:/data/spark-1.6.1-bin-cdh4/lib/datanucleus-api-jdo-3.2.6.jar:/data/spark-1.6.1-bin-cdh4/lib/datanucleus-rdbms-3.2.9.jar" > "-Xms10240M" "-Xmx10240M" "-Dspark.driver.port=34792" "-XX:MaxPermSize=256m" > "org.apache.spark.executor.CoarseGrainedExecutorBackend" "--driver-url" > "spark://CoarseGrainedScheduler@100.65.21.199:34792" "--executor-id" "14" > "--hostname" "100.65.21.223" "--cores" "5" "--app-id" > "app-20160624162905-0469" "--worker-url" "spark://Worker@100.65.21.223:46581" > 16/06/24 16:31:18 INFO Worker: Master with url spark://100.65.21.199:7077 > requested this worker to reconnect. > 16/06/24 16:31:18 INFO Worker: Master with url spark://100.65.21.199:7077 > requested this worker to reconnect. > 16/06/24 16:31:18 INFO Worker: Connecting to master 100.65.21.199:7077... > 16/06/24 16:31:18 INFO Worker: Successfully registered with master > spark://100.65.21.199:7077 > 16/06/24 16:31:18 INFO Worker: Worker cleanup enabled; old application > directories will be deleted in: /data/spark-1.6.1-bin-cdh4/work > 16/06/24 16:31:18 INFO Worker: Not spawning another attempt to register with > the master, since there is an attempt scheduled already. > 16/06/24 16:31:18 INFO Worker: Master with url spark://100.65.21.199:7077 > requested this worker to reconnect. > 16/06/24 16:31:18 INFO Worker: Connecting to master 100.65.21.199:7077... > 16/06/24 16:31:18 INFO Worker: Master with url spark://100.65.21.199:7077 > requested this worker to reconnect. > 16/06/24 16:31:18 INFO Worker: Not spawning another attempt to register with > the master, since there is an attempt scheduled already. > 16/06/24 16:31:18 INFO Worker: Master with url spark://100.65.21.199:7077 > requested this worker to reconnect. > 16/06/24 16:31:18 INFO Worker: Not spawning another attempt to register with > the master, since there is an attempt scheduled already. > 16/06/24 16:31:18 INFO Worker: Master with url spark://100.65.21.199:7077 > requested this worker to reconnect. > 16/06/24 16:31:18 INFO Worker: Not spawning another attempt to register with > the master, since there is an attempt scheduled already. > 16/06/24 16:31:18 INFO Worker: Master with url spark://100.65.21.199:7077 > requested this worker to reconnect. > 16/06/24 16:31:18 INFO Worker: Not spawning another attempt to register with > the master, since there is an attempt scheduled already. > 16/06/24 16:31:18 INFO Worker: Master with url spark://100.65.21.199:7077 > requested this worker to reconnect. > 16/06/24 16:31:18 INFO Worker: Not spawning another attempt to register with > the master, since there is an attempt scheduled already. > 16/06/24 16:31:18 INFO Worker: Master with url spark://100.65.21.199:7077 > requested this worker to reconnect. > 16/06/24 16:31:18 INFO Worker: Not spawning another attempt to register with > the master, since there is an attempt scheduled already. > 16/06/24 16:31:18 INFO Worker: Asked to launch executor > app-20160624153031-0467/27 for SparkRealtimeRecommender > 16/06/24 16:31:18 INFO SecurityManager: Changing view acls to: mqq > 16/06/24 16:31:18 INFO SecurityManager: Changing modify acls to: mqq > 16/06/24 16:31:18 INFO SecurityManager: SecurityManager: authentication > disabled; ui acls disabled; users with view permissions: Set(mqq); users with > modify permissions: Set(mqq) > 16/06/24 16:31:18 INFO ExecutorRunner: Launch command: > "/data/jdk1.7.0_60/bin/java" "-cp" > "/data/spark-1.6.1-bin-cdh4/conf/:/data/spark-1.6.1-bin-cdh4/lib/spark-assembly-1.6.1-hadoop2.3.0.jar:/data/spark-1.6.1-bin-cdh4/lib/datanucleus-core-3.2.10.jar:/data/spark-1.6.1-bin-cdh4/lib/datanucleus-api-jdo-3.2.6.jar:/data/spark-1.6.1-bin-cdh4/lib/datanucleus-rdbms-3.2.9.jar" > "-Xms61440M" "-Xmx61440M" "-Dspark.driver.port=50193" "-XX:MaxPermSize=256m" > "org.apache.spark.executor.CoarseGrainedExecutorBackend" "--driver-url" > "spark://CoarseGrainedScheduler@100.65.21.199:50193" "--executor-id" "27" > "--hostname" "100.65.21.223" "--cores" "46" "--app-id" > "app-20160624153031-0467" "--worker-url" "spark://Worker@100.65.21.223:46581" > 16/06/24 16:31:18 ERROR Worker: Worker registration failed: Duplicate worker > ID > 16/06/24 16:31:18 INFO ExecutorRunner: Killing process! > 16/06/24 16:31:18 INFO ExecutorRunner: Killing process! > 16/06/24 16:31:18 INFO ExecutorRunner: Killing process! > 16/06/24 16:31:18 INFO ExecutorRunner: Killing process! -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org