zhangzhiyan created SPARK-17468: ----------------------------------- Summary: Cluster worker memory exceeded when master network bad more than one minute! Key: SPARK-17468 URL: https://issues.apache.org/jira/browse/SPARK-17468 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.6.1 Environment: CentOS 6.5, Spark standalone, 15 machines,15worker and 2master,there are worker,master,driver on the same machine Reporter: zhangzhiyan Priority: Critical
I'm in China commerial company.My production spark standalone is crushed on 9.9 sales, master log is below: 16/09/09 09:49:57 WARN Master: Removing worker-20160814124907-10.205.130.37-16590 because we got no heartbeat in 60 seconds 16/09/09 09:49:57 WARN Master: Removing worker-20160814113016-10.205.130.13-57487 because we got no heartbeat in 60 seconds 16/09/09 09:49:57 WARN Master: Removing worker-20160814134926-10.205.130.39-11430 because we got no heartbeat in 60 seconds 16/09/09 09:49:57 WARN Master: Removing worker-20160814131257-10.205.130.38-32160 because we got no heartbeat in 60 seconds 16/09/09 09:49:57 WARN Master: Removing worker-20160814161444-10.205.136.19-14196 because we got no heartbeat in 60 seconds 16/09/09 09:49:57 WARN Master: Removing worker-20160814141654-10.205.130.42-49707 because we got no heartbeat in 60 seconds 16/09/09 09:49:57 WARN Master: Removing worker-20160814115125-10.205.130.14-38381 because we got no heartbeat in 60 seconds 16/09/09 09:49:57 WARN Master: Removing worker-20160814152146-10.205.136.10-24730 because we got no heartbeat in 60 seconds 16/09/09 09:49:57 WARN Master: Removing worker-20160814122817-10.205.130.36-54348 because we got no heartbeat in 60 seconds 16/09/09 09:49:57 WARN Master: Removing worker-20160814170452-10.205.136.34-9921 because we got no heartbeat in 60 seconds 16/09/09 09:49:58 WARN Master: Removing worker-20160814154744-10.205.136.12-12399 because we got no heartbeat in 60 seconds 16/09/09 09:49:58 WARN Master: Removing worker-20160814150355-10.205.130.44-5792 because we got no heartbeat in 60 seconds 16/09/09 09:49:58 WARN Master: Removing worker-20160814143901-10.205.130.43-2223 because we got no heartbeat in 60 seconds 16/09/09 09:49:58 WARN Master: Got heartbeat from unregistered worker worker-20160814124907-10.205.130.37-16590. Asking it to re-register. 16/09/09 09:49:58 WARN Master: Got heartbeat from unregistered worker worker-20160814170452-10.205.136.34-9921. Asking it to re-register. 16/09/09 09:49:58 WARN Master: Got heartbeat from unregistered worker worker-20160814141654-10.205.130.42-49707. Asking it to re-register. 16/09/09 09:49:58 WARN Master: Got heartbeat from unregistered worker worker-20160814115125-10.205.130.14-38381. Asking it to re-register. 16/09/09 09:49:58 WARN Master: Got heartbeat from unregistered worker worker-20160814134926-10.205.130.39-11430. Asking it to re-register. 16/09/09 09:49:58 WARN Master: Got heartbeat from unregistered worker worker-20160814131257-10.205.130.38-32160. Asking it to re-register. 16/09/09 09:49:58 WARN Master: Got heartbeat from unregistered worker worker-20160814150355-10.205.130.44-5792. Asking it to re-register. 16/09/09 09:49:58 WARN Master: Got heartbeat from unregistered worker worker-20160814154744-10.205.136.12-12399. Asking it to re-register. 16/09/09 09:49:58 WARN Master: Got heartbeat from unregistered worker worker-20160814161444-10.205.136.19-14196. Asking it to re-register. 16/09/09 09:49:58 WARN Master: Got heartbeat from unregistered worker worker-20160814113016-10.205.130.13-57487. Asking it to re-register. 16/09/09 09:49:58 WARN Master: Got heartbeat from unregistered worker worker-20160814152146-10.205.136.10-24730. Asking it to re-register. 16/09/09 09:49:58 WARN Master: Got heartbeat from unregistered worker worker-20160814143901-10.205.130.43-2223. Asking it to re-register. 16/09/09 09:49:58 WARN Master: Got heartbeat from unregistered worker worker-20160814122817-10.205.130.36-54348. Asking it to re-register. 16/09/09 09:49:58 WARN Master: Got heartbeat from unregistered worker worker-20160814124907-10.205.130.37-16590. Asking it to re-register. 16/09/09 09:49:58 WARN Master: Got heartbeat from unregistered worker worker-20160814170452-10.205.136.34-9921. Asking it to re-register. and I found the code here may be wrong, when master network is not ok more than WORKER_TIMEOUT_MS, master will remove worker and executor information in it's memory, but when workers recover connection with master, it's old info has been erased, despite it still running the old executors, that comes crush my workers -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org