zhangzhiyan created SPARK-17468:
-----------------------------------

             Summary: Cluster worker memory exceeded when master network bad 
more than one minute!
                 Key: SPARK-17468
                 URL: https://issues.apache.org/jira/browse/SPARK-17468
             Project: Spark
          Issue Type: Bug
          Components: Spark Core
    Affects Versions: 1.6.1
         Environment: CentOS 6.5, Spark standalone, 15 machines,15worker and 
2master,there are worker,master,driver on the same machine
            Reporter: zhangzhiyan
            Priority: Critical


I'm in China commerial company.My production spark standalone is crushed on 9.9 
sales, master log is below:

16/09/09 09:49:57 WARN Master: Removing 
worker-20160814124907-10.205.130.37-16590 because we got no heartbeat in 60 
seconds
16/09/09 09:49:57 WARN Master: Removing 
worker-20160814113016-10.205.130.13-57487 because we got no heartbeat in 60 
seconds
16/09/09 09:49:57 WARN Master: Removing 
worker-20160814134926-10.205.130.39-11430 because we got no heartbeat in 60 
seconds
16/09/09 09:49:57 WARN Master: Removing 
worker-20160814131257-10.205.130.38-32160 because we got no heartbeat in 60 
seconds
16/09/09 09:49:57 WARN Master: Removing 
worker-20160814161444-10.205.136.19-14196 because we got no heartbeat in 60 
seconds
16/09/09 09:49:57 WARN Master: Removing 
worker-20160814141654-10.205.130.42-49707 because we got no heartbeat in 60 
seconds
16/09/09 09:49:57 WARN Master: Removing 
worker-20160814115125-10.205.130.14-38381 because we got no heartbeat in 60 
seconds
16/09/09 09:49:57 WARN Master: Removing 
worker-20160814152146-10.205.136.10-24730 because we got no heartbeat in 60 
seconds
16/09/09 09:49:57 WARN Master: Removing 
worker-20160814122817-10.205.130.36-54348 because we got no heartbeat in 60 
seconds
16/09/09 09:49:57 WARN Master: Removing 
worker-20160814170452-10.205.136.34-9921 because we got no heartbeat in 60 
seconds
16/09/09 09:49:58 WARN Master: Removing 
worker-20160814154744-10.205.136.12-12399 because we got no heartbeat in 60 
seconds
16/09/09 09:49:58 WARN Master: Removing 
worker-20160814150355-10.205.130.44-5792 because we got no heartbeat in 60 
seconds
16/09/09 09:49:58 WARN Master: Removing 
worker-20160814143901-10.205.130.43-2223 because we got no heartbeat in 60 
seconds
16/09/09 09:49:58 WARN Master: Got heartbeat from unregistered worker 
worker-20160814124907-10.205.130.37-16590. Asking it to re-register.
16/09/09 09:49:58 WARN Master: Got heartbeat from unregistered worker 
worker-20160814170452-10.205.136.34-9921. Asking it to re-register.
16/09/09 09:49:58 WARN Master: Got heartbeat from unregistered worker 
worker-20160814141654-10.205.130.42-49707. Asking it to re-register.
16/09/09 09:49:58 WARN Master: Got heartbeat from unregistered worker 
worker-20160814115125-10.205.130.14-38381. Asking it to re-register.
16/09/09 09:49:58 WARN Master: Got heartbeat from unregistered worker 
worker-20160814134926-10.205.130.39-11430. Asking it to re-register.
16/09/09 09:49:58 WARN Master: Got heartbeat from unregistered worker 
worker-20160814131257-10.205.130.38-32160. Asking it to re-register.
16/09/09 09:49:58 WARN Master: Got heartbeat from unregistered worker 
worker-20160814150355-10.205.130.44-5792. Asking it to re-register.
16/09/09 09:49:58 WARN Master: Got heartbeat from unregistered worker 
worker-20160814154744-10.205.136.12-12399. Asking it to re-register.
16/09/09 09:49:58 WARN Master: Got heartbeat from unregistered worker 
worker-20160814161444-10.205.136.19-14196. Asking it to re-register.
16/09/09 09:49:58 WARN Master: Got heartbeat from unregistered worker 
worker-20160814113016-10.205.130.13-57487. Asking it to re-register.
16/09/09 09:49:58 WARN Master: Got heartbeat from unregistered worker 
worker-20160814152146-10.205.136.10-24730. Asking it to re-register.
16/09/09 09:49:58 WARN Master: Got heartbeat from unregistered worker 
worker-20160814143901-10.205.130.43-2223. Asking it to re-register.
16/09/09 09:49:58 WARN Master: Got heartbeat from unregistered worker 
worker-20160814122817-10.205.130.36-54348. Asking it to re-register.
16/09/09 09:49:58 WARN Master: Got heartbeat from unregistered worker 
worker-20160814124907-10.205.130.37-16590. Asking it to re-register.
16/09/09 09:49:58 WARN Master: Got heartbeat from unregistered worker 
worker-20160814170452-10.205.136.34-9921. Asking it to re-register.


and I found the code here may be wrong, when master network is not ok more than 
WORKER_TIMEOUT_MS, master will remove worker and executor information in it's 
memory, but when workers recover connection with master, it's old info has been 
erased, despite it still running the old executors, that comes crush my workers



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to