worker always timeout to heartbeat and was restarted by supervisor

2014-08-20 Thread DashengJu
hi, all

In out production environment, we have a topology named
logparser_mobile_nginx,it has 50 worker, spout have 48 executors,
bolt_parser have 1000 executors and bolt_saver have 50 executors.

The topology running normal most of times, but 1~5 workers restarted every
1~2 hours. When we see the logs of supervisor and worker, found 1) worker
have no error or exception; 2) supervisor says the worker did not do
heartbeat and timeout happened.

because worker have no log, I do not know why worker did not do heartbeat.
anyone have any ideas  how to investigate?
0) is the worker exist caused?
1) does it related to GC problem?
2) does it related to Memory problem? If this, I think the JVM will report
Memory Exception in worker log.

By the way, some small topologies works well on the same environment.

below is the supervisor log:
--
2014-08-20 15:51:33 b.s.d.supervisor [INFO]
90facad7-c666-41da-b7c5-f147ebe35542 still hasn't started

2014-08-20 16:01:16 b.s.d.supervisor [INFO] Shutting down and clearing
state for id c7e8d375-db76-4e2
9-8019-e783ab3cd6de. Current supervisor time: 1408521676. State:
:timed-out, Heartbeat: #backtype.sto
rm.daemon.common.WorkerHeartbeat{:time-secs 1408521645, :storm-id
logparser_mobile_nginx-259-1408518
662, :executors #{[4 4] [104 104] [204 204] [54 54] [154 154] [-1 -1]},
:port 9714}
2014-08-20 16:01:16 b.s.d.supervisor [INFO] Shutting down
6a522a57-cb0b-4a78-8b76-89f23604bf6f:c7e8d3
75-db76-4e29-8019-e783ab3cd6de
2014-08-20 16:01:17 b.s.util [INFO] Error when trying to kill 44901.
Process is probably already dead
.
2014-08-20 16:01:17 b.s.util [INFO] Error when trying to kill 44921.
Process is probably already dead
.
2014-08-20 16:01:17 b.s.d.supervisor [INFO] Shut down
6a522a57-cb0b-4a78-8b76-89f23604bf6f:c7e8d375-d
b76-4e29-8019-e783ab3cd6de

2014-08-20 16:01:17 b.s.d.supervisor [INFO] Shutting down and clearing
state for id d5a8d578-89ff-4a5
0-a906-75e847ac63a1. Current supervisor time: 1408521676. State:
:timed-out, Heartbeat: #backtype.sto
rm.daemon.common.WorkerHeartbeat{:time-secs 1408521645, :storm-id
logparser_nginx-265-1408521077, :
executors #{[50 50] [114 114] [178 178] [-1 -1]}, :port 9700}
2014-08-20 16:01:17 b.s.d.supervisor [INFO] Shutting down
6a522a57-cb0b-4a78-8b76-89f23604bf6f:d5a8d5
78-89ff-4a50-a906-75e847ac63a1
2014-08-20 16:01:18 b.s.util [INFO] Error when trying to kill 48068.
Process is probably already dead
.
2014-08-20 16:01:18 b.s.d.supervisor [INFO] Shut down
6a522a57-cb0b-4a78-8b76-89f23604bf6f:d5a8d578-8
9ff-4a50-a906-75e847ac63a1

2014-08-20 16:01:18 b.s.d.supervisor [INFO] Shutting down and clearing
state for id 5154f643-cd79-411
9-9368-153f1bede757. Current supervisor time: 1408521676. State:
:timed-out, Heartbeat: #backtype.sto
rm.daemon.common.WorkerHeartbeat{:time-secs 1408521644, :storm-id
logparser_mobile_nginx-259-1408518
662, :executors #{[98 98] [198 198] [48 48] [148 148] [248 248] [-1 -1]},
:port 9720}
2014-08-20 16:01:18 b.s.d.supervisor [INFO] Shutting down
6a522a57-cb0b-4a78-8b76-89f23604bf6f:5154f6
43-cd79-4119-9368-153f1bede757
2014-08-20 16:01:19 b.s.util [INFO] Error when trying to kill 44976.
Process is probably already dead.
2014-08-20 16:01:19 b.s.util [INFO] Error when trying to kill 44986.
Process is probably already dead.
2014-08-20 16:01:19 b.s.d.supervisor [INFO] Shut down
6a522a57-cb0b-4a78-8b76-89f23604bf6f:5154f643-cd79-4119-9368-153f1bede757

2014-08-20 16:01:19 b.s.d.supervisor [INFO] Shutting down and clearing
state for id fe9f656a-1f8b-4525-ba89-bbe65fbdb0ba. Current supervisor time:
1408521676. State: :timed-out, Heartbeat:
#backtype.storm.daemon.common.WorkerHeartbeat{:time-secs 1408521644,
:storm-id app_upload_urls-218-1408503096, :executors #{[8 8] [40 40] [24
24] [-1 -1]}, :port 9713}
2014-08-20 16:01:19 b.s.d.supervisor [INFO] Shutting down
6a522a57-cb0b-4a78-8b76-89f23604bf6f:fe9f656a-1f8b-4525-ba89-bbe65fbdb0ba
2014-08-20 16:01:20 b.s.util [INFO] Error when trying to kill 43177.
Process is probably already dead.
2014-08-20 16:01:20 b.s.d.supervisor [INFO] Shut down
6a522a57-cb0b-4a78-8b76-89f23604bf6f:fe9f656a-1f8b-4525-ba89-bbe65fbdb0ba

-- 
dashengju
+86 13810875910
dashen...@gmail.com


Re: worker always timeout to heartbeat and was restarted by supervisor

2014-08-20 Thread Derek Dagit

1) does it related to GC problem?


This is usually the cause.  As the worker consumes more and more of the heap, 
gc takes longer and longer each time.  Eventually it takes so long the 
heartbeats to the supervisor do not happen.

There could be a spike or skew in your data such that one or more workers 
cannot handle it with their heap settings.

--
Derek

On 8/20/14, 5:26, DashengJu wrote:

hi, all

In out production environment, we have a topology named
logparser_mobile_nginx,it has 50 worker, spout have 48 executors,
bolt_parser have 1000 executors and bolt_saver have 50 executors.

The topology running normal most of times, but 1~5 workers restarted every
1~2 hours. When we see the logs of supervisor and worker, found 1) worker
have no error or exception; 2) supervisor says the worker did not do
heartbeat and timeout happened.

because worker have no log, I do not know why worker did not do heartbeat.
anyone have any ideas  how to investigate?
0) is the worker exist caused?
1) does it related to GC problem?
2) does it related to Memory problem? If this, I think the JVM will report
Memory Exception in worker log.

By the way, some small topologies works well on the same environment.

below is the supervisor log:
--
2014-08-20 15:51:33 b.s.d.supervisor [INFO]
90facad7-c666-41da-b7c5-f147ebe35542 still hasn't started

2014-08-20 16:01:16 b.s.d.supervisor [INFO] Shutting down and clearing
state for id c7e8d375-db76-4e2
9-8019-e783ab3cd6de. Current supervisor time: 1408521676. State:
:timed-out, Heartbeat: #backtype.sto
rm.daemon.common.WorkerHeartbeat{:time-secs 1408521645, :storm-id
logparser_mobile_nginx-259-1408518
662, :executors #{[4 4] [104 104] [204 204] [54 54] [154 154] [-1 -1]},
:port 9714}
2014-08-20 16:01:16 b.s.d.supervisor [INFO] Shutting down
6a522a57-cb0b-4a78-8b76-89f23604bf6f:c7e8d3
75-db76-4e29-8019-e783ab3cd6de
2014-08-20 16:01:17 b.s.util [INFO] Error when trying to kill 44901.
Process is probably already dead
.
2014-08-20 16:01:17 b.s.util [INFO] Error when trying to kill 44921.
Process is probably already dead
.
2014-08-20 16:01:17 b.s.d.supervisor [INFO] Shut down
6a522a57-cb0b-4a78-8b76-89f23604bf6f:c7e8d375-d
b76-4e29-8019-e783ab3cd6de

2014-08-20 16:01:17 b.s.d.supervisor [INFO] Shutting down and clearing
state for id d5a8d578-89ff-4a5
0-a906-75e847ac63a1. Current supervisor time: 1408521676. State:
:timed-out, Heartbeat: #backtype.sto
rm.daemon.common.WorkerHeartbeat{:time-secs 1408521645, :storm-id
logparser_nginx-265-1408521077, :
executors #{[50 50] [114 114] [178 178] [-1 -1]}, :port 9700}
2014-08-20 16:01:17 b.s.d.supervisor [INFO] Shutting down
6a522a57-cb0b-4a78-8b76-89f23604bf6f:d5a8d5
78-89ff-4a50-a906-75e847ac63a1
2014-08-20 16:01:18 b.s.util [INFO] Error when trying to kill 48068.
Process is probably already dead
.
2014-08-20 16:01:18 b.s.d.supervisor [INFO] Shut down
6a522a57-cb0b-4a78-8b76-89f23604bf6f:d5a8d578-8
9ff-4a50-a906-75e847ac63a1

2014-08-20 16:01:18 b.s.d.supervisor [INFO] Shutting down and clearing
state for id 5154f643-cd79-411
9-9368-153f1bede757. Current supervisor time: 1408521676. State:
:timed-out, Heartbeat: #backtype.sto
rm.daemon.common.WorkerHeartbeat{:time-secs 1408521644, :storm-id
logparser_mobile_nginx-259-1408518
662, :executors #{[98 98] [198 198] [48 48] [148 148] [248 248] [-1 -1]},
:port 9720}
2014-08-20 16:01:18 b.s.d.supervisor [INFO] Shutting down
6a522a57-cb0b-4a78-8b76-89f23604bf6f:5154f6
43-cd79-4119-9368-153f1bede757
2014-08-20 16:01:19 b.s.util [INFO] Error when trying to kill 44976.
Process is probably already dead.
2014-08-20 16:01:19 b.s.util [INFO] Error when trying to kill 44986.
Process is probably already dead.
2014-08-20 16:01:19 b.s.d.supervisor [INFO] Shut down
6a522a57-cb0b-4a78-8b76-89f23604bf6f:5154f643-cd79-4119-9368-153f1bede757

2014-08-20 16:01:19 b.s.d.supervisor [INFO] Shutting down and clearing
state for id fe9f656a-1f8b-4525-ba89-bbe65fbdb0ba. Current supervisor time:
1408521676. State: :timed-out, Heartbeat:
#backtype.storm.daemon.common.WorkerHeartbeat{:time-secs 1408521644,
:storm-id app_upload_urls-218-1408503096, :executors #{[8 8] [40 40] [24
24] [-1 -1]}, :port 9713}
2014-08-20 16:01:19 b.s.d.supervisor [INFO] Shutting down
6a522a57-cb0b-4a78-8b76-89f23604bf6f:fe9f656a-1f8b-4525-ba89-bbe65fbdb0ba
2014-08-20 16:01:20 b.s.util [INFO] Error when trying to kill 43177.
Process is probably already dead.
2014-08-20 16:01:20 b.s.d.supervisor [INFO] Shut down
6a522a57-cb0b-4a78-8b76-89f23604bf6f:fe9f656a-1f8b-4525-ba89-bbe65fbdb0ba