Are you sure the worker is writing heartbeats and the supervisor is reading them?
On Thursday, April 16, 2015 1:36 PM, Grant Overby (groverby) <grove...@cisco.com> wrote: The supervisor is reporting the worker “still hasn’t started” even though the worker is up and appears to be working. After timeout, the supervisor kills and restarts the worker. I set the timeouts really high (10 mins) to see what would happen. I though maybe a GC pause or something. The worker is still killed with these large timeouts, it just takes longer. The supervisor is able to establish a connection to the worker port:[root@twig07 storm]# netstat -pan | grep 6700tcp 0 0 :::6700 :::* LISTEN 6860/java tcp 0 0 ::ffff:10.0.1.7:6700 ::ffff:10.0.1.8:44324 ESTABLISHED 6860/java tcp 0 0 ::ffff:10.0.1.7:36940 ::ffff:10.0.1.8:6700 ESTABLISHED 6860/java tcp 0 256 ::ffff:10.0.1.7:36543 ::ffff:10.0.1.6:6700 ESTABLISHED 6860/java tcp 0 0 ::ffff:10.0.1.7:6700 ::ffff:10.0.1.5:38746 ESTABLISHED 6860/java tcp 0 0 ::ffff:10.0.1.7:59151 ::ffff:10.0.1.5:6700 ESTABLISHED 6860/java tcp 0 0 ::ffff:10.0.1.7:6700 ::ffff:10.0.1.6:59916 ESTABLISHED 6860/java What can I look at to diagnose this behavior? Supervisor log:2015-04-15 18:32:52 b.s.d.supervisor [INFO] 8e836654-931d-46d5-bf3f-cb75e394b113 still hasn't started2015-04-15 18:32:52 b.s.d.supervisor [INFO] 8e836654-931d-46d5-bf3f-cb75e394b113 still hasn't started2015-04-15 18:54:01 b.s.d.supervisor [INFO] Shutting down and clearing state for id 8e836654-931d-46d5-bf3f-cb75e394b113. Current supervisor time: 1429138441. State: :disallowed, Heartbeat: #backtype.storm.daemon.common.WorkerHeartbeat{:time-secs 1429138440, :storm-id "burner-trident-1-1429131611", :executors #{[4 4] [36 36] [68 68] [100 100] [8 8] [40 40] [72 72] [104 104] [12 12] [44 44] [76 76] [108 108] [16 16] [48 48] [80 80] [112 112] [20 20] [52 52] [84 84] [116 116] [24 24] [56 56] [88 88] [120 120] [28 28] [60 60] [92 92] [124 124] [-1 -1] [32 32] [64 64] [96 96] [128 128]}, :port 6700}2015-04-15 18:54:01 b.s.d.supervisor [INFO] Shutting down c60221c8-1c96-41b5-9669-4fa8b74fc022:8e836654-931d-46d5-bf3f-cb75e394b1132015-04-15 18:54:01 b.s.config [INFO] GET worker-user 8e836654-931d-46d5-bf3f-cb75e394b1132015-04-15 18:54:02 b.s.config [INFO] REMOVE worker-user 8e836654-931d-46d5-bf3f-cb75e394b1132015-04-15 18:54:02 b.s.d.supervisor [INFO] Shut down c60221c8-1c96-41b5-9669-4fa8b74fc022:8e836654-931d-46d5-bf3f-cb75e394b113 Worker log for the same time period:2015-04-15 18:43:16 k.p.SyncProducer [INFO] Connected to twig06.twigs:6667 for producing2015-04-15 18:43:16 k.p.SyncProducer [INFO] Connected to twig08.twigs:6667 for producing2015-04-15 18:43:16 k.p.SyncProducer [INFO] Connected to twig05.twigs:6667 for producing2015-04-15 18:43:16 k.p.SyncProducer [INFO] Disconnecting from twig07.twigs:66672015-04-15 18:43:16 k.p.SyncProducer [INFO] Disconnecting from twig08.twigs:66672015-04-15 18:43:16 k.p.SyncProducer [INFO] Disconnecting from twig05.twigs:66672015-04-15 18:43:16 k.p.SyncProducer [INFO] Disconnecting from twig06.twigs:66672015-04-15 18:43:16 k.p.SyncProducer [INFO] Disconnecting from twig07.twigs:66672015-04-15 18:43:16 k.p.SyncProducer [INFO] Connected to twig07.twigs:6667 for producing2015-04-15 18:43:16 k.p.SyncProducer [INFO] Connected to twig06.twigs:6667 for producing2015-04-15 18:43:16 k.p.SyncProducer [INFO] Connected to twig08.twigs:6667 for producing2015-04-15 18:43:16 k.p.SyncProducer [INFO] Connected to twig05.twigs:6667 for producing2015-04-15 18:44:06 s.k.t.ZkBrokerReader [INFO] brokers need refreshing because 60000ms have expired2015-04-15 18:44:06 s.k.DynamicBrokersReader [INFO] Read partition info from zookeeper: GlobalPartitionInformation{partitionMap={0=twig07.twigs:6667, 1=twig08.twigs:6667, 2=twig05.twigs:6667, 3=twig06.twigs:6667}}2015-04-15 18:45:14 s.k.t.ZkBrokerReader [INFO] brokers need refreshing because 60000ms have expired2015-04-15 18:45:14 s.k.DynamicBrokersReader [INFO] Read partition info from zookeeper: GlobalPartitionInformation{partitionMap={0=twig07.twigs:6667, 1=twig08.twigs:6667, 2=twig05.twigs:6667, 3=twig06.twigs:6667}}2015-04-15 18:54:01 o.a.s.c.r.ExponentialBackoffRetry [WARN] maxRetries too large (30). Pinning to 292015-04-15 18:54:01 b.s.u.StormBoundedExponentialBackoffRetry [INFO] The baseSleepTimeMs [100] the maxSleepTimeMs [1000] the maxRetries [30]2015-04-15 18:54:01 b.s.m.n.Client [INFO] New Netty Client, connect to twig07.twigs, 6700, config: , buffer_size: 52428802015-04-15 18:54:01 b.s.m.n.Client [INFO] Reconnect started for Netty-Client-twig07.twigs/10.0.1.7:6700... [0]2015-04-15 18:54:01 b.s.m.n.Client [INFO] Closing Netty Client Netty-Client-twig08.twigs/10.0.1.8:67002015-04-15 18:54:01 b.s.m.n.Client [INFO] Waiting for pending batchs to be sent with Netty-Client-twig08.twigs/10.0.1.8:6700..., timeout: 600000ms, pendings: 02015-04-15 18:54:01 b.s.m.n.Client [INFO] connection established to a remote host Netty-Client-twig07.twigs/10.0.1.7:6700, [id: 0x3a394433, /10.0.1.7:35479 => twig07.twigs/10.0.1.7:6700]2015-04-15 18:54:01 b.s.d.worker [INFO] Shutting down worker burner-trident-1-1429131611 c60221c8-1c96-41b5-9669-4fa8b74fc022 67002015-04-15 18:54:01 b.s.m.n.Client [INFO] Closing Netty Client Netty-Client-twig07.twigs/10.0.1.7:67002015-04-15 18:54:01 b.s.m.n.Client [INFO] Waiting for pending batchs to be sent with Netty-Client-twig07.twigs/10.0.1.7:6700..., timeout: 600000ms, pendings: 02015-04-15 18:54:01 b.s.m.n.Client [INFO] Closing Netty Client Netty-Client-twig06.twigs/10.0.1.6:67002015-04-15 18:54:01 b.s.m.n.Client [INFO] Waiting for pending batchs to be sent with Netty-Client-twig06.twigs/10.0.1.6:6700..., timeout: 600000ms, pendings: 02015-04-15 18:54:01 b.s.m.n.Client [INFO] Closing Netty Client Netty-Client-twig05.twigs/10.0.1.5:67002015-04-15 18:54:01 b.s.m.n.Client [INFO] Waiting for pending batchs to be sent with Netty-Client-twig05.twigs/10.0.1.5:6700..., timeout: 600000ms, pendings: 02015-04-15 18:54:01 b.s.d.worker [INFO] Shutting down receive thread2015-04-15 18:54:01 o.a.s.c.r.ExponentialBackoffRetry [WARN] maxRetries too large (30). Pinning to 292015-04-15 18:54:01 b.s.u.StormBoundedExponentialBackoffRetry [INFO] The baseSleepTimeMs [100] the maxSleepTimeMs [1000] the maxRetries [30]2015-04-15 18:54:01 b.s.m.n.Client [INFO] New Netty Client, connect to localhost, 6700, config: , buffer_size: 52428802015-04-15 18:54:01 b.s.m.n.Client [INFO] Reconnect started for Netty-Client-localhost/127.0.0.1:6700... [0]2015-04-15 18:54:01 b.s.m.loader [INFO] Shutting down receiving-thread: [burner-trident-1-1429131611, 6700]2015-04-15 18:54:01 b.s.m.n.Client [INFO] connection established to a remote host Netty-Client-localhost/127.0.0.1:6700, [id: 0x9e468dc6, /127.0.0.1:59930 => localhost/127.0.0.1:6700]2015-04-15 18:54:01 b.s.m.n.Client [INFO] Closing Netty Client Netty-Client-localhost/127.0.0.1:67002015-04-15 18:54:01 b.s.m.n.Client [INFO] Waiting for pending batchs to be sent with Netty-Client-localhost/127.0.0.1:6700..., timeout: 600000ms, pendings: 02015-04-15 18:54:01 b.s.m.loader [INFO] Waiting for receiving-thread:[burner-trident-1-1429131611, 6700] to die2015-04-15 18:54:01 b.s.m.loader [INFO] Shutdown receiving-thread: [burner-trident-1-1429131611, 6700]2015-04-15 18:54:01 b.s.d.worker [INFO] Shut down receive thread2015-04-15 18:54:01 b.s.d.worker [INFO] Terminating messaging context2015-04-15 18:54:01 b.s.d.worker [INFO] Shutting down executors2015-04-15 18:54:01 b.s.d.executor [INFO] Shutting down executor __acker:[4 4]2015-04-15 18:54:01 b.s.util [INFO] Async loop interrupted!2015-04-15 18:54:01 b.s.util [INFO] Async loop interrupted!2015-04-15 18:54:01 b.s.d.executor [INFO] Shut down executor __acker:[4 4]2015-04-15 18:54:01 b.s.d.executor [INFO] Shutting down executor b-0-To-DC-Sink:[36 36]2015-04-15 18:54:01 b.s.util [INFO] Async loop interrupted!2015-04-15 18:54:01 b.s.util [INFO] Async loop interrupted!2015-04-15 18:54:01 c.c.t.b.t.f.FlushingFunction [ERROR] sleeping in flushThread interruptedjava.lang.InterruptedException: sleep interrupted at java.lang.Thread.sleep(Native Method) [na:1.7.0_75] at com.cisco.tinderbox.burner.trident.functions.FlushingFunction$1.run(FlushingFunction.java:84) ~[stormjar.jar:na] at java.lang.Thread.run(Thread.java:745) [na:1.7.0_75] | | | | Grant Overby Software Engineer Cisco.com grove...@cisco.com Mobile: 865 724 4910 | | | | | Think before you print. | | This email may contain confidential and privileged material for the sole use of the intended recipient. Any review, use, distribution or disclosure by others is strictly prohibited. If you are not the intended recipient (or authorized to receive for the recipient), please contact the sender by reply email and delete all copies of this message.Please click here for Company Registration Information. | |