Are you sure the worker is writing heartbeats and the supervisor is reading 
them?
 


     On Thursday, April 16, 2015 1:36 PM, Grant Overby (groverby) 
<grove...@cisco.com> wrote:
   

  The supervisor is reporting the worker “still hasn’t started” even though the 
worker is up and appears to be working. After timeout, the supervisor kills and 
restarts the worker.
I set the timeouts really high (10 mins) to see what would happen. I though 
maybe a GC pause or something. The worker is still killed with these large 
timeouts, it just takes longer.
The supervisor is able to establish a connection to the worker 
port:[root@twig07 storm]# netstat -pan | grep 6700tcp        0      0 :::6700   
                  :::*                        LISTEN      6860/java           
tcp        0      0 ::ffff:10.0.1.7:6700        ::ffff:10.0.1.8:44324       
ESTABLISHED 6860/java           tcp        0      0 ::ffff:10.0.1.7:36940       
::ffff:10.0.1.8:6700        ESTABLISHED 6860/java           tcp        0    256 
::ffff:10.0.1.7:36543       ::ffff:10.0.1.6:6700        ESTABLISHED 6860/java   
        tcp        0      0 ::ffff:10.0.1.7:6700        ::ffff:10.0.1.5:38746   
    ESTABLISHED 6860/java           tcp        0      0 ::ffff:10.0.1.7:59151   
    ::ffff:10.0.1.5:6700        ESTABLISHED 6860/java           tcp        0    
  0 ::ffff:10.0.1.7:6700        ::ffff:10.0.1.6:59916       ESTABLISHED 
6860/java    


What can I look at to diagnose this behavior?




Supervisor log:2015-04-15 18:32:52 b.s.d.supervisor [INFO] 
8e836654-931d-46d5-bf3f-cb75e394b113 still hasn't started2015-04-15 18:32:52 
b.s.d.supervisor [INFO] 8e836654-931d-46d5-bf3f-cb75e394b113 still hasn't 
started2015-04-15 18:54:01 b.s.d.supervisor [INFO] Shutting down and clearing 
state for id 8e836654-931d-46d5-bf3f-cb75e394b113. Current supervisor time: 
1429138441. State: :disallowed, Heartbeat: 
#backtype.storm.daemon.common.WorkerHeartbeat{:time-secs 1429138440, :storm-id 
"burner-trident-1-1429131611", :executors #{[4 4] [36 36] [68 68] [100 100] [8 
8] [40 40] [72 72] [104 104] [12 12] [44 44] [76 76] [108 108] [16 16] [48 48] 
[80 80] [112 112] [20 20] [52 52] [84 84] [116 116] [24 24] [56 56] [88 88] 
[120 120] [28 28] [60 60] [92 92] [124 124] [-1 -1] [32 32] [64 64] [96 96] 
[128 128]}, :port 6700}2015-04-15 18:54:01 b.s.d.supervisor [INFO] Shutting 
down 
c60221c8-1c96-41b5-9669-4fa8b74fc022:8e836654-931d-46d5-bf3f-cb75e394b1132015-04-15
 18:54:01 b.s.config [INFO] GET worker-user 
8e836654-931d-46d5-bf3f-cb75e394b1132015-04-15 18:54:02 b.s.config [INFO] 
REMOVE worker-user 8e836654-931d-46d5-bf3f-cb75e394b1132015-04-15 18:54:02 
b.s.d.supervisor [INFO] Shut down 
c60221c8-1c96-41b5-9669-4fa8b74fc022:8e836654-931d-46d5-bf3f-cb75e394b113


Worker log for the same time period:2015-04-15 18:43:16 k.p.SyncProducer [INFO] 
Connected to twig06.twigs:6667 for producing2015-04-15 18:43:16 
k.p.SyncProducer [INFO] Connected to twig08.twigs:6667 for producing2015-04-15 
18:43:16 k.p.SyncProducer [INFO] Connected to twig05.twigs:6667 for 
producing2015-04-15 18:43:16 k.p.SyncProducer [INFO] Disconnecting from 
twig07.twigs:66672015-04-15 18:43:16 k.p.SyncProducer [INFO] Disconnecting from 
twig08.twigs:66672015-04-15 18:43:16 k.p.SyncProducer [INFO] Disconnecting from 
twig05.twigs:66672015-04-15 18:43:16 k.p.SyncProducer [INFO] Disconnecting from 
twig06.twigs:66672015-04-15 18:43:16 k.p.SyncProducer [INFO] Disconnecting from 
twig07.twigs:66672015-04-15 18:43:16 k.p.SyncProducer [INFO] Connected to 
twig07.twigs:6667 for producing2015-04-15 18:43:16 k.p.SyncProducer [INFO] 
Connected to twig06.twigs:6667 for producing2015-04-15 18:43:16 
k.p.SyncProducer [INFO] Connected to twig08.twigs:6667 for producing2015-04-15 
18:43:16 k.p.SyncProducer [INFO] Connected to twig05.twigs:6667 for 
producing2015-04-15 18:44:06 s.k.t.ZkBrokerReader [INFO] brokers need 
refreshing because 60000ms have expired2015-04-15 18:44:06 
s.k.DynamicBrokersReader [INFO] Read partition info from zookeeper: 
GlobalPartitionInformation{partitionMap={0=twig07.twigs:6667, 
1=twig08.twigs:6667, 2=twig05.twigs:6667, 3=twig06.twigs:6667}}2015-04-15 
18:45:14 s.k.t.ZkBrokerReader [INFO] brokers need refreshing because 60000ms 
have expired2015-04-15 18:45:14 s.k.DynamicBrokersReader [INFO] Read partition 
info from zookeeper: 
GlobalPartitionInformation{partitionMap={0=twig07.twigs:6667, 
1=twig08.twigs:6667, 2=twig05.twigs:6667, 3=twig06.twigs:6667}}2015-04-15 
18:54:01 o.a.s.c.r.ExponentialBackoffRetry [WARN] maxRetries too large (30). 
Pinning to 292015-04-15 18:54:01 b.s.u.StormBoundedExponentialBackoffRetry 
[INFO] The baseSleepTimeMs [100] the maxSleepTimeMs [1000] the maxRetries 
[30]2015-04-15 18:54:01 b.s.m.n.Client [INFO] New Netty Client, connect to 
twig07.twigs, 6700, config: , buffer_size: 52428802015-04-15 18:54:01 
b.s.m.n.Client [INFO] Reconnect started for 
Netty-Client-twig07.twigs/10.0.1.7:6700... [0]2015-04-15 18:54:01 
b.s.m.n.Client [INFO] Closing Netty Client 
Netty-Client-twig08.twigs/10.0.1.8:67002015-04-15 18:54:01 b.s.m.n.Client 
[INFO] Waiting for pending batchs to be sent with 
Netty-Client-twig08.twigs/10.0.1.8:6700..., timeout: 600000ms, pendings: 
02015-04-15 18:54:01 b.s.m.n.Client [INFO] connection established to a remote 
host Netty-Client-twig07.twigs/10.0.1.7:6700, [id: 0x3a394433, /10.0.1.7:35479 
=> twig07.twigs/10.0.1.7:6700]2015-04-15 18:54:01 b.s.d.worker [INFO] Shutting 
down worker burner-trident-1-1429131611 c60221c8-1c96-41b5-9669-4fa8b74fc022 
67002015-04-15 18:54:01 b.s.m.n.Client [INFO] Closing Netty Client 
Netty-Client-twig07.twigs/10.0.1.7:67002015-04-15 18:54:01 b.s.m.n.Client 
[INFO] Waiting for pending batchs to be sent with 
Netty-Client-twig07.twigs/10.0.1.7:6700..., timeout: 600000ms, pendings: 
02015-04-15 18:54:01 b.s.m.n.Client [INFO] Closing Netty Client 
Netty-Client-twig06.twigs/10.0.1.6:67002015-04-15 18:54:01 b.s.m.n.Client 
[INFO] Waiting for pending batchs to be sent with 
Netty-Client-twig06.twigs/10.0.1.6:6700..., timeout: 600000ms, pendings: 
02015-04-15 18:54:01 b.s.m.n.Client [INFO] Closing Netty Client 
Netty-Client-twig05.twigs/10.0.1.5:67002015-04-15 18:54:01 b.s.m.n.Client 
[INFO] Waiting for pending batchs to be sent with 
Netty-Client-twig05.twigs/10.0.1.5:6700..., timeout: 600000ms, pendings: 
02015-04-15 18:54:01 b.s.d.worker [INFO] Shutting down receive thread2015-04-15 
18:54:01 o.a.s.c.r.ExponentialBackoffRetry [WARN] maxRetries too large (30). 
Pinning to 292015-04-15 18:54:01 b.s.u.StormBoundedExponentialBackoffRetry 
[INFO] The baseSleepTimeMs [100] the maxSleepTimeMs [1000] the maxRetries 
[30]2015-04-15 18:54:01 b.s.m.n.Client [INFO] New Netty Client, connect to 
localhost, 6700, config: , buffer_size: 52428802015-04-15 18:54:01 
b.s.m.n.Client [INFO] Reconnect started for 
Netty-Client-localhost/127.0.0.1:6700... [0]2015-04-15 18:54:01 b.s.m.loader 
[INFO] Shutting down receiving-thread: [burner-trident-1-1429131611, 
6700]2015-04-15 18:54:01 b.s.m.n.Client [INFO] connection established to a 
remote host Netty-Client-localhost/127.0.0.1:6700, [id: 0x9e468dc6, 
/127.0.0.1:59930 => localhost/127.0.0.1:6700]2015-04-15 18:54:01 b.s.m.n.Client 
[INFO] Closing Netty Client Netty-Client-localhost/127.0.0.1:67002015-04-15 
18:54:01 b.s.m.n.Client [INFO] Waiting for pending batchs to be sent with 
Netty-Client-localhost/127.0.0.1:6700..., timeout: 600000ms, pendings: 
02015-04-15 18:54:01 b.s.m.loader [INFO] Waiting for 
receiving-thread:[burner-trident-1-1429131611, 6700] to die2015-04-15 18:54:01 
b.s.m.loader [INFO] Shutdown receiving-thread: [burner-trident-1-1429131611, 
6700]2015-04-15 18:54:01 b.s.d.worker [INFO] Shut down receive thread2015-04-15 
18:54:01 b.s.d.worker [INFO] Terminating messaging context2015-04-15 18:54:01 
b.s.d.worker [INFO] Shutting down executors2015-04-15 18:54:01 b.s.d.executor 
[INFO] Shutting down executor __acker:[4 4]2015-04-15 18:54:01 b.s.util [INFO] 
Async loop interrupted!2015-04-15 18:54:01 b.s.util [INFO] Async loop 
interrupted!2015-04-15 18:54:01 b.s.d.executor [INFO] Shut down executor 
__acker:[4 4]2015-04-15 18:54:01 b.s.d.executor [INFO] Shutting down executor 
b-0-To-DC-Sink:[36 36]2015-04-15 18:54:01 b.s.util [INFO] Async loop 
interrupted!2015-04-15 18:54:01 b.s.util [INFO] Async loop 
interrupted!2015-04-15 18:54:01 c.c.t.b.t.f.FlushingFunction [ERROR] sleeping 
in flushThread interruptedjava.lang.InterruptedException: sleep interrupted     
   at java.lang.Thread.sleep(Native Method) [na:1.7.0_75]        at 
com.cisco.tinderbox.burner.trident.functions.FlushingFunction$1.run(FlushingFunction.java:84)
 ~[stormjar.jar:na]        at java.lang.Thread.run(Thread.java:745) 
[na:1.7.0_75]

| 
|  |
| Grant Overby
Software Engineer
Cisco.com
grove...@cisco.com
Mobile: 865 724 4910
 | 

 |
|  |


|  Think before you print. |
| This email may contain confidential and privileged material for the sole use 
of the intended recipient. Any review, use, distribution or disclosure by 
others is strictly prohibited. If you are not the intended recipient (or 
authorized to receive for the recipient), please contact the sender by reply 
email and delete all copies of this message.Please click here for Company 
Registration Information. |


 |




  

Reply via email to