Hi all,
I am experiencing a problem on a remote cluster which causes a running
topology to be killed after around 15-20 minutes. We are deploying to a
cluster with a nimbus and three worker nodes, with around 8GB RAM each.
This cluster shares a zookeeper cluster with another storm cluster (version
0.9.3) which runs topologies without issue. The storm 0.9.3 cluster has a
different zookeeper root.
Nothing else is running on the cluster. There doesn't appear to be any
error in the logs, the supervisor simply states that it's shutting down the
worker, and then the topology is promptly shut down. The worker's state is
set to disabled according to the supervisor. We have already tried setting
nimbus.task.timeout.secs: 600
nimbus.supervisor.timeout.secs: 600
in the storm.yaml but that did not appear to help. I have enabled gc
logging, but it doesn't appear to be logging anything. It would be very
surprising to me if system load would be causing this, since the topology
does virtually nothing. Can anybody figure out what's going on here?
Logs will be posted below.
*Our storm.yaml:*
nimbus.host: stg1examplestorm2-nimbus01.example.com
nimbus.thrift.port: 6627
nimbus.thrift.max_buffer_size: 32000000
storm.local.dir: /data/storm
storm.zookeeper.servers:
- stg1examplezk-storm01.example.com
- stg1examplezk-storm02.example.com
- stg1examplezk-storm03.example.com
supervisor.slots.ports:
- 6700
- 6701
- 6702
- 6703
- 6704
- 6705
- 6706
- 6707
- 6708
- 6709
- 6710
- 6711
- 6712
- 6713
- 6714
- 6715
- 6716
- 6717
- 6718
- 6719
- 6720
ui.port: 6060
storm.log.dir: /var/log/storm
storm.zookeeper.root: /storm_0_10
*Topology logs:*
[a lot of this and zookeeper responses]
2017-07-17 17:03:00.574 b.s.m.n.Client [DEBUG] writing 1 messages to
channel [id: 0x6ee64433, /10.202.144.23:42166 =>
stg1examplestorm2-node02.example.com/10.202.144.20:6720]
2017-07-17 17:03:00.574 b.s.m.n.Client [DEBUG] writing 1 messages to
channel [id: 0x6ee64433, /10.202.144.23:42166 =>
stg1examplestorm2-node02.example.com/10.202.144.20:6720]
2017-07-17 17:03:00.575 b.s.m.n.Client [DEBUG] sent 1 messages to
Netty-Client-stg1examplestorm2-node02.example.com/10.202.144.20:6720
2017-07-17 17:03:00.575 b.s.m.n.Client [DEBUG] sent 1 messages to
Netty-Client-stg1examplestorm2-node02.example.com/10.202.144.20:6720
2017-07-17 17:03:00.613 o.a.s.s.o.a.z.ClientCnxn [DEBUG] Got notification
sessionid:0x35d4744d5c10011
2017-07-17 17:03:00.614 o.a.s.s.o.a.z.ClientCnxn [DEBUG] Got WatchedEvent
state:SyncConnected type:NodeDeleted
path:/assignments/excla-testcla-1-1500324928 for sessionid 0x35d4744d5c10011
2017-07-17 17:03:00.615 o.a.s.s.o.a.c.u.DefaultTracerDriver [TRACE] Trace:
EventListener - 0 ms
2017-07-17 17:03:00.615 o.a.s.s.o.a.c.u.DefaultTracerDriver [TRACE] Trace:
connection-state-parent-process - 0 ms
2017-07-17 17:03:00.637 b.s.d.worker [INFO] Shutting down worker
excla-testcla-1-1500324928 13751591-813e-4721-ac3d-55084e32768f 6720
2017-07-17 17:03:00.638 b.s.m.n.Client [INFO] closing Netty Client
Netty-Client-stg1examplestorm2-node02.example.com/10.202.144.20:6720
2017-07-17 17:03:00.638 b.s.m.n.Client [INFO] waiting up to 600000 ms to
send 0 pending messages to
Netty-Client-stg1examplestorm2-node02.example.com/10.202.144.20:6720
2017-07-17 17:03:00.639 b.s.m.n.Client [DEBUG] channel to
Netty-Client-stg1examplestorm2-node02.example.com/10.202.144.20:6720 closed
2017-07-17 17:03:00.639 b.s.m.n.Client [INFO] closing Netty Client
Netty-Client-stg1examplestorm2-node01.example.com/10.202.144.16:6720
2017-07-17 17:03:00.639 b.s.m.n.Client [INFO] waiting up to 600000 ms to
send 0 pending messages to
Netty-Client-stg1examplestorm2-node01.example.com/10.202.144.16:6720
2017-07-17 17:03:00.639 b.s.m.n.Client [DEBUG] channel to
Netty-Client-stg1examplestorm2-node01.example.com/10.202.144.16:6720 closed
2017-07-17 17:03:00.639 b.s.d.worker [INFO] Shutting down receive thread
2017-07-17 17:03:00.640 b.s.m.n.Client [INFO] creating Netty Client,
connecting to stg1examplestorm2-node03.example.com:6720, bufferSize: 5242880
2017-07-17 17:03:00.640 o.a.s.s.o.a.c.r.ExponentialBackoffRetry [WARN]
maxRetries too large (300). Pinning to 29
2017-07-17 17:03:00.640 b.s.u.StormBoundedExponentialBackoffRetry [INFO]
The baseSleepTimeMs [100] the maxSleepTimeMs [1000] the maxRetries [300]
2017-07-17 17:03:00.703 b.s.m.loader [INFO] Shutting down receiving-thread:
[excla-testcla-1-1500324928, 6720]
2017-07-17 17:03:00.703 b.s.m.loader [INFO] Waiting for
receiving-thread:[excla-testcla-1-1500324928, 6720] to die
2017-07-17 17:03:00.704 b.s.m.loader [INFO] Shutdown receiving-thread:
[excla-testcla-1-1500324928, 6720]
2017-07-17 17:03:00.704 b.s.d.worker [INFO] Shut down receive thread
2017-07-17 17:03:00.704 b.s.d.worker [INFO] Terminating messaging context
2017-07-17 17:03:00.705 b.s.d.worker [INFO] Shutting down executors
2017-07-17 17:03:00.705 b.s.d.executor [INFO] Shutting down executor
word:[12 12]
2017-07-17 17:03:00.706 b.s.util [INFO] Async loop interrupted!
2017-07-17 17:03:00.713 b.s.util [INFO] Async loop interrupted!
etc...
*supervisor logs:*
2017-07-17 17:02:58.142 b.s.d.supervisor [DEBUG] Syncing processes
2017-07-17 17:02:58.142 b.s.d.supervisor [DEBUG] Assigned executors: {6720
{:storm-id "excla-testcla-1-1500324928", :executors [[7 7] [1 1] [16 16]
[13 13] [10 10] [4 4]]}}
2017-07-17 17:02:58.143 b.s.d.supervisor [DEBUG] Allocated:
{"e96fc1c2-eeab-40f1-8639-ede11233415b" [:valid {:time-secs 1500325377,
:storm-id "excla-testcla-1-1500324928", :executors [[7 7] [1 1] [-1 -1] [16
16] [13 13] [10 10] [4 4]], :port 6720}]}
2017-07-17 17:03:00.742 o.a.s.s.o.a.z.ClientCnxn [DEBUG] Got notification
sessionid:0x35d4744d5c10010
2017-07-17 17:03:00.742 o.a.s.s.o.a.z.ClientCnxn [DEBUG] Got WatchedEvent
state:SyncConnected type:NodeDeleted
path:/assignments/excla-testcla-1-1500324928 for sessionid 0x35d4744d5c10010
2017-07-17 17:03:00.742 o.a.s.s.o.a.z.ClientCnxn [DEBUG] Got notification
sessionid:0x35d4744d5c10010
2017-07-17 17:03:00.742 o.a.s.s.o.a.z.ClientCnxn [DEBUG] Got WatchedEvent
state:SyncConnected type:NodeChildrenChanged path:/assignments for
sessionid 0x35d4744d5c10010
2017-07-17 17:03:00.743 o.a.s.s.o.a.c.u.DefaultTracerDriver [TRACE] Trace:
EventListener - 0 ms
2017-07-17 17:03:00.743 o.a.s.s.o.a.c.u.DefaultTracerDriver [TRACE] Trace:
connection-state-parent-process - 0 ms
2017-07-17 17:03:00.743 o.a.s.s.o.a.c.u.DefaultTracerDriver [TRACE] Trace:
EventListener - 0 ms
2017-07-17 17:03:00.744 o.a.s.s.o.a.c.u.DefaultTracerDriver [TRACE] Trace:
connection-state-parent-process - 0 ms
2017-07-17 17:03:00.744 o.a.s.s.o.a.z.ClientCnxn [DEBUG] Reading reply
sessionid:0x35d4744d5c10010, packet:: clientPath:null serverPath:null
finished:false header:: 482,12 replyHeader:: 482,81609600378,0 request::
'/storm_0_10/assignments,T response:: v{},s{34381319$
47,34381319647,1491855399679,1491855399679,0,78,0,0,1,0,81609600378}
2017-07-17 17:03:00.744 o.a.s.s.o.a.c.u.DefaultTracerDriver [TRACE] Trace:
GetChildrenBuilderImpl-Foreground - 1 ms
2017-07-17 17:03:00.745 b.s.d.supervisor [DEBUG] Synchronizing supervisor
2017-07-17 17:03:00.745 b.s.d.supervisor [DEBUG] Storm code map: {}
2017-07-17 17:03:00.746 b.s.d.supervisor [DEBUG] Downloaded storm ids:
#{"excla-testcla-1-1500324928"}
2017-07-17 17:03:00.746 b.s.d.supervisor [DEBUG] All assignment:
2017-07-17 17:03:00.746 b.s.d.supervisor [DEBUG] New assignment: {}
2017-07-17 17:03:00.746 b.s.d.supervisor [DEBUG] Writing new assignment {}
2017-07-17 17:03:00.748 b.s.d.supervisor [INFO] Removing code for storm id
excla-testcla-1-1500324928
2017-07-17 17:03:00.750 b.s.util [DEBUG] Rmr path
/data/storm/supervisor/stormdist/excla-testcla-1-1500324928
2017-07-17 17:03:00.753 b.s.u.LocalState [DEBUG] New Local State for
/data/storm/workers/e96fc1c2-eeab-40f1-8639-ede11233415b/heartbeats
2017-07-17 17:03:00.754 b.s.d.supervisor [DEBUG] Worker
e96fc1c2-eeab-40f1-8639-ede11233415b is :disallowed: {:time-secs
1500325379, :storm-id "excla-testcla-1-1500324928", :executors [[7 7] [1 1]
[-1 -1] [16 16] [13 13] [10 10] [4 4]], :port 6720} at supervisor time-se
cs 1500325380
2017-07-17 17:03:00.754 b.s.d.supervisor [DEBUG] Syncing processes
2017-07-17 17:03:00.754 b.s.d.supervisor [DEBUG] Assigned executors: {}
2017-07-17 17:03:00.754 b.s.d.supervisor [DEBUG] Allocated:
{"e96fc1c2-eeab-40f1-8639-ede11233415b" [:disallowed {:time-secs
1500325379, :storm-id "excla-testcla-1-1500324928", :executors [[7 7] [1 1]
[-1 -1] [16 16] [13 13] [10 10] [4 4]], :port 6720}]}
2017-07-17 17:03:00.755 b.s.d.supervisor [INFO] Shutting down and clearing
state for id e96fc1c2-eeab-40f1-8639-ede11233415b. Current supervisor time:
1500325380. State: :disallowed, Heartbeat: {:time-secs 1500325379,
:storm-id "excla-testcla-1-1500324928", :executors [
[7 7] [1 1] [-1 -1] [16 16] [13 13] [10 10] [4 4]], :port 6720}
2017-07-17 17:03:00.755 b.s.d.supervisor [INFO] Shutting down
404f7dd0-363c-4161-a19e-c95e56cbe96e:e96fc1c2-eeab-40f1-8639-ede11233415b
2017-07-17 17:03:00.755 b.s.config [INFO] GET worker-user
e96fc1c2-eeab-40f1-8639-ede11233415b