Topology periodically stalls until manually restarted

Sheldon White Wed, 19 Nov 2014 08:53:17 -0800

We've recently implemented a Storm topology that pulls data from one DB, 
performs some simple transformations and populates another DB. The spout and 
bolt code is all pretty simple and doesn't retain any data (other than a DB 
connection pool). This topology is running through about 400 million DB records.
We have a cluster of 8 VMs running storm (0.9.3) and a single VM running 
zookeeper (3.4.5). The zookeeper node never shows much load.
We are periodically seeing behavior where the zookeeper and the storm topology 
seem to "lose track" of each other. The supervisor.log files shows a loop of 
process restarts:


2014-11-17 00:54:58 b.s.util [INFO] Error when trying to kill 5109. Process is 
probably already dead.
2014-11-17 00:54:59 b.s.util [INFO] Error when trying to kill 5109. Process is 
probably already dead.
2014-11-17 00:54:59 b.s.d.supervisor [INFO] Shut down 
de55f9bf-d630-4e11-bc8c-9bbaa9899212:d6d231cc-2f2c-4bad-924e-570c6357d97d
2014-11-17 00:54:59 b.s.d.supervisor [INFO] Launching worker with assignment 
#backtype.storm.daemon.supervisor.LocalAssignment{:storm-id 
"swhite-test-2014-11-14-13-58-2
1-3-1416002313", :executors ([5 5] [37 37] [13 13] [21 21] [29 29])} for this 
supervisor de55f9bf-d630-4e11-bc8c-9bbaa9899212 on port 6702 with id 
59b3b5d9-bd9c-4da7-82
e4-38172f7dc232
2014-11-17 00:54:59 b.s.d.supervisor [INFO] Launching worker with command: 
'java' '-server' '-Xmx2048m' '-XX:+UseConcMarkSweepGC' '-XX:+UseParNewGC' 
'-XX:+UseConcMarkSw
eepGC' '-XX:NewSize=128m' '-XX:CMSInitiatingOccupancyFraction=70' 
'-XX:-CMSConcurrentMTEnabled' '-Djava.net.preferIPv4Stack=true' 
'-Djava.library.path=/opt/storm-local/
supervisor/stormdist/swhite-test-2014-11-14-13-58-21-3-1416002313/resources/Linux-amd64:/opt/storm-local/supervisor/stormdist/swhite-test-2014-11-14-13-58-21-3-14160023
13/resources:/usr/local/lib:/opt/local/lib:/usr/lib' 
'-Dlogfile.name=worker-6702.log' '-Dstorm.home=/opt/apache-storm-0.9.3-rc1' 
'-Dstorm.log.dir=/opt/apache-storm-0.9.
3-rc1/logs' 
'-Dlogback.configurationFile=/opt/apache-storm-0.9.3-rc1/logback/cluster.xml' 
'-Dstorm.id=swhite-test-2014-11-14-13-58-21-3-1416002313' '-Dworker.id=59b3b5d
9-bd9c-4da7-82e4-38172f7dc232' '-Dworker.port=6702' '-cp' 
'/opt/apache-storm-0.9.3-rc1/lib/math.numeric-tower-0.0.1.jar:/opt/apache-storm-0.9.3-rc1/lib/slf4j-api-1.7.5.
jar:/opt/apache-storm-0.9.3-rc1/lib/ring-devel-0.3.11.jar:/opt/apache-storm-0.9.3-rc1/lib/disruptor-3.2.1.jar:/opt/apache-storm-0.9.3-rc1/lib/jetty-6.1.26.jar:/opt/apac
he-storm-0.9.3-rc1/lib/logback-core-1.0.13.jar:/opt/apache-storm-0.9.3-rc1/lib/ring-jetty-adapter-0.3.11.jar:/opt/apache-storm-0.9.3-rc1/lib/carbonite-1.4.0.jar:/opt/ap
ache-storm-0.9.3-rc1/lib/compojure-1.1.3.jar:/opt/apache-storm-0.9.3-rc1/lib/log4j-over-slf4j-1.6.6.jar:/opt/apache-storm-0.9.3-rc1/lib/chill-java-0.3.5.jar:/opt/apache
-storm-0.9.3-rc1/lib/logback-classic-1.0.13.jar:/opt/apache-storm-0.9.3-rc1/lib/minlog-1.2.jar:/opt/apache-storm-0.9.3-rc1/lib/kryo-2.21.jar:/opt/apache-storm-0.9.3-rc1
/lib/tools.macro-0.1.0.jar:/opt/apache-storm-0.9.3-rc1/lib/commons-exec-1.1.jar:/opt/apache-storm-0.9.3-rc1/lib/commons-io-2.4.jar:/opt/apache-storm-0.9.3-rc1/lib/cloju
re-1.5.1.jar:/opt/apache-storm-0.9.3-rc1/lib/ring-servlet-0.3.11.jar:/opt/apache-storm-0.9.3-rc1/lib/storm-core-0.9.3-rc1.jar:/opt/apache-storm-0.9.3-rc1/lib/core.incub
ator-0.1.0.jar:/opt/apache-storm-0.9.3-rc1/lib/asm-4.0.jar:/opt/apache-storm-0.9.3-rc1/lib/jetty-util-6.1.26.jar:/opt/apache-storm-0.9.3-rc1/lib/commons-lang-2.5.jar:/o
pt/apache-storm-0.9.3-rc1/lib/commons-logging-1.1.3.jar:/opt/apache-storm-0.9.3-rc1/lib/servlet-api-2.5.jar:/opt/apache-storm-0.9.3-rc1/lib/clj-time-0.4.1.jar:/opt/apac
he-storm-0.9.3-rc1/lib/objenesis-1.2.jar:/opt/apache-storm-0.9.3-rc1/lib/json-simple-1.1.jar:/opt/apache-storm-0.9.3-rc1/lib/jgrapht-core-0.9.0.jar:/opt/apache-storm-0.
9.3-rc1/lib/commons-fileupload-1.2.1.jar:/opt/apache-storm-0.9.3-rc1/lib/ring-core-1.1.5.jar:/opt/apache-storm-0.9.3-rc1/lib/snakeyaml-1.11.jar:/opt/apache-storm-0.9.3-
rc1/lib/commons-codec-1.6.jar:/opt/apache-storm-0.9.3-rc1/lib/joda-time-2.0.jar:/opt/apache-storm-0.9.3-rc1/lib/clout-1.0.1.jar:/opt/apache-storm-0.9.3-rc1/lib/jline-2.
11.jar:/opt/apache-storm-0.9.3-rc1/lib/hiccup-0.3.6.jar:/opt/apache-storm-0.9.3-rc1/lib/tools.cli-0.2.4.jar:/opt/apache-storm-0.9.3-rc1/lib/reflectasm-1.07-shaded.jar:/
opt/apache-storm-0.9.3-rc1/lib/tools.logging-0.2.3.jar:/opt/apache-storm-0.9.3-rc1/lib/clj-stacktrace-0.2.2.jar:/opt/apache-storm-0.9.3-rc1/conf:/opt/storm-local/superv
isor/stormdist/swhite-test-2014-11-14-13-58-21-3-1416002313/stormjar.jar' 
'backtype.storm.daemon.worker' 'swhite-test-2014-11-14-13-58-21-3-1416002313' 
'de55f9bf-d630-4
e11-bc8c-9bbaa9899212' '6702' '59b3b5d9-bd9c-4da7-82e4-38172f7dc232'
2014-11-17 00:54:59 b.s.d.supervisor [INFO] 
59b3b5d9-bd9c-4da7-82e4-38172f7dc232 still hasn't started
2014-11-17 00:54:59 b.s.d.supervisor [INFO] 
59b3b5d9-bd9c-4da7-82e4-38172f7dc232 still hasn't started
2014-11-17 00:55:00 b.s.d.supervisor [INFO] 
59b3b5d9-bd9c-4da7-82e4-38172f7dc232 still hasn't started
2014-11-17 00:55:00 b.s.d.supervisor [INFO] 
59b3b5d9-bd9c-4da7-82e4-38172f7dc232 still hasn't started
2014-11-17 00:55:01 b.s.d.supervisor [INFO] 
59b3b5d9-bd9c-4da7-82e4-38172f7dc232 still hasn't started
2014-11-17 00:55:01 b.s.d.supervisor [INFO] 
59b3b5d9-bd9c-4da7-82e4-38172f7dc232 still hasn't started
2014-11-17 00:55:02 b.s.d.supervisor [INFO] Shutting down and clearing state 
for id 59b3b5d9-bd9c-4da7-82e4-38172f7dc232. Current supervisor time: 
1416185702. State: :d
isallowed, Heartbeat: #backtype.storm.daemon.common.WorkerHeartbeat{:time-secs 
1416185701, :storm-id "swhite-test-2014-11-14-13-58-21-3-1416002313", 
:executors #{[-1 -1
]}, :port 6702}
2014-11-17 00:55:02 b.s.d.supervisor [INFO] Shutting down 
de55f9bf-d630-4e11-bc8c-9bbaa9899212:59b3b5d9-bd9c-4da7-82e4-38172f7dc232
2014-11-17 00:55:03 b.s.util [INFO] Error when trying to kill 5174. Process is 
probably already dead.
...

Also seeing this in the worker logs when things are stalled out:
...
/opt/apache-storm-0.9.3-rc1/logs/worker-6703.log.9:2014-11-17 00:49:20 
b.s.m.n.Client [INFO] Reconnect started for 
Netty-Client-storm-ap08.dev.g2/10.10.122.17:6703... [220]
/opt/apache-storm-0.9.3-rc1/logs/worker-6703.log.9:2014-11-17 00:49:21 
b.s.m.n.Client [INFO] Reconnect started for 
Netty-Client-storm-ap08.dev.g2/10.10.122.17:6703... [221]
/opt/apache-storm-0.9.3-rc1/logs/worker-6703.log.9:2014-11-17 00:49:22 
b.s.m.n.Client [INFO] Reconnect started for 
Netty-Client-storm-ap08.dev.g2/10.10.122.17:6703... [222]
/opt/apache-storm-0.9.3-rc1/logs/worker-6703.log.9:2014-11-17 00:49:23 
b.s.m.n.Client [INFO] Reconnect started for 
Netty-Client-storm-ap08.dev.g2/10.10.122.17:6703... [223]
/opt/apache-storm-0.9.3-rc1/logs/worker-6703.log.9:2014-11-17 00:49:24 
b.s.m.n.Client [INFO] Reconnect started for 
Netty-Client-storm-ap08.dev.g2/10.10.122.17:6703... [224]
/opt/apache-storm-0.9.3-rc1/logs/worker-6703.log.9:2014-11-17 00:49:24 
b.s.m.n.Client [INFO] Reconnect started for 
Netty-Client-storm-ap08.dev.g2/10.10.122.17:6703... [225]
/opt/apache-storm-0.9.3-rc1/logs/worker-6703.log.9:2014-11-17 00:49:25 
b.s.m.n.Client [INFO] Reconnect started for 
Netty-Client-storm-ap08.dev.g2/10.10.122.17:6703... [226]
/opt/apache-storm-0.9.3-rc1/logs/worker-6703.log.9:2014-11-17 00:49:26 
b.s.m.n.Client [INFO] Reconnect started for 
Netty-Client-storm-ap08.dev.g2/10.10.122.17:6703... [227]
/opt/apache-storm-0.9.3-rc1/logs/worker-6703.log.9:2014-11-17 00:49:27 
b.s.m.n.Client [INFO] Reconnect started for 
Netty-Client-storm-ap08.dev.g2/10.10.122.17:6703... [228]
...

We've tried increasing the storm.zookeeper timeouts to 5 minutes:

storm.zookeeper.session.timeout = 300000
storm.zookeeper.connection.timeout = 300000

and this allows the cluster to run much more stably, with good topology 
performance. But it still occasionally seems to stall out, and a manual 
rebalancing request is required to get it rolling. All VMs have fixed DNS 
entries and there doesn't seem to be any network issues at all.

This behavior is pretty much identical to something reported in August: 
http://mail-archives.apache.org/mod_mbox/storm-user/201408.mbox/%3CFF975F8738DEE84AABF1BF95A4248E6B6A8F5393%40mvexe1.integral.com%3E
but this thread didn't really get resolved.

Question 1: Is there any reason to think we'd need additional zookeeper 
servers? The one server we're using doen't show any significant load.
Question 2: Is it normal to manually rebalance a running topology? It would 
probably work but seems like something's not right.
Question 3: What other clues should I be looking for?

Thanks in advance

-sheldon white-
[email protected]

Topology periodically stalls until manually restarted

Reply via email to