We've recently implemented a Storm topology that pulls data from one DB,
performs some simple transformations and populates another DB. The spout and
bolt code is all pretty simple and doesn't retain any data (other than a DB
connection pool). This topology is running through about 400 million DB records.
We have a cluster of 8 VMs running storm (0.9.3) and a single VM running
zookeeper (3.4.5). The zookeeper node never shows much load.
We are periodically seeing behavior where the zookeeper and the storm topology
seem to "lose track" of each other. The supervisor.log files shows a loop of
process restarts:
2014-11-17 00:54:58 b.s.util [INFO] Error when trying to kill 5109. Process is
probably already dead.
2014-11-17 00:54:59 b.s.util [INFO] Error when trying to kill 5109. Process is
probably already dead.
2014-11-17 00:54:59 b.s.d.supervisor [INFO] Shut down
de55f9bf-d630-4e11-bc8c-9bbaa9899212:d6d231cc-2f2c-4bad-924e-570c6357d97d
2014-11-17 00:54:59 b.s.d.supervisor [INFO] Launching worker with assignment
#backtype.storm.daemon.supervisor.LocalAssignment{:storm-id
"swhite-test-2014-11-14-13-58-2
1-3-1416002313", :executors ([5 5] [37 37] [13 13] [21 21] [29 29])} for this
supervisor de55f9bf-d630-4e11-bc8c-9bbaa9899212 on port 6702 with id
59b3b5d9-bd9c-4da7-82
e4-38172f7dc232
2014-11-17 00:54:59 b.s.d.supervisor [INFO] Launching worker with command:
'java' '-server' '-Xmx2048m' '-XX:+UseConcMarkSweepGC' '-XX:+UseParNewGC'
'-XX:+UseConcMarkSw
eepGC' '-XX:NewSize=128m' '-XX:CMSInitiatingOccupancyFraction=70'
'-XX:-CMSConcurrentMTEnabled' '-Djava.net.preferIPv4Stack=true'
'-Djava.library.path=/opt/storm-local/
supervisor/stormdist/swhite-test-2014-11-14-13-58-21-3-1416002313/resources/Linux-amd64:/opt/storm-local/supervisor/stormdist/swhite-test-2014-11-14-13-58-21-3-14160023
13/resources:/usr/local/lib:/opt/local/lib:/usr/lib'
'-Dlogfile.name=worker-6702.log' '-Dstorm.home=/opt/apache-storm-0.9.3-rc1'
'-Dstorm.log.dir=/opt/apache-storm-0.9.
3-rc1/logs'
'-Dlogback.configurationFile=/opt/apache-storm-0.9.3-rc1/logback/cluster.xml'
'-Dstorm.id=swhite-test-2014-11-14-13-58-21-3-1416002313' '-Dworker.id=59b3b5d
9-bd9c-4da7-82e4-38172f7dc232' '-Dworker.port=6702' '-cp'
'/opt/apache-storm-0.9.3-rc1/lib/math.numeric-tower-0.0.1.jar:/opt/apache-storm-0.9.3-rc1/lib/slf4j-api-1.7.5.
jar:/opt/apache-storm-0.9.3-rc1/lib/ring-devel-0.3.11.jar:/opt/apache-storm-0.9.3-rc1/lib/disruptor-3.2.1.jar:/opt/apache-storm-0.9.3-rc1/lib/jetty-6.1.26.jar:/opt/apac
he-storm-0.9.3-rc1/lib/logback-core-1.0.13.jar:/opt/apache-storm-0.9.3-rc1/lib/ring-jetty-adapter-0.3.11.jar:/opt/apache-storm-0.9.3-rc1/lib/carbonite-1.4.0.jar:/opt/ap
ache-storm-0.9.3-rc1/lib/compojure-1.1.3.jar:/opt/apache-storm-0.9.3-rc1/lib/log4j-over-slf4j-1.6.6.jar:/opt/apache-storm-0.9.3-rc1/lib/chill-java-0.3.5.jar:/opt/apache
-storm-0.9.3-rc1/lib/logback-classic-1.0.13.jar:/opt/apache-storm-0.9.3-rc1/lib/minlog-1.2.jar:/opt/apache-storm-0.9.3-rc1/lib/kryo-2.21.jar:/opt/apache-storm-0.9.3-rc1
/lib/tools.macro-0.1.0.jar:/opt/apache-storm-0.9.3-rc1/lib/commons-exec-1.1.jar:/opt/apache-storm-0.9.3-rc1/lib/commons-io-2.4.jar:/opt/apache-storm-0.9.3-rc1/lib/cloju
re-1.5.1.jar:/opt/apache-storm-0.9.3-rc1/lib/ring-servlet-0.3.11.jar:/opt/apache-storm-0.9.3-rc1/lib/storm-core-0.9.3-rc1.jar:/opt/apache-storm-0.9.3-rc1/lib/core.incub
ator-0.1.0.jar:/opt/apache-storm-0.9.3-rc1/lib/asm-4.0.jar:/opt/apache-storm-0.9.3-rc1/lib/jetty-util-6.1.26.jar:/opt/apache-storm-0.9.3-rc1/lib/commons-lang-2.5.jar:/o
pt/apache-storm-0.9.3-rc1/lib/commons-logging-1.1.3.jar:/opt/apache-storm-0.9.3-rc1/lib/servlet-api-2.5.jar:/opt/apache-storm-0.9.3-rc1/lib/clj-time-0.4.1.jar:/opt/apac
he-storm-0.9.3-rc1/lib/objenesis-1.2.jar:/opt/apache-storm-0.9.3-rc1/lib/json-simple-1.1.jar:/opt/apache-storm-0.9.3-rc1/lib/jgrapht-core-0.9.0.jar:/opt/apache-storm-0.
9.3-rc1/lib/commons-fileupload-1.2.1.jar:/opt/apache-storm-0.9.3-rc1/lib/ring-core-1.1.5.jar:/opt/apache-storm-0.9.3-rc1/lib/snakeyaml-1.11.jar:/opt/apache-storm-0.9.3-
rc1/lib/commons-codec-1.6.jar:/opt/apache-storm-0.9.3-rc1/lib/joda-time-2.0.jar:/opt/apache-storm-0.9.3-rc1/lib/clout-1.0.1.jar:/opt/apache-storm-0.9.3-rc1/lib/jline-2.
11.jar:/opt/apache-storm-0.9.3-rc1/lib/hiccup-0.3.6.jar:/opt/apache-storm-0.9.3-rc1/lib/tools.cli-0.2.4.jar:/opt/apache-storm-0.9.3-rc1/lib/reflectasm-1.07-shaded.jar:/
opt/apache-storm-0.9.3-rc1/lib/tools.logging-0.2.3.jar:/opt/apache-storm-0.9.3-rc1/lib/clj-stacktrace-0.2.2.jar:/opt/apache-storm-0.9.3-rc1/conf:/opt/storm-local/superv
isor/stormdist/swhite-test-2014-11-14-13-58-21-3-1416002313/stormjar.jar'
'backtype.storm.daemon.worker' 'swhite-test-2014-11-14-13-58-21-3-1416002313'
'de55f9bf-d630-4
e11-bc8c-9bbaa9899212' '6702' '59b3b5d9-bd9c-4da7-82e4-38172f7dc232'
2014-11-17 00:54:59 b.s.d.supervisor [INFO]
59b3b5d9-bd9c-4da7-82e4-38172f7dc232 still hasn't started
2014-11-17 00:54:59 b.s.d.supervisor [INFO]
59b3b5d9-bd9c-4da7-82e4-38172f7dc232 still hasn't started
2014-11-17 00:55:00 b.s.d.supervisor [INFO]
59b3b5d9-bd9c-4da7-82e4-38172f7dc232 still hasn't started
2014-11-17 00:55:00 b.s.d.supervisor [INFO]
59b3b5d9-bd9c-4da7-82e4-38172f7dc232 still hasn't started
2014-11-17 00:55:01 b.s.d.supervisor [INFO]
59b3b5d9-bd9c-4da7-82e4-38172f7dc232 still hasn't started
2014-11-17 00:55:01 b.s.d.supervisor [INFO]
59b3b5d9-bd9c-4da7-82e4-38172f7dc232 still hasn't started
2014-11-17 00:55:02 b.s.d.supervisor [INFO] Shutting down and clearing state
for id 59b3b5d9-bd9c-4da7-82e4-38172f7dc232. Current supervisor time:
1416185702. State: :d
isallowed, Heartbeat: #backtype.storm.daemon.common.WorkerHeartbeat{:time-secs
1416185701, :storm-id "swhite-test-2014-11-14-13-58-21-3-1416002313",
:executors #{[-1 -1
]}, :port 6702}
2014-11-17 00:55:02 b.s.d.supervisor [INFO] Shutting down
de55f9bf-d630-4e11-bc8c-9bbaa9899212:59b3b5d9-bd9c-4da7-82e4-38172f7dc232
2014-11-17 00:55:03 b.s.util [INFO] Error when trying to kill 5174. Process is
probably already dead.
...
Also seeing this in the worker logs when things are stalled out:
...
/opt/apache-storm-0.9.3-rc1/logs/worker-6703.log.9:2014-11-17 00:49:20
b.s.m.n.Client [INFO] Reconnect started for
Netty-Client-storm-ap08.dev.g2/10.10.122.17:6703... [220]
/opt/apache-storm-0.9.3-rc1/logs/worker-6703.log.9:2014-11-17 00:49:21
b.s.m.n.Client [INFO] Reconnect started for
Netty-Client-storm-ap08.dev.g2/10.10.122.17:6703... [221]
/opt/apache-storm-0.9.3-rc1/logs/worker-6703.log.9:2014-11-17 00:49:22
b.s.m.n.Client [INFO] Reconnect started for
Netty-Client-storm-ap08.dev.g2/10.10.122.17:6703... [222]
/opt/apache-storm-0.9.3-rc1/logs/worker-6703.log.9:2014-11-17 00:49:23
b.s.m.n.Client [INFO] Reconnect started for
Netty-Client-storm-ap08.dev.g2/10.10.122.17:6703... [223]
/opt/apache-storm-0.9.3-rc1/logs/worker-6703.log.9:2014-11-17 00:49:24
b.s.m.n.Client [INFO] Reconnect started for
Netty-Client-storm-ap08.dev.g2/10.10.122.17:6703... [224]
/opt/apache-storm-0.9.3-rc1/logs/worker-6703.log.9:2014-11-17 00:49:24
b.s.m.n.Client [INFO] Reconnect started for
Netty-Client-storm-ap08.dev.g2/10.10.122.17:6703... [225]
/opt/apache-storm-0.9.3-rc1/logs/worker-6703.log.9:2014-11-17 00:49:25
b.s.m.n.Client [INFO] Reconnect started for
Netty-Client-storm-ap08.dev.g2/10.10.122.17:6703... [226]
/opt/apache-storm-0.9.3-rc1/logs/worker-6703.log.9:2014-11-17 00:49:26
b.s.m.n.Client [INFO] Reconnect started for
Netty-Client-storm-ap08.dev.g2/10.10.122.17:6703... [227]
/opt/apache-storm-0.9.3-rc1/logs/worker-6703.log.9:2014-11-17 00:49:27
b.s.m.n.Client [INFO] Reconnect started for
Netty-Client-storm-ap08.dev.g2/10.10.122.17:6703... [228]
...
We've tried increasing the storm.zookeeper timeouts to 5 minutes:
storm.zookeeper.session.timeout = 300000
storm.zookeeper.connection.timeout = 300000
and this allows the cluster to run much more stably, with good topology
performance. But it still occasionally seems to stall out, and a manual
rebalancing request is required to get it rolling. All VMs have fixed DNS
entries and there doesn't seem to be any network issues at all.
This behavior is pretty much identical to something reported in August:
http://mail-archives.apache.org/mod_mbox/storm-user/201408.mbox/%3CFF975F8738DEE84AABF1BF95A4248E6B6A8F5393%40mvexe1.integral.com%3E
but this thread didn't really get resolved.
Question 1: Is there any reason to think we'd need additional zookeeper
servers? The one server we're using doen't show any significant load.
Question 2: Is it normal to manually rebalance a running topology? It would
probably work but seems like something's not right.
Question 3: What other clues should I be looking for?
Thanks in advance
-sheldon white-
[email protected]