[ 
https://issues.apache.org/jira/browse/STORM-130?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15059503#comment-15059503
 ] 

Gergely Nagy commented on STORM-130:
------------------------------------

Hi folks. I experience the same bug in 0.10.0. In my case the supervisor 
survives but the topology dies and it's getting rebalanced to another 
supervisor.

This is what I see in the supervisor log:
{code}
2015-12-16 12:56:27.746 b.s.u.Utils [INFO] Using defaults.yaml from resources
2015-12-16 12:56:27.784 b.s.u.Utils [INFO] Using storm.yaml from resources
2015-12-16 12:56:27.841 b.s.d.supervisor [INFO] Launching worker with command: 
'java' '-cp' 
'/home/fogetti/downloads/apache-storm-0.10.0/lib/kryo-2.21.jar:/home/fogetti/downloads/apache-storm-0.10.0/lib/log4j-over-slf4j-1.6.6.jar:/home/fogetti/downloads/apache-storm-0.10.0/lib/log4j-slf4j-impl-2.1.jar:/home/fogetti/downloads/apache-storm-0.10.0/lib/minlog-1.2.jar:/home/fogetti/downloads/apache-storm-0.10.0/lib/asm-4.0.jar:/home/fogetti/downloads/apache-storm-0.10.0/lib/log4j-core-2.1.jar:/home/fogetti/downloads/apache-storm-0.10.0/lib/reflectasm-1.07-shaded.jar:/home/fogetti/downloads/apache-storm-0.10.0/lib/log4j-api-2.1.jar:/home/fogetti/downloads/apache-storm-0.10.0/lib/hadoop-auth-2.4.0.jar:/home/fogetti/downloads/apache-storm-0.10.0/lib/slf4j-api-1.7.7.jar:/home/fogetti/downloads/apache-storm-0.10.0/lib/clojure-1.6.0.jar:/home/fogetti/downloads/apache-storm-0.10.0/lib/servlet-api-2.5.jar:/home/fogetti/downloads/apache-storm-0.10.0/lib/disruptor-2.10.4.jar:/home/fogetti/downloads/apache-storm-0.10.0/lib/storm-core-0.10.0.jar:/home/fogetti/downloads/apache-storm-0.10.0/conf:/home/fogetti/downloads/apache-storm-0.10.0/storm-local/supervisor/stormdist/phish-storm-topology-2-1450237054/stormjar.jar'
 '-Dlogfile.name=phish-storm-topology-2-1450237054-worker-6700.log' 
'-Dstorm.home=/home/fogetti/downloads/apache-storm-0.10.0' 
'-Dstorm.id=phish-storm-topology-2-1450237054' 
'-Dworker.id=d3819964-7671-447d-8763-827ab5cd6140' '-Dworker.port=6700' 
'-Dstorm.log.dir=/home/fogetti/downloads/apache-storm-0.10.0/logs' 
'-Dlog4j.configurationFile=/home/fogetti/downloads/apache-storm-0.10.0/log4j2/worker.xml'
 'backtype.storm.LogWriter' 'java' '-server' '-Xmx768m' 
'-Djava.library.path=/home/fogetti/downloads/apache-storm-0.10.0/storm-local/supervisor/stormdist/phish-storm-topology-2-1450237054/resources/Linux-amd64:/home/fogetti/downloads/apache-storm-0.10.0/storm-local/supervisor/stormdist/phish-storm-topology-2-1450237054/resources:/usr'
 '-Dlogfile.name=phish-storm-topology-2-1450237054-worker-6700.log' 
'-Dstorm.home=/home/fogetti/downloads/apache-storm-0.10.0' '-Dstorm.conf.file=' 
'-Dstorm.options=' 
'-Dstorm.log.dir=/home/fogetti/downloads/apache-storm-0.10.0/logs' 
'-Dlogging.sensitivity=S3' 
'-Dlog4j.configurationFile=/home/fogetti/downloads/apache-storm-0.10.0/log4j2/worker.xml'
 '-Dstorm.id=phish-storm-topology-2-1450237054' 
'-Dworker.id=d3819964-7671-447d-8763-827ab5cd6140' '-Dworker.port=6700' '-cp' 
'/home/fogetti/downloads/apache-storm-0.10.0/lib/kryo-2.21.jar:/home/fogetti/downloads/apache-storm-0.10.0/lib/log4j-over-slf4j-1.6.6.jar:/home/fogetti/downloads/apache-storm-0.10.0/lib/log4j-slf4j-impl-2.1.jar:/home/fogetti/downloads/apache-storm-0.10.0/lib/minlog-1.2.jar:/home/fogetti/downloads/apache-storm-0.10.0/lib/asm-4.0.jar:/home/fogetti/downloads/apache-storm-0.10.0/lib/log4j-core-2.1.jar:/home/fogetti/downloads/apache-storm-0.10.0/lib/reflectasm-1.07-shaded.jar:/home/fogetti/downloads/apache-storm-0.10.0/lib/log4j-api-2.1.jar:/home/fogetti/downloads/apache-storm-0.10.0/lib/hadoop-auth-2.4.0.jar:/home/fogetti/downloads/apache-storm-0.10.0/lib/slf4j-api-1.7.7.jar:/home/fogetti/downloads/apache-storm-0.10.0/lib/clojure-1.6.0.jar:/home/fogetti/downloads/apache-storm-0.10.0/lib/servlet-api-2.5.jar:/home/fogetti/downloads/apache-storm-0.10.0/lib/disruptor-2.10.4.jar:/home/fogetti/downloads/apache-storm-0.10.0/lib/storm-core-0.10.0.jar:/home/fogetti/downloads/apache-storm-0.10.0/conf:/home/fogetti/downloads/apache-storm-0.10.0/storm-local/supervisor/stormdist/phish-storm-topology-2-1450237054/stormjar.jar'
 'backtype.storm.daemon.worker' 'phish-storm-topology-2-1450237054' 
'6c528751-1a10-4c33-bd54-a1ec9cb26d86' '6700' 
'd3819964-7671-447d-8763-827ab5cd6140'
2015-12-16 12:56:27.888 b.s.config [INFO] SET worker-user 
d3819964-7671-447d-8763-827ab5cd6140 
2015-12-16 12:56:27.948 b.s.d.supervisor [INFO] 
d3819964-7671-447d-8763-827ab5cd6140 still hasn't started
2015-12-16 12:56:28.169 b.s.d.supervisor [INFO] Removing code for storm id 
phish-storm-topology-2-1450237054
2015-12-16 12:56:28.454 b.s.d.supervisor [INFO] 
d3819964-7671-447d-8763-827ab5cd6140 still hasn't started
2015-12-16 12:56:28.956 b.s.d.supervisor [INFO] 
d3819964-7671-447d-8763-827ab5cd6140 still hasn't started
2015-12-16 12:56:29.457 b.s.d.supervisor [INFO] 
d3819964-7671-447d-8763-827ab5cd6140 still hasn't started
2015-12-16 12:56:29.957 b.s.d.supervisor [INFO] 
d3819964-7671-447d-8763-827ab5cd6140 still hasn't started
2015-12-16 12:56:30.458 b.s.d.supervisor [INFO] 
d3819964-7671-447d-8763-827ab5cd6140 still hasn't started
2015-12-16 12:56:30.960 b.s.d.supervisor [INFO] 
d3819964-7671-447d-8763-827ab5cd6140 still hasn't started
2015-12-16 12:56:31.463 b.s.d.supervisor [INFO] 
d3819964-7671-447d-8763-827ab5cd6140 still hasn't started
2015-12-16 12:56:31.964 b.s.d.supervisor [INFO] 
d3819964-7671-447d-8763-827ab5cd6140 still hasn't started
2015-12-16 12:56:32.465 b.s.d.supervisor [INFO] 
d3819964-7671-447d-8763-827ab5cd6140 still hasn't started
2015-12-16 12:56:32.966 b.s.d.supervisor [INFO] 
d3819964-7671-447d-8763-827ab5cd6140 still hasn't started
2015-12-16 12:56:33.467 b.s.d.supervisor [INFO] 
d3819964-7671-447d-8763-827ab5cd6140 still hasn't started
2015-12-16 12:56:33.968 b.s.d.supervisor [INFO] 
d3819964-7671-447d-8763-827ab5cd6140 still hasn't started
2015-12-16 12:56:34.469 b.s.d.supervisor [INFO] 
d3819964-7671-447d-8763-827ab5cd6140 still hasn't started
2015-12-16 12:56:34.970 b.s.d.supervisor [INFO] 
d3819964-7671-447d-8763-827ab5cd6140 still hasn't started
2015-12-16 12:56:35.471 b.s.d.supervisor [INFO] 
d3819964-7671-447d-8763-827ab5cd6140 still hasn't started
2015-12-16 12:56:35.972 b.s.d.supervisor [INFO] 
d3819964-7671-447d-8763-827ab5cd6140 still hasn't started
2015-12-16 12:56:36.473 b.s.d.supervisor [INFO] 
d3819964-7671-447d-8763-827ab5cd6140 still hasn't started
2015-12-16 12:56:36.974 b.s.d.supervisor [INFO] 
d3819964-7671-447d-8763-827ab5cd6140 still hasn't started
2015-12-16 12:56:37.076 b.s.d.supervisor [INFO] Worker Process 
d3819964-7671-447d-8763-827ab5cd6140 exited with code: 13
2015-12-16 12:56:37.475 b.s.d.supervisor [INFO] 
d3819964-7671-447d-8763-827ab5cd6140 still hasn't started
2015-12-16 12:56:37.976 b.s.d.supervisor [INFO] 
d3819964-7671-447d-8763-827ab5cd6140 still hasn't started
...
2015-12-16 12:58:23.196 b.s.d.supervisor [INFO] 
d3819964-7671-447d-8763-827ab5cd6140 still hasn't started
2015-12-16 12:58:23.697 b.s.d.supervisor [INFO] 
d3819964-7671-447d-8763-827ab5cd6140 still hasn't started
2015-12-16 12:58:24.198 b.s.d.supervisor [INFO] 
d3819964-7671-447d-8763-827ab5cd6140 still hasn't started
2015-12-16 12:58:24.699 b.s.d.supervisor [INFO] 
d3819964-7671-447d-8763-827ab5cd6140 still hasn't started
2015-12-16 12:58:25.200 b.s.d.supervisor [INFO] 
d3819964-7671-447d-8763-827ab5cd6140 still hasn't started
2015-12-16 12:58:25.701 b.s.d.supervisor [INFO] 
d3819964-7671-447d-8763-827ab5cd6140 still hasn't started
2015-12-16 12:58:26.202 b.s.d.supervisor [INFO] 
d3819964-7671-447d-8763-827ab5cd6140 still hasn't started
2015-12-16 12:58:26.702 b.s.d.supervisor [INFO] 
d3819964-7671-447d-8763-827ab5cd6140 still hasn't started
2015-12-16 12:58:27.203 b.s.d.supervisor [INFO] Worker 
d3819964-7671-447d-8763-827ab5cd6140 failed to start
2015-12-16 12:58:27.206 b.s.d.supervisor [INFO] Shutting down and clearing 
state for id d3819964-7671-447d-8763-827ab5cd6140. Current supervisor time: 
1450238307. State: :not-started, Heartbeat: nil
2015-12-16 12:58:27.207 b.s.d.supervisor [INFO] Shutting down 
6c528751-1a10-4c33-bd54-a1ec9cb26d86:d3819964-7671-447d-8763-827ab5cd6140
2015-12-16 12:58:27.208 b.s.config [INFO] GET worker-user 
d3819964-7671-447d-8763-827ab5cd6140
2015-12-16 12:58:27.259 b.s.util [INFO] Error when trying to kill 5076. Process 
is probably already dead.
2015-12-16 12:58:27.260 b.s.d.supervisor [INFO] Sleep 1 seconds for execution 
of cleanup threads on worker.
2015-12-16 12:58:28.266 b.s.util [INFO] Error when trying to kill 5076. Process 
is probably already dead.
2015-12-16 12:58:28.275 b.s.config [INFO] REMOVE worker-user 
d3819964-7671-447d-8763-827ab5cd6140
2015-12-16 12:58:28.276 b.s.d.supervisor [INFO] Shut down 
6c528751-1a10-4c33-bd54-a1ec9cb26d86:d3819964-7671-447d-8763-827ab5cd6140
{code}

And this is what I see in the worker log:
{code}
2015-12-16 12:56:36.702 b.s.u.Utils [INFO] Using defaults.yaml from resources
2015-12-16 12:56:36.769 b.s.u.Utils [INFO] Using storm.yaml from resources
2015-12-16 12:56:36.921 b.s.d.worker [INFO] Launching worker for 
phish-storm-topology-2-1450237054 on 6c528751-1a10-4c33-bd54-a1ec9cb26d86:6700 
with id d3819964-7671-447d-8763-827ab5cd6140 and conf 
{"topology.builtin.metrics.bucket.size.secs" 60, "nimbus.childopts" 
"-Xmx1024m", "ui.filter.params" nil, "storm.cluster.mode" "distributed", 
"storm.messaging.netty.client_worker_threads" 1, 
"supervisor.run.worker.as.user" false, "topology.max.task.parallelism" nil, 
"zmq.threads" 1, "storm.group.mapping.service" 
"backtype.storm.security.auth.ShellBasedGroupsMapping", 
"transactional.zookeeper.root" "/transactional", 
"topology.sleep.spout.wait.strategy.time.ms" 1, "drpc.invocations.port" 3773, 
"topology.multilang.serializer" "backtype.storm.multilang.JsonSerializer", 
"storm.messaging.netty.server_worker_threads" 1, 
"topology.max.error.report.per.interval" 5, "storm.thrift.transport" 
"backtype.storm.security.auth.SimpleTransportPlugin", "zmq.hwm" 0, 
"storm.principal.tolocal" 
"backtype.storm.security.auth.DefaultPrincipalToLocal", 
"supervisor.worker.shutdown.sleep.secs" 1, "storm.zookeeper.retry.times" 5, 
"ui.actions.enabled" true, "zmq.linger.millis" 5000, "supervisor.enable" true, 
"topology.stats.sample.rate" 0.05, "storm.messaging.netty.min_wait_ms" 100, 
"storm.zookeeper.port" 2181, "supervisor.heartbeat.frequency.secs" 5, 
"topology.enable.message.timeouts" true, "drpc.worker.threads" 64, 
"drpc.queue.size" 128, "drpc.https.keystore.password" "", "logviewer.port" 
8000, "nimbus.reassign" true, "topology.executor.send.buffer.size" 1024, 
"topology.spout.wait.strategy" "backtype.storm.spout.SleepSpoutWaitStrategy", 
"ui.host" "0.0.0.0", "storm.nimbus.retry.interval.millis" 2000, 
"nimbus.inbox.jar.expiration.secs" 3600, "dev.zookeeper.path" 
"/tmp/dev-storm-zookeeper", "topology.acker.executors" nil, 
"topology.fall.back.on.java.serialization" true, "storm.zookeeper.servers" 
["dimebag" "petrucci" "hetfield"], "nimbus.thrift.threads" 64, 
"logviewer.cleanup.age.mins" 10080, "topology.worker.childopts" nil, 
"topology.classpath" nil, "supervisor.monitor.frequency.secs" 3, 
"nimbus.credential.renewers.freq.secs" 600, 
"topology.skip.missing.kryo.registrations" false, 
"drpc.authorizer.acl.filename" "drpc-auth-acl.yaml", 
"storm.group.mapping.service.cache.duration.secs" 120, 
"topology.testing.always.try.serialize" false, "nimbus.monitor.freq.secs" 10, 
"supervisor.supervisors" [], "topology.tasks" nil, 
"topology.bolts.outgoing.overflow.buffer.enable" false, 
"storm.messaging.netty.socket.backlog" 500, "topology.workers" 1, 
"storm.local.dir" "/home/fogetti/downloads/apache-storm-0.10.0/storm-local", 
"worker.childopts" "-Xmx768m", "storm.auth.simple-white-list.users" [], 
"topology.message.timeout.secs" 30, 
"topology.state.synchronization.timeout.secs" 60, "topology.tuple.serializer" 
"backtype.storm.serialization.types.ListDelegateSerializer", 
"supervisor.supervisors.commands" [], "logviewer.childopts" "-Xmx128m", 
"topology.environment" nil, "topology.debug" false, 
"storm.messaging.netty.max_retries" 300, "ui.childopts" "-Xmx768m", 
"storm.zookeeper.session.timeout" 20000, "drpc.childopts" "-Xmx768m", 
"drpc.http.creds.plugin" 
"backtype.storm.security.auth.DefaultHttpCredentialsPlugin", 
"storm.zookeeper.connection.timeout" 15000, "storm.zookeeper.auth.user" nil, 
"storm.meta.serialization.delegate" 
"backtype.storm.serialization.GzipThriftSerializationDelegate", 
"topology.max.spout.pending" nil, "nimbus.supervisor.timeout.secs" 60, 
"nimbus.task.timeout.secs" 30, "drpc.port" 3772, 
"storm.zookeeper.retry.intervalceiling.millis" 30000, "nimbus.thrift.port" 
6627, "storm.auth.simple-acl.admins" [], "storm.nimbus.retry.times" 5, 
"supervisor.worker.start.timeout.secs" 120, "storm.zookeeper.retry.interval" 
1000, "logs.users" nil, "transactional.zookeeper.port" nil, 
"drpc.max_buffer_size" 1048576, "task.credentials.poll.secs" 30, 
"drpc.https.keystore.type" "JKS", "topology.worker.receiver.thread.count" 1, 
"supervisor.slots.ports" [6700 6701], "topology.transfer.buffer.size" 1024, 
"topology.worker.shared.thread.pool.size" 4, "drpc.authorizer.acl.strict" 
false, "nimbus.file.copy.expiration.secs" 600, 
"topology.executor.receive.buffer.size" 1024, "nimbus.task.launch.secs" 120, 
"storm.local.mode.zmq" false, "storm.messaging.netty.buffer_size" 5242880, 
"worker.heartbeat.frequency.secs" 1, "ui.http.creds.plugin" 
"backtype.storm.security.auth.DefaultHttpCredentialsPlugin", 
"storm.zookeeper.root" "/storm", "topology.tick.tuple.freq.secs" nil, 
"drpc.https.port" -1, "task.refresh.poll.secs" 10, 
"task.heartbeat.frequency.secs" 3, "storm.messaging.netty.max_wait_ms" 1000, 
"nimbus.impersonation.authorizer" 
"backtype.storm.security.auth.authorizer.ImpersonationAuthorizer", 
"drpc.http.port" 3774, "topology.error.throttle.interval.secs" 10, 
"storm.messaging.transport" "backtype.storm.messaging.netty.Context", 
"storm.messaging.netty.authentication" false, "topology.kryo.factory" 
"backtype.storm.serialization.DefaultKryoFactory", "worker.gc.childopts" "", 
"nimbus.topology.validator" "backtype.storm.nimbus.DefaultTopologyValidator", 
"nimbus.cleanup.inbox.freq.secs" 600, "ui.users" nil, 
"transactional.zookeeper.servers" nil, "supervisor.worker.timeout.secs" 30, 
"storm.zookeeper.auth.password" nil, "supervisor.childopts" "-Xmx256m", 
"ui.filter" nil, "ui.header.buffer.bytes" 4096, 
"topology.disruptor.wait.timeout.millis" 1000, 
"storm.nimbus.retry.intervalceiling.millis" 60000, 
"topology.trident.batch.emit.interval.millis" 500, 
"topology.disruptor.wait.strategy" "com.lmax.disruptor.BlockingWaitStrategy", 
"storm.auth.simple-acl.users" [], "drpc.invocations.threads" 64, 
"java.library.path" "/usr", "ui.port" 8080, 
"storm.messaging.netty.transfer.batch.size" 262144, "logviewer.appender.name" 
"A1", "nimbus.thrift.max_buffer_size" 1048576, "nimbus.host" "dimebag", 
"storm.auth.simple-acl.users.commands" [], "drpc.request.timeout.secs" 600}
2015-12-16 12:56:36.947 b.s.util [DEBUG] Touching file at 
/home/fogetti/downloads/apache-storm-0.10.0/storm-local/workers/d3819964-7671-447d-8763-827ab5cd6140/pids/5076
2015-12-16 12:56:36.962 b.s.d.worker [ERROR] Error on initialization of server 
mk-worker
java.io.FileNotFoundException: File 
'/home/fogetti/downloads/apache-storm-0.10.0/storm-local/supervisor/stormdist/phish-storm-topology-2-1450237054/stormconf.ser'
 does not exist
        at 
org.apache.storm.shade.org.apache.commons.io.FileUtils.openInputStream(FileUtils.java:299)
 ~[storm-core-0.10.0.jar:0.10.0]
        at 
org.apache.storm.shade.org.apache.commons.io.FileUtils.readFileToByteArray(FileUtils.java:1763)
 ~[storm-core-0.10.0.jar:0.10.0]
        at 
backtype.storm.config$read_supervisor_storm_conf.invoke(config.clj:222) 
~[storm-core-0.10.0.jar:0.10.0]
        at 
backtype.storm.daemon.worker$fn__7098$exec_fn__1236__auto____7099.invoke(worker.clj:418)
 ~[storm-core-0.10.0.jar:0.10.0]
        at clojure.lang.AFn.applyToHelper(AFn.java:178) ~[clojure-1.6.0.jar:?]
        at clojure.lang.AFn.applyTo(AFn.java:144) ~[clojure-1.6.0.jar:?]
        at clojure.core$apply.invoke(core.clj:624) ~[clojure-1.6.0.jar:?]
        at 
backtype.storm.daemon.worker$fn__7098$mk_worker__7175.doInvoke(worker.clj:409) 
[storm-core-0.10.0.jar:0.10.0]
        at clojure.lang.RestFn.invoke(RestFn.java:512) [clojure-1.6.0.jar:?]
        at backtype.storm.daemon.worker$_main.invoke(worker.clj:542) 
[storm-core-0.10.0.jar:0.10.0]
        at clojure.lang.AFn.applyToHelper(AFn.java:165) [clojure-1.6.0.jar:?]
        at clojure.lang.AFn.applyTo(AFn.java:144) [clojure-1.6.0.jar:?]
        at backtype.storm.daemon.worker.main(Unknown Source) 
[storm-core-0.10.0.jar:0.10.0]
2015-12-16 12:56:36.984 b.s.util [ERROR] Halting process: ("Error on 
initialization")
java.lang.RuntimeException: ("Error on initialization")
        at backtype.storm.util$exit_process_BANG_.doInvoke(util.clj:336) 
[storm-core-0.10.0.jar:0.10.0]
        at clojure.lang.RestFn.invoke(RestFn.java:423) [clojure-1.6.0.jar:?]
        at 
backtype.storm.daemon.worker$fn__7098$mk_worker__7175.doInvoke(worker.clj:409) 
[storm-core-0.10.0.jar:0.10.0]
        at clojure.lang.RestFn.invoke(RestFn.java:512) [clojure-1.6.0.jar:?]
        at backtype.storm.daemon.worker$_main.invoke(worker.clj:542) 
[storm-core-0.10.0.jar:0.10.0]
        at clojure.lang.AFn.applyToHelper(AFn.java:165) [clojure-1.6.0.jar:?]
        at clojure.lang.AFn.applyTo(AFn.java:144) [clojure-1.6.0.jar:?]
        at backtype.storm.daemon.worker.main(Unknown Source) 
[storm-core-0.10.0.jar:0.10.0]
{code}

And this is what I see in the nimbus log:
{code}
2015-12-16 12:52:26.929 b.s.d.nimbus [INFO] Delaying event :do-rebalance for 10 
secs for phish-storm-topology-2-1450237054
2015-12-16 12:52:27.257 b.s.d.nimbus [INFO] [req 29] Access from:  principal: 
op:getTopologyConf
2015-12-16 12:52:27.290 b.s.d.nimbus [INFO] [req 28] Access from:  principal: 
op:getTopologyInfo
2015-12-16 12:52:27.376 b.s.d.nimbus [INFO] [req 34] Access from:  principal: 
op:getTopology
2015-12-16 12:52:27.378 b.s.d.nimbus [INFO] [req 30] Access from:  principal: 
op:getTopologyConf
2015-12-16 12:52:28.505 b.s.d.nimbus [INFO] [req 32] Access from:  principal: 
op:getClusterInfo
2015-12-16 12:52:28.510 b.s.d.nimbus [INFO] [req 31] Access from:  principal: 
op:getClusterInfo
2015-12-16 12:52:28.524 b.s.d.nimbus [INFO] [req 38] Access from:  principal: 
op:getClusterInfo
2015-12-16 12:52:28.529 b.s.d.nimbus [INFO] [req 33] Access from:  principal: 
op:getNimbusConf
2015-12-16 12:52:37.055 b.s.s.EvenScheduler [INFO] Available slots: 
(["2c80eadd-a7bd-4470-9ba5-d17be81c94ae" 6700] 
["2c80eadd-a7bd-4470-9ba5-d17be81c94ae" 6701] 
["6c528751-1a10-4c33-bd54-a1ec9cb26d86" 6700] ["6c528751-1a10-4c33-bd54-a1
ec9cb26d86" 6701])
2015-12-16 12:52:37.057 b.s.d.nimbus [INFO] Setting new assignment for topology 
id phish-storm-topology-2-1450237054: 
#backtype.storm.daemon.common.Assignment{:master-code-dir 
"/home/fogetti/downloads/apache-storm-0.10.0/storm-local/ni
mbus/stormdist/phish-storm-topology-2-1450237054", :node->host 
{"6c528751-1a10-4c33-bd54-a1ec9cb26d86" "petrucci", 
"2c80eadd-a7bd-4470-9ba5-d17be81c94ae" "hetfield"}, :executor->node+port {[8 8] 
["6c528751-1a10-4c33-bd54-a1ec9cb26d86" 
6700], [7 7] ["2c80eadd-a7bd-4470-9ba5-d17be81c94ae" 6700], [5 6] 
["6c528751-1a10-4c33-bd54-a1ec9cb26d86" 6700], [3 4] 
["2c80eadd-a7bd-4470-9ba5-d17be81c94ae" 6700], [9 10] 
["2c80eadd-a7bd-4470-9ba5-d17be81c94ae" 6700], [2 2] 
["6c528751-1a10-4c33-bd54-a1ec9cb26d86" 6700], [1 1] 
["2c80eadd-a7bd-4470-9ba5-d17be81c94ae" 6700]}, :executor->start-time-secs {[9 
10] 1450237957, [7 7] 1450237957, [3 4] 1450237957, [1 1] 1450237957, [8 8] 
1450237957, [5 6] 1450237957, [2 2] 1450237957}}
...
2015-12-16 12:54:41.453 b.s.d.nimbus [INFO] Executor 
phish-storm-topology-2-1450237054:[8 8] not alive
2015-12-16 12:54:41.453 b.s.d.nimbus [INFO] Executor 
phish-storm-topology-2-1450237054:[2 2] not alive
2015-12-16 12:54:41.453 b.s.d.nimbus [INFO] Executor 
phish-storm-topology-2-1450237054:[5 6] not alive
2015-12-16 12:54:41.457 b.s.s.EvenScheduler [INFO] Available slots: 
(["2c80eadd-a7bd-4470-9ba5-d17be81c94ae" 6701] 
["6c528751-1a10-4c33-bd54-a1ec9cb26d86" 6701])
2015-12-16 12:54:41.457 b.s.d.nimbus [INFO] Reassigning 
phish-storm-topology-2-1450237054 to 2 slots
2015-12-16 12:54:41.457 b.s.d.nimbus [INFO] Reassign executors: [[8 8] [5 6] [2 
2]]
2015-12-16 12:54:41.462 b.s.d.nimbus [INFO] Setting new assignment for topology 
id phish-storm-topology-2-1450237054: 
#backtype.storm.daemon.common.Assignment{:master-code-dir 
"/home/fogetti/downloads/apache-storm-0.10.0/storm-local/nimbus/stormdist/phish-storm-topology-2-1450237054",
 :node->host {"2c80eadd-a7bd-4470-9ba5-d17be81c94ae" "hetfield"}, 
:executor->node+port {[8 8] ["2c80eadd-a7bd-4470-9ba5-d17be81c94ae" 6701], [7 
7] ["2c80eadd-a7bd-4470-9ba5-d17be81c94ae" 6700], [5 6] 
["2c80eadd-a7bd-4470-9ba5-d17be81c94ae" 6701], [3 4] 
["2c80eadd-a7bd-4470-9ba5-d17be81c94ae" 6700], [9 10] 
["2c80eadd-a7bd-4470-9ba5-d17be81c94ae" 6700], [2 2] 
["2c80eadd-a7bd-4470-9ba5-d17be81c94ae" 6701], [1 1] 
["2c80eadd-a7bd-4470-9ba5-d17be81c94ae" 6700]}, :executor->start-time-secs {[7 
7] 1450237957, [1 1] 1450237957, [8 8] 1450238081, [2 2] 1450238081, [9 10] 
1450237957, [3 4] 1450237957, [5 6] 1450238081}}
{code}

Hope it helps.

> [Storm 0.8.2]: java.io.FileNotFoundException: File '../stormconf.ser' does 
> not exist
> ------------------------------------------------------------------------------------
>
>                 Key: STORM-130
>                 URL: https://issues.apache.org/jira/browse/STORM-130
>             Project: Apache Storm
>          Issue Type: Bug
>          Components: storm-core
>    Affects Versions: 0.9.2-incubating, 0.9.3, 0.9.4
>            Reporter: James Xu
>            Assignee: Sriharsha Chintalapani
>            Priority: Minor
>             Fix For: 0.10.0, 0.9.5
>
>         Attachments: README.txt, nimbus.log.gz, supervisor.log.2015-08-26, 
> supervisor_logs.tar.gz, worker_logs.tar.gz, 
> worker_logs_of_kafka_traffic.tar.gz, 
> worker_logs_of_zookeeper_traffic_2015-04-11.tar.gz, 
> worker_logs_of_zookeeper_traffic_2015-04-12.tar.gz, 
> worker_logs_of_zookeeper_traffic_2015-04-13.tar.gz, 
> workers_with_stormconf.ser.gz
>
>
> https://github.com/nathanmarz/storm/issues/438
> Hi developers,
> We met critical issue with deploying storm topology to our prod cluster.
> After deploying topology we got trace on workers (Storm 
> 0.8.2/zookeeper-3.3.6) :
> 2013-01-14 10:57:39 ZooKeeper [INFO] Initiating client connection, 
> connectString=zookeeper1.company.com:2181,zookeeper2.company.com:2181,zookeeper3.company.com:2181
>  sessionTimeout=20000 watcher=com.netflix.curator.ConnectionState@254ba9a2
> 2013-01-14 10:57:39 ClientCnxn [INFO] Opening socket connection to server 
> zookeeper1.company.com/10.72.209.112:2181
> 2013-01-14 10:57:39 ClientCnxn [INFO] Socket connection established to 
> zookeeper1.company.com/10.72.209.112:2181, initiating session
> 2013-01-14 10:57:39 ClientCnxn [INFO] Session establishment complete on 
> server zookeeper1.company.com/10.72.209.112:2181, sessionid = 
> 0x13b3e4b5c780239, negotiated timeout = 20000
> 2013-01-14 10:57:39 zookeeper [INFO] Zookeeper state update: :connected:none
> 2013-01-14 10:57:39 ZooKeeper [INFO] Session: 0x13b3e4b5c780239 closed
> 2013-01-14 10:57:39 ClientCnxn [INFO] EventThread shut down
> 2013-01-14 10:57:39 CuratorFrameworkImpl [INFO] Starting
> 2013-01-14 10:57:39 ZooKeeper [INFO] Initiating client connection, 
> connectString=zookeeper1.company.com:2181,zookeeper2.company.com:2181,zookeeper3.company.com:2181/storm
>  sessionTimeout=20000 watcher=com.netflix.curator.ConnectionState@33a998c7
> 2013-01-14 10:57:39 ClientCnxn [INFO] Opening socket connection to server 
> zookeeper1.company.com/10.72.209.112:2181
> 2013-01-14 10:57:39 ClientCnxn [INFO] Socket connection established to 
> zookeeper1.company.com/10.72.209.112:2181, initiating session
> 2013-01-14 10:57:39 ClientCnxn [INFO] Session establishment complete on 
> server zookeeper1.company.com/10.72.209.112:2181, sessionid = 
> 0x13b3e4b5c78023a, negotiated timeout = 20000
> 2013-01-14 10:57:39 worker [ERROR] Error on initialization of server mk-worker
> java.io.FileNotFoundException: File 
> '/tmp/storm/supervisor/stormdist/normalization-prod-1-1358161053/stormconf.ser'
>  does not exist
> at org.apache.commons.io.FileUtils.openInputStream(FileUtils.java:137)
> at org.apache.commons.io.FileUtils.readFileToByteArray(FileUtils.java:1135)
> at backtype.storm.config$read_supervisor_storm_conf.invoke(config.clj:138)
> at backtype.storm.daemon.worker$worker_data.invoke(worker.clj:146)
> at 
> backtype.storm.daemon.worker$fn__4348$exec_fn__1228__auto____4349.invoke(worker.clj:332)
> at clojure.lang.AFn.applyToHelper(AFn.java:185)
> at clojure.lang.AFn.applyTo(AFn.java:151)
> at clojure.core$apply.invoke(core.clj:601)
> at 
> backtype.storm.daemon.worker$fn__4348$mk_worker__4404.doInvoke(worker.clj:323)
> at clojure.lang.RestFn.invoke(RestFn.java:512)
> at backtype.storm.daemon.worker$_main.invoke(worker.clj:433)
> at clojure.lang.AFn.applyToHelper(AFn.java:172)
> at clojure.lang.AFn.applyTo(AFn.java:151)
> at backtype.storm.daemon.worker.main(Unknown Source)
> 2013-01-14 10:57:39 util [INFO] Halting process: ("Error on initialization")
> Supervisor trace:
> 2013-01-14 10:59:01 supervisor [INFO] d6735377-f0d6-4247-9f35-c8620e2b0e26 
> still hasn't started
> 2013-01-14 10:59:02 supervisor [INFO] d6735377-f0d6-4247-9f35-c8620e2b0e26 
> still hasn't starte
> .......
> 2013-01-14 10:59:34 supervisor [INFO] d6735377-f0d6-4247-9f35-c8620e2b0e26 
> still hasn't started
> 2013-01-14 10:59:35 supervisor [INFO] Worker 
> d6735377-f0d6-4247-9f35-c8620e2b0e26 failed to start
> 2013-01-14 10:59:35 supervisor [INFO] Worker 
> 234264c6-d9d6-4e8a-ab0a-8926bdd6b536 failed to start
> 2013-01-14 10:59:35 supervisor [INFO] Shutting down and clearing state for id 
> 234264c6-d9d6-4e8a-ab0a-8926bdd6b536. Current supervisor time: 1358161175. 
> State: :disallowed, Heartbeat: nil
> 2013-01-14 10:59:35 supervisor [INFO] Shutting down 
> d5c3235f-5880-4be8-a759-5654b3df6a27:234264c6-d9d6-4e8a-ab0a-8926bdd6b536
> 2013-01-14 10:59:35 util [INFO] Error when trying to kill 4819. Process is 
> probably already dead.
> 2013-01-14 10:59:35 supervisor [INFO] Shut down 
> d5c3235f-5880-4be8-a759-5654b3df6a27:234264c6-d9d6-4e8a-ab0a-8926bdd6b536
> 2013-01-14 10:59:35 supervisor [INFO] Shutting down and clearing state for id 
> d6735377-f0d6-4247-9f35-c8620e2b0e26. Current supervisor time: 1358161175. 
> State: :disallowed, Heartbeat: nil
> 2013-01-14 10:59:35 supervisor [INFO] Shutting down 
> d5c3235f-5880-4be8-a759-5654b3df6a27:d6735377-f0d6-4247-9f35-c8620e2b0e26
> 2013-01-14 10:59:35 util [INFO] Error when trying to kill 4809. Process is 
> probably already dead.
> 2013-01-14 10:59:35 supervisor [INFO] Shut down 
> d5c3235f-5880-4be8-a759-5654b3df6a27:d6735377-f0d6-4247-9f35-c8620e2b0e26
> Thanks!
> Oleg M.
> ----------
> xiaokang: We also encountered this problem and the DEBUG log showed that 
> storm-code-map is null and new-assignment is not null. I may be that 
> new-assignment is got from zk after storm-code-map. So we changed the 
> storm-code-map after all-assignment and the problem disapeared.
> (defn mk-synchronize-supervisor [supervisor sync-processes event-manager 
> processes-event-manager]
>   (fn this []
> (let [conf (:conf supervisor)
>       storm-cluster-state (:storm-cluster-state supervisor)
>       ^ISupervisor isupervisor (:isupervisor supervisor)
>       ^LocalState local-state (:local-state supervisor)
>       sync-callback (fn [& ignored] (.add event-manager this))
>       storm-code-map (read-storm-code-locations storm-cluster-state 
> sync-callback)
>       downloaded-storm-ids (set (read-downloaded-storm-ids conf))
>       all-assignment (read-assignments
>                        storm-cluster-state
>                        (:supervisor-id supervisor)
>                        sync-callback)
>       new-assignment (->> all-assignment
>                           (filter-key #(.confirmAssigned isupervisor %)))
>       assigned-storm-ids (assigned-storm-ids-from-port-assignments 
> new-assignment)
>       existing-assignment (.get local-state LS-LOCAL-ASSIGNMENTS)]
>   (log-debug "Synchronizing supervisor")
>   (log-debug "Storm code map: " storm-code-map)
>   (log-debug "Downloaded storm ids: " downloaded-storm-ids)
>   (log-debug "All assignment: " all-assignment)
>   (log-debug "New assignment: " new-assignment)
> ----------
> NJtwentyone: I ran into having a similar problem once. Haven't had it lately. 
> I was going to investigate more and before I created a posted but... There 
> are two schedule-recurring functions of intertest in the function 
> (mk-supervisor).
> (schedule-recurring (:timer supervisor) 0 10 (fn ))
> (schedule-recurring (:timer supervisor)
> 0
> (conf SUPERVISOR-MONITOR-FREQUENCY-SECS)
> (fn )))
> So function-1 (synchronize-supervisor) will eventually remove that dir 
> [../stormconf.ser]
> And function-2 (sync-processes) will eventually hang wanting to create 
> workers saying repeatedly
> 2013-01-13 22:13:13 b.s.d.supervisor [INFO] 
> 6f73facd-8722-4b83-959c-a7b396c61224 still hasn't started
> I'm using the distributed setup of (Storm 0.8.2/zookeeper-3.4.5) on my mac 
> and a hp using storm.starter.WordCountTopology
> ----------
> miggi: Well this problem was resolved only after reinstalling new instance of 
> ZK (3.3.6), we had other issues with deploy topology on 0.8.1 version, so it 
> was decided use new instance ZK. 
> Btw: removing storm data folders didn't help
> ----------
> nathanmarz: For those of you hitting the problem, it would be really useful 
> if you could turn on DEBUG logging on the supervisor and show those logs when 
> the problem is hit.
> ----------
> miggi: After few months stable work - we ran into the same issue again.
> Logs ("topology.debug: true") 
> Supervisor logs: http://cl.ly/0p2W3N0A2V3S
> Worker logs: http://cl.ly/2w1j1r0a3X46
> Cleaning/restarting ZK/Nimbus/Worker components didn't help. 
> Now we down: haven't tried solution proposed by xiaokang but probably it 
> would be our next steps.
> Any fresh ideas ?
> ----------
> devoncrouse: +1 - ran into this twice now. We've "reset" things by clearing 
> out Storm stuff in Zk, wiping out Storm data directory, and restarting the 
> topology, but it happens again in time.
> Storm 0.9.0-wip16
> Zk 3.3.6
> ----------
> d2r: +1 seen this happen many times. same work-around
> Most recently Storm 0.8.2-wip22
> I'll try to collect a look and debug logs from the supervisor and have a look.
> ----------
> miggi: Hey guys, 
> try restart your cluster and ensure that all previous workers (java 
> processes) was killed after this operation. 
> Also we've updated cluster to storm 0.8.3-wip1 with possible fix.
> Hope it helps 
> ----––––
> Oleg M.
> ----------
> devoncrouse: Simple restart doesn't cut it; have to totally clear out all 
> Storm state in Zk and data directory (cluster otherwise quickly becomes 
> unbalanced with same errors in the logs) - and again, we're on 0.9.0-wip16.
> ----------
> xiaokang: We encountered the same problem in production cluster. The log 
> showed that a worker died and supervisor and nimbus found that at the same 
> time. Than nimbus assign the worker to other host and the supervisor's 
> sync-supervisor delete topo's code while sync-process try to restart the died 
> timeouted worker.
> Our solution is change supervisor's woker timeout from 30 to 5 seconds and 
> nimbus remain 30. It works.
> ----------
> d2r: We have tried the same work-around from @xiaokang, but we still see the 
> issue.
> ----------
> nathanmarz: @d2r This is fixed in the 0.8.3 wip, please try that out and let 
> me know if it fixes the issue for you.
> ----------
> d2r: @nathanmarz Yes, pulling in the fix to supervisor.clj from 0.8.3 did 
> resolve the issue for us. Thanks!
> ----------
> devoncrouse: Is this fix going to be applied to the newer builds (e.g. 0.9.0) 
> as well?
> d5207b5
> ---------
> viceyang: We encountered the same problem, our version is 0.8.3, i read 
> supervisor code, it seems not a bug, the key reason is 
> "synchronize-supervisor" which responsible for download file and remove file 
> thread and "sync-processes" which responsible for start worker process thread 
> is Asynchronous. 
> see this case: in synchronize-supervisor read assigment information from zk, 
> supervisor download necessary file from nimbus and write local state . in 
> aother thread sync-processes funciton read local state to launch workor 
> process, when the worker process has not start ,synchronize-supervisor 
> function is called again topology's assignment information has changed (cased 
> by rebalance,or worker time out etc.) worker assignment to this supervisor 
> has move to another supervisor, synchronize-supervisor remove the unnecessary 
> file (jar file and ser file etc.) , after this, worker launched by " 
> sync-processes" ,ser file was not exsit , this issue occur. 
> Though in this issue exception occured but logic sees right. this excetion 
> can't compelety Eliminate unless change thread Asynchronous feather. 
> there is some way reduction the exception occur, @xiaokang metioned "change 
> supervisor's woker timeout from 30 to 5 seconds and nimbus remain 30" this 
> method redeuce nimbus reassign topology. another way change 
> "synchronize-supervisor" thread loop time to a longger than 10(default time) 
> sec, such as 30 sec。



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to