[ https://issues.apache.org/jira/browse/STORM-130?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15059503#comment-15059503 ]
Gergely Nagy commented on STORM-130: ------------------------------------ Hi folks. I experience the same bug in 0.10.0. In my case the supervisor survives but the topology dies and it's getting rebalanced to another supervisor. This is what I see in the supervisor log: {code} 2015-12-16 12:56:27.746 b.s.u.Utils [INFO] Using defaults.yaml from resources 2015-12-16 12:56:27.784 b.s.u.Utils [INFO] Using storm.yaml from resources 2015-12-16 12:56:27.841 b.s.d.supervisor [INFO] Launching worker with command: 'java' '-cp' '/home/fogetti/downloads/apache-storm-0.10.0/lib/kryo-2.21.jar:/home/fogetti/downloads/apache-storm-0.10.0/lib/log4j-over-slf4j-1.6.6.jar:/home/fogetti/downloads/apache-storm-0.10.0/lib/log4j-slf4j-impl-2.1.jar:/home/fogetti/downloads/apache-storm-0.10.0/lib/minlog-1.2.jar:/home/fogetti/downloads/apache-storm-0.10.0/lib/asm-4.0.jar:/home/fogetti/downloads/apache-storm-0.10.0/lib/log4j-core-2.1.jar:/home/fogetti/downloads/apache-storm-0.10.0/lib/reflectasm-1.07-shaded.jar:/home/fogetti/downloads/apache-storm-0.10.0/lib/log4j-api-2.1.jar:/home/fogetti/downloads/apache-storm-0.10.0/lib/hadoop-auth-2.4.0.jar:/home/fogetti/downloads/apache-storm-0.10.0/lib/slf4j-api-1.7.7.jar:/home/fogetti/downloads/apache-storm-0.10.0/lib/clojure-1.6.0.jar:/home/fogetti/downloads/apache-storm-0.10.0/lib/servlet-api-2.5.jar:/home/fogetti/downloads/apache-storm-0.10.0/lib/disruptor-2.10.4.jar:/home/fogetti/downloads/apache-storm-0.10.0/lib/storm-core-0.10.0.jar:/home/fogetti/downloads/apache-storm-0.10.0/conf:/home/fogetti/downloads/apache-storm-0.10.0/storm-local/supervisor/stormdist/phish-storm-topology-2-1450237054/stormjar.jar' '-Dlogfile.name=phish-storm-topology-2-1450237054-worker-6700.log' '-Dstorm.home=/home/fogetti/downloads/apache-storm-0.10.0' '-Dstorm.id=phish-storm-topology-2-1450237054' '-Dworker.id=d3819964-7671-447d-8763-827ab5cd6140' '-Dworker.port=6700' '-Dstorm.log.dir=/home/fogetti/downloads/apache-storm-0.10.0/logs' '-Dlog4j.configurationFile=/home/fogetti/downloads/apache-storm-0.10.0/log4j2/worker.xml' 'backtype.storm.LogWriter' 'java' '-server' '-Xmx768m' '-Djava.library.path=/home/fogetti/downloads/apache-storm-0.10.0/storm-local/supervisor/stormdist/phish-storm-topology-2-1450237054/resources/Linux-amd64:/home/fogetti/downloads/apache-storm-0.10.0/storm-local/supervisor/stormdist/phish-storm-topology-2-1450237054/resources:/usr' '-Dlogfile.name=phish-storm-topology-2-1450237054-worker-6700.log' '-Dstorm.home=/home/fogetti/downloads/apache-storm-0.10.0' '-Dstorm.conf.file=' '-Dstorm.options=' '-Dstorm.log.dir=/home/fogetti/downloads/apache-storm-0.10.0/logs' '-Dlogging.sensitivity=S3' '-Dlog4j.configurationFile=/home/fogetti/downloads/apache-storm-0.10.0/log4j2/worker.xml' '-Dstorm.id=phish-storm-topology-2-1450237054' '-Dworker.id=d3819964-7671-447d-8763-827ab5cd6140' '-Dworker.port=6700' '-cp' '/home/fogetti/downloads/apache-storm-0.10.0/lib/kryo-2.21.jar:/home/fogetti/downloads/apache-storm-0.10.0/lib/log4j-over-slf4j-1.6.6.jar:/home/fogetti/downloads/apache-storm-0.10.0/lib/log4j-slf4j-impl-2.1.jar:/home/fogetti/downloads/apache-storm-0.10.0/lib/minlog-1.2.jar:/home/fogetti/downloads/apache-storm-0.10.0/lib/asm-4.0.jar:/home/fogetti/downloads/apache-storm-0.10.0/lib/log4j-core-2.1.jar:/home/fogetti/downloads/apache-storm-0.10.0/lib/reflectasm-1.07-shaded.jar:/home/fogetti/downloads/apache-storm-0.10.0/lib/log4j-api-2.1.jar:/home/fogetti/downloads/apache-storm-0.10.0/lib/hadoop-auth-2.4.0.jar:/home/fogetti/downloads/apache-storm-0.10.0/lib/slf4j-api-1.7.7.jar:/home/fogetti/downloads/apache-storm-0.10.0/lib/clojure-1.6.0.jar:/home/fogetti/downloads/apache-storm-0.10.0/lib/servlet-api-2.5.jar:/home/fogetti/downloads/apache-storm-0.10.0/lib/disruptor-2.10.4.jar:/home/fogetti/downloads/apache-storm-0.10.0/lib/storm-core-0.10.0.jar:/home/fogetti/downloads/apache-storm-0.10.0/conf:/home/fogetti/downloads/apache-storm-0.10.0/storm-local/supervisor/stormdist/phish-storm-topology-2-1450237054/stormjar.jar' 'backtype.storm.daemon.worker' 'phish-storm-topology-2-1450237054' '6c528751-1a10-4c33-bd54-a1ec9cb26d86' '6700' 'd3819964-7671-447d-8763-827ab5cd6140' 2015-12-16 12:56:27.888 b.s.config [INFO] SET worker-user d3819964-7671-447d-8763-827ab5cd6140 2015-12-16 12:56:27.948 b.s.d.supervisor [INFO] d3819964-7671-447d-8763-827ab5cd6140 still hasn't started 2015-12-16 12:56:28.169 b.s.d.supervisor [INFO] Removing code for storm id phish-storm-topology-2-1450237054 2015-12-16 12:56:28.454 b.s.d.supervisor [INFO] d3819964-7671-447d-8763-827ab5cd6140 still hasn't started 2015-12-16 12:56:28.956 b.s.d.supervisor [INFO] d3819964-7671-447d-8763-827ab5cd6140 still hasn't started 2015-12-16 12:56:29.457 b.s.d.supervisor [INFO] d3819964-7671-447d-8763-827ab5cd6140 still hasn't started 2015-12-16 12:56:29.957 b.s.d.supervisor [INFO] d3819964-7671-447d-8763-827ab5cd6140 still hasn't started 2015-12-16 12:56:30.458 b.s.d.supervisor [INFO] d3819964-7671-447d-8763-827ab5cd6140 still hasn't started 2015-12-16 12:56:30.960 b.s.d.supervisor [INFO] d3819964-7671-447d-8763-827ab5cd6140 still hasn't started 2015-12-16 12:56:31.463 b.s.d.supervisor [INFO] d3819964-7671-447d-8763-827ab5cd6140 still hasn't started 2015-12-16 12:56:31.964 b.s.d.supervisor [INFO] d3819964-7671-447d-8763-827ab5cd6140 still hasn't started 2015-12-16 12:56:32.465 b.s.d.supervisor [INFO] d3819964-7671-447d-8763-827ab5cd6140 still hasn't started 2015-12-16 12:56:32.966 b.s.d.supervisor [INFO] d3819964-7671-447d-8763-827ab5cd6140 still hasn't started 2015-12-16 12:56:33.467 b.s.d.supervisor [INFO] d3819964-7671-447d-8763-827ab5cd6140 still hasn't started 2015-12-16 12:56:33.968 b.s.d.supervisor [INFO] d3819964-7671-447d-8763-827ab5cd6140 still hasn't started 2015-12-16 12:56:34.469 b.s.d.supervisor [INFO] d3819964-7671-447d-8763-827ab5cd6140 still hasn't started 2015-12-16 12:56:34.970 b.s.d.supervisor [INFO] d3819964-7671-447d-8763-827ab5cd6140 still hasn't started 2015-12-16 12:56:35.471 b.s.d.supervisor [INFO] d3819964-7671-447d-8763-827ab5cd6140 still hasn't started 2015-12-16 12:56:35.972 b.s.d.supervisor [INFO] d3819964-7671-447d-8763-827ab5cd6140 still hasn't started 2015-12-16 12:56:36.473 b.s.d.supervisor [INFO] d3819964-7671-447d-8763-827ab5cd6140 still hasn't started 2015-12-16 12:56:36.974 b.s.d.supervisor [INFO] d3819964-7671-447d-8763-827ab5cd6140 still hasn't started 2015-12-16 12:56:37.076 b.s.d.supervisor [INFO] Worker Process d3819964-7671-447d-8763-827ab5cd6140 exited with code: 13 2015-12-16 12:56:37.475 b.s.d.supervisor [INFO] d3819964-7671-447d-8763-827ab5cd6140 still hasn't started 2015-12-16 12:56:37.976 b.s.d.supervisor [INFO] d3819964-7671-447d-8763-827ab5cd6140 still hasn't started ... 2015-12-16 12:58:23.196 b.s.d.supervisor [INFO] d3819964-7671-447d-8763-827ab5cd6140 still hasn't started 2015-12-16 12:58:23.697 b.s.d.supervisor [INFO] d3819964-7671-447d-8763-827ab5cd6140 still hasn't started 2015-12-16 12:58:24.198 b.s.d.supervisor [INFO] d3819964-7671-447d-8763-827ab5cd6140 still hasn't started 2015-12-16 12:58:24.699 b.s.d.supervisor [INFO] d3819964-7671-447d-8763-827ab5cd6140 still hasn't started 2015-12-16 12:58:25.200 b.s.d.supervisor [INFO] d3819964-7671-447d-8763-827ab5cd6140 still hasn't started 2015-12-16 12:58:25.701 b.s.d.supervisor [INFO] d3819964-7671-447d-8763-827ab5cd6140 still hasn't started 2015-12-16 12:58:26.202 b.s.d.supervisor [INFO] d3819964-7671-447d-8763-827ab5cd6140 still hasn't started 2015-12-16 12:58:26.702 b.s.d.supervisor [INFO] d3819964-7671-447d-8763-827ab5cd6140 still hasn't started 2015-12-16 12:58:27.203 b.s.d.supervisor [INFO] Worker d3819964-7671-447d-8763-827ab5cd6140 failed to start 2015-12-16 12:58:27.206 b.s.d.supervisor [INFO] Shutting down and clearing state for id d3819964-7671-447d-8763-827ab5cd6140. Current supervisor time: 1450238307. State: :not-started, Heartbeat: nil 2015-12-16 12:58:27.207 b.s.d.supervisor [INFO] Shutting down 6c528751-1a10-4c33-bd54-a1ec9cb26d86:d3819964-7671-447d-8763-827ab5cd6140 2015-12-16 12:58:27.208 b.s.config [INFO] GET worker-user d3819964-7671-447d-8763-827ab5cd6140 2015-12-16 12:58:27.259 b.s.util [INFO] Error when trying to kill 5076. Process is probably already dead. 2015-12-16 12:58:27.260 b.s.d.supervisor [INFO] Sleep 1 seconds for execution of cleanup threads on worker. 2015-12-16 12:58:28.266 b.s.util [INFO] Error when trying to kill 5076. Process is probably already dead. 2015-12-16 12:58:28.275 b.s.config [INFO] REMOVE worker-user d3819964-7671-447d-8763-827ab5cd6140 2015-12-16 12:58:28.276 b.s.d.supervisor [INFO] Shut down 6c528751-1a10-4c33-bd54-a1ec9cb26d86:d3819964-7671-447d-8763-827ab5cd6140 {code} And this is what I see in the worker log: {code} 2015-12-16 12:56:36.702 b.s.u.Utils [INFO] Using defaults.yaml from resources 2015-12-16 12:56:36.769 b.s.u.Utils [INFO] Using storm.yaml from resources 2015-12-16 12:56:36.921 b.s.d.worker [INFO] Launching worker for phish-storm-topology-2-1450237054 on 6c528751-1a10-4c33-bd54-a1ec9cb26d86:6700 with id d3819964-7671-447d-8763-827ab5cd6140 and conf {"topology.builtin.metrics.bucket.size.secs" 60, "nimbus.childopts" "-Xmx1024m", "ui.filter.params" nil, "storm.cluster.mode" "distributed", "storm.messaging.netty.client_worker_threads" 1, "supervisor.run.worker.as.user" false, "topology.max.task.parallelism" nil, "zmq.threads" 1, "storm.group.mapping.service" "backtype.storm.security.auth.ShellBasedGroupsMapping", "transactional.zookeeper.root" "/transactional", "topology.sleep.spout.wait.strategy.time.ms" 1, "drpc.invocations.port" 3773, "topology.multilang.serializer" "backtype.storm.multilang.JsonSerializer", "storm.messaging.netty.server_worker_threads" 1, "topology.max.error.report.per.interval" 5, "storm.thrift.transport" "backtype.storm.security.auth.SimpleTransportPlugin", "zmq.hwm" 0, "storm.principal.tolocal" "backtype.storm.security.auth.DefaultPrincipalToLocal", "supervisor.worker.shutdown.sleep.secs" 1, "storm.zookeeper.retry.times" 5, "ui.actions.enabled" true, "zmq.linger.millis" 5000, "supervisor.enable" true, "topology.stats.sample.rate" 0.05, "storm.messaging.netty.min_wait_ms" 100, "storm.zookeeper.port" 2181, "supervisor.heartbeat.frequency.secs" 5, "topology.enable.message.timeouts" true, "drpc.worker.threads" 64, "drpc.queue.size" 128, "drpc.https.keystore.password" "", "logviewer.port" 8000, "nimbus.reassign" true, "topology.executor.send.buffer.size" 1024, "topology.spout.wait.strategy" "backtype.storm.spout.SleepSpoutWaitStrategy", "ui.host" "0.0.0.0", "storm.nimbus.retry.interval.millis" 2000, "nimbus.inbox.jar.expiration.secs" 3600, "dev.zookeeper.path" "/tmp/dev-storm-zookeeper", "topology.acker.executors" nil, "topology.fall.back.on.java.serialization" true, "storm.zookeeper.servers" ["dimebag" "petrucci" "hetfield"], "nimbus.thrift.threads" 64, "logviewer.cleanup.age.mins" 10080, "topology.worker.childopts" nil, "topology.classpath" nil, "supervisor.monitor.frequency.secs" 3, "nimbus.credential.renewers.freq.secs" 600, "topology.skip.missing.kryo.registrations" false, "drpc.authorizer.acl.filename" "drpc-auth-acl.yaml", "storm.group.mapping.service.cache.duration.secs" 120, "topology.testing.always.try.serialize" false, "nimbus.monitor.freq.secs" 10, "supervisor.supervisors" [], "topology.tasks" nil, "topology.bolts.outgoing.overflow.buffer.enable" false, "storm.messaging.netty.socket.backlog" 500, "topology.workers" 1, "storm.local.dir" "/home/fogetti/downloads/apache-storm-0.10.0/storm-local", "worker.childopts" "-Xmx768m", "storm.auth.simple-white-list.users" [], "topology.message.timeout.secs" 30, "topology.state.synchronization.timeout.secs" 60, "topology.tuple.serializer" "backtype.storm.serialization.types.ListDelegateSerializer", "supervisor.supervisors.commands" [], "logviewer.childopts" "-Xmx128m", "topology.environment" nil, "topology.debug" false, "storm.messaging.netty.max_retries" 300, "ui.childopts" "-Xmx768m", "storm.zookeeper.session.timeout" 20000, "drpc.childopts" "-Xmx768m", "drpc.http.creds.plugin" "backtype.storm.security.auth.DefaultHttpCredentialsPlugin", "storm.zookeeper.connection.timeout" 15000, "storm.zookeeper.auth.user" nil, "storm.meta.serialization.delegate" "backtype.storm.serialization.GzipThriftSerializationDelegate", "topology.max.spout.pending" nil, "nimbus.supervisor.timeout.secs" 60, "nimbus.task.timeout.secs" 30, "drpc.port" 3772, "storm.zookeeper.retry.intervalceiling.millis" 30000, "nimbus.thrift.port" 6627, "storm.auth.simple-acl.admins" [], "storm.nimbus.retry.times" 5, "supervisor.worker.start.timeout.secs" 120, "storm.zookeeper.retry.interval" 1000, "logs.users" nil, "transactional.zookeeper.port" nil, "drpc.max_buffer_size" 1048576, "task.credentials.poll.secs" 30, "drpc.https.keystore.type" "JKS", "topology.worker.receiver.thread.count" 1, "supervisor.slots.ports" [6700 6701], "topology.transfer.buffer.size" 1024, "topology.worker.shared.thread.pool.size" 4, "drpc.authorizer.acl.strict" false, "nimbus.file.copy.expiration.secs" 600, "topology.executor.receive.buffer.size" 1024, "nimbus.task.launch.secs" 120, "storm.local.mode.zmq" false, "storm.messaging.netty.buffer_size" 5242880, "worker.heartbeat.frequency.secs" 1, "ui.http.creds.plugin" "backtype.storm.security.auth.DefaultHttpCredentialsPlugin", "storm.zookeeper.root" "/storm", "topology.tick.tuple.freq.secs" nil, "drpc.https.port" -1, "task.refresh.poll.secs" 10, "task.heartbeat.frequency.secs" 3, "storm.messaging.netty.max_wait_ms" 1000, "nimbus.impersonation.authorizer" "backtype.storm.security.auth.authorizer.ImpersonationAuthorizer", "drpc.http.port" 3774, "topology.error.throttle.interval.secs" 10, "storm.messaging.transport" "backtype.storm.messaging.netty.Context", "storm.messaging.netty.authentication" false, "topology.kryo.factory" "backtype.storm.serialization.DefaultKryoFactory", "worker.gc.childopts" "", "nimbus.topology.validator" "backtype.storm.nimbus.DefaultTopologyValidator", "nimbus.cleanup.inbox.freq.secs" 600, "ui.users" nil, "transactional.zookeeper.servers" nil, "supervisor.worker.timeout.secs" 30, "storm.zookeeper.auth.password" nil, "supervisor.childopts" "-Xmx256m", "ui.filter" nil, "ui.header.buffer.bytes" 4096, "topology.disruptor.wait.timeout.millis" 1000, "storm.nimbus.retry.intervalceiling.millis" 60000, "topology.trident.batch.emit.interval.millis" 500, "topology.disruptor.wait.strategy" "com.lmax.disruptor.BlockingWaitStrategy", "storm.auth.simple-acl.users" [], "drpc.invocations.threads" 64, "java.library.path" "/usr", "ui.port" 8080, "storm.messaging.netty.transfer.batch.size" 262144, "logviewer.appender.name" "A1", "nimbus.thrift.max_buffer_size" 1048576, "nimbus.host" "dimebag", "storm.auth.simple-acl.users.commands" [], "drpc.request.timeout.secs" 600} 2015-12-16 12:56:36.947 b.s.util [DEBUG] Touching file at /home/fogetti/downloads/apache-storm-0.10.0/storm-local/workers/d3819964-7671-447d-8763-827ab5cd6140/pids/5076 2015-12-16 12:56:36.962 b.s.d.worker [ERROR] Error on initialization of server mk-worker java.io.FileNotFoundException: File '/home/fogetti/downloads/apache-storm-0.10.0/storm-local/supervisor/stormdist/phish-storm-topology-2-1450237054/stormconf.ser' does not exist at org.apache.storm.shade.org.apache.commons.io.FileUtils.openInputStream(FileUtils.java:299) ~[storm-core-0.10.0.jar:0.10.0] at org.apache.storm.shade.org.apache.commons.io.FileUtils.readFileToByteArray(FileUtils.java:1763) ~[storm-core-0.10.0.jar:0.10.0] at backtype.storm.config$read_supervisor_storm_conf.invoke(config.clj:222) ~[storm-core-0.10.0.jar:0.10.0] at backtype.storm.daemon.worker$fn__7098$exec_fn__1236__auto____7099.invoke(worker.clj:418) ~[storm-core-0.10.0.jar:0.10.0] at clojure.lang.AFn.applyToHelper(AFn.java:178) ~[clojure-1.6.0.jar:?] at clojure.lang.AFn.applyTo(AFn.java:144) ~[clojure-1.6.0.jar:?] at clojure.core$apply.invoke(core.clj:624) ~[clojure-1.6.0.jar:?] at backtype.storm.daemon.worker$fn__7098$mk_worker__7175.doInvoke(worker.clj:409) [storm-core-0.10.0.jar:0.10.0] at clojure.lang.RestFn.invoke(RestFn.java:512) [clojure-1.6.0.jar:?] at backtype.storm.daemon.worker$_main.invoke(worker.clj:542) [storm-core-0.10.0.jar:0.10.0] at clojure.lang.AFn.applyToHelper(AFn.java:165) [clojure-1.6.0.jar:?] at clojure.lang.AFn.applyTo(AFn.java:144) [clojure-1.6.0.jar:?] at backtype.storm.daemon.worker.main(Unknown Source) [storm-core-0.10.0.jar:0.10.0] 2015-12-16 12:56:36.984 b.s.util [ERROR] Halting process: ("Error on initialization") java.lang.RuntimeException: ("Error on initialization") at backtype.storm.util$exit_process_BANG_.doInvoke(util.clj:336) [storm-core-0.10.0.jar:0.10.0] at clojure.lang.RestFn.invoke(RestFn.java:423) [clojure-1.6.0.jar:?] at backtype.storm.daemon.worker$fn__7098$mk_worker__7175.doInvoke(worker.clj:409) [storm-core-0.10.0.jar:0.10.0] at clojure.lang.RestFn.invoke(RestFn.java:512) [clojure-1.6.0.jar:?] at backtype.storm.daemon.worker$_main.invoke(worker.clj:542) [storm-core-0.10.0.jar:0.10.0] at clojure.lang.AFn.applyToHelper(AFn.java:165) [clojure-1.6.0.jar:?] at clojure.lang.AFn.applyTo(AFn.java:144) [clojure-1.6.0.jar:?] at backtype.storm.daemon.worker.main(Unknown Source) [storm-core-0.10.0.jar:0.10.0] {code} And this is what I see in the nimbus log: {code} 2015-12-16 12:52:26.929 b.s.d.nimbus [INFO] Delaying event :do-rebalance for 10 secs for phish-storm-topology-2-1450237054 2015-12-16 12:52:27.257 b.s.d.nimbus [INFO] [req 29] Access from: principal: op:getTopologyConf 2015-12-16 12:52:27.290 b.s.d.nimbus [INFO] [req 28] Access from: principal: op:getTopologyInfo 2015-12-16 12:52:27.376 b.s.d.nimbus [INFO] [req 34] Access from: principal: op:getTopology 2015-12-16 12:52:27.378 b.s.d.nimbus [INFO] [req 30] Access from: principal: op:getTopologyConf 2015-12-16 12:52:28.505 b.s.d.nimbus [INFO] [req 32] Access from: principal: op:getClusterInfo 2015-12-16 12:52:28.510 b.s.d.nimbus [INFO] [req 31] Access from: principal: op:getClusterInfo 2015-12-16 12:52:28.524 b.s.d.nimbus [INFO] [req 38] Access from: principal: op:getClusterInfo 2015-12-16 12:52:28.529 b.s.d.nimbus [INFO] [req 33] Access from: principal: op:getNimbusConf 2015-12-16 12:52:37.055 b.s.s.EvenScheduler [INFO] Available slots: (["2c80eadd-a7bd-4470-9ba5-d17be81c94ae" 6700] ["2c80eadd-a7bd-4470-9ba5-d17be81c94ae" 6701] ["6c528751-1a10-4c33-bd54-a1ec9cb26d86" 6700] ["6c528751-1a10-4c33-bd54-a1 ec9cb26d86" 6701]) 2015-12-16 12:52:37.057 b.s.d.nimbus [INFO] Setting new assignment for topology id phish-storm-topology-2-1450237054: #backtype.storm.daemon.common.Assignment{:master-code-dir "/home/fogetti/downloads/apache-storm-0.10.0/storm-local/ni mbus/stormdist/phish-storm-topology-2-1450237054", :node->host {"6c528751-1a10-4c33-bd54-a1ec9cb26d86" "petrucci", "2c80eadd-a7bd-4470-9ba5-d17be81c94ae" "hetfield"}, :executor->node+port {[8 8] ["6c528751-1a10-4c33-bd54-a1ec9cb26d86" 6700], [7 7] ["2c80eadd-a7bd-4470-9ba5-d17be81c94ae" 6700], [5 6] ["6c528751-1a10-4c33-bd54-a1ec9cb26d86" 6700], [3 4] ["2c80eadd-a7bd-4470-9ba5-d17be81c94ae" 6700], [9 10] ["2c80eadd-a7bd-4470-9ba5-d17be81c94ae" 6700], [2 2] ["6c528751-1a10-4c33-bd54-a1ec9cb26d86" 6700], [1 1] ["2c80eadd-a7bd-4470-9ba5-d17be81c94ae" 6700]}, :executor->start-time-secs {[9 10] 1450237957, [7 7] 1450237957, [3 4] 1450237957, [1 1] 1450237957, [8 8] 1450237957, [5 6] 1450237957, [2 2] 1450237957}} ... 2015-12-16 12:54:41.453 b.s.d.nimbus [INFO] Executor phish-storm-topology-2-1450237054:[8 8] not alive 2015-12-16 12:54:41.453 b.s.d.nimbus [INFO] Executor phish-storm-topology-2-1450237054:[2 2] not alive 2015-12-16 12:54:41.453 b.s.d.nimbus [INFO] Executor phish-storm-topology-2-1450237054:[5 6] not alive 2015-12-16 12:54:41.457 b.s.s.EvenScheduler [INFO] Available slots: (["2c80eadd-a7bd-4470-9ba5-d17be81c94ae" 6701] ["6c528751-1a10-4c33-bd54-a1ec9cb26d86" 6701]) 2015-12-16 12:54:41.457 b.s.d.nimbus [INFO] Reassigning phish-storm-topology-2-1450237054 to 2 slots 2015-12-16 12:54:41.457 b.s.d.nimbus [INFO] Reassign executors: [[8 8] [5 6] [2 2]] 2015-12-16 12:54:41.462 b.s.d.nimbus [INFO] Setting new assignment for topology id phish-storm-topology-2-1450237054: #backtype.storm.daemon.common.Assignment{:master-code-dir "/home/fogetti/downloads/apache-storm-0.10.0/storm-local/nimbus/stormdist/phish-storm-topology-2-1450237054", :node->host {"2c80eadd-a7bd-4470-9ba5-d17be81c94ae" "hetfield"}, :executor->node+port {[8 8] ["2c80eadd-a7bd-4470-9ba5-d17be81c94ae" 6701], [7 7] ["2c80eadd-a7bd-4470-9ba5-d17be81c94ae" 6700], [5 6] ["2c80eadd-a7bd-4470-9ba5-d17be81c94ae" 6701], [3 4] ["2c80eadd-a7bd-4470-9ba5-d17be81c94ae" 6700], [9 10] ["2c80eadd-a7bd-4470-9ba5-d17be81c94ae" 6700], [2 2] ["2c80eadd-a7bd-4470-9ba5-d17be81c94ae" 6701], [1 1] ["2c80eadd-a7bd-4470-9ba5-d17be81c94ae" 6700]}, :executor->start-time-secs {[7 7] 1450237957, [1 1] 1450237957, [8 8] 1450238081, [2 2] 1450238081, [9 10] 1450237957, [3 4] 1450237957, [5 6] 1450238081}} {code} Hope it helps. > [Storm 0.8.2]: java.io.FileNotFoundException: File '../stormconf.ser' does > not exist > ------------------------------------------------------------------------------------ > > Key: STORM-130 > URL: https://issues.apache.org/jira/browse/STORM-130 > Project: Apache Storm > Issue Type: Bug > Components: storm-core > Affects Versions: 0.9.2-incubating, 0.9.3, 0.9.4 > Reporter: James Xu > Assignee: Sriharsha Chintalapani > Priority: Minor > Fix For: 0.10.0, 0.9.5 > > Attachments: README.txt, nimbus.log.gz, supervisor.log.2015-08-26, > supervisor_logs.tar.gz, worker_logs.tar.gz, > worker_logs_of_kafka_traffic.tar.gz, > worker_logs_of_zookeeper_traffic_2015-04-11.tar.gz, > worker_logs_of_zookeeper_traffic_2015-04-12.tar.gz, > worker_logs_of_zookeeper_traffic_2015-04-13.tar.gz, > workers_with_stormconf.ser.gz > > > https://github.com/nathanmarz/storm/issues/438 > Hi developers, > We met critical issue with deploying storm topology to our prod cluster. > After deploying topology we got trace on workers (Storm > 0.8.2/zookeeper-3.3.6) : > 2013-01-14 10:57:39 ZooKeeper [INFO] Initiating client connection, > connectString=zookeeper1.company.com:2181,zookeeper2.company.com:2181,zookeeper3.company.com:2181 > sessionTimeout=20000 watcher=com.netflix.curator.ConnectionState@254ba9a2 > 2013-01-14 10:57:39 ClientCnxn [INFO] Opening socket connection to server > zookeeper1.company.com/10.72.209.112:2181 > 2013-01-14 10:57:39 ClientCnxn [INFO] Socket connection established to > zookeeper1.company.com/10.72.209.112:2181, initiating session > 2013-01-14 10:57:39 ClientCnxn [INFO] Session establishment complete on > server zookeeper1.company.com/10.72.209.112:2181, sessionid = > 0x13b3e4b5c780239, negotiated timeout = 20000 > 2013-01-14 10:57:39 zookeeper [INFO] Zookeeper state update: :connected:none > 2013-01-14 10:57:39 ZooKeeper [INFO] Session: 0x13b3e4b5c780239 closed > 2013-01-14 10:57:39 ClientCnxn [INFO] EventThread shut down > 2013-01-14 10:57:39 CuratorFrameworkImpl [INFO] Starting > 2013-01-14 10:57:39 ZooKeeper [INFO] Initiating client connection, > connectString=zookeeper1.company.com:2181,zookeeper2.company.com:2181,zookeeper3.company.com:2181/storm > sessionTimeout=20000 watcher=com.netflix.curator.ConnectionState@33a998c7 > 2013-01-14 10:57:39 ClientCnxn [INFO] Opening socket connection to server > zookeeper1.company.com/10.72.209.112:2181 > 2013-01-14 10:57:39 ClientCnxn [INFO] Socket connection established to > zookeeper1.company.com/10.72.209.112:2181, initiating session > 2013-01-14 10:57:39 ClientCnxn [INFO] Session establishment complete on > server zookeeper1.company.com/10.72.209.112:2181, sessionid = > 0x13b3e4b5c78023a, negotiated timeout = 20000 > 2013-01-14 10:57:39 worker [ERROR] Error on initialization of server mk-worker > java.io.FileNotFoundException: File > '/tmp/storm/supervisor/stormdist/normalization-prod-1-1358161053/stormconf.ser' > does not exist > at org.apache.commons.io.FileUtils.openInputStream(FileUtils.java:137) > at org.apache.commons.io.FileUtils.readFileToByteArray(FileUtils.java:1135) > at backtype.storm.config$read_supervisor_storm_conf.invoke(config.clj:138) > at backtype.storm.daemon.worker$worker_data.invoke(worker.clj:146) > at > backtype.storm.daemon.worker$fn__4348$exec_fn__1228__auto____4349.invoke(worker.clj:332) > at clojure.lang.AFn.applyToHelper(AFn.java:185) > at clojure.lang.AFn.applyTo(AFn.java:151) > at clojure.core$apply.invoke(core.clj:601) > at > backtype.storm.daemon.worker$fn__4348$mk_worker__4404.doInvoke(worker.clj:323) > at clojure.lang.RestFn.invoke(RestFn.java:512) > at backtype.storm.daemon.worker$_main.invoke(worker.clj:433) > at clojure.lang.AFn.applyToHelper(AFn.java:172) > at clojure.lang.AFn.applyTo(AFn.java:151) > at backtype.storm.daemon.worker.main(Unknown Source) > 2013-01-14 10:57:39 util [INFO] Halting process: ("Error on initialization") > Supervisor trace: > 2013-01-14 10:59:01 supervisor [INFO] d6735377-f0d6-4247-9f35-c8620e2b0e26 > still hasn't started > 2013-01-14 10:59:02 supervisor [INFO] d6735377-f0d6-4247-9f35-c8620e2b0e26 > still hasn't starte > ....... > 2013-01-14 10:59:34 supervisor [INFO] d6735377-f0d6-4247-9f35-c8620e2b0e26 > still hasn't started > 2013-01-14 10:59:35 supervisor [INFO] Worker > d6735377-f0d6-4247-9f35-c8620e2b0e26 failed to start > 2013-01-14 10:59:35 supervisor [INFO] Worker > 234264c6-d9d6-4e8a-ab0a-8926bdd6b536 failed to start > 2013-01-14 10:59:35 supervisor [INFO] Shutting down and clearing state for id > 234264c6-d9d6-4e8a-ab0a-8926bdd6b536. Current supervisor time: 1358161175. > State: :disallowed, Heartbeat: nil > 2013-01-14 10:59:35 supervisor [INFO] Shutting down > d5c3235f-5880-4be8-a759-5654b3df6a27:234264c6-d9d6-4e8a-ab0a-8926bdd6b536 > 2013-01-14 10:59:35 util [INFO] Error when trying to kill 4819. Process is > probably already dead. > 2013-01-14 10:59:35 supervisor [INFO] Shut down > d5c3235f-5880-4be8-a759-5654b3df6a27:234264c6-d9d6-4e8a-ab0a-8926bdd6b536 > 2013-01-14 10:59:35 supervisor [INFO] Shutting down and clearing state for id > d6735377-f0d6-4247-9f35-c8620e2b0e26. Current supervisor time: 1358161175. > State: :disallowed, Heartbeat: nil > 2013-01-14 10:59:35 supervisor [INFO] Shutting down > d5c3235f-5880-4be8-a759-5654b3df6a27:d6735377-f0d6-4247-9f35-c8620e2b0e26 > 2013-01-14 10:59:35 util [INFO] Error when trying to kill 4809. Process is > probably already dead. > 2013-01-14 10:59:35 supervisor [INFO] Shut down > d5c3235f-5880-4be8-a759-5654b3df6a27:d6735377-f0d6-4247-9f35-c8620e2b0e26 > Thanks! > Oleg M. > ---------- > xiaokang: We also encountered this problem and the DEBUG log showed that > storm-code-map is null and new-assignment is not null. I may be that > new-assignment is got from zk after storm-code-map. So we changed the > storm-code-map after all-assignment and the problem disapeared. > (defn mk-synchronize-supervisor [supervisor sync-processes event-manager > processes-event-manager] > (fn this [] > (let [conf (:conf supervisor) > storm-cluster-state (:storm-cluster-state supervisor) > ^ISupervisor isupervisor (:isupervisor supervisor) > ^LocalState local-state (:local-state supervisor) > sync-callback (fn [& ignored] (.add event-manager this)) > storm-code-map (read-storm-code-locations storm-cluster-state > sync-callback) > downloaded-storm-ids (set (read-downloaded-storm-ids conf)) > all-assignment (read-assignments > storm-cluster-state > (:supervisor-id supervisor) > sync-callback) > new-assignment (->> all-assignment > (filter-key #(.confirmAssigned isupervisor %))) > assigned-storm-ids (assigned-storm-ids-from-port-assignments > new-assignment) > existing-assignment (.get local-state LS-LOCAL-ASSIGNMENTS)] > (log-debug "Synchronizing supervisor") > (log-debug "Storm code map: " storm-code-map) > (log-debug "Downloaded storm ids: " downloaded-storm-ids) > (log-debug "All assignment: " all-assignment) > (log-debug "New assignment: " new-assignment) > ---------- > NJtwentyone: I ran into having a similar problem once. Haven't had it lately. > I was going to investigate more and before I created a posted but... There > are two schedule-recurring functions of intertest in the function > (mk-supervisor). > (schedule-recurring (:timer supervisor) 0 10 (fn )) > (schedule-recurring (:timer supervisor) > 0 > (conf SUPERVISOR-MONITOR-FREQUENCY-SECS) > (fn ))) > So function-1 (synchronize-supervisor) will eventually remove that dir > [../stormconf.ser] > And function-2 (sync-processes) will eventually hang wanting to create > workers saying repeatedly > 2013-01-13 22:13:13 b.s.d.supervisor [INFO] > 6f73facd-8722-4b83-959c-a7b396c61224 still hasn't started > I'm using the distributed setup of (Storm 0.8.2/zookeeper-3.4.5) on my mac > and a hp using storm.starter.WordCountTopology > ---------- > miggi: Well this problem was resolved only after reinstalling new instance of > ZK (3.3.6), we had other issues with deploy topology on 0.8.1 version, so it > was decided use new instance ZK. > Btw: removing storm data folders didn't help > ---------- > nathanmarz: For those of you hitting the problem, it would be really useful > if you could turn on DEBUG logging on the supervisor and show those logs when > the problem is hit. > ---------- > miggi: After few months stable work - we ran into the same issue again. > Logs ("topology.debug: true") > Supervisor logs: http://cl.ly/0p2W3N0A2V3S > Worker logs: http://cl.ly/2w1j1r0a3X46 > Cleaning/restarting ZK/Nimbus/Worker components didn't help. > Now we down: haven't tried solution proposed by xiaokang but probably it > would be our next steps. > Any fresh ideas ? > ---------- > devoncrouse: +1 - ran into this twice now. We've "reset" things by clearing > out Storm stuff in Zk, wiping out Storm data directory, and restarting the > topology, but it happens again in time. > Storm 0.9.0-wip16 > Zk 3.3.6 > ---------- > d2r: +1 seen this happen many times. same work-around > Most recently Storm 0.8.2-wip22 > I'll try to collect a look and debug logs from the supervisor and have a look. > ---------- > miggi: Hey guys, > try restart your cluster and ensure that all previous workers (java > processes) was killed after this operation. > Also we've updated cluster to storm 0.8.3-wip1 with possible fix. > Hope it helps > ----–––– > Oleg M. > ---------- > devoncrouse: Simple restart doesn't cut it; have to totally clear out all > Storm state in Zk and data directory (cluster otherwise quickly becomes > unbalanced with same errors in the logs) - and again, we're on 0.9.0-wip16. > ---------- > xiaokang: We encountered the same problem in production cluster. The log > showed that a worker died and supervisor and nimbus found that at the same > time. Than nimbus assign the worker to other host and the supervisor's > sync-supervisor delete topo's code while sync-process try to restart the died > timeouted worker. > Our solution is change supervisor's woker timeout from 30 to 5 seconds and > nimbus remain 30. It works. > ---------- > d2r: We have tried the same work-around from @xiaokang, but we still see the > issue. > ---------- > nathanmarz: @d2r This is fixed in the 0.8.3 wip, please try that out and let > me know if it fixes the issue for you. > ---------- > d2r: @nathanmarz Yes, pulling in the fix to supervisor.clj from 0.8.3 did > resolve the issue for us. Thanks! > ---------- > devoncrouse: Is this fix going to be applied to the newer builds (e.g. 0.9.0) > as well? > d5207b5 > --------- > viceyang: We encountered the same problem, our version is 0.8.3, i read > supervisor code, it seems not a bug, the key reason is > "synchronize-supervisor" which responsible for download file and remove file > thread and "sync-processes" which responsible for start worker process thread > is Asynchronous. > see this case: in synchronize-supervisor read assigment information from zk, > supervisor download necessary file from nimbus and write local state . in > aother thread sync-processes funciton read local state to launch workor > process, when the worker process has not start ,synchronize-supervisor > function is called again topology's assignment information has changed (cased > by rebalance,or worker time out etc.) worker assignment to this supervisor > has move to another supervisor, synchronize-supervisor remove the unnecessary > file (jar file and ser file etc.) , after this, worker launched by " > sync-processes" ,ser file was not exsit , this issue occur. > Though in this issue exception occured but logic sees right. this excetion > can't compelety Eliminate unless change thread Asynchronous feather. > there is some way reduction the exception occur, @xiaokang metioned "change > supervisor's woker timeout from 30 to 5 seconds and nimbus remain 30" this > method redeuce nimbus reassign topology. another way change > "synchronize-supervisor" thread loop time to a longger than 10(default time) > sec, such as 30 sec。 -- This message was sent by Atlassian JIRA (v6.3.4#6332)