[ https://issues.apache.org/jira/browse/STORM-130?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14307781#comment-14307781 ]
ASF GitHub Bot commented on STORM-130: -------------------------------------- Github user harshach commented on the pull request: https://github.com/apache/storm/pull/401#issuecomment-73105790 @revans2 @ptgoetz opened a PR against 0.9.3-branch https://github.com/apache/storm/pull/418 . Also updated the current PR to handle another case. This issue can happen if supervisors are down and user kills a topology , restart supervisors. These supervisors can crash with same issue as sync-process can start after cleaning up local deleted topology files . In general having try catch around launch-worker is good idea as there is no need to kill the supervisor itself if the launch-worker fails. > [Storm 0.8.2]: java.io.FileNotFoundException: File '../stormconf.ser' does > not exist > ------------------------------------------------------------------------------------ > > Key: STORM-130 > URL: https://issues.apache.org/jira/browse/STORM-130 > Project: Apache Storm > Issue Type: Bug > Reporter: James Xu > Assignee: Sriharsha Chintalapani > Priority: Minor > > https://github.com/nathanmarz/storm/issues/438 > Hi developers, > We met critical issue with deploying storm topology to our prod cluster. > After deploying topology we got trace on workers (Storm > 0.8.2/zookeeper-3.3.6) : > 2013-01-14 10:57:39 ZooKeeper [INFO] Initiating client connection, > connectString=zookeeper1.company.com:2181,zookeeper2.company.com:2181,zookeeper3.company.com:2181 > sessionTimeout=20000 watcher=com.netflix.curator.ConnectionState@254ba9a2 > 2013-01-14 10:57:39 ClientCnxn [INFO] Opening socket connection to server > zookeeper1.company.com/10.72.209.112:2181 > 2013-01-14 10:57:39 ClientCnxn [INFO] Socket connection established to > zookeeper1.company.com/10.72.209.112:2181, initiating session > 2013-01-14 10:57:39 ClientCnxn [INFO] Session establishment complete on > server zookeeper1.company.com/10.72.209.112:2181, sessionid = > 0x13b3e4b5c780239, negotiated timeout = 20000 > 2013-01-14 10:57:39 zookeeper [INFO] Zookeeper state update: :connected:none > 2013-01-14 10:57:39 ZooKeeper [INFO] Session: 0x13b3e4b5c780239 closed > 2013-01-14 10:57:39 ClientCnxn [INFO] EventThread shut down > 2013-01-14 10:57:39 CuratorFrameworkImpl [INFO] Starting > 2013-01-14 10:57:39 ZooKeeper [INFO] Initiating client connection, > connectString=zookeeper1.company.com:2181,zookeeper2.company.com:2181,zookeeper3.company.com:2181/storm > sessionTimeout=20000 watcher=com.netflix.curator.ConnectionState@33a998c7 > 2013-01-14 10:57:39 ClientCnxn [INFO] Opening socket connection to server > zookeeper1.company.com/10.72.209.112:2181 > 2013-01-14 10:57:39 ClientCnxn [INFO] Socket connection established to > zookeeper1.company.com/10.72.209.112:2181, initiating session > 2013-01-14 10:57:39 ClientCnxn [INFO] Session establishment complete on > server zookeeper1.company.com/10.72.209.112:2181, sessionid = > 0x13b3e4b5c78023a, negotiated timeout = 20000 > 2013-01-14 10:57:39 worker [ERROR] Error on initialization of server mk-worker > java.io.FileNotFoundException: File > '/tmp/storm/supervisor/stormdist/normalization-prod-1-1358161053/stormconf.ser' > does not exist > at org.apache.commons.io.FileUtils.openInputStream(FileUtils.java:137) > at org.apache.commons.io.FileUtils.readFileToByteArray(FileUtils.java:1135) > at backtype.storm.config$read_supervisor_storm_conf.invoke(config.clj:138) > at backtype.storm.daemon.worker$worker_data.invoke(worker.clj:146) > at > backtype.storm.daemon.worker$fn__4348$exec_fn__1228__auto____4349.invoke(worker.clj:332) > at clojure.lang.AFn.applyToHelper(AFn.java:185) > at clojure.lang.AFn.applyTo(AFn.java:151) > at clojure.core$apply.invoke(core.clj:601) > at > backtype.storm.daemon.worker$fn__4348$mk_worker__4404.doInvoke(worker.clj:323) > at clojure.lang.RestFn.invoke(RestFn.java:512) > at backtype.storm.daemon.worker$_main.invoke(worker.clj:433) > at clojure.lang.AFn.applyToHelper(AFn.java:172) > at clojure.lang.AFn.applyTo(AFn.java:151) > at backtype.storm.daemon.worker.main(Unknown Source) > 2013-01-14 10:57:39 util [INFO] Halting process: ("Error on initialization") > Supervisor trace: > 2013-01-14 10:59:01 supervisor [INFO] d6735377-f0d6-4247-9f35-c8620e2b0e26 > still hasn't started > 2013-01-14 10:59:02 supervisor [INFO] d6735377-f0d6-4247-9f35-c8620e2b0e26 > still hasn't starte > ....... > 2013-01-14 10:59:34 supervisor [INFO] d6735377-f0d6-4247-9f35-c8620e2b0e26 > still hasn't started > 2013-01-14 10:59:35 supervisor [INFO] Worker > d6735377-f0d6-4247-9f35-c8620e2b0e26 failed to start > 2013-01-14 10:59:35 supervisor [INFO] Worker > 234264c6-d9d6-4e8a-ab0a-8926bdd6b536 failed to start > 2013-01-14 10:59:35 supervisor [INFO] Shutting down and clearing state for id > 234264c6-d9d6-4e8a-ab0a-8926bdd6b536. Current supervisor time: 1358161175. > State: :disallowed, Heartbeat: nil > 2013-01-14 10:59:35 supervisor [INFO] Shutting down > d5c3235f-5880-4be8-a759-5654b3df6a27:234264c6-d9d6-4e8a-ab0a-8926bdd6b536 > 2013-01-14 10:59:35 util [INFO] Error when trying to kill 4819. Process is > probably already dead. > 2013-01-14 10:59:35 supervisor [INFO] Shut down > d5c3235f-5880-4be8-a759-5654b3df6a27:234264c6-d9d6-4e8a-ab0a-8926bdd6b536 > 2013-01-14 10:59:35 supervisor [INFO] Shutting down and clearing state for id > d6735377-f0d6-4247-9f35-c8620e2b0e26. Current supervisor time: 1358161175. > State: :disallowed, Heartbeat: nil > 2013-01-14 10:59:35 supervisor [INFO] Shutting down > d5c3235f-5880-4be8-a759-5654b3df6a27:d6735377-f0d6-4247-9f35-c8620e2b0e26 > 2013-01-14 10:59:35 util [INFO] Error when trying to kill 4809. Process is > probably already dead. > 2013-01-14 10:59:35 supervisor [INFO] Shut down > d5c3235f-5880-4be8-a759-5654b3df6a27:d6735377-f0d6-4247-9f35-c8620e2b0e26 > Thanks! > Oleg M. > ---------- > xiaokang: We also encountered this problem and the DEBUG log showed that > storm-code-map is null and new-assignment is not null. I may be that > new-assignment is got from zk after storm-code-map. So we changed the > storm-code-map after all-assignment and the problem disapeared. > (defn mk-synchronize-supervisor [supervisor sync-processes event-manager > processes-event-manager] > (fn this [] > (let [conf (:conf supervisor) > storm-cluster-state (:storm-cluster-state supervisor) > ^ISupervisor isupervisor (:isupervisor supervisor) > ^LocalState local-state (:local-state supervisor) > sync-callback (fn [& ignored] (.add event-manager this)) > storm-code-map (read-storm-code-locations storm-cluster-state > sync-callback) > downloaded-storm-ids (set (read-downloaded-storm-ids conf)) > all-assignment (read-assignments > storm-cluster-state > (:supervisor-id supervisor) > sync-callback) > new-assignment (->> all-assignment > (filter-key #(.confirmAssigned isupervisor %))) > assigned-storm-ids (assigned-storm-ids-from-port-assignments > new-assignment) > existing-assignment (.get local-state LS-LOCAL-ASSIGNMENTS)] > (log-debug "Synchronizing supervisor") > (log-debug "Storm code map: " storm-code-map) > (log-debug "Downloaded storm ids: " downloaded-storm-ids) > (log-debug "All assignment: " all-assignment) > (log-debug "New assignment: " new-assignment) > ---------- > NJtwentyone: I ran into having a similar problem once. Haven't had it lately. > I was going to investigate more and before I created a posted but... There > are two schedule-recurring functions of intertest in the function > (mk-supervisor). > (schedule-recurring (:timer supervisor) 0 10 (fn )) > (schedule-recurring (:timer supervisor) > 0 > (conf SUPERVISOR-MONITOR-FREQUENCY-SECS) > (fn ))) > So function-1 (synchronize-supervisor) will eventually remove that dir > [../stormconf.ser] > And function-2 (sync-processes) will eventually hang wanting to create > workers saying repeatedly > 2013-01-13 22:13:13 b.s.d.supervisor [INFO] > 6f73facd-8722-4b83-959c-a7b396c61224 still hasn't started > I'm using the distributed setup of (Storm 0.8.2/zookeeper-3.4.5) on my mac > and a hp using storm.starter.WordCountTopology > ---------- > miggi: Well this problem was resolved only after reinstalling new instance of > ZK (3.3.6), we had other issues with deploy topology on 0.8.1 version, so it > was decided use new instance ZK. > Btw: removing storm data folders didn't help > ---------- > nathanmarz: For those of you hitting the problem, it would be really useful > if you could turn on DEBUG logging on the supervisor and show those logs when > the problem is hit. > ---------- > miggi: After few months stable work - we ran into the same issue again. > Logs ("topology.debug: true") > Supervisor logs: http://cl.ly/0p2W3N0A2V3S > Worker logs: http://cl.ly/2w1j1r0a3X46 > Cleaning/restarting ZK/Nimbus/Worker components didn't help. > Now we down: haven't tried solution proposed by xiaokang but probably it > would be our next steps. > Any fresh ideas ? > ---------- > devoncrouse: +1 - ran into this twice now. We've "reset" things by clearing > out Storm stuff in Zk, wiping out Storm data directory, and restarting the > topology, but it happens again in time. > Storm 0.9.0-wip16 > Zk 3.3.6 > ---------- > d2r: +1 seen this happen many times. same work-around > Most recently Storm 0.8.2-wip22 > I'll try to collect a look and debug logs from the supervisor and have a look. > ---------- > miggi: Hey guys, > try restart your cluster and ensure that all previous workers (java > processes) was killed after this operation. > Also we've updated cluster to storm 0.8.3-wip1 with possible fix. > Hope it helps > ----–––– > Oleg M. > ---------- > devoncrouse: Simple restart doesn't cut it; have to totally clear out all > Storm state in Zk and data directory (cluster otherwise quickly becomes > unbalanced with same errors in the logs) - and again, we're on 0.9.0-wip16. > ---------- > xiaokang: We encountered the same problem in production cluster. The log > showed that a worker died and supervisor and nimbus found that at the same > time. Than nimbus assign the worker to other host and the supervisor's > sync-supervisor delete topo's code while sync-process try to restart the died > timeouted worker. > Our solution is change supervisor's woker timeout from 30 to 5 seconds and > nimbus remain 30. It works. > ---------- > d2r: We have tried the same work-around from @xiaokang, but we still see the > issue. > ---------- > nathanmarz: @d2r This is fixed in the 0.8.3 wip, please try that out and let > me know if it fixes the issue for you. > ---------- > d2r: @nathanmarz Yes, pulling in the fix to supervisor.clj from 0.8.3 did > resolve the issue for us. Thanks! > ---------- > devoncrouse: Is this fix going to be applied to the newer builds (e.g. 0.9.0) > as well? > d5207b5 > --------- > viceyang: We encountered the same problem, our version is 0.8.3, i read > supervisor code, it seems not a bug, the key reason is > "synchronize-supervisor" which responsible for download file and remove file > thread and "sync-processes" which responsible for start worker process thread > is Asynchronous. > see this case: in synchronize-supervisor read assigment information from zk, > supervisor download necessary file from nimbus and write local state . in > aother thread sync-processes funciton read local state to launch workor > process, when the worker process has not start ,synchronize-supervisor > function is called again topology's assignment information has changed (cased > by rebalance,or worker time out etc.) worker assignment to this supervisor > has move to another supervisor, synchronize-supervisor remove the unnecessary > file (jar file and ser file etc.) , after this, worker launched by " > sync-processes" ,ser file was not exsit , this issue occur. > Though in this issue exception occured but logic sees right. this excetion > can't compelety Eliminate unless change thread Asynchronous feather. > there is some way reduction the exception occur, @xiaokang metioned "change > supervisor's woker timeout from 30 to 5 seconds and nimbus remain 30" this > method redeuce nimbus reassign topology. another way change > "synchronize-supervisor" thread loop time to a longger than 10(default time) > sec, such as 30 sec。 -- This message was sent by Atlassian JIRA (v6.3.4#6332)