[ 
https://issues.apache.org/jira/browse/STORM-130?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14307781#comment-14307781
 ] 

ASF GitHub Bot commented on STORM-130:
--------------------------------------

Github user harshach commented on the pull request:

    https://github.com/apache/storm/pull/401#issuecomment-73105790
  
    @revans2 @ptgoetz  opened a PR against 0.9.3-branch 
https://github.com/apache/storm/pull/418 . Also updated the current PR to 
handle another case.
    This issue can happen if supervisors are down and user kills a topology , 
restart supervisors. These supervisors can crash with same issue as 
sync-process can start after cleaning up local deleted topology files . In 
general having try catch around launch-worker is good idea as there is no need 
to kill the supervisor itself if the launch-worker fails.


> [Storm 0.8.2]: java.io.FileNotFoundException: File '../stormconf.ser' does 
> not exist
> ------------------------------------------------------------------------------------
>
>                 Key: STORM-130
>                 URL: https://issues.apache.org/jira/browse/STORM-130
>             Project: Apache Storm
>          Issue Type: Bug
>            Reporter: James Xu
>            Assignee: Sriharsha Chintalapani
>            Priority: Minor
>
> https://github.com/nathanmarz/storm/issues/438
> Hi developers,
> We met critical issue with deploying storm topology to our prod cluster.
> After deploying topology we got trace on workers (Storm 
> 0.8.2/zookeeper-3.3.6) :
> 2013-01-14 10:57:39 ZooKeeper [INFO] Initiating client connection, 
> connectString=zookeeper1.company.com:2181,zookeeper2.company.com:2181,zookeeper3.company.com:2181
>  sessionTimeout=20000 watcher=com.netflix.curator.ConnectionState@254ba9a2
> 2013-01-14 10:57:39 ClientCnxn [INFO] Opening socket connection to server 
> zookeeper1.company.com/10.72.209.112:2181
> 2013-01-14 10:57:39 ClientCnxn [INFO] Socket connection established to 
> zookeeper1.company.com/10.72.209.112:2181, initiating session
> 2013-01-14 10:57:39 ClientCnxn [INFO] Session establishment complete on 
> server zookeeper1.company.com/10.72.209.112:2181, sessionid = 
> 0x13b3e4b5c780239, negotiated timeout = 20000
> 2013-01-14 10:57:39 zookeeper [INFO] Zookeeper state update: :connected:none
> 2013-01-14 10:57:39 ZooKeeper [INFO] Session: 0x13b3e4b5c780239 closed
> 2013-01-14 10:57:39 ClientCnxn [INFO] EventThread shut down
> 2013-01-14 10:57:39 CuratorFrameworkImpl [INFO] Starting
> 2013-01-14 10:57:39 ZooKeeper [INFO] Initiating client connection, 
> connectString=zookeeper1.company.com:2181,zookeeper2.company.com:2181,zookeeper3.company.com:2181/storm
>  sessionTimeout=20000 watcher=com.netflix.curator.ConnectionState@33a998c7
> 2013-01-14 10:57:39 ClientCnxn [INFO] Opening socket connection to server 
> zookeeper1.company.com/10.72.209.112:2181
> 2013-01-14 10:57:39 ClientCnxn [INFO] Socket connection established to 
> zookeeper1.company.com/10.72.209.112:2181, initiating session
> 2013-01-14 10:57:39 ClientCnxn [INFO] Session establishment complete on 
> server zookeeper1.company.com/10.72.209.112:2181, sessionid = 
> 0x13b3e4b5c78023a, negotiated timeout = 20000
> 2013-01-14 10:57:39 worker [ERROR] Error on initialization of server mk-worker
> java.io.FileNotFoundException: File 
> '/tmp/storm/supervisor/stormdist/normalization-prod-1-1358161053/stormconf.ser'
>  does not exist
> at org.apache.commons.io.FileUtils.openInputStream(FileUtils.java:137)
> at org.apache.commons.io.FileUtils.readFileToByteArray(FileUtils.java:1135)
> at backtype.storm.config$read_supervisor_storm_conf.invoke(config.clj:138)
> at backtype.storm.daemon.worker$worker_data.invoke(worker.clj:146)
> at 
> backtype.storm.daemon.worker$fn__4348$exec_fn__1228__auto____4349.invoke(worker.clj:332)
> at clojure.lang.AFn.applyToHelper(AFn.java:185)
> at clojure.lang.AFn.applyTo(AFn.java:151)
> at clojure.core$apply.invoke(core.clj:601)
> at 
> backtype.storm.daemon.worker$fn__4348$mk_worker__4404.doInvoke(worker.clj:323)
> at clojure.lang.RestFn.invoke(RestFn.java:512)
> at backtype.storm.daemon.worker$_main.invoke(worker.clj:433)
> at clojure.lang.AFn.applyToHelper(AFn.java:172)
> at clojure.lang.AFn.applyTo(AFn.java:151)
> at backtype.storm.daemon.worker.main(Unknown Source)
> 2013-01-14 10:57:39 util [INFO] Halting process: ("Error on initialization")
> Supervisor trace:
> 2013-01-14 10:59:01 supervisor [INFO] d6735377-f0d6-4247-9f35-c8620e2b0e26 
> still hasn't started
> 2013-01-14 10:59:02 supervisor [INFO] d6735377-f0d6-4247-9f35-c8620e2b0e26 
> still hasn't starte
> .......
> 2013-01-14 10:59:34 supervisor [INFO] d6735377-f0d6-4247-9f35-c8620e2b0e26 
> still hasn't started
> 2013-01-14 10:59:35 supervisor [INFO] Worker 
> d6735377-f0d6-4247-9f35-c8620e2b0e26 failed to start
> 2013-01-14 10:59:35 supervisor [INFO] Worker 
> 234264c6-d9d6-4e8a-ab0a-8926bdd6b536 failed to start
> 2013-01-14 10:59:35 supervisor [INFO] Shutting down and clearing state for id 
> 234264c6-d9d6-4e8a-ab0a-8926bdd6b536. Current supervisor time: 1358161175. 
> State: :disallowed, Heartbeat: nil
> 2013-01-14 10:59:35 supervisor [INFO] Shutting down 
> d5c3235f-5880-4be8-a759-5654b3df6a27:234264c6-d9d6-4e8a-ab0a-8926bdd6b536
> 2013-01-14 10:59:35 util [INFO] Error when trying to kill 4819. Process is 
> probably already dead.
> 2013-01-14 10:59:35 supervisor [INFO] Shut down 
> d5c3235f-5880-4be8-a759-5654b3df6a27:234264c6-d9d6-4e8a-ab0a-8926bdd6b536
> 2013-01-14 10:59:35 supervisor [INFO] Shutting down and clearing state for id 
> d6735377-f0d6-4247-9f35-c8620e2b0e26. Current supervisor time: 1358161175. 
> State: :disallowed, Heartbeat: nil
> 2013-01-14 10:59:35 supervisor [INFO] Shutting down 
> d5c3235f-5880-4be8-a759-5654b3df6a27:d6735377-f0d6-4247-9f35-c8620e2b0e26
> 2013-01-14 10:59:35 util [INFO] Error when trying to kill 4809. Process is 
> probably already dead.
> 2013-01-14 10:59:35 supervisor [INFO] Shut down 
> d5c3235f-5880-4be8-a759-5654b3df6a27:d6735377-f0d6-4247-9f35-c8620e2b0e26
> Thanks!
> Oleg M.
> ----------
> xiaokang: We also encountered this problem and the DEBUG log showed that 
> storm-code-map is null and new-assignment is not null. I may be that 
> new-assignment is got from zk after storm-code-map. So we changed the 
> storm-code-map after all-assignment and the problem disapeared.
> (defn mk-synchronize-supervisor [supervisor sync-processes event-manager 
> processes-event-manager]
>   (fn this []
> (let [conf (:conf supervisor)
>       storm-cluster-state (:storm-cluster-state supervisor)
>       ^ISupervisor isupervisor (:isupervisor supervisor)
>       ^LocalState local-state (:local-state supervisor)
>       sync-callback (fn [& ignored] (.add event-manager this))
>       storm-code-map (read-storm-code-locations storm-cluster-state 
> sync-callback)
>       downloaded-storm-ids (set (read-downloaded-storm-ids conf))
>       all-assignment (read-assignments
>                        storm-cluster-state
>                        (:supervisor-id supervisor)
>                        sync-callback)
>       new-assignment (->> all-assignment
>                           (filter-key #(.confirmAssigned isupervisor %)))
>       assigned-storm-ids (assigned-storm-ids-from-port-assignments 
> new-assignment)
>       existing-assignment (.get local-state LS-LOCAL-ASSIGNMENTS)]
>   (log-debug "Synchronizing supervisor")
>   (log-debug "Storm code map: " storm-code-map)
>   (log-debug "Downloaded storm ids: " downloaded-storm-ids)
>   (log-debug "All assignment: " all-assignment)
>   (log-debug "New assignment: " new-assignment)
> ----------
> NJtwentyone: I ran into having a similar problem once. Haven't had it lately. 
> I was going to investigate more and before I created a posted but... There 
> are two schedule-recurring functions of intertest in the function 
> (mk-supervisor).
> (schedule-recurring (:timer supervisor) 0 10 (fn ))
> (schedule-recurring (:timer supervisor)
> 0
> (conf SUPERVISOR-MONITOR-FREQUENCY-SECS)
> (fn )))
> So function-1 (synchronize-supervisor) will eventually remove that dir 
> [../stormconf.ser]
> And function-2 (sync-processes) will eventually hang wanting to create 
> workers saying repeatedly
> 2013-01-13 22:13:13 b.s.d.supervisor [INFO] 
> 6f73facd-8722-4b83-959c-a7b396c61224 still hasn't started
> I'm using the distributed setup of (Storm 0.8.2/zookeeper-3.4.5) on my mac 
> and a hp using storm.starter.WordCountTopology
> ----------
> miggi: Well this problem was resolved only after reinstalling new instance of 
> ZK (3.3.6), we had other issues with deploy topology on 0.8.1 version, so it 
> was decided use new instance ZK. 
> Btw: removing storm data folders didn't help
> ----------
> nathanmarz: For those of you hitting the problem, it would be really useful 
> if you could turn on DEBUG logging on the supervisor and show those logs when 
> the problem is hit.
> ----------
> miggi: After few months stable work - we ran into the same issue again.
> Logs ("topology.debug: true") 
> Supervisor logs: http://cl.ly/0p2W3N0A2V3S
> Worker logs: http://cl.ly/2w1j1r0a3X46
> Cleaning/restarting ZK/Nimbus/Worker components didn't help. 
> Now we down: haven't tried solution proposed by xiaokang but probably it 
> would be our next steps.
> Any fresh ideas ?
> ----------
> devoncrouse: +1 - ran into this twice now. We've "reset" things by clearing 
> out Storm stuff in Zk, wiping out Storm data directory, and restarting the 
> topology, but it happens again in time.
> Storm 0.9.0-wip16
> Zk 3.3.6
> ----------
> d2r: +1 seen this happen many times. same work-around
> Most recently Storm 0.8.2-wip22
> I'll try to collect a look and debug logs from the supervisor and have a look.
> ----------
> miggi: Hey guys, 
> try restart your cluster and ensure that all previous workers (java 
> processes) was killed after this operation. 
> Also we've updated cluster to storm 0.8.3-wip1 with possible fix.
> Hope it helps 
> ----––––
> Oleg M.
> ----------
> devoncrouse: Simple restart doesn't cut it; have to totally clear out all 
> Storm state in Zk and data directory (cluster otherwise quickly becomes 
> unbalanced with same errors in the logs) - and again, we're on 0.9.0-wip16.
> ----------
> xiaokang: We encountered the same problem in production cluster. The log 
> showed that a worker died and supervisor and nimbus found that at the same 
> time. Than nimbus assign the worker to other host and the supervisor's 
> sync-supervisor delete topo's code while sync-process try to restart the died 
> timeouted worker.
> Our solution is change supervisor's woker timeout from 30 to 5 seconds and 
> nimbus remain 30. It works.
> ----------
> d2r: We have tried the same work-around from @xiaokang, but we still see the 
> issue.
> ----------
> nathanmarz: @d2r This is fixed in the 0.8.3 wip, please try that out and let 
> me know if it fixes the issue for you.
> ----------
> d2r: @nathanmarz Yes, pulling in the fix to supervisor.clj from 0.8.3 did 
> resolve the issue for us. Thanks!
> ----------
> devoncrouse: Is this fix going to be applied to the newer builds (e.g. 0.9.0) 
> as well?
> d5207b5
> ---------
> viceyang: We encountered the same problem, our version is 0.8.3, i read 
> supervisor code, it seems not a bug, the key reason is 
> "synchronize-supervisor" which responsible for download file and remove file 
> thread and "sync-processes" which responsible for start worker process thread 
> is Asynchronous. 
> see this case: in synchronize-supervisor read assigment information from zk, 
> supervisor download necessary file from nimbus and write local state . in 
> aother thread sync-processes funciton read local state to launch workor 
> process, when the worker process has not start ,synchronize-supervisor 
> function is called again topology's assignment information has changed (cased 
> by rebalance,or worker time out etc.) worker assignment to this supervisor 
> has move to another supervisor, synchronize-supervisor remove the unnecessary 
> file (jar file and ser file etc.) , after this, worker launched by " 
> sync-processes" ,ser file was not exsit , this issue occur. 
> Though in this issue exception occured but logic sees right. this excetion 
> can't compelety Eliminate unless change thread Asynchronous feather. 
> there is some way reduction the exception occur, @xiaokang metioned "change 
> supervisor's woker timeout from 30 to 5 seconds and nimbus remain 30" this 
> method redeuce nimbus reassign topology. another way change 
> "synchronize-supervisor" thread loop time to a longger than 10(default time) 
> sec, such as 30 sec。



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to