[ 
https://issues.apache.org/jira/browse/STORM-1043?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14743763#comment-14743763
 ] 

Erik Weathers commented on STORM-1043:
--------------------------------------

As I replied in the issue that [~ernisv] raised (), the solution is to leverage 
mesos's ability to put framework's data into separate sandboxes.  Just *don't* 
set {storm.local.dir} and the cwd of the Mesos Executor will be used for the 
Supervisor, which will be in the supervisor-specific sandbox.

FYI [~revans2], the ports are taken care of automatically by Mesos's 
scheduler/offer system, as they are considered part of the resources that each 
topology is claiming on the Mesos worker nodes ("mesos-slave" has now been 
renamed as "mesos-agent").

> Concurrent access to state on local FS by multiple supervisors
> --------------------------------------------------------------
>
>                 Key: STORM-1043
>                 URL: https://issues.apache.org/jira/browse/STORM-1043
>             Project: Apache Storm
>          Issue Type: Bug
>    Affects Versions: 0.9.5
>            Reporter: Ernestas Vaiciukevičius
>              Labels: mesosphere
>
> Hi,
> we are running storm-mesos cluster and occassionaly workers die or are "lost" 
> in mesos. When this happens it often coincides with errors in logs related to 
> supervisors local state.
> By looking at the storm code it seems this might be caused by the way how 
> multiple supervisor processes access the local state in the same directory 
> via VersionedStore.
> For example: 
> https://github.com/apache/storm/blob/master/storm-core/src/clj/backtype/storm/daemon/supervisor.clj#L434
> Here every supervisor does this concurrently:
> 1. reads latest state from FS
> 2. possibly updates the state
> 3. writes the new version of the state
> Some updates could be lost if there are 2+ supervisors and they execute above 
> steps concurrently - then only the updates from last supervisor would remain 
> on the last state version on the disk.
> We observed local state changes quite often (seconds), so the likelihood of 
> this concurrency issue occurring is high.
> Some examples of exeptions:
> ------------------------------------------
> java.lang.RuntimeException: Version already exists or data already exists
> at backtype.storm.utils.VersionedStore.createVersion(VersionedStore.java:85) 
> ~[storm-core-0.9.5.jar:0.9.5]
> at backtype.storm.utils.VersionedStore.createVersion(VersionedStore.java:79) 
> ~[storm-core-0.9.5.jar:0.9.5]
> at backtype.storm.utils.LocalState.persist(LocalState.java:101) 
> ~[storm-core-0.9.5.jar:0.9.5]
> at backtype.storm.utils.LocalState.put(LocalState.java:82) 
> ~[storm-core-0.9.5.jar:0.9.5]
> at backtype.storm.utils.LocalState.put(LocalState.java:76) 
> ~[storm-core-0.9.5.jar:0.9.5]
> at 
> backtype.storm.daemon.supervisor$mk_synchronize_supervisor$this7400.invoke(supervisor.clj:382)
>  ~[storm-core-0.9.5.jar:0.9.5]
> at backtype.storm.event$event_manager$fn2625.invoke(event.clj:40) 
> ~[storm-core-0.9.5.jar:0.9.5]
> at clojure.lang.AFn.run(AFn.java:24) [clojure-1.5.1.jar:na]
> at java.lang.Thread.run(Thread.java:745) [na:1.8.0_60]
> ---------------------------------------
> java.io.FileNotFoundException: File 
> '/var/lib/storm/supervisor/localstate/1441034838231' does not exist
> at org.apache.commons.io.FileUtils.openInputStream(FileUtils.java:299) 
> ~[commons-io-2.4.jar:2.4]
> at org.apache.commons.io.FileUtils.readFileToByteArray(FileUtils.java:1763) 
> ~[commons-io-2.4.jar:2.4]
> at 
> backtype.storm.utils.LocalState.deserializeLatestVersion(LocalState.java:61) 
> ~[storm-core-0.9.5.jar:0.9.5]
> at backtype.storm.utils.LocalState.snapshot(LocalState.java:47) 
> ~[storm-core-0.9.5.jar:0.9.5]
> at backtype.storm.utils.LocalState.get(LocalState.java:72) 
> ~[storm-core-0.9.5.jar:0.9.5]
> at backtype.storm.daemon.supervisor$sync_processes.invoke(supervisor.clj:234) 
> ~[storm-core-0.9.5.jar:0.9.5]
> at clojure.lang.AFn.applyToHelper(AFn.java:161) [clojure-1.5.1.jar:na]
> at clojure.lang.AFn.applyTo(AFn.java:151) [clojure-1.5.1.jar:na]
> at clojure.core$apply.invoke(core.clj:619) ~[clojure-1.5.1.jar:na]
> at clojure.core$partial$fn4190.doInvoke(core.clj:2396) ~[clojure-1.5.1.jar:na]
> at clojure.lang.RestFn.invoke(RestFn.java:397) ~[clojure-1.5.1.jar:na]
> at backtype.storm.event$event_manager$fn2625.invoke(event.clj:40) 
> ~[storm-core-0.9.5.jar:0.9.5]
> at clojure.lang.AFn.run(AFn.java:24) [clojure-1.5.1.jar:na]
> at java.lang.Thread.run(Thread.java:745) [na:1.8.0_60]
> -----------------------------------------



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to