[ https://issues.apache.org/jira/browse/STORM-1043?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14743763#comment-14743763 ]
Erik Weathers commented on STORM-1043: -------------------------------------- As I replied in the issue that [~ernisv] raised (), the solution is to leverage mesos's ability to put framework's data into separate sandboxes. Just *don't* set {storm.local.dir} and the cwd of the Mesos Executor will be used for the Supervisor, which will be in the supervisor-specific sandbox. FYI [~revans2], the ports are taken care of automatically by Mesos's scheduler/offer system, as they are considered part of the resources that each topology is claiming on the Mesos worker nodes ("mesos-slave" has now been renamed as "mesos-agent"). > Concurrent access to state on local FS by multiple supervisors > -------------------------------------------------------------- > > Key: STORM-1043 > URL: https://issues.apache.org/jira/browse/STORM-1043 > Project: Apache Storm > Issue Type: Bug > Affects Versions: 0.9.5 > Reporter: Ernestas Vaiciukevičius > Labels: mesosphere > > Hi, > we are running storm-mesos cluster and occassionaly workers die or are "lost" > in mesos. When this happens it often coincides with errors in logs related to > supervisors local state. > By looking at the storm code it seems this might be caused by the way how > multiple supervisor processes access the local state in the same directory > via VersionedStore. > For example: > https://github.com/apache/storm/blob/master/storm-core/src/clj/backtype/storm/daemon/supervisor.clj#L434 > Here every supervisor does this concurrently: > 1. reads latest state from FS > 2. possibly updates the state > 3. writes the new version of the state > Some updates could be lost if there are 2+ supervisors and they execute above > steps concurrently - then only the updates from last supervisor would remain > on the last state version on the disk. > We observed local state changes quite often (seconds), so the likelihood of > this concurrency issue occurring is high. > Some examples of exeptions: > ------------------------------------------ > java.lang.RuntimeException: Version already exists or data already exists > at backtype.storm.utils.VersionedStore.createVersion(VersionedStore.java:85) > ~[storm-core-0.9.5.jar:0.9.5] > at backtype.storm.utils.VersionedStore.createVersion(VersionedStore.java:79) > ~[storm-core-0.9.5.jar:0.9.5] > at backtype.storm.utils.LocalState.persist(LocalState.java:101) > ~[storm-core-0.9.5.jar:0.9.5] > at backtype.storm.utils.LocalState.put(LocalState.java:82) > ~[storm-core-0.9.5.jar:0.9.5] > at backtype.storm.utils.LocalState.put(LocalState.java:76) > ~[storm-core-0.9.5.jar:0.9.5] > at > backtype.storm.daemon.supervisor$mk_synchronize_supervisor$this7400.invoke(supervisor.clj:382) > ~[storm-core-0.9.5.jar:0.9.5] > at backtype.storm.event$event_manager$fn2625.invoke(event.clj:40) > ~[storm-core-0.9.5.jar:0.9.5] > at clojure.lang.AFn.run(AFn.java:24) [clojure-1.5.1.jar:na] > at java.lang.Thread.run(Thread.java:745) [na:1.8.0_60] > --------------------------------------- > java.io.FileNotFoundException: File > '/var/lib/storm/supervisor/localstate/1441034838231' does not exist > at org.apache.commons.io.FileUtils.openInputStream(FileUtils.java:299) > ~[commons-io-2.4.jar:2.4] > at org.apache.commons.io.FileUtils.readFileToByteArray(FileUtils.java:1763) > ~[commons-io-2.4.jar:2.4] > at > backtype.storm.utils.LocalState.deserializeLatestVersion(LocalState.java:61) > ~[storm-core-0.9.5.jar:0.9.5] > at backtype.storm.utils.LocalState.snapshot(LocalState.java:47) > ~[storm-core-0.9.5.jar:0.9.5] > at backtype.storm.utils.LocalState.get(LocalState.java:72) > ~[storm-core-0.9.5.jar:0.9.5] > at backtype.storm.daemon.supervisor$sync_processes.invoke(supervisor.clj:234) > ~[storm-core-0.9.5.jar:0.9.5] > at clojure.lang.AFn.applyToHelper(AFn.java:161) [clojure-1.5.1.jar:na] > at clojure.lang.AFn.applyTo(AFn.java:151) [clojure-1.5.1.jar:na] > at clojure.core$apply.invoke(core.clj:619) ~[clojure-1.5.1.jar:na] > at clojure.core$partial$fn4190.doInvoke(core.clj:2396) ~[clojure-1.5.1.jar:na] > at clojure.lang.RestFn.invoke(RestFn.java:397) ~[clojure-1.5.1.jar:na] > at backtype.storm.event$event_manager$fn2625.invoke(event.clj:40) > ~[storm-core-0.9.5.jar:0.9.5] > at clojure.lang.AFn.run(AFn.java:24) [clojure-1.5.1.jar:na] > at java.lang.Thread.run(Thread.java:745) [na:1.8.0_60] > ----------------------------------------- -- This message was sent by Atlassian JIRA (v6.3.4#6332)