Hi everyone, I found the how to fix it In fact, on my cluster, we use the NFS technology and I didn't change my storm.yaml where I had to define *storm.local.dir*. Indeed, if you don't define it, storm will take it in the file defaults.yaml where it defines this : *storm.local.dir: "storm-local"*.
The problem here, where you have NFS on your cluster is it create the storm-local directory in your home. But in NFS as you should probably know, every nodes have access to the same home. So when storm creates is directory "storm-local", inside of it, he create a directory named "supervisor". But every time you launch a Supervisor, it erases the last one created so the last supervisor will be stopped. So the only solution is to modify your storm.yaml and insert the new storm.local.dir. In my case, because I am on an NFS cluster, I had to put this path : *storm.local.dir: "/tmp/storm-local"* because it is not in your home directory so it won't be shared between your nodes. So every nodes launching a Supervisor will have its own Supervisor directory. I sincerely hope to be clear and that could help someone. Benjamin. 2014-09-02 19:53 GMT+02:00 Benjamin SOULAS <benjamin.soula...@gmail.com>: > Hi Harsha, > > You're right, I didn't export STORM_HOME ... > > I will do it, maybe this is the problem. > > Thanks > > > 2014-09-02 18:08 GMT+02:00 Harsha <st...@harsha.io>: > >> Hi Benjamin, >> Correct me if I missed it , in your config I don't see >> storm.local.dir defined. If its not defined in config storm will create one >> in the storm_installation dir which seems to be >> >> /home/bsoulas/incubator-storm-master/storm-dist/binary/target/apache-storm-0.9.3-ben/apache-storm-0.9.3-ben/ >> and are you running the supervisor and nimbus as user "bsoulas". When you >> are running "storm nimbus or storm supervisor" command which storm command >> its pointing. Did you export >> STORM_HOME=/home/bsoulas/incubator-storm-master/storm-dist/binary/target/apache-storm-0.9.3-ben" >> and also added it to PATH. I am checking to see if you had any previous >> installation of storm and invoking the storm command from previous >> installation. >> Can you also check zookeeper logs . >> -Harsha >> >> On Tue, Sep 2, 2014, at 03:39 AM, Benjamin SOULAS wrote: >> >> Hi everyone, >> >> I followed your instructions for installing a zookeeper server, i >> downloaded it on the website, extract the tar file somewhere in a machine >> on my cluster, i made those modifications in my zoo.cfg : >> >> >> >> # The number of milliseconds of each tick >> >> tickTime=2000 >> >> # The number of ticks that the initial >> >> # synchronization phase can take >> >> initLimit=10 >> >> # The number of ticks that can pass between >> >> # sending a request and getting an acknowledgement >> >> syncLimit=5 >> >> # the directory where the snapshot is stored. >> >> # do not use /tmp for storage, /tmp here is just >> >> # example sakes. >> >> dataDir=/home/bsoulas/zookeeper/zookeeper-3.4.6/data/ >> >> # the port at which the clients will connect >> >> clientPort=2181 >> >> # the maximum number of client connections. >> >> # increase this if you need to handle more clients >> >> #maxClientCnxns=60 >> >> # >> >> # Be sure to read the maintenance section of the >> >> # administrator guide before turning on autopurge. >> >> # >> >> # >> http://zookeeper.apache.org/doc/current/zookeeperAdmin.html#sc_maintenance >> >> # >> >> # The number of snapshots to retain in dataDir >> >> #autopurge.snapRetainCount=3 >> >> # Purge task interval in hours >> >> # Set to "0" to disable auto purge feature >> >> #autopurge.purgeInterval=1 >> >> >> In the log4j.properties, i uncommented the line for the log file : >> >> >> # Example with rolling log file >> >> log4j.rootLogger=DEBUG, CONSOLE, ROLLINGFILE >> >> >> Then i went to my storm.yaml (located here in my case, because i took the >> source version) : >> >> >> >> /home/bsoulas/incubator-storm-master/storm-dist/binary/target/apache-storm-0.9.3-ben/apache-storm-0.9.3-ben/conf >> >> >> This file contain this configuration : >> >> >> ########### These MUST be filled in for a storm configuration >> >> storm.zookeeper.servers: >> >> - "paradent-4" >> >> # - "paradent-47" >> >> # - "paradent-48" >> >> >> # >> >> nimbus.host: "paradent-4" >> >> # >> >> # >> >> # ##### These may optionally be filled in: >> >> # >> >> ## List of custom serializations >> >> # topology.kryo.register: >> >> # - org.mycompany.MyType >> >> # - org.mycompany.MyType2: org.mycompany.MyType2Serializer >> >> # >> >> ## List of custom kryo decorators >> >> # topology.kryo.decorators: >> >> # - org.mycompany.MyDecorator >> >> # >> >> ## Locations of the drpc servers >> >> # drpc.servers: >> >> # - "server1" >> >> # - "server2" >> >> >> ## Metrics Consumers >> >> # topology.metrics.consumer.register: >> >> # - class: "backtype.storm.metric.LoggingMetricsConsumer" >> >> # parallelism.hint: 1 >> >> # - class: "org.mycompany.MyMetricsConsumer" >> >> # parallelism.hint: 1 >> >> # argument: >> >> # - endpoint: "metrics-collector.mycompany.org" >> >> dev.zookeeper.path: "paradent-4.rennes.grid5000.fr: >> ~/home/bsoulas/zookeeper/zookeeper-3.4.6/" >> >> storm.zookeeper.port: 2181 >> >> To launch storm on the cluster, i launch it thanks to *storm nimbus *(on >> a machine named paradent-4), then my zookeeper Server *sh zkServer.sh >> start* (on paradent-4 again)(which create a *zookeeper_server.pid* where >> the pid of the zookeeper is written, i know it's obvious ...>_< ). >> >> After i launch my *storm ui* for having a visual of my storm app (on >> paradent-4). Until now, everything work fine. Now, the logical way implies >> i launch my supervisor, on a different machine (here *paradent-39*) >> thanks to *storm supervisor*, it is launched but once again, 3 or 4 >> seconds after it's down. >> >> So i watched the supervisor.log located : >> >> >> >> /home/bsoulas/incubator-storm-master/storm-dist/binary/target/apache-storm-0.9.3-ben/apache-storm-0.9.3-ben/logs >> >> >> And here appear a tricky error : >> >> >> 2014-09-02 09:31:37 o.a.c.f.i.CuratorFrameworkImpl [INFO] Starting >> >> 2014-09-02 09:31:37 o.a.z.ZooKeeper [INFO] Initiating client connection, >> connectString=paradent-4:2181 sessionTimeout=20000 >> watcher=org.apache.curator.ConnectionState@220df4c8 >> >> 2014-09-02 09:31:37 o.a.z.ClientCnxn [INFO] Opening socket connection to >> server paradent-4.rennes.grid5000.fr/172.16.97.4:2181. Will not attempt >> to authenticate using SASL (unknown error) >> >> 2014-09-02 09:31:37 o.a.z.ClientCnxn [INFO] Socket connection established >> to paradent-4.rennes.grid5000.fr/172.16.97.4:2181, initiating session >> >> 2014-09-02 09:31:37 o.a.z.ClientCnxn [INFO] Session establishment >> complete on server paradent-4.rennes.grid5000.fr/172.16.97.4:2181, >> sessionid = 0x14835a48ca90004, negotiated timeout = 20000 >> >> 2014-09-02 09:31:37 o.a.c.f.s.ConnectionStateManager [INFO] State change: >> CONNECTED >> >> 2014-09-02 09:31:37 o.a.c.f.s.ConnectionStateManager [WARN] There are no >> ConnectionStateListeners registered. >> >> 2014-09-02 09:31:37 b.s.zookeeper [INFO] Zookeeper state update: >> :connected:none >> >> 2014-09-02 09:31:38 o.a.z.ZooKeeper [INFO] Session: 0x14835a48ca90004 >> closed >> >> 2014-09-02 09:31:38 o.a.z.ClientCnxn [INFO] EventThread shut down >> >> 2014-09-02 09:31:38 o.a.c.f.i.CuratorFrameworkImpl [INFO] Starting >> >> 2014-09-02 09:31:38 o.a.z.ZooKeeper [INFO] Initiating client connection, >> connectString=paradent-4:2181/storm sessionTimeout=20000 >> watcher=org.apache.curator.ConnectionState@c6d625b >> >> 2014-09-02 09:31:38 o.a.z.ClientCnxn [INFO] Opening socket connection to >> server paradent-4.rennes.grid5000.fr/172.16.97.4:2181. Will not attempt >> to authenticate using SASL (unknown error) >> >> 2014-09-02 09:31:38 o.a.z.ClientCnxn [INFO] Socket connection established >> to paradent-4.rennes.grid5000.fr/172.16.97.4:2181, initiating session >> >> 2014-09-02 09:31:38 o.a.z.ClientCnxn [INFO] Session establishment >> complete on server paradent-4.rennes.grid5000.fr/172.16.97.4:2181, >> sessionid = 0x14835a48ca90005, negotiated timeout = 20000 >> >> 2014-09-02 09:31:38 o.a.c.f.s.ConnectionStateManager [INFO] State change: >> CONNECTED >> >> 2014-09-02 09:31:38 o.a.c.f.s.ConnectionStateManager [WARN] There are no >> ConnectionStateListeners registered. >> >> 2014-09-02 09:31:38 b.s.d.supervisor [INFO] Starting supervisor with id >> 280caffa-d6c5-4fd4-8282-7d8c1dec7e66 at host >> paradent-39.rennes.grid5000.fr >> >> 2014-09-02 09:31:39 b.s.event [ERROR] Error when processing event >> >> java.io.FileNotFoundException: File >> '/home/bsoulas/storm-local/workers/fc350518-ded6-48f4-abf9-da73cbaf7c5c/heartbeats/1409146760275' >> does not exist >> >> at org.apache.commons.io.FileUtils.openInputStream(FileUtils.java:299) >> ~[commons-io-2.4.jar:2.4] >> >> at >> org.apache.commons.io.FileUtils.readFileToByteArray(FileUtils.java:1763) >> ~[commons-io-2.4.jar:2.4] >> >> at backtype.storm.utils.LocalState.snapshot(LocalState.java:45) >> ~[storm-core-0.9.3-ben.jar:0.9.3-ben] >> >> at backtype.storm.utils.LocalState.get(LocalState.java:56) >> ~[storm-core-0.9.3-ben.jar:0.9.3-ben] >> >> at >> backtype.storm.daemon.supervisor$read_worker_heartbeat.invoke(supervisor.clj:77) >> ~[storm-core-0.9.3-ben.jar:0.9.3-ben] >> >> at >> backtype.storm.daemon.supervisor$read_worker_heartbeats$iter__6381__6385$fn__6386.invoke(supervisor.clj:90) >> ~[storm-core-0.9.3-ben.jar:0.9.3-ben] >> >> at clojure.lang.LazySeq.sval(LazySeq.java:42) ~[clojure-1.5.1.jar:na] >> >> at clojure.lang.LazySeq.seq(LazySeq.java:60) ~[clojure-1.5.1.jar:na] >> >> at clojure.lang.Cons.next(Cons.java:39) ~[clojure-1.5.1.jar:na] >> >> at clojure.lang.LazySeq.next(LazySeq.java:92) ~[clojure-1.5.1.jar:na] >> >> at clojure.lang.RT.next(RT.java:598) ~[clojure-1.5.1.jar:na] >> >> at clojure.core$next.invoke(core.clj:64) ~[clojure-1.5.1.jar:na] >> >> at clojure.core$dorun.invoke(core.clj:2781) ~[clojure-1.5.1.jar:na] >> >> at clojure.core$doall.invoke(core.clj:2796) ~[clojure-1.5.1.jar:na] >> >> at >> backtype.storm.daemon.supervisor$read_worker_heartbeats.invoke(supervisor.clj:89) >> ~[storm-core-0.9.3-ben.jar:0.9.3-ben] >> >> at >> backtype.storm.daemon.supervisor$read_allocated_workers.invoke(supervisor.clj:106) >> ~[storm-core-0.9.3-ben.jar:0.9.3-ben] >> >> at >> backtype.storm.daemon.supervisor$sync_processes.invoke(supervisor.clj:209) >> ~[storm-core-0.9.3-ben.jar:0.9.3-ben] >> >> at clojure.lang.AFn.applyToHelper(AFn.java:161) [clojure-1.5.1.jar:na] >> >> at clojure.lang.AFn.applyTo(AFn.java:151) [clojure-1.5.1.jar:na] >> >> at clojure.core$apply.invoke(core.clj:619) ~[clojure-1.5.1.jar:na] >> >> at clojure.core$partial$fn__4190.doInvoke(core.clj:2396) >> ~[clojure-1.5.1.jar:na] >> >> at clojure.lang.RestFn.invoke(RestFn.java:397) ~[clojure-1.5.1.jar:na] >> >> at backtype.storm.event$event_manager$fn__4687.invoke(event.clj:39) >> ~[storm-core-0.9.3-ben.jar:0.9.3-ben] >> >> at clojure.lang.AFn.run(AFn.java:24) [clojure-1.5.1.jar:na] >> >> at java.lang.Thread.run(Thread.java:745) [na:1.7.0_65] >> >> 2014-09-02 09:31:39 b.s.util [INFO] Halting process: ("Error when >> processing an event") >> >> >> I understood that there was a missing file, my question is "why?????". If >> i watch the rights with ls -l at this path : >> >> >> /home/bsoulas/storm-local/workers/fc350518-ded6-48f4-abf9-da73cbaf7c5c/ >> >> I have this : >> >> >> drwxr-xr-x 2 bsoulas users 4096 Aug 27 15:39 heartbeats >> >> So for me this is not the problem, can someone help me? I am really stuck >> here :S >> >> I sincerely hope to be clear and precise enough ... >> >> Kind regards. >> >> >> >> >> >> >> 2014-08-29 16:47 GMT+02:00 Harsha <st...@harsha.io>: >> >> >> >> Hi Benjamin, >> Storm cluster needs a zookeeper quorum to function. >> ExclamationTopology accepts command line params to deploy on a storm >> cluster. If you don't pass any arguments it will use LocalCluster(a >> simulated local cluster) to deploy. >> I recommend you to go through >> http://zookeeper.apache.org/doc/r3.4.5/zookeeperAdmin.html >> <http://zookeeper.apache.org/doc/r3.3.3/zookeeperAdmin.html> >> for setting up zookeeper. Here is an excellent write up on storm >> cluster setup along with zookeeper >> http://www.michael-noll.com/tutorials/running-multi-node-storm-cluster/. >> Hope that helps. >> -Harsha >> >> >> On Fri, Aug 29, 2014, at 05:34 AM, Benjamin SOULAS wrote: >> >> Hello everyone, i have a problem during implementing storm on a cluster >> (Grid 5000 if anyone knows). I took the inubator-storm-master from the >> github branch with the sources, i succeeded to create my own release (no >> code modification, just for maven errors that were disturbing...) >> >> It's working fine on my own laptop in local, i modified the >> ExclamationTopology in adding 40 more bolts. I also modified this Topology >> to allow 50 workers in the configuration. >> >> Now on a cluster, when I try to do the same thing, supervisors are down >> just 3s after their execution. Nimbus is ok, dev-zookeeeper too, storm ui >> too. >> >> I read somewhere on the apache website you need to implement a real >> zookeeper (not the one in storm). >> >> Please, does someone knows a good tutorial explaining how running a >> zookeeper server on a cluster for storm? >> >> I hope I am clear ... >> >> Kind regards. >> >> Benjamin SOULAS >> >> >> >> >> >> >> > >