Glad that you reproduced and fixed the problem. Happy to help!

- Jungtaek Lim (HeartSaVioR)

2017년 8월 26일 (토) 오후 11:13, Alexandre Vermeerbergen <avermeerber...@gmail.com>님이
작성:

> Hello Jungtaek,
>
> Your answers were very useful, because I was able to reproduce the issue by
> simply deleting storm.local.dir contents, and I found traces that indeed
> the machine that suffered this issue had lost this directory : it confirms
> your diagnostic, thank you very much for this!
>
> I have filled a JIRA: https://issues.apache.org/jira/browse/STORM-2707
> hoping that this will help make Storm even more robust when it'll be fixed.
>
> In the meantime, I'm changing our storm.local.dir to a place that differs
> from /tmp in order to reduce the risk of deletion of this directory.
>
> Best regard,
> Alexandre Vermeerbergen
>
> 2017-08-25 9:44 GMT+02:00 Jungtaek Lim <kabh...@gmail.com>:
>
> > I'm not sure. Topology code can't be restored, so my best bet would be
> > detecting it (periodically, or react in failure) and give up leadership.
> > If my memory is right, leader Nimbus doesn't pull blobs from followers,
> so
> > if it doesn't have any blobs and need to sync, it just needs to become
> > follower and let other nimbuses to take a chance to be a leader.
> >
> > It would not help for non-HA Nimbus cluster anyway. What's next? Maybe we
> > could remove related topology (from ZK), but it feels me a bit dangerous
> to
> > do it automatically, so I'd rather leaning on CLI (maybe) tool so that
> > operators can do it manually.
> >
> > Please file an issue regarding this when you would want to have a
> solution.
> >
> > Best regards,
> > Jungtaek Lim (HeartSaVioR)
> >
> > 2017년 8월 25일 (금) 오전 6:29, Alexandre Vermeerbergen <
> > avermeerber...@gmail.com>님이
> > 작성:
> >
> > > Hello Jungtaek,
> > >
> > > I confirm that we currently do not have multiple Nimbus nodes.
> > >
> > > I want to clarify that Nimbus process never crashed : it keep printing
> in
> > > its log the error:
> > >
> > > 2017-08-06 03:44:01.777 o.a.s.t.ProcessFunction pool-14-thread-1
> [ERROR]
> > > Internal error processing getClusterInfo
> > > org.apache.storm.generated.KeyNotFoundException: null
> > > ... (rest of the stack trace)
> > >
> > > However I forgot to mention that:
> > >
> > > * During this problem our topologies are unaffected
> > > * We notice the problem because we have a self-healing check on
> Nimbus-UI
> > > process which consists in calling Web Services to get a few stats on
> > > running topologies. But when Nimbus-UI is unresponsive, we kill it and
> > > restart it automatically.
> > > * The period of our NimbusUI self healing cron is 1 minute
> > > * When the getClusterInfo stack trace occurs on Nimbus log, Nimbus UI
> is
> > > unable to restart
> > > * Here's the kind of exceptions we see in ui.log  (Nimbus UI's trace):
> > >
> > > 2017-08-24 21:24:21.777 o.a.s.u.core main [INFO] Starting ui server for
> > > storm version '1.1.0'
> > > 2017-08-24 21:24:21.788 o.a.s.d.m.MetricsUtils main [INFO] Using
> > statistics
> > > reporter
> > > plugin:org.apache.storm.daemon.metrics.reporters.JmxPreparableReporter
> > > 2017-08-24 21:24:21.790 o.a.s.d.m.r.JmxPreparableReporter main [INFO]
> > > Preparing...
> > > 2017-08-24 21:24:21.801 o.a.s.d.common main [INFO] Started statistics
> > > report plugin...
> > > 2017-08-24 21:24:21.878 o.a.s.s.o.e.j.s.Server main [INFO]
> > > jetty-7.x.y-SNAPSHOT
> > > 2017-08-24 21:24:21.929 o.a.s.s.o.e.j.s.h.ContextHandler main [INFO]
> > > started o.a.s.s.o.e.j.s.ServletContextHandler{/,null}
> > > 2017-08-24 21:24:21.965 o.a.s.s.o.e.j.s.AbstractConnector main [INFO]
> > > Started SelectChannelConnector@0.0.0.0:8070
> > > 2017-08-24 21:25:02.446 o.a.s.u.NimbusClient qtp2142536057-19 [WARN]
> > > Ignoring exception while trying to get leader nimbus info from
> > > ec2-52-48-59-151.eu-west-1.compute.amazonaws.com. will retry with a
> > > different seed host.
> > > org.apache.storm.thrift.transport.TTransportException: null
> > >         at
> > >
> > > org.apache.storm.thrift.transport.TIOStreamTransport.
> > read(TIOStreamTransport.java:132)
> > > ~[storm-core-1.1.0.jar:1.1.0]
> > >         at
> > >
> org.apache.storm.thrift.transport.TTransport.readAll(TTransport.java:86)
> > > ~[storm-core-1.1.0.jar:1.1.0]
> > >         at
> > >
> > > org.apache.storm.thrift.transport.TFramedTransport.
> > readFrame(TFramedTransport.java:129)
> > > ~[storm-core-1.1.0.jar:1.1.0]
> > >         at
> > >
> > > org.apache.storm.thrift.transport.TFramedTransport.
> > read(TFramedTransport.java:101)
> > > ~[storm-core-1.1.0.jar:1.1.0]
> > >         at
> > >
> org.apache.storm.thrift.transport.TTransport.readAll(TTransport.java:86)
> > > ~[storm-core-1.1.0.jar:1.1.0]
> > >         at
> > >
> > > org.apache.storm.thrift.protocol.TBinaryProtocol.
> > readAll(TBinaryProtocol.java:429)
> > > ~[storm-core-1.1.0.jar:1.1.0]
> > >         at
> > >
> > > org.apache.storm.thrift.protocol.TBinaryProtocol.
> > readI32(TBinaryProtocol.java:318)
> > > ~[storm-core-1.1.0.jar:1.1.0]
> > >         at
> > >
> > > org.apache.storm.thrift.protocol.TBinaryProtocol.readMessageBegin(
> > TBinaryProtocol.java:219)
> > > ~[storm-core-1.1.0.jar:1.1.0]
> > >         at
> > > org.apache.storm.thrift.TServiceClient.receiveBase(
> > TServiceClient.java:77)
> > > ~[storm-core-1.1.0.jar:1.1.0]
> > >         at
> > > org.apache.storm.generated.Nimbus$Client.recv_getLeader(
> > Nimbus.java:1193)
> > > ~[storm-core-1.1.0.jar:1.1.0]
> > >         at
> > > org.apache.storm.generated.Nimbus$Client.getLeader(Nimbus.java:1181)
> > > ~[storm-core-1.1.0.jar:1.1.0]
> > >         at
> > >
> > > org.apache.storm.utils.NimbusClient.getConfiguredClientAs(
> > NimbusClient.java:84)
> > > ~[storm-core-1.1.0.jar:1.1.0]
> > >         at
> org.apache.storm.ui.core$cluster_summary.invoke(core.clj:355)
> > > ~[storm-core-1.1.0.jar:1.1.0]
> > >         at org.apache.storm.ui.core$fn__9556.invoke(core.clj:1113)
> > > ~[storm-core-1.1.0.jar:1.1.0]
> > >         at
> > >
> > > org.apache.storm.shade.compojure.core$make_route$fn__
> > 5976.invoke(core.clj:100)
> > > ~[storm-core-1.1.0.jar:1.1.0]
> > >         at
> > > org.apache.storm.shade.compojure.core$if_route$fn__
> > 5964.invoke(core.clj:46)
> > > ~[storm-core-1.1.0.jar:1.1.0]
> > >         at
> > >
> > > org.apache.storm.shade.compojure.core$if_method$fn__
> > 5957.invoke(core.clj:31)
> > > ~[storm-core-1.1.0.jar:1.1.0]
> > >         at
> > > org.apache.storm.shade.compojure.core$routing$fn__
> > 5982.invoke(core.clj:113)
> > > ~[storm-core-1.1.0.jar:1.1.0]
> > >         at clojure.core$some.invoke(core.clj:2570)
> > ~[clojure-1.7.0.jar:?]
> > >         at
> > > org.apache.storm.shade.compojure.core$routing.doInvoke(core.clj:113)
> > > ~[storm-core-1.1.0.jar:1.1.0]
> > >         at clojure.lang.RestFn.applyTo(RestFn.java:139)
> > > ~[clojure-1.7.0.jar:?]
> > >         at clojure.core$apply.invoke(core.clj:632)
> > ~[clojure-1.7.0.jar:?]
> > >
> > > So our problem is that Nimbus stays forever in a inconsistent state
> > without
> > > crashing, and if it would crash, then it would be helpless since we
> would
> > > have automatic restart of it (we also have a self healing on Nimbus)
> but
> > > again in such case we need to manually cleanup the content in the ZKs
> to
> > be
> > > able to restart Nimbus in a clean state.
> > >
> > > My question is the following one: is there a way to make Nimbus more
> > > resilient to such issues?
> > >
> > > Best regards,
> > > Alexandre
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > > 2017-08-24 23:06 GMT+02:00 Jungtaek Lim <kabh...@gmail.com>:
> > >
> > > > Blob files (meta, data) are in storm local directory. ZK only has
> list
> > of
> > > > blob keys and which alive nimbuses have that file. So if you lose
> storm
> > > > local directory, you just can't restore blobs, unless other nimbuses
> > have
> > > > these blobs so current nimbus could pull.
> > > > (I guess you have only one nimbus, since replication factor was 1 in
> > > nimbus
> > > > log.)
> > > >
> > > > If FNF occurs while finding meta file, we just throw
> > KeyNotFoundException
> > > > because it is just same as not exist.
> > > >
> > > > Anyway you have to clear up everything because there's inconsistency
> > > > between ZK and local, and no way to recover missed blob files so no
> way
> > > to
> > > > make it consistent.
> > > >
> > > > Btw, in this case, my expectation is that after crashing and
> > restarting,
> > > > nimbus can't be a leader and just waits for missing blobs. If you saw
> > > > Nimbus crashed again while relaunching, it could be a bug.
> > > >
> > > > - Jungtaek Lim (HeartSaVioR)
> > > > On Fri, 25 Aug 2017 at 3:35 AM Alexandre Vermeerbergen <
> > > > avermeerber...@gmail.com> wrote:
> > > >
> > > > > Hello Jungtaek,
> > > > >
> > > > > I can do what you suggest (ie moving storm local dir to a place
> which
> > > > isn't
> > > > > in /tmp),but since the issue occurs rarely (once per month), I
> doubt
> > > I'll
> > > > > be able to feedback soon.
> > > > >
> > > > > What is puzzling to me is that in order to recover from such issue,
> > we
> > > > have
> > > > > to stop everything, then clean the Zookeepers, then restarts all
> > Storm
> > > > > processes and finally restart our topologies. We don't clean the
> /tmp
> > > > > directory.
> > > > >
> > > > > Are you sure there's no relationships between Zookeeper contents
> and
> > > what
> > > > > is causing the following message to happen?
> > > > > "Internal error processing getClusterInfo"
> > > > > Best regards;
> > > > > Alexandre
> > > > >
> > > > > 2017-08-24 13:41 GMT+02:00 Jungtaek Lim <kabh...@gmail.com>:
> > > > >
> > > > > > Sorry I meant reproduce, not replicate. :)
> > > > > >
> > > > > > 2017년 8월 24일 (목) 오후 8:34, Jungtaek Lim <kabh...@gmail.com>님이 작성:
> > > > > >
> > > > > > > Alexandre,
> > > > > > >
> > > > > > > I found that your storm local dir is placed to "/tmp/storm"
> which
> > > > parts
> > > > > > or
> > > > > > > all could be removed at any time.
> > > > > > > Could you move the path to non-temporary place and try to
> > > replicate?
> > > > > > >
> > > > > > > Thanks,
> > > > > > > Jungtaek Lim (HeartSaVioR)
> > > > > > >
> > > > > > >
> > > > > > > 2017년 8월 24일 (목) 오후 6:40, Alexandre Vermeerbergen <
> > > > > > > avermeerber...@gmail.com>님이 작성:
> > > > > > >
> > > > > > >> Hello Jungtaek,
> > > > > > >>
> > > > > > >> Thank you very much for your answer.
> > > > > > >>
> > > > > > >> Please find attached the full Nimbus log (gzipped) related to
> > this
> > > > > > issue.
> > > > > > >>
> > > > > > >> Please note that the last ERROR repeats forever until we
> > "repair"
> > > > > Storm.
> > > > > > >>
> > > > > > >> From the logs, it could be that the issue began close to when
> a
> > > > > topology
> > > > > > >> was restarted (killed, then started)
> > > > > > >>
> > > > > > >> Maybe this caused a corruption in Zookeeper. Is there anything
> > > > which I
> > > > > > >> can collect in our Zookeeper nodes/logs to help analysis?
> > > > > > >>
> > > > > > >> Best regards,
> > > > > > >> Alexandre
> > > > > > >>
> > > > > > >>
> > > > > > >>
> > > > > > >>
> > > > > > >> 2017-08-24 9:29 GMT+02:00 Jungtaek Lim <kabh...@gmail.com>:
> > > > > > >>
> > > > > > >>> Hi Alexandre, I missed this mail since I was on vacation.
> > > > > > >>>
> > > > > > >>> I followed the stack trace but hard to analyze without
> context.
> > > Do
> > > > > you
> > > > > > >>> mind
> > > > > > >>> providing full nimbus log?
> > > > > > >>>
> > > > > > >>> Thanks,
> > > > > > >>> Jungtaek Lim (HeartSaVioR)
> > > > > > >>>
> > > > > > >>> 2017년 8월 16일 (수) 오전 4:12, Alexandre Vermeerbergen <
> > > > > > >>> avermeerber...@gmail.com>님이
> > > > > > >>> 작성:
> > > > > > >>>
> > > > > > >>> > Hello,
> > > > > > >>> >
> > > > > > >>> > Tomorrow I will have to restart the cluster on which I have
> > > this
> > > > > > issue
> > > > > > >>> with
> > > > > > >>> > Storm 1.1.0.
> > > > > > >>> > Is there are anybody interested in my running some commands
> > to
> > > > get
> > > > > > more
> > > > > > >>> > logs before I repair this cluster?
> > > > > > >>> >
> > > > > > >>> > Best regards,
> > > > > > >>> > Alexandre Vermeerbergen
> > > > > > >>> >
> > > > > > >>> > 2017-08-13 16:14 GMT+02:00 Alexandre Vermeerbergen <
> > > > > > >>> > avermeerber...@gmail.com
> > > > > > >>> > >:
> > > > > > >>> >
> > > > > > >>> > > Hello,
> > > > > > >>> > >
> > > > > > >>> > > I think it might be of interest for you Storm developers
> to
> > > > learn
> > > > > > >>> that I
> > > > > > >>> > > currently have a case of issue with Storm 1.1.0 which was
> > > > > supposed
> > > > > > to
> > > > > > >>> > > resolved in this release according to
> > > > https://issues.apache.org/
> > > > > > >>> > > jira/browse/STORM-1977 ; and I can look for any more
> > > > information
> > > > > > >>> which
> > > > > > >>> > > you'd need to make a diagnostic on why this issue still
> can
> > > > > happen.
> > > > > > >>> > >
> > > > > > >>> > > Indeed, I have a Storm UI process which can't get any
> > > > information
> > > > > > on
> > > > > > >>> its
> > > > > > >>> > > Storm cluster, and I see many following exception in
> > > > nimbus.log:
> > > > > > >>> > >
> > > > > > >>> > > 2017-08-02 05:11:15.971 o.a.s.d.nimbus pool-14-thread-21
> > > [INFO]
> > > > > > >>> Created
> > > > > > >>> > > download session for
> > > > > > >>> statefulAlerting_ec2-52-51-199-56-eu-west-1-compute-
> > > > > > >>> > > amazonaws-com_defaultStormTopic-29-
> > 1501650673-stormcode.ser
> > > > with
> > > > > > id
> > > > > > >>> > > d5120ad7-a81c-4c39-afc5-a7f876b04c73
> > > > > > >>> > > 2017-08-02 05:11:15.978 o.a.s.d.nimbus pool-14-thread-27
> > > [INFO]
> > > > > > >>> Created
> > > > > > >>> > > download session for
> > > > > > >>> statefulAlerting_ec2-52-51-199-56-eu-west-1-compute-
> > > > > > >>> > > amazonaws-com_defaultStormTopic-29-
> > 1501650673-stormconf.ser
> > > > with
> > > > > > id
> > > > > > >>> > > aba18011-3258-4023-bbef-14d21a7066e1
> > > > > > >>> > > 2017-08-02 06:20:59.208 o.a.s.d.nimbus timer [INFO]
> > Cleaning
> > > > > inbox
> > > > > > >>> ...
> > > > > > >>> > > deleted:
> stormjar-fbdadeab-105d-4510-9beb-0f0d87e1a77d.jar
> > > > > > >>> > > 2017-08-06 03:42:02.854 o.a.s.t.ProcessFunction
> > > > pool-14-thread-34
> > > > > > >>> [ERROR]
> > > > > > >>> > > Internal error processing getClusterInfo
> > > > > > >>> > > org.apache.storm.generated.KeyNotFoundException: null
> > > > > > >>> > >         at
> > > > > > >>> >
> > > > > > >>>
> org.apache.storm.blobstore.LocalFsBlobStore.getStoredBlobMeta(
> > > > > > LocalFsBlobStore.java:147)
> > > > > > >>> > > ~[storm-core-1.1.0.jar:1.1.0]
> > > > > > >>> > >         at
> > > > > > >>> >
> > > > > > >>> org.apache.storm.blobstore.LocalFsBlobStore.
> > getBlobReplication(
> > > > > > LocalFsBlobStore.java:299)
> > > > > > >>> > > ~[storm-core-1.1.0.jar:1.1.0]
> > > > > > >>> > >         at sun.reflect.GeneratedMethodAccessor78.
> > > > invoke(Unknown
> > > > > > >>> Source)
> > > > > > >>> > > ~[?:?]
> > > > > > >>> > >         at sun.reflect.DelegatingMethodAccessorImpl.
> > invoke(
> > > > > > >>> > > DelegatingMethodAccessorImpl.java:43) ~[?:1.8.0_121]
> > > > > > >>> > >         at java.lang.reflect.Method.
> > invoke(Method.java:498)
> > > > > > >>> > ~[?:1.8.0_121]
> > > > > > >>> > >         at
> > > > > > >>>
> clojure.lang.Reflector.invokeMatchingMethod(Reflector.java:93)
> > > > > > >>> > > ~[clojure-1.7.0.jar:?]
> > > > > > >>> > >         at
> > > > > > >>>
> clojure.lang.Reflector.invokeInstanceMethod(Reflector.java:28)
> > > > > > >>> > > ~[clojure-1.7.0.jar:?]
> > > > > > >>> > >         at
> > > > > > >>> >
> > > > > > >>> org.apache.storm.daemon.nimbus$get_blob_replication_
> > > > > > count.invoke(nimbus.clj:489)
> > > > > > >>> > > ~[storm-core-1.1.0.jar:1.1.0]
> > > > > > >>> > >         at org.apache.storm.daemon.
> > > > nimbus$get_cluster_info$iter__
> > > > > > >>> > > 10687__10691$fn__10692.invoke(nimbus.clj:1550)
> > > > > > >>> > > ~[storm-core-1.1.0.jar:1.1.0]
> > > > > > >>> > >         at clojure.lang.LazySeq.sval(LazySeq.java:40)
> > > > > > >>> > > ~[clojure-1.7.0.jar:?]
> > > > > > >>> > >         at clojure.lang.LazySeq.seq(LazySeq.java:49)
> > > > > > >>> > > ~[clojure-1.7.0.jar:?]
> > > > > > >>> > >         at clojure.lang.RT.seq(RT.java:507)
> > > > > ~[clojure-1.7.0.jar:?]
> > > > > > >>> > >         at clojure.core$seq__4128.invoke(core.clj:137)
> > > > > > >>> > > ~[clojure-1.7.0.jar:?]
> > > > > > >>> > >         at clojure.core$dorun.invoke(core.clj:3009)
> > > > > > >>> > ~[clojure-1.7.0.jar:?]
> > > > > > >>> > >         at clojure.core$doall.invoke(core.clj:3025)
> > > > > > >>> > ~[clojure-1.7.0.jar:?]
> > > > > > >>> > >         at
> > > > > > >>> > org.apache.storm.daemon.nimbus$get_cluster_info.
> > > > > > invoke(nimbus.clj:1524)
> > > > > > >>> > > ~[storm-core-1.1.0.jar:1.1.0]
> > > > > > >>> > >         at org.apache.storm.daemon.
> > nimbus$mk_reified_nimbus$
> > > > > > >>> > > reify__10782.getClusterInfo(nimbus.clj:1971)
> > > > > > >>> > ~[storm-core-1.1.0.jar:1.1.0]
> > > > > > >>> > >         at org.apache.storm.generated.Nimbus$Processor$
> > > > > > >>> > > getClusterInfo.getResult(Nimbus.java:3920)
> > > > > > >>> ~[storm-core-1.1.0.jar:1.1.0]
> > > > > > >>> > >         at org.apache.storm.generated.Nimbus$Processor$
> > > > > > >>> > > getClusterInfo.getResult(Nimbus.java:3904)
> > > > > > >>> ~[storm-core-1.1.0.jar:1.1.0]
> > > > > > >>> > >         at
> > > > > > >>> >
> > > > > > >>> org.apache.storm.thrift.ProcessFunction.process(
> > > > > > ProcessFunction.java:39)
> > > > > > >>> > > ~[storm-core-1.1.0.jar:1.1.0]
> > > > > > >>> > >         at
> > > > > > >>> > org.apache.storm.thrift.TBaseProcessor.process(
> > > > > > TBaseProcessor.java:39)
> > > > > > >>> > > ~[storm-core-1.1.0.jar:1.1.0]
> > > > > > >>> > >         at org.apache.storm.security.
> > > > auth.SimpleTransportPlugin$
> > > > > > >>> > > SimpleWrapProcessor.process(SimpleTransportPlugin.java:
> > 162)
> > > > > > >>> > > ~[storm-core-1.1.0.jar:1.1.0]
> > > > > > >>> > >         at org.apache.storm.thrift.server.
> > > > > > AbstractNonblockingServer$
> > > > > > >>> > > FrameBuffer.invoke(AbstractNonblockingServer.java:518)
> > > > > > >>> > > ~[storm-core-1.1.0.jar:1.1.0]
> > > > > > >>> > >         at
> > > > > > >>> > org.apache.storm.thrift.server.Invocation.run(
> > > > Invocation.java:18)
> > > > > > >>> > > ~[storm-core-1.1.0.jar:1.1.0]
> > > > > > >>> > >         at
> > > > > > >>> >
> > > > > > >>> java.util.concurrent.ThreadPoolExecutor.runWorker(
> > > > > > ThreadPoolExecutor.java:1142)
> > > > > > >>> > > [?:1.8.0_121]
> > > > > > >>> > >         at
> > > > > > >>> >
> > > > > > >>> java.util.concurrent.ThreadPoolExecutor$Worker.run(
> > > > > > ThreadPoolExecutor.java:617)
> > > > > > >>> > > [?:1.8.0_121]
> > > > > > >>> > >         at java.lang.Thread.run(Thread.java:745)
> > > [?:1.8.0_121]
> > > > > > >>> > >
> > > > > > >>> > >
> > > > > > >>> > > What is amazing is that I got this same issue on two
> Storm
> > > > > clusters
> > > > > > >>> > > running on different VMs ; just they share the same data
> in
> > > > they
> > > > > > >>> Kafka
> > > > > > >>> > > Broker cluster (one cluster is the production one, which
> > was
> > > > > > quickly
> > > > > > >>> > fixed,
> > > > > > >>> > > and the other one is the "backup" cluster to be used if
> the
> > > > > > >>> production
> > > > > > >>> > one
> > > > > > >>> > > fails for quick "back to production")
> > > > > > >>> > >
> > > > > > >>> > > If left one of these cluster with this behavior because I
> > > felt
> > > > > that
> > > > > > >>> it
> > > > > > >>> > > could be interesting for Storm developers to have more
> > > > > information
> > > > > > on
> > > > > > >>> > this
> > > > > > >>> > > issue, if needed to properly diagnose it.
> > > > > > >>> > >
> > > > > > >>> > > I can keep this cluster as is for max 2 days.
> > > > > > >>> > >
> > > > > > >>> > > Is there anything useful which I can collect on it to
> help
> > > > Storm
> > > > > > >>> > > developers to understand the cause (and hopefully use it
> to
> > > > make
> > > > > > >>> Storm
> > > > > > >>> > more
> > > > > > >>> > > robust) ?
> > > > > > >>> > >
> > > > > > >>> > > Few details:
> > > > > > >>> > >
> > > > > > >>> > > * Storm 1.1.0 cluster with Nimbus & NimbusUI running on a
> > VM,
> > > > > and 4
> > > > > > >>> > > Supervisors VMs + 3 Zookeeper VMs
> > > > > > >>> > >
> > > > > > >>> > > * Running with Java Server JRE 1.8.0_121
> > > > > > >>> > > * Running on AWS EC2 instances
> > > > > > >>> > >
> > > > > > >>> > > * We run about 10 topologies, with automatic self-healing
> > on
> > > > them
> > > > > > (if
> > > > > > >>> > they
> > > > > > >>> > > stop consuming Kafka items, our self-healer call "Kill
> > > > topology",
> > > > > > and
> > > > > > >>> > then
> > > > > > >>> > > eventually restarts the topology
> > > > > > >>> > >
> > > > > > >>> > > * We have a self-healing on Nimbus UI based on calling
> its
> > > REST
> > > > > > >>> services.
> > > > > > >>> > > If it's not responding fast enough, we restart Nimbus UI
> > > > > > >>> > > * We figured out the issue because Nimbus UI was
> restarted
> > > > every
> > > > > 2
> > > > > > >>> > minutes
> > > > > > >>> > >
> > > > > > >>> > > * To fix our production server which had the same
> symptom,
> > we
> > > > had
> > > > > > to
> > > > > > >>> stop
> > > > > > >>> > > all Storm processes, then stop all Zookeepers, then
> remove
> > > all
> > > > > data
> > > > > > >>> in
> > > > > > >>> > > Zookeeper "snapshot files", then restart all Zookeeper,
> > then
> > > > > > restart
> > > > > > >>> all
> > > > > > >>> > > Storm process, then re-submit all our topologies
> > > > > > >>> > >
> > > > > > >>> > > Please be as clear as possible about which commands we
> > should
> > > > run
> > > > > > to
> > > > > > >>> give
> > > > > > >>> > > you more details if needed
> > > > > > >>> > >
> > > > > > >>> > > Best regards,
> > > > > > >>> > > Alexandre Vermeerbergen
> > > > > > >>> > >
> > > > > > >>> > >
> > > > > > >>> >
> > > > > > >>>
> > > > > > >>
> > > > > > >>
> > > > > >
> > > > >
> > > >
> > >
> >
>

Reply via email to