Hello Jungtaek,

Thank you very much for your answer.

Please find attached the full Nimbus log (gzipped) related to this issue.

Please note that the last ERROR repeats forever until we "repair" Storm.

>From the logs, it could be that the issue began close to when a topology
was restarted (killed, then started)

Maybe this caused a corruption in Zookeeper. Is there anything which I can
collect in our Zookeeper nodes/logs to help analysis?

Best regards,
Alexandre




2017-08-24 9:29 GMT+02:00 Jungtaek Lim <kabh...@gmail.com>:

> Hi Alexandre, I missed this mail since I was on vacation.
>
> I followed the stack trace but hard to analyze without context. Do you mind
> providing full nimbus log?
>
> Thanks,
> Jungtaek Lim (HeartSaVioR)
>
> 2017년 8월 16일 (수) 오전 4:12, Alexandre Vermeerbergen <
> avermeerber...@gmail.com>님이
> 작성:
>
> > Hello,
> >
> > Tomorrow I will have to restart the cluster on which I have this issue
> with
> > Storm 1.1.0.
> > Is there are anybody interested in my running some commands to get more
> > logs before I repair this cluster?
> >
> > Best regards,
> > Alexandre Vermeerbergen
> >
> > 2017-08-13 16:14 GMT+02:00 Alexandre Vermeerbergen <
> > avermeerber...@gmail.com
> > >:
> >
> > > Hello,
> > >
> > > I think it might be of interest for you Storm developers to learn that
> I
> > > currently have a case of issue with Storm 1.1.0 which was supposed to
> > > resolved in this release according to https://issues.apache.org/
> > > jira/browse/STORM-1977 ; and I can look for any more information which
> > > you'd need to make a diagnostic on why this issue still can happen.
> > >
> > > Indeed, I have a Storm UI process which can't get any information on
> its
> > > Storm cluster, and I see many following exception in nimbus.log:
> > >
> > > 2017-08-02 05:11:15.971 o.a.s.d.nimbus pool-14-thread-21 [INFO] Created
> > > download session for statefulAlerting_ec2-52-51-
> 199-56-eu-west-1-compute-
> > > amazonaws-com_defaultStormTopic-29-1501650673-stormcode.ser with id
> > > d5120ad7-a81c-4c39-afc5-a7f876b04c73
> > > 2017-08-02 05:11:15.978 o.a.s.d.nimbus pool-14-thread-27 [INFO] Created
> > > download session for statefulAlerting_ec2-52-51-
> 199-56-eu-west-1-compute-
> > > amazonaws-com_defaultStormTopic-29-1501650673-stormconf.ser with id
> > > aba18011-3258-4023-bbef-14d21a7066e1
> > > 2017-08-02 06:20:59.208 o.a.s.d.nimbus timer [INFO] Cleaning inbox ...
> > > deleted: stormjar-fbdadeab-105d-4510-9beb-0f0d87e1a77d.jar
> > > 2017-08-06 03:42:02.854 o.a.s.t.ProcessFunction pool-14-thread-34
> [ERROR]
> > > Internal error processing getClusterInfo
> > > org.apache.storm.generated.KeyNotFoundException: null
> > >         at
> > org.apache.storm.blobstore.LocalFsBlobStore.getStoredBlobMeta(
> LocalFsBlobStore.java:147)
> > > ~[storm-core-1.1.0.jar:1.1.0]
> > >         at
> > org.apache.storm.blobstore.LocalFsBlobStore.getBlobReplication(
> LocalFsBlobStore.java:299)
> > > ~[storm-core-1.1.0.jar:1.1.0]
> > >         at sun.reflect.GeneratedMethodAccessor78.invoke(Unknown
> Source)
> > > ~[?:?]
> > >         at sun.reflect.DelegatingMethodAccessorImpl.invoke(
> > > DelegatingMethodAccessorImpl.java:43) ~[?:1.8.0_121]
> > >         at java.lang.reflect.Method.invoke(Method.java:498)
> > ~[?:1.8.0_121]
> > >         at clojure.lang.Reflector.invokeMatchingMethod(
> Reflector.java:93)
> > > ~[clojure-1.7.0.jar:?]
> > >         at clojure.lang.Reflector.invokeInstanceMethod(
> Reflector.java:28)
> > > ~[clojure-1.7.0.jar:?]
> > >         at
> > org.apache.storm.daemon.nimbus$get_blob_replication_
> count.invoke(nimbus.clj:489)
> > > ~[storm-core-1.1.0.jar:1.1.0]
> > >         at org.apache.storm.daemon.nimbus$get_cluster_info$iter__
> > > 10687__10691$fn__10692.invoke(nimbus.clj:1550)
> > > ~[storm-core-1.1.0.jar:1.1.0]
> > >         at clojure.lang.LazySeq.sval(LazySeq.java:40)
> > > ~[clojure-1.7.0.jar:?]
> > >         at clojure.lang.LazySeq.seq(LazySeq.java:49)
> > > ~[clojure-1.7.0.jar:?]
> > >         at clojure.lang.RT.seq(RT.java:507) ~[clojure-1.7.0.jar:?]
> > >         at clojure.core$seq__4128.invoke(core.clj:137)
> > > ~[clojure-1.7.0.jar:?]
> > >         at clojure.core$dorun.invoke(core.clj:3009)
> > ~[clojure-1.7.0.jar:?]
> > >         at clojure.core$doall.invoke(core.clj:3025)
> > ~[clojure-1.7.0.jar:?]
> > >         at
> > org.apache.storm.daemon.nimbus$get_cluster_info.invoke(nimbus.clj:1524)
> > > ~[storm-core-1.1.0.jar:1.1.0]
> > >         at org.apache.storm.daemon.nimbus$mk_reified_nimbus$
> > > reify__10782.getClusterInfo(nimbus.clj:1971)
> > ~[storm-core-1.1.0.jar:1.1.0]
> > >         at org.apache.storm.generated.Nimbus$Processor$
> > > getClusterInfo.getResult(Nimbus.java:3920)
> ~[storm-core-1.1.0.jar:1.1.0]
> > >         at org.apache.storm.generated.Nimbus$Processor$
> > > getClusterInfo.getResult(Nimbus.java:3904)
> ~[storm-core-1.1.0.jar:1.1.0]
> > >         at
> > org.apache.storm.thrift.ProcessFunction.process(ProcessFunction.java:39)
> > > ~[storm-core-1.1.0.jar:1.1.0]
> > >         at
> > org.apache.storm.thrift.TBaseProcessor.process(TBaseProcessor.java:39)
> > > ~[storm-core-1.1.0.jar:1.1.0]
> > >         at org.apache.storm.security.auth.SimpleTransportPlugin$
> > > SimpleWrapProcessor.process(SimpleTransportPlugin.java:162)
> > > ~[storm-core-1.1.0.jar:1.1.0]
> > >         at org.apache.storm.thrift.server.AbstractNonblockingServer$
> > > FrameBuffer.invoke(AbstractNonblockingServer.java:518)
> > > ~[storm-core-1.1.0.jar:1.1.0]
> > >         at
> > org.apache.storm.thrift.server.Invocation.run(Invocation.java:18)
> > > ~[storm-core-1.1.0.jar:1.1.0]
> > >         at
> > java.util.concurrent.ThreadPoolExecutor.runWorker(
> ThreadPoolExecutor.java:1142)
> > > [?:1.8.0_121]
> > >         at
> > java.util.concurrent.ThreadPoolExecutor$Worker.run(
> ThreadPoolExecutor.java:617)
> > > [?:1.8.0_121]
> > >         at java.lang.Thread.run(Thread.java:745) [?:1.8.0_121]
> > >
> > >
> > > What is amazing is that I got this same issue on two Storm clusters
> > > running on different VMs ; just they share the same data in they Kafka
> > > Broker cluster (one cluster is the production one, which was quickly
> > fixed,
> > > and the other one is the "backup" cluster to be used if the production
> > one
> > > fails for quick "back to production")
> > >
> > > If left one of these cluster with this behavior because I felt that it
> > > could be interesting for Storm developers to have more information on
> > this
> > > issue, if needed to properly diagnose it.
> > >
> > > I can keep this cluster as is for max 2 days.
> > >
> > > Is there anything useful which I can collect on it to help Storm
> > > developers to understand the cause (and hopefully use it to make Storm
> > more
> > > robust) ?
> > >
> > > Few details:
> > >
> > > * Storm 1.1.0 cluster with Nimbus & NimbusUI running on a VM, and 4
> > > Supervisors VMs + 3 Zookeeper VMs
> > >
> > > * Running with Java Server JRE 1.8.0_121
> > > * Running on AWS EC2 instances
> > >
> > > * We run about 10 topologies, with automatic self-healing on them (if
> > they
> > > stop consuming Kafka items, our self-healer call "Kill topology", and
> > then
> > > eventually restarts the topology
> > >
> > > * We have a self-healing on Nimbus UI based on calling its REST
> services.
> > > If it's not responding fast enough, we restart Nimbus UI
> > > * We figured out the issue because Nimbus UI was restarted every 2
> > minutes
> > >
> > > * To fix our production server which had the same symptom, we had to
> stop
> > > all Storm processes, then stop all Zookeepers, then remove all data in
> > > Zookeeper "snapshot files", then restart all Zookeeper, then restart
> all
> > > Storm process, then re-submit all our topologies
> > >
> > > Please be as clear as possible about which commands we should run to
> give
> > > you more details if needed
> > >
> > > Best regards,
> > > Alexandre Vermeerbergen
> > >
> > >
> >
>

Attachment: nimbus.log.gz
Description: GNU Zip compressed data

Reply via email to