Sorry I meant reproduce, not replicate. :)

2017년 8월 24일 (목) 오후 8:34, Jungtaek Lim <[email protected]>님이 작성:

> Alexandre,
>
> I found that your storm local dir is placed to "/tmp/storm" which parts or
> all could be removed at any time.
> Could you move the path to non-temporary place and try to replicate?
>
> Thanks,
> Jungtaek Lim (HeartSaVioR)
>
>
> 2017년 8월 24일 (목) 오후 6:40, Alexandre Vermeerbergen <
> [email protected]>님이 작성:
>
>> Hello Jungtaek,
>>
>> Thank you very much for your answer.
>>
>> Please find attached the full Nimbus log (gzipped) related to this issue.
>>
>> Please note that the last ERROR repeats forever until we "repair" Storm.
>>
>> From the logs, it could be that the issue began close to when a topology
>> was restarted (killed, then started)
>>
>> Maybe this caused a corruption in Zookeeper. Is there anything which I
>> can collect in our Zookeeper nodes/logs to help analysis?
>>
>> Best regards,
>> Alexandre
>>
>>
>>
>>
>> 2017-08-24 9:29 GMT+02:00 Jungtaek Lim <[email protected]>:
>>
>>> Hi Alexandre, I missed this mail since I was on vacation.
>>>
>>> I followed the stack trace but hard to analyze without context. Do you
>>> mind
>>> providing full nimbus log?
>>>
>>> Thanks,
>>> Jungtaek Lim (HeartSaVioR)
>>>
>>> 2017년 8월 16일 (수) 오전 4:12, Alexandre Vermeerbergen <
>>> [email protected]>님이
>>> 작성:
>>>
>>> > Hello,
>>> >
>>> > Tomorrow I will have to restart the cluster on which I have this issue
>>> with
>>> > Storm 1.1.0.
>>> > Is there are anybody interested in my running some commands to get more
>>> > logs before I repair this cluster?
>>> >
>>> > Best regards,
>>> > Alexandre Vermeerbergen
>>> >
>>> > 2017-08-13 16:14 GMT+02:00 Alexandre Vermeerbergen <
>>> > [email protected]
>>> > >:
>>> >
>>> > > Hello,
>>> > >
>>> > > I think it might be of interest for you Storm developers to learn
>>> that I
>>> > > currently have a case of issue with Storm 1.1.0 which was supposed to
>>> > > resolved in this release according to https://issues.apache.org/
>>> > > jira/browse/STORM-1977 ; and I can look for any more information
>>> which
>>> > > you'd need to make a diagnostic on why this issue still can happen.
>>> > >
>>> > > Indeed, I have a Storm UI process which can't get any information on
>>> its
>>> > > Storm cluster, and I see many following exception in nimbus.log:
>>> > >
>>> > > 2017-08-02 05:11:15.971 o.a.s.d.nimbus pool-14-thread-21 [INFO]
>>> Created
>>> > > download session for
>>> statefulAlerting_ec2-52-51-199-56-eu-west-1-compute-
>>> > > amazonaws-com_defaultStormTopic-29-1501650673-stormcode.ser with id
>>> > > d5120ad7-a81c-4c39-afc5-a7f876b04c73
>>> > > 2017-08-02 05:11:15.978 o.a.s.d.nimbus pool-14-thread-27 [INFO]
>>> Created
>>> > > download session for
>>> statefulAlerting_ec2-52-51-199-56-eu-west-1-compute-
>>> > > amazonaws-com_defaultStormTopic-29-1501650673-stormconf.ser with id
>>> > > aba18011-3258-4023-bbef-14d21a7066e1
>>> > > 2017-08-02 06:20:59.208 o.a.s.d.nimbus timer [INFO] Cleaning inbox
>>> ...
>>> > > deleted: stormjar-fbdadeab-105d-4510-9beb-0f0d87e1a77d.jar
>>> > > 2017-08-06 03:42:02.854 o.a.s.t.ProcessFunction pool-14-thread-34
>>> [ERROR]
>>> > > Internal error processing getClusterInfo
>>> > > org.apache.storm.generated.KeyNotFoundException: null
>>> > >         at
>>> >
>>> org.apache.storm.blobstore.LocalFsBlobStore.getStoredBlobMeta(LocalFsBlobStore.java:147)
>>> > > ~[storm-core-1.1.0.jar:1.1.0]
>>> > >         at
>>> >
>>> org.apache.storm.blobstore.LocalFsBlobStore.getBlobReplication(LocalFsBlobStore.java:299)
>>> > > ~[storm-core-1.1.0.jar:1.1.0]
>>> > >         at sun.reflect.GeneratedMethodAccessor78.invoke(Unknown
>>> Source)
>>> > > ~[?:?]
>>> > >         at sun.reflect.DelegatingMethodAccessorImpl.invoke(
>>> > > DelegatingMethodAccessorImpl.java:43) ~[?:1.8.0_121]
>>> > >         at java.lang.reflect.Method.invoke(Method.java:498)
>>> > ~[?:1.8.0_121]
>>> > >         at
>>> clojure.lang.Reflector.invokeMatchingMethod(Reflector.java:93)
>>> > > ~[clojure-1.7.0.jar:?]
>>> > >         at
>>> clojure.lang.Reflector.invokeInstanceMethod(Reflector.java:28)
>>> > > ~[clojure-1.7.0.jar:?]
>>> > >         at
>>> >
>>> org.apache.storm.daemon.nimbus$get_blob_replication_count.invoke(nimbus.clj:489)
>>> > > ~[storm-core-1.1.0.jar:1.1.0]
>>> > >         at org.apache.storm.daemon.nimbus$get_cluster_info$iter__
>>> > > 10687__10691$fn__10692.invoke(nimbus.clj:1550)
>>> > > ~[storm-core-1.1.0.jar:1.1.0]
>>> > >         at clojure.lang.LazySeq.sval(LazySeq.java:40)
>>> > > ~[clojure-1.7.0.jar:?]
>>> > >         at clojure.lang.LazySeq.seq(LazySeq.java:49)
>>> > > ~[clojure-1.7.0.jar:?]
>>> > >         at clojure.lang.RT.seq(RT.java:507) ~[clojure-1.7.0.jar:?]
>>> > >         at clojure.core$seq__4128.invoke(core.clj:137)
>>> > > ~[clojure-1.7.0.jar:?]
>>> > >         at clojure.core$dorun.invoke(core.clj:3009)
>>> > ~[clojure-1.7.0.jar:?]
>>> > >         at clojure.core$doall.invoke(core.clj:3025)
>>> > ~[clojure-1.7.0.jar:?]
>>> > >         at
>>> > org.apache.storm.daemon.nimbus$get_cluster_info.invoke(nimbus.clj:1524)
>>> > > ~[storm-core-1.1.0.jar:1.1.0]
>>> > >         at org.apache.storm.daemon.nimbus$mk_reified_nimbus$
>>> > > reify__10782.getClusterInfo(nimbus.clj:1971)
>>> > ~[storm-core-1.1.0.jar:1.1.0]
>>> > >         at org.apache.storm.generated.Nimbus$Processor$
>>> > > getClusterInfo.getResult(Nimbus.java:3920)
>>> ~[storm-core-1.1.0.jar:1.1.0]
>>> > >         at org.apache.storm.generated.Nimbus$Processor$
>>> > > getClusterInfo.getResult(Nimbus.java:3904)
>>> ~[storm-core-1.1.0.jar:1.1.0]
>>> > >         at
>>> >
>>> org.apache.storm.thrift.ProcessFunction.process(ProcessFunction.java:39)
>>> > > ~[storm-core-1.1.0.jar:1.1.0]
>>> > >         at
>>> > org.apache.storm.thrift.TBaseProcessor.process(TBaseProcessor.java:39)
>>> > > ~[storm-core-1.1.0.jar:1.1.0]
>>> > >         at org.apache.storm.security.auth.SimpleTransportPlugin$
>>> > > SimpleWrapProcessor.process(SimpleTransportPlugin.java:162)
>>> > > ~[storm-core-1.1.0.jar:1.1.0]
>>> > >         at org.apache.storm.thrift.server.AbstractNonblockingServer$
>>> > > FrameBuffer.invoke(AbstractNonblockingServer.java:518)
>>> > > ~[storm-core-1.1.0.jar:1.1.0]
>>> > >         at
>>> > org.apache.storm.thrift.server.Invocation.run(Invocation.java:18)
>>> > > ~[storm-core-1.1.0.jar:1.1.0]
>>> > >         at
>>> >
>>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>>> > > [?:1.8.0_121]
>>> > >         at
>>> >
>>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>>> > > [?:1.8.0_121]
>>> > >         at java.lang.Thread.run(Thread.java:745) [?:1.8.0_121]
>>> > >
>>> > >
>>> > > What is amazing is that I got this same issue on two Storm clusters
>>> > > running on different VMs ; just they share the same data in they
>>> Kafka
>>> > > Broker cluster (one cluster is the production one, which was quickly
>>> > fixed,
>>> > > and the other one is the "backup" cluster to be used if the
>>> production
>>> > one
>>> > > fails for quick "back to production")
>>> > >
>>> > > If left one of these cluster with this behavior because I felt that
>>> it
>>> > > could be interesting for Storm developers to have more information on
>>> > this
>>> > > issue, if needed to properly diagnose it.
>>> > >
>>> > > I can keep this cluster as is for max 2 days.
>>> > >
>>> > > Is there anything useful which I can collect on it to help Storm
>>> > > developers to understand the cause (and hopefully use it to make
>>> Storm
>>> > more
>>> > > robust) ?
>>> > >
>>> > > Few details:
>>> > >
>>> > > * Storm 1.1.0 cluster with Nimbus & NimbusUI running on a VM, and 4
>>> > > Supervisors VMs + 3 Zookeeper VMs
>>> > >
>>> > > * Running with Java Server JRE 1.8.0_121
>>> > > * Running on AWS EC2 instances
>>> > >
>>> > > * We run about 10 topologies, with automatic self-healing on them (if
>>> > they
>>> > > stop consuming Kafka items, our self-healer call "Kill topology", and
>>> > then
>>> > > eventually restarts the topology
>>> > >
>>> > > * We have a self-healing on Nimbus UI based on calling its REST
>>> services.
>>> > > If it's not responding fast enough, we restart Nimbus UI
>>> > > * We figured out the issue because Nimbus UI was restarted every 2
>>> > minutes
>>> > >
>>> > > * To fix our production server which had the same symptom, we had to
>>> stop
>>> > > all Storm processes, then stop all Zookeepers, then remove all data
>>> in
>>> > > Zookeeper "snapshot files", then restart all Zookeeper, then restart
>>> all
>>> > > Storm process, then re-submit all our topologies
>>> > >
>>> > > Please be as clear as possible about which commands we should run to
>>> give
>>> > > you more details if needed
>>> > >
>>> > > Best regards,
>>> > > Alexandre Vermeerbergen
>>> > >
>>> > >
>>> >
>>>
>>
>>

Reply via email to