Blob files (meta, data) are in storm local directory. ZK only has list of blob keys and which alive nimbuses have that file. So if you lose storm local directory, you just can't restore blobs, unless other nimbuses have these blobs so current nimbus could pull. (I guess you have only one nimbus, since replication factor was 1 in nimbus log.)
If FNF occurs while finding meta file, we just throw KeyNotFoundException because it is just same as not exist. Anyway you have to clear up everything because there's inconsistency between ZK and local, and no way to recover missed blob files so no way to make it consistent. Btw, in this case, my expectation is that after crashing and restarting, nimbus can't be a leader and just waits for missing blobs. If you saw Nimbus crashed again while relaunching, it could be a bug. - Jungtaek Lim (HeartSaVioR) On Fri, 25 Aug 2017 at 3:35 AM Alexandre Vermeerbergen < [email protected]> wrote: > Hello Jungtaek, > > I can do what you suggest (ie moving storm local dir to a place which isn't > in /tmp),but since the issue occurs rarely (once per month), I doubt I'll > be able to feedback soon. > > What is puzzling to me is that in order to recover from such issue, we have > to stop everything, then clean the Zookeepers, then restarts all Storm > processes and finally restart our topologies. We don't clean the /tmp > directory. > > Are you sure there's no relationships between Zookeeper contents and what > is causing the following message to happen? > "Internal error processing getClusterInfo" > Best regards; > Alexandre > > 2017-08-24 13:41 GMT+02:00 Jungtaek Lim <[email protected]>: > > > Sorry I meant reproduce, not replicate. :) > > > > 2017년 8월 24일 (목) 오후 8:34, Jungtaek Lim <[email protected]>님이 작성: > > > > > Alexandre, > > > > > > I found that your storm local dir is placed to "/tmp/storm" which parts > > or > > > all could be removed at any time. > > > Could you move the path to non-temporary place and try to replicate? > > > > > > Thanks, > > > Jungtaek Lim (HeartSaVioR) > > > > > > > > > 2017년 8월 24일 (목) 오후 6:40, Alexandre Vermeerbergen < > > > [email protected]>님이 작성: > > > > > >> Hello Jungtaek, > > >> > > >> Thank you very much for your answer. > > >> > > >> Please find attached the full Nimbus log (gzipped) related to this > > issue. > > >> > > >> Please note that the last ERROR repeats forever until we "repair" > Storm. > > >> > > >> From the logs, it could be that the issue began close to when a > topology > > >> was restarted (killed, then started) > > >> > > >> Maybe this caused a corruption in Zookeeper. Is there anything which I > > >> can collect in our Zookeeper nodes/logs to help analysis? > > >> > > >> Best regards, > > >> Alexandre > > >> > > >> > > >> > > >> > > >> 2017-08-24 9:29 GMT+02:00 Jungtaek Lim <[email protected]>: > > >> > > >>> Hi Alexandre, I missed this mail since I was on vacation. > > >>> > > >>> I followed the stack trace but hard to analyze without context. Do > you > > >>> mind > > >>> providing full nimbus log? > > >>> > > >>> Thanks, > > >>> Jungtaek Lim (HeartSaVioR) > > >>> > > >>> 2017년 8월 16일 (수) 오전 4:12, Alexandre Vermeerbergen < > > >>> [email protected]>님이 > > >>> 작성: > > >>> > > >>> > Hello, > > >>> > > > >>> > Tomorrow I will have to restart the cluster on which I have this > > issue > > >>> with > > >>> > Storm 1.1.0. > > >>> > Is there are anybody interested in my running some commands to get > > more > > >>> > logs before I repair this cluster? > > >>> > > > >>> > Best regards, > > >>> > Alexandre Vermeerbergen > > >>> > > > >>> > 2017-08-13 16:14 GMT+02:00 Alexandre Vermeerbergen < > > >>> > [email protected] > > >>> > >: > > >>> > > > >>> > > Hello, > > >>> > > > > >>> > > I think it might be of interest for you Storm developers to learn > > >>> that I > > >>> > > currently have a case of issue with Storm 1.1.0 which was > supposed > > to > > >>> > > resolved in this release according to https://issues.apache.org/ > > >>> > > jira/browse/STORM-1977 ; and I can look for any more information > > >>> which > > >>> > > you'd need to make a diagnostic on why this issue still can > happen. > > >>> > > > > >>> > > Indeed, I have a Storm UI process which can't get any information > > on > > >>> its > > >>> > > Storm cluster, and I see many following exception in nimbus.log: > > >>> > > > > >>> > > 2017-08-02 05:11:15.971 o.a.s.d.nimbus pool-14-thread-21 [INFO] > > >>> Created > > >>> > > download session for > > >>> statefulAlerting_ec2-52-51-199-56-eu-west-1-compute- > > >>> > > amazonaws-com_defaultStormTopic-29-1501650673-stormcode.ser with > > id > > >>> > > d5120ad7-a81c-4c39-afc5-a7f876b04c73 > > >>> > > 2017-08-02 05:11:15.978 o.a.s.d.nimbus pool-14-thread-27 [INFO] > > >>> Created > > >>> > > download session for > > >>> statefulAlerting_ec2-52-51-199-56-eu-west-1-compute- > > >>> > > amazonaws-com_defaultStormTopic-29-1501650673-stormconf.ser with > > id > > >>> > > aba18011-3258-4023-bbef-14d21a7066e1 > > >>> > > 2017-08-02 06:20:59.208 o.a.s.d.nimbus timer [INFO] Cleaning > inbox > > >>> ... > > >>> > > deleted: stormjar-fbdadeab-105d-4510-9beb-0f0d87e1a77d.jar > > >>> > > 2017-08-06 03:42:02.854 o.a.s.t.ProcessFunction pool-14-thread-34 > > >>> [ERROR] > > >>> > > Internal error processing getClusterInfo > > >>> > > org.apache.storm.generated.KeyNotFoundException: null > > >>> > > at > > >>> > > > >>> org.apache.storm.blobstore.LocalFsBlobStore.getStoredBlobMeta( > > LocalFsBlobStore.java:147) > > >>> > > ~[storm-core-1.1.0.jar:1.1.0] > > >>> > > at > > >>> > > > >>> org.apache.storm.blobstore.LocalFsBlobStore.getBlobReplication( > > LocalFsBlobStore.java:299) > > >>> > > ~[storm-core-1.1.0.jar:1.1.0] > > >>> > > at sun.reflect.GeneratedMethodAccessor78.invoke(Unknown > > >>> Source) > > >>> > > ~[?:?] > > >>> > > at sun.reflect.DelegatingMethodAccessorImpl.invoke( > > >>> > > DelegatingMethodAccessorImpl.java:43) ~[?:1.8.0_121] > > >>> > > at java.lang.reflect.Method.invoke(Method.java:498) > > >>> > ~[?:1.8.0_121] > > >>> > > at > > >>> clojure.lang.Reflector.invokeMatchingMethod(Reflector.java:93) > > >>> > > ~[clojure-1.7.0.jar:?] > > >>> > > at > > >>> clojure.lang.Reflector.invokeInstanceMethod(Reflector.java:28) > > >>> > > ~[clojure-1.7.0.jar:?] > > >>> > > at > > >>> > > > >>> org.apache.storm.daemon.nimbus$get_blob_replication_ > > count.invoke(nimbus.clj:489) > > >>> > > ~[storm-core-1.1.0.jar:1.1.0] > > >>> > > at org.apache.storm.daemon.nimbus$get_cluster_info$iter__ > > >>> > > 10687__10691$fn__10692.invoke(nimbus.clj:1550) > > >>> > > ~[storm-core-1.1.0.jar:1.1.0] > > >>> > > at clojure.lang.LazySeq.sval(LazySeq.java:40) > > >>> > > ~[clojure-1.7.0.jar:?] > > >>> > > at clojure.lang.LazySeq.seq(LazySeq.java:49) > > >>> > > ~[clojure-1.7.0.jar:?] > > >>> > > at clojure.lang.RT.seq(RT.java:507) > ~[clojure-1.7.0.jar:?] > > >>> > > at clojure.core$seq__4128.invoke(core.clj:137) > > >>> > > ~[clojure-1.7.0.jar:?] > > >>> > > at clojure.core$dorun.invoke(core.clj:3009) > > >>> > ~[clojure-1.7.0.jar:?] > > >>> > > at clojure.core$doall.invoke(core.clj:3025) > > >>> > ~[clojure-1.7.0.jar:?] > > >>> > > at > > >>> > org.apache.storm.daemon.nimbus$get_cluster_info. > > invoke(nimbus.clj:1524) > > >>> > > ~[storm-core-1.1.0.jar:1.1.0] > > >>> > > at org.apache.storm.daemon.nimbus$mk_reified_nimbus$ > > >>> > > reify__10782.getClusterInfo(nimbus.clj:1971) > > >>> > ~[storm-core-1.1.0.jar:1.1.0] > > >>> > > at org.apache.storm.generated.Nimbus$Processor$ > > >>> > > getClusterInfo.getResult(Nimbus.java:3920) > > >>> ~[storm-core-1.1.0.jar:1.1.0] > > >>> > > at org.apache.storm.generated.Nimbus$Processor$ > > >>> > > getClusterInfo.getResult(Nimbus.java:3904) > > >>> ~[storm-core-1.1.0.jar:1.1.0] > > >>> > > at > > >>> > > > >>> org.apache.storm.thrift.ProcessFunction.process( > > ProcessFunction.java:39) > > >>> > > ~[storm-core-1.1.0.jar:1.1.0] > > >>> > > at > > >>> > org.apache.storm.thrift.TBaseProcessor.process( > > TBaseProcessor.java:39) > > >>> > > ~[storm-core-1.1.0.jar:1.1.0] > > >>> > > at org.apache.storm.security.auth.SimpleTransportPlugin$ > > >>> > > SimpleWrapProcessor.process(SimpleTransportPlugin.java:162) > > >>> > > ~[storm-core-1.1.0.jar:1.1.0] > > >>> > > at org.apache.storm.thrift.server. > > AbstractNonblockingServer$ > > >>> > > FrameBuffer.invoke(AbstractNonblockingServer.java:518) > > >>> > > ~[storm-core-1.1.0.jar:1.1.0] > > >>> > > at > > >>> > org.apache.storm.thrift.server.Invocation.run(Invocation.java:18) > > >>> > > ~[storm-core-1.1.0.jar:1.1.0] > > >>> > > at > > >>> > > > >>> java.util.concurrent.ThreadPoolExecutor.runWorker( > > ThreadPoolExecutor.java:1142) > > >>> > > [?:1.8.0_121] > > >>> > > at > > >>> > > > >>> java.util.concurrent.ThreadPoolExecutor$Worker.run( > > ThreadPoolExecutor.java:617) > > >>> > > [?:1.8.0_121] > > >>> > > at java.lang.Thread.run(Thread.java:745) [?:1.8.0_121] > > >>> > > > > >>> > > > > >>> > > What is amazing is that I got this same issue on two Storm > clusters > > >>> > > running on different VMs ; just they share the same data in they > > >>> Kafka > > >>> > > Broker cluster (one cluster is the production one, which was > > quickly > > >>> > fixed, > > >>> > > and the other one is the "backup" cluster to be used if the > > >>> production > > >>> > one > > >>> > > fails for quick "back to production") > > >>> > > > > >>> > > If left one of these cluster with this behavior because I felt > that > > >>> it > > >>> > > could be interesting for Storm developers to have more > information > > on > > >>> > this > > >>> > > issue, if needed to properly diagnose it. > > >>> > > > > >>> > > I can keep this cluster as is for max 2 days. > > >>> > > > > >>> > > Is there anything useful which I can collect on it to help Storm > > >>> > > developers to understand the cause (and hopefully use it to make > > >>> Storm > > >>> > more > > >>> > > robust) ? > > >>> > > > > >>> > > Few details: > > >>> > > > > >>> > > * Storm 1.1.0 cluster with Nimbus & NimbusUI running on a VM, > and 4 > > >>> > > Supervisors VMs + 3 Zookeeper VMs > > >>> > > > > >>> > > * Running with Java Server JRE 1.8.0_121 > > >>> > > * Running on AWS EC2 instances > > >>> > > > > >>> > > * We run about 10 topologies, with automatic self-healing on them > > (if > > >>> > they > > >>> > > stop consuming Kafka items, our self-healer call "Kill topology", > > and > > >>> > then > > >>> > > eventually restarts the topology > > >>> > > > > >>> > > * We have a self-healing on Nimbus UI based on calling its REST > > >>> services. > > >>> > > If it's not responding fast enough, we restart Nimbus UI > > >>> > > * We figured out the issue because Nimbus UI was restarted every > 2 > > >>> > minutes > > >>> > > > > >>> > > * To fix our production server which had the same symptom, we had > > to > > >>> stop > > >>> > > all Storm processes, then stop all Zookeepers, then remove all > data > > >>> in > > >>> > > Zookeeper "snapshot files", then restart all Zookeeper, then > > restart > > >>> all > > >>> > > Storm process, then re-submit all our topologies > > >>> > > > > >>> > > Please be as clear as possible about which commands we should run > > to > > >>> give > > >>> > > you more details if needed > > >>> > > > > >>> > > Best regards, > > >>> > > Alexandre Vermeerbergen > > >>> > > > > >>> > > > > >>> > > > >>> > > >> > > >> > > >
