Still getting "Internal error processing getClusterInfo" with Storm 1.1.0, isn't STORM-1977 supposed to be closed?

2017-08-13 Thread Alexandre Vermeerbergen
Hello, I think it might be of interest for you Storm developers to learn that I currently have a case of issue with Storm 1.1.0 which was supposed to resolved in this release according to https://issues.apache.org/jira/browse/STORM-1977 ; and I can look for any more information which you'd need to

Re: Still getting "Internal error processing getClusterInfo" with Storm 1.1.0, isn't STORM-1977 supposed to be closed?

2017-08-15 Thread Alexandre Vermeerbergen
Hello, Tomorrow I will have to restart the cluster on which I have this issue with Storm 1.1.0. Is there are anybody interested in my running some commands to get more logs before I repair this cluster? Best regards, Alexandre Vermeerbergen 2017-08-13 16:14 GMT+02:00 Alexandre Vermeerbergen : >

Re: Still getting "Internal error processing getClusterInfo" with Storm 1.1.0, isn't STORM-1977 supposed to be closed?

2017-08-24 Thread Jungtaek Lim
Hi Alexandre, I missed this mail since I was on vacation. I followed the stack trace but hard to analyze without context. Do you mind providing full nimbus log? Thanks, Jungtaek Lim (HeartSaVioR) 2017년 8월 16일 (수) 오전 4:12, Alexandre Vermeerbergen 님이 작성: > Hello, > > Tomorrow I will have to resta

Re: Still getting "Internal error processing getClusterInfo" with Storm 1.1.0, isn't STORM-1977 supposed to be closed?

2017-08-24 Thread Alexandre Vermeerbergen
Hello Jungtaek, Thank you very much for your answer. Please find attached the full Nimbus log (gzipped) related to this issue. Please note that the last ERROR repeats forever until we "repair" Storm. >From the logs, it could be that the issue began close to when a topology was restarted (killed

Re: Still getting "Internal error processing getClusterInfo" with Storm 1.1.0, isn't STORM-1977 supposed to be closed?

2017-08-24 Thread Jungtaek Lim
Alexandre, I found that your storm local dir is placed to "/tmp/storm" which parts or all could be removed at any time. Could you move the path to non-temporary place and try to replicate? Thanks, Jungtaek Lim (HeartSaVioR) 2017년 8월 24일 (목) 오후 6:40, Alexandre Vermeerbergen 님이 작성: > Hello Jungt

Re: Still getting "Internal error processing getClusterInfo" with Storm 1.1.0, isn't STORM-1977 supposed to be closed?

2017-08-24 Thread Jungtaek Lim
Sorry I meant reproduce, not replicate. :) 2017년 8월 24일 (목) 오후 8:34, Jungtaek Lim 님이 작성: > Alexandre, > > I found that your storm local dir is placed to "/tmp/storm" which parts or > all could be removed at any time. > Could you move the path to non-temporary place and try to replicate? > > Thank

Re: Still getting "Internal error processing getClusterInfo" with Storm 1.1.0, isn't STORM-1977 supposed to be closed?

2017-08-24 Thread Alexandre Vermeerbergen
Hello Jungtaek, I can do what you suggest (ie moving storm local dir to a place which isn't in /tmp),but since the issue occurs rarely (once per month), I doubt I'll be able to feedback soon. What is puzzling to me is that in order to recover from such issue, we have to stop everything, then clea

Re: Still getting "Internal error processing getClusterInfo" with Storm 1.1.0, isn't STORM-1977 supposed to be closed?

2017-08-24 Thread Jungtaek Lim
Blob files (meta, data) are in storm local directory. ZK only has list of blob keys and which alive nimbuses have that file. So if you lose storm local directory, you just can't restore blobs, unless other nimbuses have these blobs so current nimbus could pull. (I guess you have only one nimbus, si

Re: Still getting "Internal error processing getClusterInfo" with Storm 1.1.0, isn't STORM-1977 supposed to be closed?

2017-08-24 Thread Alexandre Vermeerbergen
Hello Jungtaek, I confirm that we currently do not have multiple Nimbus nodes. I want to clarify that Nimbus process never crashed : it keep printing in its log the error: 2017-08-06 03:44:01.777 o.a.s.t.ProcessFunction pool-14-thread-1 [ERROR] Internal error processing getClusterInfo org.apache

Re: Still getting "Internal error processing getClusterInfo" with Storm 1.1.0, isn't STORM-1977 supposed to be closed?

2017-08-25 Thread Jungtaek Lim
I'm not sure. Topology code can't be restored, so my best bet would be detecting it (periodically, or react in failure) and give up leadership. If my memory is right, leader Nimbus doesn't pull blobs from followers, so if it doesn't have any blobs and need to sync, it just needs to become follower

Re: Still getting "Internal error processing getClusterInfo" with Storm 1.1.0, isn't STORM-1977 supposed to be closed?

2017-08-26 Thread Alexandre Vermeerbergen
Hello Jungtaek, Your answers were very useful, because I was able to reproduce the issue by simply deleting storm.local.dir contents, and I found traces that indeed the machine that suffered this issue had lost this directory : it confirms your diagnostic, thank you very much for this! I have fil

Re: Still getting "Internal error processing getClusterInfo" with Storm 1.1.0, isn't STORM-1977 supposed to be closed?

2017-08-26 Thread Jungtaek Lim
Glad that you reproduced and fixed the problem. Happy to help! - Jungtaek Lim (HeartSaVioR) 2017년 8월 26일 (토) 오후 11:13, Alexandre Vermeerbergen 님이 작성: > Hello Jungtaek, > > Your answers were very useful, because I was able to reproduce the issue by > simply deleting storm.local.dir contents, and

Re: Still getting "Internal error processing getClusterInfo" with Storm 1.1.0, isn't STORM-1977 supposed to be closed?

2017-08-29 Thread Alexandre Vermeerbergen
Hello, I'm afraid we had a new occurrence of this issue, where storm.local.dir directory was still globally available but maybe some file was missing or corrupted. The issue starts again just after Nimbus log says it's "Cleaning inbox": 2017-08-28 02:51:21.163 o.a.s.d.nimbus pool-14-thread-30 [I

Re: Still getting "Internal error processing getClusterInfo" with Storm 1.1.0, isn't STORM-1977 supposed to be closed?

2017-08-30 Thread Jungtaek Lim
I'm not sure. If you are sure about this behavior, please file an issue with affected version. If you can provide the way to reproduce the issue (I guess if it is from race condition it would be hard to) it should be great. Even better you don't mind to spend time to give it a try to fix: that's ho