Yes retry would be the most logical way to work around this.  The code is kind 
of odd in that there are two separate locations in ZK for this information.  
Leader election simply stores who the leader is at `/leader-lock`, but the 
information about all of the nimbus instances that are alive is stored under 
`/nimbuses`.  What you have run into is where they are not in sync with each 
other.  The leader lock said nimbus-A is the leader and nimbuses had no 
knowledge of nimbus-A at all.  If nimbus-A was crashing during this period of 
time then it is a race and we need to fix it with retry (I'll file a JIRA for 
this anyways as we should have this in no matter what).  If nimbus-A was not 
crashing then ZK some how messed up or we some how messed up.  The only way 
that could happen on our end is if for some reason we have two different 
connections to ZK, one for leader election and another for writing to nimbuses. 
 If that is not the case, and this is reproducible, then yes the first thing to 
do is to turn on debug logging, and try to grab the snapshot/edit logs for your 
ZK cluster right after this happened.  I am really hopeful that it was nimbus 
crashing.

- Bobby

On Sunday, May 14, 2017, 4:03:22 PM CDT, S G <[email protected]> 
wrote:Thanks Bobby,

This looks like a serious issue to me. Any ideas how I can provide more
information (like enable some logs etc) to gain more insight into this
problem?

It might be a good idea to add some retry logic or some waiting logic on
the node that comes up empty handed so that it handles the error more
gracefully rather than crashing with a NullPointerException?

Also, the leader election is supposed to happen through zookeeper, right?
Isn't the new leader becoming a leader after saving its state in zookeeper?
Because then the other nodes should not come empty handed.
If no, then it seems like a bug and the leader should persist the state in
zookeeper first before becoming a leader.


> looks like it is caused by trying to read a NimbusSummary for the leader
but not being able to find it
Instead of crashing, this should trigger a new leader election IMO with
some good warning messages in the logs.


Disclaimer: I have not seen the actual code that does the nimbus leader
election. Above are just some suggestions based on my limited knowledge. So
please forgive any outrageous/obvious ideas :)



On Tue, May 9, 2017 at 1:58 PM, Bobby Evans <[email protected]>
wrote:

> This looks like something odd is happening with leader election.  The
> exception looks like it is caused by trying to read a NimbusSummary for the
> leader but not being able to find it.  So it could mean that a leader is
> elected and is then crashing quickly enough that the other node when it
> tries to read this loses the race and comes up empty handed.  But if you
> only have a single nimbus configured then this is not the case and
> something else worse is happening.
>
>
> - Bobby
>
> On Monday, May 8, 2017, 4:41:13 PM CDT, S G <[email protected]>
> wrote:Hi,
>
> I am trying to upgrade from 1.0.2 to 1.1.0 version of Storm.
> And I see the below exception happening randomly on the Nimbus node.
> When it happens, Nimbus is unable to accept any new topologies.
>
>
> java.lang.NullPointerException: null
>        at
> clojure.lang.Reflector.invokeNoArgInstanceMember(Reflector.java:301)
> ~[clojure-1.7.0.jar:?]
>        at
> org.apache.storm.daemon.nimbus$mk_reified_nimbus$
> reify__10782.getLeader(nimbus.clj:2383)
> ~[storm-core-1.1.0.jar:1.1.0]
>        at
> org.apache.storm.generated.Nimbus$Processor$getLeader.
> getResult(Nimbus.java:3944)
> ~[storm-core-1.1.0.jar:1.1.0]
>        at
> org.apache.storm.generated.Nimbus$Processor$getLeader.
> getResult(Nimbus.java:3928)
> ~[storm-core-1.1.0.jar:1.1.0]
>        at
> org.apache.storm.thrift.ProcessFunction.process(ProcessFunction.java:39)
> ~[storm-core-1.1.0.jar:1.1.0]
>        at
> org.apache.storm.thrift.TBaseProcessor.process(TBaseProcessor.java:39)
> ~[storm-core-1.1.0.jar:1.1.0]
>        at
> org.apache.storm.security.auth.SimpleTransportPlugin$
> SimpleWrapProcessor.process(SimpleTransportPlugin.java:162)
> ~[storm-core-1.1.0.jar:1.1.0]
>        at
> org.apache.storm.thrift.server.AbstractNonblockingServer$
> FrameBuffer.invoke(AbstractNonblockingServer.java:518)
> ~[storm-core-1.1.0.jar:1.1.0]
>        at
> org.apache.storm.thrift.server.Invocation.run(Invocation.java:18)
> ~[storm-core-1.1.0.jar:1.1.0]
>        at
> java.util.concurrent.ThreadPoolExecutor.runWorker(
> ThreadPoolExecutor.java:1142)
> [?:1.8.0_51]
>        at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(
> ThreadPoolExecutor.java:617)
> [?:1.8.0_51]
>        at java.lang.Thread.run(Thread.java:745) [?:1.8.0_51]
>
>
> I have not been able to isolate what causes this exception.
> Any help would be appreciated.
>
> Thanks
> SG
>

Reply via email to