Hi developers,

We have been users of apache storm for a number of years. Earlier this year we 
tried to upgrade from storm 1.2.1 to storm 2.

While validating this upgrade we noticed a high level of randomness in our 
chaos monkey test, this randomness is currently blocking our upgrade to storm 2.

Test context

In our test we have 3 zookeeper instances,  2 storm nimbus instances, 2 storm 
supervisors and 1 storm UI instance, these instances are distributed across 
three separate VM's.

The test execution can be seen below.

1. Start 1 worker
2. Insert events and verify they have been processed
3. Begin inserting more events
4. Kill zookeeper 1, nimbus 1, supervisor 1
5. Restart zookeeper 1, nimbus 1, supervisor 1
6. Kill the worker
7. Kill zookeeper 2, nimbus 2, supervisor 2
8. Restart zookeeper 2, nimbus 2, supervisor 2
9. We repeats steps 4-8 until all the events inserted in step 3 are processed

The randomness we are seeing occurs in step 9, it usually takes ~6 minutes for 
this test case, in some instances it takes up to ~20 minutes. When this occurs 
the test will timeout after 15minutes, after upping the time out to 30 minutes 
the test passes consistently.

We have done a considerable amount of analysis to try understand this slowness 
but have not found the root cause, and would appreciate any advice you can 
offer. See below for the observations of our analysis, I can provide specific 
logs if that will help.

Analysis

The issue seems to be that when the nimbi and supervisors and being killed on 
and off, something happens that causes the supervisors to fail finding a nimbus 
leader. This brings all processing to a stop until a nimbus leader is found. 
Eventually the nimbus leader is found and the workers resume processing.

During the wait period, the following exceptions are  repeatedly logged in the 
supervisor logs:

org.apache.storm.thrift.transport.TTransportException: 
java.net.ConnectException: Connection refused (Connection refused)
o.a.s.l.AsyncLocalizer AsyncLocalizer Task Executor - 1 [ERROR] AsyncLocalizer 
cleanup failure
org.apache.storm.utils.NimbusLeaderNotFoundException: Could not find leader 
nimbus from seed hosts [dell998srv.fr.murex.com, mx28860vm.fr.murex.com]. Did 
you specify a valid list of nimbus hosts for config nimbus.seeds?
o.a.s.u.NimbusClient AsyncLocalizer Task Executor - 1 [WARN] Ignoring exception 
while trying to get leader nimbus info from dell998srv.fr.murex.com. will retry 
with a different seed host.
o.a.s.u.NimbusClient timer [WARN] Ignoring exception while trying to get leader 
nimbus info from dell998srv.fr.murex.com. will retry with a different seed host.
.a.s.d.s.t.ReportWorkerHeartbeats timer [ERROR] Send worker heartbeats to 
master exception
o.a.s.d.s.t.SynchronizeAssignments Thread-3 [ERROR] Get assignments from master 
exception

The issue occurs at least 50% of time (3 out of every 5 or 6 runs).

>From the nimbus logs, we see that the nimbus leadership switches from the 
>first leader to the second nimbus when the leader dies. So there's always a 
>nimbus leader even during the period that the supervisor is waiting for a 
>leader.

Probably the most important observation is that the supervisor seems to find 
the nimbus leader when leadership returns to the original leader. Eg. If 
Nimbus-1 gains leadership first, and then it gets killed and Nimbus-2 gains 
leadership. The supervisors are not able to find the Nimbus leader whilst 
Nimbus-2 is a leader and are able to find it when Nimbus-2 dies and Nimbus-1 
gains leadership back.

Kind regards,
Conor
*******************************
This e-mail contains information for the intended recipient only. It may 
contain proprietary material or confidential information. Its content and any 
attachment hereto are strictly confidential and must not be disclosed to any 
unauthorized third party. If you are not the intended recipient, please delete 
this email and any attachment and notify us immediately. Murex cannot guarantee 
that it is virus free and accepts no responsibility for any loss or damage 
arising from its use. If you have received this e-mail in error please notify 
immediately the sender and delete the original email received, any attachments 
and all copies from your system.

Reply via email to