Hi developers, We have been users of apache storm for a number of years. Earlier this year we tried to upgrade from storm 1.2.1 to storm 2.
While validating this upgrade we noticed a high level of randomness in our chaos monkey test, this randomness is currently blocking our upgrade to storm 2. Test context In our test we have 3 zookeeper instances, 2 storm nimbus instances, 2 storm supervisors and 1 storm UI instance, these instances are distributed across three separate VM's. The test execution can be seen below. 1. Start 1 worker 2. Insert events and verify they have been processed 3. Begin inserting more events 4. Kill zookeeper 1, nimbus 1, supervisor 1 5. Restart zookeeper 1, nimbus 1, supervisor 1 6. Kill the worker 7. Kill zookeeper 2, nimbus 2, supervisor 2 8. Restart zookeeper 2, nimbus 2, supervisor 2 9. We repeats steps 4-8 until all the events inserted in step 3 are processed The randomness we are seeing occurs in step 9, it usually takes ~6 minutes for this test case, in some instances it takes up to ~20 minutes. When this occurs the test will timeout after 15minutes, after upping the time out to 30 minutes the test passes consistently. We have done a considerable amount of analysis to try understand this slowness but have not found the root cause, and would appreciate any advice you can offer. See below for the observations of our analysis, I can provide specific logs if that will help. Analysis The issue seems to be that when the nimbi and supervisors and being killed on and off, something happens that causes the supervisors to fail finding a nimbus leader. This brings all processing to a stop until a nimbus leader is found. Eventually the nimbus leader is found and the workers resume processing. During the wait period, the following exceptions are repeatedly logged in the supervisor logs: org.apache.storm.thrift.transport.TTransportException: java.net.ConnectException: Connection refused (Connection refused) o.a.s.l.AsyncLocalizer AsyncLocalizer Task Executor - 1 [ERROR] AsyncLocalizer cleanup failure org.apache.storm.utils.NimbusLeaderNotFoundException: Could not find leader nimbus from seed hosts [dell998srv.fr.murex.com, mx28860vm.fr.murex.com]. Did you specify a valid list of nimbus hosts for config nimbus.seeds? o.a.s.u.NimbusClient AsyncLocalizer Task Executor - 1 [WARN] Ignoring exception while trying to get leader nimbus info from dell998srv.fr.murex.com. will retry with a different seed host. o.a.s.u.NimbusClient timer [WARN] Ignoring exception while trying to get leader nimbus info from dell998srv.fr.murex.com. will retry with a different seed host. .a.s.d.s.t.ReportWorkerHeartbeats timer [ERROR] Send worker heartbeats to master exception o.a.s.d.s.t.SynchronizeAssignments Thread-3 [ERROR] Get assignments from master exception The issue occurs at least 50% of time (3 out of every 5 or 6 runs). >From the nimbus logs, we see that the nimbus leadership switches from the >first leader to the second nimbus when the leader dies. So there's always a >nimbus leader even during the period that the supervisor is waiting for a >leader. Probably the most important observation is that the supervisor seems to find the nimbus leader when leadership returns to the original leader. Eg. If Nimbus-1 gains leadership first, and then it gets killed and Nimbus-2 gains leadership. The supervisors are not able to find the Nimbus leader whilst Nimbus-2 is a leader and are able to find it when Nimbus-2 dies and Nimbus-1 gains leadership back. Kind regards, Conor ******************************* This e-mail contains information for the intended recipient only. It may contain proprietary material or confidential information. Its content and any attachment hereto are strictly confidential and must not be disclosed to any unauthorized third party. If you are not the intended recipient, please delete this email and any attachment and notify us immediately. Murex cannot guarantee that it is virus free and accepts no responsibility for any loss or damage arising from its use. If you have received this e-mail in error please notify immediately the sender and delete the original email received, any attachments and all copies from your system.