kishorvpatil opened a new pull request #3297: URL: https://github.com/apache/storm/pull/3297
## What is the purpose of the change Currently, the storm version of the topology is used to determine the RPC heartbeats usage. For large clusters with beefier machines, each supervisor can have 100s of workers and multiple supervisor daemons going down can cause a lot of load on nimbus. Currently, * While using 2.x topologies, the RPC heartbeats ignore Pacemaker availability. * The call _sendSupervisorWorkerHeartbeat_ is just checking supervisor is up. * The worker should kill itself if assignment has changed. ( regression) * Supervisor timer threads are not named. * Nimbus should check if using Pacemaker and expecting heartbeat calls. With this change, if Pacemaker is used, the behavior is : 1. Worker does not call supervisor 2. Worker sends heartbeat to pacemaker periodically 3. Supervisor does not send worker heartbeats to nimbus. 4. Nimbus checks if heartbeats should be expected from RPC calls or not. 5. If supervisor is down, the worker kills itself on reassignment. So worker does hang around without checking the reassignments. 6. Worker should restart itself if its assignments have changed. ( typically supervisor should notice the change in assignment and restart worker.) But if supervisor is down, then this is a good backup. ## How was the change tested Setup cluster with Pacemaker and validate that: 1. Worker does sends heartbeat to Pacemaker instead of calling __sendSupervisorWorkerHeartbeat_. 2. Stop Supervisor, re-balance topology- and worker dies (as assignments have changed - logs message about change in assignment worker.log) 3. Supervisor does not send executor heartbeats to nimbus. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org