GitHub user abhishekshivanna opened a pull request:
https://github.com/apache/samza/pull/375
SAMZA-1506: Fix for robust ContainerHeartbeatMonitor exception handling.
The Fix includes the following changes:
- Catch all exceptions inside the heartbeat thread and not just
IOException.
- A time based force kill when the heartbeat is invalid,
this makes the monitor immune to threads that may keep the
container stuck in the shutdown sequence. When the timeout
occurs, a System.exit(1) is called.
- Increasing number of retries for failed heartbeats from 3 to 6.
This prevents short intermittent network failurs from causing the
containers to be invalidated.
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/abhishekshivanna/samza container-heartbeat
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/samza/pull/375.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #375
----
commit 55145366b0a2e15b30665e88cead5f6bfd75ee2e
Author: Abhishek Shivanna <[email protected]>
Date: 2017-11-30T20:09:10Z
SAMZA-1506: Fix for robust ContainerHeartbeatMonitor exception handling.
The Fix includes the following changes:
- Catch all exceptions inside the heartbeat thread and not just
IOException.
- A time based force kill when the heartbeat is invalid,
this makes the monitor immune to threads that may keep the
container stuck in the shutdown sequence. When the timeout
occurs, a System.exit(1) is called.
- Increasing number of retries for failed heartbeats from 3 to 6.
This prevents short intermittent network failurs from causing the
containers to be invalidated.
----
---