Hi all,
I'm using Apache Storm 2.2.1 and have been encountering frequent logs in Nimbus
with the message:
"Exception when getting heartbeat timeout"[1]
After some investigation, I found that applying STORM-4022 [2] suppresses this
log.
However, it doesn't mean the issue has been resolved — the log is simply
skipped when a `NotAliveException` occurs.[3]
I wanted to understand why these heartbeat timeout exceptions were happening.
So, I tracked down the supervisor that was sending the related
`reportWorkerHeartbeats` request.
Here’s what I found:
The supervisor reads worker heartbeats from
`${storm.local.dir}/${workerId}/heartbeats` using
`SupervisorUtils.readWorkerHeartbeats`.[4]
However, on the problematic supervisor node, I noticed that some worker
directories from previously terminated workers had not been cleaned up properly.
As a result, the supervisor ended up reading heartbeat information for workers
that no longer exist and reporting them to Nimbus.
An example of the heartbeat from a non-existent worker (`LSWorkerHeartbeat`)
looks like this:
"LSWorkerHeartbeat(time_secs:0000, topology_id:0000,
executors:[ExecutorInfo(task_start:-1, task_end:-1)], port:0000)"
Apart from the system executor ID, there is no meaningful information.
To prevent unnecessary "NotAliveException" errors in Nimbus, I think the
supervisor should either:
1. Filter out invalid or expired heartbeat data before calling
`reportWorkerHeartbeats`[5], or
2. Ensure that the stale worker directories (`${storm.local.dir}/${workerId}`)
are cleaned up appropriately.
Currently, there are no related errors in the worker or supervisor logs, so I’m
unsure why the worker directories weren't cleaned up as expected.
My questions are:
- When and how should `${storm.local.dir}/${workerId}` directories be cleaned
up in normal operation?
- Would it be safe to filter out stale or malformed heartbeats before sending
them to Nimbus in `reportWorkerHeartbeats`[5]?
Any insights or recommendations would be greatly appreciated.
Thanks
[1]:
https://github.com/apache/storm/blob/887d2af761accdf7a1a07357a80be28f9964940b/storm-server/src/main/java/org/apache/storm/daemon/nimbus/Nimbus.java#L2604
[2]: https://issues.apache.org/jira/browse/STORM-4022
[3]:
https://github.com/apache/storm/blob/758e2f3c46c4e2a92ec00d9cd5bf5ad7c3d64156/storm-server/src/main/java/org/apache/storm/daemon/nimbus/Nimbus.java#L2707-L2710
[4]:
https://github.com/apache/storm/blob/887d2af761accdf7a1a07357a80be28f9964940b/storm-server/src/main/java/org/apache/storm/daemon/supervisor/SupervisorUtils.java#L157-L163
[5]:
https://github.com/apache/storm/blob/887d2af761accdf7a1a07357a80be28f9964940b/storm-server/src/main/java/org/apache/storm/daemon/supervisor/timer/ReportWorkerHeartbeats.java#L48