Hi all,

I'm using Apache Storm 2.2.1 and have been encountering frequent logs in Nimbus 
with the message:

"Exception when getting heartbeat timeout"[1]

After some investigation, I found that applying STORM-4022 [2] suppresses this 
log.
However, it doesn't mean the issue has been resolved — the log is simply 
skipped when a `NotAliveException` occurs.[3]

I wanted to understand why these heartbeat timeout exceptions were happening.
So, I tracked down the supervisor that was sending the related 
`reportWorkerHeartbeats` request.

Here’s what I found:

The supervisor reads worker heartbeats from 
`${storm.local.dir}/${workerId}/heartbeats` using 
`SupervisorUtils.readWorkerHeartbeats`.[4]
However, on the problematic supervisor node, I noticed that some worker 
directories from previously terminated workers had not been cleaned up properly.
As a result, the supervisor ended up reading heartbeat information for workers 
that no longer exist and reporting them to Nimbus.

An example of the heartbeat from a non-existent worker (`LSWorkerHeartbeat`) 
looks like this:

"LSWorkerHeartbeat(time_secs:0000, topology_id:0000, 
executors:[ExecutorInfo(task_start:-1, task_end:-1)], port:0000)"

Apart from the system executor ID, there is no meaningful information.

To prevent unnecessary "NotAliveException" errors in Nimbus, I think the 
supervisor should either:

1. Filter out invalid or expired heartbeat data before calling 
`reportWorkerHeartbeats`[5], or
2. Ensure that the stale worker directories (`${storm.local.dir}/${workerId}`) 
are cleaned up appropriately.

Currently, there are no related errors in the worker or supervisor logs, so I’m 
unsure why the worker directories weren't cleaned up as expected.

My questions are:

- When and how should `${storm.local.dir}/${workerId}` directories be cleaned 
up in normal operation?
- Would it be safe to filter out stale or malformed heartbeats before sending 
them to Nimbus in `reportWorkerHeartbeats`[5]?

Any insights or recommendations would be greatly appreciated.

Thanks

[1]: 
https://github.com/apache/storm/blob/887d2af761accdf7a1a07357a80be28f9964940b/storm-server/src/main/java/org/apache/storm/daemon/nimbus/Nimbus.java#L2604
[2]: https://issues.apache.org/jira/browse/STORM-4022
[3]: 
https://github.com/apache/storm/blob/758e2f3c46c4e2a92ec00d9cd5bf5ad7c3d64156/storm-server/src/main/java/org/apache/storm/daemon/nimbus/Nimbus.java#L2707-L2710
[4]: 
https://github.com/apache/storm/blob/887d2af761accdf7a1a07357a80be28f9964940b/storm-server/src/main/java/org/apache/storm/daemon/supervisor/SupervisorUtils.java#L157-L163
[5]: 
https://github.com/apache/storm/blob/887d2af761accdf7a1a07357a80be28f9964940b/storm-server/src/main/java/org/apache/storm/daemon/supervisor/timer/ReportWorkerHeartbeats.java#L48

Reply via email to