prashantwason opened a new issue, #18903:
URL: https://github.com/apache/hudi/issues/18903
**Describe the problem you faced**
`HoodieHeartbeatClient` can permanently stop generating heartbeats for an
in-flight instant, so a later commit aborts with
`org.apache.hudi.exception.HoodieException: Heartbeat for instant <t> has
expired` even though the writer is still alive and making progress. The instant
is left inflight and has to be rolled back by a subsequent run.
There are two independent causes, both in
`HoodieHeartbeatClient.updateHeartbeat()`:
1. **Heartbeat write runs on the timer thread with no timeout.** The
heartbeat file is written synchronously (`storage.create(...)`) inside the
`TimerTask`. Because the timer is scheduled with `Timer.scheduleAtFixedRate`,
tasks run sequentially on a single thread, so a slow or hung storage write
(e.g. a transient cloud-storage latency spike) blocks that thread and freezes
**all** subsequent heartbeats for the instant.
2. **Self-interrupt permanently kills the timer.** When a refresh is
detected as delayed past the tolerable interval, `updateHeartbeat()` calls
`Thread.currentThread().interrupt()` on the timer thread. This terminates the
`Timer` thread (its internal wait throws `InterruptedException`), turning a
*transient* delay — a GC pause, a driver stall, or a single slow write — into a
*permanent* heartbeat blackout. The commit then fails at
`HeartbeatUtils.abortIfHeartbeatExpired()`.
Either path produces the same outcome: heartbeats stop, and a still-healthy
writer's commit is aborted.
**To Reproduce**
Steps to reproduce the behavior:
1. Run a long-running write with `hoodie.failed.writes.cleaner.policy=LAZY`
(so the heartbeat mechanism is active).
2. Induce either (a) a slow/hung storage write for the heartbeat file, or
(b) a driver pause longer than `hoodie.client.heartbeat.interval_in_ms *
hoodie.client.heartbeat.tolerable.misses`.
3. The heartbeat timer stops permanently; when the commit runs it fails with
"Heartbeat ... has expired".
**Expected behavior**
A transient storage-latency spike or driver pause should not permanently
stop heartbeats. The heartbeat timer should keep retrying; enforcement of
staleness should remain solely at commit time
(`HeartbeatUtils.abortIfHeartbeatExpired()`).
**Environment Description**
* Hudi version : master (affects current code)
**Stacktrace**
```
org.apache.hudi.exception.HoodieException: Heartbeat for instant <instant>
has expired, last heartbeat <ts>
at
org.apache.hudi.client.heartbeat.HeartbeatUtils.abortIfHeartbeatExpired(HeartbeatUtils.java:95)
at
org.apache.hudi.client.BaseHoodieWriteClient.commitStats(BaseHoodieWriteClient.java)
...
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]