voonhous commented on issue #17714:
URL: https://github.com/apache/hudi/issues/17714#issuecomment-4701910310
## Investigation + CI verification
Root-caused and reproduced on macOS; **verified not reproducible on
CI/Linux**.
### Root cause
`CleanActionExecutor.deleteFileAndGetResult` calls
`storage.getPathInfo(deletePath)` (to pick file-vs-directory delete), which
routes through `HoodieWrapperFileSystem.getFileStatus`:
```java
public FileStatus getFileStatus(Path f) throws IOException {
return executeFuncWithTimeMetrics(MetricName.getFileStatus.name(), f, ()
-> {
try {
consistencyGuard.waitTillFileAppears(convertToDefaultStoragePath(f));
// full ~25.2s backoff
} catch (TimeoutException e) {
// pass
}
return fileSystem.getFileStatus(convertToDefaultPath(f));
});
}
```
On macOS local `file://`, `waitTillFileAppears` checks
`convertToDefaultStoragePath(f)`, never sees the file, and runs the full
exponential backoff (`0.4 + 0.8 + 1.6 + 3.2 + 6.4 + 12.8 = ~25.2s`) per deleted
file before swallowing the `TimeoutException` (`// pass`). The file is still
deleted correctly - it is purely wasted wall-clock.
### Local (macOS) reproduction
- Full `TestCleanerInsertAndCleanByVersions` (4 methods): ~12 min, ~183
tasks each stalling ~25.2s.
- A single method (`testInsertAndCleanByVersions`): ~1 min, ~6 stalls.
- JDK 11 and JDK 17 are identical (the backoff is `Thread.sleep`,
JDK-independent); the per-method count just scales with how many files the
method cleans.
### CI/Linux: not reproducible (verified)
A run with verbose logging (`-Pwarn-log` dropped,
`-Dsurefire.useFile=false`, so per-task timings are visible):
- 7,109 `Finished task ... in N ms` lines emitted (logging confirmed
working) and 1,454 cleaner/delete lines (the delete path was genuinely
exercised).
- Tasks at ~25s: **0**. Tasks at 10s+: **0**. **Slowest task: 2.87s.** Class
finished in ~84-120s.
So on CI's Linux filesystem the path resolves immediately and the backoff
never fires.
### Conclusion
This is specific to **macOS local `file://` path translation**, not a
general code defect, and is invisible on CI. The gotcha is now documented
inline next to `withConsistencyCheckEnabled(true)` in the test so the next
person hitting a slow local run knows why.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]