The GitHub Actions job "Release Auditing" on 
texera.git/gh-readonly-queue/main/pr-5853-a5d8602b44f8297a15cf7800dde468d7d784b235
 has succeeded.
Run started by GitHub user xuang7 (triggered by xuang7).

Head commit for run:
18e4e67c70fcc4d4c739338d9c30ce552db3decc / Matthew B. <[email protected]>
fix(file-service): retry S3 bucket creation on slow startup (#5853)

### What changes were proposed in this PR?
- Add `awaitDependency` to `FileService`, an exponential-backoff retry
(6 attempts from 200ms, ~6s total) with an injectable sleep, mirroring
`LakeFSStorageClient.retryWithBackoff`.
- Wrap the two `S3StorageClient.createBucketIfNotExist` calls in
`FileService.run` with it, so a slow-to-start MinIO/S3 no longer aborts
file-service startup.
- Handle `InterruptedException` consistently: an interrupt arriving
during the backoff `sleep` (not just during the bucket operation) now
restores the thread's interrupt status and fails fast, instead of
escaping as a raw `InterruptedException` with the interrupt flag lost.
- Leave `LakeFSStorageClient.healthCheck()` on its existing inner retry
(unchanged).
- Add `FileServiceSpec` (8 tests) covering immediate success,
default-argument success, retry-then-success, the full backoff
progression to give-up, give-up preserving the cause, `maxAttempts ==
1`, and interrupt-fails-fast for both interrupt points.

### Any related issues, documentation, discussions?
Closes: #5852

Note: `awaitDependency` is a near-duplicate of
`LakeFSStorageClient.retryWithBackoff` in `common/workflow-core`.
Extracting a single shared helper that both delegate to is the cleaner
end state, but it would refactor a stable, separately-tested class in
another module, so it is deferred to a follow-up rather than widening
the scope of this startup-race fix.

### How was this PR tested?
- Run `sbt "FileService/testOnly
org.apache.texera.service.FileServiceSpec"` and expect 8 passing tests:
  - immediate success runs the operation once and never sleeps;
- default-argument success returns on the first try without invoking the
default `Thread.sleep` backoff;
- retry-then-success records delays `List(200, 400)` before succeeding
on the 3rd try;
- exhausting all 6 attempts records the full progression `List(200, 400,
800, 1600, 3200)` before giving up;
- give-up rethrows after `maxAttempts` with the original exception as
`getCause` and the dependency name in the message;
  - `maxAttempts == 1` gives up after a single attempt without sleeping;
- an interrupt while running the operation restores the interrupt flag
and fails fast;
- an interrupt while sleeping between attempts likewise restores the
interrupt flag and fails fast.
- This environment hits a pre-existing JaCoCo instrumentation error
(`Unsupported class file major version 69`) because JaCoCo 0.8.11 cannot
instrument JDK 25 class files; this is unrelated to the change. The spec
was verified locally against a JDK 17 toolchain (`sbt -java-home
<jdk17>`, 8/8 pass) and relies on CI's JDK/JaCoCo combo for the standard
instrumented run. `scalafmtCheck` is clean for both main and test
sources.

### Was this PR authored or co-authored using generative AI tooling?
Co-authored with Claude Opus 4.8 in compliance with ASF

Report URL: https://github.com/apache/texera/actions/runs/27970596546

With regards,
GitHub Actions via GitBox

Reply via email to