The GitHub Actions job "Release Auditing" on texera.git/gh-readonly-queue/main/pr-5853-a5d8602b44f8297a15cf7800dde468d7d784b235 has succeeded. Run started by GitHub user xuang7 (triggered by xuang7).
Head commit for run: 18e4e67c70fcc4d4c739338d9c30ce552db3decc / Matthew B. <[email protected]> fix(file-service): retry S3 bucket creation on slow startup (#5853) ### What changes were proposed in this PR? - Add `awaitDependency` to `FileService`, an exponential-backoff retry (6 attempts from 200ms, ~6s total) with an injectable sleep, mirroring `LakeFSStorageClient.retryWithBackoff`. - Wrap the two `S3StorageClient.createBucketIfNotExist` calls in `FileService.run` with it, so a slow-to-start MinIO/S3 no longer aborts file-service startup. - Handle `InterruptedException` consistently: an interrupt arriving during the backoff `sleep` (not just during the bucket operation) now restores the thread's interrupt status and fails fast, instead of escaping as a raw `InterruptedException` with the interrupt flag lost. - Leave `LakeFSStorageClient.healthCheck()` on its existing inner retry (unchanged). - Add `FileServiceSpec` (8 tests) covering immediate success, default-argument success, retry-then-success, the full backoff progression to give-up, give-up preserving the cause, `maxAttempts == 1`, and interrupt-fails-fast for both interrupt points. ### Any related issues, documentation, discussions? Closes: #5852 Note: `awaitDependency` is a near-duplicate of `LakeFSStorageClient.retryWithBackoff` in `common/workflow-core`. Extracting a single shared helper that both delegate to is the cleaner end state, but it would refactor a stable, separately-tested class in another module, so it is deferred to a follow-up rather than widening the scope of this startup-race fix. ### How was this PR tested? - Run `sbt "FileService/testOnly org.apache.texera.service.FileServiceSpec"` and expect 8 passing tests: - immediate success runs the operation once and never sleeps; - default-argument success returns on the first try without invoking the default `Thread.sleep` backoff; - retry-then-success records delays `List(200, 400)` before succeeding on the 3rd try; - exhausting all 6 attempts records the full progression `List(200, 400, 800, 1600, 3200)` before giving up; - give-up rethrows after `maxAttempts` with the original exception as `getCause` and the dependency name in the message; - `maxAttempts == 1` gives up after a single attempt without sleeping; - an interrupt while running the operation restores the interrupt flag and fails fast; - an interrupt while sleeping between attempts likewise restores the interrupt flag and fails fast. - This environment hits a pre-existing JaCoCo instrumentation error (`Unsupported class file major version 69`) because JaCoCo 0.8.11 cannot instrument JDK 25 class files; this is unrelated to the change. The spec was verified locally against a JDK 17 toolchain (`sbt -java-home <jdk17>`, 8/8 pass) and relies on CI's JDK/JaCoCo combo for the standard instrumented run. `scalafmtCheck` is clean for both main and test sources. ### Was this PR authored or co-authored using generative AI tooling? Co-authored with Claude Opus 4.8 in compliance with ASF Report URL: https://github.com/apache/texera/actions/runs/27970596546 With regards, GitHub Actions via GitBox
