The GitHub Actions job "License Binary Checker" on texera.git/main has failed. Run started by GitHub user bobbai00 (triggered by bobbai00).
Head commit for run: 5e569568606a204070040b1521210cc9d853bc10 / Meng Wang <[email protected]> fix: close CloseableIterable owners in Iceberg read paths (#5149) ### What changes were proposed in this PR? Fixes a resource leak in `IcebergUtil.readDataFileAsIterator` and five sibling sites in `IcebergDocument` that share the same anti-pattern: ```scala closeableIterable.iterator().asScala ``` The bare `Iterator` returned to callers held no reference to its parent `CloseableIterable`, so the parent could never be closed. Under `S3FileIO`: 1. Every read leaked one `S3InputStream` (kept open until GC because nothing in the call graph could close it). 2. The leaked stream had already borrowed one slot from the AWS SDK's `ApacheHttpClient` connection pool (default **50**; texera did not override). 3. After ~50 leaked reads the pool may saturate; new S3 reads then block on `acquireConnection` until JVM restart. This PR: - Introduces `CloseableScalaIterator[T]` (`Iterator[T] with AutoCloseable`, idempotent `close()`) in `IcebergUtil`, which wraps a `CloseableIterable[T]` and propagates `close()` to the parent. - Changes `IcebergUtil.readDataFileAsIterator` to return `CloseableScalaIterator[Record]` instead of bare `Iterator[Record]`. Callers must now close it (e.g. via `Using.resource`). - Updates the single caller in `IcebergDocument`'s read iterator to track the close handle in a sibling `AutoCloseable` field (`currentRecordIteratorCloser`) and close it on file-switch, on exhaustion, and on caller-imposed `until` cap. The sibling field is necessary because `Iterator.drop(n)` returns a bare iterator that loses the wrapper type. - Wraps the four eagerly-consumed `planFiles()` call sites — `getCount`, `seekToUsableFile`, `getTableStatistics`, `asInputStream` — in `Using.resource` so the metadata-side `CloseableIterable<FileScanTask>` is closed promptly. **Known limitation (out of scope here):** if a caller of `IcebergDocument.get()` / `getRange()` / `getAfter()` stops iterating before `hasNext` returns `false` (e.g. throws mid-loop, or calls `.take(n)` and then drops the result), the LAST file's `CloseableScalaIterator` will leak until JVM GC. Fixing this requires changing the public `Iterator[T]` return type on `VirtualDocument` to `Iterator[T] with AutoCloseable` and updating all callers — best done as a separate refactor. ### Any related issues, documentation, discussions? Closes #5143. ### How was this PR tested? - Added `IcebergUtilLeakSpec` (2 cases): validates that `CloseableScalaIterator` (a) closes its parent `CloseableIterable` when used inside `Using.resource`, and (b) is idempotent under repeated `close()` calls. - All existing iceberg specs still pass: - `IcebergUtilSpec`: 14/14 - `IcebergUtilLeakSpec`: 2/2 (new) - `IcebergDocumentSpec`: 18/18 (exercises the modified read iterator's close-on-reassign / close-on-exhaustion paths against real Iceberg infrastructure) - `IcebergTableStatsSpec`: 12/12 (exercises `getTableStatistics` with the new `Using.resource` wrap) - `IcebergDocumentConsoleMessagesSpec`: 1/1 Run locally: ``` sbt "WorkflowCore/testOnly org.apache.texera.amber.util.IcebergUtilSpec org.apache.texera.amber.util.IcebergUtilLeakSpec org.apache.texera.amber.storage.result.iceberg.*" ``` Result: `47 succeeded, 0 failed`. ### Was this PR authored or co-authored using generative AI tooling? Generated-by: Claude Code (claude-opus-4-7) Co-authored-by: Claude Opus 4.7 (1M context) <[email protected]> Report URL: https://github.com/apache/texera/actions/runs/26331395326 With regards, GitHub Actions via GitBox
