kunwp1 opened a new pull request, #5569: URL: https://github.com/apache/texera/pull/5569
### What changes were proposed in this PR? `S3StorageClient.deleteDirectory(bucketName, directoryPrefix)` listed objects under the prefix with a single `listObjectsV2` call and issued one `deleteObjects` batch, with no continuation-token loop. `listObjectsV2` returns at most 1000 keys per page, so only the first ≤1000 objects under a prefix were ever deleted — any beyond that were silently orphaned (a storage leak). AWS `DeleteObjects` also accepts at most 1000 keys per request, so once listing is paginated the deletions must be chunked too. This affects real cleanup paths: `DatasetResource` (deleting a dataset repository) and `LargeBinaryManager` (per-execution cleanup), either of which can exceed 1000 objects under a single prefix. Fix: - Follow the `listObjectsV2` continuation token until the listing is no longer truncated. - Delete each page in batches of at most 1000 keys (`MAX_KEYS_PER_DELETE_REQUEST`), keeping memory bounded to a single page rather than accumulating every key. - Remove dead code: an MD5 digest computed over the request's `toString` that was never attached to the request (and its now-unused import). The prefix normalization and the public method signature are unchanged. ``` Before: deleteDirectory(prefix) -> list <=1000 -> delete <=1000 -> remaining objects leak After: deleteDirectory(prefix) -> paginate all pages -> delete each page in <=1000 batches ``` ### Any related issues, documentation, discussions? Closes #5281 ### How was this PR tested? Added a `deleteDirectory` test section to `S3StorageClientSpec` (runs against a MinIO testcontainer), covering positive, boundary, and negative cases: - deletes all objects under a prefix (5 objects); - deletes more than 1000 objects under a prefix (1001 → exercises both the continuation-token pagination loop and the ≤1000 delete batching); - does not throw for a prefix with no objects. The 1001-object test fails against the pre-fix code (one object is left orphaned) and passes after the fix. ``` sbt 'WorkflowCore / testOnly org.apache.texera.service.util.S3StorageClientSpec' # -> Tests: succeeded 20, failed 0 sbt 'WorkflowCore / testOnly *LargeBinaryOutputStreamSpec *LargeBinaryInputStreamSpec' # -> Tests: succeeded 43, failed 0 (LargeBinaryManager is a consumer of deleteDirectory; no regression) ``` Requires Docker (testcontainers MinIO). ### Was this PR authored or co-authored using generative AI tooling? Generated-by: Claude Code (Claude Opus 4.8) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
