kunwp1 opened a new issue, #5393: URL: https://github.com/apache/texera/issues/5393
### Task Summary `S3StorageClient.deleteDirectory` lists objects with a single `listObjectsV2` call (no continuation token) and deletes them with a single `deleteObjects` call. Both are capped at **1000 keys**, so deleting a prefix that holds more than 1000 objects silently removes only the first page and leaves the rest behind. This surfaced during review of #5280 (large-binary cleanup scoped by execution id; closes #4123): `LargeBinaryManager.deleteByExecution` delegates to `deleteDirectory`, so an execution that produced more than 1000 large binaries would leak the remainder. The cap applies to **every** caller of `deleteDirectory` (result / console-message / runtime-stats cleanup as well), so this is a shared-infra defect, not large-binary-specific. **Root cause:** the listing is not paginated and the delete is not chunked. ``` Before: deleteDirectory(prefix, 2500 objects) -> ~1000 deleted, ~1500 silently leaked After: deleteDirectory(prefix, 2500 objects) -> all 2500 deleted ``` **Proposed fix** - Loop `listObjectsV2` on the continuation token until `isTruncated` is false. - Delete in batches of <= 1000 keys. - Add a regression test that creates > 1000 objects under one prefix (MinIO test container) and asserts the prefix is empty after `deleteDirectory`. **Location:** `common/workflow-core/src/main/scala/org/apache/texera/service/util/S3StorageClient.scala` — `deleteDirectory`. ### Task Type - [ ] Refactor / Cleanup - [ ] DevOps / Deployment / CI - [x] Testing / QA - [ ] Documentation - [ ] Performance - [x] Other (correctness — silent incomplete deletion / storage leak) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
