kunwp1 opened a new issue, #5393:
URL: https://github.com/apache/texera/issues/5393

   ### Task Summary
   
   `S3StorageClient.deleteDirectory` lists objects with a single 
`listObjectsV2` call (no continuation token) and deletes them with a single 
`deleteObjects` call. Both are capped at **1000 keys**, so deleting a prefix 
that holds more than 1000 objects silently removes only the first page and 
leaves the rest behind.
   
   This surfaced during review of #5280 (large-binary cleanup scoped by 
execution id; closes #4123): `LargeBinaryManager.deleteByExecution` delegates 
to `deleteDirectory`, so an execution that produced more than 1000 large 
binaries would leak the remainder. The cap applies to **every** caller of 
`deleteDirectory` (result / console-message / runtime-stats cleanup as well), 
so this is a shared-infra defect, not large-binary-specific.
   
   **Root cause:** the listing is not paginated and the delete is not chunked.
   
   ```
   Before:  deleteDirectory(prefix, 2500 objects)  ->  ~1000 deleted, ~1500 
silently leaked
   After:   deleteDirectory(prefix, 2500 objects)  ->  all 2500 deleted
   ```
   
   **Proposed fix**
   - Loop `listObjectsV2` on the continuation token until `isTruncated` is 
false.
   - Delete in batches of <= 1000 keys.
   - Add a regression test that creates > 1000 objects under one prefix (MinIO 
test container) and asserts the prefix is empty after `deleteDirectory`.
   
   **Location:** 
`common/workflow-core/src/main/scala/org/apache/texera/service/util/S3StorageClient.scala`
 — `deleteDirectory`.
   
   ### Task Type
   - [ ] Refactor / Cleanup
   - [ ] DevOps / Deployment / CI
   - [x] Testing / QA
   - [ ] Documentation
   - [ ] Performance
   - [x] Other (correctness — silent incomplete deletion / storage leak)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to