kunwp1 opened a new pull request, #5569:
URL: https://github.com/apache/texera/pull/5569

   ### What changes were proposed in this PR?
   
   `S3StorageClient.deleteDirectory(bucketName, directoryPrefix)` listed 
objects under the prefix with a single `listObjectsV2` call and issued one 
`deleteObjects` batch, with no continuation-token loop. `listObjectsV2` returns 
at most 1000 keys per page, so only the first ≤1000 objects under a prefix were 
ever deleted — any beyond that were silently orphaned (a storage leak). AWS 
`DeleteObjects` also accepts at most 1000 keys per request, so once listing is 
paginated the deletions must be chunked too.
   
   This affects real cleanup paths: `DatasetResource` (deleting a dataset 
repository) and `LargeBinaryManager` (per-execution cleanup), either of which 
can exceed 1000 objects under a single prefix.
   
   Fix:
   - Follow the `listObjectsV2` continuation token until the listing is no 
longer truncated.
   - Delete each page in batches of at most 1000 keys 
(`MAX_KEYS_PER_DELETE_REQUEST`), keeping memory bounded to a single page rather 
than accumulating every key.
   - Remove dead code: an MD5 digest computed over the request's `toString` 
that was never attached to the request (and its now-unused import).
   
   The prefix normalization and the public method signature are unchanged.
   
   ```
   Before:  deleteDirectory(prefix) -> list <=1000 -> delete <=1000 -> 
remaining objects leak
   After:   deleteDirectory(prefix) -> paginate all pages -> delete each page 
in <=1000 batches
   ```
   
   ### Any related issues, documentation, discussions?
   
   Closes #5281
   
   ### How was this PR tested?
   
   Added a `deleteDirectory` test section to `S3StorageClientSpec` (runs 
against a MinIO testcontainer), covering positive, boundary, and negative cases:
   - deletes all objects under a prefix (5 objects);
   - deletes more than 1000 objects under a prefix (1001 → exercises both the 
continuation-token pagination loop and the ≤1000 delete batching);
   - does not throw for a prefix with no objects.
   
   The 1001-object test fails against the pre-fix code (one object is left 
orphaned) and passes after the fix.
   
   ```
   sbt 'WorkflowCore / testOnly 
org.apache.texera.service.util.S3StorageClientSpec'
   # -> Tests: succeeded 20, failed 0
   
   sbt 'WorkflowCore / testOnly *LargeBinaryOutputStreamSpec 
*LargeBinaryInputStreamSpec'
   # -> Tests: succeeded 43, failed 0  (LargeBinaryManager is a consumer of 
deleteDirectory; no regression)
   ```
   
   Requires Docker (testcontainers MinIO).
   
   ### Was this PR authored or co-authored using generative AI tooling?
   
   Generated-by: Claude Code (Claude Opus 4.8)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to