Re: [PR] HADOOP-18679. Add API for bulk/paged delete of files [hadoop]

via GitHub Mon, 24 Jun 2024 03:27:23 -0700


steveloughran commented on PR #6726:
URL: https://github.com/apache/hadoop/pull/6726#issuecomment-2186215000

Aws sdk delete with version id actually requires more IAM permissions than
unversioned delete, which always removes HEAD object, because granting that
permission allows the caller to delete backups. Deployments where apps can
delete HEAD but not versions are not unusual for this reason.

This is why S3A doesn't use it even in simple listing -> delete calls where
the status is known.

you might also need to issue getFileStatus/list calls, which would massively
increase the cost if the process didn't have those values already.

A bulk delete with a tuple of (path, version) for each entry could work, if
the store could be configured to use that version ID/type. for S3A we would
leave it off by default. the tuple would be Map.entry to be reflection friendly.

if you do thing version/etag support would be a blocker to use, well, things
haven't shipped yet, though @mukund-thakur is preparing a 3.4.1 alpha release.

You (and it would be you, sorry) will need to modify the api with
* `Collection<Map.Entry<Path, version>>[]`
* S3A impl to not use version by default, option to turn it on,
parameterized testing for this if a versioned bucket is the test bucket.
* WrappedIO changed to match

This isn't that useful for table compaction, as the engines tend to use
randomness in their names to spread the s3 store load across shards. But it
could have other uses.

for example. here's some work to do version printing, recovery and copy
within the same bucket, lets you pull out the layers underneath a directory tree

https://github.com/steveloughran/cloudstore/tree/main/src/main/extra/org/apache/hadoop/fs/s3a/extra

question is: do we want to make something that complex part of a broader api
with tests, specification, commitments to maintain etc, or do we just say call
`S3A.getInternals().getClient()` and then sort it out yourself?

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org

Re: [PR] HADOOP-18679. Add API for bulk/paged delete of files [hadoop]

Reply via email to