[ https://issues.apache.org/jira/browse/ARROW-17544?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17597211#comment-17597211 ]
Rusty Conover commented on ARROW-17544: --------------------------------------- S3 splits out "listing" into two API methods. One that is bucket versioning aware and one that isn't. - List Objects V2 - [https://docs.aws.amazon.com/AmazonS3/latest/API/API_ListObjectsV2.html] That doesn't include the "versions" of each S3 key. - List Object Versions - [https://docs.aws.amazon.com/AmazonS3/latest/API/API_ListObjectVersions.html] That does list every version of every S3 key. Since there can be hundreds of versions of a single key, listing all versions of keys in a bucket can take much longer than just listing the bucket. > [C++/Python] Add support for S3 Bucket Versioning > ------------------------------------------------- > > Key: ARROW-17544 > URL: https://issues.apache.org/jira/browse/ARROW-17544 > Project: Apache Arrow > Issue Type: New Feature > Components: C++ > Affects Versions: 9.0.0 > Reporter: Rusty Conover > Assignee: Rusty Conover > Priority: Major > Labels: pull-request-available > Time Spent: 0.5h > Remaining Estimate: 0h > > Arrow offers a reasonably capable S3 interface, but it lacks support for S3 > Buckets that have versioning enabled. For information about what S3 bucket > versioning is, see: > [https://docs.aws.amazon.com/AmazonS3/latest/userguide/Versioning.html] > If Arrow is interacting with a bucket where versioning is enabled, there can > be S3 keys that have multiple versions of content stored utilizing the same > key name. At the present moment, Arrow does not have the ability to: > # Access versions of an S3 key rather than just the latest version of an S3 > key. There is no ability to specify the VersionId parameter of S3's > GetObject API. > # Report the VersionId created when a new S3 key is uploaded to a bucket. > Along with S3, GCS also supports versioned buckets. > [https://cloud.google.com/storage/docs/object-versioning] > There are a few shortcomings of the Filesystem interface to support remote > file systems that support versioning: > 1. The parameters for open_input_stream() and open_input_file() do not easily > lend themselves to adding an additional parameter of "version" because they > would be passed to all other implemented filesystems. Most other file > systems that exist don't actually support versioning. > 2. Upon completion of an S3 multipart upload (i.e., close() on an > S3FileSystem output stream), there is not currently a way for the user to > determine the VersionId or ETag of the S3 key that was created. This is > important to know because if there are multiple concurrent writers to S3, it > should be possible to identify the written S3 key. > Proposed solutions to enable S3 Bucket versioning: > 1. To allow library callers to read specific versions of an S3 key, extend > only the S3FileSystem interface with two new API calls: > {{open_input_stream_with_version()}} > {{open_input_file_with_version()}} > Both are like their namesakes from the normal FileSystem interface but take > an additional parameter of a "version," which is a string representation of > the VersionId returned by S3 when the S3 Key is created. If these functions > are called with an empty string for the specified version, the latest version > of the S3 key will be returned. > I'm a bit reluctant to create these specialized functions just on the > S3FileSystem interface, but I also don't think it is appropriate to change > open_input_stream() and open_input_file()'s parameter list for all > filesystems just for functionality that is only implemented by a small number > of filesystems. > 2. Allow callers to call ReadMetadata() on an S3FileSystem output stream to > retrieve the metadata about the S3 key that has been written after the stream > has been closed. The metadata will likely include both a VersionId and a > value for ETag. -- This message was sent by Atlassian Jira (v8.20.10#820010)