[ 
https://issues.apache.org/jira/browse/ARROW-17544?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17597211#comment-17597211
 ] 

Rusty Conover commented on ARROW-17544:
---------------------------------------

S3 splits out "listing" into two API methods.  One that is bucket versioning 
aware and one that isn't.

- List Objects V2 - 
[https://docs.aws.amazon.com/AmazonS3/latest/API/API_ListObjectsV2.html]

That doesn't include the "versions" of each S3 key. 

- List Object Versions - 
[https://docs.aws.amazon.com/AmazonS3/latest/API/API_ListObjectVersions.html]

That does list every version of every S3 key.

Since there can be hundreds of versions of a single key, listing all versions 
of keys in a bucket can take much longer than just listing the bucket.

> [C++/Python] Add support for S3 Bucket Versioning
> -------------------------------------------------
>
>                 Key: ARROW-17544
>                 URL: https://issues.apache.org/jira/browse/ARROW-17544
>             Project: Apache Arrow
>          Issue Type: New Feature
>          Components: C++
>    Affects Versions: 9.0.0
>            Reporter: Rusty Conover
>            Assignee: Rusty Conover
>            Priority: Major
>              Labels: pull-request-available
>          Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> Arrow offers a reasonably capable S3 interface, but it lacks support for S3 
> Buckets that have versioning enabled.  For information about what S3 bucket 
> versioning is, see:
> [https://docs.aws.amazon.com/AmazonS3/latest/userguide/Versioning.html]
> If Arrow is interacting with a bucket where versioning is enabled, there can 
> be S3 keys that have multiple versions of content stored utilizing the same 
> key name.  At the present moment, Arrow does not have the ability to:
>  # Access versions of an S3 key rather than just the latest version of an S3 
> key.  There is no ability to specify the VersionId parameter of S3's 
> GetObject API.
>  # Report the VersionId created when a new S3 key is uploaded to a bucket.
> Along with S3, GCS also supports versioned buckets.
> [https://cloud.google.com/storage/docs/object-versioning]
> There are a few shortcomings of the Filesystem interface to support remote 
> file systems that support versioning:
> 1. The parameters for open_input_stream() and open_input_file() do not easily 
> lend themselves to adding an additional parameter of "version" because they 
> would be passed to all other implemented filesystems.  Most other file 
> systems that exist don't actually support versioning.
> 2. Upon completion of an S3 multipart upload (i.e., close() on an 
> S3FileSystem output stream), there is not currently a way for the user to 
> determine the VersionId or ETag of the S3 key that was created.  This is 
> important to know because if there are multiple concurrent writers to S3, it 
> should be possible to identify the written S3 key.
> Proposed solutions to enable S3 Bucket versioning:
> 1. To allow library callers to read specific versions of an S3 key, extend 
> only the S3FileSystem interface with two new API calls:
> {{open_input_stream_with_version()}}
> {{open_input_file_with_version()}}
> Both are like their namesakes from the normal FileSystem interface but take 
> an additional parameter of a "version," which is a string representation of 
> the VersionId returned by S3 when the S3 Key is created.  If these functions 
> are called with an empty string for the specified version, the latest version 
> of the S3 key will be returned.
> I'm a bit reluctant to create these specialized functions just on the 
> S3FileSystem interface, but I also don't think it is appropriate to change 
> open_input_stream() and open_input_file()'s parameter list for all 
> filesystems just for functionality that is only implemented by a small number 
> of filesystems.
> 2. Allow callers to call ReadMetadata() on an S3FileSystem output stream to 
> retrieve the metadata about the S3 key that has been written after the stream 
> has been closed.  The metadata will likely include both a VersionId and a 
> value for ETag.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to