[jira] [Comment Edited] (ARROW-14523) [Python] S3FileSystem write_table can lose data

Antoine Pitrou (Jira) Sun, 31 Oct 2021 13:00:04 -0700


    [ 
https://issues.apache.org/jira/browse/ARROW-14523?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17436543#comment-17436543
 ]


Antoine Pitrou edited comment on ARROW-14523 at 10/31/21, 7:59 PM:
-------------------------------------------------------------------

The filesystem API doesn't know up front what the total file size will be, 
which is why it always uses multipart upload. Perhaps we could have a one-shot 
file write API that would use another API for shorter uploads. That said, that 
API probably wouldn't be used by the Parquet writer...


was (Author: pitrou):
The filesystem API doesn't know up front what the total file size will be, 
which is why it always uses multipart upload. Perhaps we could have a one-shot 
file write API that would use another API for shorter uploads.

> [Python] S3FileSystem write_table can lose data
> -----------------------------------------------
>
>                 Key: ARROW-14523
>                 URL: https://issues.apache.org/jira/browse/ARROW-14523
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: C++, Python
>    Affects Versions: 5.0.0
>            Reporter: Mark Seitter
>            Priority: Critical
>              Labels: AWS
>
> We have seen odd behavior in very rare occasions when writing a parquet table 
> to s3 using the S3FileSystem ({color:#000080}from {color}pyarrow.fs 
> {color:#000080}import {color}S3FileSystem).  Even though the application 
> returns without errors, data would be missing from the bucket.  It appears 
> that internally it's doing a S3 multipart upload, but it's not handling a 
> special error condition and returning a 200.  Per [AWS Docs 
> |https://aws.amazon.com/premiumsupport/knowledge-center/s3-resolve-200-internalerror/]
>  CompleteMultipartUpload (which is being called) can return a 200 response 
> with an InternalError payload and needs to be treated as a 5XX. It appears 
> this isn't happening with pyarrow and instead it's a success which is causing 
> the caller to *think* their data was uploaded but actually it's not. 
> Doing a s3 list-parts call for the <upload-id> for the InternalError request 
> shows the parts are still there and not completed.
> From our S3 access logs with <my-key> and <upload-id> sanitized for security 
> |operation|key|requesturi_operation|requesturi_key|requesturi_httpprotoversion|httpstatus|errorcode|
> |REST.PUT.PART|<my-key>-SNAPPY.parquet|PUT|/<my-key>-SNAPPY.parquet?partNumber=1&uploadId=<upload-id>|HTTP/1.1|200|-|
> |REST.POST.UPLOAD|<my-key>-SNAPPY.parquet|POST|/<my-key>-SNAPPY.parquet?uploadId=<upload-id>|HTTP/1.1|200|InternalError|
> |REST.POST.UPLOADS|<my-key>-SNAPPY.parquet|POST|/<my-key>-SNAPPY.parquet?uploads|HTTP/1.1|200|-|
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Comment Edited] (ARROW-14523) [Python] S3FileSystem write_table can lose data

Reply via email to