Mark Seitter created ARROW-14523:
------------------------------------

             Summary: [Python] S3FileSystem write_table can lose data
                 Key: ARROW-14523
                 URL: https://issues.apache.org/jira/browse/ARROW-14523
             Project: Apache Arrow
          Issue Type: Bug
          Components: Python
    Affects Versions: 5.0.0
            Reporter: Mark Seitter


We have seen odd behavior in very rare occasions when writing a parquet table 
to s3 using the S3FileSystem.  Even though the application returns without 
errors, data would be missing from the bucket.  It appears that internally it's 
doing a S3 multipart upload, but it's not handling a special error condition 
and returning a 200.  Per [AWS Docs 
|https://aws.amazon.com/premiumsupport/knowledge-center/s3-resolve-200-internalerror/]
 CompleteMultipartUpload (which is being called) can return a 200 response with 
an InternalError payload and needs to be treated as a 5XX. It appears this 
isn't happening with pyarrow and instead it's a success which is causing the 
caller to *think* their data was uploaded but actually it's not.  

Doing a s3 list-parts call for the <upload-id> for the InternalError request 
shows the parts are still there and not completed.

>From our S3 access logs with <my-key> and <upload-id> sanitized for security 
|operation|key|requesturi_operation|requesturi_key|requesturi_httpprotoversion|httpstatus|errorcode|
|REST.PUT.PART|<my-key>-SNAPPY.parquet|PUT|/<my-key>-SNAPPY.parquet?partNumber=1&uploadId=<upload-id>|HTTP/1.1|200|-|
|REST.POST.UPLOAD|<my-key>-SNAPPY.parquet|POST|/<my-key>-SNAPPY.parquet?uploadId=<upload-id>|HTTP/1.1|200|InternalError|
|REST.POST.UPLOADS|<my-key>-SNAPPY.parquet|POST|/<my-key>-SNAPPY.parquet?uploads|HTTP/1.1|200|-|
 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to