[ https://issues.apache.org/jira/browse/ARROW-14523?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17436543#comment-17436543 ]
Antoine Pitrou edited comment on ARROW-14523 at 10/31/21, 7:59 PM: ------------------------------------------------------------------- The filesystem API doesn't know up front what the total file size will be, which is why it always uses multipart upload. Perhaps we could have a one-shot file write API that would use another API for shorter uploads. That said, that API probably wouldn't be used by the Parquet writer... was (Author: pitrou): The filesystem API doesn't know up front what the total file size will be, which is why it always uses multipart upload. Perhaps we could have a one-shot file write API that would use another API for shorter uploads. > [Python] S3FileSystem write_table can lose data > ----------------------------------------------- > > Key: ARROW-14523 > URL: https://issues.apache.org/jira/browse/ARROW-14523 > Project: Apache Arrow > Issue Type: Bug > Components: C++, Python > Affects Versions: 5.0.0 > Reporter: Mark Seitter > Priority: Critical > Labels: AWS > > We have seen odd behavior in very rare occasions when writing a parquet table > to s3 using the S3FileSystem ({color:#000080}from {color}pyarrow.fs > {color:#000080}import {color}S3FileSystem). Even though the application > returns without errors, data would be missing from the bucket. It appears > that internally it's doing a S3 multipart upload, but it's not handling a > special error condition and returning a 200. Per [AWS Docs > |https://aws.amazon.com/premiumsupport/knowledge-center/s3-resolve-200-internalerror/] > CompleteMultipartUpload (which is being called) can return a 200 response > with an InternalError payload and needs to be treated as a 5XX. It appears > this isn't happening with pyarrow and instead it's a success which is causing > the caller to *think* their data was uploaded but actually it's not. > Doing a s3 list-parts call for the <upload-id> for the InternalError request > shows the parts are still there and not completed. > From our S3 access logs with <my-key> and <upload-id> sanitized for security > |operation|key|requesturi_operation|requesturi_key|requesturi_httpprotoversion|httpstatus|errorcode| > |REST.PUT.PART|<my-key>-SNAPPY.parquet|PUT|/<my-key>-SNAPPY.parquet?partNumber=1&uploadId=<upload-id>|HTTP/1.1|200|-| > |REST.POST.UPLOAD|<my-key>-SNAPPY.parquet|POST|/<my-key>-SNAPPY.parquet?uploadId=<upload-id>|HTTP/1.1|200|InternalError| > |REST.POST.UPLOADS|<my-key>-SNAPPY.parquet|POST|/<my-key>-SNAPPY.parquet?uploads|HTTP/1.1|200|-| > -- This message was sent by Atlassian Jira (v8.3.4#803005)