[ https://issues.apache.org/jira/browse/ARROW-14523?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17436571#comment-17436571 ]
Mark Seitter commented on ARROW-14523: -------------------------------------- Ahh ok that makes sense if you don't know the file size. Agreed, I think only thing maybe you could do is have a flag/option that would let you tell the filesystem API to use single-file upload that a user could pass in if they know the file is less than 5GB (max a single put can do). That would let someone like us who has a max file size of 30MB to reduce our cost and improvement performance by skipping multipart calls (and bypassing this current bug too :) ). But ultimately I'm very surprised AWS hasn't fixed this bug in their SDK yet. > [Python] S3FileSystem write_table can lose data > ----------------------------------------------- > > Key: ARROW-14523 > URL: https://issues.apache.org/jira/browse/ARROW-14523 > Project: Apache Arrow > Issue Type: Bug > Components: C++, Python > Affects Versions: 5.0.0 > Reporter: Mark Seitter > Priority: Critical > Labels: AWS > > We have seen odd behavior in very rare occasions when writing a parquet table > to s3 using the S3FileSystem ({color:#000080}from {color}pyarrow.fs > {color:#000080}import {color}S3FileSystem). Even though the application > returns without errors, data would be missing from the bucket. It appears > that internally it's doing a S3 multipart upload, but it's not handling a > special error condition and returning a 200. Per [AWS Docs > |https://aws.amazon.com/premiumsupport/knowledge-center/s3-resolve-200-internalerror/] > CompleteMultipartUpload (which is being called) can return a 200 response > with an InternalError payload and needs to be treated as a 5XX. It appears > this isn't happening with pyarrow and instead it's a success which is causing > the caller to *think* their data was uploaded but actually it's not. > Doing a s3 list-parts call for the <upload-id> for the InternalError request > shows the parts are still there and not completed. > From our S3 access logs with <my-key> and <upload-id> sanitized for security > |operation|key|requesturi_operation|requesturi_key|requesturi_httpprotoversion|httpstatus|errorcode| > |REST.PUT.PART|<my-key>-SNAPPY.parquet|PUT|/<my-key>-SNAPPY.parquet?partNumber=1&uploadId=<upload-id>|HTTP/1.1|200|-| > |REST.POST.UPLOAD|<my-key>-SNAPPY.parquet|POST|/<my-key>-SNAPPY.parquet?uploadId=<upload-id>|HTTP/1.1|200|InternalError| > |REST.POST.UPLOADS|<my-key>-SNAPPY.parquet|POST|/<my-key>-SNAPPY.parquet?uploads|HTTP/1.1|200|-| > -- This message was sent by Atlassian Jira (v8.3.4#803005)