[ 
https://issues.apache.org/jira/browse/ARROW-8365?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17077312#comment-17077312
 ] 

Antoine Pitrou commented on ARROW-8365:
---------------------------------------

Thanks for the thorough report and diagnosis!

> [C++] Error when writing files to S3 larger than 5 GB
> -----------------------------------------------------
>
>                 Key: ARROW-8365
>                 URL: https://issues.apache.org/jira/browse/ARROW-8365
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: C++
>    Affects Versions: 0.16.0
>            Reporter: Juan Galvez
>            Assignee: Antoine Pitrou
>            Priority: Major
>             Fix For: 0.17.0
>
>
> When purely using the arrow-cpp library to write to S3, I get the following 
> error when trying to write a large Arrow table to S3 (resulting in a file 
> size larger than 5 GB):
> {{../src/arrow/io/interfaces.cc:219: Error ignored when destroying file of 
> type N5arrow2fs12_GLOBAL__N_118ObjectOutputStreamE: IOError: When uploading 
> part for key 'test01.parquet/part-00.parquet' in bucket 'test': AWS Error 
> [code 100]: Unable to parse ExceptionName: EntityTooLarge Message: Your 
> proposed upload exceeds the maximum allowed size with address : 
> 52.219.100.32}}
> I have diagnosed the problem by looking at and modifying the code in 
> *{{s3fs.cc}}*. The code uses multipart upload, and uses 5 MB chunks for the 
> first 100 parts. After it has submitted the first 100 parts, it is supposed 
> to increase the size of the chunks to 10 MB (the part upload threshold or 
> {{part_upload_threshold_}}). The issue is that the threshold is increased 
> inside {{DoWrite}}, and {{DoWrite}} can be called multiple times before the 
> current part is uploaded, which ultimately causes the threshold to keep 
> getting increased indefinitely, and the last part ends up surpassing the 5 GB 
> part upload limit of AWS/S3.
> This issue where the last part is much larger than it should I'm pretty sure 
> can happen every time a multi-part upload exceeds 100 parts, but the error is 
> only thrown if the last part is larger than 5 GB. Therefore this is only 
> observed with very large uploads.
> I can confirm that the bug does not happen if I move this:
> {{if (part_number_ % 100 == 0) {}}
>    part_upload_threshold_ += kMinimumPartUpload;}}
> }
> and do it in a different method, right before the line that does: 
> {{++part_number_}}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to