[ https://issues.apache.org/jira/browse/ARROW-8365?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17077335#comment-17077335 ]
Juan Galvez commented on ARROW-8365: ------------------------------------ Done. Thanks! > [C++] Error when writing files to S3 larger than 5 GB > ----------------------------------------------------- > > Key: ARROW-8365 > URL: https://issues.apache.org/jira/browse/ARROW-8365 > Project: Apache Arrow > Issue Type: Bug > Components: C++ > Affects Versions: 0.16.0 > Reporter: Juan Galvez > Assignee: Antoine Pitrou > Priority: Blocker > Labels: pull-request-available > Fix For: 0.17.0 > > Time Spent: 0.5h > Remaining Estimate: 0h > > When purely using the arrow-cpp library to write to S3, I get the following > error when trying to write a large Arrow table to S3 (resulting in a file > size larger than 5 GB): > {{../src/arrow/io/interfaces.cc:219: Error ignored when destroying file of > type N5arrow2fs12_GLOBAL__N_118ObjectOutputStreamE: IOError: When uploading > part for key 'test01.parquet/part-00.parquet' in bucket 'test': AWS Error > [code 100]: Unable to parse ExceptionName: EntityTooLarge Message: Your > proposed upload exceeds the maximum allowed size with address : > 52.219.100.32}} > I have diagnosed the problem by looking at and modifying the code in > *{{s3fs.cc}}*. The code uses multipart upload, and uses 5 MB chunks for the > first 100 parts. After it has submitted the first 100 parts, it is supposed > to increase the size of the chunks to 10 MB (the part upload threshold or > {{part_upload_threshold_}}). The issue is that the threshold is increased > inside {{DoWrite}}, and {{DoWrite}} can be called multiple times before the > current part is uploaded, which ultimately causes the threshold to keep > getting increased indefinitely, and the last part ends up surpassing the 5 GB > part upload limit of AWS/S3. > This issue where the last part is much larger than it should I'm pretty sure > can happen every time a multi-part upload exceeds 100 parts, but the error is > only thrown if the last part is larger than 5 GB. Therefore this is only > observed with very large uploads. > I can confirm that the bug does not happen if I move this: > {{if (part_number_ % 100 == 0) {}} > part_upload_threshold_ += kMinimumPartUpload;}} > } > and do it in a different method, right before the line that does: > {{++part_number_}} > -- This message was sent by Atlassian Jira (v8.3.4#803005)