[ https://issues.apache.org/jira/browse/ARROW-10439?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17395851#comment-17395851 ]
Ruben Laguna commented on ARROW-10439: -------------------------------------- My current workaround is to use ParquetWriter and create a new file when the input data is bigger than x amount of bytes. So that indirectly limits the output file size but the output file size still varies depending on compression ratio for the particular input data in the split. https://stackoverflow.com/a/68679635/90580 {code:python} bytes_written = 0 index = 0 writer = pq.ParquetWriter(f'output_{index:03}.parquet', table.schema) for i in range(300000): writer.write_table(table) bytes_written = bytes_written + table.nbytes if bytes_written >= 500000000: # 500MB, start a new file writer.close() index = index + 1 writer = pq.ParquetWriter(f'output_{index:03}.parquet', table.schema) bytes_written = 0 writer.close() {code} > [C++][Dataset] Add max file size as a dataset writing option > ------------------------------------------------------------ > > Key: ARROW-10439 > URL: https://issues.apache.org/jira/browse/ARROW-10439 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ > Affects Versions: 2.0.0 > Reporter: Ben Kietzman > Assignee: Weston Pace > Priority: Major > Labels: beginner, dataset > Fix For: 6.0.0 > > > This should be specified as a row limit. -- This message was sent by Atlassian Jira (v8.3.4#803005)