[jira] [Commented] (ARROW-10439) [C++][Dataset] Add max file size as a dataset writing option

Ruben Laguna (Jira) Sun, 08 Aug 2021 23:47:06 -0700


    [ 
https://issues.apache.org/jira/browse/ARROW-10439?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17395851#comment-17395851
 ]


Ruben Laguna commented on ARROW-10439:
--------------------------------------

My current workaround is to use ParquetWriter and create a new file when the 
input data is bigger than x amount of bytes. So that indirectly limits the 
output file size but the output file size still varies depending on compression 
ratio for the particular input data in the split. 


https://stackoverflow.com/a/68679635/90580
{code:python}
bytes_written = 0
index = 0
writer = pq.ParquetWriter(f'output_{index:03}.parquet', table.schema)

for i in range(300000):
  writer.write_table(table)
  bytes_written = bytes_written + table.nbytes
  if bytes_written >= 500000000: # 500MB, start a new file
    writer.close()
    index = index + 1
    writer = pq.ParquetWriter(f'output_{index:03}.parquet', table.schema)
    bytes_written = 0

writer.close()
{code}


> [C++][Dataset] Add max file size as a dataset writing option
> ------------------------------------------------------------
>
>                 Key: ARROW-10439
>                 URL: https://issues.apache.org/jira/browse/ARROW-10439
>             Project: Apache Arrow
>          Issue Type: Improvement
>          Components: C++
>    Affects Versions: 2.0.0
>            Reporter: Ben Kietzman
>            Assignee: Weston Pace
>            Priority: Major
>              Labels: beginner, dataset
>             Fix For: 6.0.0
>
>
> This should be specified as a row limit.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (ARROW-10439) [C++][Dataset] Add max file size as a dataset writing option

Reply via email to