[jira] [Commented] (ARROW-10439) [C++][Dataset] Add max file size as a dataset writing option
[ https://issues.apache.org/jira/browse/ARROW-10439?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17422446#comment-17422446 ] David Li commented on ARROW-10439: -- The approach used for Flight would really only work for IPC, unfortunately. It optimistically assumes batches are below the limit and hooks into the low-level IPC writer implementation so that it gets passed the already-serialized batches - that way, it doesn't waste work computing the actual serialized size (which is expensive) - and if the batch is over the size limit, rejects it. The caller is then expected to try again. I suppose you could generalize this to CSV (by serializing rows to a buffer before writing them out), though that would be an expensive/invasive refactor (and I have no clue about Parquet). I'll note even "in-memory size" can be difficult to compute if you have slices. The GetRecordBatchSize function actually serializes the batch under the hood and reads the bytes written. > [C++][Dataset] Add max file size as a dataset writing option > > > Key: ARROW-10439 > URL: https://issues.apache.org/jira/browse/ARROW-10439 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Affects Versions: 2.0.0 >Reporter: Ben Kietzman >Assignee: Weston Pace >Priority: Minor > Labels: beginner, dataset, query-engine > Fix For: 6.0.0 > > > This should be specified as a row limit. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-10439) [C++][Dataset] Add max file size as a dataset writing option
[ https://issues.apache.org/jira/browse/ARROW-10439?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17422441#comment-17422441 ] Weston Pace commented on ARROW-10439: - So the challenge with a bytes limit is that we need to know how many bytes are going to be written before the potentially blocking write call. The way the file writers are currently structured that is not easy. Options available: * Modify the file writers to be truly asynchronous and return "{ bytes_queued: int64_t, Future<>: write_future }" (or they could return a Future<> and have a method to query how many total bytes have been queued to be written to the file). * Use the in-memory size of the data (the downside is that this can be quite different from the written size if compression is used which is often the case). * Enforce a best-effort limit which checks the current file size when determining if a new file should be opened. The problem in this case is we will queue some number of batches more than we should so the limit will be a soft limit that we will typically shoot past by some amount. Does anyone have any other ideas or suggestions or have a preference amongst the available options? [~lidavidm] what was the approach used for flight? > [C++][Dataset] Add max file size as a dataset writing option > > > Key: ARROW-10439 > URL: https://issues.apache.org/jira/browse/ARROW-10439 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Affects Versions: 2.0.0 >Reporter: Ben Kietzman >Assignee: Weston Pace >Priority: Minor > Labels: beginner, dataset, query-engine > Fix For: 6.0.0 > > > This should be specified as a row limit. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-10439) [C++][Dataset] Add max file size as a dataset writing option
[ https://issues.apache.org/jira/browse/ARROW-10439?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17415285#comment-17415285 ] Weston Pace commented on ARROW-10439: - I'm going to leave this open to track adding a `max_bytes_per_file` limit. > [C++][Dataset] Add max file size as a dataset writing option > > > Key: ARROW-10439 > URL: https://issues.apache.org/jira/browse/ARROW-10439 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Affects Versions: 2.0.0 >Reporter: Ben Kietzman >Assignee: Weston Pace >Priority: Minor > Labels: beginner, dataset, query-engine > Fix For: 6.0.0 > > > This should be specified as a row limit. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-10439) [C++][Dataset] Add max file size as a dataset writing option
[ https://issues.apache.org/jira/browse/ARROW-10439?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17403455#comment-17403455 ] Weston Pace commented on ARROW-10439: - https://github.com/apache/arrow/pull/10955 (as part of ARROW-13650) adds a `max_rows_per_file` option. Max bytes is a little trickier (table.nbytes is the in-memory size and I assume one would want the on-disk size) although doable (the file writer's should be able to keep track of how many bytes they've written but they don't do this today.) I'd prefer to avoid max bytes unless someone has a need for it though. > [C++][Dataset] Add max file size as a dataset writing option > > > Key: ARROW-10439 > URL: https://issues.apache.org/jira/browse/ARROW-10439 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Affects Versions: 2.0.0 >Reporter: Ben Kietzman >Assignee: Weston Pace >Priority: Major > Labels: beginner, dataset, query-engine > Fix For: 6.0.0 > > > This should be specified as a row limit. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-10439) [C++][Dataset] Add max file size as a dataset writing option
[ https://issues.apache.org/jira/browse/ARROW-10439?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17401132#comment-17401132 ] Alessandro Molina commented on ARROW-10439: --- It would probably make sense to see if we can come up with an API that allows something more flexible. For example I can see cases where someone might want to "start a new file" every 1M rows instead of starting a new file instead of every 100MB. > [C++][Dataset] Add max file size as a dataset writing option > > > Key: ARROW-10439 > URL: https://issues.apache.org/jira/browse/ARROW-10439 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Affects Versions: 2.0.0 >Reporter: Ben Kietzman >Assignee: Weston Pace >Priority: Major > Labels: beginner, dataset, query-engine > Fix For: 6.0.0 > > > This should be specified as a row limit. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-10439) [C++][Dataset] Add max file size as a dataset writing option
[ https://issues.apache.org/jira/browse/ARROW-10439?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17395851#comment-17395851 ] Ruben Laguna commented on ARROW-10439: -- My current workaround is to use ParquetWriter and create a new file when the input data is bigger than x amount of bytes. So that indirectly limits the output file size but the output file size still varies depending on compression ratio for the particular input data in the split. https://stackoverflow.com/a/68679635/90580 {code:python} bytes_written = 0 index = 0 writer = pq.ParquetWriter(f'output_{index:03}.parquet', table.schema) for i in range(30): writer.write_table(table) bytes_written = bytes_written + table.nbytes if bytes_written >= 5: # 500MB, start a new file writer.close() index = index + 1 writer = pq.ParquetWriter(f'output_{index:03}.parquet', table.schema) bytes_written = 0 writer.close() {code} > [C++][Dataset] Add max file size as a dataset writing option > > > Key: ARROW-10439 > URL: https://issues.apache.org/jira/browse/ARROW-10439 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Affects Versions: 2.0.0 >Reporter: Ben Kietzman >Assignee: Weston Pace >Priority: Major > Labels: beginner, dataset > Fix For: 6.0.0 > > > This should be specified as a row limit. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-10439) [C++][Dataset] Add max file size as a dataset writing option
[ https://issues.apache.org/jira/browse/ARROW-10439?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17380812#comment-17380812 ] Weston Pace commented on ARROW-10439: - [~motibz] I'm sorry that I did not see this earlier. One possible workaround is to split the input into multiple pieces manually and then call `write_dataset` multiple times using a different `basename_template` for each write but it is not a great workaround. > [C++][Dataset] Add max file size as a dataset writing option > > > Key: ARROW-10439 > URL: https://issues.apache.org/jira/browse/ARROW-10439 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Affects Versions: 2.0.0 >Reporter: Ben Kietzman >Assignee: Weston Pace >Priority: Major > Labels: dataset > Fix For: 6.0.0 > > > This should be specified as a row limit. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-10439) [C++][Dataset] Add max file size as a dataset writing option
[ https://issues.apache.org/jira/browse/ARROW-10439?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17338511#comment-17338511 ] Mordechai Ben Zechariah commented on ARROW-10439: - Hi [~westonpace] does anybody have a good workaround for this request? I'm trying to use write_data_set and I want to create multiple files in each partition. > [C++][Dataset] Add max file size as a dataset writing option > > > Key: ARROW-10439 > URL: https://issues.apache.org/jira/browse/ARROW-10439 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Affects Versions: 2.0.0 >Reporter: Ben Kietzman >Assignee: Weston Pace >Priority: Major > Labels: dataset > Fix For: 5.0.0 > > > This should be specified as a row limit. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-10439) [C++][Dataset] Add max file size as a dataset writing option
[ https://issues.apache.org/jira/browse/ARROW-10439?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17223874#comment-17223874 ] Ben Kietzman commented on ARROW-10439: -- [~jorisvandenbossche] > [C++][Dataset] Add max file size as a dataset writing option > > > Key: ARROW-10439 > URL: https://issues.apache.org/jira/browse/ARROW-10439 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Affects Versions: 2.0.0 >Reporter: Ben Kietzman >Assignee: Ben Kietzman >Priority: Major > Labels: dataset > Fix For: 3.0.0 > > > This should be specified as a row limit. -- This message was sent by Atlassian Jira (v8.3.4#803005)