[ https://issues.apache.org/jira/browse/ARROW-10439?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17422446#comment-17422446 ]
David Li commented on ARROW-10439: ---------------------------------- The approach used for Flight would really only work for IPC, unfortunately. It optimistically assumes batches are below the limit and hooks into the low-level IPC writer implementation so that it gets passed the already-serialized batches - that way, it doesn't waste work computing the actual serialized size (which is expensive) - and if the batch is over the size limit, rejects it. The caller is then expected to try again. I suppose you could generalize this to CSV (by serializing rows to a buffer before writing them out), though that would be an expensive/invasive refactor (and I have no clue about Parquet). I'll note even "in-memory size" can be difficult to compute if you have slices. The GetRecordBatchSize function actually serializes the batch under the hood and reads the bytes written. > [C++][Dataset] Add max file size as a dataset writing option > ------------------------------------------------------------ > > Key: ARROW-10439 > URL: https://issues.apache.org/jira/browse/ARROW-10439 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ > Affects Versions: 2.0.0 > Reporter: Ben Kietzman > Assignee: Weston Pace > Priority: Minor > Labels: beginner, dataset, query-engine > Fix For: 6.0.0 > > > This should be specified as a row limit. -- This message was sent by Atlassian Jira (v8.3.4#803005)