[ https://issues.apache.org/jira/browse/ARROW-16118?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17525893#comment-17525893 ]
Micah Kornfield commented on ARROW-16118: ----------------------------------------- Also, we should be careful how this enabled, since if someone is actually consuming the stream in real-time there would need to be some sort of coordination to ensure bytes aren't sent prematurely. > [C++] Reduce memory usage when writing to IPC > --------------------------------------------- > > Key: ARROW-16118 > URL: https://issues.apache.org/jira/browse/ARROW-16118 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ > Reporter: Jorge Leitão > Priority: Major > > Writing a record batch to IPC ([header][buffers]) currently requires O(N*B) > where N is the average size of the buffer and B the number of buffers in the > recordbatch. > This is because we need the buffer location and total number of bytes to > write the header of the record, which is only known after e.g. knowning by > how much the buffers were compressed. > When the writer supports seeking, this memory usage can be reduced to O(N) > where N is the average size of a primitive buffer over all fields. This is > done using the following pseudo-code implementation: > {code:java} > start = writer.seek(current); > empty_locations = create_empty_header(schema) > write_header(writer, empty_locations) > locations = write_buffers(writer, batch) > writer.seek(start) > write_header(writer, locations) > {code} > This has a significantly lower memory footprint. O(N) vs O(N*B) > It could be interesting for the C++ implementation to support this. -- This message was sent by Atlassian Jira (v8.20.7#820007)