assignUser commented on issue #43929: URL: https://github.com/apache/arrow/issues/43929#issuecomment-2325604847
`new_file` and `new_stream` differ in the format that is used, which one you want depends on your use case: > - Streaming format: for sending an arbitrary length sequence of record batches. The format must be processed from start to end, and does not support random access > > - File or Random Access format: for serializing a fixed number of record batches. Supports random access, and thus is very useful when used with memory maps So it depends what exactly you want to do, if you want to write a stream of record batches of unknown number you want to use the stream APIs. If you just want save a bunch (read: fixed number of batches) of data (like the table you are using in your code) in one go and make it available with minimal allocation for a consumer you can use the IPC file format which can be effectively mmap'ed (in the consumer, for writing arrow handles that internally!) . Could you try this? ``` def save_data(): size = table.get_total_buffer_size() file_path = os.path.join(prefix_stream, sink) with pa.ipc.new_file(file_path, table.schema) as writer: writer.write_table(table, max_chunksize=1000) ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org