Hi Micah, Thank you for your reply. That is also my understanding - not possible in streaming IPC, possible in file IPC with random access. The pseudo-code could be something like:
start = writer.seek_current(); empty_locations = create_empty_header(schema) write_header(writer, empty_locations) locations = write_buffers(writer, batch) end_buffers_position = writer.seek_current() writer.seek(start) write_header(writer, locations) writer.seek(end_buffers_position) AFAI can understand, this would cause writing to IPC to require O(N) where N is the average size of the buffers, as opposed to O(N*B) where N is the average size of the buffer and B the number of buffers. I.e. It is still quite a multiplicative factor involved. I filed https://issues.apache.org/jira/browse/ARROW-16118 with the idea. Best, Jorge On Mon, Apr 4, 2022 at 6:09 PM Micah Kornfield <emkornfi...@gmail.com> wrote: > Hi Jorge, > I don't think any implementation does this but I think it is technically > possible, although it might be complicated to actually do. It also > requires random access files (the output can't be purely streaming). > > I think the approach you would need to take is to pr-write the header > information without the values zeroed out at first., After you've > compressed and written the physical bytes you would need to update the > values in place, after you know them. Since Flatbuffers doesn't do any > variable length encoding, you don't need to worry about possibly corrupting > the data. The challenging part is determining the exact locations that > need to be overwritten. > > -MIcah > > On Mon, Apr 4, 2022 at 7:40 AM Jorge Cardoso Leitão < > jorgecarlei...@gmail.com> wrote: > > > Hi, > > > > Motivated by [1], I wonder if it is possible to write to IPC without > > writing the data to an intermediary buffer. > > > > The challenge is that the header of an IPC message [header][data] > requires: > > > > * the positions of the buffers > > * the total length of the body > > > > For uncompressed data, we could compute these before-hand at `O(C)` > where C > > is the number of columns. However, I am unable to find a way of computing > > these ahead of writing for compressed buffers: we need to compress the > data > > to know its compressed (and thus buffers) size. > > > > Is this understanding correct? > > > > Best, > > Jorge > > > > [1] https://github.com/pola-rs/polars/issues/2639 > > >