Hi Micah,

Thank you for your reply. That is also my understanding - not possible in
streaming IPC, possible in file IPC with random access. The pseudo-code
could be something like:

start = writer.seek_current();
empty_locations = create_empty_header(schema)
write_header(writer, empty_locations)
locations = write_buffers(writer, batch)
end_buffers_position = writer.seek_current()
writer.seek(start)
write_header(writer, locations)
writer.seek(end_buffers_position)

AFAI can understand, this would cause writing to IPC to require O(N) where
N is the average size of the buffers, as opposed to O(N*B) where N is the
average size of the buffer and B the number of buffers. I.e. It is still
quite a multiplicative factor involved.

I filed https://issues.apache.org/jira/browse/ARROW-16118 with the idea.

Best,
Jorge



On Mon, Apr 4, 2022 at 6:09 PM Micah Kornfield <emkornfi...@gmail.com>
wrote:

> Hi Jorge,
> I don't think any implementation does this but I think it is technically
> possible, although it might be complicated to actually do.  It also
> requires random access files (the output can't be purely streaming).
>
> I think the approach you would need to take is to pr-write the header
> information without the values zeroed out at first., After you've
> compressed and written the physical bytes you would need to update the
> values in place, after you know them.  Since Flatbuffers doesn't do any
> variable length encoding, you don't need to worry about possibly corrupting
> the data.   The challenging part is determining the exact locations that
> need to be overwritten.
>
> -MIcah
>
> On Mon, Apr 4, 2022 at 7:40 AM Jorge Cardoso Leitão <
> jorgecarlei...@gmail.com> wrote:
>
> > Hi,
> >
> > Motivated by [1], I wonder if it is possible to write to IPC without
> > writing the data to an intermediary buffer.
> >
> > The challenge is that the header of an IPC message [header][data]
> requires:
> >
> > * the positions of the buffers
> > * the total length of the body
> >
> > For uncompressed data, we could compute these before-hand at `O(C)`
> where C
> > is the number of columns. However, I am unable to find a way of computing
> > these ahead of writing for compressed buffers: we need to compress the
> data
> > to know its compressed (and thus buffers) size.
> >
> > Is this understanding correct?
> >
> > Best,
> > Jorge
> >
> > [1] https://github.com/pola-rs/polars/issues/2639
> >
>

Reply via email to