>
>
> AFAI can understand, this would cause writing to IPC to require O(N) where
> N is the average size of the buffers, as opposed to O(N*B) where N is the
> average size of the buffer and B the number of buffers. I.e. It is still
> quite a multiplicative factor involved.


Small nit, but this could theoretically be O(1) size requirements depending
on the compression library, since the same seeking behavior could be used
to go back and store the necessary byte lengths after compressing data.

Unfortunately the solution doesn't work if the data is actually being
consumed as a stream without more coordination between producer and
consumer.

On Tue, Apr 5, 2022 at 2:50 AM Jorge Cardoso Leitão <
jorgecarlei...@gmail.com> wrote:

> Hi Micah,
>
> Thank you for your reply. That is also my understanding - not possible in
> streaming IPC, possible in file IPC with random access. The pseudo-code
> could be something like:
>
> start = writer.seek_current();
> empty_locations = create_empty_header(schema)
> write_header(writer, empty_locations)
> locations = write_buffers(writer, batch)
> end_buffers_position = writer.seek_current()
> writer.seek(start)
> write_header(writer, locations)
> writer.seek(end_buffers_position)
>
> AFAI can understand, this would cause writing to IPC to require O(N) where
> N is the average size of the buffers, as opposed to O(N*B) where N is the
> average size of the buffer and B the number of buffers. I.e. It is still
> quite a multiplicative factor involved.
>
> I filed https://issues.apache.org/jira/browse/ARROW-16118 with the idea.
>
> Best,
> Jorge
>
>
>
> On Mon, Apr 4, 2022 at 6:09 PM Micah Kornfield <emkornfi...@gmail.com>
> wrote:
>
>> Hi Jorge,
>> I don't think any implementation does this but I think it is technically
>> possible, although it might be complicated to actually do.  It also
>> requires random access files (the output can't be purely streaming).
>>
>> I think the approach you would need to take is to pr-write the header
>> information without the values zeroed out at first., After you've
>> compressed and written the physical bytes you would need to update the
>> values in place, after you know them.  Since Flatbuffers doesn't do any
>> variable length encoding, you don't need to worry about possibly
>> corrupting
>> the data.   The challenging part is determining the exact locations that
>> need to be overwritten.
>>
>> -MIcah
>>
>> On Mon, Apr 4, 2022 at 7:40 AM Jorge Cardoso Leitão <
>> jorgecarlei...@gmail.com> wrote:
>>
>> > Hi,
>> >
>> > Motivated by [1], I wonder if it is possible to write to IPC without
>> > writing the data to an intermediary buffer.
>> >
>> > The challenge is that the header of an IPC message [header][data]
>> requires:
>> >
>> > * the positions of the buffers
>> > * the total length of the body
>> >
>> > For uncompressed data, we could compute these before-hand at `O(C)`
>> where C
>> > is the number of columns. However, I am unable to find a way of
>> computing
>> > these ahead of writing for compressed buffers: we need to compress the
>> data
>> > to know its compressed (and thus buffers) size.
>> >
>> > Is this understanding correct?
>> >
>> > Best,
>> > Jorge
>> >
>> > [1] https://github.com/pola-rs/polars/issues/2639
>> >
>>
>

Reply via email to