Hi Antoine, thanks for raising this.

There is a test [1] in the Go implementation that validates this behavior.
It writes an IPC file, then ensures the same data is read back using a
stream reader starting 8 bytes past the start of the buffer.

I have personally seen and written code that assumes the embedded stream
can be safely read to consume the contents of the IPC file. There is also
an open PR [2] adding integration tests for related behavior, but it was
never merged.

IMHO the flexibility to consume an IPC file as a stream improves its value
compared to alternatives. Combined with existing usage relying on this
assumption, my preference would be toward formalizing this as an explicit
requirement.

[1]
https://github.com/apache/arrow-go/blob/8d81fc39254b7a51daf1fe1a272c24169a059878/arrow/ipc/file_test.go#L83
[2] https://github.com/apache/arrow/pull/43834

Thanks,
Joel

On Tue, Feb 17, 2026 at 1:20 PM Antoine Pitrou <[email protected]> wrote:

>
> Hello,
>
> The IPC file format is defined as the IPC stream format, preceded by a
> header (the Arrow magic bytes) and followed by a footer (a catalog of
> record batches, and the Arrow magic bytes). Thus, reading and writing
> IPC files can reuse the same basic building blocks as for IPC streams
> (this is almost trivial for writing, which is usually done sequentially).
>
> As a consequence, IPC files practically result in valid identical IPC
> streams (ignoring the 8 header bytes) that read as the same logical
> contents.
>
> However, there is no theoretical guarantee that this is always the case.
> Consider a IPC file writer that would write record batches in reverse
> order in the footer, compared to their sequential order in the
> underlying stream. Or, more generally, an IPC file footer that would
> repeat or skip some batches in the stream.
>
> So theoretically, we cannot assume that reading an IPC file as an IPC
> stream (after skipping the 8 header bytes) returns the intended contents.
>
> However, it seems that it could be useful to be able to make such an
> assumption. Hence these questions:
> 1. Do all current IPC file writers uphold this assumption?
> 2. Do we want to make it a more explicit requirement of the IPC file
> format?
>
>
> Context: I've submitted a PR
> (https://github.com/apache/arrow/pull/49312) to enable differential
> fuzzing in the C++ IPC file fuzzer, where I'm comparing the results of
> the IPC file and stream readers on the fuzzing payload.
>
> Regards
>
> Antoine.
>
>

Reply via email to