Ok, after I've enabled (*) such a check on the IPC file fuzzer, we just got one OSS-Fuzz report.

It turns out that the Schema is duplicated in the IPC file footer for faster access, but it might actually be different from the one that's at the head of the IPC stream. And validating that they are the same would be too costly in normal IPC file reader operation, so I might have to introduce special treatment in the IPC file fuzzer.

Regards

Antoine.


(*) https://github.com/apache/arrow/pull/49312


Le 18/02/2026 à 00:15, Dewey Dunnington a écrit :
Thanks for raising this!

I agree with Joel and I think it's quite useful. As a concrete example of
where this is used, because nanoarrow doesn't officially support the file
format (only to the extent needed for integration testing), it has allowed
Arrow files to be read by DuckDB's arrow extension (by skipping the first 8
bytes and pretending it's a stream). This is great for sources where random
access is harder to support but where there is some advantage to supplying
the footer for clients that can take advantage (e.g., statically hosting a
file via http).

Cheers,

-dewey

On Tue, Feb 17, 2026 at 1:37 PM Joel Lubinitsky <[email protected]> wrote:

Hi Antoine, thanks for raising this.

There is a test [1] in the Go implementation that validates this behavior.
It writes an IPC file, then ensures the same data is read back using a
stream reader starting 8 bytes past the start of the buffer.

I have personally seen and written code that assumes the embedded stream
can be safely read to consume the contents of the IPC file. There is also
an open PR [2] adding integration tests for related behavior, but it was
never merged.

IMHO the flexibility to consume an IPC file as a stream improves its value
compared to alternatives. Combined with existing usage relying on this
assumption, my preference would be toward formalizing this as an explicit
requirement.

[1]

https://github.com/apache/arrow-go/blob/8d81fc39254b7a51daf1fe1a272c24169a059878/arrow/ipc/file_test.go#L83
[2] https://github.com/apache/arrow/pull/43834

Thanks,
Joel

On Tue, Feb 17, 2026 at 1:20 PM Antoine Pitrou <[email protected]> wrote:


Hello,

The IPC file format is defined as the IPC stream format, preceded by a
header (the Arrow magic bytes) and followed by a footer (a catalog of
record batches, and the Arrow magic bytes). Thus, reading and writing
IPC files can reuse the same basic building blocks as for IPC streams
(this is almost trivial for writing, which is usually done sequentially).

As a consequence, IPC files practically result in valid identical IPC
streams (ignoring the 8 header bytes) that read as the same logical
contents.

However, there is no theoretical guarantee that this is always the case.
Consider a IPC file writer that would write record batches in reverse
order in the footer, compared to their sequential order in the
underlying stream. Or, more generally, an IPC file footer that would
repeat or skip some batches in the stream.

So theoretically, we cannot assume that reading an IPC file as an IPC
stream (after skipping the 8 header bytes) returns the intended contents.

However, it seems that it could be useful to be able to make such an
assumption. Hence these questions:
1. Do all current IPC file writers uphold this assumption?
2. Do we want to make it a more explicit requirement of the IPC file
format?


Context: I've submitted a PR
(https://github.com/apache/arrow/pull/49312) to enable differential
fuzzing in the C++ IPC file fuzzer, where I'm comparing the results of
the IPC file and stream readers on the fuzzing payload.

Regards

Antoine.





Reply via email to