Re: Fuzzing Parquet C++

Micah Kornfield Sun, 08 Feb 2026 12:08:23 -0800

>
> I am also toying with the idea of a encoding/decoding fuzzer that
> roundtrips data (see "function/inverse pairs" in
> https://blog.regehr.org/archives/856). The question becomes in which
> format the fuzzer would accept input data for the encoding step (as
> Parquet files, which would mean a decoding/encoding/decoding roundtrip?
> as Arrow IPC files, which are a simpler format?).



Sorry for the late reply.  It could also be the IPC json testing format?

On Mon, Dec 15, 2025 at 12:59 AM Antoine Pitrou <[email protected]> wrote:

>
> Hi Micah,
>
> Thanks for the suggestions.
>
> 1. Fuzz seeds for single columns: yes, this was added in
> https://github.com/apache/arrow/pull/47892
>
> 2. Have more encodings in the seed corpus: that's in the plans, but not
> done yet (by the way, see the overarching issue tracking this at
> https://github.com/apache/arrow/issues/43709)
>
> 3. A page-level fuzzer: that's quite an intriguing idea. Unfortunately,
> the Parquet C++ APIs don't allow reading from a single page without
> giving external metadata (such as the column descriptor), and the fuzz
> target wouldn't have anywhere to get this metadata from. Basically, all
> the data-specific info required by a fuzzer should be encoded in the
> fuzz payload (see the discussion I had in
> https://github.com/google/oss-fuzz/issues/14437).
>
>
> I am also toying with the idea of a encoding/decoding fuzzer that
> roundtrips data (see "function/inverse pairs" in
> https://blog.regehr.org/archives/856). The question becomes in which
> format the fuzzer would accept input data for the encoding step (as
> Parquet files, which would mean a decoding/encoding/decoding roundtrip?
> as Arrow IPC files, which are a simpler format?).
>
> Regards
>
> Antoine.
>
>
> Le 14/12/2025 à 07:11, Micah Kornfield a écrit :
> > Hi Antoine,
> > I haven't looked at the seed-corpus in a while but one idea could be to
> > make sure we fuzz have seeds that fuzz on columns of single type +
> > encoding, to lower the search space the encoder would need to find issues
> > with any specific encoding (another approach would to potentially have a
> > page level fuzzer).
> >
> > Cheers,
> > Micah
> >
> > On Mon, Dec 8, 2025 at 1:10 AM Antoine Pitrou <[email protected]>
> wrote:
> >
> >>
> >> Hello,
> >>
> >> We have been fuzzing the C++ Parquet reader for years as part of fuzzing
> >> Arrow C++ on OSS-Fuzz (1). This has helped us find dozens of issues and
> >> make the Parquet reader more robust against fringe cases, corrupt
> >> or invalid files.
> >>
> >> However, the fuzzing setup had remained relatively the same, despite the
> >> Parquet reader accruing additional features and complexity.
> >>
> >> Recently, my employer QuantStack secured some funding from the Sovereign
> >> Tech Fund for various initiatives on the Arrow project (2). One of them
> >> is to improve the fuzzing setup, and part of that is to improve the
> >> Parquet fuzz target.
> >>
> >> The work has already started and we have integrated a number of changes
> >> to test more features and variations, and expand our seed corpus. For
> >> example, we will now be able to fuzz the reading of Parquet encrypted
> >> files (3).
> >>
> >> We welcome any suggestions for further improvements on Parquet fuzzing.
> >>
> >> Regards
> >>
> >> Antoine.
> >>
> >>
> >> (1) https://arrow.apache.org/docs/developers/cpp/fuzzing.html
> >>
> >> (2)
> >>
> >>
> https://medium.com/@QuantStack/sovereign-tech-agency-invests-in-apache-arrows-future-with-quantstack-d2f84c21c2cc
> >>
> >> (3) https://github.com/apache/arrow/pull/48336
> >>
> >>
> >>
> >
>
>
>

Re: Fuzzing Parquet C++

Reply via email to