Re: Fuzzing Parquet C++

Steve Loughran Tue, 10 Feb 2026 10:48:49 -0800

Thanks for your insight. libxml is so ubiquitous as a parser that someone
clearly felt motivated.



On Tue, 10 Feb 2026 at 15:03, Antoine Pitrou <[email protected]> wrote:

>
> Hello Steve,
>
> Le 09/02/2026 à 15:59, Steve Loughran a écrit :
> > I saw an interesting video on this topic -was anyone at the conference?
> >
> > https://youtu.be/h3UcecN5fvQ?si=PlhrwMIv8s_wxAF1
> >
> > Antoine, given you clearly understand the topic, what exactly does the
> > content at 25:30 mean (especially in terms of parquet)?
>
> Thanks for the pointer, I have just watched this and it was interesting.
>
> I'm honestly not sure what to make at the content at 25:30. Apparently,
> structuring the input space differently allowed the Sokoban fuzzer to
> converge to a solution much more quickly. But why precisely? The speaker
> doesn't answer that question. My assumption is that it may reduce the
> effective search space, but I'm not a Sokoban player.
>
> In terms of Parquet, I'm even less sure what to make of it. Parquet is
> made of disjoint building blocks: magic numbers, Thrift-encoded
> structures, and assorted ranges of binary data in different formats
> (such as definition levels, etc.). Not to mention some of it can be
> compressed and/or encrypted.
>
> Another approach would be the libxml2 approach also mentioned in that
> talk, i.e. a custom bytecode format encoding calls to XML generation
> APIs (*) (or, in our case, Parquet writing APIs). But, again, Parquet is
> massively more complex than XML, and it's still growing while I presume
> the XML spec is stable. Designing (and writing an interpreter for) such
> a custom bytecode format would be quite an investment.
>
> (*) https://gitlab.gnome.org/GNOME/libxml2/-/merge_requests/241
>
> > FYI, the ASF Community over Code conference in Glasgow will have its CfP
> > announced before long, and I think some talks on code security would be
> > good. I've got a working title of one "Open Source and CVEs: the forever
> > war"...
> > Something on fuzzing would be really good too
>
> I had no idea, thanks. I'll try to think about it.
>
> Best regards
>
> Antoine.
>
>
>
>
> >
> > On Mon, 9 Feb 2026 at 09:23, Antoine Pitrou <[email protected]> wrote:
> >
> >>
> >> Hi Micah,
> >>
> >> Le 08/02/2026 à 21:08, Micah Kornfield a écrit :
> >>>>
> >>>> I am also toying with the idea of a encoding/decoding fuzzer that
> >>>> roundtrips data (see "function/inverse pairs" in
> >>>> https://blog.regehr.org/archives/856). The question becomes in which
> >>>> format the fuzzer would accept input data for the encoding step (as
> >>>> Parquet files, which would mean a decoding/encoding/decoding
> roundtrip?
> >>>> as Arrow IPC files, which are a simpler format?).
> >>>
> >>> Sorry for the late reply.  It could also be the IPC json testing
> format?
> >>
> >> It could, but that introduces more overhead. The current Parquet full
> >> file fuzzer runs at around 100 iterations/second. Ideally a low-level
> >> Parquet encoding fuzzer should run at least 1-2 orders of magnitude
> >> faster so as to explore the search space more quickly.
> >>
> >> So my current inclination is to go with a custom fixed-size struct
> >> header indicating the physical type, encoding type and perhaps a couple
> >> other pieces of information.
> >>
> >> Regards
> >>
> >> Antoine.
> >>
> >>
> >>
> >
>
>
>

Re: Fuzzing Parquet C++

Reply via email to