Re: Fuzzing Parquet C++

Antoine Pitrou Tue, 10 Feb 2026 07:03:45 -0800


Hello Steve,

Le 09/02/2026 à 15:59, Steve Loughran a écrit :

I saw an interesting video on this topic -was anyone at the conference?

https://youtu.be/h3UcecN5fvQ?si=PlhrwMIv8s_wxAF1

Antoine, given you clearly understand the topic, what exactly does the
content at 25:30 mean (especially in terms of parquet)?


Thanks for the pointer, I have just watched this and it was interesting.

I'm honestly not sure what to make at the content at 25:30. Apparently,structuring the input space differently allowed the Sokoban fuzzer toconverge to a solution much more quickly. But why precisely? The speakerdoesn't answer that question. My assumption is that it may reduce theeffective search space, but I'm not a Sokoban player.

In terms of Parquet, I'm even less sure what to make of it. Parquet ismade of disjoint building blocks: magic numbers, Thrift-encodedstructures, and assorted ranges of binary data in different formats(such as definition levels, etc.). Not to mention some of it can becompressed and/or encrypted.

Another approach would be the libxml2 approach also mentioned in thattalk, i.e. a custom bytecode format encoding calls to XML generationAPIs (*) (or, in our case, Parquet writing APIs). But, again, Parquet ismassively more complex than XML, and it's still growing while I presumethe XML spec is stable. Designing (and writing an interpreter for) sucha custom bytecode format would be quite an investment.


(*) https://gitlab.gnome.org/GNOME/libxml2/-/merge_requests/241

FYI, the ASF Community over Code conference in Glasgow will have its CfP
announced before long, and I think some talks on code security would be
good. I've got a working title of one "Open Source and CVEs: the forever
war"...
Something on fuzzing would be really good too


I had no idea, thanks. I'll try to think about it.

Best regards

Antoine.


On Mon, 9 Feb 2026 at 09:23, Antoine Pitrou <[email protected]> wrote:


Hi Micah,

Le 08/02/2026 à 21:08, Micah Kornfield a écrit :


I am also toying with the idea of a encoding/decoding fuzzer that
roundtrips data (see "function/inverse pairs" in
https://blog.regehr.org/archives/856). The question becomes in which
format the fuzzer would accept input data for the encoding step (as
Parquet files, which would mean a decoding/encoding/decoding roundtrip?
as Arrow IPC files, which are a simpler format?).


Sorry for the late reply.  It could also be the IPC json testing format?


It could, but that introduces more overhead. The current Parquet full
file fuzzer runs at around 100 iterations/second. Ideally a low-level
Parquet encoding fuzzer should run at least 1-2 orders of magnitude
faster so as to explore the search space more quickly.

So my current inclination is to go with a custom fixed-size struct
header indicating the physical type, encoding type and perhaps a couple
other pieces of information.

Regards

Antoine.

Re: Fuzzing Parquet C++

Reply via email to