Hello Steve,
Le 09/02/2026 à 15:59, Steve Loughran a écrit :
I saw an interesting video on this topic -was anyone at the conference?
https://youtu.be/h3UcecN5fvQ?si=PlhrwMIv8s_wxAF1
Antoine, given you clearly understand the topic, what exactly does the
content at 25:30 mean (especially in terms of parquet)?
Thanks for the pointer, I have just watched this and it was interesting.
I'm honestly not sure what to make at the content at 25:30. Apparently,
structuring the input space differently allowed the Sokoban fuzzer to
converge to a solution much more quickly. But why precisely? The speaker
doesn't answer that question. My assumption is that it may reduce the
effective search space, but I'm not a Sokoban player.
In terms of Parquet, I'm even less sure what to make of it. Parquet is
made of disjoint building blocks: magic numbers, Thrift-encoded
structures, and assorted ranges of binary data in different formats
(such as definition levels, etc.). Not to mention some of it can be
compressed and/or encrypted.
Another approach would be the libxml2 approach also mentioned in that
talk, i.e. a custom bytecode format encoding calls to XML generation
APIs (*) (or, in our case, Parquet writing APIs). But, again, Parquet is
massively more complex than XML, and it's still growing while I presume
the XML spec is stable. Designing (and writing an interpreter for) such
a custom bytecode format would be quite an investment.
(*) https://gitlab.gnome.org/GNOME/libxml2/-/merge_requests/241
FYI, the ASF Community over Code conference in Glasgow will have its CfP
announced before long, and I think some talks on code security would be
good. I've got a working title of one "Open Source and CVEs: the forever
war"...
Something on fuzzing would be really good too
I had no idea, thanks. I'll try to think about it.
Best regards
Antoine.
On Mon, 9 Feb 2026 at 09:23, Antoine Pitrou <[email protected]> wrote:
Hi Micah,
Le 08/02/2026 à 21:08, Micah Kornfield a écrit :
I am also toying with the idea of a encoding/decoding fuzzer that
roundtrips data (see "function/inverse pairs" in
https://blog.regehr.org/archives/856). The question becomes in which
format the fuzzer would accept input data for the encoding step (as
Parquet files, which would mean a decoding/encoding/decoding roundtrip?
as Arrow IPC files, which are a simpler format?).
Sorry for the late reply. It could also be the IPC json testing format?
It could, but that introduces more overhead. The current Parquet full
file fuzzer runs at around 100 iterations/second. Ideally a low-level
Parquet encoding fuzzer should run at least 1-2 orders of magnitude
faster so as to explore the search space more quickly.
So my current inclination is to go with a custom fixed-size struct
header indicating the physical type, encoding type and perhaps a couple
other pieces of information.
Regards
Antoine.