Thanks for your insight. libxml is so ubiquitous as a parser that someone clearly felt motivated.
On Tue, 10 Feb 2026 at 15:03, Antoine Pitrou <[email protected]> wrote: > > Hello Steve, > > Le 09/02/2026 à 15:59, Steve Loughran a écrit : > > I saw an interesting video on this topic -was anyone at the conference? > > > > https://youtu.be/h3UcecN5fvQ?si=PlhrwMIv8s_wxAF1 > > > > Antoine, given you clearly understand the topic, what exactly does the > > content at 25:30 mean (especially in terms of parquet)? > > Thanks for the pointer, I have just watched this and it was interesting. > > I'm honestly not sure what to make at the content at 25:30. Apparently, > structuring the input space differently allowed the Sokoban fuzzer to > converge to a solution much more quickly. But why precisely? The speaker > doesn't answer that question. My assumption is that it may reduce the > effective search space, but I'm not a Sokoban player. > > In terms of Parquet, I'm even less sure what to make of it. Parquet is > made of disjoint building blocks: magic numbers, Thrift-encoded > structures, and assorted ranges of binary data in different formats > (such as definition levels, etc.). Not to mention some of it can be > compressed and/or encrypted. > > Another approach would be the libxml2 approach also mentioned in that > talk, i.e. a custom bytecode format encoding calls to XML generation > APIs (*) (or, in our case, Parquet writing APIs). But, again, Parquet is > massively more complex than XML, and it's still growing while I presume > the XML spec is stable. Designing (and writing an interpreter for) such > a custom bytecode format would be quite an investment. > > (*) https://gitlab.gnome.org/GNOME/libxml2/-/merge_requests/241 > > > FYI, the ASF Community over Code conference in Glasgow will have its CfP > > announced before long, and I think some talks on code security would be > > good. I've got a working title of one "Open Source and CVEs: the forever > > war"... > > Something on fuzzing would be really good too > > I had no idea, thanks. I'll try to think about it. > > Best regards > > Antoine. > > > > > > > > On Mon, 9 Feb 2026 at 09:23, Antoine Pitrou <[email protected]> wrote: > > > >> > >> Hi Micah, > >> > >> Le 08/02/2026 à 21:08, Micah Kornfield a écrit : > >>>> > >>>> I am also toying with the idea of a encoding/decoding fuzzer that > >>>> roundtrips data (see "function/inverse pairs" in > >>>> https://blog.regehr.org/archives/856). The question becomes in which > >>>> format the fuzzer would accept input data for the encoding step (as > >>>> Parquet files, which would mean a decoding/encoding/decoding > roundtrip? > >>>> as Arrow IPC files, which are a simpler format?). > >>> > >>> Sorry for the late reply. It could also be the IPC json testing > format? > >> > >> It could, but that introduces more overhead. The current Parquet full > >> file fuzzer runs at around 100 iterations/second. Ideally a low-level > >> Parquet encoding fuzzer should run at least 1-2 orders of magnitude > >> faster so as to explore the search space more quickly. > >> > >> So my current inclination is to go with a custom fixed-size struct > >> header indicating the physical type, encoding type and perhaps a couple > >> other pieces of information. > >> > >> Regards > >> > >> Antoine. > >> > >> > >> > > > > >
