> > I am also toying with the idea of a encoding/decoding fuzzer that > roundtrips data (see "function/inverse pairs" in > https://blog.regehr.org/archives/856). The question becomes in which > format the fuzzer would accept input data for the encoding step (as > Parquet files, which would mean a decoding/encoding/decoding roundtrip? > as Arrow IPC files, which are a simpler format?).
Sorry for the late reply. It could also be the IPC json testing format? On Mon, Dec 15, 2025 at 12:59 AM Antoine Pitrou <[email protected]> wrote: > > Hi Micah, > > Thanks for the suggestions. > > 1. Fuzz seeds for single columns: yes, this was added in > https://github.com/apache/arrow/pull/47892 > > 2. Have more encodings in the seed corpus: that's in the plans, but not > done yet (by the way, see the overarching issue tracking this at > https://github.com/apache/arrow/issues/43709) > > 3. A page-level fuzzer: that's quite an intriguing idea. Unfortunately, > the Parquet C++ APIs don't allow reading from a single page without > giving external metadata (such as the column descriptor), and the fuzz > target wouldn't have anywhere to get this metadata from. Basically, all > the data-specific info required by a fuzzer should be encoded in the > fuzz payload (see the discussion I had in > https://github.com/google/oss-fuzz/issues/14437). > > > I am also toying with the idea of a encoding/decoding fuzzer that > roundtrips data (see "function/inverse pairs" in > https://blog.regehr.org/archives/856). The question becomes in which > format the fuzzer would accept input data for the encoding step (as > Parquet files, which would mean a decoding/encoding/decoding roundtrip? > as Arrow IPC files, which are a simpler format?). > > Regards > > Antoine. > > > Le 14/12/2025 à 07:11, Micah Kornfield a écrit : > > Hi Antoine, > > I haven't looked at the seed-corpus in a while but one idea could be to > > make sure we fuzz have seeds that fuzz on columns of single type + > > encoding, to lower the search space the encoder would need to find issues > > with any specific encoding (another approach would to potentially have a > > page level fuzzer). > > > > Cheers, > > Micah > > > > On Mon, Dec 8, 2025 at 1:10 AM Antoine Pitrou <[email protected]> > wrote: > > > >> > >> Hello, > >> > >> We have been fuzzing the C++ Parquet reader for years as part of fuzzing > >> Arrow C++ on OSS-Fuzz (1). This has helped us find dozens of issues and > >> make the Parquet reader more robust against fringe cases, corrupt > >> or invalid files. > >> > >> However, the fuzzing setup had remained relatively the same, despite the > >> Parquet reader accruing additional features and complexity. > >> > >> Recently, my employer QuantStack secured some funding from the Sovereign > >> Tech Fund for various initiatives on the Arrow project (2). One of them > >> is to improve the fuzzing setup, and part of that is to improve the > >> Parquet fuzz target. > >> > >> The work has already started and we have integrated a number of changes > >> to test more features and variations, and expand our seed corpus. For > >> example, we will now be able to fuzz the reading of Parquet encrypted > >> files (3). > >> > >> We welcome any suggestions for further improvements on Parquet fuzzing. > >> > >> Regards > >> > >> Antoine. > >> > >> > >> (1) https://arrow.apache.org/docs/developers/cpp/fuzzing.html > >> > >> (2) > >> > >> > https://medium.com/@QuantStack/sovereign-tech-agency-invests-in-apache-arrows-future-with-quantstack-d2f84c21c2cc > >> > >> (3) https://github.com/apache/arrow/pull/48336 > >> > >> > >> > > > > >
