Hi Antoine, I haven't looked at the seed-corpus in a while but one idea could be to make sure we fuzz have seeds that fuzz on columns of single type + encoding, to lower the search space the encoder would need to find issues with any specific encoding (another approach would to potentially have a page level fuzzer).
Cheers, Micah On Mon, Dec 8, 2025 at 1:10 AM Antoine Pitrou <[email protected]> wrote: > > Hello, > > We have been fuzzing the C++ Parquet reader for years as part of fuzzing > Arrow C++ on OSS-Fuzz (1). This has helped us find dozens of issues and > make the Parquet reader more robust against fringe cases, corrupt > or invalid files. > > However, the fuzzing setup had remained relatively the same, despite the > Parquet reader accruing additional features and complexity. > > Recently, my employer QuantStack secured some funding from the Sovereign > Tech Fund for various initiatives on the Arrow project (2). One of them > is to improve the fuzzing setup, and part of that is to improve the > Parquet fuzz target. > > The work has already started and we have integrated a number of changes > to test more features and variations, and expand our seed corpus. For > example, we will now be able to fuzz the reading of Parquet encrypted > files (3). > > We welcome any suggestions for further improvements on Parquet fuzzing. > > Regards > > Antoine. > > > (1) https://arrow.apache.org/docs/developers/cpp/fuzzing.html > > (2) > > https://medium.com/@QuantStack/sovereign-tech-agency-invests-in-apache-arrows-future-with-quantstack-d2f84c21c2cc > > (3) https://github.com/apache/arrow/pull/48336 > > >
