Hi Antoine,
I haven't looked at the seed-corpus in a while but one idea could be to
make sure we fuzz have seeds that fuzz on columns of single type +
encoding, to lower the search space the encoder would need to find issues
with any specific encoding (another approach would to potentially have a
page level fuzzer).

Cheers,
Micah

On Mon, Dec 8, 2025 at 1:10 AM Antoine Pitrou <[email protected]> wrote:

>
> Hello,
>
> We have been fuzzing the C++ Parquet reader for years as part of fuzzing
> Arrow C++ on OSS-Fuzz (1). This has helped us find dozens of issues and
> make the Parquet reader more robust against fringe cases, corrupt
> or invalid files.
>
> However, the fuzzing setup had remained relatively the same, despite the
> Parquet reader accruing additional features and complexity.
>
> Recently, my employer QuantStack secured some funding from the Sovereign
> Tech Fund for various initiatives on the Arrow project (2). One of them
> is to improve the fuzzing setup, and part of that is to improve the
> Parquet fuzz target.
>
> The work has already started and we have integrated a number of changes
> to test more features and variations, and expand our seed corpus. For
> example, we will now be able to fuzz the reading of Parquet encrypted
> files (3).
>
> We welcome any suggestions for further improvements on Parquet fuzzing.
>
> Regards
>
> Antoine.
>
>
> (1) https://arrow.apache.org/docs/developers/cpp/fuzzing.html
>
> (2)
>
> https://medium.com/@QuantStack/sovereign-tech-agency-invests-in-apache-arrows-future-with-quantstack-d2f84c21c2cc
>
> (3) https://github.com/apache/arrow/pull/48336
>
>
>

Reply via email to