Hi Micah,

Thanks for the suggestions.

1. Fuzz seeds for single columns: yes, this was added in https://github.com/apache/arrow/pull/47892

2. Have more encodings in the seed corpus: that's in the plans, but not done yet (by the way, see the overarching issue tracking this at https://github.com/apache/arrow/issues/43709)

3. A page-level fuzzer: that's quite an intriguing idea. Unfortunately, the Parquet C++ APIs don't allow reading from a single page without giving external metadata (such as the column descriptor), and the fuzz target wouldn't have anywhere to get this metadata from. Basically, all the data-specific info required by a fuzzer should be encoded in the fuzz payload (see the discussion I had in https://github.com/google/oss-fuzz/issues/14437).


I am also toying with the idea of a encoding/decoding fuzzer that roundtrips data (see "function/inverse pairs" in https://blog.regehr.org/archives/856). The question becomes in which format the fuzzer would accept input data for the encoding step (as Parquet files, which would mean a decoding/encoding/decoding roundtrip? as Arrow IPC files, which are a simpler format?).

Regards

Antoine.


Le 14/12/2025 à 07:11, Micah Kornfield a écrit :
Hi Antoine,
I haven't looked at the seed-corpus in a while but one idea could be to
make sure we fuzz have seeds that fuzz on columns of single type +
encoding, to lower the search space the encoder would need to find issues
with any specific encoding (another approach would to potentially have a
page level fuzzer).

Cheers,
Micah

On Mon, Dec 8, 2025 at 1:10 AM Antoine Pitrou <[email protected]> wrote:


Hello,

We have been fuzzing the C++ Parquet reader for years as part of fuzzing
Arrow C++ on OSS-Fuzz (1). This has helped us find dozens of issues and
make the Parquet reader more robust against fringe cases, corrupt
or invalid files.

However, the fuzzing setup had remained relatively the same, despite the
Parquet reader accruing additional features and complexity.

Recently, my employer QuantStack secured some funding from the Sovereign
Tech Fund for various initiatives on the Arrow project (2). One of them
is to improve the fuzzing setup, and part of that is to improve the
Parquet fuzz target.

The work has already started and we have integrated a number of changes
to test more features and variations, and expand our seed corpus. For
example, we will now be able to fuzz the reading of Parquet encrypted
files (3).

We welcome any suggestions for further improvements on Parquet fuzzing.

Regards

Antoine.


(1) https://arrow.apache.org/docs/developers/cpp/fuzzing.html

(2)

https://medium.com/@QuantStack/sovereign-tech-agency-invests-in-apache-arrows-future-with-quantstack-d2f84c21c2cc

(3) https://github.com/apache/arrow/pull/48336






Reply via email to