Re: Fuzzing Parquet C++

Antoine Pitrou Mon, 15 Dec 2025 00:59:44 -0800


Hi Micah,

Thanks for the suggestions.

1. Fuzz seeds for single columns: yes, this was added inhttps://github.com/apache/arrow/pull/47892

2. Have more encodings in the seed corpus: that's in the plans, but notdone yet (by the way, see the overarching issue tracking this athttps://github.com/apache/arrow/issues/43709)

3. A page-level fuzzer: that's quite an intriguing idea. Unfortunately,the Parquet C++ APIs don't allow reading from a single page withoutgiving external metadata (such as the column descriptor), and the fuzztarget wouldn't have anywhere to get this metadata from. Basically, allthe data-specific info required by a fuzzer should be encoded in thefuzz payload (see the discussion I had inhttps://github.com/google/oss-fuzz/issues/14437).

I am also toying with the idea of a encoding/decoding fuzzer thatroundtrips data (see "function/inverse pairs" inhttps://blog.regehr.org/archives/856). The question becomes in whichformat the fuzzer would accept input data for the encoding step (asParquet files, which would mean a decoding/encoding/decoding roundtrip?as Arrow IPC files, which are a simpler format?).


Regards

Antoine.


Le 14/12/2025 à 07:11, Micah Kornfield a écrit :

Hi Antoine,
I haven't looked at the seed-corpus in a while but one idea could be to
make sure we fuzz have seeds that fuzz on columns of single type +
encoding, to lower the search space the encoder would need to find issues
with any specific encoding (another approach would to potentially have a
page level fuzzer).

Cheers,
Micah

On Mon, Dec 8, 2025 at 1:10 AM Antoine Pitrou <[email protected]> wrote:


Hello,

We have been fuzzing the C++ Parquet reader for years as part of fuzzing
Arrow C++ on OSS-Fuzz (1). This has helped us find dozens of issues and
make the Parquet reader more robust against fringe cases, corrupt
or invalid files.

However, the fuzzing setup had remained relatively the same, despite the
Parquet reader accruing additional features and complexity.

Recently, my employer QuantStack secured some funding from the Sovereign
Tech Fund for various initiatives on the Arrow project (2). One of them
is to improve the fuzzing setup, and part of that is to improve the
Parquet fuzz target.

The work has already started and we have integrated a number of changes
to test more features and variations, and expand our seed corpus. For
example, we will now be able to fuzz the reading of Parquet encrypted
files (3).

We welcome any suggestions for further improvements on Parquet fuzzing.

Regards

Antoine.


(1) https://arrow.apache.org/docs/developers/cpp/fuzzing.html

(2)

https://medium.com/@QuantStack/sovereign-tech-agency-invests-in-apache-arrows-future-with-quantstack-d2f84c21c2cc

(3) https://github.com/apache/arrow/pull/48336

Re: Fuzzing Parquet C++

Reply via email to