Hey Parquet Devs,

I would like to introduce a proposal that addresses the issues arising from
the physical layout requirements in the Parquet format that necessitate
contiguous data for columnar data.

Over the years, several improvements were introduced to solve other
challenges, effectively capturing the necessary information for Parquet to
lift the contiguity requirement on pages and column chunks.

Other formats recognize these challenges and embrace a model where
individual column segments are tracked at the metadata level but do not
rely on physical contiguity in the file.

The core problem is writer memory pressure caused by wide schemas and
asymmetric column sizes. Today a writer must buffer every column chunk in
memory until a row group is complete, because each column chunk must land
as a single contiguous byte range. For wide schemas, or schemas mixing
small fixed-width columns with very large variable-length values, this can
drive high memory usage even when individual pages are fully encoded,
compressed, and ready to flush, or it can result in row groups being
produced at inconsistent or inefficient boundaries.

This characteristic is more pronounced for emerging AI/ML use cases that
rely on data types and sizes atypical for traditional analytic use cases.

The document linked below includes a comprehensive proposal. Looking
forward to your feedback.

Proposal:
https://docs.google.com/document/d/1nntcYM98PFSkHT70RexSBPtCnWqg1uRJ5_7m--ZgbsA

Thanks,
Dan

Reply via email to