Thank you -- I really like the idea and left some comments.

> The core problem is writer memory pressure caused by wide schemas and
asymmetric column sizes. T

I think another important problem that this proposal addresses is that the
current layout can force scattered reads when doing point lookups, which
are increasingly important for AI workloads.

Weston Pace, from LanceDB, also gave talk recently[1] about challenges
using Parquet that basically describe these issues, in case anyone would
like a higher bandwidth (and amusing) take on the challenge

https://www.youtube.com/watch?v=fDrmfDuPK3s

Andrew


On Wed, May 6, 2026 at 3:27 AM Will Edwards via dev <[email protected]>
wrote:

> Hi Daniel,
>
> Interesting problem, it's good that we are thinking about this.
>
> Memory pressure is definitely a problem, particularly for row-wise writers
> that must buffer all the column chunks for the row group before writing
> them.
>
> Trivia: Most open-source column-wise writers I've seen also buffer all the
> column chunks for the row group before writing them, although they strictly
> don't need to.  DuckDB has a really interesting hybrid that serializes
> eight columns at a time, which I'd really like to understand better.
> Arrow-cpp's WriteTable processes one column at a time, offering memory
> advantages.
>
> Also, many modern writers split into row groups based on row count instead
> of byte size; some use compressed size, while others use uncompressed
> size.  They generally aren't ready for very large values in certain columns
> within some rows.
>
> Of course the row-group-size is not just about the writer's memory
> management; it's a contract with the reader regarding the memory
> requirements the reader will need.  Parquet writers that generate very
> large row groups efficiently can easily create Parquet files that even
> mainstream readers with ample RAM cannot manage to read.
>
> One implementation detail concerns using `data_page_offset = -1` as a
> marker.  Most open source readers I've seen fail (often with an I/O error)
> if you access that column.  Arrow rust will actually panic if any footer
> columns have data_page_offset = -1, even if that column isn't in the
> projection (and panic can map to sigabrt on Rust; it's a compiler
> setting).  DuckDB silently corrupts data.
>
> Other approaches might exist.  For example, there might be a new DataPageV3
> (yes, the point is for readers who don't know to fail when they encounter
> it!).  DataPageV3 doesn't store data inline, instead, it contains the
> OffsetIndex. (The expectation that OffsetIndex sits outside row
> groups—which seems a strange insistence in the spec—is relaxed).  Engines
> unaware of DataPageV3 that use offset indexes to always seek will get the
> real data.  Most don't, though, so fail.
>
> It's easy to imagine future Parquet files that have one big row group and
> discontinuous column chunks.  It's basically half way to Lance?
>
> Many engines today pay no attention to page indexes, or if they do, they
> use them for stats rather than for seeking.  The new discontinuous column
> chunks might force a significant architectural change for those engines.
>
> Fun thinking about this kind of stuff,
> Will
>
>
> On Tue, 5 May 2026 at 01:18, Daniel Weeks <[email protected]>
> wrote:
>
> > Hey Parquet Devs,
> >
> > I would like to introduce a proposal that addresses the issues arising
> from
> > the physical layout requirements in the Parquet format that necessitate
> > contiguous data for columnar data.
> >
> > Over the years, several improvements were introduced to solve other
> > challenges, effectively capturing the necessary information for Parquet
> to
> > lift the contiguity requirement on pages and column chunks.
> >
> > Other formats recognize these challenges and embrace a model where
> > individual column segments are tracked at the metadata level but do not
> > rely on physical contiguity in the file.
> >
> > The core problem is writer memory pressure caused by wide schemas and
> > asymmetric column sizes. Today a writer must buffer every column chunk in
> > memory until a row group is complete, because each column chunk must land
> > as a single contiguous byte range. For wide schemas, or schemas mixing
> > small fixed-width columns with very large variable-length values, this
> can
> > drive high memory usage even when individual pages are fully encoded,
> > compressed, and ready to flush, or it can result in row groups being
> > produced at inconsistent or inefficient boundaries.
> >
> > This characteristic is more pronounced for emerging AI/ML use cases that
> > rely on data types and sizes atypical for traditional analytic use cases.
> >
> > The document linked below includes a comprehensive proposal. Looking
> > forward to your feedback.
> >
> > Proposal:
> >
> >
> https://docs.google.com/document/d/1nntcYM98PFSkHT70RexSBPtCnWqg1uRJ5_7m--ZgbsA
> >
> > Thanks,
> > Dan
> >
>

Reply via email to