Re: [DISCUSS] Support non-contiguous pages in parquet

Daniel Weeks Tue, 05 May 2026 16:40:51 -0700

>
> Going out on a limb here, but maybe storing individual values that are
> hundreds of megabytes isn't really the best fit for Parquet files. Or at
> least this isn't a common-enough use case for shared/public files to
> warrant a complicating change in the format.
>


As Micah indicated, scenarios where values are hundreds of megabytes is
rather extreme (likely degenerate), but many variations involving
asymmetric data or wide tables cause problems.  This isn't a new problem
and workarounds typically involve undesirable tradeoffs or regressions.


> Given the requests/proposals of late, I wonder if there isn't good reason
> for someone to come up with another file format that is made specifically
> to handle rows with tons of columns and/or very large values.


I would take the opposite position given the same evidence of increased
requests and proposals: Parquet is uniquely positioned to evolve and
address new use cases and there is growing interest in investing to improve
the format through new data types and optimising the layout.  This proposal
is particularly appealing because it requires very limited changes while
building on prior work.

There is significant overlap between use cases where Parquet has
traditionally performed well and emerging use cases spanning analytics and
AI/ML.  With a few targeted investments, Parquet can excel across all these
use cases without forcing users to pair tailored formats with specific data.

Parquet has the right foundation and community to address recent shifts in
data, so I'm optimistic as long as there is a shared willingness to evolve
the format.

-Dan


On Tue, May 5, 2026 at 11:05 AM Micah Kornfield <[email protected]>
wrote:

> Thank Dan, Andrian and Andrew,
>
> Some responses inline.
>
>
> > I would like to introduce a proposal that addresses the issues arising
> from
> > the physical layout requirements in the Parquet format that necessitate
> > contiguous data for columnar data.
>
>
> I left some comments on the doc but IMO this looks promising.  I think we
> should see how much churn this actually causes for reference
> implementations (and maybe adjust the approach based on that if necessary).
>
>
> > Going out on a limb here, but maybe storing individual values that are
> > hundreds of megabytes isn't really the best fit for Parquet files. Or at
> > least this isn't a common-enough use case for shared/public files to
> > warrant a complicating change in the format.
>
>
> The use cases this covers aren't just large blobs.  Another more common
> use-case is repeated moderately sized data or at least data with high skew.
> Adrian already touched on this below. Examples that would likely benefit
> from this approach (we should definitely benchmark as we get further
> along): web data/wikipedia data/LLM chat logs).  If you have one column
> that is O(10-100 KB)  per row, you end up with 1,200 to 12,000 rows per
> row-group which is on the small side for storing/processing the smaller
> columns.
>
> There are different operational trade-offs in general for inlining vs
> referencing but providing consumers of parquet flexibility is useful.
>
> I’d be interested in trying it on our system.
>
>
> This would be awesome.
>
>
> Given the requests/proposals of late, I wonder if there isn't good reason
> > for someone to come up with another file format that is made specifically
> > to handle rows with tons of columns and/or very large values.
>
>
> The number of proposals to evolve parquet is a conscious choice that we
> discussed a while ago [1]
> <https://lists.apache.org/thread/5jyhzkwyrjk9z52g0b49g31ygnz73gxo
> >(admittedly
> progress has been slow).  Similar sentiments about why not just use
> something else were expressed but there was enough interest back then to at
> least to try to evolve Parquet to keep it relevant in the data/AI
> ecosystem.
>
> Cheers,
> Micah
>
> [1] https://lists.apache.org/thread/5jyhzkwyrjk9z52g0b49g31ygnz73gxo
>
>
>
>
> On Tue, May 5, 2026 at 7:51 AM Adrian Garcia Badaracco via dev <
> [email protected]> wrote:
>
> > I think the point that AI data = “regular” data nowadays and AI data
> tends
> > to have these large blobs is valid. This seems to be a shift in the
> > industry, e.g. AI observability data = “regular” observability data (
> > https://opentelemetry.io/docs/specs/semconv/gen-ai/) now. I think it’s
> > also true in general that data volumes are growing over time so what
> might
> > have been 1kB values 5 years ago is not 5mB values.
> >
> > We can say Parquet is not the right format for that, but I think that
> > diminishes the use cases for Parquet in the future. Putting this data in
> > external files and linking to it from Parquet is doable, but adds a lot
> of
> > complexity to implementations especially if they want to support queries
> > like `time_col > now() - ‘5 min’ and large_text like ‘%foo%’`.
> >
> > This proposal is interesting but focuses a lot on the read side of
> things.
> > I’m more interested in the read side of things, but haven’t really
> explored
> > how big values impact reading. It seems to me that the problem there
> would
> > be more along the lines of row groups structures which can force
> > inefficient IO patterns with a mix of small id-like columns and large
> blobs
> > (I think). The point about offloading to local temp files on disk is
> > interesting. For someone running on fast SSDs that might be a viable
> > solution, I’d be interested in trying it on our system. We have this
> > problem but have mostly solved it by flushing more frequently if there
> are
> > large blobs, that may be hurting us in other ways though...
> >
> > > On May 5, 2026, at 9:39 AM, Andrew Bell <[email protected]>
> > wrote:
> > >
> > > Hi,
> > >
> > > Going out on a limb here, but maybe storing individual values that are
> > > hundreds of megabytes isn't really the best fit for Parquet files. Or
> at
> > > least this isn't a common-enough use case for shared/public files to
> > > warrant a complicating change in the format.
> > >
> > > Given the requests/proposals of late, I wonder if there isn't good
> reason
> > > for someone to come up with another file format that is made
> specifically
> > > to handle rows with tons of columns and/or very large values.
> > >
> > > On Mon, May 4, 2026 at 7:17 PM Daniel Weeks <[email protected]>
> > > wrote:
> > >
> > >> Hey Parquet Devs,
> > >>
> > >> The core problem is writer memory pressure caused by wide schemas and
> > >> asymmetric column sizes. Today a writer must buffer every column chunk
> > in
> > >> memory until a row group is complete, because each column chunk must
> > land
> > >> as a single contiguous byte range. For wide schemas, or schemas mixing
> > >> small fixed-width columns with very large variable-length values, this
> > can
> > >> drive high memory usage even when individual pages are fully encoded,
> > >> compressed, and ready to flush, or it can result in row groups being
> > >> produced at inconsistent or inefficient boundaries.
> > >
> > >
> > > --
> > > Andrew Bell
> > > [email protected]
> >
> >
>

Re: [DISCUSS] Support non-contiguous pages in parquet

Reply via email to