Re: [DISCUSS] Support for late materialization in Parquet -> Arrow

Weston Pace Thu, 11 Dec 2025 05:33:22 -0800

I think this is a very interesting idea.  This could potentially open up
the door for things like adding compute kernels for these compressed
representations to Arrow or Datafusion.  Though it isn't without some
challenges.


> It seems FSSTStringVector/Array could potentially be modelled
> as an extension type
> ...
> This would however require a fixed dictionary, so might not
> be desirable.
> ...
> ALPFloatingPointVector and bit-packed vectors/arrays are more challenging
> to represent as extension types.
> ...
> Each batch of values has a different metadata parameter set.

I think these are basically the same problem.  From what I've seen in
implementations a format will typically introduce some kind of small batch
concept (I think every 1024 values in the Fast Lanes paper IIRC).  So
either we need individual record batches for each small batch (in which
case the Arrow representation is more straightforward but batches are quite
small) or we need some concept of a batched array in Arrow.  If we want
individual record batches for each small batch that requires the batch
sizes to be consistent (in number of rows) between columns and I don't know
if that's always true.

> One of the discussion items is to allow late materialization: to allow
> keeping data in encoded format beyond the filter stage (for example in
> Datafusion).

> Vortex seems to show that it is possible to support advanced
> encodings (like ALP, FSST, or others) by separating the logical type
> from the physical encoding.

Pierre brings up another challenge in achieving this goal, which may be
more significant.  The compression and encoding techniques typically vary
from page to page within Parquet (this is even more true in formats like
fast lanes and vortex).  A column might use ALP for one page and then use
PLAIN encoding for the next page.  This makes it difficult to represent a
stream of data with the typical Arrow schema we have today.  I think we
would need a "semantic schema" or "logical schema" which indicates the
logical type but not the physical representation.  Still, that can be an
orthogonal discussion to FSST and ALP representation.

>  We could also experiment with Opaque vectors.

This could be an interesting approach too.  I don't know if they could be
entirely opaque though.  Arrow users typically expect to be able to perform
operations like "slice" and "take" which require some knowledge of the
underlying type.  Do you think we would come up with a semi-opaque array
that could be sliced?  Or that we would introduce the concept of an
unsliceable array?


On Thu, Dec 11, 2025 at 5:27 AM Pierre Lacave <[email protected]> wrote:

> Hi all,
>
> I am relatively new to this space, so I apologize if I am missing some
> context or history here. I wanted to share some observations from what I
> see happening with projects like Vortex.
>
> Vortex seems to show that it is possible to support advanced encodings
> (like ALP, FSST, or others) by separating the logical type from the
> physical encoding. If the consumer engine supports the advanced encoding,
> it stays compressed and fast. If not, the data is "canonicalized" to
> standard Arrow arrays at the edge.
>
> As Parquet adopts these novel encodings, the current Arrow approach forces
> us to "densify" or decompress data immediately, even if the engine could
> have operated on the encoded data.
>
> Is there a world where Arrow could offer some sort of negotiation
> mechanism? The goal would be to guarantee the data can always be read as
> standard "safe" physical types (paying a cost only at the boundary), while
> allowing systems that understand the advanced encoding to let the data flow
> through efficiently.
>
> This sounds like it keep the safety of the interoperability - Arrow making
> sure new encodings have a canonical representation - and it leave the onus
> of implemented the efficient flow to the consumer - decoupling efficiency
> from interoperability.
>
> Thanks !
>
> Pierre
>
> On 2025/12/11 06:49:30 Micah Kornfield wrote:
> > I think this is an interesting idea.  Julien, do you have a proposal for
> > scope?  Is the intent to be 1:1 with any new encoding that is added to
> > Parquet?  For instance would the intent be to also put cascading
> encodings
> > in Arrow?
> >
> > We could also experiment with Opaque vectors.
> >
> >
> > Did you mean this as a new type? I think this would be necessary for ALP.
> >
> > It seems FSSTStringVector/Array could potentially be modelled as an
> > extension type (dictionary stored as part of the type metadata?) on top
> of
> > a byte array. This would however require a fixed dictionary, so might not
> > be desirable.
> >
> > ALPFloatingPointVector and bit-packed vectors/arrays are more challenging
> > to represent as extension types.
> >
> > 1.  There is no natural alignment with any of the existing types (and the
> > bit-packing width can effectively vary by batch).
> > 2.  Each batch of values has a different metadata parameter set.
> >
> > So it seems there is no easy way out for the ALP encoding and we either
> > need to pay the cost of adding a new type (which is not necessarily
> > trivial) or we would have to do some work to literally make a new opaque
> > "Custom" Type, which would have a buffer that is only interpretable based
> > on its extension type.  An easy way of shoe-horning this in would be to
> add
> > a ParquetScalar extension type, which simply contains the decompressed
> but
> > encoded Parquet page with repetition and definition levels stripped out.
> > The latter also has its obvious down-sides.
> >
> > Cheers,
> > Micah
> >
> > [1] https://github.com/apache/arrow/blob/main/format/Schema.fbs#L160
> > [2] https://www.vldb.org/pvldb/vol16/p2132-afroozeh.pdf
> >
> > On Wed, Dec 10, 2025 at 5:44 PM Julien Le Dem <[email protected]> wrote:
> >
> > > I forgot to mention that those encodings have the particularity of
> allowing
> > > random access without decoding previous values.
> > >
> > > On Wed, Dec 10, 2025 at 5:40 PM Julien Le Dem <[email protected]>
> wrote:
> > >
> > > > Hello,
> > > > Parquet is in the process of adopting new encodings [1] (Currently
> in POC
> > > > stage), specifically ALP [2] and FSST [3].
> > > > One of the discussion items is to allow late materialization: to
> allow
> > > > keeping data in encoded format beyond the filter stage (for example
> in
> > > > Datafusion).
> > > > There are several advantages to this:
> > > > - For example, if I summarize FSST as a variation of dictionary
> encoding
> > > > on substrings in the values, one can evaluate some operations on
> encoded
> > > > values without decoding them, saving memory and CPU.
> > > > - Similarly, simplifying for brevity, ALP converts floating point
> values
> > > > to small integers that are then bitpacked.
> > > > The Vortex project argues that keeping encoded values in in-memory
> > > vectors
> > > > opens up opportunities for performance improvements. [4] a third
> party
> > > blog
> > > > argues it's a problem as well [5]
> > > >
> > > > So I wanted to start a discussion to suggest, we might consider
> adding
> > > > some additional vectors to support such encoded Values like an
> > > > FSSTStringVector for example. This would not be too different from
> the
> > > > dictionary encoding, or an ALPFloatingPointVector with a bit packed
> > > scheme
> > > > not too different from what we use for nullability.
> > > > We could also experiment with Opaque vectors.
> > > >
> > > > For reference, similarly motivated improvements have been done in the
> > > past
> > > > [6]
> > > >
> > > > Thoughts?
> > > >
> > > > See:
> > > > [1]
> > > >
> > >
> https://github.com/apache/parquet-format/tree/master/proposals#active-proposals
> > > > [2] https://github.com/apache/arrow/pull/48345
> > > > [3] https://github.com/apache/arrow/pull/48232
> > > > [4] https://docs.vortex.dev/#in-memory
> > > > [5]
> > > >
> > >
> https://www.polarsignals.com/blog/posts/2025/11/25/interface-parquet-vortex
> > > > [6]
> > > >
> > >
> https://engineering.fb.com/2024/02/20/developer-tools/velox-apache-arrow-15-composable-data-management/
> > > >
> > >
> >
>

Re: [DISCUSS] Support for late materialization in Parquet -> Arrow

Reply via email to