I worked with Julien on a version of this post for a more academic audience
that I wanted to share with you

In the presentation,  I tried to delve deeper into what "AI" usecases are
driving the pressure for new features in Parquet. I try to explain "feature
store" ,  RAG and Vector search as usecases motivating new Parquet features
such as faster metadata parsing, single row lookups and SIMD/GPU friendly
encodings.

recording: https://www.youtube.com/watch?v=k9uhw7yqPsQ
slides:
https://docs.google.com/presentation/d/19F-XvNJ8sgIpIeIduA3PhbsWp4pC-P632J2eJV1cLG8

Andrew

On Fri, Dec 12, 2025 at 7:20 PM Julien Le Dem <[email protected]> wrote:

> The post:
> https://sympathetic.ink/2025/12/11/Column-Storage-for-the-AI-era.html
> Feel free to reach out if you see errors/omissions.
>
> On Tue, Dec 9, 2025 at 18:47 Julien Le Dem <[email protected]> wrote:
>
> > Hello all,
> > I'm writing a blog post on my personal blog and I have a section where I
> > use Variant as an example of collaboration (see content below). I'm
> trying
> > to give credit to everyone involved but I'm sure I'm forgetting someone.
> > Could you please tell me if you think I should change something or add
> > someone? Either on this thread or privately. I'll be happy to fix it.
> > (NB: This is not a substitute for a Variant post on the Parquet blog that
> > some of you would get the fame of being the author of. nudge nudge :) )
> > Thank you!
> >  The excerpt:
> >
> >> ## Case Study: The Variant Type
> >
> >
> >> To give you an example of how bigger changes make their way into
> Parquet,
> >> about a year ago, engineers made an initial proposal to find a neutral
> home
> >> for the [variant type](
> >> https://github.com/apache/parquet-format/blob/master/VariantEncoding.md
> )
> >> that was [at the time in Spark](
> >>
> https://github.com/apache/spark/blob/d84f1a3575c4125009374521d2f179089ebd71ad/common/variant/README.md
> ).
> >> Variant is akin to a binary representation of JSON. It separates the
> field
> >> names in one column and the values in another. You can selectively
> shred a
> >> subset of the fields into their own column. It is useful when you have
> >> unknown field cardinality or too many sparse fields in your data.
> >> The big question was [whether this new type should be defined in Spark,
> >> Arrow, Iceberg or Parquet](
> >> https://lists.apache.org/thread/6h58hj39lhqtcyd2hlsyvqm4lzdh4b9z). What
> >> made the most sense, knowing that all of those projects (and more) would
> >> end up using it?
> >
> >
> >> We agreed to put it in Parquet. Then we worked as a community to
> >> [finalize a consensus on the spec](
> >> https://lists.apache.org/thread/obn1yzhgm5zlznwrdpg7f66mswwooxw7). We
> >> needed to make sure everybody was on the same page. We changed a few
> >> things, made sure we all agreed, and then implemented it across the
> >> ecosystem. (Thanks to Gang, Aihua, Gene, Micah, Andrew, Ryan, Yufei,
> >> Jiaying, Martin, Aditya, Matt, Antoine, Daniel, Russell and many others)
> >
> >
> >> The community produced multiple implementations in multiple systems,
> open
> >> source or not and collaborated on cross-compatibility tests to make
> sure we
> >> were building compatible systems. This included individuals from
> >> Databricks, Snowflake, Google, Tabular, Datadog, CMU, InfluxData, Dremio
> >> and more (I'm sorry, if I forgot you, please reach out and I'll add you
> >> here).
> >
> >
> >> Now we know that when a Variant is written in one system, it's going to
> >> be read correctly in another. From Databricks to Snowflake and BigQuery
> and
> >> from Datafusion to Duckdb and Spark, No surprises. (And Dremio, and
> >> InfluxDB, etc)
> >
> >
>

Reply via email to