I worked with Julien on a version of this post for a more academic audience that I wanted to share with you
In the presentation, I tried to delve deeper into what "AI" usecases are driving the pressure for new features in Parquet. I try to explain "feature store" , RAG and Vector search as usecases motivating new Parquet features such as faster metadata parsing, single row lookups and SIMD/GPU friendly encodings. recording: https://www.youtube.com/watch?v=k9uhw7yqPsQ slides: https://docs.google.com/presentation/d/19F-XvNJ8sgIpIeIduA3PhbsWp4pC-P632J2eJV1cLG8 Andrew On Fri, Dec 12, 2025 at 7:20 PM Julien Le Dem <[email protected]> wrote: > The post: > https://sympathetic.ink/2025/12/11/Column-Storage-for-the-AI-era.html > Feel free to reach out if you see errors/omissions. > > On Tue, Dec 9, 2025 at 18:47 Julien Le Dem <[email protected]> wrote: > > > Hello all, > > I'm writing a blog post on my personal blog and I have a section where I > > use Variant as an example of collaboration (see content below). I'm > trying > > to give credit to everyone involved but I'm sure I'm forgetting someone. > > Could you please tell me if you think I should change something or add > > someone? Either on this thread or privately. I'll be happy to fix it. > > (NB: This is not a substitute for a Variant post on the Parquet blog that > > some of you would get the fame of being the author of. nudge nudge :) ) > > Thank you! > > The excerpt: > > > >> ## Case Study: The Variant Type > > > > > >> To give you an example of how bigger changes make their way into > Parquet, > >> about a year ago, engineers made an initial proposal to find a neutral > home > >> for the [variant type]( > >> https://github.com/apache/parquet-format/blob/master/VariantEncoding.md > ) > >> that was [at the time in Spark]( > >> > https://github.com/apache/spark/blob/d84f1a3575c4125009374521d2f179089ebd71ad/common/variant/README.md > ). > >> Variant is akin to a binary representation of JSON. It separates the > field > >> names in one column and the values in another. You can selectively > shred a > >> subset of the fields into their own column. It is useful when you have > >> unknown field cardinality or too many sparse fields in your data. > >> The big question was [whether this new type should be defined in Spark, > >> Arrow, Iceberg or Parquet]( > >> https://lists.apache.org/thread/6h58hj39lhqtcyd2hlsyvqm4lzdh4b9z). What > >> made the most sense, knowing that all of those projects (and more) would > >> end up using it? > > > > > >> We agreed to put it in Parquet. Then we worked as a community to > >> [finalize a consensus on the spec]( > >> https://lists.apache.org/thread/obn1yzhgm5zlznwrdpg7f66mswwooxw7). We > >> needed to make sure everybody was on the same page. We changed a few > >> things, made sure we all agreed, and then implemented it across the > >> ecosystem. (Thanks to Gang, Aihua, Gene, Micah, Andrew, Ryan, Yufei, > >> Jiaying, Martin, Aditya, Matt, Antoine, Daniel, Russell and many others) > > > > > >> The community produced multiple implementations in multiple systems, > open > >> source or not and collaborated on cross-compatibility tests to make > sure we > >> were building compatible systems. This included individuals from > >> Databricks, Snowflake, Google, Tabular, Datadog, CMU, InfluxData, Dremio > >> and more (I'm sorry, if I forgot you, please reach out and I'll add you > >> here). > > > > > >> Now we know that when a Variant is written in one system, it's going to > >> be read correctly in another. From Databricks to Snowflake and BigQuery > and > >> from Datafusion to Duckdb and Spark, No surprises. (And Dremio, and > >> InfluxDB, etc) > > > > >
