The post: https://sympathetic.ink/2025/12/11/Column-Storage-for-the-AI-era.html Feel free to reach out if you see errors/omissions.
On Tue, Dec 9, 2025 at 18:47 Julien Le Dem <[email protected]> wrote: > Hello all, > I'm writing a blog post on my personal blog and I have a section where I > use Variant as an example of collaboration (see content below). I'm trying > to give credit to everyone involved but I'm sure I'm forgetting someone. > Could you please tell me if you think I should change something or add > someone? Either on this thread or privately. I'll be happy to fix it. > (NB: This is not a substitute for a Variant post on the Parquet blog that > some of you would get the fame of being the author of. nudge nudge :) ) > Thank you! > The excerpt: > >> ## Case Study: The Variant Type > > >> To give you an example of how bigger changes make their way into Parquet, >> about a year ago, engineers made an initial proposal to find a neutral home >> for the [variant type]( >> https://github.com/apache/parquet-format/blob/master/VariantEncoding.md) >> that was [at the time in Spark]( >> https://github.com/apache/spark/blob/d84f1a3575c4125009374521d2f179089ebd71ad/common/variant/README.md). >> Variant is akin to a binary representation of JSON. It separates the field >> names in one column and the values in another. You can selectively shred a >> subset of the fields into their own column. It is useful when you have >> unknown field cardinality or too many sparse fields in your data. >> The big question was [whether this new type should be defined in Spark, >> Arrow, Iceberg or Parquet]( >> https://lists.apache.org/thread/6h58hj39lhqtcyd2hlsyvqm4lzdh4b9z). What >> made the most sense, knowing that all of those projects (and more) would >> end up using it? > > >> We agreed to put it in Parquet. Then we worked as a community to >> [finalize a consensus on the spec]( >> https://lists.apache.org/thread/obn1yzhgm5zlznwrdpg7f66mswwooxw7). We >> needed to make sure everybody was on the same page. We changed a few >> things, made sure we all agreed, and then implemented it across the >> ecosystem. (Thanks to Gang, Aihua, Gene, Micah, Andrew, Ryan, Yufei, >> Jiaying, Martin, Aditya, Matt, Antoine, Daniel, Russell and many others) > > >> The community produced multiple implementations in multiple systems, open >> source or not and collaborated on cross-compatibility tests to make sure we >> were building compatible systems. This included individuals from >> Databricks, Snowflake, Google, Tabular, Datadog, CMU, InfluxData, Dremio >> and more (I'm sorry, if I forgot you, please reach out and I'll add you >> here). > > >> Now we know that when a Variant is written in one system, it's going to >> be read correctly in another. From Databricks to Snowflake and BigQuery and >> from Datafusion to Duckdb and Spark, No surprises. (And Dremio, and >> InfluxDB, etc) > >
