Re: Accuracy review on Variant contribution description.

Julien Le Dem Fri, 12 Dec 2025 16:20:54 -0800

The post:
https://sympathetic.ink/2025/12/11/Column-Storage-for-the-AI-era.html
Feel free to reach out if you see errors/omissions.


On Tue, Dec 9, 2025 at 18:47 Julien Le Dem <[email protected]> wrote:

> Hello all,
> I'm writing a blog post on my personal blog and I have a section where I
> use Variant as an example of collaboration (see content below). I'm trying
> to give credit to everyone involved but I'm sure I'm forgetting someone.
> Could you please tell me if you think I should change something or add
> someone? Either on this thread or privately. I'll be happy to fix it.
> (NB: This is not a substitute for a Variant post on the Parquet blog that
> some of you would get the fame of being the author of. nudge nudge :) )
> Thank you!
>  The excerpt:
>
>> ## Case Study: The Variant Type
>
>
>> To give you an example of how bigger changes make their way into Parquet,
>> about a year ago, engineers made an initial proposal to find a neutral home
>> for the [variant type](
>> https://github.com/apache/parquet-format/blob/master/VariantEncoding.md)
>> that was [at the time in Spark](
>> https://github.com/apache/spark/blob/d84f1a3575c4125009374521d2f179089ebd71ad/common/variant/README.md).
>> Variant is akin to a binary representation of JSON. It separates the field
>> names in one column and the values in another. You can selectively shred a
>> subset of the fields into their own column. It is useful when you have
>> unknown field cardinality or too many sparse fields in your data.
>> The big question was [whether this new type should be defined in Spark,
>> Arrow, Iceberg or Parquet](
>> https://lists.apache.org/thread/6h58hj39lhqtcyd2hlsyvqm4lzdh4b9z). What
>> made the most sense, knowing that all of those projects (and more) would
>> end up using it?
>
>
>> We agreed to put it in Parquet. Then we worked as a community to
>> [finalize a consensus on the spec](
>> https://lists.apache.org/thread/obn1yzhgm5zlznwrdpg7f66mswwooxw7). We
>> needed to make sure everybody was on the same page. We changed a few
>> things, made sure we all agreed, and then implemented it across the
>> ecosystem. (Thanks to Gang, Aihua, Gene, Micah, Andrew, Ryan, Yufei,
>> Jiaying, Martin, Aditya, Matt, Antoine, Daniel, Russell and many others)
>
>
>> The community produced multiple implementations in multiple systems, open
>> source or not and collaborated on cross-compatibility tests to make sure we
>> were building compatible systems. This included individuals from
>> Databricks, Snowflake, Google, Tabular, Datadog, CMU, InfluxData, Dremio
>> and more (I'm sorry, if I forgot you, please reach out and I'll add you
>> here).
>
>
>> Now we know that when a Variant is written in one system, it's going to
>> be read correctly in another. From Databricks to Snowflake and BigQuery and
>> from Datafusion to Duckdb and Spark, No surprises. (And Dremio, and
>> InfluxDB, etc)
>
>

Re: Accuracy review on Variant contribution description.

Reply via email to