Re: [DISCUSS] Iceberg Variant - Tracking Document & Sync Proposal

Steve Loughran Mon, 20 Apr 2026 13:49:13 -0700

+ regarding the rust, go and cpp impls, a status from each team would be
great!


I've been reviewing arrow parquet variant stuff and it is all there,
including with some benchmarks and optimisations. Which may put it ahead of
the others.

It also has some special handling for sorted variants, as key search there
is straightforward. AFAIK I don't think the others do that, and nor do I
see them going to any effort to sort fields in an object. I think sorting
would be good, but you would have to handle the case where there are
duplicate keys. It's allowed in the spec, and seems like itcould creep in
from nested variants. Has anyone looked at this?

Also: has anyone created malformed parquet files with a shredded variant
and a metadata entry of the same name. The requirement is "ignore the
metadata one", but that's something to test. You'd have to write a shredded
file and then edit the binary content to achieve this, or manually create
one and put it into the parquet-testing repository under bad-data/


On Mon, 20 Apr 2026 at 19:08, Qiegang Long <[email protected]> wrote:

> Thanks for the doc to track the status! +1 on the dedicated
> sync—definitely feels like there’s a lot of work before we see Variant’s
> full potential.
>
> Qiegang
>
> On Mon, Apr 20, 2026 at 11:09 AM Steve Loughran <[email protected]>
> wrote:
>
>>
>> This is great, we need that tracker as it is cross-project. piece of work
>> to say "this is readly
>>
>> I did have an agenda item from last month's community call which didn't
>> get through. If we can retain that open time slot we could do a very quick
>> summary of where we are (summarly slides of Qiegang's results and mine, key
>> outstanding issues and next steps, then we can start that monthly session
>> on it.
>>
>> Meanwhile, I have both parquet and iceberg PRs for benchmarks which I
>> think are ready for review -please take a look
>>
>> Finally, I'm thinking about interop of those many, many variant readers
>> out there. Has anyone explicitly cross-tested their implementations of
>> variant? what about consistent handling of invalid data? That includes
>> iceberg-rust, parquet-cpp and more...
>>
>> Steve
>>
>> On Sun, 19 Apr 2026 at 21:57, Neelesh Salian <[email protected]>
>> wrote:
>>
>>> Hi everyone,
>>>
>>> The Variant umbrella issue (#10392
>>> <https://github.com/apache/iceberg/issues/10392>) hasn't been updated
>>> in a while, and with active work happening across multiple PRs in Iceberg,
>>> Spark, and Parquet, it's been hard to keep track of where things stand.
>>>
>>> Since a few of us are actively working on variant features, I thought it
>>> would help to put together a tracking document so the community has a
>>> single place to see the current state, open work, and benchmark findings. I
>>> plan to update this on a weekly basis to keep track of the issues and PRs
>>> that are updated.
>>>
>>> Iceberg Variant Community Document
>>> <https://docs.google.com/document/d/1IuhLRxw1rcPD_f4jgHuGe3SwFgy7Y5wgEGvLzf6311s/edit?usp=sharing>
>>>
>>> The document has three tabs:
>>>
>>>    1. Overview - what shipped in 1.10, what's merged to main, open work
>>>    areas, and the dependency graph across Iceberg, Spark, and Parquet
>>>    2. Tracker - all open variant issues and PRs across Iceberg,
>>>    Parquet-Java, Parquet-Format, and Spark with authors and status
>>>    3. Benchmarks - summary of three independent benchmark efforts
>>>    (details below)
>>>
>>> *Benchmark findings*
>>>
>>> Three independent benchmarks have measured variant performance. All
>>> converge on the same picture: variant is a modest improvement over JSON
>>> strings today (1.1-1.7x faster reads), but 15-17x slower than typed columns.
>>>
>>>    1. Qiegang Long - 14 queries on GitHub Archive, 5 configs:
>>>    https://qlong.github.io/posts/2026-03-30-variant-early-results
>>>    2. Steve Loughran - JMH microbenchmarks, profiler-driven
>>>    optimization:
>>>    
>>> https://steveloughran.github.io/benchmarking-variants/benchmarking-variants.html
>>>    
>>> <https://steveloughran.github.io/benchmarking-variants/benchmarking-variants.html>
>>>    3. Neelesh Salian - Controlled baseline, 10M+100M rows, write +
>>>    read:
>>>    
>>> https://github.com/nssalian/iceberg/tree/iceberg-variant-benchmark/benchmark
>>>
>>> If you're working on variant-related changes, please chime in or let me
>>> know and I'll add it to the tracker. Feedback on the benchmarks or anything
>>> else is welcome.
>>>
>>> I've been giving variant updates during the Iceberg Spark Sync
>>> (Tuesdays, 10 AM PT), but given that this work now spans Iceberg, Spark,
>>> Parquet, and Flink, I think it deserves its own forum. I'd like to propose
>>> a monthly Variant Sync; a short call where contributors can share progress,
>>> surface blockers, and coordinate across repos. If there's interest, I'll
>>> set one up and share an invite on this thread.
>>>
>>> Thanks,
>>> Neelesh Salian.
>>>
>>

Re: [DISCUSS] Iceberg Variant - Tracking Document & Sync Proposal

Reply via email to