Re: Iceberg / Spark syncs

Aihua Xu Tue, 20 Jan 2026 13:50:00 -0800

Thanks Anurag for driving this and included Support writing shredded
variant in Iceberg-Spark (#14297
<https://github.com/apache/iceberg/pull/14297>). Appreciate the folks from
Spark side can help review the PR.


Thanks,
Aihua



On Tue, Jan 20, 2026 at 1:17 PM karuppayya <[email protected]> wrote:

> Thanks Anurag for driving this. Unfortunately, I wasnt able to attend
> today. I reviewed the video.
>
> @prashant, regarding
>
>    -
>
>    Alpha family aggregate support - #52551
>    <https://github.com/apache/spark/pull/52551>
>
> The Iceberg spec[1] requires that NDV computed via Alpha sketch. We
> currently have a ThetaAgg expression in the Iceberg code that uses the
> Alpha family[2].
> Spark recently introduced support for ThetaSketch aggregates[3] (*which
> currently only supports the quickselect family*), this PR introduces *Alpha
> family support*.
> Once we have this, we don't need to maintain the code in Iceberg and also
> get the benefits from Spark community(either with improvement to Spark
> Catalyst in general or the expression specifically)
>
> With regards to
>
>    -
>
>    Codegen for MergeRowsExec - #52399
>    <https://github.com/apache/spark/pull/52399>
>
> This would help speed up merge execution (and also simplify SplitIterator
> <https://github.com/apache/spark/pull/52399/changes#diff-a572ff40254b26b4a903f101ee466dd2dff9b8c7954a3b957fe5fc25b87ee10aR241-R242>
> logic).
>
> - Karuppayya
> [1] -
> https://iceberg.apache.org/puffin-spec/#apache-datasketches-theta-v1-blob-type
> [2] -
> https://github.com/apache/iceberg/blob/main/spark/v4.1/spark/src/main/scala/org/apache/spark/sql/stats/ThetaSketchAgg.scala#L66-L68
> [3] -
> https://github.com/karuppayya/spark/commit/6ff9edcaf16c90007508f15de98fac361e234381
>
>
>
>
> On Tue, Jan 20, 2026 at 12:28 PM Anurag Mantripragada <
> [email protected]> wrote:
>
>> Thanks everyone for joining the first Iceberg/Spark community sync.
>>
>> Here is the recording: https://youtu.be/g4n2hwdFosE?si=n9hVRhCThshuOqd5
>>
>>
>> Below are the discussion highlights.
>>
>> Datafusion Comet integration
>>
>>    -
>>       -
>>
>>          Spark: Encapsulate parquet objects for Comet (#13786
>>          <https://github.com/apache/iceberg/pull/13786>)
>>          -
>>
>>          Future of Iceberg support in Comet (datafusion-comet#2921
>>          <https://github.com/apache/datafusion-comet/issues/2921>)
>>          -
>>
>>             Mailing List Discussion
>>             
>> <https://lists.apache.org/thread/vr9nsbd5nhg3d20nmtyj4b3zsw9229gd>
>>             -
>>
>>          Notes:
>>          -
>>
>>             Rust vs Java - Discuss and vote in the dev list
>>             -
>>
>>             To move forward with (#13786
>>             <https://github.com/apache/iceberg/pull/13786>) - Discuss in
>>             FileFormat API sync if there are any pending items this PR needs 
>> updates on.
>>             -
>>
>>             Make a decision to merge the PR vs waiting for FileFormat API
>>
>>
>> -
>>
>>       Spark 3.4 Deprecation
>>       -
>>
>>          Spark: Remove Spark 3.4 support (#14122
>>          <https://github.com/apache/iceberg/pull/14122>)
>>          -
>>
>>          Notes:
>>          -
>>
>>             Wait until comet integration is resolved.
>>
>>             -
>>
>>       Spark 4.1/4.2
>>       -
>>
>>          Spark: Add support for 4.2.0-preview (#14984
>>          <https://github.com/apache/iceberg/pull/14984>)
>>          -
>>
>>          Spark 4.1: Initial support for MERGE INTO schema evolution (#
>>          14970 <https://github.com/apache/iceberg/pull/14970>)
>>          -
>>
>>          Notes:
>>          -
>>
>>             4.1 is the current latest version. New PRs must go to it
>>             -
>>
>>             Spark 4.1 introduces a version framework. Anton is working
>>             on integrating it with Iceberg. This greatly simplifies Iceberg 
>> lifecycle
>>             management but requires non-trivial integration work.
>>             -
>>
>>             Prefer not to make any releases with 4.1 until this is in.
>>
>>
>> -
>>
>>       DSv2 and sort order reporting
>>       -
>>
>>          Spark (4.0, 3.5): Set data file sort_order_id in manifest for
>>          writes from Spark (#14683
>>          <https://github.com/apache/iceberg/pull/14683>)
>>          -
>>
>>             The rebase has many changes. Ask author to fix.
>>             -
>>
>>          Spark 4.0: Implement SupportsReportOrdering DSv2 API (#14948
>>          <https://github.com/apache/iceberg/pull/14948>)
>>          -
>>
>>             Move to 4.1 for easier review
>>
>>
>> -
>>
>>       Compaction/Table maintenance/DR
>>       -
>>
>>          Spark 4.0: RewriteTablePath  support for multiple source and
>>          destination prefixes (#14355
>>          <https://github.com/apache/iceberg/pull/14355>)
>>          -
>>
>>          Spark 4.0: Optional switch to log expire data files during
>>          ExpireSnapshots action (#14354
>>          <https://github.com/apache/iceberg/pull/14354>)
>>          -
>>
>>          Notes:
>>          -
>>
>>             Trace level logging
>>             -
>>
>>             How about logging it to another Iceberg table?
>>             -
>>
>>             Use the dataframe of files and log separately?
>>
>>
>> -
>>
>>       V3 spec implementation
>>       -
>>
>>          Spark: Support writing shredded variant in Iceberg-Spark (#14297
>>          <https://github.com/apache/iceberg/pull/14297>)
>>          -
>>
>>          Notes:
>>          -
>>
>>             Status of Variant type support - consolidate and track
>>             somewhere
>>             -
>>
>>             Filter pushdown not implemented
>>             -
>>
>>             The write support PR is new, will review. It should have
>>             Iceberg metadata changes to indicate the variant shredding so 
>> Spark can use
>>             it.
>>             -
>>
>>             #14297 <https://github.com/apache/iceberg/pull/14297> Will
>>             be reviewed
>>
>>
>> -
>>
>>       Spark UDF Support
>>       -
>>
>>          SQL UDF support Stage 1 (#14954
>>          <https://github.com/apache/iceberg/pull/14954>) (The
>>          corresponding Spark SPIP: SPIP: Catalog-backed Code-Literal
>>          Functions (SQL and Python) with Catalog SPI and CRUD
>>          
>> <https://docs.google.com/document/d/186cTAZxoXp1p8vaSunIaJmVLXcPR-FxSiLiDUl8kK8A/edit?tab=t.0#heading=h.for1fb3tezo3>
>>          )
>>          -
>>
>>          Notes:
>>          -
>>
>>             Waiting for the proposal vote and spark side SPIP related to
>>             this.
>>             -
>>
>>          Spark 4.0: Spark UDF POC (#14505
>>          <https://github.com/apache/iceberg/pull/14505>)
>>          -
>>
>>             Huaxin to delete this PR. This version is hacky.
>>
>>
>> -
>>
>>       DDL and schema evolution
>>       -
>>
>>          CREATE TABLE LIKE support (#14269
>>          <https://github.com/apache/iceberg/pull/14269>)
>>          -
>>
>>          Notes:
>>          -
>>
>>             Recommend not to add SQL extensions in Iceberg code anymore.
>>             -
>>
>>                They are fragile and need maintenance and have to work
>>                well with Spark
>>                -
>>
>>             Alternatively, consider writing a procedure to do this until
>>             Spark has native support.
>>             -
>>
>>             Native Spark support for CREATE TABLE LIKE is not yet
>>             implemented.
>>             Spark PRs
>>       -
>>
>>       Alpha family aggregate support - #52551
>>       <https://github.com/apache/spark/pull/52551>
>>       -
>>
>>          Notes:
>>          -
>>
>>             Okay to have Spark only changes that can potentially help in
>>             Iceberg use-cases
>>             -
>>
>>             Elaborate on the use of this? How does this integrate with
>>             Iceberg?
>>
>>
>>    -
>>
>>       Codegen for MergeRowsExec - #52399
>>       <https://github.com/apache/spark/pull/52399>
>>       -
>>
>>          Notes:
>>          -
>>
>>             This is a heavily used Exec node in Iceberg so this is good
>>             to have.
>>             -
>>
>>             The community will review this
>>
>>
>>
>>
>> Thanks,
>> ~Anurag
>>
>> On Thu, Jan 15, 2026 at 6:48 PM Anton Okolnychyi <[email protected]>
>> wrote:
>>
>>> If anyone has long-standing PRs related to Spark, it may be a good forum
>>> to get some reviews and help from the community.
>>>
>>> ср, 14 січ. 2026 р. о 11:23 Anurag Mantripragada <
>>> [email protected]> пише:
>>>
>>>> Thanks Kevin,
>>>>
>>>> All, please review the doc
>>>> <https://docs.google.com/document/d/19nno1RoPznbbxKOZZddZNHHafa7XULjbN6RPExdr2n4/edit?tab=t.0>
>>>>  and
>>>> add any agenda items I may have missed. See you on Tuesday.
>>>>
>>>> ~ Anurag
>>>>
>>>> On Wed, Jan 14, 2026 at 11:20 AM Kevin Liu <[email protected]>
>>>> wrote:
>>>>
>>>>> Connected with Anurag on Slack. I just added a new event to the
>>>>> Iceberg Dev calendar for next week Tuesday Jan 20th from 10AM - 11AM PT, 
>>>>> "*Iceberg
>>>>> - Spark Community Sync*". It's a monthly recurring meeting and the
>>>>> google meets link is set to open to the public.
>>>>> Happy to make changes based on feedback.
>>>>>
>>>>> Best,
>>>>> Kevin Liu
>>>>>
>>>>>
>>>>> On Wed, Jan 14, 2026 at 10:57 AM Kevin Liu <[email protected]>
>>>>> wrote:
>>>>>
>>>>>> Looking at the current Iceberg dev calendar schedule, we have a slot
>>>>>> next week Tuesday or Friday for a monthly recurring sync. Wednesday
>>>>>> corresponds with the main Community Sync in some weeks.
>>>>>> Please let me know the preferred day and time and I can help set it
>>>>>> up!
>>>>>>
>>>>>> Best,
>>>>>> Kevin Liu
>>>>>>
>>>>>> On Tue, Jan 13, 2026 at 10:58 AM Anurag Mantripragada <
>>>>>> [email protected]> wrote:
>>>>>>
>>>>>>> Hi Kevin,
>>>>>>>
>>>>>>> I'm open to ideas, but I think we could start with monthly cadence
>>>>>>> for Spark syncs and increase the frequency if the community feels we 
>>>>>>> need
>>>>>>> to meet more often. Could you please set up a time on the Iceberg dev
>>>>>>> calendar?
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Anurag
>>>>>>>
>>>>>>> On Fri, Jan 9, 2026 at 10:16 AM Anurag Mantripragada <
>>>>>>> [email protected]> wrote:
>>>>>>>
>>>>>>>> Thanks Anton and Kevin,
>>>>>>>>
>>>>>>>> I wrote a doc with general themes from the Spark PRs and Issues I
>>>>>>>> browsed in the repo. Please feel free to add more if I may have missed
>>>>>>>> anything.
>>>>>>>>
>>>>>>>> https://docs.google.com/document/d/19nno1RoPznbbxKOZZddZNHHafa7XULjbN6RPExdr2n4/edit?tab=t.0
>>>>>>>>
>>>>>>>> Looking forward to meeting you all and talking about all things
>>>>>>>> Spark!
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>> Anurag
>>>>>>>>
>>>>>>>> On Fri, Jan 9, 2026 at 10:03 AM Kevin Liu <[email protected]>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> +1 great idea!
>>>>>>>>> Let's start a doc with potential discussion items and find a time
>>>>>>>>> on the calendar. I have permission to add events to the "iceberg dev
>>>>>>>>> events" calendar. Happy to help with the logistics once the time and
>>>>>>>>> cadence is decided.
>>>>>>>>>
>>>>>>>>> Best,
>>>>>>>>> Kevin Liu
>>>>>>>>>
>>>>>>>>> On Wed, Jan 7, 2026 at 4:35 PM Anton Okolnychyi <
>>>>>>>>> [email protected]> wrote:
>>>>>>>>>
>>>>>>>>>> YES! I have been meaning to suggest the same.
>>>>>>>>>>
>>>>>>>>>> Can you start a doc with the pool of items to which everyone can
>>>>>>>>>> contribute to?
>>>>>>>>>>
>>>>>>>>>> - Anton
>>>>>>>>>>
>>>>>>>>>> ср, 7 січ. 2026 р. о 15:30 Anurag Mantripragada <
>>>>>>>>>> [email protected]> пише:
>>>>>>>>>>
>>>>>>>>>>> Hi folks, happy new year!
>>>>>>>>>>>
>>>>>>>>>>> (Sorry if I sent this email more than once, my attempts of
>>>>>>>>>>> sending this from a different email failed)
>>>>>>>>>>>
>>>>>>>>>>> There are a few Spark changes the community is working on
>>>>>>>>>>> including
>>>>>>>>>>> - Sort order reporting [1], [2]
>>>>>>>>>>> - Spark 4.1 support [3]
>>>>>>>>>>> - Future of Datafusion-Comet support [4] [5]
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Community members interested in the Spark integration have been
>>>>>>>>>>> discussing it in smaller groups. However, we believe that the 
>>>>>>>>>>> general
>>>>>>>>>>> community sync should include all updates, and discussing 
>>>>>>>>>>> Spark-specific
>>>>>>>>>>> matters may not be the most effective use of that sync. I was 
>>>>>>>>>>> wondering if
>>>>>>>>>>> it will be useful to  create a Spark-Iceberg integration-specific 
>>>>>>>>>>> sync on
>>>>>>>>>>> the calendar, similar to what we have for individual proposals. 
>>>>>>>>>>> This sync
>>>>>>>>>>> will not replace the community sync, which will still be used for 
>>>>>>>>>>> broader
>>>>>>>>>>> discussions including any new spark topics that come out of the 
>>>>>>>>>>> spark sync.
>>>>>>>>>>>
>>>>>>>>>>> If there’s interest in doing these spark breakout syncs, I’m
>>>>>>>>>>> happy to volunteer to run them. Please let me know what you all 
>>>>>>>>>>> think.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Thanks,
>>>>>>>>>>> ~ Anurag
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> [1] - https://github.com/apache/iceberg/pull/14683
>>>>>>>>>>> [2] - https://github.com/apache/iceberg/pull/14948
>>>>>>>>>>> [3] - https://github.com/apache/iceberg/pull/14970
>>>>>>>>>>> [4] - https://github.com/apache/datafusion-comet/issues/2921
>>>>>>>>>>> [5] -
>>>>>>>>>>> https://lists.apache.org/thread/vr9nsbd5nhg3d20nmtyj4b3zsw9229gd
>>>>>>>>>>>
>>>>>>>>>>

Re: Iceberg / Spark syncs

Reply via email to