Re: Iceberg / Spark syncs

karuppayya Tue, 20 Jan 2026 13:18:35 -0800

Thanks Anurag for driving this. Unfortunately, I wasnt able to attend
today. I reviewed the video.


@prashant, regarding

   -

   Alpha family aggregate support - #52551
   <https://github.com/apache/spark/pull/52551>

The Iceberg spec[1] requires that NDV computed via Alpha sketch. We
currently have a ThetaAgg expression in the Iceberg code that uses the
Alpha family[2].
Spark recently introduced support for ThetaSketch aggregates[3] (*which
currently only supports the quickselect family*), this PR introduces *Alpha
family support*.
Once we have this, we don't need to maintain the code in Iceberg and also
get the benefits from Spark community(either with improvement to Spark
Catalyst in general or the expression specifically)

With regards to

   -

   Codegen for MergeRowsExec - #52399
   <https://github.com/apache/spark/pull/52399>

This would help speed up merge execution (and also simplify SplitIterator
<https://github.com/apache/spark/pull/52399/changes#diff-a572ff40254b26b4a903f101ee466dd2dff9b8c7954a3b957fe5fc25b87ee10aR241-R242>
logic).

- Karuppayya
[1] -
https://iceberg.apache.org/puffin-spec/#apache-datasketches-theta-v1-blob-type
[2] -
https://github.com/apache/iceberg/blob/main/spark/v4.1/spark/src/main/scala/org/apache/spark/sql/stats/ThetaSketchAgg.scala#L66-L68
[3] -
https://github.com/karuppayya/spark/commit/6ff9edcaf16c90007508f15de98fac361e234381




On Tue, Jan 20, 2026 at 12:28 PM Anurag Mantripragada <
[email protected]> wrote:

> Thanks everyone for joining the first Iceberg/Spark community sync.
>
> Here is the recording: https://youtu.be/g4n2hwdFosE?si=n9hVRhCThshuOqd5
>
>
> Below are the discussion highlights.
>
> Datafusion Comet integration
>
>    -
>       -
>
>          Spark: Encapsulate parquet objects for Comet (#13786
>          <https://github.com/apache/iceberg/pull/13786>)
>          -
>
>          Future of Iceberg support in Comet (datafusion-comet#2921
>          <https://github.com/apache/datafusion-comet/issues/2921>)
>          -
>
>             Mailing List Discussion
>             <https://lists.apache.org/thread/vr9nsbd5nhg3d20nmtyj4b3zsw9229gd>
>             -
>
>          Notes:
>          -
>
>             Rust vs Java - Discuss and vote in the dev list
>             -
>
>             To move forward with (#13786
>             <https://github.com/apache/iceberg/pull/13786>) - Discuss in
>             FileFormat API sync if there are any pending items this PR needs 
> updates on.
>             -
>
>             Make a decision to merge the PR vs waiting for FileFormat API
>
>
> -
>
>       Spark 3.4 Deprecation
>       -
>
>          Spark: Remove Spark 3.4 support (#14122
>          <https://github.com/apache/iceberg/pull/14122>)
>          -
>
>          Notes:
>          -
>
>             Wait until comet integration is resolved.
>
>             -
>
>       Spark 4.1/4.2
>       -
>
>          Spark: Add support for 4.2.0-preview (#14984
>          <https://github.com/apache/iceberg/pull/14984>)
>          -
>
>          Spark 4.1: Initial support for MERGE INTO schema evolution (#
>          14970 <https://github.com/apache/iceberg/pull/14970>)
>          -
>
>          Notes:
>          -
>
>             4.1 is the current latest version. New PRs must go to it
>             -
>
>             Spark 4.1 introduces a version framework. Anton is working on
>             integrating it with Iceberg. This greatly simplifies Iceberg 
> lifecycle
>             management but requires non-trivial integration work.
>             -
>
>             Prefer not to make any releases with 4.1 until this is in.
>
>
> -
>
>       DSv2 and sort order reporting
>       -
>
>          Spark (4.0, 3.5): Set data file sort_order_id in manifest for
>          writes from Spark (#14683
>          <https://github.com/apache/iceberg/pull/14683>)
>          -
>
>             The rebase has many changes. Ask author to fix.
>             -
>
>          Spark 4.0: Implement SupportsReportOrdering DSv2 API (#14948
>          <https://github.com/apache/iceberg/pull/14948>)
>          -
>
>             Move to 4.1 for easier review
>
>
> -
>
>       Compaction/Table maintenance/DR
>       -
>
>          Spark 4.0: RewriteTablePath  support for multiple source and
>          destination prefixes (#14355
>          <https://github.com/apache/iceberg/pull/14355>)
>          -
>
>          Spark 4.0: Optional switch to log expire data files during
>          ExpireSnapshots action (#14354
>          <https://github.com/apache/iceberg/pull/14354>)
>          -
>
>          Notes:
>          -
>
>             Trace level logging
>             -
>
>             How about logging it to another Iceberg table?
>             -
>
>             Use the dataframe of files and log separately?
>
>
> -
>
>       V3 spec implementation
>       -
>
>          Spark: Support writing shredded variant in Iceberg-Spark (#14297
>          <https://github.com/apache/iceberg/pull/14297>)
>          -
>
>          Notes:
>          -
>
>             Status of Variant type support - consolidate and track
>             somewhere
>             -
>
>             Filter pushdown not implemented
>             -
>
>             The write support PR is new, will review. It should have
>             Iceberg metadata changes to indicate the variant shredding so 
> Spark can use
>             it.
>             -
>
>             #14297 <https://github.com/apache/iceberg/pull/14297> Will be
>             reviewed
>
>
> -
>
>       Spark UDF Support
>       -
>
>          SQL UDF support Stage 1 (#14954
>          <https://github.com/apache/iceberg/pull/14954>) (The
>          corresponding Spark SPIP: SPIP: Catalog-backed Code-Literal
>          Functions (SQL and Python) with Catalog SPI and CRUD
>          
> <https://docs.google.com/document/d/186cTAZxoXp1p8vaSunIaJmVLXcPR-FxSiLiDUl8kK8A/edit?tab=t.0#heading=h.for1fb3tezo3>
>          )
>          -
>
>          Notes:
>          -
>
>             Waiting for the proposal vote and spark side SPIP related to
>             this.
>             -
>
>          Spark 4.0: Spark UDF POC (#14505
>          <https://github.com/apache/iceberg/pull/14505>)
>          -
>
>             Huaxin to delete this PR. This version is hacky.
>
>
> -
>
>       DDL and schema evolution
>       -
>
>          CREATE TABLE LIKE support (#14269
>          <https://github.com/apache/iceberg/pull/14269>)
>          -
>
>          Notes:
>          -
>
>             Recommend not to add SQL extensions in Iceberg code anymore.
>             -
>
>                They are fragile and need maintenance and have to work
>                well with Spark
>                -
>
>             Alternatively, consider writing a procedure to do this until
>             Spark has native support.
>             -
>
>             Native Spark support for CREATE TABLE LIKE is not yet
>             implemented.
>             Spark PRs
>       -
>
>       Alpha family aggregate support - #52551
>       <https://github.com/apache/spark/pull/52551>
>       -
>
>          Notes:
>          -
>
>             Okay to have Spark only changes that can potentially help in
>             Iceberg use-cases
>             -
>
>             Elaborate on the use of this? How does this integrate with
>             Iceberg?
>
>
>    -
>
>       Codegen for MergeRowsExec - #52399
>       <https://github.com/apache/spark/pull/52399>
>       -
>
>          Notes:
>          -
>
>             This is a heavily used Exec node in Iceberg so this is good
>             to have.
>             -
>
>             The community will review this
>
>
>
>
> Thanks,
> ~Anurag
>
> On Thu, Jan 15, 2026 at 6:48 PM Anton Okolnychyi <[email protected]>
> wrote:
>
>> If anyone has long-standing PRs related to Spark, it may be a good forum
>> to get some reviews and help from the community.
>>
>> ср, 14 січ. 2026 р. о 11:23 Anurag Mantripragada <
>> [email protected]> пише:
>>
>>> Thanks Kevin,
>>>
>>> All, please review the doc
>>> <https://docs.google.com/document/d/19nno1RoPznbbxKOZZddZNHHafa7XULjbN6RPExdr2n4/edit?tab=t.0>
>>>  and
>>> add any agenda items I may have missed. See you on Tuesday.
>>>
>>> ~ Anurag
>>>
>>> On Wed, Jan 14, 2026 at 11:20 AM Kevin Liu <[email protected]>
>>> wrote:
>>>
>>>> Connected with Anurag on Slack. I just added a new event to the Iceberg
>>>> Dev calendar for next week Tuesday Jan 20th from 10AM - 11AM PT, "*Iceberg
>>>> - Spark Community Sync*". It's a monthly recurring meeting and the
>>>> google meets link is set to open to the public.
>>>> Happy to make changes based on feedback.
>>>>
>>>> Best,
>>>> Kevin Liu
>>>>
>>>>
>>>> On Wed, Jan 14, 2026 at 10:57 AM Kevin Liu <[email protected]>
>>>> wrote:
>>>>
>>>>> Looking at the current Iceberg dev calendar schedule, we have a slot
>>>>> next week Tuesday or Friday for a monthly recurring sync. Wednesday
>>>>> corresponds with the main Community Sync in some weeks.
>>>>> Please let me know the preferred day and time and I can help set it
>>>>> up!
>>>>>
>>>>> Best,
>>>>> Kevin Liu
>>>>>
>>>>> On Tue, Jan 13, 2026 at 10:58 AM Anurag Mantripragada <
>>>>> [email protected]> wrote:
>>>>>
>>>>>> Hi Kevin,
>>>>>>
>>>>>> I'm open to ideas, but I think we could start with monthly cadence
>>>>>> for Spark syncs and increase the frequency if the community feels we need
>>>>>> to meet more often. Could you please set up a time on the Iceberg dev
>>>>>> calendar?
>>>>>>
>>>>>> Thanks,
>>>>>> Anurag
>>>>>>
>>>>>> On Fri, Jan 9, 2026 at 10:16 AM Anurag Mantripragada <
>>>>>> [email protected]> wrote:
>>>>>>
>>>>>>> Thanks Anton and Kevin,
>>>>>>>
>>>>>>> I wrote a doc with general themes from the Spark PRs and Issues I
>>>>>>> browsed in the repo. Please feel free to add more if I may have missed
>>>>>>> anything.
>>>>>>>
>>>>>>> https://docs.google.com/document/d/19nno1RoPznbbxKOZZddZNHHafa7XULjbN6RPExdr2n4/edit?tab=t.0
>>>>>>>
>>>>>>> Looking forward to meeting you all and talking about all things
>>>>>>> Spark!
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Anurag
>>>>>>>
>>>>>>> On Fri, Jan 9, 2026 at 10:03 AM Kevin Liu <[email protected]>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> +1 great idea!
>>>>>>>> Let's start a doc with potential discussion items and find a time
>>>>>>>> on the calendar. I have permission to add events to the "iceberg dev
>>>>>>>> events" calendar. Happy to help with the logistics once the time and
>>>>>>>> cadence is decided.
>>>>>>>>
>>>>>>>> Best,
>>>>>>>> Kevin Liu
>>>>>>>>
>>>>>>>> On Wed, Jan 7, 2026 at 4:35 PM Anton Okolnychyi <
>>>>>>>> [email protected]> wrote:
>>>>>>>>
>>>>>>>>> YES! I have been meaning to suggest the same.
>>>>>>>>>
>>>>>>>>> Can you start a doc with the pool of items to which everyone can
>>>>>>>>> contribute to?
>>>>>>>>>
>>>>>>>>> - Anton
>>>>>>>>>
>>>>>>>>> ср, 7 січ. 2026 р. о 15:30 Anurag Mantripragada <
>>>>>>>>> [email protected]> пише:
>>>>>>>>>
>>>>>>>>>> Hi folks, happy new year!
>>>>>>>>>>
>>>>>>>>>> (Sorry if I sent this email more than once, my attempts of
>>>>>>>>>> sending this from a different email failed)
>>>>>>>>>>
>>>>>>>>>> There are a few Spark changes the community is working on
>>>>>>>>>> including
>>>>>>>>>> - Sort order reporting [1], [2]
>>>>>>>>>> - Spark 4.1 support [3]
>>>>>>>>>> - Future of Datafusion-Comet support [4] [5]
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Community members interested in the Spark integration have been
>>>>>>>>>> discussing it in smaller groups. However, we believe that the general
>>>>>>>>>> community sync should include all updates, and discussing 
>>>>>>>>>> Spark-specific
>>>>>>>>>> matters may not be the most effective use of that sync. I was 
>>>>>>>>>> wondering if
>>>>>>>>>> it will be useful to  create a Spark-Iceberg integration-specific 
>>>>>>>>>> sync on
>>>>>>>>>> the calendar, similar to what we have for individual proposals. This 
>>>>>>>>>> sync
>>>>>>>>>> will not replace the community sync, which will still be used for 
>>>>>>>>>> broader
>>>>>>>>>> discussions including any new spark topics that come out of the 
>>>>>>>>>> spark sync.
>>>>>>>>>>
>>>>>>>>>> If there’s interest in doing these spark breakout syncs, I’m
>>>>>>>>>> happy to volunteer to run them. Please let me know what you all 
>>>>>>>>>> think.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Thanks,
>>>>>>>>>> ~ Anurag
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> [1] - https://github.com/apache/iceberg/pull/14683
>>>>>>>>>> [2] - https://github.com/apache/iceberg/pull/14948
>>>>>>>>>> [3] - https://github.com/apache/iceberg/pull/14970
>>>>>>>>>> [4] - https://github.com/apache/datafusion-comet/issues/2921
>>>>>>>>>> [5] -
>>>>>>>>>> https://lists.apache.org/thread/vr9nsbd5nhg3d20nmtyj4b3zsw9229gd
>>>>>>>>>>
>>>>>>>>>

Re: Iceberg / Spark syncs

Reply via email to