Re: Iceberg / Spark syncs

Anurag Mantripragada Tue, 17 Feb 2026 10:43:30 -0800

Hi all,

Thanks for joining the Iceberg-Spark community sync today. You can find the
recording here <https://youtu.be/vNvOHMHXHGw>.


Below are the highlights from the meeting:

   - Comet Integration: Thanks to everyone involved in reviewing and
   merging the FileFormat API PRs to unblock the Comet team. The final
   required PR is #15328 <https://github.com/apache/iceberg/pull/15328>.
   - Variant Support: There are questions regarding the status of the
   variant support PR (#14297 <https://github.com/apache/iceberg/pull/14297>),
   specifically whether it completes the full end-to-end implementation for
   reading and writing variants in Spark and if it will be included in the
   next release.
   - DSV2 Versions Framework: We discussed the DSV2 version framework PRs
   and whether they are required for the upcoming release.

~ Anurag

On Tue, Jan 20, 2026 at 1:49 PM Aihua Xu <[email protected]> wrote:

> Thanks Anurag for driving this and included Support writing shredded
> variant in Iceberg-Spark (#14297
> <https://github.com/apache/iceberg/pull/14297>). Appreciate the folks
> from Spark side can help review the PR.
>
> Thanks,
> Aihua
>
>
>
> On Tue, Jan 20, 2026 at 1:17 PM karuppayya <[email protected]>
> wrote:
>
>> Thanks Anurag for driving this. Unfortunately, I wasnt able to attend
>> today. I reviewed the video.
>>
>> @prashant, regarding
>>
>>    -
>>
>>    Alpha family aggregate support - #52551
>>    <https://github.com/apache/spark/pull/52551>
>>
>> The Iceberg spec[1] requires that NDV computed via Alpha sketch. We
>> currently have a ThetaAgg expression in the Iceberg code that uses the
>> Alpha family[2].
>> Spark recently introduced support for ThetaSketch aggregates[3] (*which
>> currently only supports the quickselect family*), this PR introduces *Alpha
>> family support*.
>> Once we have this, we don't need to maintain the code in Iceberg and also
>> get the benefits from Spark community(either with improvement to Spark
>> Catalyst in general or the expression specifically)
>>
>> With regards to
>>
>>    -
>>
>>    Codegen for MergeRowsExec - #52399
>>    <https://github.com/apache/spark/pull/52399>
>>
>> This would help speed up merge execution (and also simplify SplitIterator
>> <https://github.com/apache/spark/pull/52399/changes#diff-a572ff40254b26b4a903f101ee466dd2dff9b8c7954a3b957fe5fc25b87ee10aR241-R242>
>> logic).
>>
>> - Karuppayya
>> [1] -
>> https://iceberg.apache.org/puffin-spec/#apache-datasketches-theta-v1-blob-type
>> [2] -
>> https://github.com/apache/iceberg/blob/main/spark/v4.1/spark/src/main/scala/org/apache/spark/sql/stats/ThetaSketchAgg.scala#L66-L68
>> [3] -
>> https://github.com/karuppayya/spark/commit/6ff9edcaf16c90007508f15de98fac361e234381
>>
>>
>>
>>
>> On Tue, Jan 20, 2026 at 12:28 PM Anurag Mantripragada <
>> [email protected]> wrote:
>>
>>> Thanks everyone for joining the first Iceberg/Spark community sync.
>>>
>>> Here is the recording: https://youtu.be/g4n2hwdFosE?si=n9hVRhCThshuOqd5
>>>
>>>
>>> Below are the discussion highlights.
>>>
>>> Datafusion Comet integration
>>>
>>>    -
>>>       -
>>>
>>>          Spark: Encapsulate parquet objects for Comet (#13786
>>>          <https://github.com/apache/iceberg/pull/13786>)
>>>          -
>>>
>>>          Future of Iceberg support in Comet (datafusion-comet#2921
>>>          <https://github.com/apache/datafusion-comet/issues/2921>)
>>>          -
>>>
>>>             Mailing List Discussion
>>>             
>>> <https://lists.apache.org/thread/vr9nsbd5nhg3d20nmtyj4b3zsw9229gd>
>>>             -
>>>
>>>          Notes:
>>>          -
>>>
>>>             Rust vs Java - Discuss and vote in the dev list
>>>             -
>>>
>>>             To move forward with (#13786
>>>             <https://github.com/apache/iceberg/pull/13786>) - Discuss
>>>             in FileFormat API sync if there are any pending items this PR 
>>> needs updates
>>>             on.
>>>             -
>>>
>>>             Make a decision to merge the PR vs waiting for FileFormat
>>>             API
>>>
>>>
>>> -
>>>
>>>       Spark 3.4 Deprecation
>>>       -
>>>
>>>          Spark: Remove Spark 3.4 support (#14122
>>>          <https://github.com/apache/iceberg/pull/14122>)
>>>          -
>>>
>>>          Notes:
>>>          -
>>>
>>>             Wait until comet integration is resolved.
>>>
>>>             -
>>>
>>>       Spark 4.1/4.2
>>>       -
>>>
>>>          Spark: Add support for 4.2.0-preview (#14984
>>>          <https://github.com/apache/iceberg/pull/14984>)
>>>          -
>>>
>>>          Spark 4.1: Initial support for MERGE INTO schema evolution (#
>>>          14970 <https://github.com/apache/iceberg/pull/14970>)
>>>          -
>>>
>>>          Notes:
>>>          -
>>>
>>>             4.1 is the current latest version. New PRs must go to it
>>>             -
>>>
>>>             Spark 4.1 introduces a version framework. Anton is working
>>>             on integrating it with Iceberg. This greatly simplifies Iceberg 
>>> lifecycle
>>>             management but requires non-trivial integration work.
>>>             -
>>>
>>>             Prefer not to make any releases with 4.1 until this is in.
>>>
>>>
>>> -
>>>
>>>       DSv2 and sort order reporting
>>>       -
>>>
>>>          Spark (4.0, 3.5): Set data file sort_order_id in manifest for
>>>          writes from Spark (#14683
>>>          <https://github.com/apache/iceberg/pull/14683>)
>>>          -
>>>
>>>             The rebase has many changes. Ask author to fix.
>>>             -
>>>
>>>          Spark 4.0: Implement SupportsReportOrdering DSv2 API (#14948
>>>          <https://github.com/apache/iceberg/pull/14948>)
>>>          -
>>>
>>>             Move to 4.1 for easier review
>>>
>>>
>>> -
>>>
>>>       Compaction/Table maintenance/DR
>>>       -
>>>
>>>          Spark 4.0: RewriteTablePath  support for multiple source and
>>>          destination prefixes (#14355
>>>          <https://github.com/apache/iceberg/pull/14355>)
>>>          -
>>>
>>>          Spark 4.0: Optional switch to log expire data files during
>>>          ExpireSnapshots action (#14354
>>>          <https://github.com/apache/iceberg/pull/14354>)
>>>          -
>>>
>>>          Notes:
>>>          -
>>>
>>>             Trace level logging
>>>             -
>>>
>>>             How about logging it to another Iceberg table?
>>>             -
>>>
>>>             Use the dataframe of files and log separately?
>>>
>>>
>>> -
>>>
>>>       V3 spec implementation
>>>       -
>>>
>>>          Spark: Support writing shredded variant in Iceberg-Spark (
>>>          #14297 <https://github.com/apache/iceberg/pull/14297>)
>>>          -
>>>
>>>          Notes:
>>>          -
>>>
>>>             Status of Variant type support - consolidate and track
>>>             somewhere
>>>             -
>>>
>>>             Filter pushdown not implemented
>>>             -
>>>
>>>             The write support PR is new, will review. It should have
>>>             Iceberg metadata changes to indicate the variant shredding so 
>>> Spark can use
>>>             it.
>>>             -
>>>
>>>             #14297 <https://github.com/apache/iceberg/pull/14297> Will
>>>             be reviewed
>>>
>>>
>>> -
>>>
>>>       Spark UDF Support
>>>       -
>>>
>>>          SQL UDF support Stage 1 (#14954
>>>          <https://github.com/apache/iceberg/pull/14954>) (The
>>>          corresponding Spark SPIP: SPIP: Catalog-backed Code-Literal
>>>          Functions (SQL and Python) with Catalog SPI and CRUD
>>>          
>>> <https://docs.google.com/document/d/186cTAZxoXp1p8vaSunIaJmVLXcPR-FxSiLiDUl8kK8A/edit?tab=t.0#heading=h.for1fb3tezo3>
>>>          )
>>>          -
>>>
>>>          Notes:
>>>          -
>>>
>>>             Waiting for the proposal vote and spark side SPIP related
>>>             to this.
>>>             -
>>>
>>>          Spark 4.0: Spark UDF POC (#14505
>>>          <https://github.com/apache/iceberg/pull/14505>)
>>>          -
>>>
>>>             Huaxin to delete this PR. This version is hacky.
>>>
>>>
>>> -
>>>
>>>       DDL and schema evolution
>>>       -
>>>
>>>          CREATE TABLE LIKE support (#14269
>>>          <https://github.com/apache/iceberg/pull/14269>)
>>>          -
>>>
>>>          Notes:
>>>          -
>>>
>>>             Recommend not to add SQL extensions in Iceberg code
>>>             anymore.
>>>             -
>>>
>>>                They are fragile and need maintenance and have to work
>>>                well with Spark
>>>                -
>>>
>>>             Alternatively, consider writing a procedure to do this
>>>             until Spark has native support.
>>>             -
>>>
>>>             Native Spark support for CREATE TABLE LIKE is not yet
>>>             implemented.
>>>             Spark PRs
>>>       -
>>>
>>>       Alpha family aggregate support - #52551
>>>       <https://github.com/apache/spark/pull/52551>
>>>       -
>>>
>>>          Notes:
>>>          -
>>>
>>>             Okay to have Spark only changes that can potentially help
>>>             in Iceberg use-cases
>>>             -
>>>
>>>             Elaborate on the use of this? How does this integrate with
>>>             Iceberg?
>>>
>>>
>>>    -
>>>
>>>       Codegen for MergeRowsExec - #52399
>>>       <https://github.com/apache/spark/pull/52399>
>>>       -
>>>
>>>          Notes:
>>>          -
>>>
>>>             This is a heavily used Exec node in Iceberg so this is good
>>>             to have.
>>>             -
>>>
>>>             The community will review this
>>>
>>>
>>>
>>>
>>> Thanks,
>>> ~Anurag
>>>
>>> On Thu, Jan 15, 2026 at 6:48 PM Anton Okolnychyi <[email protected]>
>>> wrote:
>>>
>>>> If anyone has long-standing PRs related to Spark, it may be a good
>>>> forum to get some reviews and help from the community.
>>>>
>>>> ср, 14 січ. 2026 р. о 11:23 Anurag Mantripragada <
>>>> [email protected]> пише:
>>>>
>>>>> Thanks Kevin,
>>>>>
>>>>> All, please review the doc
>>>>> <https://docs.google.com/document/d/19nno1RoPznbbxKOZZddZNHHafa7XULjbN6RPExdr2n4/edit?tab=t.0>
>>>>>  and
>>>>> add any agenda items I may have missed. See you on Tuesday.
>>>>>
>>>>> ~ Anurag
>>>>>
>>>>> On Wed, Jan 14, 2026 at 11:20 AM Kevin Liu <[email protected]>
>>>>> wrote:
>>>>>
>>>>>> Connected with Anurag on Slack. I just added a new event to the
>>>>>> Iceberg Dev calendar for next week Tuesday Jan 20th from 10AM - 11AM PT, 
>>>>>> "*Iceberg
>>>>>> - Spark Community Sync*". It's a monthly recurring meeting and the
>>>>>> google meets link is set to open to the public.
>>>>>> Happy to make changes based on feedback.
>>>>>>
>>>>>> Best,
>>>>>> Kevin Liu
>>>>>>
>>>>>>
>>>>>> On Wed, Jan 14, 2026 at 10:57 AM Kevin Liu <[email protected]>
>>>>>> wrote:
>>>>>>
>>>>>>> Looking at the current Iceberg dev calendar schedule, we have a slot
>>>>>>> next week Tuesday or Friday for a monthly recurring sync. Wednesday
>>>>>>> corresponds with the main Community Sync in some weeks.
>>>>>>> Please let me know the preferred day and time and I can help set it
>>>>>>> up!
>>>>>>>
>>>>>>> Best,
>>>>>>> Kevin Liu
>>>>>>>
>>>>>>> On Tue, Jan 13, 2026 at 10:58 AM Anurag Mantripragada <
>>>>>>> [email protected]> wrote:
>>>>>>>
>>>>>>>> Hi Kevin,
>>>>>>>>
>>>>>>>> I'm open to ideas, but I think we could start with monthly cadence
>>>>>>>> for Spark syncs and increase the frequency if the community feels we 
>>>>>>>> need
>>>>>>>> to meet more often. Could you please set up a time on the Iceberg dev
>>>>>>>> calendar?
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>> Anurag
>>>>>>>>
>>>>>>>> On Fri, Jan 9, 2026 at 10:16 AM Anurag Mantripragada <
>>>>>>>> [email protected]> wrote:
>>>>>>>>
>>>>>>>>> Thanks Anton and Kevin,
>>>>>>>>>
>>>>>>>>> I wrote a doc with general themes from the Spark PRs and Issues I
>>>>>>>>> browsed in the repo. Please feel free to add more if I may have missed
>>>>>>>>> anything.
>>>>>>>>>
>>>>>>>>> https://docs.google.com/document/d/19nno1RoPznbbxKOZZddZNHHafa7XULjbN6RPExdr2n4/edit?tab=t.0
>>>>>>>>>
>>>>>>>>> Looking forward to meeting you all and talking about all things
>>>>>>>>> Spark!
>>>>>>>>>
>>>>>>>>> Thanks,
>>>>>>>>> Anurag
>>>>>>>>>
>>>>>>>>> On Fri, Jan 9, 2026 at 10:03 AM Kevin Liu <[email protected]>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> +1 great idea!
>>>>>>>>>> Let's start a doc with potential discussion items and find a time
>>>>>>>>>> on the calendar. I have permission to add events to the "iceberg dev
>>>>>>>>>> events" calendar. Happy to help with the logistics once the time and
>>>>>>>>>> cadence is decided.
>>>>>>>>>>
>>>>>>>>>> Best,
>>>>>>>>>> Kevin Liu
>>>>>>>>>>
>>>>>>>>>> On Wed, Jan 7, 2026 at 4:35 PM Anton Okolnychyi <
>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>
>>>>>>>>>>> YES! I have been meaning to suggest the same.
>>>>>>>>>>>
>>>>>>>>>>> Can you start a doc with the pool of items to which everyone can
>>>>>>>>>>> contribute to?
>>>>>>>>>>>
>>>>>>>>>>> - Anton
>>>>>>>>>>>
>>>>>>>>>>> ср, 7 січ. 2026 р. о 15:30 Anurag Mantripragada <
>>>>>>>>>>> [email protected]> пише:
>>>>>>>>>>>
>>>>>>>>>>>> Hi folks, happy new year!
>>>>>>>>>>>>
>>>>>>>>>>>> (Sorry if I sent this email more than once, my attempts of
>>>>>>>>>>>> sending this from a different email failed)
>>>>>>>>>>>>
>>>>>>>>>>>> There are a few Spark changes the community is working on
>>>>>>>>>>>> including
>>>>>>>>>>>> - Sort order reporting [1], [2]
>>>>>>>>>>>> - Spark 4.1 support [3]
>>>>>>>>>>>> - Future of Datafusion-Comet support [4] [5]
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> Community members interested in the Spark integration have been
>>>>>>>>>>>> discussing it in smaller groups. However, we believe that the 
>>>>>>>>>>>> general
>>>>>>>>>>>> community sync should include all updates, and discussing 
>>>>>>>>>>>> Spark-specific
>>>>>>>>>>>> matters may not be the most effective use of that sync. I was 
>>>>>>>>>>>> wondering if
>>>>>>>>>>>> it will be useful to  create a Spark-Iceberg integration-specific 
>>>>>>>>>>>> sync on
>>>>>>>>>>>> the calendar, similar to what we have for individual proposals. 
>>>>>>>>>>>> This sync
>>>>>>>>>>>> will not replace the community sync, which will still be used for 
>>>>>>>>>>>> broader
>>>>>>>>>>>> discussions including any new spark topics that come out of the 
>>>>>>>>>>>> spark sync.
>>>>>>>>>>>>
>>>>>>>>>>>> If there’s interest in doing these spark breakout syncs, I’m
>>>>>>>>>>>> happy to volunteer to run them. Please let me know what you all 
>>>>>>>>>>>> think.
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> Thanks,
>>>>>>>>>>>> ~ Anurag
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> [1] - https://github.com/apache/iceberg/pull/14683
>>>>>>>>>>>> [2] - https://github.com/apache/iceberg/pull/14948
>>>>>>>>>>>> [3] - https://github.com/apache/iceberg/pull/14970
>>>>>>>>>>>> [4] - https://github.com/apache/datafusion-comet/issues/2921
>>>>>>>>>>>> [5] -
>>>>>>>>>>>> https://lists.apache.org/thread/vr9nsbd5nhg3d20nmtyj4b3zsw9229gd
>>>>>>>>>>>>
>>>>>>>>>>>

Re: Iceberg / Spark syncs

Reply via email to