Thanks Anurag for driving this and included Support writing shredded variant in Iceberg-Spark (#14297 <https://github.com/apache/iceberg/pull/14297>). Appreciate the folks from Spark side can help review the PR.
Thanks, Aihua On Tue, Jan 20, 2026 at 1:17 PM karuppayya <[email protected]> wrote: > Thanks Anurag for driving this. Unfortunately, I wasnt able to attend > today. I reviewed the video. > > @prashant, regarding > > - > > Alpha family aggregate support - #52551 > <https://github.com/apache/spark/pull/52551> > > The Iceberg spec[1] requires that NDV computed via Alpha sketch. We > currently have a ThetaAgg expression in the Iceberg code that uses the > Alpha family[2]. > Spark recently introduced support for ThetaSketch aggregates[3] (*which > currently only supports the quickselect family*), this PR introduces *Alpha > family support*. > Once we have this, we don't need to maintain the code in Iceberg and also > get the benefits from Spark community(either with improvement to Spark > Catalyst in general or the expression specifically) > > With regards to > > - > > Codegen for MergeRowsExec - #52399 > <https://github.com/apache/spark/pull/52399> > > This would help speed up merge execution (and also simplify SplitIterator > <https://github.com/apache/spark/pull/52399/changes#diff-a572ff40254b26b4a903f101ee466dd2dff9b8c7954a3b957fe5fc25b87ee10aR241-R242> > logic). > > - Karuppayya > [1] - > https://iceberg.apache.org/puffin-spec/#apache-datasketches-theta-v1-blob-type > [2] - > https://github.com/apache/iceberg/blob/main/spark/v4.1/spark/src/main/scala/org/apache/spark/sql/stats/ThetaSketchAgg.scala#L66-L68 > [3] - > https://github.com/karuppayya/spark/commit/6ff9edcaf16c90007508f15de98fac361e234381 > > > > > On Tue, Jan 20, 2026 at 12:28 PM Anurag Mantripragada < > [email protected]> wrote: > >> Thanks everyone for joining the first Iceberg/Spark community sync. >> >> Here is the recording: https://youtu.be/g4n2hwdFosE?si=n9hVRhCThshuOqd5 >> >> >> Below are the discussion highlights. >> >> Datafusion Comet integration >> >> - >> - >> >> Spark: Encapsulate parquet objects for Comet (#13786 >> <https://github.com/apache/iceberg/pull/13786>) >> - >> >> Future of Iceberg support in Comet (datafusion-comet#2921 >> <https://github.com/apache/datafusion-comet/issues/2921>) >> - >> >> Mailing List Discussion >> >> <https://lists.apache.org/thread/vr9nsbd5nhg3d20nmtyj4b3zsw9229gd> >> - >> >> Notes: >> - >> >> Rust vs Java - Discuss and vote in the dev list >> - >> >> To move forward with (#13786 >> <https://github.com/apache/iceberg/pull/13786>) - Discuss in >> FileFormat API sync if there are any pending items this PR needs >> updates on. >> - >> >> Make a decision to merge the PR vs waiting for FileFormat API >> >> >> - >> >> Spark 3.4 Deprecation >> - >> >> Spark: Remove Spark 3.4 support (#14122 >> <https://github.com/apache/iceberg/pull/14122>) >> - >> >> Notes: >> - >> >> Wait until comet integration is resolved. >> >> - >> >> Spark 4.1/4.2 >> - >> >> Spark: Add support for 4.2.0-preview (#14984 >> <https://github.com/apache/iceberg/pull/14984>) >> - >> >> Spark 4.1: Initial support for MERGE INTO schema evolution (# >> 14970 <https://github.com/apache/iceberg/pull/14970>) >> - >> >> Notes: >> - >> >> 4.1 is the current latest version. New PRs must go to it >> - >> >> Spark 4.1 introduces a version framework. Anton is working >> on integrating it with Iceberg. This greatly simplifies Iceberg >> lifecycle >> management but requires non-trivial integration work. >> - >> >> Prefer not to make any releases with 4.1 until this is in. >> >> >> - >> >> DSv2 and sort order reporting >> - >> >> Spark (4.0, 3.5): Set data file sort_order_id in manifest for >> writes from Spark (#14683 >> <https://github.com/apache/iceberg/pull/14683>) >> - >> >> The rebase has many changes. Ask author to fix. >> - >> >> Spark 4.0: Implement SupportsReportOrdering DSv2 API (#14948 >> <https://github.com/apache/iceberg/pull/14948>) >> - >> >> Move to 4.1 for easier review >> >> >> - >> >> Compaction/Table maintenance/DR >> - >> >> Spark 4.0: RewriteTablePath support for multiple source and >> destination prefixes (#14355 >> <https://github.com/apache/iceberg/pull/14355>) >> - >> >> Spark 4.0: Optional switch to log expire data files during >> ExpireSnapshots action (#14354 >> <https://github.com/apache/iceberg/pull/14354>) >> - >> >> Notes: >> - >> >> Trace level logging >> - >> >> How about logging it to another Iceberg table? >> - >> >> Use the dataframe of files and log separately? >> >> >> - >> >> V3 spec implementation >> - >> >> Spark: Support writing shredded variant in Iceberg-Spark (#14297 >> <https://github.com/apache/iceberg/pull/14297>) >> - >> >> Notes: >> - >> >> Status of Variant type support - consolidate and track >> somewhere >> - >> >> Filter pushdown not implemented >> - >> >> The write support PR is new, will review. It should have >> Iceberg metadata changes to indicate the variant shredding so >> Spark can use >> it. >> - >> >> #14297 <https://github.com/apache/iceberg/pull/14297> Will >> be reviewed >> >> >> - >> >> Spark UDF Support >> - >> >> SQL UDF support Stage 1 (#14954 >> <https://github.com/apache/iceberg/pull/14954>) (The >> corresponding Spark SPIP: SPIP: Catalog-backed Code-Literal >> Functions (SQL and Python) with Catalog SPI and CRUD >> >> <https://docs.google.com/document/d/186cTAZxoXp1p8vaSunIaJmVLXcPR-FxSiLiDUl8kK8A/edit?tab=t.0#heading=h.for1fb3tezo3> >> ) >> - >> >> Notes: >> - >> >> Waiting for the proposal vote and spark side SPIP related to >> this. >> - >> >> Spark 4.0: Spark UDF POC (#14505 >> <https://github.com/apache/iceberg/pull/14505>) >> - >> >> Huaxin to delete this PR. This version is hacky. >> >> >> - >> >> DDL and schema evolution >> - >> >> CREATE TABLE LIKE support (#14269 >> <https://github.com/apache/iceberg/pull/14269>) >> - >> >> Notes: >> - >> >> Recommend not to add SQL extensions in Iceberg code anymore. >> - >> >> They are fragile and need maintenance and have to work >> well with Spark >> - >> >> Alternatively, consider writing a procedure to do this until >> Spark has native support. >> - >> >> Native Spark support for CREATE TABLE LIKE is not yet >> implemented. >> Spark PRs >> - >> >> Alpha family aggregate support - #52551 >> <https://github.com/apache/spark/pull/52551> >> - >> >> Notes: >> - >> >> Okay to have Spark only changes that can potentially help in >> Iceberg use-cases >> - >> >> Elaborate on the use of this? How does this integrate with >> Iceberg? >> >> >> - >> >> Codegen for MergeRowsExec - #52399 >> <https://github.com/apache/spark/pull/52399> >> - >> >> Notes: >> - >> >> This is a heavily used Exec node in Iceberg so this is good >> to have. >> - >> >> The community will review this >> >> >> >> >> Thanks, >> ~Anurag >> >> On Thu, Jan 15, 2026 at 6:48 PM Anton Okolnychyi <[email protected]> >> wrote: >> >>> If anyone has long-standing PRs related to Spark, it may be a good forum >>> to get some reviews and help from the community. >>> >>> ср, 14 січ. 2026 р. о 11:23 Anurag Mantripragada < >>> [email protected]> пише: >>> >>>> Thanks Kevin, >>>> >>>> All, please review the doc >>>> <https://docs.google.com/document/d/19nno1RoPznbbxKOZZddZNHHafa7XULjbN6RPExdr2n4/edit?tab=t.0> >>>> and >>>> add any agenda items I may have missed. See you on Tuesday. >>>> >>>> ~ Anurag >>>> >>>> On Wed, Jan 14, 2026 at 11:20 AM Kevin Liu <[email protected]> >>>> wrote: >>>> >>>>> Connected with Anurag on Slack. I just added a new event to the >>>>> Iceberg Dev calendar for next week Tuesday Jan 20th from 10AM - 11AM PT, >>>>> "*Iceberg >>>>> - Spark Community Sync*". It's a monthly recurring meeting and the >>>>> google meets link is set to open to the public. >>>>> Happy to make changes based on feedback. >>>>> >>>>> Best, >>>>> Kevin Liu >>>>> >>>>> >>>>> On Wed, Jan 14, 2026 at 10:57 AM Kevin Liu <[email protected]> >>>>> wrote: >>>>> >>>>>> Looking at the current Iceberg dev calendar schedule, we have a slot >>>>>> next week Tuesday or Friday for a monthly recurring sync. Wednesday >>>>>> corresponds with the main Community Sync in some weeks. >>>>>> Please let me know the preferred day and time and I can help set it >>>>>> up! >>>>>> >>>>>> Best, >>>>>> Kevin Liu >>>>>> >>>>>> On Tue, Jan 13, 2026 at 10:58 AM Anurag Mantripragada < >>>>>> [email protected]> wrote: >>>>>> >>>>>>> Hi Kevin, >>>>>>> >>>>>>> I'm open to ideas, but I think we could start with monthly cadence >>>>>>> for Spark syncs and increase the frequency if the community feels we >>>>>>> need >>>>>>> to meet more often. Could you please set up a time on the Iceberg dev >>>>>>> calendar? >>>>>>> >>>>>>> Thanks, >>>>>>> Anurag >>>>>>> >>>>>>> On Fri, Jan 9, 2026 at 10:16 AM Anurag Mantripragada < >>>>>>> [email protected]> wrote: >>>>>>> >>>>>>>> Thanks Anton and Kevin, >>>>>>>> >>>>>>>> I wrote a doc with general themes from the Spark PRs and Issues I >>>>>>>> browsed in the repo. Please feel free to add more if I may have missed >>>>>>>> anything. >>>>>>>> >>>>>>>> https://docs.google.com/document/d/19nno1RoPznbbxKOZZddZNHHafa7XULjbN6RPExdr2n4/edit?tab=t.0 >>>>>>>> >>>>>>>> Looking forward to meeting you all and talking about all things >>>>>>>> Spark! >>>>>>>> >>>>>>>> Thanks, >>>>>>>> Anurag >>>>>>>> >>>>>>>> On Fri, Jan 9, 2026 at 10:03 AM Kevin Liu <[email protected]> >>>>>>>> wrote: >>>>>>>> >>>>>>>>> +1 great idea! >>>>>>>>> Let's start a doc with potential discussion items and find a time >>>>>>>>> on the calendar. I have permission to add events to the "iceberg dev >>>>>>>>> events" calendar. Happy to help with the logistics once the time and >>>>>>>>> cadence is decided. >>>>>>>>> >>>>>>>>> Best, >>>>>>>>> Kevin Liu >>>>>>>>> >>>>>>>>> On Wed, Jan 7, 2026 at 4:35 PM Anton Okolnychyi < >>>>>>>>> [email protected]> wrote: >>>>>>>>> >>>>>>>>>> YES! I have been meaning to suggest the same. >>>>>>>>>> >>>>>>>>>> Can you start a doc with the pool of items to which everyone can >>>>>>>>>> contribute to? >>>>>>>>>> >>>>>>>>>> - Anton >>>>>>>>>> >>>>>>>>>> ср, 7 січ. 2026 р. о 15:30 Anurag Mantripragada < >>>>>>>>>> [email protected]> пише: >>>>>>>>>> >>>>>>>>>>> Hi folks, happy new year! >>>>>>>>>>> >>>>>>>>>>> (Sorry if I sent this email more than once, my attempts of >>>>>>>>>>> sending this from a different email failed) >>>>>>>>>>> >>>>>>>>>>> There are a few Spark changes the community is working on >>>>>>>>>>> including >>>>>>>>>>> - Sort order reporting [1], [2] >>>>>>>>>>> - Spark 4.1 support [3] >>>>>>>>>>> - Future of Datafusion-Comet support [4] [5] >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> Community members interested in the Spark integration have been >>>>>>>>>>> discussing it in smaller groups. However, we believe that the >>>>>>>>>>> general >>>>>>>>>>> community sync should include all updates, and discussing >>>>>>>>>>> Spark-specific >>>>>>>>>>> matters may not be the most effective use of that sync. I was >>>>>>>>>>> wondering if >>>>>>>>>>> it will be useful to create a Spark-Iceberg integration-specific >>>>>>>>>>> sync on >>>>>>>>>>> the calendar, similar to what we have for individual proposals. >>>>>>>>>>> This sync >>>>>>>>>>> will not replace the community sync, which will still be used for >>>>>>>>>>> broader >>>>>>>>>>> discussions including any new spark topics that come out of the >>>>>>>>>>> spark sync. >>>>>>>>>>> >>>>>>>>>>> If there’s interest in doing these spark breakout syncs, I’m >>>>>>>>>>> happy to volunteer to run them. Please let me know what you all >>>>>>>>>>> think. >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> Thanks, >>>>>>>>>>> ~ Anurag >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> [1] - https://github.com/apache/iceberg/pull/14683 >>>>>>>>>>> [2] - https://github.com/apache/iceberg/pull/14948 >>>>>>>>>>> [3] - https://github.com/apache/iceberg/pull/14970 >>>>>>>>>>> [4] - https://github.com/apache/datafusion-comet/issues/2921 >>>>>>>>>>> [5] - >>>>>>>>>>> https://lists.apache.org/thread/vr9nsbd5nhg3d20nmtyj4b3zsw9229gd >>>>>>>>>>> >>>>>>>>>>
