Thanks Wing. That makes sense. I see that the commit you linked made the change from spark 3.5 to spark 4.0 ``` systemProp.defaultSparkVersions=4.0 ```
I did a pass to double check that all the features we've implemented in Spark 3.5 are also available in Spark 4.0. I wanted to make sure that we have feature parity for the upcoming 1.10.0 release. I'm happy to report that all the new features in 3.5 are also available in 4.0. Thank you to everyone for being diligent about backportings :) This check was made 1000% easier because of that. If anyone's interested in the details, here's the list of commits I've looked at for both 3.5 and 4.0 https://docs.google.com/spreadsheets/d/1TZ3Hi5gfUjg0fGf25Yb3p0wQCWYio4nylyKrQDow1JI/edit?gid=0#gid=0 This list was generated by running these commands on latest main branch ``` # in 3.5 but not in 4.0 git log --oneline -- spark/v3.5 | grep -vFf <(git log --oneline -- spark/v4.0 | cut -d' ' -f1) # in 4.0 but not in 3.5 git log --oneline -- spark/v4.0 | grep -vFf <(git log --oneline -- spark/v3.5 | cut -d' ' -f1) ``` I've only looked at commits after "8353ac8f8 Spark: Copy back 4.0 as 3.5" . Here's the doc where I match the commits by feature parity https://docs.google.com/document/d/1Ox3ptj6dGdI_8CTYqfl6SOGWNKbDb11x_xqkFf2pJvY/edit?tab=t.0 There are a few commits (2 in 3.5, 1 in 4.0) that only implement tests in a specific version. I'll work on adding the corresponding tests. Also, a big thanks to Amogh for helping to verify this. Best, Kevin Liu On Tue, Jul 15, 2025 at 12:12 PM Wing Yew Poon <wyp...@cloudera.com.invalid> wrote: > Kevin, > > Just a minor clarification: > > I want to point out that Spark 4.0 is in an interesting state right >> now. Spark 4.0 is not yet the "latest supported version" since Iceberg >> 1.10 should be the first version that works with Spark 4.0 according to >> #13162 >> <https://github.com/apache/iceberg/issues/13162#issuecomment-2912307091>. >> So before the next release, Spark 3.5 is the "latest supported version". > > > By "latest supported version", I did not mean in released Iceberg; I meant > in the branch under development. So once Huaxin added support for Spark 4.0 > and Spark 4.0 was released ( > https://github.com/apache/iceberg/commit/b504f9c51c6c0e0a5c0c5ff53f295e69b67d8e59), > the latest supported version became 4.0. > > - Wing Yew > > > On Tue, Jul 15, 2025 at 11:44 AM Kevin Liu <kevinjq...@apache.org> wrote: > >> Thanks for the context, Wing and Anton! >> >> My main concern was around feature parity between the different Spark >> versions. And especially if a feature is only implemented in an older >> version of Spark. >> >> > I believe the general practice is to implement a feature in the latest >> supported version (currently Spark 4.0). Once the PR is merged, the author >> may choose to backport the feature to older supported versions, but is not >> obligated to. If the author does not backport it, others who want the >> feature in an older version can choose to backport it. >> >> This is great! This process addresses my concern. >> I want to point out that Spark 4.0 is in an interesting state right now. >> Spark >> 4.0 is not yet the "latest supported version" since Iceberg 1.10 should >> be the first version that works with Spark 4.0 according to #13162 >> <https://github.com/apache/iceberg/issues/13162#issuecomment-2912307091>. >> So before the next release, Spark 3.5 is the "latest supported version". >> I'll do a pass to make sure that newly added features in Spark 3.5 are also >> available in Spark 4.0 so there's no discrepancy between the 2 versions. >> >> > Shall we be more aggressive with dropping old Spark versions if we >> feel the quality of those integrations is not on the expected level? For >> instance, can we deprecate 3.4 in the upcoming 1.10 release? Spark 3.4.0 >> was released in April of 2023 and is not maintained by Spark as of October >> of 2024. I can imagine maintaining 3.5 for a bit longer as it is the last >> 3.x release. >> >> That makes sense to me. I think we should keep at least 1 of the Spark >> 3.x versions. Not sure if there's value in keeping both 3.4 and 3.5. Let's >> start a separate thread to solicit some feedback from the community. >> >> Best, >> Kevin Liu >> >> On Fri, Jul 11, 2025 at 11:38 AM Anton Okolnychyi <aokolnyc...@gmail.com> >> wrote: >> >>> I agree with what Wing Yew said. It has always been the agreement to >>> actively develop against the latest supported version of Spark and folks >>> that are interested in older Spark versions can backport features of their >>> interest. That said, we do try to fix correctness bugs and prevent >>> corruptions across all maintained Spark versions. >>> >>> Shall we be more aggressive with dropping old Spark versions if we feel >>> the quality of those integrations is not on the expected level? For >>> instance, can we deprecate 3.4 in the upcoming 1.10 release? Spark 3.4.0 >>> was released in April of 2023 and is not maintained by Spark as of October >>> of 2024. I can imagine maintaining 3.5 for a bit longer as it is the last >>> 3.x release. >>> >>> - Anton >>> >>> пт, 11 лип. 2025 р. о 11:13 Wing Yew Poon <wyp...@cloudera.com.invalid> >>> пише: >>> >>>> Hi Kevin, >>>> >>>> I believe the general practice is to implement a feature in the latest >>>> supported version (currently Spark 4.0). Once the PR is merged, the author >>>> may choose to backport the feature to older supported versions, but is not >>>> obligated to. If the author does not backport it, others who want the >>>> feature in an older version can choose to backport it. >>>> >>>> Sometimes a change is simple enough that it makes sense to implement it >>>> for all supported versions at once (in one PR). In addition, if a change >>>> requires changes in core Iceberg that then requires the same change in >>>> other Spark versions, the change is implemented for all Spark versions in >>>> one PR. >>>> >>>> Sometimes a feature depends on changes in the latest supported Spark >>>> version and so cannot be backported. >>>> >>>> Finally, sometimes a PR has already been in progress for a long time >>>> and the latest supported Spark version changes in the meantime. It may >>>> still get merged and then be forward ported. >>>> >>>> I understand that your intent is to ensure that features/fixes that >>>> *can* be backported be backported. A diff of git logs by itself cannot tell >>>> you if a missing change is portable or not. How and when do you propose to >>>> do this diff, and does the result of the diff cause anything to be blocked >>>> or any action to be taken? Do you perhaps envision this to be done as a >>>> kind of pre-release audit (with enough time to address missing backports)? >>>> >>>> - Wing Yew >>>> >>>> >>>> On Thu, Jul 10, 2025 at 6:43 PM Kevin Liu <kevinjq...@apache.org> >>>> wrote: >>>> >>>>> Hi everyone, >>>>> >>>>> We currently maintain 3 different versions of spark under >>>>> https://github.com/apache/iceberg/tree/main/spark >>>>> I've seen this issue a couple of times where a feature would be >>>>> implemented for only one of the Spark versions. For example, see >>>>> https://github.com/apache/iceberg/pull/13324 and >>>>> https://github.com/apache/iceberg/pull/13459. It's hard to remember >>>>> that there are 3 different versions of Spark. >>>>> >>>>> Do we want to verify that features are implemented across all 3 >>>>> versions if possible? If so, we can diff the git logs between spark >>>>> 3.4 <https://github.com/apache/iceberg/commits/main/spark/v3.4>, 3.5 >>>>> <https://github.com/apache/iceberg/commits/main/spark/v3.5>, and 4.0 >>>>> <https://github.com/apache/iceberg/commits/main/spark/v4.0>. >>>>> >>>>> Best, >>>>> Kevin Liu >>>>> >>>>