To add some user perspective, I wanted to share our experience from automatically upgrading tens of thousands of jobs from Spark 2 to 3 at Palantir:
We didn't mind "loud" changes that threw exceptions. We have some infra to try run jobs with Spark 3 and fallback to Spark 2 if there's an exception. E.g., the datetime parsing and rebasing migration in Spark 3 was great: Spark threw a helpful exception but never silently changed results. Similarly, for things listed in the migration guide as silent changes (e.g., add_months's handling of last-day-of-month), we wrote custom check rules to throw unless users acknowledged the change through config. Silent changes not in the migration guide were really bad for us: Trusting the migration guide to be exhaustive, we automatically upgraded jobs which then “succeeded” but wrote incorrect results. For example, some expression increased timestamp precision in Spark 3; a query implicitly relied on the reduced precision, and then produced bad results on upgrade. It’s a silly query but a note in the migration guide would have helped. To summarize: the migration guide was invaluable, we appreciated every entry, and we'd appreciate Wenchen's stricter definition of "behavior changes" (especially for silent ones). From: Nimrod Ofek <ofek.nim...@gmail.com> Date: Thursday, 2 May 2024 at 11:57 To: Wenchen Fan <cloud0...@gmail.com> Cc: Erik Krogen <xkro...@apache.org>, Spark dev list <dev@spark.apache.org> Subject: Re: [DISCUSS] clarify the definition of behavior changes CAUTION: This email originates from an external party (outside of Palantir). If you believe this message is suspicious in nature, please use the "Report Message" button built into Outlook. Hi Erik and Wenchen, I think that usually a good practice with public api and with internal api that has big impact and a lot of usage is to ease in changes by providing defaults to new parameters that will keep former behaviour in a method with the previous signature with deprecation notice, and deleting that deprecated function in the next release- so the actual break will be in the next release after all libraries had the chance to align with the api and upgrades can be done while already using the new version. Another thing is that we should probably examine what private apis are used externally to provide better experience and provide proper public apis to meet those needs (for instance, applicative metrics and some way of creating custom behaviour columns). Thanks, Nimrod בתאריך יום ה׳, 2 במאי 2024, 03:51, מאת Wenchen Fan <cloud0...@gmail.com<mailto:cloud0...@gmail.com>>: Hi Erik, Thanks for sharing your thoughts! Note: developer APIs are also public APIs (such as Data Source V2 API, Spark Listener API, etc.), so breaking changes should be avoided as much as we can and new APIs should be mentioned in the release notes. Breaking binary compatibility is also a "functional change" and should be treated as a behavior change. BTW, AFAIK some downstream libraries use private APIs such as Catalyst Expression and LogicalPlan. It's too much work to track all the changes to private APIs and I think it's the downstream library's responsibility to check such changes in new Spark versions, or avoid using private APIs. Exceptions can happen if certain private APIs are used too widely and we should avoid breaking them. Thanks, Wenchen On Wed, May 1, 2024 at 11:51 PM Erik Krogen <xkro...@apache.org<mailto:xkro...@apache.org>> wrote: Thanks for raising this important discussion Wenchen! Two points I would like to raise, though I'm fully supportive of any improvements in this regard, my points below notwithstanding -- I am not intending to let perfect be the enemy of good here. On a similar note as Santosh's comment, we should consider how this relates to developer APIs. Let's say I am an end user relying on some library like frameless [github.com]<https://urldefense.com/v3/__https:/github.com/typelevel/frameless__;!!NkS9JGVQ2sDq!-aEkNYlil5TIzBQLHrkoCO3btFfp6ZE7SY2qMoTmWqr5T6oi-kay5SS5goSRSeM0SA0IYPk0YFcqoU59xY4PAlZR$>, which relies on developer APIs in Spark. When we make a change to Spark's developer APIs that requires a corresponding change in frameless, I don't directly see that change as an end user, but it does impact me, because now I have to upgrade to a new version of frameless that supports those new changes. This can have ripple effects across the ecosystem. Should we call out such changes so that end users understand the potential impact to libraries they use? Second point, what about binary compatibility? Currently our versioning policy says "Link-level compatibility is something we’ll try to guarantee in future releases." (FWIW, it has said this since at least 2016 [web.archive.org]<https://urldefense.com/v3/__https:/web.archive.org/web/20161127193643/https:/*spark.apache.org/versioning-policy.html__;Lw!!NkS9JGVQ2sDq!-aEkNYlil5TIzBQLHrkoCO3btFfp6ZE7SY2qMoTmWqr5T6oi-kay5SS5goSRSeM0SA0IYPk0YFcqoU59xSvpnzyr$>...) One step towards this would be to clearly call out any binary-incompatible changes in our release notes, to help users understand if they may be impacted. Similar to my first point, this has ripple effects across the ecosystem -- if I just use Spark itself, recompiling is probably not a big deal, but if I use N libraries that each depend on Spark, then after a binary-incompatible change is made I have to wait for all N libraries to publish new compatible versions before I can upgrade myself, presenting a nontrivial barrier to adoption. On Wed, May 1, 2024 at 8:18 AM Santosh Pingale <santosh.ping...@adyen.com.invalid> wrote: Thanks Wenchen for starting this! How do we define "the user" for spark? 1. End users: There are some users that use spark as a service from a provider 2. Providers/Operators: There are some users that provide spark as a service for their internal(on-prem setup with yarn/k8s)/external(Something like EMR) customers 3. ? Perhaps we need to consider infrastructure behavior changes as well to accommodate the second group of users. On 1 May 2024, at 06:08, Wenchen Fan <cloud0...@gmail.com<mailto:cloud0...@gmail.com>> wrote: Hi all, It's exciting to see innovations keep happening in the Spark community and Spark keeps evolving itself. To make these innovations available to more users, it's important to help users upgrade to newer Spark versions easily. We've done a good job on it: the PR template requires the author to write down user-facing behavior changes, and the migration guide contains behavior changes that need attention from users. Sometimes behavior changes come with a legacy config to restore the old behavior. However, we still lack a clear definition of behavior changes and I propose the following definition: Behavior changes mean user-visible functional changes in a new release via public APIs. This means new features, and even bug fixes that eliminate NPE or correct query results, are behavior changes. Things like performance improvement, code refactoring, and changes to unreleased APIs/features are not. All behavior changes should be called out in the PR description. We need to write an item in the migration guide (and probably legacy config) for those that may break users when upgrading: * Bug fixes that change query results. Users may need to do backfill to correct the existing data and must know about these correctness fixes. * Bug fixes that change query schema. Users may need to update the schema of the tables in their data pipelines and must know about these changes. * Remove configs * Rename error class/condition * Any change to the public Python/SQL/Scala/Java/R APIs: rename function, remove parameters, add parameters, rename parameters, change parameter default values, etc. These changes should be avoided in general, or do it in a compatible way like deprecating and adding a new function instead of renaming. Once we reach a conclusion, I'll document it in https://spark.apache.org/versioning-policy.html [spark.apache.org]<https://urldefense.com/v3/__https:/spark.apache.org/versioning-policy.html__;!!NkS9JGVQ2sDq!-aEkNYlil5TIzBQLHrkoCO3btFfp6ZE7SY2qMoTmWqr5T6oi-kay5SS5goSRSeM0SA0IYPk0YFcqoU59xUTjy2gb$> . Thanks, Wenchen