Hi all, I've created a PR to put the behavior change guideline on the Spark website: https://github.com/apache/spark-website/pull/518 . Please leave comments if you have any, thanks!
On Wed, May 15, 2024 at 1:41 AM Wenchen Fan <cloud0...@gmail.com> wrote: > Thanks all for the feedback here! Let me put up a new version, which > clarifies the definition of "users": > > Behavior changes mean user-visible functional changes in a new release via > public APIs. The "user" here is not only the user who writes queries and/or > develops Spark plugins, but also the user who deploys and/or manages Spark > clusters. New features, and even bug fixes that eliminate NPE or correct > query results, are behavior changes. Things like performance improvement, > code refactoring, and changes to unreleased APIs/features are not. All > behavior changes should be called out in the PR description. We need to > write an item in the migration guide (and probably legacy config) for those > that may break users when upgrading: > > - Bug fixes that change query results. Users may need to do backfill > to correct the existing data and must know about these correctness fixes. > - Bug fixes that change query schema. Users may need to update the > schema of the tables in their data pipelines and must know about these > changes. > - Remove configs > - Rename error class/condition > - Any non-additive change to the public Python/SQL/Scala/Java/R APIs > (including developer APIs): rename function, remove parameters, add > parameters, rename parameters, change parameter default values, etc. These > changes should be avoided in general, or done in a binary-compatible > way like deprecating and adding a new function instead of renaming. > - Any non-additive change to the way Spark should be deployed and > managed. > > The list above is not supposed to be comprehensive. Anyone can raise your > concern when reviewing PRs and ask the PR author to add migration guide if > you believe the change is risky and may break users. > > On Thu, May 2, 2024 at 10:25 PM Will Raschkowski < > wraschkow...@palantir.com> wrote: > >> To add some user perspective, I wanted to share our experience from >> automatically upgrading tens of thousands of jobs from Spark 2 to 3 at >> Palantir: >> >> >> >> We didn't mind "loud" changes that threw exceptions. We have some infra >> to try run jobs with Spark 3 and fallback to Spark 2 if there's an >> exception. E.g., the datetime parsing and rebasing migration in Spark 3 was >> great: Spark threw a helpful exception but never silently changed results. >> Similarly, for things listed in the migration guide as silent changes >> (e.g., add_months's handling of last-day-of-month), we wrote custom check >> rules to throw unless users acknowledged the change through config. >> >> >> >> Silent changes *not* in the migration guide were really bad for us: >> Trusting the migration guide to be exhaustive, we automatically upgraded >> jobs which then “succeeded” but wrote incorrect results. For example, some >> expression increased timestamp precision in Spark 3; a query implicitly >> relied on the reduced precision, and then produced bad results on upgrade. >> It’s a silly query but a note in the migration guide would have helped. >> >> >> >> To summarize: the migration guide was invaluable, we appreciated every >> entry, and we'd appreciate Wenchen's stricter definition of "behavior >> changes" (especially for silent ones). >> >> >> >> *From: *Nimrod Ofek <ofek.nim...@gmail.com> >> *Date: *Thursday, 2 May 2024 at 11:57 >> *To: *Wenchen Fan <cloud0...@gmail.com> >> *Cc: *Erik Krogen <xkro...@apache.org>, Spark dev list < >> dev@spark.apache.org> >> *Subject: *Re: [DISCUSS] clarify the definition of behavior changes >> >> *CAUTION:* This email originates from an external party (outside of >> Palantir). If you believe this message is suspicious in nature, please use >> the "Report Message" button built into Outlook. >> >> >> >> Hi Erik and Wenchen, >> >> >> >> I think that usually a good practice with public api and with internal >> api that has big impact and a lot of usage is to ease in changes by >> providing defaults to new parameters that will keep former behaviour in a >> method with the previous signature with deprecation notice, and deleting >> that deprecated function in the next release- so the actual break will be >> in the next release after all libraries had the chance to align with the >> api and upgrades can be done while already using the new version. >> >> >> >> Another thing is that we should probably examine what private apis are >> used externally to provide better experience and provide proper public apis >> to meet those needs (for instance, applicative metrics and some way of >> creating custom behaviour columns). >> >> >> >> Thanks, >> >> Nimrod >> >> >> >> בתאריך יום ה׳, 2 במאי 2024, 03:51, מאת Wenchen Fan <cloud0...@gmail.com >> >: >> >> Hi Erik, >> >> >> >> Thanks for sharing your thoughts! Note: developer APIs are also public >> APIs (such as Data Source V2 API, Spark Listener API, etc.), so breaking >> changes should be avoided as much as we can and new APIs should be >> mentioned in the release notes. Breaking binary compatibility is also a >> "functional change" and should be treated as a behavior change. >> >> >> >> BTW, AFAIK some downstream libraries use private APIs such as Catalyst >> Expression and LogicalPlan. It's too much work to track all the changes to >> private APIs and I think it's the downstream library's responsibility to >> check such changes in new Spark versions, or avoid using private APIs. >> Exceptions can happen if certain private APIs are used too widely and we >> should avoid breaking them. >> >> >> >> Thanks, >> >> Wenchen >> >> >> >> On Wed, May 1, 2024 at 11:51 PM Erik Krogen <xkro...@apache.org> wrote: >> >> Thanks for raising this important discussion Wenchen! Two points I would >> like to raise, though I'm fully supportive of any improvements in this >> regard, my points below notwithstanding -- I am not intending to let >> perfect be the enemy of good here. >> >> >> >> On a similar note as Santosh's comment, we should consider how this >> relates to developer APIs. Let's say I am an end user relying on some >> library like frameless [github.com] >> <https://urldefense.com/v3/__https:/github.com/typelevel/frameless__;!!NkS9JGVQ2sDq!-aEkNYlil5TIzBQLHrkoCO3btFfp6ZE7SY2qMoTmWqr5T6oi-kay5SS5goSRSeM0SA0IYPk0YFcqoU59xY4PAlZR$>, >> which relies on developer APIs in Spark. When we make a change to Spark's >> developer APIs that requires a corresponding change in frameless, I don't >> directly see that change as an end user, but it *does* impact me, >> because now I have to upgrade to a new version of frameless that supports >> those new changes. This can have ripple effects across the ecosystem. >> Should we call out such changes so that end users understand the potential >> impact to libraries they use? >> >> >> >> Second point, what about binary compatibility? Currently our versioning >> policy says "Link-level compatibility is something we’ll try to guarantee >> in future releases." (FWIW, it has said this since at least 2016 >> [web.archive.org] >> <https://urldefense.com/v3/__https:/web.archive.org/web/20161127193643/https:/*spark.apache.org/versioning-policy.html__;Lw!!NkS9JGVQ2sDq!-aEkNYlil5TIzBQLHrkoCO3btFfp6ZE7SY2qMoTmWqr5T6oi-kay5SS5goSRSeM0SA0IYPk0YFcqoU59xSvpnzyr$>...) >> One step towards this would be to clearly call out any binary-incompatible >> changes in our release notes, to help users understand if they may be >> impacted. Similar to my first point, this has ripple effects across the >> ecosystem -- if I just use Spark itself, recompiling is probably not a big >> deal, but if I use N libraries that each depend on Spark, then after a >> binary-incompatible change is made I have to wait for all N libraries to >> publish new compatible versions before I can upgrade myself, presenting a >> nontrivial barrier to adoption. >> >> >> >> On Wed, May 1, 2024 at 8:18 AM Santosh Pingale >> <santosh.ping...@adyen.com.invalid> wrote: >> >> Thanks Wenchen for starting this! >> >> >> >> How do we define "the user" for spark? >> >> 1. End users: There are some users that use spark as a service from a >> provider >> >> 2. Providers/Operators: There are some users that provide spark as a >> service for their internal(on-prem setup with yarn/k8s)/external(Something >> like EMR) customers >> >> 3. ? >> >> >> >> Perhaps we need to consider infrastructure behavior changes as well to >> accommodate the second group of users. >> >> >> >> On 1 May 2024, at 06:08, Wenchen Fan <cloud0...@gmail.com> wrote: >> >> >> >> Hi all, >> >> >> It's exciting to see innovations keep happening in the Spark community >> and Spark keeps evolving itself. To make these innovations available to >> more users, it's important to help users upgrade to newer Spark versions >> easily. We've done a good job on it: the PR template requires the author to >> write down user-facing behavior changes, and the migration guide contains >> behavior changes that need attention from users. Sometimes behavior changes >> come with a legacy config to restore the old behavior. However, we still >> lack a clear definition of behavior changes and I propose the following >> definition: >> >> Behavior changes mean user-visible functional changes in a new release >> via public APIs. This means new features, and even bug fixes that eliminate >> NPE or correct query results, are behavior changes. Things like performance >> improvement, code refactoring, and changes to unreleased APIs/features are >> not. All behavior changes should be called out in the PR description. We >> need to write an item in the migration guide (and probably legacy config) >> for those that may break users when upgrading: >> >> - Bug fixes that change query results. Users may need to do backfill >> to correct the existing data and must know about these correctness fixes. >> - Bug fixes that change query schema. Users may need to update the >> schema of the tables in their data pipelines and must know about these >> changes. >> - Remove configs >> - Rename error class/condition >> - Any change to the public Python/SQL/Scala/Java/R APIs: rename >> function, remove parameters, add parameters, rename parameters, change >> parameter default values, etc. These changes should be avoided in general, >> or do it in a compatible way like deprecating and adding a new function >> instead of renaming. >> >> Once we reach a conclusion, I'll document it in >> https://spark.apache.org/versioning-policy.html >> [spark.apache.org] >> <https://urldefense.com/v3/__https:/spark.apache.org/versioning-policy.html__;!!NkS9JGVQ2sDq!-aEkNYlil5TIzBQLHrkoCO3btFfp6ZE7SY2qMoTmWqr5T6oi-kay5SS5goSRSeM0SA0IYPk0YFcqoU59xUTjy2gb$> >> . >> >> >> >> Thanks, >> >> Wenchen >> >> >> >>