Re: [DISCUSS] clarify the definition of behavior changes

Wenchen Fan Tue, 28 May 2024 20:25:32 -0700

Hi all,

I've created a PR to put the behavior change guideline on the Spark
website: https://github.com/apache/spark-website/pull/518 . Please leave
comments if you have any, thanks!


On Wed, May 15, 2024 at 1:41 AM Wenchen Fan <[email protected]> wrote:

> Thanks all for the feedback here! Let me put up a new version, which
> clarifies the definition of "users":
>
> Behavior changes mean user-visible functional changes in a new release via
> public APIs. The "user" here is not only the user who writes queries and/or
> develops Spark plugins, but also the user who deploys and/or manages Spark
> clusters. New features, and even bug fixes that eliminate NPE or correct
> query results, are behavior changes. Things like performance improvement,
> code refactoring, and changes to unreleased APIs/features are not. All
> behavior changes should be called out in the PR description. We need to
> write an item in the migration guide (and probably legacy config) for those
> that may break users when upgrading:
>
>    - Bug fixes that change query results. Users may need to do backfill
>    to correct the existing data and must know about these correctness fixes.
>    - Bug fixes that change query schema. Users may need to update the
>    schema of the tables in their data pipelines and must know about these
>    changes.
>    - Remove configs
>    - Rename error class/condition
>    - Any non-additive change to the public Python/SQL/Scala/Java/R APIs
>    (including developer APIs): rename function, remove parameters, add
>    parameters, rename parameters, change parameter default values, etc. These
>    changes should be avoided in general, or done in a binary-compatible
>    way like deprecating and adding a new function instead of renaming.
>    - Any non-additive change to the way Spark should be deployed and
>    managed.
>
> The list above is not supposed to be comprehensive. Anyone can raise your
> concern when reviewing PRs and ask the PR author to add migration guide if
> you believe the change is risky and may break users.
>
> On Thu, May 2, 2024 at 10:25 PM Will Raschkowski <
> [email protected]> wrote:
>
>> To add some user perspective, I wanted to share our experience from
>> automatically upgrading tens of thousands of jobs from Spark 2 to 3 at
>> Palantir:
>>
>>
>>
>> We didn't mind "loud" changes that threw exceptions. We have some infra
>> to try run jobs with Spark 3 and fallback to Spark 2 if there's an
>> exception. E.g., the datetime parsing and rebasing migration in Spark 3 was
>> great: Spark threw a helpful exception but never silently changed results.
>> Similarly, for things listed in the migration guide as silent changes
>> (e.g., add_months's handling of last-day-of-month), we wrote custom check
>> rules to throw unless users acknowledged the change through config.
>>
>>
>>
>> Silent changes *not* in the migration guide were really bad for us:
>> Trusting the migration guide to be exhaustive, we automatically upgraded
>> jobs which then “succeeded” but wrote incorrect results. For example, some
>> expression increased timestamp precision in Spark 3; a query implicitly
>> relied on the reduced precision, and then produced bad results on upgrade.
>> It’s a silly query but a note in the migration guide would have helped.
>>
>>
>>
>> To summarize: the migration guide was invaluable, we appreciated every
>> entry, and we'd appreciate Wenchen's stricter definition of "behavior
>> changes" (especially for silent ones).
>>
>>
>>
>> *From: *Nimrod Ofek <[email protected]>
>> *Date: *Thursday, 2 May 2024 at 11:57
>> *To: *Wenchen Fan <[email protected]>
>> *Cc: *Erik Krogen <[email protected]>, Spark dev list <
>> [email protected]>
>> *Subject: *Re: [DISCUSS] clarify the definition of behavior changes
>>
>> *CAUTION:* This email originates from an external party (outside of
>> Palantir). If you believe this message is suspicious in nature, please use
>> the "Report Message" button built into Outlook.
>>
>>
>>
>> Hi Erik and Wenchen,
>>
>>
>>
>> I think that usually a good practice with public api and with internal
>> api that has big impact and a lot of usage is to ease in changes by
>> providing defaults to new parameters that will keep former behaviour in a
>> method with the previous signature with deprecation notice, and deleting
>> that deprecated function in the next release- so the actual break will be
>> in the next release after all libraries had the chance to align with the
>> api and upgrades can be done while already using the new version.
>>
>>
>>
>> Another thing is that we should probably examine what private apis are
>> used externally to provide better experience and provide proper public apis
>> to meet those needs (for instance, applicative metrics and some way of
>> creating custom behaviour columns).
>>
>>
>>
>> Thanks,
>>
>> Nimrod
>>
>>
>>
>> בתאריך יום ה׳, 2 במאי 2024, 03:51, מאת Wenchen Fan ‏<[email protected]
>> >:
>>
>> Hi Erik,
>>
>>
>>
>> Thanks for sharing your thoughts! Note: developer APIs are also public
>> APIs (such as Data Source V2 API, Spark Listener API, etc.), so breaking
>> changes should be avoided as much as we can and new APIs should be
>> mentioned in the release notes. Breaking binary compatibility is also a
>> "functional change" and should be treated as a behavior change.
>>
>>
>>
>> BTW, AFAIK some downstream libraries use private APIs such as Catalyst
>> Expression and LogicalPlan. It's too much work to track all the changes to
>> private APIs and I think it's the downstream library's responsibility to
>> check such changes in new Spark versions, or avoid using private APIs.
>> Exceptions can happen if certain private APIs are used too widely and we
>> should avoid breaking them.
>>
>>
>>
>> Thanks,
>>
>> Wenchen
>>
>>
>>
>> On Wed, May 1, 2024 at 11:51 PM Erik Krogen <[email protected]> wrote:
>>
>> Thanks for raising this important discussion Wenchen! Two points I would
>> like to raise, though I'm fully supportive of any improvements in this
>> regard, my points below notwithstanding -- I am not intending to let
>> perfect be the enemy of good here.
>>
>>
>>
>> On a similar note as Santosh's comment, we should consider how this
>> relates to developer APIs. Let's say I am an end user relying on some
>> library like frameless [github.com]
>> <https://urldefense.com/v3/__https:/github.com/typelevel/frameless__;!!NkS9JGVQ2sDq!-aEkNYlil5TIzBQLHrkoCO3btFfp6ZE7SY2qMoTmWqr5T6oi-kay5SS5goSRSeM0SA0IYPk0YFcqoU59xY4PAlZR$>,
>> which relies on developer APIs in Spark. When we make a change to Spark's
>> developer APIs that requires a corresponding change in frameless, I don't
>> directly see that change as an end user, but it *does* impact me,
>> because now I have to upgrade to a new version of frameless that supports
>> those new changes. This can have ripple effects across the ecosystem.
>> Should we call out such changes so that end users understand the potential
>> impact to libraries they use?
>>
>>
>>
>> Second point, what about binary compatibility? Currently our versioning
>> policy says "Link-level compatibility is something we’ll try to guarantee
>> in future releases." (FWIW, it has said this since at least 2016
>> [web.archive.org]
>> <https://urldefense.com/v3/__https:/web.archive.org/web/20161127193643/https:/*spark.apache.org/versioning-policy.html__;Lw!!NkS9JGVQ2sDq!-aEkNYlil5TIzBQLHrkoCO3btFfp6ZE7SY2qMoTmWqr5T6oi-kay5SS5goSRSeM0SA0IYPk0YFcqoU59xSvpnzyr$>...)
>> One step towards this would be to clearly call out any binary-incompatible
>> changes in our release notes, to help users understand if they may be
>> impacted. Similar to my first point, this has ripple effects across the
>> ecosystem -- if I just use Spark itself, recompiling is probably not a big
>> deal, but if I use N libraries that each depend on Spark, then after a
>> binary-incompatible change is made I have to wait for all N libraries to
>> publish new compatible versions before I can upgrade myself, presenting a
>> nontrivial barrier to adoption.
>>
>>
>>
>> On Wed, May 1, 2024 at 8:18 AM Santosh Pingale
>> <[email protected]> wrote:
>>
>> Thanks Wenchen for starting this!
>>
>>
>>
>> How do we define "the user" for spark?
>>
>> 1. End users: There are some users that use spark as a service from a
>> provider
>>
>> 2. Providers/Operators: There are some users that provide spark as a
>> service for their internal(on-prem setup with yarn/k8s)/external(Something
>> like EMR) customers
>>
>> 3. ?
>>
>>
>>
>> Perhaps we need to consider infrastructure behavior changes as well to
>> accommodate the second group of users.
>>
>>
>>
>> On 1 May 2024, at 06:08, Wenchen Fan <[email protected]> wrote:
>>
>>
>>
>> Hi all,
>>
>>
>> It's exciting to see innovations keep happening in the Spark community
>> and Spark keeps evolving itself. To make these innovations available to
>> more users, it's important to help users upgrade to newer Spark versions
>> easily. We've done a good job on it: the PR template requires the author to
>> write down user-facing behavior changes, and the migration guide contains
>> behavior changes that need attention from users. Sometimes behavior changes
>> come with a legacy config to restore the old behavior. However, we still
>> lack a clear definition of behavior changes and I propose the following
>> definition:
>>
>> Behavior changes mean user-visible functional changes in a new release
>> via public APIs. This means new features, and even bug fixes that eliminate
>> NPE or correct query results, are behavior changes. Things like performance
>> improvement, code refactoring, and changes to unreleased APIs/features are
>> not. All behavior changes should be called out in the PR description. We
>> need to write an item in the migration guide (and probably legacy config)
>> for those that may break users when upgrading:
>>
>>    - Bug fixes that change query results. Users may need to do backfill
>>    to correct the existing data and must know about these correctness fixes.
>>    - Bug fixes that change query schema. Users may need to update the
>>    schema of the tables in their data pipelines and must know about these
>>    changes.
>>    - Remove configs
>>    - Rename error class/condition
>>    - Any change to the public Python/SQL/Scala/Java/R APIs: rename
>>    function, remove parameters, add parameters, rename parameters, change
>>    parameter default values, etc. These changes should be avoided in general,
>>    or do it in a compatible way like deprecating and adding a new function
>>    instead of renaming.
>>
>> Once we reach a conclusion, I'll document it in 
>> https://spark.apache.org/versioning-policy.html
>> [spark.apache.org]
>> <https://urldefense.com/v3/__https:/spark.apache.org/versioning-policy.html__;!!NkS9JGVQ2sDq!-aEkNYlil5TIzBQLHrkoCO3btFfp6ZE7SY2qMoTmWqr5T6oi-kay5SS5goSRSeM0SA0IYPk0YFcqoU59xUTjy2gb$>
>> .
>>
>>
>>
>> Thanks,
>>
>> Wenchen
>>
>>
>>
>>

Re: [DISCUSS] clarify the definition of behavior changes

Reply via email to