Re: [DISCUSS] clarify the definition of behavior changes

Will Raschkowski Thu, 02 May 2024 07:25:41 -0700

To add some user perspective, I wanted to share our experience from 
automatically upgrading tens of thousands of jobs from Spark 2 to 3 at Palantir:


We didn't mind "loud" changes that threw exceptions. We have some infra to try 
run jobs with Spark 3 and fallback to Spark 2 if there's an exception. E.g., 
the datetime parsing and rebasing migration in Spark 3 was great: Spark threw a 
helpful exception but never silently changed results. Similarly, for things 
listed in the migration guide as silent changes (e.g., add_months's handling of 
last-day-of-month), we wrote custom check rules to throw unless users 
acknowledged the change through config.

Silent changes not in the migration guide were really bad for us: Trusting the 
migration guide to be exhaustive, we automatically upgraded jobs which then 
“succeeded” but wrote incorrect results. For example, some expression increased 
timestamp precision in Spark 3; a query implicitly relied on the reduced 
precision, and then produced bad results on upgrade. It’s a silly query but a 
note in the migration guide would have helped.

To summarize: the migration guide was invaluable, we appreciated every entry, 
and we'd appreciate Wenchen's stricter definition of "behavior changes" 
(especially for silent ones).

From: Nimrod Ofek <ofek.nim...@gmail.com>
Date: Thursday, 2 May 2024 at 11:57
To: Wenchen Fan <cloud0...@gmail.com>
Cc: Erik Krogen <xkro...@apache.org>, Spark dev list <dev@spark.apache.org>
Subject: Re: [DISCUSS] clarify the definition of behavior changes
CAUTION: This email originates from an external party (outside of Palantir). If 
you believe this message is suspicious in nature, please use the "Report 
Message" button built into Outlook.

Hi Erik and Wenchen,

I think that usually a good practice with public api and with internal api that 
has big impact and a lot of usage is to ease in changes by providing defaults 
to new parameters that will keep former behaviour in a method with the previous 
signature with deprecation notice, and deleting that deprecated function in the 
next release- so the actual break will be in the next release after all 
libraries had the chance to align with the api and upgrades can be done while 
already using the new version.

Another thing is that we should probably examine what private apis are used 
externally to provide better experience and provide proper public apis to meet 
those needs (for instance, applicative metrics and some way of creating custom 
behaviour columns).

Thanks,
Nimrod

בתאריך יום ה׳, 2 במאי 2024, 03:51, מאת Wenchen Fan 
‏<cloud0...@gmail.com<mailto:cloud0...@gmail.com>>:
Hi Erik,

Thanks for sharing your thoughts! Note: developer APIs are also public APIs 
(such as Data Source V2 API, Spark Listener API, etc.), so breaking changes 
should be avoided as much as we can and new APIs should be mentioned in the 
release notes. Breaking binary compatibility is also a "functional change" and 
should be treated as a behavior change.

BTW, AFAIK some downstream libraries use private APIs such as Catalyst 
Expression and LogicalPlan. It's too much work to track all the changes to 
private APIs and I think it's the downstream library's responsibility to check 
such changes in new Spark versions, or avoid using private APIs. Exceptions can 
happen if certain private APIs are used too widely and we should avoid breaking 
them.

Thanks,
Wenchen

On Wed, May 1, 2024 at 11:51 PM Erik Krogen 
<xkro...@apache.org<mailto:xkro...@apache.org>> wrote:
Thanks for raising this important discussion Wenchen! Two points I would like 
to raise, though I'm fully supportive of any improvements in this regard, my 
points below notwithstanding -- I am not intending to let perfect be the enemy 
of good here.

On a similar note as Santosh's comment, we should consider how this relates to 
developer APIs. Let's say I am an end user relying on some library like 
frameless 
[github.com]<https://urldefense.com/v3/__https:/github.com/typelevel/frameless__;!!NkS9JGVQ2sDq!-aEkNYlil5TIzBQLHrkoCO3btFfp6ZE7SY2qMoTmWqr5T6oi-kay5SS5goSRSeM0SA0IYPk0YFcqoU59xY4PAlZR$>,
 which relies on developer APIs in Spark. When we make a change to Spark's 
developer APIs that requires a corresponding change in frameless, I don't 
directly see that change as an end user, but it does impact me, because now I 
have to upgrade to a new version of frameless that supports those new changes. 
This can have ripple effects across the ecosystem. Should we call out such 
changes so that end users understand the potential impact to libraries they use?

Second point, what about binary compatibility? Currently our versioning policy 
says "Link-level compatibility is something we’ll try to guarantee in future 
releases." (FWIW, it has said this since at least 2016 
[web.archive.org]<https://urldefense.com/v3/__https:/web.archive.org/web/20161127193643/https:/*spark.apache.org/versioning-policy.html__;Lw!!NkS9JGVQ2sDq!-aEkNYlil5TIzBQLHrkoCO3btFfp6ZE7SY2qMoTmWqr5T6oi-kay5SS5goSRSeM0SA0IYPk0YFcqoU59xSvpnzyr$>...)
 One step towards this would be to clearly call out any binary-incompatible 
changes in our release notes, to help users understand if they may be impacted. 
Similar to my first point, this has ripple effects across the ecosystem -- if I 
just use Spark itself, recompiling is probably not a big deal, but if I use N 
libraries that each depend on Spark, then after a binary-incompatible change is 
made I have to wait for all N libraries to publish new compatible versions 
before I can upgrade myself, presenting a nontrivial barrier to adoption.

On Wed, May 1, 2024 at 8:18 AM Santosh Pingale 
<santosh.ping...@adyen.com.invalid> wrote:
Thanks Wenchen for starting this!

How do we define "the user" for spark?
1. End users: There are some users that use spark as a service from a provider
2. Providers/Operators: There are some users that provide spark as a service 
for their internal(on-prem setup with yarn/k8s)/external(Something like EMR) 
customers
3. ?

Perhaps we need to consider infrastructure behavior changes as well to 
accommodate the second group of users.


On 1 May 2024, at 06:08, Wenchen Fan 
<cloud0...@gmail.com<mailto:cloud0...@gmail.com>> wrote:

Hi all,

It's exciting to see innovations keep happening in the Spark community and 
Spark keeps evolving itself. To make these innovations available to more users, 
it's important to help users upgrade to newer Spark versions easily. We've done 
a good job on it: the PR template requires the author to write down user-facing 
behavior changes, and the migration guide contains behavior changes that need 
attention from users. Sometimes behavior changes come with a legacy config to 
restore the old behavior. However, we still lack a clear definition of behavior 
changes and I propose the following definition:
Behavior changes mean user-visible functional changes in a new release via 
public APIs. This means new features, and even bug fixes that eliminate NPE or 
correct query results, are behavior changes. Things like performance 
improvement, code refactoring, and changes to unreleased APIs/features are not. 
All behavior changes should be called out in the PR description. We need to 
write an item in the migration guide (and probably legacy config) for those 
that may break users when upgrading:

  *   Bug fixes that change query results. Users may need to do backfill to 
correct the existing data and must know about these correctness fixes.
  *   Bug fixes that change query schema. Users may need to update the schema 
of the tables in their data pipelines and must know about these changes.
  *   Remove configs
  *   Rename error class/condition
  *   Any change to the public Python/SQL/Scala/Java/R APIs: rename function, 
remove parameters, add parameters, rename parameters, change parameter default 
values, etc. These changes should be avoided in general, or do it in a 
compatible way like deprecating and adding a new function instead of renaming.
Once we reach a conclusion, I'll document it in 
https://spark.apache.org/versioning-policy.html 
[spark.apache.org]<https://urldefense.com/v3/__https:/spark.apache.org/versioning-policy.html__;!!NkS9JGVQ2sDq!-aEkNYlil5TIzBQLHrkoCO3btFfp6ZE7SY2qMoTmWqr5T6oi-kay5SS5goSRSeM0SA0IYPk0YFcqoU59xUTjy2gb$>
 .

Thanks,
Wenchen

Re: [DISCUSS] clarify the definition of behavior changes

Reply via email to