Thanks for raising this important discussion Wenchen! Two points I would
like to raise, though I'm fully supportive of any improvements in this
regard, my points below notwithstanding -- I am not intending to let
perfect be the enemy of good here.

On a similar note as Santosh's comment, we should consider how this relates
to developer APIs. Let's say I am an end user relying on some library like
frameless <https://github.com/typelevel/frameless>, which relies on
developer APIs in Spark. When we make a change to Spark's developer APIs
that requires a corresponding change in frameless, I don't directly see
that change as an end user, but it *does* impact me, because now I have to
upgrade to a new version of frameless that supports those new changes. This
can have ripple effects across the ecosystem. Should we call out such
changes so that end users understand the potential impact to libraries they
use?

Second point, what about binary compatibility? Currently our versioning
policy says "Link-level compatibility is something we’ll try to guarantee
in future releases." (FWIW, it has said this since at least 2016
<https://web.archive.org/web/20161127193643/https://spark.apache.org/versioning-policy.html>...)
One step towards this would be to clearly call out any binary-incompatible
changes in our release notes, to help users understand if they may be
impacted. Similar to my first point, this has ripple effects across the
ecosystem -- if I just use Spark itself, recompiling is probably not a big
deal, but if I use N libraries that each depend on Spark, then after a
binary-incompatible change is made I have to wait for all N libraries to
publish new compatible versions before I can upgrade myself, presenting a
nontrivial barrier to adoption.

On Wed, May 1, 2024 at 8:18 AM Santosh Pingale
<santosh.ping...@adyen.com.invalid> wrote:

> Thanks Wenchen for starting this!
>
> How do we define "the user" for spark?
> 1. End users: There are some users that use spark as a service from a
> provider
> 2. Providers/Operators: There are some users that provide spark as a
> service for their internal(on-prem setup with yarn/k8s)/external(Something
> like EMR) customers
> 3. ?
>
> Perhaps we need to consider infrastructure behavior changes as well to
> accommodate the second group of users.
>
> On 1 May 2024, at 06:08, Wenchen Fan <cloud0...@gmail.com> wrote:
>
> Hi all,
>
> It's exciting to see innovations keep happening in the Spark community and
> Spark keeps evolving itself. To make these innovations available to more
> users, it's important to help users upgrade to newer Spark versions easily.
> We've done a good job on it: the PR template requires the author to write
> down user-facing behavior changes, and the migration guide contains
> behavior changes that need attention from users. Sometimes behavior changes
> come with a legacy config to restore the old behavior. However, we still
> lack a clear definition of behavior changes and I propose the following
> definition:
>
> Behavior changes mean user-visible functional changes in a new release via
> public APIs. This means new features, and even bug fixes that eliminate NPE
> or correct query results, are behavior changes. Things like performance
> improvement, code refactoring, and changes to unreleased APIs/features are
> not. All behavior changes should be called out in the PR description. We
> need to write an item in the migration guide (and probably legacy config)
> for those that may break users when upgrading:
>
>    - Bug fixes that change query results. Users may need to do backfill
>    to correct the existing data and must know about these correctness fixes.
>    - Bug fixes that change query schema. Users may need to update the
>    schema of the tables in their data pipelines and must know about these
>    changes.
>    - Remove configs
>    - Rename error class/condition
>    - Any change to the public Python/SQL/Scala/Java/R APIs: rename
>    function, remove parameters, add parameters, rename parameters, change
>    parameter default values, etc. These changes should be avoided in general,
>    or do it in a compatible way like deprecating and adding a new function
>    instead of renaming.
>
> Once we reach a conclusion, I'll document it in
> https://spark.apache.org/versioning-policy.html .
>
> Thanks,
> Wenchen
>
>
>

Reply via email to