Re: [DISCUSS] clarify the definition of behavior changes

2024-05-28 Thread Wenchen Fan
Hi all,

I've created a PR to put the behavior change guideline on the Spark
website: https://github.com/apache/spark-website/pull/518 . Please leave
comments if you have any, thanks!

On Wed, May 15, 2024 at 1:41 AM Wenchen Fan  wrote:

> Thanks all for the feedback here! Let me put up a new version, which
> clarifies the definition of "users":
>
> Behavior changes mean user-visible functional changes in a new release via
> public APIs. The "user" here is not only the user who writes queries and/or
> develops Spark plugins, but also the user who deploys and/or manages Spark
> clusters. New features, and even bug fixes that eliminate NPE or correct
> query results, are behavior changes. Things like performance improvement,
> code refactoring, and changes to unreleased APIs/features are not. All
> behavior changes should be called out in the PR description. We need to
> write an item in the migration guide (and probably legacy config) for those
> that may break users when upgrading:
>
>- Bug fixes that change query results. Users may need to do backfill
>to correct the existing data and must know about these correctness fixes.
>- Bug fixes that change query schema. Users may need to update the
>schema of the tables in their data pipelines and must know about these
>changes.
>- Remove configs
>- Rename error class/condition
>- Any non-additive change to the public Python/SQL/Scala/Java/R APIs
>(including developer APIs): rename function, remove parameters, add
>parameters, rename parameters, change parameter default values, etc. These
>changes should be avoided in general, or done in a binary-compatible
>way like deprecating and adding a new function instead of renaming.
>- Any non-additive change to the way Spark should be deployed and
>managed.
>
> The list above is not supposed to be comprehensive. Anyone can raise your
> concern when reviewing PRs and ask the PR author to add migration guide if
> you believe the change is risky and may break users.
>
> On Thu, May 2, 2024 at 10:25 PM Will Raschkowski <
> wraschkow...@palantir.com> wrote:
>
>> To add some user perspective, I wanted to share our experience from
>> automatically upgrading tens of thousands of jobs from Spark 2 to 3 at
>> Palantir:
>>
>>
>>
>> We didn't mind "loud" changes that threw exceptions. We have some infra
>> to try run jobs with Spark 3 and fallback to Spark 2 if there's an
>> exception. E.g., the datetime parsing and rebasing migration in Spark 3 was
>> great: Spark threw a helpful exception but never silently changed results.
>> Similarly, for things listed in the migration guide as silent changes
>> (e.g., add_months's handling of last-day-of-month), we wrote custom check
>> rules to throw unless users acknowledged the change through config.
>>
>>
>>
>> Silent changes *not* in the migration guide were really bad for us:
>> Trusting the migration guide to be exhaustive, we automatically upgraded
>> jobs which then “succeeded” but wrote incorrect results. For example, some
>> expression increased timestamp precision in Spark 3; a query implicitly
>> relied on the reduced precision, and then produced bad results on upgrade.
>> It’s a silly query but a note in the migration guide would have helped.
>>
>>
>>
>> To summarize: the migration guide was invaluable, we appreciated every
>> entry, and we'd appreciate Wenchen's stricter definition of "behavior
>> changes" (especially for silent ones).
>>
>>
>>
>> *From: *Nimrod Ofek 
>> *Date: *Thursday, 2 May 2024 at 11:57
>> *To: *Wenchen Fan 
>> *Cc: *Erik Krogen , Spark dev list <
>> dev@spark.apache.org>
>> *Subject: *Re: [DISCUSS] clarify the definition of behavior changes
>>
>> *CAUTION:* This email originates from an external party (outside of
>> Palantir). If you believe this message is suspicious in nature, please use
>> the "Report Message" button built into Outlook.
>>
>>
>>
>> Hi Erik and Wenchen,
>>
>>
>>
>> I think that usually a good practice with public api and with internal
>> api that has big impact and a lot of usage is to ease in changes by
>> providing defaults to new parameters that will keep former behaviour in a
>> method with the previous signature with deprecation notice, and deleting
>> that deprecated function in the next release- so the actual break will be
>> in the next release after all libraries had the chance to align with the
>> api and upgrades can be done while already using the new version.
>>
>>
&g

Re: [DISCUSS] clarify the definition of behavior changes

2024-05-15 Thread Wenchen Fan
Thanks all for the feedback here! Let me put up a new version, which
clarifies the definition of "users":

Behavior changes mean user-visible functional changes in a new release via
public APIs. The "user" here is not only the user who writes queries and/or
develops Spark plugins, but also the user who deploys and/or manages Spark
clusters. New features, and even bug fixes that eliminate NPE or correct
query results, are behavior changes. Things like performance improvement,
code refactoring, and changes to unreleased APIs/features are not. All
behavior changes should be called out in the PR description. We need to
write an item in the migration guide (and probably legacy config) for those
that may break users when upgrading:

   - Bug fixes that change query results. Users may need to do backfill to
   correct the existing data and must know about these correctness fixes.
   - Bug fixes that change query schema. Users may need to update the
   schema of the tables in their data pipelines and must know about these
   changes.
   - Remove configs
   - Rename error class/condition
   - Any non-additive change to the public Python/SQL/Scala/Java/R APIs
   (including developer APIs): rename function, remove parameters, add
   parameters, rename parameters, change parameter default values, etc. These
   changes should be avoided in general, or done in a binary-compatible
   way like deprecating and adding a new function instead of renaming.
   - Any non-additive change to the way Spark should be deployed and
   managed.

The list above is not supposed to be comprehensive. Anyone can raise your
concern when reviewing PRs and ask the PR author to add migration guide if
you believe the change is risky and may break users.

On Thu, May 2, 2024 at 10:25 PM Will Raschkowski 
wrote:

> To add some user perspective, I wanted to share our experience from
> automatically upgrading tens of thousands of jobs from Spark 2 to 3 at
> Palantir:
>
>
>
> We didn't mind "loud" changes that threw exceptions. We have some infra to
> try run jobs with Spark 3 and fallback to Spark 2 if there's an exception.
> E.g., the datetime parsing and rebasing migration in Spark 3 was great:
> Spark threw a helpful exception but never silently changed results.
> Similarly, for things listed in the migration guide as silent changes
> (e.g., add_months's handling of last-day-of-month), we wrote custom check
> rules to throw unless users acknowledged the change through config.
>
>
>
> Silent changes *not* in the migration guide were really bad for us:
> Trusting the migration guide to be exhaustive, we automatically upgraded
> jobs which then “succeeded” but wrote incorrect results. For example, some
> expression increased timestamp precision in Spark 3; a query implicitly
> relied on the reduced precision, and then produced bad results on upgrade.
> It’s a silly query but a note in the migration guide would have helped.
>
>
>
> To summarize: the migration guide was invaluable, we appreciated every
> entry, and we'd appreciate Wenchen's stricter definition of "behavior
> changes" (especially for silent ones).
>
>
>
> *From: *Nimrod Ofek 
> *Date: *Thursday, 2 May 2024 at 11:57
> *To: *Wenchen Fan 
> *Cc: *Erik Krogen , Spark dev list <
> dev@spark.apache.org>
> *Subject: *Re: [DISCUSS] clarify the definition of behavior changes
>
> *CAUTION:* This email originates from an external party (outside of
> Palantir). If you believe this message is suspicious in nature, please use
> the "Report Message" button built into Outlook.
>
>
>
> Hi Erik and Wenchen,
>
>
>
> I think that usually a good practice with public api and with internal api
> that has big impact and a lot of usage is to ease in changes by providing
> defaults to new parameters that will keep former behaviour in a method with
> the previous signature with deprecation notice, and deleting that
> deprecated function in the next release- so the actual break will be in the
> next release after all libraries had the chance to align with the api and
> upgrades can be done while already using the new version.
>
>
>
> Another thing is that we should probably examine what private apis are
> used externally to provide better experience and provide proper public apis
> to meet those needs (for instance, applicative metrics and some way of
> creating custom behaviour columns).
>
>
>
> Thanks,
>
> Nimrod
>
>
>
> בתאריך יום ה׳, 2 במאי 2024, 03:51, מאת Wenchen Fan ‏:
>
> Hi Erik,
>
>
>
> Thanks for sharing your thoughts! Note: developer APIs are also public
> APIs (such as Data Source V2 API, Spark Listener API, etc.), so breaking
> changes should be avoided as much as we can and new APIs should be
> mentioned

Re: [DISCUSS] clarify the definition of behavior changes

2024-05-02 Thread Will Raschkowski
To add some user perspective, I wanted to share our experience from 
automatically upgrading tens of thousands of jobs from Spark 2 to 3 at Palantir:

We didn't mind "loud" changes that threw exceptions. We have some infra to try 
run jobs with Spark 3 and fallback to Spark 2 if there's an exception. E.g., 
the datetime parsing and rebasing migration in Spark 3 was great: Spark threw a 
helpful exception but never silently changed results. Similarly, for things 
listed in the migration guide as silent changes (e.g., add_months's handling of 
last-day-of-month), we wrote custom check rules to throw unless users 
acknowledged the change through config.

Silent changes not in the migration guide were really bad for us: Trusting the 
migration guide to be exhaustive, we automatically upgraded jobs which then 
“succeeded” but wrote incorrect results. For example, some expression increased 
timestamp precision in Spark 3; a query implicitly relied on the reduced 
precision, and then produced bad results on upgrade. It’s a silly query but a 
note in the migration guide would have helped.

To summarize: the migration guide was invaluable, we appreciated every entry, 
and we'd appreciate Wenchen's stricter definition of "behavior changes" 
(especially for silent ones).

From: Nimrod Ofek 
Date: Thursday, 2 May 2024 at 11:57
To: Wenchen Fan 
Cc: Erik Krogen , Spark dev list 
Subject: Re: [DISCUSS] clarify the definition of behavior changes
CAUTION: This email originates from an external party (outside of Palantir). If 
you believe this message is suspicious in nature, please use the "Report 
Message" button built into Outlook.

Hi Erik and Wenchen,

I think that usually a good practice with public api and with internal api that 
has big impact and a lot of usage is to ease in changes by providing defaults 
to new parameters that will keep former behaviour in a method with the previous 
signature with deprecation notice, and deleting that deprecated function in the 
next release- so the actual break will be in the next release after all 
libraries had the chance to align with the api and upgrades can be done while 
already using the new version.

Another thing is that we should probably examine what private apis are used 
externally to provide better experience and provide proper public apis to meet 
those needs (for instance, applicative metrics and some way of creating custom 
behaviour columns).

Thanks,
Nimrod

בתאריך יום ה׳, 2 במאי 2024, 03:51, מאת Wenchen Fan 
‏mailto:cloud0...@gmail.com>>:
Hi Erik,

Thanks for sharing your thoughts! Note: developer APIs are also public APIs 
(such as Data Source V2 API, Spark Listener API, etc.), so breaking changes 
should be avoided as much as we can and new APIs should be mentioned in the 
release notes. Breaking binary compatibility is also a "functional change" and 
should be treated as a behavior change.

BTW, AFAIK some downstream libraries use private APIs such as Catalyst 
Expression and LogicalPlan. It's too much work to track all the changes to 
private APIs and I think it's the downstream library's responsibility to check 
such changes in new Spark versions, or avoid using private APIs. Exceptions can 
happen if certain private APIs are used too widely and we should avoid breaking 
them.

Thanks,
Wenchen

On Wed, May 1, 2024 at 11:51 PM Erik Krogen 
mailto:xkro...@apache.org>> wrote:
Thanks for raising this important discussion Wenchen! Two points I would like 
to raise, though I'm fully supportive of any improvements in this regard, my 
points below notwithstanding -- I am not intending to let perfect be the enemy 
of good here.

On a similar note as Santosh's comment, we should consider how this relates to 
developer APIs. Let's say I am an end user relying on some library like 
frameless 
[github.com]<https://urldefense.com/v3/__https:/github.com/typelevel/frameless__;!!NkS9JGVQ2sDq!-aEkNYlil5TIzBQLHrkoCO3btFfp6ZE7SY2qMoTmWqr5T6oi-kay5SS5goSRSeM0SA0IYPk0YFcqoU59xY4PAlZR$>,
 which relies on developer APIs in Spark. When we make a change to Spark's 
developer APIs that requires a corresponding change in frameless, I don't 
directly see that change as an end user, but it does impact me, because now I 
have to upgrade to a new version of frameless that supports those new changes. 
This can have ripple effects across the ecosystem. Should we call out such 
changes so that end users understand the potential impact to libraries they use?

Second point, what about binary compatibility? Currently our versioning policy 
says "Link-level compatibility is something we’ll try to guarantee in future 
releases." (FWIW, it has said this since at least 2016 
[web.archive.org]<https://urldefense.com/v3/__https:/web.archive.org/web/20161127193643/https:/*spark.apache.org/versioning-policy.html__;Lw!!NkS9JGVQ2sDq!-aEkNYlil5TIzBQLHrkoCO3btFfp6ZE7SY2qMoTmWqr5T6oi-kay5SS5goSRSeM0SA0IYPk0YFcqoU59xSvpnzyr$>...)
 

Re: [DISCUSS] clarify the definition of behavior changes

2024-05-02 Thread Nimrod Ofek
Hi Erik and Wenchen,

I think that usually a good practice with public api and with internal api
that has big impact and a lot of usage is to ease in changes by providing
defaults to new parameters that will keep former behaviour in a method with
the previous signature with deprecation notice, and deleting that
deprecated function in the next release- so the actual break will be in the
next release after all libraries had the chance to align with the api and
upgrades can be done while already using the new version.

Another thing is that we should probably examine what private apis are used
externally to provide better experience and provide proper public apis to
meet those needs (for instance, applicative metrics and some way of
creating custom behaviour columns).

Thanks,
Nimrod


בתאריך יום ה׳, 2 במאי 2024, 03:51, מאת Wenchen Fan ‏:

> Hi Erik,
>
> Thanks for sharing your thoughts! Note: developer APIs are also public
> APIs (such as Data Source V2 API, Spark Listener API, etc.), so breaking
> changes should be avoided as much as we can and new APIs should be
> mentioned in the release notes. Breaking binary compatibility is also a
> "functional change" and should be treated as a behavior change.
>
> BTW, AFAIK some downstream libraries use private APIs such as Catalyst
> Expression and LogicalPlan. It's too much work to track all the changes to
> private APIs and I think it's the downstream library's responsibility to
> check such changes in new Spark versions, or avoid using private APIs.
> Exceptions can happen if certain private APIs are used too widely and we
> should avoid breaking them.
>
> Thanks,
> Wenchen
>
> On Wed, May 1, 2024 at 11:51 PM Erik Krogen  wrote:
>
>> Thanks for raising this important discussion Wenchen! Two points I would
>> like to raise, though I'm fully supportive of any improvements in this
>> regard, my points below notwithstanding -- I am not intending to let
>> perfect be the enemy of good here.
>>
>> On a similar note as Santosh's comment, we should consider how this
>> relates to developer APIs. Let's say I am an end user relying on some
>> library like frameless , which
>> relies on developer APIs in Spark. When we make a change to Spark's
>> developer APIs that requires a corresponding change in frameless, I don't
>> directly see that change as an end user, but it *does* impact me,
>> because now I have to upgrade to a new version of frameless that supports
>> those new changes. This can have ripple effects across the ecosystem.
>> Should we call out such changes so that end users understand the potential
>> impact to libraries they use?
>>
>> Second point, what about binary compatibility? Currently our versioning
>> policy says "Link-level compatibility is something we’ll try to guarantee
>> in future releases." (FWIW, it has said this since at least 2016
>> ...)
>> One step towards this would be to clearly call out any binary-incompatible
>> changes in our release notes, to help users understand if they may be
>> impacted. Similar to my first point, this has ripple effects across the
>> ecosystem -- if I just use Spark itself, recompiling is probably not a big
>> deal, but if I use N libraries that each depend on Spark, then after a
>> binary-incompatible change is made I have to wait for all N libraries to
>> publish new compatible versions before I can upgrade myself, presenting a
>> nontrivial barrier to adoption.
>>
>> On Wed, May 1, 2024 at 8:18 AM Santosh Pingale
>>  wrote:
>>
>>> Thanks Wenchen for starting this!
>>>
>>> How do we define "the user" for spark?
>>> 1. End users: There are some users that use spark as a service from a
>>> provider
>>> 2. Providers/Operators: There are some users that provide spark as a
>>> service for their internal(on-prem setup with yarn/k8s)/external(Something
>>> like EMR) customers
>>> 3. ?
>>>
>>> Perhaps we need to consider infrastructure behavior changes as well to
>>> accommodate the second group of users.
>>>
>>> On 1 May 2024, at 06:08, Wenchen Fan  wrote:
>>>
>>> Hi all,
>>>
>>> It's exciting to see innovations keep happening in the Spark community
>>> and Spark keeps evolving itself. To make these innovations available to
>>> more users, it's important to help users upgrade to newer Spark versions
>>> easily. We've done a good job on it: the PR template requires the author to
>>> write down user-facing behavior changes, and the migration guide contains
>>> behavior changes that need attention from users. Sometimes behavior changes
>>> come with a legacy config to restore the old behavior. However, we still
>>> lack a clear definition of behavior changes and I propose the following
>>> definition:
>>>
>>> Behavior changes mean user-visible functional changes in a new release
>>> via public APIs. This means new features, and even bug fixes that eliminate
>>> NPE or correct query results, 

Re: [DISCUSS] clarify the definition of behavior changes

2024-05-01 Thread Wenchen Fan
Hi Erik,

Thanks for sharing your thoughts! Note: developer APIs are also public APIs
(such as Data Source V2 API, Spark Listener API, etc.), so breaking changes
should be avoided as much as we can and new APIs should be mentioned in the
release notes. Breaking binary compatibility is also a "functional change"
and should be treated as a behavior change.

BTW, AFAIK some downstream libraries use private APIs such as Catalyst
Expression and LogicalPlan. It's too much work to track all the changes to
private APIs and I think it's the downstream library's responsibility to
check such changes in new Spark versions, or avoid using private APIs.
Exceptions can happen if certain private APIs are used too widely and we
should avoid breaking them.

Thanks,
Wenchen

On Wed, May 1, 2024 at 11:51 PM Erik Krogen  wrote:

> Thanks for raising this important discussion Wenchen! Two points I would
> like to raise, though I'm fully supportive of any improvements in this
> regard, my points below notwithstanding -- I am not intending to let
> perfect be the enemy of good here.
>
> On a similar note as Santosh's comment, we should consider how this
> relates to developer APIs. Let's say I am an end user relying on some
> library like frameless , which
> relies on developer APIs in Spark. When we make a change to Spark's
> developer APIs that requires a corresponding change in frameless, I don't
> directly see that change as an end user, but it *does* impact me, because
> now I have to upgrade to a new version of frameless that supports those new
> changes. This can have ripple effects across the ecosystem. Should we call
> out such changes so that end users understand the potential impact to
> libraries they use?
>
> Second point, what about binary compatibility? Currently our versioning
> policy says "Link-level compatibility is something we’ll try to guarantee
> in future releases." (FWIW, it has said this since at least 2016
> ...)
> One step towards this would be to clearly call out any binary-incompatible
> changes in our release notes, to help users understand if they may be
> impacted. Similar to my first point, this has ripple effects across the
> ecosystem -- if I just use Spark itself, recompiling is probably not a big
> deal, but if I use N libraries that each depend on Spark, then after a
> binary-incompatible change is made I have to wait for all N libraries to
> publish new compatible versions before I can upgrade myself, presenting a
> nontrivial barrier to adoption.
>
> On Wed, May 1, 2024 at 8:18 AM Santosh Pingale
>  wrote:
>
>> Thanks Wenchen for starting this!
>>
>> How do we define "the user" for spark?
>> 1. End users: There are some users that use spark as a service from a
>> provider
>> 2. Providers/Operators: There are some users that provide spark as a
>> service for their internal(on-prem setup with yarn/k8s)/external(Something
>> like EMR) customers
>> 3. ?
>>
>> Perhaps we need to consider infrastructure behavior changes as well to
>> accommodate the second group of users.
>>
>> On 1 May 2024, at 06:08, Wenchen Fan  wrote:
>>
>> Hi all,
>>
>> It's exciting to see innovations keep happening in the Spark community
>> and Spark keeps evolving itself. To make these innovations available to
>> more users, it's important to help users upgrade to newer Spark versions
>> easily. We've done a good job on it: the PR template requires the author to
>> write down user-facing behavior changes, and the migration guide contains
>> behavior changes that need attention from users. Sometimes behavior changes
>> come with a legacy config to restore the old behavior. However, we still
>> lack a clear definition of behavior changes and I propose the following
>> definition:
>>
>> Behavior changes mean user-visible functional changes in a new release
>> via public APIs. This means new features, and even bug fixes that eliminate
>> NPE or correct query results, are behavior changes. Things like performance
>> improvement, code refactoring, and changes to unreleased APIs/features are
>> not. All behavior changes should be called out in the PR description. We
>> need to write an item in the migration guide (and probably legacy config)
>> for those that may break users when upgrading:
>>
>>- Bug fixes that change query results. Users may need to do backfill
>>to correct the existing data and must know about these correctness fixes.
>>- Bug fixes that change query schema. Users may need to update the
>>schema of the tables in their data pipelines and must know about these
>>changes.
>>- Remove configs
>>- Rename error class/condition
>>- Any change to the public Python/SQL/Scala/Java/R APIs: rename
>>function, remove parameters, add parameters, rename parameters, change
>>parameter default values, etc. These changes should be avoided in general,
>>

Re: [DISCUSS] clarify the definition of behavior changes

2024-05-01 Thread Erik Krogen
Thanks for raising this important discussion Wenchen! Two points I would
like to raise, though I'm fully supportive of any improvements in this
regard, my points below notwithstanding -- I am not intending to let
perfect be the enemy of good here.

On a similar note as Santosh's comment, we should consider how this relates
to developer APIs. Let's say I am an end user relying on some library like
frameless , which relies on
developer APIs in Spark. When we make a change to Spark's developer APIs
that requires a corresponding change in frameless, I don't directly see
that change as an end user, but it *does* impact me, because now I have to
upgrade to a new version of frameless that supports those new changes. This
can have ripple effects across the ecosystem. Should we call out such
changes so that end users understand the potential impact to libraries they
use?

Second point, what about binary compatibility? Currently our versioning
policy says "Link-level compatibility is something we’ll try to guarantee
in future releases." (FWIW, it has said this since at least 2016
...)
One step towards this would be to clearly call out any binary-incompatible
changes in our release notes, to help users understand if they may be
impacted. Similar to my first point, this has ripple effects across the
ecosystem -- if I just use Spark itself, recompiling is probably not a big
deal, but if I use N libraries that each depend on Spark, then after a
binary-incompatible change is made I have to wait for all N libraries to
publish new compatible versions before I can upgrade myself, presenting a
nontrivial barrier to adoption.

On Wed, May 1, 2024 at 8:18 AM Santosh Pingale
 wrote:

> Thanks Wenchen for starting this!
>
> How do we define "the user" for spark?
> 1. End users: There are some users that use spark as a service from a
> provider
> 2. Providers/Operators: There are some users that provide spark as a
> service for their internal(on-prem setup with yarn/k8s)/external(Something
> like EMR) customers
> 3. ?
>
> Perhaps we need to consider infrastructure behavior changes as well to
> accommodate the second group of users.
>
> On 1 May 2024, at 06:08, Wenchen Fan  wrote:
>
> Hi all,
>
> It's exciting to see innovations keep happening in the Spark community and
> Spark keeps evolving itself. To make these innovations available to more
> users, it's important to help users upgrade to newer Spark versions easily.
> We've done a good job on it: the PR template requires the author to write
> down user-facing behavior changes, and the migration guide contains
> behavior changes that need attention from users. Sometimes behavior changes
> come with a legacy config to restore the old behavior. However, we still
> lack a clear definition of behavior changes and I propose the following
> definition:
>
> Behavior changes mean user-visible functional changes in a new release via
> public APIs. This means new features, and even bug fixes that eliminate NPE
> or correct query results, are behavior changes. Things like performance
> improvement, code refactoring, and changes to unreleased APIs/features are
> not. All behavior changes should be called out in the PR description. We
> need to write an item in the migration guide (and probably legacy config)
> for those that may break users when upgrading:
>
>- Bug fixes that change query results. Users may need to do backfill
>to correct the existing data and must know about these correctness fixes.
>- Bug fixes that change query schema. Users may need to update the
>schema of the tables in their data pipelines and must know about these
>changes.
>- Remove configs
>- Rename error class/condition
>- Any change to the public Python/SQL/Scala/Java/R APIs: rename
>function, remove parameters, add parameters, rename parameters, change
>parameter default values, etc. These changes should be avoided in general,
>or do it in a compatible way like deprecating and adding a new function
>instead of renaming.
>
> Once we reach a conclusion, I'll document it in
> https://spark.apache.org/versioning-policy.html .
>
> Thanks,
> Wenchen
>
>
>


Re: [DISCUSS] clarify the definition of behavior changes

2024-05-01 Thread Wenchen Fan
Good point, Santosh!

I was originally targeting end users who write queries with Spark, as this
is probably the largest user base. But we should definitely consider other
users who deploy and manage Spark clusters. Those users are usually more
tolerant of behavior changes and I think it should be sufficient to put
behavior changes in this area in the release notes.

On Wed, May 1, 2024 at 11:18 PM Santosh Pingale 
wrote:

> Thanks Wenchen for starting this!
>
> How do we define "the user" for spark?
> 1. End users: There are some users that use spark as a service from a
> provider
> 2. Providers/Operators: There are some users that provide spark as a
> service for their internal(on-prem setup with yarn/k8s)/external(Something
> like EMR) customers
> 3. ?
>
> Perhaps we need to consider infrastructure behavior changes as well to
> accommodate the second group of users.
>
> On 1 May 2024, at 06:08, Wenchen Fan  wrote:
>
> Hi all,
>
> It's exciting to see innovations keep happening in the Spark community and
> Spark keeps evolving itself. To make these innovations available to more
> users, it's important to help users upgrade to newer Spark versions easily.
> We've done a good job on it: the PR template requires the author to write
> down user-facing behavior changes, and the migration guide contains
> behavior changes that need attention from users. Sometimes behavior changes
> come with a legacy config to restore the old behavior. However, we still
> lack a clear definition of behavior changes and I propose the following
> definition:
>
> Behavior changes mean user-visible functional changes in a new release via
> public APIs. This means new features, and even bug fixes that eliminate NPE
> or correct query results, are behavior changes. Things like performance
> improvement, code refactoring, and changes to unreleased APIs/features are
> not. All behavior changes should be called out in the PR description. We
> need to write an item in the migration guide (and probably legacy config)
> for those that may break users when upgrading:
>
>- Bug fixes that change query results. Users may need to do backfill
>to correct the existing data and must know about these correctness fixes.
>- Bug fixes that change query schema. Users may need to update the
>schema of the tables in their data pipelines and must know about these
>changes.
>- Remove configs
>- Rename error class/condition
>- Any change to the public Python/SQL/Scala/Java/R APIs: rename
>function, remove parameters, add parameters, rename parameters, change
>parameter default values, etc. These changes should be avoided in general,
>or do it in a compatible way like deprecating and adding a new function
>instead of renaming.
>
> Once we reach a conclusion, I'll document it in
> https://spark.apache.org/versioning-policy.html .
>
> Thanks,
> Wenchen
>
>
>


Re: [DISCUSS] clarify the definition of behavior changes

2024-05-01 Thread Santosh Pingale
Thanks Wenchen for starting this!

How do we define "the user" for spark?
1. End users: There are some users that use spark as a service from a
provider
2. Providers/Operators: There are some users that provide spark as a
service for their internal(on-prem setup with yarn/k8s)/external(Something
like EMR) customers
3. ?

Perhaps we need to consider infrastructure behavior changes as well to
accommodate the second group of users.

On 1 May 2024, at 06:08, Wenchen Fan  wrote:

Hi all,

It's exciting to see innovations keep happening in the Spark community and
Spark keeps evolving itself. To make these innovations available to more
users, it's important to help users upgrade to newer Spark versions easily.
We've done a good job on it: the PR template requires the author to write
down user-facing behavior changes, and the migration guide contains
behavior changes that need attention from users. Sometimes behavior changes
come with a legacy config to restore the old behavior. However, we still
lack a clear definition of behavior changes and I propose the following
definition:

Behavior changes mean user-visible functional changes in a new release via
public APIs. This means new features, and even bug fixes that eliminate NPE
or correct query results, are behavior changes. Things like performance
improvement, code refactoring, and changes to unreleased APIs/features are
not. All behavior changes should be called out in the PR description. We
need to write an item in the migration guide (and probably legacy config)
for those that may break users when upgrading:

   - Bug fixes that change query results. Users may need to do backfill to
   correct the existing data and must know about these correctness fixes.
   - Bug fixes that change query schema. Users may need to update the
   schema of the tables in their data pipelines and must know about these
   changes.
   - Remove configs
   - Rename error class/condition
   - Any change to the public Python/SQL/Scala/Java/R APIs: rename
   function, remove parameters, add parameters, rename parameters, change
   parameter default values, etc. These changes should be avoided in general,
   or do it in a compatible way like deprecating and adding a new function
   instead of renaming.

Once we reach a conclusion, I'll document it in
https://spark.apache.org/versioning-policy.html .

Thanks,
Wenchen