Re: Issue with Materialized Views in Spark SQL

2024-05-02 Thread Walaa Eldin Moustafa
I do not think the issue is with DROP MATERIALIZED VIEW only, but also with
CREATE MATERIALIZED VIEW, because neither is supported in Spark. I guess
you must have created the view from Hive and are trying to drop it from
Spark and that is why you are running to the issue with DROP first.

There is some work in the Iceberg community to add the support to Spark
through SQL extensions, and Iceberg support for views and
materialization tables. Some recent discussions can be found here [1] along
with a WIP Iceberg-Spark PR.

[1] https://lists.apache.org/thread/rotmqzmwk5jrcsyxhzjhrvcjs5v3yjcc

Thanks,
Walaa.

On Thu, May 2, 2024 at 4:55 PM Mich Talebzadeh 
wrote:

> An issue I encountered while working with Materialized Views in Spark SQL.
> It appears that there is an inconsistency between the behavior of
> Materialized Views in Spark SQL and Hive.
>
> When attempting to execute a statement like DROP MATERIALIZED VIEW IF
> EXISTS test.mv in Spark SQL, I encountered a syntax error indicating that
> the keyword MATERIALIZED is not recognized. However, the same statement
> executes successfully in Hive without any errors.
>
> pyspark.errors.exceptions.captured.ParseException:
> [PARSE_SYNTAX_ERROR] Syntax error at or near 'MATERIALIZED'.(line 1, pos 5)
>
> == SQL ==
> DROP MATERIALIZED VIEW IF EXISTS test.mv
> -^^^
>
> Here are the versions I am using:
>
>
>
> *Hive: 3.1.1Spark: 3.4*
> my Spark session:
>
> spark = SparkSession.builder \
>   .appName("test") \
>   .enableHiveSupport() \
>   .getOrCreate()
>
> Has anyone seen this behaviour or encountered a similar issue or if there
> are any insights into why this discrepancy exists between Spark SQL and
> Hive.
>
> Thanks
>
> Mich Talebzadeh,
>
> Technologist | Architect | Data Engineer  | Generative AI | FinCrime
>
> London
> United Kingdom
>
>
>view my Linkedin profile
>
>
>  https://en.everybodywiki.com/Mich_Talebzadeh
>
>
>
> Disclaimer: The information provided is correct to the best of my
> knowledge but of course cannot be guaranteed . It is essential to note
> that, as with any advice, quote "one test result is worth one-thousand
> expert opinions (Werner Von Braun)".
>


Issue with Materialized Views in Spark SQL

2024-05-02 Thread Mich Talebzadeh
An issue I encountered while working with Materialized Views in Spark SQL.
It appears that there is an inconsistency between the behavior of
Materialized Views in Spark SQL and Hive.

When attempting to execute a statement like DROP MATERIALIZED VIEW IF
EXISTS test.mv in Spark SQL, I encountered a syntax error indicating that
the keyword MATERIALIZED is not recognized. However, the same statement
executes successfully in Hive without any errors.

pyspark.errors.exceptions.captured.ParseException:
[PARSE_SYNTAX_ERROR] Syntax error at or near 'MATERIALIZED'.(line 1, pos 5)

== SQL ==
DROP MATERIALIZED VIEW IF EXISTS test.mv
-^^^

Here are the versions I am using:



*Hive: 3.1.1Spark: 3.4*
my Spark session:

spark = SparkSession.builder \
  .appName("test") \
  .enableHiveSupport() \
  .getOrCreate()

Has anyone seen this behaviour or encountered a similar issue or if there
are any insights into why this discrepancy exists between Spark SQL and
Hive.

Thanks

Mich Talebzadeh,

Technologist | Architect | Data Engineer  | Generative AI | FinCrime

London
United Kingdom


   view my Linkedin profile


 https://en.everybodywiki.com/Mich_Talebzadeh



Disclaimer: The information provided is correct to the best of my knowledge
but of course cannot be guaranteed . It is essential to note that, as with
any advice, quote "one test result is worth one-thousand expert opinions
(Werner Von Braun)".


Re: [DISCUSS] Spark 4.0.0 release

2024-05-02 Thread yangjie01
+1

发件人: Jungtaek Lim 
日期: 2024年5月2日 星期四 10:21
收件人: Holden Karau 
抄送: Chao Sun , Xiao Li , Tathagata 
Das , Wenchen Fan , Cheng Pan 
, Nicholas Chammas , Dongjoon 
Hyun , Cheng Pan , Spark dev list 
, Anish Shrigondekar 
主题: Re: [DISCUSS] Spark 4.0.0 release

+1 love to see it!

On Thu, May 2, 2024 at 10:08 AM Holden Karau 
mailto:holden.ka...@gmail.com>> wrote:
+1 :) yay previews

On Wed, May 1, 2024 at 5:36 PM Chao Sun 
mailto:sunc...@apache.org>> wrote:
+1

On Wed, May 1, 2024 at 5:23 PM Xiao Li 
mailto:gatorsm...@gmail.com>> wrote:
+1 for next Monday.

We can do more previews when the other features are ready for preview.

Tathagata Das mailto:tathagata.das1...@gmail.com>> 
于2024年5月1日周三 08:46写道:
Next week sounds great! Thank you Wenchen!

On Wed, May 1, 2024 at 11:16 AM Wenchen Fan 
mailto:cloud0...@gmail.com>> wrote:
Yea I think a preview release won't hurt (without a branch cut). We don't need 
to wait for all the ongoing projects to be ready. How about we do a 4.0 preview 
release based on the current master branch next Monday?

On Wed, May 1, 2024 at 11:06 PM Tathagata Das 
mailto:tathagata.das1...@gmail.com>> wrote:
Hey all,

Reviving this thread, but Spark master has already accumulated a huge amount of 
changes.  As a downstream project maintainer, I want to really start testing 
the new features and other breaking changes, and it's hard to do that without a 
Preview release. So the sooner we make a Preview release, the faster we can 
start getting feedback for fixing things for a great Spark 4.0 final release.

So I urge the community to produce a Spark 4.0 Preview soon even if certain 
features targeting the Delta 4.0 release are still incomplete.

Thanks!


On Wed, Apr 17, 2024 at 8:35 AM Wenchen Fan 
mailto:cloud0...@gmail.com>> wrote:
Thank you all for the replies!

To @Nicholas Chammas : Thanks for cleaning 
up the error terminology and documentation! I've merged the first PR and let's 
finish others before the 4.0 release.
To @Dongjoon Hyun : Thanks for driving the ANSI 
on by default effort! Now the vote has passed, let's flip the config and finish 
the DataFrame error context feature before 4.0.
To @Jungtaek Lim : Ack. We can treat the 
Streaming state store data source as completed for 4.0 then.
To @Cheng Pan : Yea we definitely should have a 
preview release. Let's collect more feedback on the ongoing projects and then 
we can propose a date for the preview release.

On Wed, Apr 17, 2024 at 1:22 PM Cheng Pan 
mailto:pan3...@gmail.com>> wrote:
will we have preview release for 4.0.0 like we did for 2.0.0 and 3.0.0?

Thanks,
Cheng Pan


> On Apr 15, 2024, at 09:58, Jungtaek Lim 
> mailto:kabhwan.opensou...@gmail.com>> wrote:
>
> W.r.t. state data source - reader (SPARK-45511), there are several follow-up 
> tickets, but we don't plan to address them soon. The current implementation 
> is the final shape for Spark 4.0.0, unless there are demands on the follow-up 
> tickets.
>
> We may want to check the plan for transformWithState - my understanding is 
> that we want to release the feature to 4.0.0, but there are several remaining 
> works to be done. While the tentative timeline for releasing is June 2024, 
> what would be the tentative timeline for the RC cut?
> (cc. Anish to add more context on the plan for transformWithState)
>
> On Sat, Apr 13, 2024 at 3:15 AM Wenchen Fan 
> mailto:cloud0...@gmail.com>> wrote:
> Hi all,
>
> It's close to the previously proposed 4.0.0 release date (June 2024), and I 
> think it's time to prepare for it and discuss the ongoing projects:
> •
> ANSI by default
> • Spark Connect GA
> • Structured Logging
> • Streaming state store data source
> • new data type VARIANT
> • STRING collation support
> • Spark k8s operator versioning
> Please help to add more items to this list that are missed here. I would like 
> to volunteer as the release manager for Apache Spark 4.0.0 if there is no 
> objection. Thank you all for the great work that fills Spark 4.0!
>
> Wenchen Fan


--
Twitter: 
https://twitter.com/holdenkarau
Books (Learning Spark, High Performance Spark, etc.): https://amzn.to/2MaRAG9 

YouTube Live Streams: 
https://www.youtube.com/user/holdenkarau


Re: [DISCUSS] Spark 4.0.0 release

2024-05-02 Thread Mich Talebzadeh
   - Integration with additional external data sources or systems, say Hive
   - Enhancements to the Spark UI for improved monitoring and debugging
   - Enhancements to machine learning (MLlib) algorithms and capabilities,
   like TensorFlow or PyTorch,( if any in the pipeline)

HTH

Mich Talebzadeh,
Technologist | Architect | Data Engineer  | Generative AI | FinCrime
London
United Kingdom


   view my Linkedin profile



 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* The information provided is correct to the best of my
knowledge but of course cannot be guaranteed . It is essential to note
that, as with any advice, quote "one test result is worth one-thousand
expert opinions (Werner  Von
Braun )".


On Thu, 2 May 2024 at 17:02, Steve Loughran 
wrote:

> There's a new parquet RC up this week which would be good to pull in.
>
> On Thu, 2 May 2024 at 03:20, Jungtaek Lim 
> wrote:
>
>> +1 love to see it!
>>
>> On Thu, May 2, 2024 at 10:08 AM Holden Karau 
>> wrote:
>>
>>> +1 :) yay previews
>>>
>>> On Wed, May 1, 2024 at 5:36 PM Chao Sun  wrote:
>>>
 +1

 On Wed, May 1, 2024 at 5:23 PM Xiao Li  wrote:

> +1 for next Monday.
>
> We can do more previews when the other features are ready for preview.
>
> Tathagata Das  于2024年5月1日周三 08:46写道:
>
>> Next week sounds great! Thank you Wenchen!
>>
>> On Wed, May 1, 2024 at 11:16 AM Wenchen Fan 
>> wrote:
>>
>>> Yea I think a preview release won't hurt (without a branch cut). We
>>> don't need to wait for all the ongoing projects to be ready. How about 
>>> we
>>> do a 4.0 preview release based on the current master branch next Monday?
>>>
>>> On Wed, May 1, 2024 at 11:06 PM Tathagata Das <
>>> tathagata.das1...@gmail.com> wrote:
>>>
 Hey all,

 Reviving this thread, but Spark master has already accumulated a
 huge amount of changes.  As a downstream project maintainer, I want to
 really start testing the new features and other breaking changes, and 
 it's
 hard to do that without a Preview release. So the sooner we make a 
 Preview
 release, the faster we can start getting feedback for fixing things 
 for a
 great Spark 4.0 final release.

 So I urge the community to produce a Spark 4.0 Preview soon even if
 certain features targeting the Delta 4.0 release are still incomplete.

 Thanks!


 On Wed, Apr 17, 2024 at 8:35 AM Wenchen Fan 
 wrote:

> Thank you all for the replies!
>
> To @Nicholas Chammas  : Thanks for
> cleaning up the error terminology and documentation! I've merged the 
> first
> PR and let's finish others before the 4.0 release.
> To @Dongjoon Hyun  : Thanks for driving
> the ANSI on by default effort! Now the vote has passed, let's flip the
> config and finish the DataFrame error context feature before 4.0.
> To @Jungtaek Lim  : Ack. We can
> treat the Streaming state store data source as completed for 4.0 then.
> To @Cheng Pan  : Yea we definitely should
> have a preview release. Let's collect more feedback on the ongoing 
> projects
> and then we can propose a date for the preview release.
>
> On Wed, Apr 17, 2024 at 1:22 PM Cheng Pan 
> wrote:
>
>> will we have preview release for 4.0.0 like we did for 2.0.0 and
>> 3.0.0?
>>
>> Thanks,
>> Cheng Pan
>>
>>
>> > On Apr 15, 2024, at 09:58, Jungtaek Lim <
>> kabhwan.opensou...@gmail.com> wrote:
>> >
>> > W.r.t. state data source - reader (SPARK-45511), there are
>> several follow-up tickets, but we don't plan to address them soon. 
>> The
>> current implementation is the final shape for Spark 4.0.0, unless 
>> there are
>> demands on the follow-up tickets.
>> >
>> > We may want to check the plan for transformWithState - my
>> understanding is that we want to release the feature to 4.0.0, but 
>> there
>> are several remaining works to be done. While the tentative timeline 
>> for
>> releasing is June 2024, what would be the tentative timeline for the 
>> RC cut?
>> > (cc. Anish to add more context on the plan for
>> transformWithState)
>> >
>> > On Sat, Apr 13, 2024 at 3:15 AM Wenchen Fan <
>> cloud0...@gmail.com> wrote:
>> > Hi all,
>> >
>> > It's close to the previously proposed 4.0.0 release date (June
>> 2024), and I think it's time to prepare for it and discuss the 

Re: [DISCUSS] Spark 4.0.0 release

2024-05-02 Thread Steve Loughran
There's a new parquet RC up this week which would be good to pull in.

On Thu, 2 May 2024 at 03:20, Jungtaek Lim 
wrote:

> +1 love to see it!
>
> On Thu, May 2, 2024 at 10:08 AM Holden Karau 
> wrote:
>
>> +1 :) yay previews
>>
>> On Wed, May 1, 2024 at 5:36 PM Chao Sun  wrote:
>>
>>> +1
>>>
>>> On Wed, May 1, 2024 at 5:23 PM Xiao Li  wrote:
>>>
 +1 for next Monday.

 We can do more previews when the other features are ready for preview.

 Tathagata Das  于2024年5月1日周三 08:46写道:

> Next week sounds great! Thank you Wenchen!
>
> On Wed, May 1, 2024 at 11:16 AM Wenchen Fan 
> wrote:
>
>> Yea I think a preview release won't hurt (without a branch cut). We
>> don't need to wait for all the ongoing projects to be ready. How about we
>> do a 4.0 preview release based on the current master branch next Monday?
>>
>> On Wed, May 1, 2024 at 11:06 PM Tathagata Das <
>> tathagata.das1...@gmail.com> wrote:
>>
>>> Hey all,
>>>
>>> Reviving this thread, but Spark master has already accumulated a
>>> huge amount of changes.  As a downstream project maintainer, I want to
>>> really start testing the new features and other breaking changes, and 
>>> it's
>>> hard to do that without a Preview release. So the sooner we make a 
>>> Preview
>>> release, the faster we can start getting feedback for fixing things for 
>>> a
>>> great Spark 4.0 final release.
>>>
>>> So I urge the community to produce a Spark 4.0 Preview soon even if
>>> certain features targeting the Delta 4.0 release are still incomplete.
>>>
>>> Thanks!
>>>
>>>
>>> On Wed, Apr 17, 2024 at 8:35 AM Wenchen Fan 
>>> wrote:
>>>
 Thank you all for the replies!

 To @Nicholas Chammas  : Thanks for
 cleaning up the error terminology and documentation! I've merged the 
 first
 PR and let's finish others before the 4.0 release.
 To @Dongjoon Hyun  : Thanks for driving
 the ANSI on by default effort! Now the vote has passed, let's flip the
 config and finish the DataFrame error context feature before 4.0.
 To @Jungtaek Lim  : Ack. We can
 treat the Streaming state store data source as completed for 4.0 then.
 To @Cheng Pan  : Yea we definitely should
 have a preview release. Let's collect more feedback on the ongoing 
 projects
 and then we can propose a date for the preview release.

 On Wed, Apr 17, 2024 at 1:22 PM Cheng Pan 
 wrote:

> will we have preview release for 4.0.0 like we did for 2.0.0 and
> 3.0.0?
>
> Thanks,
> Cheng Pan
>
>
> > On Apr 15, 2024, at 09:58, Jungtaek Lim <
> kabhwan.opensou...@gmail.com> wrote:
> >
> > W.r.t. state data source - reader (SPARK-45511), there are
> several follow-up tickets, but we don't plan to address them soon. The
> current implementation is the final shape for Spark 4.0.0, unless 
> there are
> demands on the follow-up tickets.
> >
> > We may want to check the plan for transformWithState - my
> understanding is that we want to release the feature to 4.0.0, but 
> there
> are several remaining works to be done. While the tentative timeline 
> for
> releasing is June 2024, what would be the tentative timeline for the 
> RC cut?
> > (cc. Anish to add more context on the plan for
> transformWithState)
> >
> > On Sat, Apr 13, 2024 at 3:15 AM Wenchen Fan 
> wrote:
> > Hi all,
> >
> > It's close to the previously proposed 4.0.0 release date (June
> 2024), and I think it's time to prepare for it and discuss the ongoing
> projects:
> > •
> > ANSI by default
> > • Spark Connect GA
> > • Structured Logging
> > • Streaming state store data source
> > • new data type VARIANT
> > • STRING collation support
> > • Spark k8s operator versioning
> > Please help to add more items to this list that are missed here.
> I would like to volunteer as the release manager for Apache Spark 
> 4.0.0 if
> there is no objection. Thank you all for the great work that fills 
> Spark
> 4.0!
> >
> > Wenchen Fan
>
>
>>
>> --
>> Twitter: https://twitter.com/holdenkarau
>> Books (Learning Spark, High Performance Spark, etc.):
>> https://amzn.to/2MaRAG9  
>> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>>
>


Re: [DISCUSS] clarify the definition of behavior changes

2024-05-02 Thread Will Raschkowski
To add some user perspective, I wanted to share our experience from 
automatically upgrading tens of thousands of jobs from Spark 2 to 3 at Palantir:

We didn't mind "loud" changes that threw exceptions. We have some infra to try 
run jobs with Spark 3 and fallback to Spark 2 if there's an exception. E.g., 
the datetime parsing and rebasing migration in Spark 3 was great: Spark threw a 
helpful exception but never silently changed results. Similarly, for things 
listed in the migration guide as silent changes (e.g., add_months's handling of 
last-day-of-month), we wrote custom check rules to throw unless users 
acknowledged the change through config.

Silent changes not in the migration guide were really bad for us: Trusting the 
migration guide to be exhaustive, we automatically upgraded jobs which then 
“succeeded” but wrote incorrect results. For example, some expression increased 
timestamp precision in Spark 3; a query implicitly relied on the reduced 
precision, and then produced bad results on upgrade. It’s a silly query but a 
note in the migration guide would have helped.

To summarize: the migration guide was invaluable, we appreciated every entry, 
and we'd appreciate Wenchen's stricter definition of "behavior changes" 
(especially for silent ones).

From: Nimrod Ofek 
Date: Thursday, 2 May 2024 at 11:57
To: Wenchen Fan 
Cc: Erik Krogen , Spark dev list 
Subject: Re: [DISCUSS] clarify the definition of behavior changes
CAUTION: This email originates from an external party (outside of Palantir). If 
you believe this message is suspicious in nature, please use the "Report 
Message" button built into Outlook.

Hi Erik and Wenchen,

I think that usually a good practice with public api and with internal api that 
has big impact and a lot of usage is to ease in changes by providing defaults 
to new parameters that will keep former behaviour in a method with the previous 
signature with deprecation notice, and deleting that deprecated function in the 
next release- so the actual break will be in the next release after all 
libraries had the chance to align with the api and upgrades can be done while 
already using the new version.

Another thing is that we should probably examine what private apis are used 
externally to provide better experience and provide proper public apis to meet 
those needs (for instance, applicative metrics and some way of creating custom 
behaviour columns).

Thanks,
Nimrod

בתאריך יום ה׳, 2 במאי 2024, 03:51, מאת Wenchen Fan 
‏mailto:cloud0...@gmail.com>>:
Hi Erik,

Thanks for sharing your thoughts! Note: developer APIs are also public APIs 
(such as Data Source V2 API, Spark Listener API, etc.), so breaking changes 
should be avoided as much as we can and new APIs should be mentioned in the 
release notes. Breaking binary compatibility is also a "functional change" and 
should be treated as a behavior change.

BTW, AFAIK some downstream libraries use private APIs such as Catalyst 
Expression and LogicalPlan. It's too much work to track all the changes to 
private APIs and I think it's the downstream library's responsibility to check 
such changes in new Spark versions, or avoid using private APIs. Exceptions can 
happen if certain private APIs are used too widely and we should avoid breaking 
them.

Thanks,
Wenchen

On Wed, May 1, 2024 at 11:51 PM Erik Krogen 
mailto:xkro...@apache.org>> wrote:
Thanks for raising this important discussion Wenchen! Two points I would like 
to raise, though I'm fully supportive of any improvements in this regard, my 
points below notwithstanding -- I am not intending to let perfect be the enemy 
of good here.

On a similar note as Santosh's comment, we should consider how this relates to 
developer APIs. Let's say I am an end user relying on some library like 
frameless 
[github.com],
 which relies on developer APIs in Spark. When we make a change to Spark's 
developer APIs that requires a corresponding change in frameless, I don't 
directly see that change as an end user, but it does impact me, because now I 
have to upgrade to a new version of frameless that supports those new changes. 
This can have ripple effects across the ecosystem. Should we call out such 
changes so that end users understand the potential impact to libraries they use?

Second point, what about binary compatibility? Currently our versioning policy 
says "Link-level compatibility is something we’ll try to guarantee in future 
releases." (FWIW, it has said this since at least 2016 
[web.archive.org]...)
 One step towards this would be to clearly call out any binary-incompatible 

Re: [DISCUSS] clarify the definition of behavior changes

2024-05-02 Thread Nimrod Ofek
Hi Erik and Wenchen,

I think that usually a good practice with public api and with internal api
that has big impact and a lot of usage is to ease in changes by providing
defaults to new parameters that will keep former behaviour in a method with
the previous signature with deprecation notice, and deleting that
deprecated function in the next release- so the actual break will be in the
next release after all libraries had the chance to align with the api and
upgrades can be done while already using the new version.

Another thing is that we should probably examine what private apis are used
externally to provide better experience and provide proper public apis to
meet those needs (for instance, applicative metrics and some way of
creating custom behaviour columns).

Thanks,
Nimrod


בתאריך יום ה׳, 2 במאי 2024, 03:51, מאת Wenchen Fan ‏:

> Hi Erik,
>
> Thanks for sharing your thoughts! Note: developer APIs are also public
> APIs (such as Data Source V2 API, Spark Listener API, etc.), so breaking
> changes should be avoided as much as we can and new APIs should be
> mentioned in the release notes. Breaking binary compatibility is also a
> "functional change" and should be treated as a behavior change.
>
> BTW, AFAIK some downstream libraries use private APIs such as Catalyst
> Expression and LogicalPlan. It's too much work to track all the changes to
> private APIs and I think it's the downstream library's responsibility to
> check such changes in new Spark versions, or avoid using private APIs.
> Exceptions can happen if certain private APIs are used too widely and we
> should avoid breaking them.
>
> Thanks,
> Wenchen
>
> On Wed, May 1, 2024 at 11:51 PM Erik Krogen  wrote:
>
>> Thanks for raising this important discussion Wenchen! Two points I would
>> like to raise, though I'm fully supportive of any improvements in this
>> regard, my points below notwithstanding -- I am not intending to let
>> perfect be the enemy of good here.
>>
>> On a similar note as Santosh's comment, we should consider how this
>> relates to developer APIs. Let's say I am an end user relying on some
>> library like frameless , which
>> relies on developer APIs in Spark. When we make a change to Spark's
>> developer APIs that requires a corresponding change in frameless, I don't
>> directly see that change as an end user, but it *does* impact me,
>> because now I have to upgrade to a new version of frameless that supports
>> those new changes. This can have ripple effects across the ecosystem.
>> Should we call out such changes so that end users understand the potential
>> impact to libraries they use?
>>
>> Second point, what about binary compatibility? Currently our versioning
>> policy says "Link-level compatibility is something we’ll try to guarantee
>> in future releases." (FWIW, it has said this since at least 2016
>> ...)
>> One step towards this would be to clearly call out any binary-incompatible
>> changes in our release notes, to help users understand if they may be
>> impacted. Similar to my first point, this has ripple effects across the
>> ecosystem -- if I just use Spark itself, recompiling is probably not a big
>> deal, but if I use N libraries that each depend on Spark, then after a
>> binary-incompatible change is made I have to wait for all N libraries to
>> publish new compatible versions before I can upgrade myself, presenting a
>> nontrivial barrier to adoption.
>>
>> On Wed, May 1, 2024 at 8:18 AM Santosh Pingale
>>  wrote:
>>
>>> Thanks Wenchen for starting this!
>>>
>>> How do we define "the user" for spark?
>>> 1. End users: There are some users that use spark as a service from a
>>> provider
>>> 2. Providers/Operators: There are some users that provide spark as a
>>> service for their internal(on-prem setup with yarn/k8s)/external(Something
>>> like EMR) customers
>>> 3. ?
>>>
>>> Perhaps we need to consider infrastructure behavior changes as well to
>>> accommodate the second group of users.
>>>
>>> On 1 May 2024, at 06:08, Wenchen Fan  wrote:
>>>
>>> Hi all,
>>>
>>> It's exciting to see innovations keep happening in the Spark community
>>> and Spark keeps evolving itself. To make these innovations available to
>>> more users, it's important to help users upgrade to newer Spark versions
>>> easily. We've done a good job on it: the PR template requires the author to
>>> write down user-facing behavior changes, and the migration guide contains
>>> behavior changes that need attention from users. Sometimes behavior changes
>>> come with a legacy config to restore the old behavior. However, we still
>>> lack a clear definition of behavior changes and I propose the following
>>> definition:
>>>
>>> Behavior changes mean user-visible functional changes in a new release
>>> via public APIs. This means new features, and even bug fixes that eliminate
>>> NPE or correct query results,