Re: [DISCUSS] clarify the definition of behavior changes

2024-05-02 Thread Will Raschkowski
To add some user perspective, I wanted to share our experience from 
automatically upgrading tens of thousands of jobs from Spark 2 to 3 at Palantir:

We didn't mind "loud" changes that threw exceptions. We have some infra to try 
run jobs with Spark 3 and fallback to Spark 2 if there's an exception. E.g., 
the datetime parsing and rebasing migration in Spark 3 was great: Spark threw a 
helpful exception but never silently changed results. Similarly, for things 
listed in the migration guide as silent changes (e.g., add_months's handling of 
last-day-of-month), we wrote custom check rules to throw unless users 
acknowledged the change through config.

Silent changes not in the migration guide were really bad for us: Trusting the 
migration guide to be exhaustive, we automatically upgraded jobs which then 
“succeeded” but wrote incorrect results. For example, some expression increased 
timestamp precision in Spark 3; a query implicitly relied on the reduced 
precision, and then produced bad results on upgrade. It’s a silly query but a 
note in the migration guide would have helped.

To summarize: the migration guide was invaluable, we appreciated every entry, 
and we'd appreciate Wenchen's stricter definition of "behavior changes" 
(especially for silent ones).

From: Nimrod Ofek 
Date: Thursday, 2 May 2024 at 11:57
To: Wenchen Fan 
Cc: Erik Krogen , Spark dev list 
Subject: Re: [DISCUSS] clarify the definition of behavior changes
CAUTION: This email originates from an external party (outside of Palantir). If 
you believe this message is suspicious in nature, please use the "Report 
Message" button built into Outlook.

Hi Erik and Wenchen,

I think that usually a good practice with public api and with internal api that 
has big impact and a lot of usage is to ease in changes by providing defaults 
to new parameters that will keep former behaviour in a method with the previous 
signature with deprecation notice, and deleting that deprecated function in the 
next release- so the actual break will be in the next release after all 
libraries had the chance to align with the api and upgrades can be done while 
already using the new version.

Another thing is that we should probably examine what private apis are used 
externally to provide better experience and provide proper public apis to meet 
those needs (for instance, applicative metrics and some way of creating custom 
behaviour columns).

Thanks,
Nimrod

בתאריך יום ה׳, 2 במאי 2024, 03:51, מאת Wenchen Fan 
‏mailto:cloud0...@gmail.com>>:
Hi Erik,

Thanks for sharing your thoughts! Note: developer APIs are also public APIs 
(such as Data Source V2 API, Spark Listener API, etc.), so breaking changes 
should be avoided as much as we can and new APIs should be mentioned in the 
release notes. Breaking binary compatibility is also a "functional change" and 
should be treated as a behavior change.

BTW, AFAIK some downstream libraries use private APIs such as Catalyst 
Expression and LogicalPlan. It's too much work to track all the changes to 
private APIs and I think it's the downstream library's responsibility to check 
such changes in new Spark versions, or avoid using private APIs. Exceptions can 
happen if certain private APIs are used too widely and we should avoid breaking 
them.

Thanks,
Wenchen

On Wed, May 1, 2024 at 11:51 PM Erik Krogen 
mailto:xkro...@apache.org>> wrote:
Thanks for raising this important discussion Wenchen! Two points I would like 
to raise, though I'm fully supportive of any improvements in this regard, my 
points below notwithstanding -- I am not intending to let perfect be the enemy 
of good here.

On a similar note as Santosh's comment, we should consider how this relates to 
developer APIs. Let's say I am an end user relying on some library like 
frameless 
[github.com],
 which relies on developer APIs in Spark. When we make a change to Spark's 
developer APIs that requires a corresponding change in frameless, I don't 
directly see that change as an end user, but it does impact me, because now I 
have to upgrade to a new version of frameless that supports those new changes. 
This can have ripple effects across the ecosystem. Should we call out such 
changes so that end users understand the potential impact to libraries they use?

Second point, what about binary compatibility? Currently our versioning policy 
says "Link-level compatibility is something we’ll try to guarantee in future 
releases." (FWIW, it has said this since at least 2016 
[web.archive.org]...)
 One step towards this would be to clearly call out any binary-incompatible 

Re: Plans for built-in v2 data sources in Spark 4

2023-09-20 Thread Will Raschkowski
Thank you for linking that, Dongjoon!

I found SPARK-44518<https://issues.apache.org/jira/browse/SPARK-44518> in that 
list which wants to turn Spark’s Hive integration into a data source. To think 
out loud: The big gaps between built-in v1 and v2 data sources are support for 
bucketing and partitioning. And the reason v1 data sources support those is 
because they’re kind of interleaved with Spark’s Hive integration. Separating 
that Hive integration or making it more data source-ish would put us close to 
supporting bucketing and partitioning in v2 and then defaulting to v2. (Just my 
understanding – curious if I’m thinking about this correctly).

Anyway, thank you for the pointer.

From: Dongjoon Hyun 
Date: Friday, 15 September 2023 at 05:36
To: Will Raschkowski 
Cc: dev@spark.apache.org 
Subject: Re: Plans for built-in v2 data sources in Spark 4
CAUTION: This email originates from an external party (outside of Palantir). If 
you believe this message is suspicious in nature, please use the "Report 
Message" button built into Outlook.

Hi, Will.

According to the following JIRA, as of now, there is no plan or on-going 
discussion to switch it.

https://issues.apache.org/jira/browse/SPARK-44111 
[issues.apache.org]<https://urldefense.com/v3/__https:/issues.apache.org/jira/browse/SPARK-44111__;!!NkS9JGVQ2sDq!9ClB4HvwYAfMI2IMJf1zw4UPYwDUxsnN21c3p35XbY8OQO8vCZnS-KtrRL52X6vfCnXAqFpB_jh0S5q-m5htQQyNwA4$>
 (Prepare Apache Spark 4.0.0)

Thanks,
Dongjoon.


On Wed, Sep 13, 2023 at 9:02 AM Will Raschkowski 
 wrote:
Hey everyone,

I was wondering what the plans are for Spark's built-in v2 file data sources in 
Spark 4.

Concretely, is the plan for Spark 4 to continue defaulting to the built-in v1 
data sources? And if yes, what are the blockers for defaulting to v2? I see, 
just as example, that writing Hive-partitions is not supported in v2. Are there 
other blockers or outstanding discussions?

Regards,
Will



Re: Plans for built-in v2 data sources in Spark 4

2023-09-20 Thread Will Raschkowski
Thank you for linking that, Dongjoon!

I found SPARK-44518<https://issues.apache.org/jira/browse/SPARK-44518> in that 
list which wants to turn Spark’s Hive integration into a data source. IIUC, 
that’s very related but I’m curious if I’m thinking about this correctly:

Big gaps between built-in v1 and v2 data sources are support for bucketing and 
partitioning. And the reason v1 data sources support those is because the v1 
paths are kind of interleaved with Spark’s Hive integration. I understand 
separating that Hive integration or making it more data source-ish would put us 
closer to supporting bucketing and partitioning in v2 and then defaulting to v2.

From: Dongjoon Hyun 
Date: Friday, 15 September 2023 at 05:36
To: Will Raschkowski 
Cc: dev@spark.apache.org 
Subject: Re: Plans for built-in v2 data sources in Spark 4
CAUTION: This email originates from an external party (outside of Palantir). If 
you believe this message is suspicious in nature, please use the "Report 
Message" button built into Outlook.

Hi, Will.

According to the following JIRA, as of now, there is no plan or on-going 
discussion to switch it.

https://issues.apache.org/jira/browse/SPARK-44111 
[issues.apache.org]<https://urldefense.com/v3/__https:/issues.apache.org/jira/browse/SPARK-44111__;!!NkS9JGVQ2sDq!9ClB4HvwYAfMI2IMJf1zw4UPYwDUxsnN21c3p35XbY8OQO8vCZnS-KtrRL52X6vfCnXAqFpB_jh0S5q-m5htQQyNwA4$>
 (Prepare Apache Spark 4.0.0)

Thanks,
Dongjoon.


On Wed, Sep 13, 2023 at 9:02 AM Will Raschkowski 
 wrote:
Hey everyone,

I was wondering what the plans are for Spark's built-in v2 file data sources in 
Spark 4.

Concretely, is the plan for Spark 4 to continue defaulting to the built-in v1 
data sources? And if yes, what are the blockers for defaulting to v2? I see, 
just as example, that writing Hive-partitions is not supported in v2. Are there 
other blockers or outstanding discussions?

Regards,
Will



Plans for built-in v2 data sources in Spark 4

2023-09-13 Thread Will Raschkowski
Hey everyone,

I was wondering what the plans are for Spark's built-in v2 file data sources in 
Spark 4.

Concretely, is the plan for Spark 4 to continue defaulting to the built-in v1 
data sources? And if yes, what are the blockers for defaulting to v2? I see, 
just as example, that writing Hive-partitions is not supported in v2. Are there 
other blockers or outstanding discussions?

Regards,
Will



Re: Bridging gap between Spark UI and Code

2021-05-24 Thread Will Raschkowski
This would be great.

At least for logical nodes, would it be possible to re-use the existing 
Utils.getCallSite
 to populate a field when nodes are created? I suppose most value would come 
from eventually passing the call-sites along to physical nodes. But maybe just 
as starting point Spark could display the call-site only with unoptimized 
logical plans? Users would still get a better sense for how the plan’s 
structure relates to their code.

From: mhawes 
Date: Friday, 21 May 2021 at 22:36
To: dev@spark.apache.org 
Subject: Re: Bridging gap between Spark UI and Code
CAUTION: This email originates from an external party (outside of Palantir). If 
you believe this message is suspicious in nature, please use the "Report 
Phishing" button built into Outlook.


Reviving this thread to ask whether any of the Spark maintainers would
consider helping to scope a solution for this. Michal outlines the problem
in this thread, but to clarify. The issue is that for very complex spark
application where the Logical Plans often span many pages, it is extremely
hard to figure out how the stages in the Spark UI/RDD operations link to the
Logical Plan that generated them.

Now, obviously this is a hard problem to solve given the various
optimisations and transformations that go on in between these two stages.
However I wanted to raise it as a potential option as I think it would be
/extremely/ valuable for Spark users.

My two main ideas are either:
 - To carry a reference to the original plan around when
planning/optimising.
 - To maintain a separate mapping for each planning/optimisation step that
maps from source to target. Im thinking along the lines of JavaScript
sourcemaps.

It would be great to get the opinion of an experienced Spark maintainer on
this, given the complexity.



--
Sent from: 
https://urldefense.proofpoint.com/v2/url?u=http-3A__apache-2Dspark-2Ddevelopers-2Dlist.1001551.n3.nabble.com_=DwICAg=izlc9mHr637UR4lpLEZLFFS3Vn2UXBrZ4tFb6oOnmz8=HrP36vwrw3UfNOlJ_ndb5EgIQ5INvWvw9xCbXhhQujY=jhxzuGxzWWdVR-pHNp2qV4JtVtGoOiAisKfUe-ySPt8=S68eCuXKhVzlv12dMdK8YM1YY0BocZ3vMblM_I8E_wo=

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org