[VOTE] Release Apache Spark 3.3.3 (RC1)

2023-08-04 Thread Yuming Wang
Please vote on releasing the following candidate as Apache Spark version
3.3.3.

The vote is open until 11:59pm Pacific time August 10th and passes if a
majority +1 PMC votes are cast, with a minimum of 3 +1 votes.

[ ] +1 Release this package as Apache Spark 3.3.3
[ ] -1 Do not release this package because ...

To learn more about Apache Spark, please see https://spark.apache.org

The tag to be voted on is v3.3.3-rc1 (commit
8c2b3319c6734250ff9d72f3d7e5cab56b142195):
https://github.com/apache/spark/tree/v3.3.3-rc1

The release files, including signatures, digests, etc. can be found at:
https://dist.apache.org/repos/dist/dev/spark/v3.3.3-rc1-bin

Signatures used for Spark RCs can be found in this file:
https://dist.apache.org/repos/dist/dev/spark/KEYS

The staging repository for this release can be found at:
https://repository.apache.org/content/repositories/orgapachespark-1445

The documentation corresponding to this release can be found at:
https://dist.apache.org/repos/dist/dev/spark/v3.3.3-rc1-docs

The list of bug fixes going into 3.3.3 can be found at the following URL:
https://s.apache.org/rjci4

This release is using the release script of the tag v3.3.3-rc1.


FAQ

=
How can I help test this release?
=
If you are a Spark user, you can help us test this release by taking
an existing Spark workload and running on this release candidate, then
reporting any regressions.

If you're working in PySpark you can set up a virtual env and install
the current RC and see if anything important breaks, in the Java/Scala
you can add the staging repository to your projects resolvers and test
with the RC (make sure to clean up the artifact cache before/after so
you don't end up building with an out of date RC going forward).

===
What should happen to JIRA tickets still targeting 3.3.3?
===
The current list of open tickets targeted at 3.3.3 can be found at:
https://issues.apache.org/jira/projects/SPARK and search for "Target
Version/s" = 3.3.3

Committers should look at those and triage. Extremely important bug
fixes, documentation, and API tweaks that impact compatibility should
be worked on immediately. Everything else please retarget to an
appropriate release.

==
But my bug isn't fixed?
==
In order to make timely releases, we will typically not hold the
release unless the bug in question is a regression from the previous
release. That being said, if there is something which is a regression
that has not been correctly targeted please ping me or a committer to
help target the issue.


[VOTE] Release Apache Spark 3.5.0 (RC1)

2023-08-04 Thread Yuanjian Li
Please vote on releasing the following candidate(RC1) as Apache Spark
version 3.5.0.

The vote is open until 11:59pm Pacific time Aug 9th and passes if a
majority +1 PMC votes are cast, with a minimum of 3 +1 votes.

[ ] +1 Release this package as Apache Spark 3.5.0

[ ] -1 Do not release this package because ...

To learn more about Apache Spark, please see http://spark.apache.org/

The tag to be voted on is v3.5.0-rc1 (commit
7e862c01fc9a1d3b47764df8b6a4b5c4cafb0807):

https://github.com/apache/spark/tree/v3.5.0-rc1

The release files, including signatures, digests, etc. can be found at:

https://dist.apache.org/repos/dist/dev/spark/v3.5.0-rc1-bin/

Signatures used for Spark RCs can be found in this file:

https://dist.apache.org/repos/dist/dev/spark/KEYS

The staging repository for this release can be found at:

https://repository.apache.org/content/repositories/orgapachespark-1444

The documentation corresponding to this release can be found at:

https://dist.apache.org/repos/dist/dev/spark/v3.5.0-rc1-docs/

The list of bug fixes going into 3.5.0 can be found at the following URL:

https://issues.apache.org/jira/projects/SPARK/versions/12352848

This release is using the release script of the tag v3.5.0-rc1.


FAQ

=

How can I help test this release?

=

If you are a Spark user, you can help us test this release by taking

an existing Spark workload and running on this release candidate, then

reporting any regressions.

If you're working in PySpark you can set up a virtual env and install

the current RC and see if anything important breaks, in the Java/Scala

you can add the staging repository to your projects resolvers and test

with the RC (make sure to clean up the artifact cache before/after so

you don't end up building with an out of date RC going forward).

===

What should happen to JIRA tickets still targeting 3.5.0?

===

The current list of open tickets targeted at 3.5.0 can be found at:

https://issues.apache.org/jira/projects/SPARK and search for "Target
Version/s" = 3.5.0

Committers should look at those and triage. Extremely important bug

fixes, documentation, and API tweaks that impact compatibility should

be worked on immediately. Everything else please retarget to an

appropriate release.

==

But my bug isn't fixed?

==

In order to make timely releases, we will typically not hold the

release unless the bug in question is a regression from the previous

release. That being said, if there is something which is a regression

that has not been correctly targeted please ping me or a committer to

help target the issue.

Thanks,

Yuanjian Li


Re: [Reminder] Spark 3.5 RC Cut

2023-08-04 Thread Dongjoon Hyun
Thank you again, Emil and Bjorn.

FYI, SPARK-44678 landed at branch-3.5 like the following.

https://github.com/apache/spark/pull/42345
[SPARK-44678][BUILD][3.5] Downgrade Hadoop to 3.3.4

Dongjoon.

On 2023/08/02 18:58:51 Bjørn Jørgensen wrote:
> @Dongjoon Hyun  FYI
> [image: image.png]
> 
> We better ask common-...@hadoop.apache.org.
> 
> ons. 2. aug. 2023 kl. 18:03 skrev Dongjoon Hyun :
> 
> > Oh, I got it, Emil and Bjorn.
> >
> > Dongjoon.
> >
> > On Wed, Aug 2, 2023 at 12:32 AM Bjørn Jørgensen 
> > wrote:
> >
> >> "*As far as I can tell this makes both 3.3.5 and 3.3.6 unusable with s3
> >> without providing an alternative committer code.*"
> >>
> >> https://github.com/apache/hadoop/pull/5706#issuecomment-1619927992
> >>
> >> ons. 2. aug. 2023 kl. 08:05 skrev Emil Ejbyfeldt
> >> :
> >>
> >>>  > Apache Spark is not affected by HADOOP-18757 because it is not a part
> >>> of
> >>>  > both Apache Hadoop 3.3.5 and 3.3.6.
> >>>
> >>> I am not sure I am following what you are trying to say here. Is that
> >>> the jira is saying that only 3.3.5 is affected? Here I think the Jira is
> >>> just incorrect. The jira was created (and the PR with the fix) was
> >>> created before 3.3.6 was released and I just think the jira has not been
> >>> updated to reflect the fact that 3.3.6 is also affected.
> >>>
> >>>  > HADOOP-18757 seems to be merged just two weeks ago and there is no
> >>>  > Apache Hadoop release with it, isn't it?
> >>>
> >>> That is correct, there is no hadoop release containing the fix. So
> >>> therefore 3.3.6 would also be affected by the regression.
> >>>
> >>> Best,
> >>> Emil
> >>>
> >>> On 02/08/2023 07:51, Dongjoon Hyun wrote:
> >>> > It's still invalid information, Emil.
> >>> >
> >>> > Apache Spark is not affected by HADOOP-18757 because it is not a part
> >>> of
> >>> > both Apache Hadoop 3.3.5 and 3.3.6.
> >>> >
> >>> > HADOOP-18757 seems to be merged just two weeks ago and there is no
> >>> > Apache Hadoop release with it, isn't it?
> >>> >
> >>> > Could you check your local branch once more, please?
> >>> >
> >>> > Dongjoon.
> >>> >
> >>> >
> >>> >
> >>> > On Tue, Aug 1, 2023 at 9:46 PM Emil Ejbyfeldt <
> >>> eejbyfe...@liveintent.com
> >>> > > wrote:
> >>> >
> >>> > Hi,
> >>> >
> >>> > Yes, sorry about that seem to have messed up the link. Should have
> >>> been
> >>> > https://issues.apache.org/jira/browse/HADOOP-18757
> >>> > 
> >>> >
> >>> > Best,
> >>> > Emil
> >>> >
> >>> > On 01/08/2023 19:08, Dongjoon Hyun wrote:
> >>> >  > Hi, Emil.
> >>> >  >
> >>> >  > HADOOP-18568 is still open and it seems to be never a part of
> >>> the
> >>> > Hadoop
> >>> >  > trunk branch.
> >>> >  >
> >>> >  > Do you mean another JIRA?
> >>> >  >
> >>> >  > Dongjoon.
> >>> >  >
> >>> >  >
> >>> >  >
> >>> >  > On Tue, Aug 1, 2023 at 2:59 AM Emil Ejbyfeldt
> >>> >  >  >>> > .invalid> wrote:
> >>> >  >
> >>> >  > Hi,
> >>> >  >
> >>> >  > We previously ran some experiments on builds from the 3.5
> >>> > branch and
> >>> >  > noticed that Hadoop had a regression
> >>> >  > (https://issues.apache.org/jira/browse/HADOOP-18568
> >>> > 
> >>> >  >  >>> > >) in their
> >>> s3a
> >>> >  > committer affecting 3.3.5 and 3.3.6 (Spark 3.4 uses hadoop
> >>> > 3.3.4). This
> >>> >  > fix has been merged into Hadoop and will be part the next
> >>> > release of
> >>> >  > Hadoop.
> >>> >  >
> >>> >  >   From our testing the regression when writing data to S3
> >>> > with large
> >>> >  > number of tasks S3 is severe enough that we would need to
> >>> > revert to
> >>> >  > hadoop 3.3.4 in order to use spark 3.5 release.
> >>> >  >
> >>> >  > Since it only for S3 I am not sure it warrants action
> >>> changes
> >>> > in Spark
> >>> >  > (e.g rolling back hadoop to 3.3.4). But it probably
> >>> something
> >>> > people
> >>> >  > testing the rc against s3 should be aware of.
> >>> >  >
> >>> >  > Best,
> >>> >  > Emil
> >>> >  >
> >>> >  > On 29/07/2023 10:29, Yuanjian Li wrote:
> >>> >  >  > Hi everyone,
> >>> >  >  >
> >>> >  >  > Following the release timeline, I will cut the RC
> >>> > on*Tuesday, Aug
> >>> >  > 1st at
> >>> >  >  > 1 pm PST* as scheduled.
> >>> >  >  >
> >>> >  >  > Date  Event
> >>> >  >  > July 17th 2023
> >>> >  >  > Late July
> >>> >  >  > 2023  Code freeze. Release branch cut.
> >>> >  >  > QA period. Focus on bug fixes, tests, stability and docs.
> >>> 

Re: LLM script for error message improvement

2023-08-04 Thread Maciej
Besides, in case a separate discussion doesn't happen, our core 
responsibility is to follow the ASF guidelines, including the ASF 
Generative Tooling Guidance 
(https://www.apache.org/legal/generative-tooling.html).


As far as I understand it, both the first (which explicitly mentions 
ChatGPT) and the third acceptance conditions are not satisfied by this 
and the other mentioned PR.


On a side note, we should probably take a closer look at the following

'When providing contributions authored using generative AI tooling, a 
recommended practice is for contributors to indicate the tooling used to 
create the contribution. This should be included as a token in the 
source control commit message, for example including the phrase 
“Generated-by: ”.'


and consider adjusting PR template / merge tool accordingly.

Best regards,
Maciej Szymkiewicz

Web:https://zero323.net
PGP: A30CEF0C31A501EC

On 8/3/23 22:14, Maciej wrote:
I am sitting on the fence about that. In the linked PR Xiao wrote the 
following
>We published the error guideline a few years ago, but not all 
contributors adhered to it, resulting in variable quality in error 
messages.
If a policy exists but is not enforced (if that's indeed the case, I 
didn't go through the source to confirm that) it might be useful to 
learn the reasons why it happens. Normally, I'd expect
-Policy is too complex to enforce. In such case, additional tooling 
can be useful.
-Policy is not well known, and the people responsible for introducing 
it are not committed to enforcing it.
-Policy or some of its components don't really reflect community 
values and expectations.
If the problem of suspected violations was never raised on our 
standard communication channel, and as far as I can tell, it has not, 
then introducing a new tool to enforce the policy seems a bit premature.
If these were the only considerations, I'd say that improving the 
overall consistency of the project outweighs possible risks, even if 
the case for such might be poorly supported.
However, there is an elephant in the room. It is another attempt, 
after SPARK-44546, to embed generative tools directly within the Spark 
dev workflow. By principle, I am not against such tools. In fact, it 
is pretty clear that they are already used by Spark committers, and 
even if we wanted to, there is little we can do to prevent that. In 
such cases, decisions which tools, if any, to use, to what extent and 
how to treat their output are the sole responsibility of contributors.
In contrast, these proposals try to push a proprietary tool burdened 
with serious privacy and ethical issues and likely to introduce 
unclear liabilities as a standard or even required developer tool.
I can't speak for others, but personally, I'm quite uneasy about it. 
If we go this way, I strongly believe that it should be preceded by a 
serious discussion, if not the development of a formal policy, about 
what categories of tools, to what capacity, to what extent are 
acceptable within the project. Ideally, with an official opinion from 
the ASF as the copyright owner.

WDYT All? Shall we start a separate discussion?
Best regards,
Maciej Szymkiewicz

Web:https://zero323.net
PGP: A30CEF0C31A501EC
On 8/3/23 18:33, Haejoon Lee wrote:


Additional information:

Please check https://issues.apache.org/jira/browse/SPARK-37935if you 
want to start contributing to improving error messages.


You can create sub-tasks if you believe there are error messages that 
need improvement, in addition to the tasks listed in the umbrella JIRA.


You can also refer to https://github.com/apache/spark/pull/41504, 
https://github.com/apache/spark/pull/41455as an example PR.



On Thu, Aug 3, 2023 at 1:10 PM Ruifeng Zheng  wrote:

+1 from my side, I'm fine to have it as a helper script

On Thu, Aug 3, 2023 at 10:53 AM Hyukjin Kwon
 wrote:

I think adding that dev tool script to improve the error
message is fine.

On Thu, 3 Aug 2023 at 10:24, Haejoon Lee
 wrote:

Dear contributors, I hope you are doing well!

I see there are contributors who are interested in
working on error message improvements and persistent
contribution, so I want to share an llm-based error
message improvement script for helping your contribution.

You can find a detail for the script at
https://github.com/apache/spark/pull/41711. I believe
this can help your error message improvement work, so I
encourage you to take a look at the pull request and
leverage the script.

Please let me know if you have any questions or concerns.

Thanks all for your time and contributions!

Best regards,

Haejoon



OpenPGP_signature
Description: OpenPGP digital signature