Re: [VOTE] Single-pass Analyzer for Catalyst

2024-09-30 Thread Dongjoon Hyun
Thank you for the swift clarification, Reynold and Xiao.

It seems that the Target Version was set mistakenly initially.

I removed the `Target Version` from the SPIP JIRA.

https://issues.apache.org/jira/browse/SPARK-49834

I'm switching my cast to +1 for this SPIP vote.

Thanks,
Dongjoon.

On 2024/09/30 22:55:41 Xiao Li wrote:
> +1 in support of the direction of the Single-pass Analyzer for Catalyst.
> 
> I think we should not have a target version for the new Catalyst SPARK-49834
> <https://issues.apache.org/jira/browse/SPARK-49834>. It should not be a
> blocker for Spark 4.0. When implementing the new analyzer, the code changes
> must not affect users of the existing analyzer to avoid any user-facing
> impacts.
> 
> Reynold Xin  于2024年9月30日周一 15:39写道:
> 
> > I don't actually "lead" this. But I don't think this needs to target a
> > specific Spark version given it should not have any user facing
> > consequences?
> >
> >
> > On Mon, Sep 30, 2024 at 3:36 PM Dongjoon Hyun  wrote:
> >
> >> Thank you for leading this, Vladimir, Reynold, Herman.
> >>
> >> I'm wondering if this is really achievable goal for Apache Spark 4.0.0.
> >>
> >> If it's expected that we are unable to deliver it, shall we postpone this
> >> vote until 4.1.0 planning?
> >>
> >> Anyway, since SPARK-49834 has a target version 4.0.0 explicitly,
> >>
> >> -1 from my side.
> >>
> >> Thanks,
> >> Dongjoon.
> >>
> >>
> >> On 2024/09/30 17:51:24 Herman van Hovell wrote:
> >> > +1
> >> >
> >> > On Mon, Sep 30, 2024 at 8:29 AM Reynold Xin  >> >
> >> > wrote:
> >> >
> >> > > +1
> >> > >
> >> > > On Mon, Sep 30, 2024 at 6:47 AM Vladimir Golubev 
> >> > > wrote:
> >> > >
> >> > >> Hi all,
> >> > >>
> >> > >> I’d like to start a vote for a single-pass Analyzer for the Catalyst
> >> > >> project. This project will introduce a new analysis framework to the
> >> > >> Catalyst, which will eventually replace the fixed-point one.
> >> > >>
> >> > >> Please refer to the SPIP jira:
> >> > >> https://issues.apache.org/jira/browse/SPARK-49834
> >> > >>
> >> > >> [ ] +1: Accept the proposal
> >> > >> [ ] +0
> >> > >> [ ] -1: I don’t think this is a good idea because …
> >> > >>
> >> > >> Thanks!
> >> > >>
> >> > >> Vladimir
> >> > >>
> >> > >
> >> >
> >>
> >> -
> >> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
> >>
> >>
> 

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: [VOTE] Single-pass Analyzer for Catalyst

2024-09-30 Thread Dongjoon Hyun
Thank you for leading this, Vladimir, Reynold, Herman.

I'm wondering if this is really achievable goal for Apache Spark 4.0.0.

If it's expected that we are unable to deliver it, shall we postpone this vote 
until 4.1.0 planning?

Anyway, since SPARK-49834 has a target version 4.0.0 explicitly, 

-1 from my side.

Thanks,
Dongjoon.


On 2024/09/30 17:51:24 Herman van Hovell wrote:
> +1
> 
> On Mon, Sep 30, 2024 at 8:29 AM Reynold Xin 
> wrote:
> 
> > +1
> >
> > On Mon, Sep 30, 2024 at 6:47 AM Vladimir Golubev 
> > wrote:
> >
> >> Hi all,
> >>
> >> I’d like to start a vote for a single-pass Analyzer for the Catalyst
> >> project. This project will introduce a new analysis framework to the
> >> Catalyst, which will eventually replace the fixed-point one.
> >>
> >> Please refer to the SPIP jira:
> >> https://issues.apache.org/jira/browse/SPARK-49834
> >>
> >> [ ] +1: Accept the proposal
> >> [ ] +0
> >> [ ] -1: I don’t think this is a good idea because …
> >>
> >> Thanks!
> >>
> >> Vladimir
> >>
> >
> 

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: [VOTE] Officialy Deprecate GraphX in Spark 4

2024-09-30 Thread Dongjoon Hyun
+1

Thank you, Holden.

Dongjoon.

On 2024/09/30 18:01:17 Holden Karau wrote:
> I think it has been de-facto deprecated, we haven’t updated it meaningfully
> in several years. I think removing the API would be excessive but
> deprecating it would give us the flexibility to remove it in the not too
> distant future.
> 
> That being said this is not a vote to remove GraphX, I think that whenever
> that time comes (if it does) we should have a separate vote
> 
> This VOTE will be open for a little more than one week, ending on October
> 8th*. To vote reply with:
> +1 Deprecate GraphX
> 0 I’m indifferent
> -1 Don’t deprecate GraphX because ABC
> 
> If you have a binding vote to simplify you tallying at the end please mark
> your vote with a *.
> 
> (*mostly because I’m going camping for my birthday)
> 
> Twitter: https://twitter.com/holdenkarau
> Fight Health Insurance: https://www.fighthealthinsurance.com/
> 
> Books (Learning Spark, High Performance Spark, etc.):
> https://amzn.to/2MaRAG9  
> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
> Pronouns: she/her
> 

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: [DISCUSS] Creating `branch-4.0` and Feature Freeze for Apache Spark 4.0

2024-09-27 Thread Dongjoon Hyun
Thank you for the detailed proposal with dates, Hyukjin. I guess it's aligned 
with Herman already, too. I'm fine if we have a pre-defined and predictably 
achievable schedule. :-)

We know the well-known goal of `Spark Connect` and I love to have. I've been 
monitoring the progress, but I'm not sure if the delivery schedule is feasible 
or not in the given status. At this point, I believe you and Herman's words 
because you are the leaders and contributors for that area.

For the proposed schedule, the worst case in the community would be a future 
delay due to the risk of unknown personal schedules because Apache Spark is a 
community-driven project depending on the contributors' voluntarily passion and 
time. And, we know that it varies during the following Holiday seasons much. I 
also believe that you and Herman considered them already in the personal 
schedules.

- 2024.12.20 ~ 2025.01.05 (Christmas and Happy New Year Holiday)
- 2025.01.29 ~ 2025.02.03 (Chinese New Year)

> 2024-01-15 Creating `branch-4.0` (allowing backporting new features)
> 2024-02-01 Feature Freeze (allowing backporting bug fixes only)
> 2024-02-15 Starting Apache Spark 4.0.0 RC1

Thank you again for the replies. Let's wait and collect more opinions and 
update Apache Spark Website for the tentative schedule.

Dongjoon.

On 2024/09/27 03:00:49 Hyukjin Kwon wrote:
> I meant 2025 :-).
> 
> On Fri, Sep 27, 2024 at 11:15 AM Hyukjin Kwon  wrote:
> 
> > We're basically working on making Scala Spark Connect ready.
> > For example, I am working on having a parent class for both Spark Classic
> > and Spark Connect so users would face less breaking changes, and they can
> > run their application without changing anything.
> > In addition, I am also working on sharing the same test base between Spark
> > Classic and Spark Connect.
> > For those, I think it might take a couple of months to stabilize.
> >
> > What about the below schedule?
> >
> > - 2024-01-15 Creating `branch-4.0` (allowing backporting new features)
> > - 2024-02-01 Feature Freeze (allowing backporting bug fixes only)
> > - 2024-02-15 Starting Apache Spark 4.0.0 RC1
> >
> >
> > On Fri, 27 Sept 2024 at 07:35, Dongjoon Hyun 
> > wrote:
> >
> >> Thank you for the reply, Herman.
> >>
> >> Given that December and January are on your schedule,
> >> I'm not sure what date your proposal is. Could you elaborate more?
> >> As we know, the community is less active during that Winter period.
> >>
> >> In addition, although I know that you are leading that area actively with
> >> big refactorings,
> >> it would be greatly appreciated if you could share a more concrete
> >> progress status and
> >> delivery plan (or milestone) for `Connect and Classic Scala interface` to
> >> the community.
> >> Specifically, we are curious about how much we achieved between
> >> `preview1` and `preview2`,
> >> and if it's going to be stabilized enough at that time frame. Let's see
> >> what you have more.
> >>
> >> > We are working on unifying the Connect and Classic Scala interface
> >>
> >> Thank you again!
> >>
> >> Dongjoon.
> >>
> >> On Thu, Sep 26, 2024 at 12:23 PM Herman van Hovell 
> >> wrote:
> >>
> >>> Hi,
> >>>
> >>> Can we push back the dates by at least 2 months?
> >>>
> >>> We are working on unifying the Connect and Classic Scala interface, and
> >>> I would like to avoid rushing things.
> >>>
> >>> Kind regards,
> >>> Herman
> >>>
> >>> On Thu, Sep 26, 2024 at 3:19 PM Dongjoon Hyun 
> >>> wrote:
> >>>
> >>>> Hi, All.
> >>>>
> >>>> We've delivered two preview releases for Apache Spark 4.0 successfully.
> >>>> I believe it's time to discuss cutting branch-4.0 to stabilize more
> >>>> based on them
> >>>> and schedule feature freeze.
> >>>>
> >>>> - https://spark.apache.org/news/spark-4.0.0-preview1.html (June)
> >>>> - https://spark.apache.org/news/spark-4.0.0-preview2.html (September)
> >>>>
> >>>> I'd like to propose as a candidate.
> >>>>
> >>>> - 2024-10-01 Creating `branch-4.0` (allowing backporting new features)
> >>>> - 2024-10-15 Feature Freeze (allowing backporting bug fixes only)
> >>>> - 2024-11-01 Starting Apache Spark 4.0.0 RC1
> >>>>
> >>>> WDYT? Please let me know if you have release blockers for Spark 4 or
> >>>> other schedule candidates in your mind.
> >>>>
> >>>> Thanks,
> >>>> Dongjoon.
> >>>>
> >>>
> 

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: [DISCUSS] Creating `branch-4.0` and Feature Freeze for Apache Spark 4.0

2024-09-26 Thread Dongjoon Hyun
Thank you for the reply, Herman.

Given that December and January are on your schedule,
I'm not sure what date your proposal is. Could you elaborate more?
As we know, the community is less active during that Winter period.

In addition, although I know that you are leading that area actively with
big refactorings,
it would be greatly appreciated if you could share a more concrete progress
status and
delivery plan (or milestone) for `Connect and Classic Scala interface` to
the community.
Specifically, we are curious about how much we achieved between `preview1`
and `preview2`,
and if it's going to be stabilized enough at that time frame. Let's see
what you have more.

> We are working on unifying the Connect and Classic Scala interface

Thank you again!

Dongjoon.

On Thu, Sep 26, 2024 at 12:23 PM Herman van Hovell 
wrote:

> Hi,
>
> Can we push back the dates by at least 2 months?
>
> We are working on unifying the Connect and Classic Scala interface, and I
> would like to avoid rushing things.
>
> Kind regards,
> Herman
>
> On Thu, Sep 26, 2024 at 3:19 PM Dongjoon Hyun 
> wrote:
>
>> Hi, All.
>>
>> We've delivered two preview releases for Apache Spark 4.0 successfully.
>> I believe it's time to discuss cutting branch-4.0 to stabilize more based
>> on them
>> and schedule feature freeze.
>>
>> - https://spark.apache.org/news/spark-4.0.0-preview1.html (June)
>> - https://spark.apache.org/news/spark-4.0.0-preview2.html (September)
>>
>> I'd like to propose as a candidate.
>>
>> - 2024-10-01 Creating `branch-4.0` (allowing backporting new features)
>> - 2024-10-15 Feature Freeze (allowing backporting bug fixes only)
>> - 2024-11-01 Starting Apache Spark 4.0.0 RC1
>>
>> WDYT? Please let me know if you have release blockers for Spark 4 or
>> other schedule candidates in your mind.
>>
>> Thanks,
>> Dongjoon.
>>
>


[DISCUSS] Creating `branch-3.5` and Feature Freeze for Apache Spark 4.0

2024-09-26 Thread Dongjoon Hyun
Hi, All.

We've delivered two preview releases for Apache Spark 4.0 successfully.
I believe it's time to discuss cutting branch-4.0 to stabilize more based
on them
and schedule feature freeze.

- https://spark.apache.org/news/spark-4.0.0-preview1.html (June)
- https://spark.apache.org/news/spark-4.0.0-preview2.html (September)

I'd like to propose as a candidate.

- 2024-10-01 Creating `branch-4.0` (allowing backporting new features)
- 2024-10-15 Feature Freeze (allowing backporting bug fixes only)
- 2024-11-01 Starting Apache Spark 4.0.0 RC1

WDYT? Please let me know if you have release blockers for Spark 4 or other
schedule candidates in your mind.

Thanks,
Dongjoon.


[DISCUSS] Creating `branch-4.0` and Feature Freeze for Apache Spark 4.0

2024-09-26 Thread Dongjoon Hyun
Hi, All.

We've delivered two preview releases for Apache Spark 4.0 successfully.
I believe it's time to discuss cutting branch-4.0 to stabilize more based
on them
and schedule feature freeze.

- https://spark.apache.org/news/spark-4.0.0-preview1.html (June)
- https://spark.apache.org/news/spark-4.0.0-preview2.html (September)

I'd like to propose as a candidate.

- 2024-10-01 Creating `branch-4.0` (allowing backporting new features)
- 2024-10-15 Feature Freeze (allowing backporting bug fixes only)
- 2024-11-01 Starting Apache Spark 4.0.0 RC1

WDYT? Please let me know if you have release blockers for Spark 4 or other
schedule candidates in your mind.

Thanks,
Dongjoon.


[ANNOUNCE] Announcing Apache Spark 4.0.0-preview2

2024-09-26 Thread Dongjoon Hyun
Hi, All.

To enable wide-scale community testing of the upcoming Spark 4.0 release,
the Apache Spark community has posted a Spark 4.0.0-preview2 release.
This preview is not a stable release in terms of either API or
functionality,
but it is meant to give the community early access to try the code that
will become Spark 4.0. If you would like to test the release, please
download it, and send feedback using either the mailing lists or JIRA.

There are a lot of exciting new features added to Spark 4.0, including ANSI
mode by default, Python data source, polymorphic Python UDTF, string
collation support, new VARIANT data type, streaming state store data
source, structured logging, Java 21 support, Java 17 by default,
and many more.

We'd like to thank our contributors and users for their contributions and
early feedback to this release. This release would not have been possible
without you.

To download Spark 4.0.0-preview2, head over to the download page:
https://archive.apache.org/dist/spark/spark-4.0.0-preview2

To view the documentation:
https://spark.apache.org/docs/preview/


Dongjoon Hyun


Re: [ANNOUNCE] Apache Spark 3.5.3 released

2024-09-24 Thread Dongjoon Hyun
Thank you for the release, Heajoon.

Could you publish docker images too like the following?

- https://github.com/apache/spark-docker/pull/64
  (Publish 3.5.2 to docker registry)

Dongjoon.


On Tue, Sep 24, 2024 at 10:29 PM Haejoon Lee 
wrote:

> Hi, Yang!
>
> And thanks Dongjoon for answering the question!
>
> For the second question, I got the commit list from the JIRA release note
> <https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12315420&version=12354954>,
> but just realized that some commits are not resolved yet & missing from
> 3.5.3 as you mentioned.
> > 3.5.3 release notes contains some commits that does not exists in
> tag/v3.5.3, e.g `[SPARK-49628]` exists in branch-3.5 but not in tag/v3.5.3,
> would you help explain more about that?
>
> Let me update the release notes soon to properly include only the list
> that was actually completed in 3.5.3.
>
> Thanks for the report!
>
> Haejoon
>
> On Wed, Sep 25, 2024 at 2:16 PM Dongjoon Hyun 
> wrote:
>
>> Hi, Yang.
>>
>> For this question, please try with `Private` mode or `Incognito` mode in
>> your browser.
>> > I find download page does not contain 3.5.3 link, but release notes
>> link exists.
>>
>> Dongjoon.
>>
>> On Tue, Sep 24, 2024 at 8:39 PM Yang Zhang  wrote:
>>
>>> Hi,
>>>
>>> I find download page does not contain 3.5.3 link, but release notes link
>>> exists.
>>>
>>> 3.5.3 release notes contains some commits that does not exists in
>>> tag/v3.5.3, e.g `[SPARK-49628]` exists in branch-3.5 but not in tag/v3.5.3,
>>> would you help explain more about that?
>>>
>>> Thank you
>>>
>>> On 2024/09/25 01:05:47 Haejoon Lee wrote:
>>> > We are happy to announce the availability of Apache Spark 3.5.3!
>>> >
>>> > Spark 3.5.3 is the third maintenance release containing security
>>> > and correctness fixes. This release is based on the branch-3.5
>>> > maintenance branch of Spark. We strongly recommend all 3.5 users
>>> > to upgrade to this stable release.
>>> >
>>> > To download Spark 3.5.3, head over to the download page:
>>> > https://spark.apache.org/downloads.html
>>> >
>>> > To view the release notes:
>>> > https://spark.apache.org/releases/spark-release-3-5-3.html
>>> >
>>> > We would like to acknowledge all community members for contributing to
>>> this
>>> > release. This release would not have been possible without you.
>>> >
>>> > Haejoon Lee
>>> >
>>>
>>> -
>>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>>
>>>


Re: [ANNOUNCE] Apache Spark 3.5.3 released

2024-09-24 Thread Dongjoon Hyun
Hi, Yang.

For this question, please try with `Private` mode or `Incognito` mode in
your browser.
> I find download page does not contain 3.5.3 link, but release notes link
exists.

Dongjoon.

On Tue, Sep 24, 2024 at 8:39 PM Yang Zhang  wrote:

> Hi,
>
> I find download page does not contain 3.5.3 link, but release notes link
> exists.
>
> 3.5.3 release notes contains some commits that does not exists in
> tag/v3.5.3, e.g `[SPARK-49628]` exists in branch-3.5 but not in tag/v3.5.3,
> would you help explain more about that?
>
> Thank you
>
> On 2024/09/25 01:05:47 Haejoon Lee wrote:
> > We are happy to announce the availability of Apache Spark 3.5.3!
> >
> > Spark 3.5.3 is the third maintenance release containing security
> > and correctness fixes. This release is based on the branch-3.5
> > maintenance branch of Spark. We strongly recommend all 3.5 users
> > to upgrade to this stable release.
> >
> > To download Spark 3.5.3, head over to the download page:
> > https://spark.apache.org/downloads.html
> >
> > To view the release notes:
> > https://spark.apache.org/releases/spark-release-3-5-3.html
> >
> > We would like to acknowledge all community members for contributing to
> this
> > release. This release would not have been possible without you.
> >
> > Haejoon Lee
> >
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>


[VOTE][RESULT] Release Spark 4.0.0-preview2 (RC1)

2024-09-20 Thread Dongjoon Hyun
The vote passes with 14 +1s (9 binding +1s).
Thanks to all who helped with the release!

(* = binding)
+1:
- Zhou Jiang
- Holden Karau *
- Dongjoon Hyun *
- Liang-Chi Hsieh *
- Huaxin Gao *
- Xinrong Meng *
- John Zhuge
- Wenchen Fan *
- Cheng Pan
- Yuming Wang *
- Xiao Li *
- Gengliang Wang *
- Shubham Patel
- Yang Jie

+0: None

-1: None


Re: [VOTE] Release Spark 4.0.0-preview2 (RC1)

2024-09-20 Thread Dongjoon Hyun
Thank you all! I'll conclude this vote.

Dongjoon.

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: [VOTE] Release Spark 4.0.0-preview2 (RC1)

2024-09-16 Thread Dongjoon Hyun
+1

Dongjoon

On Mon, Sep 16, 2024 at 10:57 AM Holden Karau 
wrote:

> +1
>
> Twitter: https://twitter.com/holdenkarau
> Books (Learning Spark, High Performance Spark, etc.):
> https://amzn.to/2MaRAG9  <https://amzn.to/2MaRAG9>
> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
> Pronouns: she/her
>
>
> On Mon, Sep 16, 2024 at 10:55 AM Zhou Jiang 
> wrote:
>
>> + 1
>> Sent from my iPhone
>>
>> On Sep 16, 2024, at 01:04, Dongjoon Hyun  wrote:
>>
>> 
>>
>> Please vote on releasing the following candidate as Apache Spark version
>> 4.0.0-preview2.
>>
>> The vote is open until September 20th 1AM (PDT) and passes if a majority
>> +1 PMC votes are cast, with a minimum of 3 +1 votes.
>>
>> [ ] +1 Release this package as Apache Spark 4.0.0-preview2
>> [ ] -1 Do not release this package because ...
>>
>> To learn more about Apache Spark, please see https://spark.apache.org/
>>
>> The tag to be voted on is v4.0.0-preview2-rc1 (commit
>> f0d465e09b8d89d5e56ec21f4bd7e3ecbeeb318a)
>> https://github.com/apache/spark/tree/v4.0.0-preview2-rc1
>>
>> The release files, including signatures, digests, etc. can be found at:
>> https://dist.apache.org/repos/dist/dev/spark/v4.0.0-preview2-rc1-bin/
>>
>> Signatures used for Spark RCs can be found in this file:
>> https://dist.apache.org/repos/dist/dev/spark/KEYS
>>
>> The staging repository for this release can be found at:
>> https://repository.apache.org/content/repositories/orgapachespark-1468/
>>
>> The documentation corresponding to this release can be found at:
>> https://dist.apache.org/repos/dist/dev/spark/v4.0.0-preview2-rc1-docs/
>>
>> The list of bug fixes going into 4.0.0-preview2 can be found at the
>> following URL:
>> https://issues.apache.org/jira/projects/SPARK/versions/12353359
>>
>> This release is using the release script of the tag v4.0.0-preview2-rc1.
>>
>> FAQ
>>
>> =
>> How can I help test this release?
>> =
>>
>> If you are a Spark user, you can help us test this release by taking
>> an existing Spark workload and running on this release candidate, then
>> reporting any regressions.
>>
>> If you're working in PySpark you can set up a virtual env and install
>> the current RC and see if anything important breaks, in the Java/Scala
>> you can add the staging repository to your projects resolvers and test
>> with the RC (make sure to clean up the artifact cache before/after so
>> you don't end up building with a out of date RC going forward).
>>
>>


[VOTE] Release Spark 4.0.0-preview2 (RC1)

2024-09-16 Thread Dongjoon Hyun
Please vote on releasing the following candidate as Apache Spark version
4.0.0-preview2.

The vote is open until September 20th 1AM (PDT) and passes if a majority +1
PMC votes are cast, with a minimum of 3 +1 votes.

[ ] +1 Release this package as Apache Spark 4.0.0-preview2
[ ] -1 Do not release this package because ...

To learn more about Apache Spark, please see https://spark.apache.org/

The tag to be voted on is v4.0.0-preview2-rc1 (commit
f0d465e09b8d89d5e56ec21f4bd7e3ecbeeb318a)
https://github.com/apache/spark/tree/v4.0.0-preview2-rc1

The release files, including signatures, digests, etc. can be found at:
https://dist.apache.org/repos/dist/dev/spark/v4.0.0-preview2-rc1-bin/

Signatures used for Spark RCs can be found in this file:
https://dist.apache.org/repos/dist/dev/spark/KEYS

The staging repository for this release can be found at:
https://repository.apache.org/content/repositories/orgapachespark-1468/

The documentation corresponding to this release can be found at:
https://dist.apache.org/repos/dist/dev/spark/v4.0.0-preview2-rc1-docs/

The list of bug fixes going into 4.0.0-preview2 can be found at the
following URL:
https://issues.apache.org/jira/projects/SPARK/versions/12353359

This release is using the release script of the tag v4.0.0-preview2-rc1.

FAQ

=
How can I help test this release?
=

If you are a Spark user, you can help us test this release by taking
an existing Spark workload and running on this release candidate, then
reporting any regressions.

If you're working in PySpark you can set up a virtual env and install
the current RC and see if anything important breaks, in the Java/Scala
you can add the staging repository to your projects resolvers and test
with the RC (make sure to clean up the artifact cache before/after so
you don't end up building with a out of date RC going forward).


Re: [VOTE] Release Apache Spark 3.5.3 (RC3)

2024-09-11 Thread Dongjoon Hyun
+1

Dongjoon

On 2024/09/11 13:51:23 Herman van Hovell wrote:
> +1
> 
> On Wed, Sep 11, 2024 at 3:30 AM Kent Yao  wrote:
> 
> > +1, thank you, Haejoon
> > Kent
> >
> > On 2024/09/11 06:12:19 Gengliang Wang wrote:
> > > +1
> > >
> > > On Mon, Sep 9, 2024 at 6:01 PM Wenchen Fan  wrote:
> > >
> > > > +1
> > > >
> > > > On Tue, Sep 10, 2024 at 7:42 AM Rui Wang  > .invalid>
> > > > wrote:
> > > >
> > > >> +1 (non-binding)
> > > >>
> > > >>
> > > >> -Rui
> > > >>
> > > >> On Mon, Sep 9, 2024 at 4:22 PM Hyukjin Kwon 
> > wrote:
> > > >>
> > > >>> +1
> > > >>>
> > > >>> On Tue, Sep 10, 2024 at 5:39 AM Haejoon Lee
> > > >>>  wrote:
> > > >>>
> > >  Hi, dev!
> > > 
> > >  Please vote on releasing the following candidate as Apache Spark
> > >  version 3.5.3 (RC3).
> > > 
> > >  The vote is open for next 72 hours, and passes if a majority +1 PMC
> > >  votes are cast, with a minimum of 3 +1 votes.
> > > 
> > >  [ ] +1 Release this package as Apache Spark 3.5.3
> > >  [ ] -1 Do not release this package because ...
> > > 
> > >  To learn more about Apache Spark, please see
> > https://spark.apache.org/
> > > 
> > >  The tag to be voted on is v3.5.3-rc3 (commit
> > >  32232e9ed33bb16b93ad58cfde8b82e0f07c0970):
> > >  https://github.com/apache/spark/tree/v3.5.3-rc3
> > > 
> > >  The release files, including signatures, digests, etc. can be found
> > at:
> > >  https://dist.apache.org/repos/dist/dev/spark/v3.5.3-rc3-bin/
> > > 
> > >  Signatures used for Spark RCs can be found in this file:
> > >  https://dist.apache.org/repos/dist/dev/spark/KEYS
> > > 
> > >  The staging repository for this release can be found at:
> > > 
> > https://repository.apache.org/content/repositories/orgapachespark-1467/
> > > 
> > >  The documentation corresponding to this release can be found at:
> > >  https://dist.apache.org/repos/dist/dev/spark/v3.5.3-rc3-docs/
> > > 
> > >  The list of bug fixes going into 3.5.3 can be found at the following
> > >  URL:
> > >  https://issues.apache.org/jira/projects/SPARK/versions/12354954
> > > 
> > >  FAQ
> > > 
> > >  =
> > >  How can I help test this release?
> > >  =
> > > 
> > >  If you are a Spark user, you can help us test this release by taking
> > >  an existing Spark workload and running on this release candidate,
> > then
> > >  reporting any regressions.
> > > 
> > >  If you're working in PySpark you can set up a virtual env and
> > install
> > >  the current RC via "pip install
> > > 
> > https://dist.apache.org/repos/dist/dev/spark/v3.5.3-rc3-bin/pyspark-3.5.3.tar.gz
> > >  "
> > >  and see if anything important breaks.
> > >  In the Java/Scala, you can add the staging repository to your
> > projects
> > >  resolvers and test
> > >  with the RC (make sure to clean up the artifact cache before/after
> > so
> > >  you don't end up building with an out of date RC going forward).
> > > 
> > >  ===
> > >  What should happen to JIRA tickets still targeting 3.5.3?
> > >  ===
> > > 
> > >  The current list of open tickets targeted at 3.5.3 can be found at:
> > >  https://issues.apache.org/jira/projects/SPARK and search for
> > >  "Target Version/s" = 3.5.3
> > > 
> > >  Committers should look at those and triage. Extremely important bug
> > >  fixes, documentation, and API tweaks that impact compatibility
> > should
> > >  be worked on immediately. Everything else please retarget to an
> > >  appropriate release.
> > > 
> > >  ==
> > >  But my bug isn't fixed?
> > >  ==
> > > 
> > >  In order to make timely releases, we will typically not hold the
> > >  release unless the bug in question is a regression from the previous
> > >  release. That being said, if there is something which is a
> > regression
> > >  that has not been correctly targeted please ping me or a committer
> > to
> > >  help target the issue.
> > > 
> > >  Thanks!
> > >  Haejoon Lee
> > > 
> > > >>>
> > >
> >
> > -
> > To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
> >
> >
> 

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: [VOTE] Document and Feature Preview via GitHub Pages

2024-09-11 Thread Dongjoon Hyun
+1

Thank you, Kent.

Dongjoon.

On 2024/09/11 16:22:19 Gengliang Wang wrote:
> +1
> 
> On Wed, Sep 11, 2024 at 6:30 AM Wenchen Fan  wrote:
> 
> > +1
> >
> > On Wed, Sep 11, 2024 at 5:15 PM Martin Grund 
> > wrote:
> >
> >> +1
> >>
> >> On Wed, Sep 11, 2024 at 9:39 AM Kent Yao  wrote:
> >>
> >>> Hi all,
> >>>
> >>> Following the discussion[1], I'd like to start the vote for 'Document and
> >>> Feature Preview via GitHub Pages'
> >>>
> >>>
> >>> Please vote for the next 72 hours:(excluding next weekend)
> >>>
> >>>  [ ] +1: Accept the proposal
> >>>  [ ] +0
> >>>  [ ]- 1: I don’t think this is a good idea because …
> >>>
> >>>
> >>>
> >>> Bests,
> >>> Kent Yao
> >>>
> >>> [1] https://lists.apache.org/thread/xojcdlw77pht9bs4mt4087ynq6k9sbqq
> >>>
> >>> -
> >>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
> >>>
> >>>
> 

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Apache Spark 4.0.0-preview2 (?)

2024-09-06 Thread Dongjoon Hyun
Hi, All.

Since the Apache Spark 4.0.0-preview1 tag was created in May, it's been
over 3 months.

https://github.com/apache/spark/releases/tag/v4.0.0-preview1 (2024-05-28)

Almost 1k commits including improvements, refactoring, and bug fixes landed
at `master` branch.

$ git log --oneline master a78ef73..HEAD | wc -l
 965

According to the progress on SPARK-44111 and related issues, I believe we
had better release `Preview2` this month in order to get more feedback on
the recent progress.

- https://issues.apache.org/jira/browse/SPARK-44111
  (Prepare Apache Spark 4.0.0)

WDYT? I'm also volunteering as the release manager of Apache Spark
4.0.0-preview2.

Dongjoon.


Re: [DISCUSS] Document and Feature Preview via GitHub Pages

2024-09-04 Thread Dongjoon Hyun
+1

It looks like a good approach. I believe we can take advantage of it in the 
Spark subprojects like `spark-kubernetes-operator` too.

Thanks,
Dongjoon.

On 2024/09/04 10:26:58 Kent Yao wrote:
> Hi all,
> 
> In this discussion, I propose using GitHub Pages to host and
> deploy documentation of the master branch, providing an easy
> and timely way for users to do technical previews and ad hoc
> post reviews after docs-related PRs for developers.
> 
> Please refer to the document below for more information.
> 
> https://docs.google.com/document/d/1D6nGOsnZ5aI4YZr7SDbcC3TNljPq6jcmBs_r-RPkgmM/edit#heading=h.9dsuxyx77nf4
> 
> WDYT?
> 
> Bests,
> Kent Yao
> 
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
> 
> 

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: [VOTE] Deprecate SparkR

2024-08-21 Thread Dongjoon Hyun
+1

Dongjoon

On 2024/08/21 19:00:46 Holden Karau wrote:
> +1
> 
> Twitter: https://twitter.com/holdenkarau
> Books (Learning Spark, High Performance Spark, etc.):
> https://amzn.to/2MaRAG9  
> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
> Pronouns: she/her
> 
> 
> On Wed, Aug 21, 2024 at 8:59 PM Herman van Hovell
>  wrote:
> 
> > +1
> >
> > On Wed, Aug 21, 2024 at 2:55 PM Martin Grund 
> > wrote:
> >
> >> +1
> >>
> >> On Wed, Aug 21, 2024 at 20:26 Xiangrui Meng  wrote:
> >>
> >>> +1
> >>>
> >>> On Wed, Aug 21, 2024, 10:24 AM Mridul Muralidharan 
> >>> wrote:
> >>>
>  +1
> 
> 
>  Regards,
>  Mridul
> 
> 
>  On Wed, Aug 21, 2024 at 11:46 AM Reynold Xin
>   wrote:
> 
> > +1
> >
> > On Wed, Aug 21, 2024 at 6:42 PM Shivaram Venkataraman <
> > shivaram.venkatara...@gmail.com> wrote:
> >
> >> Hi all
> >>
> >> Based on the previous discussion thread [1], I hereby call a vote to
> >> deprecate the SparkR module in Apache Spark with the upcoming Spark 4
> >> release and remove it in the next major release Spark 5.
> >>
> >> [ ] +1: Accept the proposal
> >> [ ] +0
> >> [ ] -1: I don’t think this is a good idea because ..
> >>
> >> This vote will be open for the next 72 hours
> >>
> >> Thanks
> >> Shivaram
> >>
> >> [1] https://lists.apache.org/thread/qjgsgxklvpvyvbzsx1qr8o533j4zjlm5
> >>
> >
> 

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: [DISCUSS] Deprecating SparkR

2024-08-12 Thread Dongjoon Hyun
+1

Dongjoon

On Mon, Aug 12, 2024 at 17:52 Holden Karau  wrote:

> +1
>
> Are the sparklyr folks on this list?
>
> Twitter: https://twitter.com/holdenkarau
> Books (Learning Spark, High Performance Spark, etc.):
> https://amzn.to/2MaRAG9  
> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
> Pronouns: she/her
>
>
> On Mon, Aug 12, 2024 at 5:22 PM Xiao Li  wrote:
>
>> +1
>>
>> Hyukjin Kwon  于2024年8月12日周一 16:18写道:
>>
> +1
>>>
>>> On Tue, Aug 13, 2024 at 7:04 AM Nicholas Chammas <
>>> nicholas.cham...@gmail.com> wrote:
>>>
 And just for the record, the stats that I screenshotted
 
  in
 that thread I linked to showed the following page views for each
 sub-section under `docs/latest/api/`:

 - python: 758K
 - java: 66K
 - sql: 39K
 - scala: 35K
 - r: <1K

 I don’t recall over what time period those stats were collected for,
 and there are certainly some factors of how the stats are gathered and how
 the various language API docs are accessed that impact those numbers. So
 it’s by no means a solid, objective measure. But I thought it was an
 interesting signal nonetheless.


 On Aug 12, 2024, at 5:50 PM, Nicholas Chammas <
 nicholas.cham...@gmail.com> wrote:

 Not an R user myself, but +1.

 I first wondered about the future of SparkR after noticing
  how
 low the visit stats were for the R API docs as compared to Python and
 Scala. (I can’t seem to find those visit stats
 
  for
 the API docs anymore.)


 On Aug 12, 2024, at 11:47 AM, Shivaram Venkataraman <
 shivaram.venkatara...@gmail.com> wrote:

 Hi

 About ten years ago, I created the original SparkR package as part of
 my research at UC Berkeley [SPARK-5654
 ]. After my PhD I
 started as a professor at UW-Madison and my contributions to SparkR have
 been in the background given my availability. I continue to be involved in
 the community and teach a popular course at UW-Madison which uses Apache
 Spark for programming assignments.

 As the original contributor and author of a research paper on SparkR, I
 also continue to get private emails from users. A common question I get is
 whether one should use SparkR in Apache Spark or the sparklyr package
 (built on top of Apache Spark). You can also see this in StackOverflow
 questions and other blog posts online:
 https://www.google.com/search?q=sparkr+vs+sparklyr . While, I have
 encouraged users to choose the SparkR package as it is maintained by the
 Apache project, the more I looked into sparklyr, the more I was convinced
 that it is a better choice for R users that want to leverage the power of
 Spark:

 (1) sparklyr is developed by a community of developers who understand
 the R programming language deeply, and as a result is more idiomatic. In
 hindsight, sparklyr’s more idiomatic approach would have been a better
 choice than the Scala-like API we have in SparkR.

 (2) Contributions to SparkR have decreased slowly. Over the last two
 years, there have been 65 commits on the Spark R codebase (compared to
 ~2200 on the Spark Python code base). In contrast Sparklyr has over 300
 commits in the same period..

 (3) Previously, using and deploying sparklyr had been cumbersome as it
 needed careful alignment of versions between Apache Spark and sparklyr.
 However, the sparklyr community has implemented a new Spark Connect based
 architecture which eliminates this issue.

 (4) The sparklyr community has maintained their package on CRAN – it
 takes some effort to do this as the CRAN release process requires passing a
 number of tests. While SparkR was on CRAN initially, we could not maintain
 that given our release process and cadence. This makes sparklyr much more
 accessible to the R community.

 So it is with a bittersweet feeling that I’m writing this email to
 propose that we deprecate SparkR, and recommend sparklyr as the R language
 binding for Spark. This will reduce complexity of our own codebase, and
 more importantly reduce confusion for users. As the sparklyr package is
 distributed using the same permissive license as Apache Spark, there should
 be no downside for existing SparkR users in adopting it.

 My proposal is to mark SparkR as deprecated in the upcom

Re: Welcome new Apache Spark committers

2024-08-12 Thread Dongjoon Hyun
Congratulations, Martin, Haejoon, Allison. :)

Dongjoon

On Mon, Aug 12, 2024 at 5:19 PM Hyukjin Kwon  wrote:

> Hi all,
>
> The Spark PMC recently voted to add three new committers. Please join me
> in welcoming them to their new role!
>
> - Martin Grund
> - Haejoon Lee
> - Allison Wang
>
> They consistently made contributions to the project and clearly showed
> their expertise. We are very excited to have them join as committers!
>
>


Re: Welcoming a new PMC member

2024-08-12 Thread Dongjoon Hyun
Congratulations, Kent.

Dongjoon.

On Mon, Aug 12, 2024 at 5:22 PM Xiao Li  wrote:

> Congratulations !
>
> Hyukjin Kwon  于2024年8月12日周一 17:20写道:
>
>> Hi all,
>>
>> The Spark PMC recently voted to add a new PMC member, Kent Yao. Join me
>> in welcoming him to his new role!
>>
>>


Re: [VOTE] Archive Spark Documentations in Apache Archives

2024-08-12 Thread Dongjoon Hyun
+1 for the proposals
- enhancing the release process to put the docs to `release` directory in order 
to archive.
- uploading old releases via SVN manually to archive.

Since deletion is not a scope of this vote, I don't see any risk here. Thank 
you, Kent.

Dongjoon.

On 2024/08/12 09:07:47 Kent Yao wrote:
> Archive Spark Documentations in Apache Archives
> 
> Hi dev,
> 
> To address the issue of the Spark website repository size
> reaching the storage limit for GitHub-hosted runners [1], I suggest
> enhancing step [2] in our release process by relocating the
> documentation releases from the dev[3] directory to the release
> directory[4]. Then it would captured by the Apache Archives
> service[5] to create permanent links, which would be alternative
> endpoints for our documentation, like
> 
> https://dist.apache.org/repos/dist/dev/spark/v3.5.2-rc5-docs/_site/index.html
> for
> https://spark.apache.org/docs/3.5.2/index.html
> 
> Note that the previous example still uses the staging repository,
> which will become
> https://archive.apache.org/dist/spark/docs/3.5.2/index.html.
> 
> For older releases hosted on the Spark website [6], we also need to
> upload them via SVN manually.
> 
> After that, when we reach the threshold again, we can delete some of
> the old ones on page [6], and update their links on page [7] or use
> redirection.
> 
> JIRA ticket: https://issues.apache.org/jira/browse/SPARK-49209
> 
> Please vote on the idea of  Archive Spark Documentations in
> Apache Archives for the next 72 hours:
> 
> [ ] +1: Accept the proposal
> [ ] +0
> [ ] -1: I don’t think this is a good idea because …
> 
> Bests,
> Kent Yao
> 
> [1] https://lists.apache.org/thread/o0w4gqoks23xztdmjjj26jkp1yyg2bvq
> [2] 
> https://spark.apache.org/release-process.html#upload-to-apache-release-directory
> [3] https://dist.apache.org/repos/dist/dev/spark/v3.5.2-rc5-docs/
> [4] https://dist.apache.org/repos/dist/release/spark/docs/3.5.2
> [5] https://archive.apache.org/dist/spark/
> [6] https://github.com/apache/spark-website/tree/asf-site/site/docs
> [7] https://spark.apache.org/documentation.html
> 
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
> 
> 

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: Spark website repo size hits the storage limit of GitHub-hosted runners

2024-08-08 Thread Dongjoon Hyun
Ya, I agree that we need to investigate what happened at PySpark 3.5+ docs.

For old Spark docs, it seems to be negligible.

- All Spark 0.x docs:  231M
- All Spark 1.x docs: 1.3G
- All Spark 2.x docs: 3.4G

For example, the total size of above all old Spark docs is less than the
following 4 releases docs.

1.1G ./3.5.0
1.2G ./3.5.1
1.2G ./3.5.2 RC2
1.1G ./4.0.0-preview1

So, if we do start something, we had better focus on the latest doc first
in the reverse order.

Dongjoon

On Thu, Aug 8, 2024 at 11:22 AM Sean Owen  wrote:

> Whoa! Is there any clear reason why 3.5 docs are so big? 1GB of docs / 10x
> jump seems crazy. Maybe we need to investigate and fix that also.
>
> I take it that the problem is the size of the repo once it's cloned into
> the docker container. Removing the .html files helps that, but, then we
> don't have .html docs in the published site!
> We can generate them in the build process, but I presume it's waaay too
> long to rebuild docs for every release every time.
>
> I do support at *least* tarring up old .html docs from old releases
> (<3.0?) and making them available somehow on the site, so that they're
> accessible if needed.
>
> Analytics says that page views for docs before 3.1 are quite minimal,
> probably hundreds of views this year at best vs 10M total views:
>
> https://analytics.apache.org/index.php?module=CoreHome&action=index&date=yesterday&period=day&idSite=40#?idSite=40&period=year&date=2024-08-07&category=General_Actions&subcategory=General_Pages
>
> On Thu, Aug 8, 2024 at 12:42 PM Dongjoon Hyun 
> wrote:
>
>> The culprit seems to be PySpark 3.5 documentation which grows 11x times
>> at 3.5+
>>
>> $ du -h 3.4.3/api/python | tail -n1
>>  84M 3.4.3/api/python
>>
>> $ du -h 3.5.1/api/python | tail -n1
>> 943M 3.5.1/api/python
>>
>> Since we will generate big documents for 3.5.x, 4.0.0-preview, 4.0.x,
>> 4.1.x, the proposed tarball idea sounds promising to me too.
>>
>> $ ls -alh 3.5.1.tgz
>> -rw-r--r--  1 dongjoon  staff   103M Aug  8 10:22 3.5.1.tgz
>>
>> Specifically, shall we keep HTML files for only the latest version of
>> live releases, e.g. 3.4.3, 3.5.1, and 4.0.0-preview1?
>>
>> In other words, all 0.x ~ 3.4.2 and 3.5.1 will be tarball files in the
>> current status.
>>
>> Dongjoon.
>>
>>
>> On Thu, Aug 8, 2024 at 10:01 AM Sean Owen  wrote:
>>
>>> I agree with 'archiving', but what does that mean? delete from the repo
>>> and site?
>>> While I really doubt people are looking for docs for, say, 0.5.0, it'd
>>> be a big jump to totally remove it.
>>>
>>> What if we made a compressed tarball of old docs and put that in the
>>> repo, linked to it, and removed the docs files for many old releases?
>>> It's still in the repo and will be in the container when docs are built,
>>> but, compressed would be much smaller.
>>> That could buy a significant amount of time.
>>>
>>> On Thu, Aug 8, 2024 at 7:06 AM Kent Yao  wrote:
>>>
>>>> Hi dev,
>>>>
>>>> The current size of the spark-website repository is approximately 16GB,
>>>> exceeding the storage limit of GitHub-hosted runners.  The GitHub
>>>> actions
>>>> have been failing recently in the actions/checkout step caused by
>>>> 'No space left on device' errors.
>>>>
>>>> Filesystem  Size  Used Avail Use% Mounted on
>>>> overlay  73G   58G   16G  80% /
>>>> tmpfs64M 0   64M   0% /dev
>>>> tmpfs   7.9G 0  7.9G   0% /sys/fs/cgroup
>>>> shm  64M 0   64M   0% /dev/shm
>>>> /dev/root73G   58G   16G  80% /__w
>>>> tmpfs   1.6G  1.2M  1.6G   1% /run/docker.sock
>>>> tmpfs   7.9G 0  7.9G   0% /proc/acpi
>>>> tmpfs   7.9G 0  7.9G   0% /proc/scsi
>>>> tmpfs   7.9G 0  7.9G   0% /sys/firmware
>>>>
>>>>
>>>> The documentation for each version contributes the most volume. Since
>>>> version
>>>>  3.5.0, the documentation size has grown 3-4 times larger than the
>>>> size of 3.4.x,
>>>>  with more than 1GB.
>>>>
>>>>
>>>> 9.9M ./0.6.0
>>>>  10M ./0.6.1
>>>>  10M ./0.6.2
>>>>  15M ./0.7.0
>>>>  16M ./0.7.2
>>>>  16M ./0.7.3
>>>>  20M ./0.8.0
>>>>  20M ./0.8.1
>>>>  38M ./0.9.0
>>>>  

Re: [VOTE] Release Spark 3.5.2 (RC5)

2024-08-08 Thread Dongjoon Hyun
+1

I'm resending my vote.

Dongjoon.

On 2024/08/06 16:06:00 Kent Yao wrote:
> Hi dev,
> 
> Please vote on releasing the following candidate as Apache Spark version 
> 3.5.2.
> 
> The vote is open until Aug 9, 17:00:00 UTC, and passes if a majority +1
> PMC votes are cast, with a minimum of 3 +1 votes.
> 
> [ ] +1 Release this package as Apache Spark 3.5.2
> [ ] -1 Do not release this package because ...
> 
> To learn more about Apache Spark, please see https://spark.apache.org/
> 
> The tag to be voted on is v3.5.2-rc5 (commit
> bb7846dd487f259994fdc69e18e03382e3f64f42):
> https://github.com/apache/spark/tree/v3.5.2-rc5
> 
> The release files, including signatures, digests, etc. can be found at:
> https://dist.apache.org/repos/dist/dev/spark/v3.5.2-rc5-bin/
> 
> Signatures used for Spark RCs can be found in this file:
> https://dist.apache.org/repos/dist/dev/spark/KEYS
> 
> The staging repository for this release can be found at:
> https://repository.apache.org/content/repositories/orgapachespark-1462/
> 
> The documentation corresponding to this release can be found at:
> https://dist.apache.org/repos/dist/dev/spark/v3.5.2-rc5-docs/
> 
> The list of bug fixes going into 3.5.2 can be found at the following URL:
> https://issues.apache.org/jira/projects/SPARK/versions/12353980
> 
> FAQ
> 
> =
> How can I help test this release?
> =
> 
> If you are a Spark user, you can help us test this release by taking
> an existing Spark workload and running on this release candidate, then
> reporting any regressions.
> 
> If you're working in PySpark you can set up a virtual env and install
> the current RC via "pip install
> https://dist.apache.org/repos/dist/dev/spark/v3.5.2-rc5-bin/pyspark-3.5.2.tar.gz";
> and see if anything important breaks.
> In the Java/Scala, you can add the staging repository to your projects
> resolvers and test
> with the RC (make sure to clean up the artifact cache before/after so
> you don't end up building with an out of date RC going forward).
> 
> ===
> What should happen to JIRA tickets still targeting 3.5.2?
> ===
> 
> The current list of open tickets targeted at 3.5.2 can be found at:
> https://issues.apache.org/jira/projects/SPARK and search for
> "Target Version/s" = 3.5.2
> 
> Committers should look at those and triage. Extremely important bug
> fixes, documentation, and API tweaks that impact compatibility should
> be worked on immediately. Everything else please retarget to an
> appropriate release.
> 
> ==
> But my bug isn't fixed?
> ==
> 
> In order to make timely releases, we will typically not hold the
> release unless the bug in question is a regression from the previous
> release. That being said, if there is something which is a regression
> that has not been correctly targeted please ping me or a committer to
> help target the issue.
> 
> Thanks,
> Kent Yao
> 
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
> 
> 

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: [VOTE] Release Spark 3.5.2 (RC5)

2024-08-08 Thread Dongjoon Hyun
Hi, Kent and all.

It seems that the vote replies are not archived in the mailing list for
some reasons.

https://lists.apache.org/list.html?dev@spark.apache.org
https://lists.apache.org/thread/chos58kswjg3x9cotp5rn0oc7hnfc6o4

Dongjoon/


On Wed, Aug 7, 2024 at 1:44 PM John Zhuge  wrote:

> +1 (non-binding)
>
> Thanks for the great work!
>
> On Wed, Aug 7, 2024 at 8:55 AM L. C. Hsieh  wrote:
>
>> +1
>>
>> Thanks Kent.
>>
>> On Wed, Aug 7, 2024 at 8:31 AM Dongjoon Hyun  wrote:
>> >
>> > +1
>> >
>> > Thank you, Kent.
>> >
>> > Dongjoon.
>> >
>> > On 2024/08/06 16:06:00 Kent Yao wrote:
>> > > Hi dev,
>> > >
>> > > Please vote on releasing the following candidate as Apache Spark
>> version 3.5.2.
>> > >
>> > > The vote is open until Aug 9, 17:00:00 UTC, and passes if a majority
>> +1
>> > > PMC votes are cast, with a minimum of 3 +1 votes.
>> > >
>> > > [ ] +1 Release this package as Apache Spark 3.5.2
>> > > [ ] -1 Do not release this package because ...
>> > >
>> > > To learn more about Apache Spark, please see
>> https://spark.apache.org/
>> > >
>> > > The tag to be voted on is v3.5.2-rc5 (commit
>> > > bb7846dd487f259994fdc69e18e03382e3f64f42):
>> > > https://github.com/apache/spark/tree/v3.5.2-rc5
>> > >
>> > > The release files, including signatures, digests, etc. can be found
>> at:
>> > > https://dist.apache.org/repos/dist/dev/spark/v3.5.2-rc5-bin/
>> > >
>> > > Signatures used for Spark RCs can be found in this file:
>> > > https://dist.apache.org/repos/dist/dev/spark/KEYS
>> > >
>> > > The staging repository for this release can be found at:
>> > >
>> https://repository.apache.org/content/repositories/orgapachespark-1462/
>> > >
>> > > The documentation corresponding to this release can be found at:
>> > > https://dist.apache.org/repos/dist/dev/spark/v3.5.2-rc5-docs/
>> > >
>> > > The list of bug fixes going into 3.5.2 can be found at the following
>> URL:
>> > > https://issues.apache.org/jira/projects/SPARK/versions/12353980
>> > >
>> > > FAQ
>> > >
>> > > =
>> > > How can I help test this release?
>> > > =
>> > >
>> > > If you are a Spark user, you can help us test this release by taking
>> > > an existing Spark workload and running on this release candidate, then
>> > > reporting any regressions.
>> > >
>> > > If you're working in PySpark you can set up a virtual env and install
>> > > the current RC via "pip install
>> > >
>> https://dist.apache.org/repos/dist/dev/spark/v3.5.2-rc5-bin/pyspark-3.5.2.tar.gz
>> "
>> > > and see if anything important breaks.
>> > > In the Java/Scala, you can add the staging repository to your projects
>> > > resolvers and test
>> > > with the RC (make sure to clean up the artifact cache before/after so
>> > > you don't end up building with an out of date RC going forward).
>> > >
>> > > ===
>> > > What should happen to JIRA tickets still targeting 3.5.2?
>> > > ===
>> > >
>> > > The current list of open tickets targeted at 3.5.2 can be found at:
>> > > https://issues.apache.org/jira/projects/SPARK and search for
>> > > "Target Version/s" = 3.5.2
>> > >
>> > > Committers should look at those and triage. Extremely important bug
>> > > fixes, documentation, and API tweaks that impact compatibility should
>> > > be worked on immediately. Everything else please retarget to an
>> > > appropriate release.
>> > >
>> > > ==
>> > > But my bug isn't fixed?
>> > > ==
>> > >
>> > > In order to make timely releases, we will typically not hold the
>> > > release unless the bug in question is a regression from the previous
>> > > release. That being said, if there is something which is a regression
>> > > that has not been correctly targeted please ping me or a committer to
>> > > help target the issue.
>> > >
>> > > Thanks,
>> > > Kent Yao
>> > >
>> > > -
>> > > To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>> > >
>> > >
>> >
>> > -
>> > To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>> >
>>
>> -
>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>
>>
>
> --
> John Zhuge
>


Re: Spark website repo size hits the storage limit of GitHub-hosted runners

2024-08-08 Thread Dongjoon Hyun
The culprit seems to be PySpark 3.5 documentation which grows 11x times at
3.5+

$ du -h 3.4.3/api/python | tail -n1
 84M 3.4.3/api/python

$ du -h 3.5.1/api/python | tail -n1
943M 3.5.1/api/python

Since we will generate big documents for 3.5.x, 4.0.0-preview, 4.0.x,
4.1.x, the proposed tarball idea sounds promising to me too.

$ ls -alh 3.5.1.tgz
-rw-r--r--  1 dongjoon  staff   103M Aug  8 10:22 3.5.1.tgz

Specifically, shall we keep HTML files for only the latest version of live
releases, e.g. 3.4.3, 3.5.1, and 4.0.0-preview1?

In other words, all 0.x ~ 3.4.2 and 3.5.1 will be tarball files in the
current status.

Dongjoon.


On Thu, Aug 8, 2024 at 10:01 AM Sean Owen  wrote:

> I agree with 'archiving', but what does that mean? delete from the repo
> and site?
> While I really doubt people are looking for docs for, say, 0.5.0, it'd be
> a big jump to totally remove it.
>
> What if we made a compressed tarball of old docs and put that in the repo,
> linked to it, and removed the docs files for many old releases?
> It's still in the repo and will be in the container when docs are built,
> but, compressed would be much smaller.
> That could buy a significant amount of time.
>
> On Thu, Aug 8, 2024 at 7:06 AM Kent Yao  wrote:
>
>> Hi dev,
>>
>> The current size of the spark-website repository is approximately 16GB,
>> exceeding the storage limit of GitHub-hosted runners.  The GitHub actions
>> have been failing recently in the actions/checkout step caused by
>> 'No space left on device' errors.
>>
>> Filesystem  Size  Used Avail Use% Mounted on
>> overlay  73G   58G   16G  80% /
>> tmpfs64M 0   64M   0% /dev
>> tmpfs   7.9G 0  7.9G   0% /sys/fs/cgroup
>> shm  64M 0   64M   0% /dev/shm
>> /dev/root73G   58G   16G  80% /__w
>> tmpfs   1.6G  1.2M  1.6G   1% /run/docker.sock
>> tmpfs   7.9G 0  7.9G   0% /proc/acpi
>> tmpfs   7.9G 0  7.9G   0% /proc/scsi
>> tmpfs   7.9G 0  7.9G   0% /sys/firmware
>>
>>
>> The documentation for each version contributes the most volume. Since
>> version
>>  3.5.0, the documentation size has grown 3-4 times larger than the
>> size of 3.4.x,
>>  with more than 1GB.
>>
>>
>> 9.9M ./0.6.0
>>  10M ./0.6.1
>>  10M ./0.6.2
>>  15M ./0.7.0
>>  16M ./0.7.2
>>  16M ./0.7.3
>>  20M ./0.8.0
>>  20M ./0.8.1
>>  38M ./0.9.0
>>  38M ./0.9.1
>>  38M ./0.9.2
>>  36M ./1.0.0
>>  38M ./1.0.1
>>  38M ./1.0.2
>>  48M ./1.1.0
>>  48M ./1.1.1
>>  73M ./1.2.0
>>  73M ./1.2.1
>>  74M ./1.2.2
>>  69M ./1.3.0
>>  73M ./1.3.1
>>  68M ./1.4.0
>>  70M ./1.4.1
>>  80M ./1.5.0
>>  78M ./1.5.1
>>  78M ./1.5.2
>>  87M ./1.6.0
>>  87M ./1.6.1
>>  87M ./1.6.2
>>  86M ./1.6.3
>> 117M ./2.0.0
>> 119M ./2.0.0-preview
>> 118M ./2.0.1
>> 118M ./2.0.2
>> 121M ./2.1.0
>> 121M ./2.1.1
>> 122M ./2.1.2
>> 122M ./2.1.3
>> 130M ./2.2.0
>> 131M ./2.2.1
>> 132M ./2.2.2
>> 131M ./2.2.3
>> 141M ./2.3.0
>> 141M ./2.3.1
>> 141M ./2.3.2
>> 142M ./2.3.3
>> 142M ./2.3.4
>> 145M ./2.4.0
>> 146M ./2.4.1
>> 145M ./2.4.2
>> 144M ./2.4.3
>> 145M ./2.4.4
>> 143M ./2.4.5
>> 143M ./2.4.6
>> 143M ./2.4.7
>> 143M ./2.4.8
>> 197M ./3.0.0
>> 185M ./3.0.0-preview
>> 197M ./3.0.0-preview2
>> 198M ./3.0.1
>> 198M ./3.0.2
>> 205M ./3.0.3
>> 239M ./3.1.1
>> 239M ./3.1.2
>> 239M ./3.1.3
>> 840M ./3.2.0
>> 842M ./3.2.1
>> 282M ./3.2.2
>> 244M ./3.2.3
>> 282M ./3.2.4
>> 295M ./3.3.0
>> 297M ./3.3.1
>> 297M ./3.3.2
>> 297M ./3.3.3
>> 297M ./3.3.4
>> 314M ./3.4.0
>> 314M ./3.4.1
>> 328M ./3.4.2
>> 324M ./3.4.3
>> 1.1G ./3.5.0
>> 1.2G ./3.5.1
>> 1.1G ./4.0.0-preview1
>>
>> I'm concerned about publishing the documentation for version 3.5.2
>> to the asf-site. So, I have merged PR[2] to eliminate this potential
>> blocker.
>>
>> Considering that the problem still exists, should we temporarily archive
>> some of the outdated version documents? For example, only keep
>> the latest version for each feature release in the asf-site branch. Or,
>> Do you have any other suggestions?
>>
>>
>> Bests,
>> Kent Yao
>>
>>
>> [1]
>> https://docs.github.com/en/actions/using-github-hosted-runners/about-github-hosted-runners/about-github-hosted-runners#standard-github-hosted-runners-for-public-repositories
>> [2] https://github.com/apache/spark-website/pull/543
>>
>> -
>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>
>>


Re: [VOTE] Release Spark 3.5.2 (RC5)

2024-08-07 Thread Dongjoon Hyun
+1

Thank you, Kent.

Dongjoon.

On 2024/08/06 16:06:00 Kent Yao wrote:
> Hi dev,
> 
> Please vote on releasing the following candidate as Apache Spark version 
> 3.5.2.
> 
> The vote is open until Aug 9, 17:00:00 UTC, and passes if a majority +1
> PMC votes are cast, with a minimum of 3 +1 votes.
> 
> [ ] +1 Release this package as Apache Spark 3.5.2
> [ ] -1 Do not release this package because ...
> 
> To learn more about Apache Spark, please see https://spark.apache.org/
> 
> The tag to be voted on is v3.5.2-rc5 (commit
> bb7846dd487f259994fdc69e18e03382e3f64f42):
> https://github.com/apache/spark/tree/v3.5.2-rc5
> 
> The release files, including signatures, digests, etc. can be found at:
> https://dist.apache.org/repos/dist/dev/spark/v3.5.2-rc5-bin/
> 
> Signatures used for Spark RCs can be found in this file:
> https://dist.apache.org/repos/dist/dev/spark/KEYS
> 
> The staging repository for this release can be found at:
> https://repository.apache.org/content/repositories/orgapachespark-1462/
> 
> The documentation corresponding to this release can be found at:
> https://dist.apache.org/repos/dist/dev/spark/v3.5.2-rc5-docs/
> 
> The list of bug fixes going into 3.5.2 can be found at the following URL:
> https://issues.apache.org/jira/projects/SPARK/versions/12353980
> 
> FAQ
> 
> =
> How can I help test this release?
> =
> 
> If you are a Spark user, you can help us test this release by taking
> an existing Spark workload and running on this release candidate, then
> reporting any regressions.
> 
> If you're working in PySpark you can set up a virtual env and install
> the current RC via "pip install
> https://dist.apache.org/repos/dist/dev/spark/v3.5.2-rc5-bin/pyspark-3.5.2.tar.gz";
> and see if anything important breaks.
> In the Java/Scala, you can add the staging repository to your projects
> resolvers and test
> with the RC (make sure to clean up the artifact cache before/after so
> you don't end up building with an out of date RC going forward).
> 
> ===
> What should happen to JIRA tickets still targeting 3.5.2?
> ===
> 
> The current list of open tickets targeted at 3.5.2 can be found at:
> https://issues.apache.org/jira/projects/SPARK and search for
> "Target Version/s" = 3.5.2
> 
> Committers should look at those and triage. Extremely important bug
> fixes, documentation, and API tweaks that impact compatibility should
> be worked on immediately. Everything else please retarget to an
> appropriate release.
> 
> ==
> But my bug isn't fixed?
> ==
> 
> In order to make timely releases, we will typically not hold the
> release unless the bug in question is a regression from the previous
> release. That being said, if there is something which is a regression
> that has not been correctly targeted please ping me or a committer to
> help target the issue.
> 
> Thanks,
> Kent Yao
> 
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
> 
> 

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: [NOTICE] Progress of 3.5.2-RC5

2024-08-01 Thread Dongjoon Hyun
Thank you for summarizing them and leading the release, Kent. :)

Dongjoon.

On Wed, Jul 31, 2024 at 10:39 PM Kent Yao  wrote:

> Hi dev,
>
> Since version 3.5.2-RC4, we have received several reports regarding
> correctness issues, some of which are still unresolved. We will need a
> few days to address the unresolved issues listed below. The RC5 vote
> might be delayed until late next week.
>
> === FIXED ===
> https://issues.apache.org/jira/browse/SPARK-49000
> https://issues.apache.org/jira/browse/SPARK-49054
>
> === ONGOING ===
> https://issues.apache.org/jira/browse/SPARK-48950
> https://issues.apache.org/jira/browse/SPARK-49030
>
>
>
> Thanks,
> Kent Yao
>
>
> [1] https://lists.apache.org/thread/9lj57fh3zbo2h4koh5hr7nhdky21p6zg
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>


Re: [VOTE] Release Spark 3.5.2 (RC4)

2024-07-26 Thread Dongjoon Hyun
+1

Thank you, Kent.

Dongjoon.

On Fri, Jul 26, 2024 at 6:37 AM Kent Yao  wrote:

> Hi dev,
>
> Please vote on releasing the following candidate as Apache Spark version
> 3.5.2.
>
> The vote is open until Jul 29, 14:00:00 UTC, and passes if a majority +1
> PMC votes are cast, with a minimum of 3 +1 votes.
>
> [ ] +1 Release this package as Apache Spark 3.5.2
> [ ] -1 Do not release this package because ...
>
> To learn more about Apache Spark, please see https://spark.apache.org/
>
> The tag to be voted on is v3.5.2-rc4 (commit
> 1edbddfadeb46581134fa477d35399ddc63b7163):
> https://github.com/apache/spark/tree/v3.5.2-rc4
>
> The release files, including signatures, digests, etc. can be found at:
> https://dist.apache.org/repos/dist/dev/spark/v3.5.2-rc4-bin/
>
> Signatures used for Spark RCs can be found in this file:
> https://dist.apache.org/repos/dist/dev/spark/KEYS
>
> The staging repository for this release can be found at:
> https://repository.apache.org/content/repositories/orgapachespark-1460/
>
> The documentation corresponding to this release can be found at:
> https://dist.apache.org/repos/dist/dev/spark/v3.5.2-rc4-docs/
>
> The list of bug fixes going into 3.5.2 can be found at the following URL:
> https://issues.apache.org/jira/projects/SPARK/versions/12353980
>
> FAQ
>
> =
> How can I help test this release?
> =
>
> If you are a Spark user, you can help us test this release by taking
> an existing Spark workload and running on this release candidate, then
> reporting any regressions.
>
> If you're working in PySpark you can set up a virtual env and install
> the current RC via "pip install
>
> https://dist.apache.org/repos/dist/dev/spark/v3.5.2-rc4-bin/pyspark-3.5.2.tar.gz
> "
> and see if anything important breaks.
> In the Java/Scala, you can add the staging repository to your projects
> resolvers and test
> with the RC (make sure to clean up the artifact cache before/after so
> you don't end up building with an out of date RC going forward).
>
> ===
> What should happen to JIRA tickets still targeting 3.5.2?
> ===
>
> The current list of open tickets targeted at 3.5.2 can be found at:
> https://issues.apache.org/jira/projects/SPARK and search for
> "Target Version/s" = 3.5.2
>
> Committers should look at those and triage. Extremely important bug
> fixes, documentation, and API tweaks that impact compatibility should
> be worked on immediately. Everything else please retarget to an
> appropriate release.
>
> ==
> But my bug isn't fixed?
> ==
>
> In order to make timely releases, we will typically not hold the
> release unless the bug in question is a regression from the previous
> release. That being said, if there is something which is a regression
> that has not been correctly targeted please ping me or a committer to
> help target the issue.
>
> Thanks,
> Kent Yao
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>


Re: [VOTE] Release Spark 3.5.2 (RC3)

2024-07-25 Thread Dongjoon Hyun
Hi, Kent.

Sorry but I need to cast -1 for RC3 inevitably.

Unlike RC0 and RC1 testing, I found that RC3 distribution fails to build
PySpark Docker image.

This is due to the external Java 17 docker image OS change (which happened
two days ago) instead of Spark binaries.

You can see the recent failures in branch-3.5 and branch-3.4 daily CIs, too.

- https://github.com/apache/spark/actions/workflows/build_branch35.yml
- https://github.com/apache/spark/actions/workflows/build_branch34.yml

The patch landed to all live release branches (branch-3.5 and branch-3.4)
now.

https://issues.apache.org/jira/browse/SPARK-49005
[SPARK-49005][K8S][3.5] Use 17-jammy tag instead of 17 to prevent Python
3.12
[SPARK-49005][K8S][3.4] Use `17-jammy` tag instead of `17-jre` to prevent
Python 3.12

FYI, Python 3.12 Support was added to Apache Spark 4.0.0 only and `master`
branch is not affected.

Dongjoon.


On Thu, Jul 25, 2024 at 6:06 AM Kent Yao  wrote:

> Hi dev,
>
> Please vote on releasing the following candidate as Apache Spark version
> 3.5.2.
>
> The vote is open until Jul 28, 13:00:00 AM UTC, and passes if a majority +1
> PMC votes are cast, with a minimum of 3 +1 votes.
>
> [ ] +1 Release this package as Apache Spark 3.5.2
> [ ] -1 Do not release this package because ...
>
> To learn more about Apache Spark, please see https://spark.apache.org/
>
> The tag to be voted on is v3.5.2-rc3 (commit
> ebda6a6a97bf0b3932b970801f4c2f5dc6ae81d4):
> https://github.com/apache/spark/tree/v3.5.2-rc3
>
> The release files, including signatures, digests, etc. can be found at:
> https://dist.apache.org/repos/dist/dev/spark/v3.5.2-rc3-bin/
>
> Signatures used for Spark RCs can be found in this file:
> https://dist.apache.org/repos/dist/dev/spark/KEYS
>
> The staging repository for this release can be found at:
> https://repository.apache.org/content/repositories/orgapachespark-1459/
>
> The documentation corresponding to this release can be found at:
> https://dist.apache.org/repos/dist/dev/spark/v3.5.2-rc3-docs/
>
> The list of bug fixes going into 3.5.2 can be found at the following URL:
> https://issues.apache.org/jira/projects/SPARK/versions/12353980
>
> FAQ
>
> =
> How can I help test this release?
> =
>
> If you are a Spark user, you can help us test this release by taking
> an existing Spark workload and running on this release candidate, then
> reporting any regressions.
>
> If you're working in PySpark you can set up a virtual env and install
> the current RC via "pip install
>
> https://dist.apache.org/repos/dist/dev/spark/v3.5.2-rc3-bin/pyspark-3.5.2.tar.gz
> "
> and see if anything important breaks.
> In the Java/Scala, you can add the staging repository to your projects
> resolvers and test
> with the RC (make sure to clean up the artifact cache before/after so
> you don't end up building with an out of date RC going forward).
>
> ===
> What should happen to JIRA tickets still targeting 3.5.2?
> ===
>
> The current list of open tickets targeted at 3.5.2 can be found at:
> https://issues.apache.org/jira/projects/SPARK and search for
> "Target Version/s" = 3.5.2
>
> Committers should look at those and triage. Extremely important bug
> fixes, documentation, and API tweaks that impact compatibility should
> be worked on immediately. Everything else please retarget to an
> appropriate release.
>
> ==
> But my bug isn't fixed?
> ==
>
> In order to make timely releases, we will typically not hold the
> release unless the bug in question is a regression from the previous
> release. That being said, if there is something which is a regression
> that has not been correctly targeted please ping me or a committer to
> help target the issue.
>
> Thanks,
> Kent Yao
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>


Re: [外部邮件] [VOTE] Release Spark 3.5.2 (RC2)

2024-07-23 Thread Dongjoon Hyun
+1

Dongjoon.

On 2024/07/24 03:28:58 Wenchen Fan wrote:
> +1
> 
> On Wed, Jul 24, 2024 at 10:51 AM Kent Yao  wrote:
> 
> > +1(non-binding), I have checked:
> >
> > - Download links are OK
> > - Signatures, Checksums, and the KEYS file are OK
> > - LICENSE and NOTICE are present
> > - No unexpected binary files in source releases
> > - Successfully built from source
> >
> > Thanks,
> > Kent Yao
> >
> > On 2024/07/23 06:55:28 yangjie01 wrote:
> > > +1, Thanks Kent Yao ~
> > >
> > > 在 2024/7/22 17:01,“Kent Yao”mailto:y...@apache.org>>
> > 写入:
> > >
> > >
> > > Hi dev,
> > >
> > >
> > > Please vote on releasing the following candidate as Apache Spark version
> > 3.5.2.
> > >
> > >
> > > The vote is open until Jul 25, 09:00:00 AM UTC, and passes if a majority
> > +1
> > > PMC votes are cast, with
> > > a minimum of 3 +1 votes.
> > >
> > >
> > > [ ] +1 Release this package as Apache Spark 3.5.2
> > > [ ] -1 Do not release this package because ...
> > >
> > >
> > > To learn more about Apache Spark, please see https://spark.apache.org/ <
> > https://spark.apache.org/>
> > >
> > >
> > > The tag to be voted on is v3.5.2-rc2 (commit
> > > 6d8f511430881fa7a3203405260da174df424103):
> > > https://github.com/apache/spark/tree/v3.5.2-rc2 <
> > https://github.com/apache/spark/tree/v3.5.2-rc2>
> > >
> > >
> > > The release files, including signatures, digests, etc. can be found at:
> > > https://dist.apache.org/repos/dist/dev/spark/v3.5.2-rc2-bin/ <
> > https://dist.apache.org/repos/dist/dev/spark/v3.5.2-rc2-bin/>
> > >
> > >
> > > Signatures used for Spark RCs can be found in this file:
> > > https://dist.apache.org/repos/dist/dev/spark/KEYS <
> > https://dist.apache.org/repos/dist/dev/spark/KEYS>
> > >
> > >
> > > The staging repository for this release can be found at:
> > > https://repository.apache.org/content/repositories/orgapachespark-1458/
> > 
> > >
> > >
> > > The documentation corresponding to this release can be found at:
> > > https://dist.apache.org/repos/dist/dev/spark/v3.5.2-rc2-docs/ <
> > https://dist.apache.org/repos/dist/dev/spark/v3.5.2-rc2-docs/>
> > >
> > >
> > > The list of bug fixes going into 3.5.2 can be found at the following URL:
> > > https://issues.apache.org/jira/projects/SPARK/versions/12353980 <
> > https://issues.apache.org/jira/projects/SPARK/versions/12353980>
> > >
> > >
> > > FAQ
> > >
> > >
> > > =
> > > How can I help test this release?
> > > =
> > >
> > >
> > > If you are a Spark user, you can help us test this release by taking
> > > an existing Spark workload and running on this release candidate, then
> > > reporting any regressions.
> > >
> > >
> > > If you're working in PySpark you can set up a virtual env and install
> > > the current RC via "pip install
> > >
> > https://dist.apache.org/repos/dist/dev/spark/v3.5.2-rc2-bin/pyspark-3.5.2.tar.gz";
> > <
> > https://dist.apache.org/repos/dist/dev/spark/v3.5.2-rc2-bin/pyspark-3.5.2.tar.gz"
> > ;>
> > > and see if anything important breaks.
> > > In the Java/Scala, you can add the staging repository to your projects
> > > resolvers and test
> > > with the RC (make sure to clean up the artifact cache before/after so
> > > you don't end up building with an out of date RC going forward).
> > >
> > >
> > > ===
> > > What should happen to JIRA tickets still targeting 3.5.2?
> > > ===
> > >
> > >
> > > The current list of open tickets targeted at 3.5.2 can be found at:
> > > https://issues.apache.org/jira/projects/SPARK <
> > https://issues.apache.org/jira/projects/SPARK> and search for
> > > "Target Version/s" = 3.5.2
> > >
> > >
> > > Committers should look at those and triage. Extremely important bug
> > > fixes, documentation, and API tweaks that impact compatibility should
> > > be worked on immediately. Everything else please retarget to an
> > > appropriate release.
> > >
> > >
> > > ==
> > > But my bug isn't fixed?
> > > ==
> > >
> > >
> > > In order to make timely releases, we will typically not hold the
> > > release unless the bug in question is a regression from the previous
> > > release. That being said, if there is something which is a regression
> > > that has not been correctly targeted please ping me or a committer to
> > > help target the issue.
> > >
> > >
> > > Thanks,
> > > Kent Yao
> > >
> > >
> > > -
> > > To unsubscribe e-mail: dev-unsubscr...@spark.apache.org  > dev-unsubscr...@spark.apache.org>
> > >
> > >
> > >
> > >
> > >
> > >
> > > -
> > > To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
> > >
> > >
> >
> > -
> > To unsubscribe e-mail: dev-unsubscr...@spark.apache.org

Re: [DISCUSS] Why do we remove RDD usage and RDD-backed code?

2024-07-23 Thread Dongjoon Hyun
I'm bumping up this thread because the overhead bites us back already. Here is 
a commit merged 3 hours ago.

https://github.com/apache/spark/pull/47453
[SPARK-48970][PYTHON][ML] Avoid using SparkSession.getActiveSession in spark ML 
reader/writer

In short, unlike the original PRs' claims, this commit starts to create 
`SparkSession` in this layer. Although I understand the reason why Hyukjin and 
Martin claims that `SparkSession` will be there in any way, this is an 
architectural change which we need to decide explicitly, not implicitly.

> On 2024/07/13 05:33:32 Hyukjin Kwon wrote:
> We actually get the active Spark session so it doesn't cause overhead. Also
> even we create, it will create once which should be pretty trivial overhead.

If this architectural change is required inevitably and needs to happen in 
Apache Spark 4.0.0. Can we have a dev-document about this? If there is no 
proper place, we can add it to the ML migration guide simply.

Dongjoon.

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: [外部邮件] [VOTE] Differentiate Spark without Spark Connect from Spark Connect

2024-07-23 Thread Dongjoon Hyun
+1 for the proposed definition.

Thanks,
Dongjoon


On Tue, Jul 23, 2024 at 6:42 AM Xianjin YE  wrote:

> +1 (non-binding)
>
> On Jul 23, 2024, at 16:16, Jungtaek Lim 
> wrote:
>
> +1 (non-binding)
>
> On Tue, Jul 23, 2024 at 1:51 PM  wrote:
>
>>
>> +1
>>
>> On Jul 22, 2024, at 21:42, John Zhuge  wrote:
>>
>> 
>> +1 (non-binding)
>>
>> On Mon, Jul 22, 2024 at 8:16 PM yangjie01 
>> wrote:
>>
>>> +1
>>>
>>> 在 2024/7/23 11:11,“Kent Yao”mailto:y...@apache.org>>
>>> 写入:
>>>
>>>
>>> +1
>>>
>>>
>>> On 2024/07/23 02:04:17 Herman van Hovell wrote:
>>> > +1
>>> >
>>> > On Mon, Jul 22, 2024 at 8:56 PM Wenchen Fan >> > wrote:
>>> >
>>> > > +1
>>> > >
>>> > > On Tue, Jul 23, 2024 at 8:40 AM Xinrong Meng >> > wrote:
>>> > >
>>> > >> +1
>>> > >>
>>> > >> Thank you @Hyukjin Kwon >> gurwls...@apache.org>> !
>>> > >>
>>> > >> On Mon, Jul 22, 2024 at 5:20 PM Gengliang Wang >> > wrote:
>>> > >>
>>> > >>> +1
>>> > >>>
>>> > >>> On Mon, Jul 22, 2024 at 5:19 PM Hyukjin Kwon >> >
>>> > >>> wrote:
>>> > >>>
>>> >  Starting with my own +1.
>>> > 
>>> >  On Tue, 23 Jul 2024 at 09:12, Hyukjin Kwon >> >
>>> >  wrote:
>>> > 
>>> > > Hi all,
>>> > >
>>> > > I’d like to start a vote for differentiating "Spark without Spark
>>> > > Connect" as "Spark Classic".
>>> > >
>>> > > Please also refer to:
>>> > >
>>> > > - Discussion thread:
>>> > > https://lists.apache.org/thread/ys7zsod8cs9c7qllmf0p0msk6z2mz2ym
>>> 
>>> > >
>>> > > Please vote on the SPIP for the next 72 hours:
>>> > >
>>> > > [ ] +1: Accept the proposal
>>> > > [ ] +0
>>> > > [ ] -1: I don’t think this is a good idea because …
>>> > >
>>> > > Thank you!
>>> > >
>>> > 
>>> >
>>>
>>>
>>> -
>>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org >> dev-unsubscr...@spark.apache.org>
>>>
>>>
>>>
>>>
>>>
>>>
>>
>> --
>> John Zhuge
>>
>>
>


Re: [DISCUSS] Differentiate Spark without Spark Connect from Spark Connect

2024-07-22 Thread Dongjoon Hyun
Thank you for opening this thread, Hyukjin.

In this discussion thread, we have three terminologies, (1) ~ (3).

> Spark Classic (vs. Spark Connect)

1. Spark
2. Spark Classic (= A proposal for Spark without Spark Connect)
3. Spark Connect

As Holden and Jungtaek mentioned, 

- (1) is definitely the existing code base which includes all (including RDD 
API, Spark Thrift Server, Spark Connect and so on). 

- (3) is is a very specific use case to a user when a Spark binary distribution 
is used with `--remote` option (or enabling the related features). Like Spark 
Thrift Server, after query planning steps, there is no fundamental difference 
in the execution code side in Spark clusters or Spark jobs.

- (2) By the proposed definition, (2) `Spark Classic` is not (1) `Spark`. Like 
`--remote`, it's one of runnable modes.

To be clear, is the proposal aiming to make us to say like A instead of B in 
our documentation?

A. Since `Spark Connect` mode has no RDD API, we need to use `Spark Classic` 
mode instead.
B. Since `Spark Connect` mode has no RDD API, we need to use `Spark without 
Spark Connect` mode instead.

Dongjoon.



On 2024/07/22 12:59:54 Sadha Chilukoori wrote:
> +1  (non-binding) for classic.
> 
> On Mon, Jul 22, 2024 at 3:59 AM Martin Grund 
> wrote:
> 
> > +1 for classic. It's simple, easy to understand and it doesn't have the
> > negative meanings like legacy for example.
> >
> > On Sun, Jul 21, 2024 at 23:48 Wenchen Fan  wrote:
> >
> >> Classic SGTM.
> >>
> >> On Mon, Jul 22, 2024 at 1:12 PM Jungtaek Lim <
> >> kabhwan.opensou...@gmail.com> wrote:
> >>
> >>> I'd propose not to change the name of "Spark Connect" - the name
> >>> represents the characteristic of the mode (separation of layer for client
> >>> and server). Trying to remove the part of "Connect" would just make
> >>> confusion.
> >>>
> >>> +1 for Classic to existing mode, till someone comes up with better
> >>> alternatives.
> >>>
> >>> On Mon, Jul 22, 2024 at 8:50 AM Hyukjin Kwon 
> >>> wrote:
> >>>
>  I was thinking about a similar option too but I ended up giving this up
>  .. It's quite unlikely at this moment but suppose that we have another
>  Spark Connect-ish component in the far future and it would be challenging
>  to come up with another name ... Another case is that we might have to 
>  cope
>  with the cases like Spark Connect, vs Spark (with Spark Connect) and 
>  Spark
>  (without Spark Connect) ..
> 
>  On Sun, 21 Jul 2024 at 09:59, Holden Karau 
>  wrote:
> 
> > I think perhaps Spark Connect could be phrased as “Basic* Spark” &
> > existing Spark could be “Full Spark” given the API limitations of Spark
> > connect.
> >
> > *I was also thinking Core here but we’ve used core to refer to the RDD
> > APIs for too long to reuse it here.
> >
> > Twitter: https://twitter.com/holdenkarau
> > Books (Learning Spark, High Performance Spark, etc.):
> > https://amzn.to/2MaRAG9  
> > YouTube Live Streams: https://www.youtube.com/user/holdenkarau
> >
> >
> > On Sat, Jul 20, 2024 at 8:02 PM Xiao Li  wrote:
> >
> >> Classic is much better than Legacy. : )
> >>
> >> Hyukjin Kwon  于2024年7月18日周四 16:58写道:
> >>
> >>> Hi all,
> >>>
> >>> I noticed that we need to standardize our terminology before moving
> >>> forward. For instance, when documenting, 'Spark without Spark 
> >>> Connect' is
> >>> too long and verbose. Additionally, I've observed that we use various 
> >>> names
> >>> for Spark without Spark Connect: Spark Classic, Classic Spark, Legacy
> >>> Spark, etc.
> >>>
> >>> I propose that we consistently refer to it as Spark Classic (vs.
> >>> Spark Connect).
> >>>
> >>> Please share your thoughts on this. Thanks!
> >>>
> >>
> 

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



[DISCUSS] Why do we remove RDD usage and RDD-backed code?

2024-07-12 Thread Dongjoon Hyun
Hi, All.

Apache Spark's RDD API plays an essential and invaluable role from the
beginning and it will be even if it's not supported by Spark Connect.

I have a concern about a recent activity which replaces RDD with
SparkSession blindly.

For instance,

https://github.com/apache/spark/pull/47328
[SPARK-48883][ML][R] Replace RDD read / write API invocation with Dataframe
read / write API

This PR doesn't look proper to me in two ways.
- SparkSession is heavier than SparkContext
- According to the following PR description, the background is also hidden
in the community.

  > # Why are the changes needed?
  > In databricks runtime, RDD read / write API has some issue for certain
storage types
  > that requires the account key, but Dataframe read / write API works.

In addition, we don't know if this PR fixes the mentioned unknown storage's
issue or not because it's not testable in the community test coverage.

I'm wondering if the Apache Spark community aims to move away from the RDD
usage in favor of `Spark Connect`. Isn't it too early because `Spark
Connect` is not even GA in the community?

Dongjoon.


Re: [DISCUSS] Release Apache Spark 3.5.2

2024-07-11 Thread Dongjoon Hyun
Thank you for the head-up and volunteering, Kent.

+1 for 3.5.2 release.

I can help you with the release steps which require Spark PMC permissions.

Please let me know if you have any questions or hit any issues.

Thanks,
Dongjoon.


On Thu, Jul 11, 2024 at 2:04 AM Kent Yao  wrote:

> Hi dev,
>
> It's been approximately 5 months since Feb 23, 2024, when
> we released version 3.5.1 for branch-3.5. The patchset differing
> from 3.5.1 has grown significantly, now consisting of over 160
> commits.
>
> The JIRA[2] also indicates that more than 120 resolved tickets are aimed
> at version 3.5.2, including some blockers and critical issues.
>
> What do you think about releasing 3.5.2? I am volunteering to take on
> the role of
> release manager for 3.5.2.
>
>
> Bests,
> Kent Yao
>
> [1] https://spark.apache.org/news/spark-3-5-1-released.html
> [2] https://issues.apache.org/jira/projects/SPARK/versions/12353980
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>


Re: [VOTE] Move Spark Connect server to builtin package (Client API layer stays external)

2024-07-03 Thread Dongjoon Hyun
+1

Dongjoon

On Wed, Jul 3, 2024 at 10:58 Xinrong Meng  wrote:

> +1
>
> Thank you @Hyukjin Kwon  !
>
> On Wed, Jul 3, 2024 at 8:55 AM bo yang  wrote:
>
>> +1 (non-binding)
>>
>
>> On Tue, Jul 2, 2024 at 11:22 PM Cheng Pan  wrote:
>>
>>> +1 (non-binding)
>>>
>>> Thanks,
>>> Cheng Pan
>>>
>>>
>>> On Jul 3, 2024, at 08:59, Hyukjin Kwon  wrote:
>>>
>>> Hi all,
>>>
>>> I’d like to start a vote for moving Spark Connect server to builtin
>>> package (Client API layer stays external).
>>>
>>> Please also refer to:
>>>
>>>- Discussion thread:
>>> https://lists.apache.org/thread/odlx9b552dp8yllhrdlp24pf9m9s4tmx
>>>- JIRA ticket: https://issues.apache.org/jira/browse/SPARK-48763
>>>
>>> Please vote on the SPIP for the next 72 hours:
>>>
>>> [ ] +1: Accept the proposal
>>> [ ] +0
>>> [ ] -1: I don’t think this is a good idea because …
>>>
>>> Thank you!
>>>
>>>
>>>


Re: [VOTE] SPIP: Stored Procedures API for Catalogs

2024-05-12 Thread Dongjoon Hyun
+1

On Sun, May 12, 2024 at 3:50 PM huaxin gao  wrote:

> +1
>
> On Sat, May 11, 2024 at 4:35 PM L. C. Hsieh  wrote:
>
>> +1
>>
>> On Sat, May 11, 2024 at 3:11 PM Chao Sun  wrote:
>> >
>> > +1
>> >
>> > On Sat, May 11, 2024 at 2:10 PM L. C. Hsieh  wrote:
>> >>
>> >> Hi all,
>> >>
>> >> I’d like to start a vote for SPIP: Stored Procedures API for Catalogs.
>> >>
>> >> Please also refer to:
>> >>
>> >>- Discussion thread:
>> >> https://lists.apache.org/thread/7r04pz544c9qs3gc8q2nyj3fpzfnv8oo
>> >>- JIRA ticket: https://issues.apache.org/jira/browse/SPARK-44167
>> >>- SPIP doc:
>> https://docs.google.com/document/d/1rDcggNl9YNcBECsfgPcoOecHXYZOu29QYFrloo2lPBg/
>> >>
>> >>
>> >> Please vote on the SPIP for the next 72 hours:
>> >>
>> >> [ ] +1: Accept the proposal as an official SPIP
>> >> [ ] +0
>> >> [ ] -1: I don’t think this is a good idea because …
>> >>
>> >>
>> >> Thank you!
>> >>
>> >> Liang-Chi Hsieh
>> >>
>> >> -
>> >> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>> >>
>>
>> -
>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>
>>


Re: [DISCUSS] Spark 4.0.0 release

2024-05-09 Thread Dongjoon Hyun
Please re-try to upload, Wenchen. ASF Infra team bumped up our upload limit
based on our request.

> Your upload limit has been increased to 650MB

Dongjoon.



On Thu, May 9, 2024 at 8:12 AM Wenchen Fan  wrote:

> I've created a ticket: https://issues.apache.org/jira/browse/INFRA-25776
>
> On Thu, May 9, 2024 at 11:06 PM Dongjoon Hyun 
> wrote:
>
>> In addition, FYI, I was the latest release manager with Apache Spark
>> 3.4.3 (2024-04-15 Vote)
>>
>> According to my work log, I uploaded the following binaries to SVN from
>> EC2 (us-west-2) without any issues.
>>
>> -rw-r--r--.  1 centos centos 311384003 Apr 15 01:29 pyspark-3.4.3.tar.gz
>> -rw-r--r--.  1 centos centos 397870995 Apr 15 00:44
>> spark-3.4.3-bin-hadoop3-scala2.13.tgz
>> -rw-r--r--.  1 centos centos 388930980 Apr 15 01:29
>> spark-3.4.3-bin-hadoop3.tgz
>> -rw-r--r--.  1 centos centos 300786123 Apr 15 01:04
>> spark-3.4.3-bin-without-hadoop.tgz
>> -rw-r--r--.  1 centos centos  32219044 Apr 15 00:23 spark-3.4.3.tgz
>> -rw-r--r--.  1 centos centos356749 Apr 15 01:29 SparkR_3.4.3.tar.gz
>>
>> Since Apache Spark 4.0.0-preview doesn't have Scala 2.12 combination, the
>> total size should be smaller than 3.4.3 binaires.
>>
>> Given that, if there is any INFRA change, that could happen after 4/15.
>>
>> Dongjoon.
>>
>> On Thu, May 9, 2024 at 7:57 AM Dongjoon Hyun 
>> wrote:
>>
>>> Could you file an INFRA JIRA issue with the error message and context
>>> first, Wenchen?
>>>
>>> As you know, if we see something, we had better file a JIRA issue
>>> because it could be not only an Apache Spark project issue but also all ASF
>>> project issues.
>>>
>>> Dongjoon.
>>>
>>>
>>> On Thu, May 9, 2024 at 12:28 AM Wenchen Fan  wrote:
>>>
>>>> UPDATE:
>>>>
>>>> After resolving a few issues in the release scripts, I can finally
>>>> build the release packages. However, I can't upload them to the staging SVN
>>>> repo due to a transmitting error, and it seems like a limitation from the
>>>> server side. I tried it on both my local laptop and remote AWS instance,
>>>> but neither works. These package binaries are like 300-400 MBs, and we just
>>>> did a release last month. Not sure if this is a new limitation due to cost
>>>> saving.
>>>>
>>>> While I'm looking for help to get unblocked, I'm wondering if we can
>>>> upload release packages to a public git repo instead, under the Apache
>>>> account?
>>>>
>>>>>
>>>>>>>>>>>>>


Re: [DISCUSS] Spark 4.0.0 release

2024-05-09 Thread Dongjoon Hyun
In addition, FYI, I was the latest release manager with Apache Spark 3.4.3
(2024-04-15 Vote)

According to my work log, I uploaded the following binaries to SVN from EC2
(us-west-2) without any issues.

-rw-r--r--.  1 centos centos 311384003 Apr 15 01:29 pyspark-3.4.3.tar.gz
-rw-r--r--.  1 centos centos 397870995 Apr 15 00:44
spark-3.4.3-bin-hadoop3-scala2.13.tgz
-rw-r--r--.  1 centos centos 388930980 Apr 15 01:29
spark-3.4.3-bin-hadoop3.tgz
-rw-r--r--.  1 centos centos 300786123 Apr 15 01:04
spark-3.4.3-bin-without-hadoop.tgz
-rw-r--r--.  1 centos centos  32219044 Apr 15 00:23 spark-3.4.3.tgz
-rw-r--r--.  1 centos centos356749 Apr 15 01:29 SparkR_3.4.3.tar.gz

Since Apache Spark 4.0.0-preview doesn't have Scala 2.12 combination, the
total size should be smaller than 3.4.3 binaires.

Given that, if there is any INFRA change, that could happen after 4/15.

Dongjoon.

On Thu, May 9, 2024 at 7:57 AM Dongjoon Hyun 
wrote:

> Could you file an INFRA JIRA issue with the error message and context
> first, Wenchen?
>
> As you know, if we see something, we had better file a JIRA issue because
> it could be not only an Apache Spark project issue but also all ASF project
> issues.
>
> Dongjoon.
>
>
> On Thu, May 9, 2024 at 12:28 AM Wenchen Fan  wrote:
>
>> UPDATE:
>>
>> After resolving a few issues in the release scripts, I can finally build
>> the release packages. However, I can't upload them to the staging SVN repo
>> due to a transmitting error, and it seems like a limitation from the server
>> side. I tried it on both my local laptop and remote AWS instance, but
>> neither works. These package binaries are like 300-400 MBs, and we just did
>> a release last month. Not sure if this is a new limitation due to cost
>> saving.
>>
>> While I'm looking for help to get unblocked, I'm wondering if we can
>> upload release packages to a public git repo instead, under the Apache
>> account?
>>
>>>
>>>>>>>>>>>


Re: [DISCUSS] Spark 4.0.0 release

2024-05-09 Thread Dongjoon Hyun
y to keep the keys 
>>>>>>> safe
>>>>>>> (there was some concern from earlier release processes).
>>>>>>>
>>>>>>> Twitter: https://twitter.com/holdenkarau
>>>>>>> Books (Learning Spark, High Performance Spark, etc.):
>>>>>>> https://amzn.to/2MaRAG9  <https://amzn.to/2MaRAG9>
>>>>>>> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>>>>>>>
>>>>>>>
>>>>>>> On Tue, May 7, 2024 at 10:55 AM Nimrod Ofek 
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Hi,
>>>>>>>>
>>>>>>>> Sorry for the novice question, Wenchen - the release is done
>>>>>>>> manually from a laptop? Not using a CI CD process on a build server?
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>> Nimrod
>>>>>>>>
>>>>>>>> On Tue, May 7, 2024 at 8:50 PM Wenchen Fan 
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> UPDATE:
>>>>>>>>>
>>>>>>>>> Unfortunately, it took me quite some time to set up my laptop and
>>>>>>>>> get it ready for the release process (docker desktop doesn't work 
>>>>>>>>> anymore,
>>>>>>>>> my pgp key is lost, etc.). I'll start the RC process at my tomorrow. 
>>>>>>>>> Thanks
>>>>>>>>> for your patience!
>>>>>>>>>
>>>>>>>>> Wenchen
>>>>>>>>>
>>>>>>>>> On Fri, May 3, 2024 at 7:47 AM yangjie01 
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> +1
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> *发件人**: *Jungtaek Lim 
>>>>>>>>>> *日期**: *2024年5月2日 星期四 10:21
>>>>>>>>>> *收件人**: *Holden Karau 
>>>>>>>>>> *抄送**: *Chao Sun , Xiao Li <
>>>>>>>>>> gatorsm...@gmail.com>, Tathagata Das ,
>>>>>>>>>> Wenchen Fan , Cheng Pan ,
>>>>>>>>>> Nicholas Chammas , Dongjoon Hyun <
>>>>>>>>>> dongjoon.h...@gmail.com>, Cheng Pan , Spark
>>>>>>>>>> dev list , Anish Shrigondekar <
>>>>>>>>>> anish.shrigonde...@databricks.com>
>>>>>>>>>> *主题**: *Re: [DISCUSS] Spark 4.0.0 release
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> +1 love to see it!
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Thu, May 2, 2024 at 10:08 AM Holden Karau <
>>>>>>>>>> holden.ka...@gmail.com> wrote:
>>>>>>>>>>
>>>>>>>>>> +1 :) yay previews
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Wed, May 1, 2024 at 5:36 PM Chao Sun 
>>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>> +1
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Wed, May 1, 2024 at 5:23 PM Xiao Li 
>>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>> +1 for next Monday.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> We can do more previews when the other features are ready for
>>>>>>>>>> preview.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Tathagata Das  于2024年5月1日周三 08:46写道:
>>>>>>>>>>
>>>>>>>>>> Next week sounds great! Thank you Wenchen!
>>>>>>>>>>
>>>>&

Re: [DISCUSS] Spark 4.0.0 release

2024-05-07 Thread Dongjoon Hyun
Thank you so much for the update, Wenchen!

Dongjoon.

On Tue, May 7, 2024 at 10:49 AM Wenchen Fan  wrote:

> UPDATE:
>
> Unfortunately, it took me quite some time to set up my laptop and get it
> ready for the release process (docker desktop doesn't work anymore, my pgp
> key is lost, etc.). I'll start the RC process at my tomorrow. Thanks for
> your patience!
>
> Wenchen
>
> On Fri, May 3, 2024 at 7:47 AM yangjie01  wrote:
>
>> +1
>>
>>
>>
>> *发件人**: *Jungtaek Lim 
>> *日期**: *2024年5月2日 星期四 10:21
>> *收件人**: *Holden Karau 
>> *抄送**: *Chao Sun , Xiao Li ,
>> Tathagata Das , Wenchen Fan <
>> cloud0...@gmail.com>, Cheng Pan , Nicholas Chammas <
>> nicholas.cham...@gmail.com>, Dongjoon Hyun ,
>> Cheng Pan , Spark dev list ,
>> Anish Shrigondekar 
>> *主题**: *Re: [DISCUSS] Spark 4.0.0 release
>>
>>
>>
>> +1 love to see it!
>>
>>
>>
>> On Thu, May 2, 2024 at 10:08 AM Holden Karau 
>> wrote:
>>
>> +1 :) yay previews
>>
>>
>>
>> On Wed, May 1, 2024 at 5:36 PM Chao Sun  wrote:
>>
>> +1
>>
>>
>>
>> On Wed, May 1, 2024 at 5:23 PM Xiao Li  wrote:
>>
>> +1 for next Monday.
>>
>>
>>
>> We can do more previews when the other features are ready for preview.
>>
>>
>>
>> Tathagata Das  于2024年5月1日周三 08:46写道:
>>
>> Next week sounds great! Thank you Wenchen!
>>
>>
>>
>> On Wed, May 1, 2024 at 11:16 AM Wenchen Fan  wrote:
>>
>> Yea I think a preview release won't hurt (without a branch cut). We don't
>> need to wait for all the ongoing projects to be ready. How about we do a
>> 4.0 preview release based on the current master branch next Monday?
>>
>>
>>
>> On Wed, May 1, 2024 at 11:06 PM Tathagata Das <
>> tathagata.das1...@gmail.com> wrote:
>>
>> Hey all,
>>
>>
>>
>> Reviving this thread, but Spark master has already accumulated a huge
>> amount of changes.  As a downstream project maintainer, I want to really
>> start testing the new features and other breaking changes, and it's hard to
>> do that without a Preview release. So the sooner we make a Preview release,
>> the faster we can start getting feedback for fixing things for a great
>> Spark 4.0 final release.
>>
>>
>>
>> So I urge the community to produce a Spark 4.0 Preview soon even if
>> certain features targeting the Delta 4.0 release are still incomplete.
>>
>>
>>
>> Thanks!
>>
>>
>>
>>
>>
>> On Wed, Apr 17, 2024 at 8:35 AM Wenchen Fan  wrote:
>>
>> Thank you all for the replies!
>>
>>
>>
>> To @Nicholas Chammas  : Thanks for cleaning
>> up the error terminology and documentation! I've merged the first PR and
>> let's finish others before the 4.0 release.
>>
>> To @Dongjoon Hyun  : Thanks for driving the
>> ANSI on by default effort! Now the vote has passed, let's flip the config
>> and finish the DataFrame error context feature before 4.0.
>>
>> To @Jungtaek Lim  : Ack. We can treat the
>> Streaming state store data source as completed for 4.0 then.
>>
>> To @Cheng Pan  : Yea we definitely should have a
>> preview release. Let's collect more feedback on the ongoing projects and
>> then we can propose a date for the preview release.
>>
>>
>>
>> On Wed, Apr 17, 2024 at 1:22 PM Cheng Pan  wrote:
>>
>> will we have preview release for 4.0.0 like we did for 2.0.0 and 3.0.0?
>>
>> Thanks,
>> Cheng Pan
>>
>>
>> > On Apr 15, 2024, at 09:58, Jungtaek Lim 
>> wrote:
>> >
>> > W.r.t. state data source - reader (SPARK-45511), there are several
>> follow-up tickets, but we don't plan to address them soon. The current
>> implementation is the final shape for Spark 4.0.0, unless there are demands
>> on the follow-up tickets.
>> >
>> > We may want to check the plan for transformWithState - my understanding
>> is that we want to release the feature to 4.0.0, but there are several
>> remaining works to be done. While the tentative timeline for releasing is
>> June 2024, what would be the tentative timeline for the RC cut?
>> > (cc. Anish to add more context on the plan for transformWithState)
>> >
>> > On Sat, Apr 13, 2024 at 3:15 AM Wenchen Fan 
>> wrote:
>> > Hi all,
>> >
>> > It's close to the previously proposed 4.0.0 release date (June 2024)

Re: ASF board report draft for May

2024-05-05 Thread Dongjoon Hyun
+1 for Holden's comment. Yes, it would be great to mention `it` as "soon".
(If Wenchen release it on Monday, we can simply mention the release)

In addition, Apache Spark PMC received an official notice from ASF Infra
team.

https://lists.apache.org/thread/rgy1cg17tkd3yox7qfq87ht12sqclkbg
> [NOTICE] Apache Spark's GitHub Actions usage exceeds allowances for ASF
projects

To track and comply with the new ASF Infra Policy as much as possible, we
opened a blocker-level JIRA issue and have been working on it.
- https://infra.apache.org/github-actions-policy.html

Please include a sentence that Apache Spark PMC is working on under the
following umbrella JIRA issue.

https://issues.apache.org/jira/browse/SPARK-48094
> Reduce GitHub Action usage according to ASF project allowance

Thanks,
Dongjoon.


On Sun, May 5, 2024 at 3:45 PM Holden Karau  wrote:

> Do we want to include that we’re planning on having a preview release of
> Spark 4 so folks can see the APIs “soon”?
>
> Twitter: https://twitter.com/holdenkarau
> Books (Learning Spark, High Performance Spark, etc.):
> https://amzn.to/2MaRAG9  
> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>
>
> On Sun, May 5, 2024 at 3:24 PM Matei Zaharia 
> wrote:
>
>> It’s time for our quarterly ASF board report on Apache Spark this
>> Wednesday. Here’s a draft, feel free to suggest changes.
>>
>> 
>>
>> Description:
>>
>> Apache Spark is a fast and general purpose engine for large-scale data
>> processing. It offers high-level APIs in Java, Scala, Python, R and SQL as
>> well as a rich set of libraries including stream processing, machine
>> learning, and graph analytics.
>>
>> Issues for the board:
>>
>> - None
>>
>> Project status:
>>
>> - We made two patch releases: Spark 3.5.1 on February 28, 2024, and Spark
>> 3.4.2 on April 18, 2024.
>> - The votes on "SPIP: Structured Logging Framework for Apache Spark" and
>> "Pure Python Package in PyPI (Spark Connect)" have passed.
>> - The votes for two behavior changes have passed: "SPARK-4: Use ANSI
>> SQL mode by default" and "SPARK-46122: Set
>> spark.sql.legacy.createHiveTableByDefault to false".
>> - The community decided that upcoming Spark 4.0 release will drop support
>> for Python 3.8.
>> - We started a discussion about the definition of behavior changes that
>> is critical for version upgrades and user experience.
>> - We've opened a dedicated repository for the Spark Kubernetes Operator
>> at https://github.com/apache/spark-kubernetes-operator. We added a new
>> version in Apache Spark JIRA for versioning of the Spark operator based on
>> a vote result.
>>
>> Trademarks:
>>
>> - No changes since the last report.
>>
>> Latest releases:
>> - Spark 3.4.3 was released on April 18, 2024
>> - Spark 3.5.1 was released on February 28, 2024
>> - Spark 3.3.4 was released on December 16, 2023
>>
>> Committers and PMC:
>>
>> - The latest committer was added on Oct 2nd, 2023 (Jiaan Geng).
>> - The latest PMC members were added on Oct 2nd, 2023 (Yuanjian Li and
>> Yikun Jiang).
>>
>> 
>> -
>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>
>>


Re: [DISCUSS] Spark 4.0.0 release

2024-05-01 Thread Dongjoon Hyun
+1 for next Monday.

Dongjoon.

On Wed, May 1, 2024 at 8:46 AM Tathagata Das 
wrote:

> Next week sounds great! Thank you Wenchen!
>
> On Wed, May 1, 2024 at 11:16 AM Wenchen Fan  wrote:
>
>> Yea I think a preview release won't hurt (without a branch cut). We don't
>> need to wait for all the ongoing projects to be ready. How about we do a
>> 4.0 preview release based on the current master branch next Monday?
>>
>> On Wed, May 1, 2024 at 11:06 PM Tathagata Das <
>> tathagata.das1...@gmail.com> wrote:
>>
>>> Hey all,
>>>
>>> Reviving this thread, but Spark master has already accumulated a huge
>>> amount of changes.  As a downstream project maintainer, I want to really
>>> start testing the new features and other breaking changes, and it's hard to
>>> do that without a Preview release. So the sooner we make a Preview release,
>>> the faster we can start getting feedback for fixing things for a great
>>> Spark 4.0 final release.
>>>
>>> So I urge the community to produce a Spark 4.0 Preview soon even if
>>> certain features targeting the Delta 4.0 release are still incomplete.
>>>
>>> Thanks!
>>>
>>>
>>> On Wed, Apr 17, 2024 at 8:35 AM Wenchen Fan  wrote:
>>>
>>>> Thank you all for the replies!
>>>>
>>>> To @Nicholas Chammas  : Thanks for
>>>> cleaning up the error terminology and documentation! I've merged the first
>>>> PR and let's finish others before the 4.0 release.
>>>> To @Dongjoon Hyun  : Thanks for driving the
>>>> ANSI on by default effort! Now the vote has passed, let's flip the config
>>>> and finish the DataFrame error context feature before 4.0.
>>>> To @Jungtaek Lim  : Ack. We can treat
>>>> the Streaming state store data source as completed for 4.0 then.
>>>> To @Cheng Pan  : Yea we definitely should have a
>>>> preview release. Let's collect more feedback on the ongoing projects and
>>>> then we can propose a date for the preview release.
>>>>
>>>> On Wed, Apr 17, 2024 at 1:22 PM Cheng Pan  wrote:
>>>>
>>>>> will we have preview release for 4.0.0 like we did for 2.0.0 and 3.0.0?
>>>>>
>>>>> Thanks,
>>>>> Cheng Pan
>>>>>
>>>>>
>>>>> > On Apr 15, 2024, at 09:58, Jungtaek Lim <
>>>>> kabhwan.opensou...@gmail.com> wrote:
>>>>> >
>>>>> > W.r.t. state data source - reader (SPARK-45511), there are several
>>>>> follow-up tickets, but we don't plan to address them soon. The current
>>>>> implementation is the final shape for Spark 4.0.0, unless there are 
>>>>> demands
>>>>> on the follow-up tickets.
>>>>> >
>>>>> > We may want to check the plan for transformWithState - my
>>>>> understanding is that we want to release the feature to 4.0.0, but there
>>>>> are several remaining works to be done. While the tentative timeline for
>>>>> releasing is June 2024, what would be the tentative timeline for the RC 
>>>>> cut?
>>>>> > (cc. Anish to add more context on the plan for transformWithState)
>>>>> >
>>>>> > On Sat, Apr 13, 2024 at 3:15 AM Wenchen Fan 
>>>>> wrote:
>>>>> > Hi all,
>>>>> >
>>>>> > It's close to the previously proposed 4.0.0 release date (June
>>>>> 2024), and I think it's time to prepare for it and discuss the ongoing
>>>>> projects:
>>>>> > •
>>>>> > ANSI by default
>>>>> > • Spark Connect GA
>>>>> > • Structured Logging
>>>>> > • Streaming state store data source
>>>>> > • new data type VARIANT
>>>>> > • STRING collation support
>>>>> > • Spark k8s operator versioning
>>>>> > Please help to add more items to this list that are missed here. I
>>>>> would like to volunteer as the release manager for Apache Spark 4.0.0 if
>>>>> there is no objection. Thank you all for the great work that fills Spark
>>>>> 4.0!
>>>>> >
>>>>> > Wenchen Fan
>>>>>
>>>>>


[VOTE][RESULT] SPARK-46122: Set spark.sql.legacy.createHiveTableByDefault to false

2024-04-30 Thread Dongjoon Hyun
The vote passes with 11 +1s (6 binding +1s) and one -1.
Thanks to all who helped with the vote!

(* = binding)
+1:
- Dongjoon Hyun *
- Gengliang Wang *
- Liang-Chi Hsieh *
- Holden Karau *
- Zhou Jiang
- Cheng Pan
- Hyukjin Kwon *
- DB Tsai *
- Ye Xianjin
- XiDuo You
- Nimrod Ofek

+0: None

-1:
- Mich Talebzadeh


Re: [DISCUSS] SPARK-46122: Set spark.sql.legacy.createHiveTableByDefault to false

2024-04-29 Thread Dongjoon Hyun
? I'm not sure why you think in that direction.

What I wrote was the following.

- You voted +1 for SPARK-4 on April 14th
  (https://lists.apache.org/thread/tp92yzf8y4yjfk6r3dkqjtlb060g82sy) 
- You voted -1 for SPARK-46122 on April 26th.
  (https://lists.apache.org/thread/2ybq1jb19j0c52rgo43zfd9br1yhtfj8)

You showed a dual-standard for the same kind of SQL votes in two weeks.

We always count all votes from all contributors
in order to keep the record of all comprehensive feedbacks.

Dongjoon.

On 2024/04/29 17:49:36 Mich Talebzadeh wrote:
> Your point
> 
> ".. t's a surprise to me to see that someone has different positions in a
> very short period of time in the community"
> 
> Well, I have  been with Spark since 2015 and this is the article in the
> medium dated February 7, 2016 with regard to both Hive and Spark and also
> presented in Hortonworks meet-up.
> 
> Hive on Spark Engine Versus Spark Using Hive Metastore
> <https://www.linkedin.com/pulse/hive-spark-engine-versus-using-metastore-mich-talebzadeh-ph-d-/>
> 
> With regard to why I castred +1 votre for one and -1 for the other, I think
> it is my prerogative how  I vote and we leave it at that.,
> 
> Mich Talebzadeh,
> Technologist | Architect | Data Engineer  | Generative AI | FinCrime
> London
> United Kingdom
> 
> 
>view my Linkedin profile
> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
> 
> 
>  https://en.everybodywiki.com/Mich_Talebzadeh
> 
> 
> 
> *Disclaimer:* The information provided is correct to the best of my
> knowledge but of course cannot be guaranteed . It is essential to note
> that, as with any advice, quote "one test result is worth one-thousand
> expert opinions (Werner  <https://en.wikipedia.org/wiki/Wernher_von_Braun>Von
> Braun <https://en.wikipedia.org/wiki/Wernher_von_Braun>)".
> 
> 
> On Mon, 29 Apr 2024 at 17:32, Dongjoon Hyun  wrote:
> 
> > It's a surprise to me to see that someone has different positions
> > in a very short period of time in the community.
> >
> > Mitch casted +1 for SPARK-4 and -1 for SPARK-46122.
> > - https://lists.apache.org/thread/4cbkpvc3vr3b6k0wp6lgsw37spdpnqrc
> > - https://lists.apache.org/thread/x09gynt90v3hh5sql1gt9dlcn6m6699p
> >
> > To Mitch, what I'm interested in is the following specifically.
> > > 2. Compatibility: Changing the default behavior could potentially
> > >  break existing workflows or pipelines that rely on the current behavior.
> >
> > May I ask you the following questions?
> > A. What is the purpose of the migration guide in the ASF projects?
> >
> > B. Do you claim that there is incompatibility when you have
> >  spark.sql.legacy.createHiveTableByDefault=true which is described
> >  in the migration guide?
> >
> > C. Do you know that ANSI SQL has new RUNTIME exceptions
> >  which are harder than SPARK-46122?
> >
> > D. Or, did you cast +1 for SPARK-4 because
> >  you think there is no breaking change by default?
> >
> > I guess there is some misunderstanding on the proposal.
> >
> > Thanks,
> > Dongjoon.
> >
> >
> > On Fri, Apr 26, 2024 at 12:05 PM Mich Talebzadeh <
> > mich.talebza...@gmail.com> wrote:
> >
> >> Hi,
> >>
> >> I would like to add a side note regarding the discussion process and the
> >> current title of the proposal. The title '[DISCUSS] SPARK-46122: Set
> >> spark.sql.legacy.createHiveTableByDefault to false' focuses on a specific
> >> configuration parameter, which might lead some participants to overlook its
> >> broader implications (as was raised by myself and others). I believe that a
> >> more descriptive title, encompassing the broader discussion on default
> >> behaviours for creating Hive tables in Spark SQL, could enable greater
> >> engagement within the community. This is an important topic that deserves
> >> thorough consideration.
> >>
> >> HTH
> >>
> >> Mich Talebzadeh,
> >> Technologist | Architect | Data Engineer  | Generative AI | FinCrime
> >> London
> >> United Kingdom
> >>
> >>
> >>view my Linkedin profile
> >> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
> >>
> >>
> >>  https://en.everybodywiki.com/Mich_Talebzadeh
> >>
> >>
> >>
> >> *Disclaimer:* The information provided is correct to the best of my
> >> knowledge but of course cannot be guaranteed . It is essential to note
> >> that, 

Re: [DISCUSS] SPARK-46122: Set spark.sql.legacy.createHiveTableByDefault to false

2024-04-29 Thread Dongjoon Hyun
gt;>> anything.
>>>>>>>>>>
>>>>>>>>>> Thanks!
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> בתאריך יום ה׳, 25 באפר׳ 2024, 16:14, מאת Mich Talebzadeh ‏<
>>>>>>>>>> mich.talebza...@gmail.com>:
>>>>>>>>>>
>>>>>>>>>>> My take regarding your question is that your mileage varies so
>>>>>>>>>>> to speak.
>>>>>>>>>>>
>>>>>>>>>>> 1) Hive provides a more mature and widely adopted catalog
>>>>>>>>>>> solution that integrates well with other components in the Hadoop
>>>>>>>>>>> ecosystem, such as HDFS, HBase, and YARN. IIf you are Hadoop 
>>>>>>>>>>> centric S(say
>>>>>>>>>>> on-premise), using Hive may offer better compatibility and
>>>>>>>>>>> interoperability.
>>>>>>>>>>> 2) Hive provides a SQL-like interface that is familiar to users
>>>>>>>>>>> who are accustomed to traditional RDBMs. If your use case involves 
>>>>>>>>>>> complex
>>>>>>>>>>> SQL queries or existing SQL-based workflows, using Hive may be 
>>>>>>>>>>> advantageous.
>>>>>>>>>>> 3) If you are looking for performance, spark's native catalog
>>>>>>>>>>> tends to offer better performance for certain workloads, 
>>>>>>>>>>> particularly those
>>>>>>>>>>> that involve iterative processing or complex data 
>>>>>>>>>>> transformations.(my
>>>>>>>>>>> understanding). Spark's in-memory processing capabilities and 
>>>>>>>>>>> optimizations
>>>>>>>>>>> make it well-suited for interactive analytics and machine learning
>>>>>>>>>>> tasks.(my favourite)
>>>>>>>>>>> 4) Integration with Spark Workflows: If you primarily use Spark
>>>>>>>>>>> for data processing and analytics, using Spark's native catalog may
>>>>>>>>>>> simplify workflow management and reduce overhead, Spark's  tight
>>>>>>>>>>> integration with its catalog allows for seamless interaction with 
>>>>>>>>>>> Spark
>>>>>>>>>>> applications and libraries.
>>>>>>>>>>> 5) There seems to be some similarity with spark catalog and
>>>>>>>>>>> Databricks unity catalog, so that may favour the choice.
>>>>>>>>>>>
>>>>>>>>>>> HTH
>>>>>>>>>>>
>>>>>>>>>>> Mich Talebzadeh,
>>>>>>>>>>> Technologist | Architect | Data Engineer  | Generative AI |
>>>>>>>>>>> FinCrime
>>>>>>>>>>> London
>>>>>>>>>>> United Kingdom
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>view my Linkedin profile
>>>>>>>>>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> *Disclaimer:* The information provided is correct to the best
>>>>>>>>>>> of my knowledge but of course cannot be guaranteed . It is 
>>>>>>>>>>> essential to
>>>>>>>>>>> note that, as with any advice, quote "one test result is worth
>>>>>>>>>>> one-thousand expert opinions (Werner
>>>>>>>>>>> <https://en.wikipedia.org/wiki/Wernher_von_Braun>Von Braun
>>>>>>>>>>> <https://en.wikipedia.org/wiki/Wernher_von_Braun>)".
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On Thu, 25 Apr 2024 at 12:30, Nimrod Ofek 
>>>>>>>>>>> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> I will also appreciate some material that describes the
>>>>>>>>>>>> differences between Spark native tables vs hive tables and why 
>>>>>>>>>>>> each should
>>>>>>>>>>>> be used...
>>>>>>>>>>>>
>>>>>>>>>>>> Thanks
>>>>>>>>>>>> Nimrod
>>>>>>>>>>>>
>>>>>>>>>>>> בתאריך יום ה׳, 25 באפר׳ 2024, 14:27, מאת Mich Talebzadeh ‏<
>>>>>>>>>>>> mich.talebza...@gmail.com>:
>>>>>>>>>>>>
>>>>>>>>>>>>> I see a statement made as below  and I quote
>>>>>>>>>>>>>
>>>>>>>>>>>>> "The proposal of SPARK-46122 is to switch the default value of
>>>>>>>>>>>>> this
>>>>>>>>>>>>> configuration from `true` to `false` to use Spark native
>>>>>>>>>>>>> tables because
>>>>>>>>>>>>> we support better."
>>>>>>>>>>>>>
>>>>>>>>>>>>> Can you please elaborate on the above specifically with regard
>>>>>>>>>>>>> to the phrase ".. because
>>>>>>>>>>>>> we support better."
>>>>>>>>>>>>>
>>>>>>>>>>>>> Are you referring to the performance of Spark catalog (I
>>>>>>>>>>>>> believe it is internal) or integration with Spark?
>>>>>>>>>>>>>
>>>>>>>>>>>>> HTH
>>>>>>>>>>>>>
>>>>>>>>>>>>> Mich Talebzadeh,
>>>>>>>>>>>>> Technologist | Architect | Data Engineer  | Generative AI |
>>>>>>>>>>>>> FinCrime
>>>>>>>>>>>>> London
>>>>>>>>>>>>> United Kingdom
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>view my Linkedin profile
>>>>>>>>>>>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> *Disclaimer:* The information provided is correct to the best
>>>>>>>>>>>>> of my knowledge but of course cannot be guaranteed . It is 
>>>>>>>>>>>>> essential to
>>>>>>>>>>>>> note that, as with any advice, quote "one test result is
>>>>>>>>>>>>> worth one-thousand expert opinions (Werner
>>>>>>>>>>>>> <https://en.wikipedia.org/wiki/Wernher_von_Braun>Von Braun
>>>>>>>>>>>>> <https://en.wikipedia.org/wiki/Wernher_von_Braun>)".
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Thu, 25 Apr 2024 at 11:17, Wenchen Fan 
>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> +1
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Thu, Apr 25, 2024 at 2:46 PM Kent Yao 
>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> +1
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Nit: the umbrella ticket is SPARK-44111, not SPARK-4.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>> Kent Yao
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Dongjoon Hyun  于2024年4月25日周四
>>>>>>>>>>>>>>> 14:39写道:
>>>>>>>>>>>>>>> >
>>>>>>>>>>>>>>> > Hi, All.
>>>>>>>>>>>>>>> >
>>>>>>>>>>>>>>> > It's great to see community activities to polish 4.0.0
>>>>>>>>>>>>>>> more and more.
>>>>>>>>>>>>>>> > Thank you all.
>>>>>>>>>>>>>>> >
>>>>>>>>>>>>>>> > I'd like to bring SPARK-46122 (another SQL topic) to you
>>>>>>>>>>>>>>> from the subtasks
>>>>>>>>>>>>>>> > of SPARK-4 (Prepare Apache Spark 4.0.0),
>>>>>>>>>>>>>>> >
>>>>>>>>>>>>>>> > - https://issues.apache.org/jira/browse/SPARK-46122
>>>>>>>>>>>>>>> >Set `spark.sql.legacy.createHiveTableByDefault` to
>>>>>>>>>>>>>>> `false` by default
>>>>>>>>>>>>>>> >
>>>>>>>>>>>>>>> > This legacy configuration is about `CREATE TABLE` SQL
>>>>>>>>>>>>>>> syntax without
>>>>>>>>>>>>>>> > `USING` and `STORED AS`, which is currently mapped to
>>>>>>>>>>>>>>> `Hive` table.
>>>>>>>>>>>>>>> > The proposal of SPARK-46122 is to switch the default value
>>>>>>>>>>>>>>> of this
>>>>>>>>>>>>>>> > configuration from `true` to `false` to use Spark native
>>>>>>>>>>>>>>> tables because
>>>>>>>>>>>>>>> > we support better.
>>>>>>>>>>>>>>> >
>>>>>>>>>>>>>>> > In other words, Spark will use the value of
>>>>>>>>>>>>>>> `spark.sql.sources.default`
>>>>>>>>>>>>>>> > as the table provider instead of `Hive` like the other
>>>>>>>>>>>>>>> Spark APIs. Of course,
>>>>>>>>>>>>>>> > the users can get all the legacy behavior by setting back
>>>>>>>>>>>>>>> to `true`.
>>>>>>>>>>>>>>> >
>>>>>>>>>>>>>>> > Historically, this behavior change was merged once at
>>>>>>>>>>>>>>> Apache Spark 3.0.0
>>>>>>>>>>>>>>> > preparation via SPARK-30098 already, but reverted during
>>>>>>>>>>>>>>> the 3.0.0 RC period.
>>>>>>>>>>>>>>> >
>>>>>>>>>>>>>>> > 2019-12-06: SPARK-30098 Use default datasource as provider
>>>>>>>>>>>>>>> for CREATE TABLE
>>>>>>>>>>>>>>> > 2020-05-16: SPARK-31707 Revert SPARK-30098 Use default
>>>>>>>>>>>>>>> datasource as
>>>>>>>>>>>>>>> > provider for CREATE TABLE command
>>>>>>>>>>>>>>> >
>>>>>>>>>>>>>>> > At Apache Spark 3.1.0, we had another discussion about
>>>>>>>>>>>>>>> this and defined it
>>>>>>>>>>>>>>> > as one of legacy behavior via this configuration via
>>>>>>>>>>>>>>> reused ID, SPARK-30098.
>>>>>>>>>>>>>>> >
>>>>>>>>>>>>>>> > 2020-12-01:
>>>>>>>>>>>>>>> https://lists.apache.org/thread/8c8k1jk61pzlcosz3mxo4rkj5l23r204
>>>>>>>>>>>>>>> > 2020-12-03: SPARK-30098 Add a configuration to use default
>>>>>>>>>>>>>>> datasource as
>>>>>>>>>>>>>>> > provider for CREATE TABLE command
>>>>>>>>>>>>>>> >
>>>>>>>>>>>>>>> > Last year, we received two additional requests twice to
>>>>>>>>>>>>>>> switch this because
>>>>>>>>>>>>>>> > Apache Spark 4.0.0 is a good time to make a decision for
>>>>>>>>>>>>>>> the future direction.
>>>>>>>>>>>>>>> >
>>>>>>>>>>>>>>> > 2023-02-27: SPARK-42603 as an independent idea.
>>>>>>>>>>>>>>> > 2023-11-27: SPARK-46122 as a part of Apache Spark 4.0.0
>>>>>>>>>>>>>>> idea
>>>>>>>>>>>>>>> >
>>>>>>>>>>>>>>> >
>>>>>>>>>>>>>>> > WDYT? The technical scope is defined in the following PR
>>>>>>>>>>>>>>> which is one line of main
>>>>>>>>>>>>>>> > code, one line of migration guide, and a few lines of test
>>>>>>>>>>>>>>> code.
>>>>>>>>>>>>>>> >
>>>>>>>>>>>>>>> > - https://github.com/apache/spark/pull/46207
>>>>>>>>>>>>>>> >
>>>>>>>>>>>>>>> > Dongjoon.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> -
>>>>>>>>>>>>>>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>


Re: [VOTE] SPARK-46122: Set spark.sql.legacy.createHiveTableByDefault to false

2024-04-26 Thread Dongjoon Hyun
I'll start with my +1.

Dongjoon.

On 2024/04/26 16:45:51 Dongjoon Hyun wrote:
> Please vote on SPARK-46122 to set spark.sql.legacy.createHiveTableByDefault
> to `false` by default. The technical scope is defined in the following PR.
> 
> - DISCUSSION:
> https://lists.apache.org/thread/ylk96fg4lvn6klxhj6t6yh42lyqb8wmd
> - JIRA: https://issues.apache.org/jira/browse/SPARK-46122
> - PR: https://github.com/apache/spark/pull/46207
> 
> The vote is open until April 30th 1AM (PST) and passes
> if a majority +1 PMC votes are cast, with a minimum of 3 +1 votes.
> 
> [ ] +1 Set spark.sql.legacy.createHiveTableByDefault to false by default
> [ ] -1 Do not change spark.sql.legacy.createHiveTableByDefault because ...
> 
> Thank you in advance.
> 
> Dongjoon
> 

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



[VOTE] SPARK-46122: Set spark.sql.legacy.createHiveTableByDefault to false

2024-04-26 Thread Dongjoon Hyun
Please vote on SPARK-46122 to set spark.sql.legacy.createHiveTableByDefault
to `false` by default. The technical scope is defined in the following PR.

- DISCUSSION:
https://lists.apache.org/thread/ylk96fg4lvn6klxhj6t6yh42lyqb8wmd
- JIRA: https://issues.apache.org/jira/browse/SPARK-46122
- PR: https://github.com/apache/spark/pull/46207

The vote is open until April 30th 1AM (PST) and passes
if a majority +1 PMC votes are cast, with a minimum of 3 +1 votes.

[ ] +1 Set spark.sql.legacy.createHiveTableByDefault to false by default
[ ] -1 Do not change spark.sql.legacy.createHiveTableByDefault because ...

Thank you in advance.

Dongjoon


Re: [DISCUSS] SPARK-46122: Set spark.sql.legacy.createHiveTableByDefault to false

2024-04-26 Thread Dongjoon Hyun
;>>>>>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>>>>>>>
>>>>>>>>
>>>>>>>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> *Disclaimer:* The information provided is correct to the best of
>>>>>>>> my knowledge but of course cannot be guaranteed . It is essential to 
>>>>>>>> note
>>>>>>>> that, as with any advice, quote "one test result is worth one-thousand
>>>>>>>> expert opinions (Werner
>>>>>>>> <https://en.wikipedia.org/wiki/Wernher_von_Braun>Von Braun
>>>>>>>> <https://en.wikipedia.org/wiki/Wernher_von_Braun>)".
>>>>>>>>
>>>>>>>>
>>>>>>>> On Thu, 25 Apr 2024 at 14:38, Nimrod Ofek 
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> Thanks for the detailed answer.
>>>>>>>>> The thing I'm missing is this: let's say that the output format I
>>>>>>>>> choose is delta lake or iceberg or whatever format that uses parquet. 
>>>>>>>>> Where
>>>>>>>>> does the catalog implementation (which holds metadata afaik, same 
>>>>>>>>> metadata
>>>>>>>>> that iceberg and delta lake save for their tables about their columns)
>>>>>>>>> comes into play and why should it affect performance?
>>>>>>>>> Another thing is that if I understand correctly, and I might be
>>>>>>>>> totally wrong here, the internal spark catalog is a local 
>>>>>>>>> installation of
>>>>>>>>> hive metastore anyway, so I'm not sure what the catalog has to do with
>>>>>>>>> anything.
>>>>>>>>>
>>>>>>>>> Thanks!
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> בתאריך יום ה׳, 25 באפר׳ 2024, 16:14, מאת Mich Talebzadeh ‏<
>>>>>>>>> mich.talebza...@gmail.com>:
>>>>>>>>>
>>>>>>>>>> My take regarding your question is that your mileage varies so to
>>>>>>>>>> speak.
>>>>>>>>>>
>>>>>>>>>> 1) Hive provides a more mature and widely adopted catalog
>>>>>>>>>> solution that integrates well with other components in the Hadoop
>>>>>>>>>> ecosystem, such as HDFS, HBase, and YARN. IIf you are Hadoop centric 
>>>>>>>>>> S(say
>>>>>>>>>> on-premise), using Hive may offer better compatibility and
>>>>>>>>>> interoperability.
>>>>>>>>>> 2) Hive provides a SQL-like interface that is familiar to users
>>>>>>>>>> who are accustomed to traditional RDBMs. If your use case involves 
>>>>>>>>>> complex
>>>>>>>>>> SQL queries or existing SQL-based workflows, using Hive may be 
>>>>>>>>>> advantageous.
>>>>>>>>>> 3) If you are looking for performance, spark's native catalog
>>>>>>>>>> tends to offer better performance for certain workloads, 
>>>>>>>>>> particularly those
>>>>>>>>>> that involve iterative processing or complex data transformations.(my
>>>>>>>>>> understanding). Spark's in-memory processing capabilities and 
>>>>>>>>>> optimizations
>>>>>>>>>> make it well-suited for interactive analytics and machine learning
>>>>>>>>>> tasks.(my favourite)
>>>>>>>>>> 4) Integration with Spark Workflows: If you primarily use Spark
>>>>>>>>>> for data processing and analytics, using Spark's native catalog may
>>>>>>>>>> simplify workflow management and reduce overhead, Spark's  tight
>>>>>>>>>> integration with its catalog allows for seamless interaction with 
>>>>>>>>>> Spark
>>>>>>>>>> applications 

[FYI] SPARK-47993: Drop Python 3.8

2024-04-25 Thread Dongjoon Hyun
FYI, there is a proposal to drop Python 3.8 because its EOL is October 2024.

https://github.com/apache/spark/pull/46228
[SPARK-47993][PYTHON] Drop Python 3.8

Since it's still alive and there will be an overlap between the lifecycle
of Python 3.8 and Apache Spark 4.0.0, please give us your feedback on the
PR, if you have any concerns.

>From my side, I agree with this decision.

Thanks,
Dongjoon.


[DISCUSS] SPARK-46122: Set spark.sql.legacy.createHiveTableByDefault to false

2024-04-24 Thread Dongjoon Hyun
Hi, All.

It's great to see community activities to polish 4.0.0 more and more.
Thank you all.

I'd like to bring SPARK-46122 (another SQL topic) to you from the subtasks
of SPARK-4 (Prepare Apache Spark 4.0.0),

- https://issues.apache.org/jira/browse/SPARK-46122
   Set `spark.sql.legacy.createHiveTableByDefault` to `false` by default

This legacy configuration is about `CREATE TABLE` SQL syntax without
`USING` and `STORED AS`, which is currently mapped to `Hive` table.
The proposal of SPARK-46122 is to switch the default value of this
configuration from `true` to `false` to use Spark native tables because
we support better.

In other words, Spark will use the value of `spark.sql.sources.default`
as the table provider instead of `Hive` like the other Spark APIs. Of
course,
the users can get all the legacy behavior by setting back to `true`.

Historically, this behavior change was merged once at Apache Spark 3.0.0
preparation via SPARK-30098 already, but reverted during the 3.0.0 RC
period.

2019-12-06: SPARK-30098 Use default datasource as provider for CREATE TABLE
2020-05-16: SPARK-31707 Revert SPARK-30098 Use default datasource as
provider for CREATE TABLE command

At Apache Spark 3.1.0, we had another discussion about this and defined it
as one of legacy behavior via this configuration via reused ID, SPARK-30098.

2020-12-01: https://lists.apache.org/thread/8c8k1jk61pzlcosz3mxo4rkj5l23r204
2020-12-03: SPARK-30098 Add a configuration to use default datasource as
provider for CREATE TABLE command

Last year, we received two additional requests twice to switch this because
Apache Spark 4.0.0 is a good time to make a decision for the future
direction.

2023-02-27: SPARK-42603 as an independent idea.
2023-11-27: SPARK-46122 as a part of Apache Spark 4.0.0 idea


WDYT? The technical scope is defined in the following PR which is one line
of main
code, one line of migration guide, and a few lines of test code.

- https://github.com/apache/spark/pull/46207

Dongjoon.


[FYI] SPARK-47046: Apache Spark 4.0.0 Dependency Audit and Cleanup

2024-04-21 Thread Dongjoon Hyun
Hi, All.

As a part of Apache Spark 4.0.0 (SPAR-44111), we have been doing dependency
audits. Today, we want to share the current readiness of Apache Spark 4.0.0
and get your feedback for further completeness.

https://issues.apache.org/jira/browse/SPARK-44111
Prepare Apache Spark 4.0.0

Dependency audit(SPARK-47046) started this February (on 14/Feb/24) and
we have only one remaining JIRA about Apache Hive 2.3.10 as of now.

https://issues.apache.org/jira/browse/SPARK-47046
Apache Spark 4.0.0 Dependency Audit and Cleanup

https://issues.apache.org/jira/browse/SPARK-47018
Upgrade built-in Hive to 2.3.10 (WIP)


Although we received Common Vulnerabilities and Exposures (CVE) reports due
to our dependencies historically and only some of them affect us
effectively,
we consider all reports seriously and want to address as much as possible
in Apache Spark 4.0.0 as a new milestone.

Here, we share the full audit list for your awareness.

++-+---+
| CVE_ID | GHSA_ID | SPARK_JIRA_ID |
++-+---+
| CVE-2018-10237 | GHSA-mvr2-9pj6-7w5j | SPARK-47025   |
| CVE-2018-10237 | GHSA-mvr2-9pj6-7w5j | SPARK-47058   |
| CVE-2018-1330  | GHSA-95q3--r683 | SPARK-2   |
| CVE-2019-0205  | GHSA-rj7p-rfgp-852x | SPARK-27029   |
| CVE-2019-10172 | GHSA-r6j9-8759-g62w | SPARK-47119   |
| CVE-2019-10202 | GHSA-c27h-mcmw-48hv | SPARK-47119   |
| CVE-2020-13949 | GHSA-g2fg-mr77-6vrm | SPARK-47018 (WIP) |
| CVE-2020-15522 | GHSA-6xx3-rg99-gc3p | SPARK-1   |
| CVE-2020-8908  | GHSA-5mg8-w23w-74h3 | SPARK-39102   |
| CVE-2020-8908  | GHSA-5mg8-w23w-74h3 | SPARK-47025   |
| CVE-2021-22569 | GHSA-wrvw-hg22-4m67 | SPARK-43489   |
| CVE-2021-22569 | GHSA-wrvw-hg22-4m67 | SPARK-47038   |
| CVE-2021-22570 | GHSA-77rm-9x9h-xj3g | SPARK-45991   |
| CVE-2021-42392 | GHSA-h376-j262-vhq6 | SPARK-38287   |
| CVE-2022-1941  | GHSA-8gq9-2x98-w8hf | SPARK-40552   |
| CVE-2022-1941  | GHSA-8gq9-2x98-w8hf | SPARK-41240   |
| CVE-2022-2047  | GHSA-cj7v-27pg-wf7q | SPARK-39725   |
| CVE-2022-21363 | GHSA-g76j-4cxx-23h9 | SPARK-39540   |
| CVE-2022-21724 | GHSA-673j-qm5f-xpv8 | SPARK-38291   |
| CVE-2022-21724 | GHSA-v7wg-cpwc-24m4 | SPARK-38291   |
| CVE-2022-23221 | GHSA-45hx-wfhj-473x | SPARK-38287   |
| CVE-2022-23437 | GHSA-h65f-jvqw-m9fj | SPARK-39183   |
| CVE-2022-25883 | GHSA-c2qf-rxjj-qqgw | SPARK-44279   |
| CVE-2022-3171  | GHSA-h4h5-3hr4-j3g2 | SPARK-40665   |
| CVE-2022-3171  | GHSA-h4h5-3hr4-j3g2 | SPARK-41076   |
| CVE-2022-3171  | GHSA-h4h5-3hr4-j3g2 | SPARK-41247   |
| CVE-2022-3171  | GHSA-h4h5-3hr4-j3g2 | SPARK-43489   |
| CVE-2022-3171  | GHSA-h4h5-3hr4-j3g2 | SPARK-47038   |
| CVE-2022-3509  | GHSA-g5ww-5jh7-63cx | SPARK-43489   |
| CVE-2022-3509  | GHSA-g5ww-5jh7-63cx | SPARK-47038   |
| CVE-2022-3510  | GHSA-4gg5-vx3j-xwc7 | SPARK-43489   |
| CVE-2022-3510  | GHSA-4gg5-vx3j-xwc7 | SPARK-47038   |
| CVE-2022-3517  | GHSA-f8q6-p94x-37v3 | SPARK-41634   |
| CVE-2022-36944 | GHSA-8qv5-68g4-248j | SPARK-40497   |
| CVE-2022-37865 | GHSA-94rr-4jr5-9h2p | SPARK-41030   |
| CVE-2022-37866 | GHSA-wv7w-rj2x-556x | SPARK-41030   |
| CVE-2022-41946 | GHSA-562r-vg33-8x8h | SPARK-41245   |
| CVE-2022-42889 | GHSA-599f-7c49-w659 | SPARK-40801   |
| CVE-2022-45868 | GHSA-22wj-vf5f-wrvj | SPARK-44393   |
| CVE-2022-46337 | GHSA-rcjc-c4pj-xxrp | SPARK-47108   |
| CVE-2022-46751 | GHSA-2jc4-r94c-rp7h | SPARK-44914   |
| CVE-2023-1428  | GHSA-6628-q6j9-w8vg | SPARK-44222   |
| CVE-2023-26119 | GHSA-3xrr-7m6p-p7xh | SPARK-5   |
| CVE-2023-2976  | GHSA-7g45-4rm6-3mm3 | SPARK-47025   |
| CVE-2023-2976  | GHSA-7g45-4rm6-3mm3 | SPARK-47056   |
| CVE-2023-32731 | GHSA-cfgp-2977-2fmm | SPARK-44222   |
| CVE-2023-32732 | GHSA-9hxf-ppjv-w6rq | SPARK-44222   |
| CVE-2023-33201 | GHSA-hr8g-6v94-x4m9 | SPARK-46411   |
| CVE-2023-34453 | GHSA-pqr6-cmr2-h8hf | SPARK-44070   |
| CVE-2023-34454 | GHSA-fjpj-2g6w-x25r | SPARK-44070   |
| CVE-2023-34455 | GHSA-qcwq-55hx-v3vh | SPARK-44070   |
| CVE-2023-42503 | GHSA-cgwf-w82q-5jrr | SPARK-45172   |
| CVE-2023-43642 | GHSA-55g7-9cwv-5qfv | SPARK-45323   |
| CVE-2023-44981 | GHSA-7286-pgfv-vxvh | SPARK-45956   |
| CVE-2023-44981 | GHSA-7286-pgfv-vxvh | SPARK-46305   |
| CVE-2024-21503 | GHSA-fj7x-q9j7-g6q6 | INVALID*  |
| CVE-2024-26308 | GHSA-4265-ccf5-phj5 | SPARK-47109   |
++-+---+
* `black` is used only in `dev/lint-python` script


Please report us via `priv...@spark.apache.org` if you have any concerns
on the above reports or have new ones for Apache Spark 4.0.0.

Dongjoon Hyun


Re: [DISCUSS] Un-deprecate Trigger.Once

2024-04-19 Thread Dongjoon Hyun
For that case, I believe it's enough for us to revise the deprecation
message only by making sure that Apache Spark will keep it without removal
for backward-compatibility purposes only. That's what the users asked,
isn't that?

> deprecation  of Trigger.Once confuses users that the trigger won't be
available sooner (though we rarely remove public API).

The feature was deprecated in Apache Spark 3.4.0 and `Undeprecation(?)` may
cause another confusion in the community, not only for Trigger.Once but
also for all historic `Deprecated` items.

Dongjoon.


On Fri, Apr 19, 2024 at 7:44 PM Jungtaek Lim 
wrote:

> Hi dev,
>
> I'd like to raise a discussion to un-deprecate Trigger.Once in future
> releases.
>
> I've proposed deprecation of Trigger.Once because it's semantically broken
> and we made a change, but we've realized that there are really users who
> strictly require the behavior of Trigger.Once (only run a single batch in
> whatever reason) despite the semantic issue, and workaround with
> Trigger.AvailableNow is arguably much more hacky or sometimes not even
> possible.
>
> I still think we have to advise using Trigger.AvailableNow whenever
> feasible, but deprecation  of Trigger.Once confuses users that the trigger
> won't be available sooner (though we rarely remove public API). So maybe
> warning log on usage sounds to me as a reasonable alternative.
>
> Thoughts?
>
> Thanks,
> Jungtaek Lim (HeartSaVioR)
>


[ANNOUNCE] Apache Spark 3.4.3 released

2024-04-18 Thread Dongjoon Hyun
We are happy to announce the availability of Apache Spark 3.4.3!

Spark 3.4.3 is a maintenance release containing many fixes including
security and correctness domains. This release is based on the
branch-3.4 maintenance branch of Spark. We strongly
recommend all 3.4 users to upgrade to this stable release.

To download Spark 3.4.3, head over to the download page:
https://spark.apache.org/downloads.html

To view the release notes:
https://spark.apache.org/releases/spark-release-3-4-3.html

We would like to acknowledge all community members for contributing to this
release. This release would not have been possible without you.

Dongjoon Hyun


[VOTE][RESULT] Release Spark 3.4.3 (RC2)

2024-04-18 Thread Dongjoon Hyun
The vote passes with 10 +1s (8 binding +1s).
Thanks to all who helped with the release!

(* = binding)
+1:
- Dongjoon Hyun *
- Mridul Muralidharan *
- Wenchen Fan *
- Liang-Chi Hsieh *
- Gengliang Wang *
- Hyukjin Kwon *
- Bo Yang
- DB Tsai *
- Kent Yao
- Huaxin Gao *

+0: None

-1: None


Re: [VOTE] Release Spark 3.4.3 (RC2)

2024-04-18 Thread Dongjoon Hyun
This vote passed.

I'll conclude this vote.

Dongjoon

On 2024/04/17 03:11:36 huaxin gao wrote:
> +1
> 
> On Tue, Apr 16, 2024 at 6:55 PM Kent Yao  wrote:
> 
> > +1(non-binding)
> >
> > Thanks,
> > Kent Yao
> >
> > bo yang  于2024年4月17日周三 09:49写道:
> > >
> > > +1
> > >
> > > On Tue, Apr 16, 2024 at 1:38 PM Hyukjin Kwon 
> > wrote:
> > >>
> > >> +1
> > >>
> > >> On Wed, Apr 17, 2024 at 3:57 AM L. C. Hsieh  wrote:
> > >>>
> > >>> +1
> > >>>
> > >>> On Tue, Apr 16, 2024 at 4:08 AM Wenchen Fan 
> > wrote:
> > >>> >
> > >>> > +1
> > >>> >
> > >>> > On Mon, Apr 15, 2024 at 12:31 PM Dongjoon Hyun 
> > wrote:
> > >>> >>
> > >>> >> I'll start with my +1.
> > >>> >>
> > >>> >> - Checked checksum and signature
> > >>> >> - Checked Scala/Java/R/Python/SQL Document's Spark version
> > >>> >> - Checked published Maven artifacts
> > >>> >> - All CIs passed.
> > >>> >>
> > >>> >> Thanks,
> > >>> >> Dongjoon.
> > >>> >>
> > >>> >> On 2024/04/15 04:22:26 Dongjoon Hyun wrote:
> > >>> >> > Please vote on releasing the following candidate as Apache Spark
> > version
> > >>> >> > 3.4.3.
> > >>> >> >
> > >>> >> > The vote is open until April 18th 1AM (PDT) and passes if a
> > majority +1 PMC
> > >>> >> > votes are cast, with a minimum of 3 +1 votes.
> > >>> >> >
> > >>> >> > [ ] +1 Release this package as Apache Spark 3.4.3
> > >>> >> > [ ] -1 Do not release this package because ...
> > >>> >> >
> > >>> >> > To learn more about Apache Spark, please see
> > https://spark.apache.org/
> > >>> >> >
> > >>> >> > The tag to be voted on is v3.4.3-rc2 (commit
> > >>> >> > 1eb558c3a6fbdd59e5a305bc3ab12ce748f6511f)
> > >>> >> > https://github.com/apache/spark/tree/v3.4.3-rc2
> > >>> >> >
> > >>> >> > The release files, including signatures, digests, etc. can be
> > found at:
> > >>> >> > https://dist.apache.org/repos/dist/dev/spark/v3.4.3-rc2-bin/
> > >>> >> >
> > >>> >> > Signatures used for Spark RCs can be found in this file:
> > >>> >> > https://dist.apache.org/repos/dist/dev/spark/KEYS
> > >>> >> >
> > >>> >> > The staging repository for this release can be found at:
> > >>> >> >
> > https://repository.apache.org/content/repositories/orgapachespark-1453/
> > >>> >> >
> > >>> >> > The documentation corresponding to this release can be found at:
> > >>> >> > https://dist.apache.org/repos/dist/dev/spark/v3.4.3-rc2-docs/
> > >>> >> >
> > >>> >> > The list of bug fixes going into 3.4.3 can be found at the
> > following URL:
> > >>> >> > https://issues.apache.org/jira/projects/SPARK/versions/12353987
> > >>> >> >
> > >>> >> > This release is using the release script of the tag v3.4.3-rc2.
> > >>> >> >
> > >>> >> > FAQ
> > >>> >> >
> > >>> >> > =
> > >>> >> > How can I help test this release?
> > >>> >> > =
> > >>> >> >
> > >>> >> > If you are a Spark user, you can help us test this release by
> > taking
> > >>> >> > an existing Spark workload and running on this release candidate,
> > then
> > >>> >> > reporting any regressions.
> > >>> >> >
> > >>> >> > If you're working in PySpark you can set up a virtual env and
> > install
> > >>> >> > the current RC and see if anything important breaks, in the
> > Java/Scala
> > >>> >> > you can add the staging repository to your projects resolvers and
> > test
> > &

[VOTE][RESULT] SPARK-44444: Use ANSI SQL mode by default

2024-04-17 Thread Dongjoon Hyun
The vote passes with 24 +1s (13 binding +1s).
Thanks to all who helped with the vote!

(* = binding)
+1:
- Dongjoon Hyun *
- Gengliang Wang *
- Chao Sun *
- Hyukjin Kwon *
- Liang-Chi Hsieh *
- Holden Karau *
- Huaxin Gao *
- Denny Lee
- Xiao Li *
- Mich Talebzadeh
- Christiano Anderson
- Yang Jie
- Wenchen Fan *
- Jungtaek Lim
- John Zhuge
- Cheng Pan
- Peter Toth
- Jiaan Geng
- Xinrong Meng *
- Rui Wang
- Maciej Szymkiewicz *
- Takuya Ueshin *
- Josh Rosen *
- Arun Dakua

+0: None

-1: None


Re: [VOTE] SPARK-44444: Use ANSI SQL mode by default

2024-04-17 Thread Dongjoon Hyun
Thank you all. The vote passed.

I'll conclude this vote.

Dongjooon.

On 2024/04/16 04:58:39 Arun Dakua wrote:
> +1
> 
> On Tue, Apr 16, 2024 at 12:50 AM Josh Rosen  wrote:
> 
> > +1
> >
> > On Mon, Apr 15, 2024 at 11:26 AM Maciej  wrote:
> >
> >> +1
> >>
> >> Best regards,
> >> Maciej Szymkiewicz
> >>
> >> Web: https://zero323.net
> >> PGP: A30CEF0C31A501EC
> >>
> >> On 4/15/24 8:16 PM, Rui Wang wrote:
> >>
> >> +1, non-binding.
> >>
> >> Thanks Dongjoon to drive this!
> >>
> >>
> >> -Rui
> >>
> >> On Mon, Apr 15, 2024 at 10:10 AM Xinrong Meng  wrote:
> >>
> >>> +1
> >>>
> >>> Thank you @Dongjoon Hyun  !
> >>>
> >>> On Mon, Apr 15, 2024 at 6:33 AM beliefer  wrote:
> >>>
> >>>> +1
> >>>>
> >>>>
> >>>> 在 2024-04-15 15:54:07,"Peter Toth"  写道:
> >>>>
> >>>> +1
> >>>>
> >>>> Wenchen Fan  ezt írta (időpont: 2024. ápr. 15.,
> >>>> H, 9:08):
> >>>>
> >>>>> +1
> >>>>>
> >>>>> On Sun, Apr 14, 2024 at 6:28 AM Dongjoon Hyun 
> >>>>> wrote:
> >>>>>
> >>>>>> I'll start from my +1.
> >>>>>>
> >>>>>> Dongjoon.
> >>>>>>
> >>>>>> On 2024/04/13 22:22:05 Dongjoon Hyun wrote:
> >>>>>> > Please vote on SPARK-4 to use ANSI SQL mode by default.
> >>>>>> > The technical scope is defined in the following PR which is
> >>>>>> > one line of code change and one line of migration guide.
> >>>>>> >
> >>>>>> > - DISCUSSION:
> >>>>>> > https://lists.apache.org/thread/ztlwoz1v1sn81ssks12tb19x37zozxlz
> >>>>>> > - JIRA: https://issues.apache.org/jira/browse/SPARK-4
> >>>>>> > - PR: https://github.com/apache/spark/pull/46013
> >>>>>> >
> >>>>>> > The vote is open until April 17th 1AM (PST) and passes
> >>>>>> > if a majority +1 PMC votes are cast, with a minimum of 3 +1 votes.
> >>>>>> >
> >>>>>> > [ ] +1 Use ANSI SQL mode by default
> >>>>>> > [ ] -1 Do not use ANSI SQL mode by default because ...
> >>>>>> >
> >>>>>> > Thank you in advance.
> >>>>>> >
> >>>>>> > Dongjoon
> >>>>>> >
> >>>>>>
> >>>>>> -
> >>>>>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
> >>>>>>
> >>>>>>
> 

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: [VOTE] Release Spark 3.4.3 (RC2)

2024-04-14 Thread Dongjoon Hyun
I'll start with my +1.

- Checked checksum and signature
- Checked Scala/Java/R/Python/SQL Document's Spark version 
- Checked published Maven artifacts
- All CIs passed.

Thanks,
Dongjoon.

On 2024/04/15 04:22:26 Dongjoon Hyun wrote:
> Please vote on releasing the following candidate as Apache Spark version
> 3.4.3.
> 
> The vote is open until April 18th 1AM (PDT) and passes if a majority +1 PMC
> votes are cast, with a minimum of 3 +1 votes.
> 
> [ ] +1 Release this package as Apache Spark 3.4.3
> [ ] -1 Do not release this package because ...
> 
> To learn more about Apache Spark, please see https://spark.apache.org/
> 
> The tag to be voted on is v3.4.3-rc2 (commit
> 1eb558c3a6fbdd59e5a305bc3ab12ce748f6511f)
> https://github.com/apache/spark/tree/v3.4.3-rc2
> 
> The release files, including signatures, digests, etc. can be found at:
> https://dist.apache.org/repos/dist/dev/spark/v3.4.3-rc2-bin/
> 
> Signatures used for Spark RCs can be found in this file:
> https://dist.apache.org/repos/dist/dev/spark/KEYS
> 
> The staging repository for this release can be found at:
> https://repository.apache.org/content/repositories/orgapachespark-1453/
> 
> The documentation corresponding to this release can be found at:
> https://dist.apache.org/repos/dist/dev/spark/v3.4.3-rc2-docs/
> 
> The list of bug fixes going into 3.4.3 can be found at the following URL:
> https://issues.apache.org/jira/projects/SPARK/versions/12353987
> 
> This release is using the release script of the tag v3.4.3-rc2.
> 
> FAQ
> 
> =
> How can I help test this release?
> =
> 
> If you are a Spark user, you can help us test this release by taking
> an existing Spark workload and running on this release candidate, then
> reporting any regressions.
> 
> If you're working in PySpark you can set up a virtual env and install
> the current RC and see if anything important breaks, in the Java/Scala
> you can add the staging repository to your projects resolvers and test
> with the RC (make sure to clean up the artifact cache before/after so
> you don't end up building with a out of date RC going forward).
> 
> ===
> What should happen to JIRA tickets still targeting 3.4.3?
> ===
> 
> The current list of open tickets targeted at 3.4.3 can be found at:
> https://issues.apache.org/jira/projects/SPARK and search for "Target
> Version/s" = 3.4.3
> 
> Committers should look at those and triage. Extremely important bug
> fixes, documentation, and API tweaks that impact compatibility should
> be worked on immediately. Everything else please retarget to an
> appropriate release.
> 
> ==
> But my bug isn't fixed?
> ==
> 
> In order to make timely releases, we will typically not hold the
> release unless the bug in question is a regression from the previous
> release. That being said, if there is something which is a regression
> that has not been correctly targeted please ping me or a committer to
> help target the issue.
> 

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



[VOTE] Release Spark 3.4.3 (RC2)

2024-04-14 Thread Dongjoon Hyun
Please vote on releasing the following candidate as Apache Spark version
3.4.3.

The vote is open until April 18th 1AM (PDT) and passes if a majority +1 PMC
votes are cast, with a minimum of 3 +1 votes.

[ ] +1 Release this package as Apache Spark 3.4.3
[ ] -1 Do not release this package because ...

To learn more about Apache Spark, please see https://spark.apache.org/

The tag to be voted on is v3.4.3-rc2 (commit
1eb558c3a6fbdd59e5a305bc3ab12ce748f6511f)
https://github.com/apache/spark/tree/v3.4.3-rc2

The release files, including signatures, digests, etc. can be found at:
https://dist.apache.org/repos/dist/dev/spark/v3.4.3-rc2-bin/

Signatures used for Spark RCs can be found in this file:
https://dist.apache.org/repos/dist/dev/spark/KEYS

The staging repository for this release can be found at:
https://repository.apache.org/content/repositories/orgapachespark-1453/

The documentation corresponding to this release can be found at:
https://dist.apache.org/repos/dist/dev/spark/v3.4.3-rc2-docs/

The list of bug fixes going into 3.4.3 can be found at the following URL:
https://issues.apache.org/jira/projects/SPARK/versions/12353987

This release is using the release script of the tag v3.4.3-rc2.

FAQ

=
How can I help test this release?
=

If you are a Spark user, you can help us test this release by taking
an existing Spark workload and running on this release candidate, then
reporting any regressions.

If you're working in PySpark you can set up a virtual env and install
the current RC and see if anything important breaks, in the Java/Scala
you can add the staging repository to your projects resolvers and test
with the RC (make sure to clean up the artifact cache before/after so
you don't end up building with a out of date RC going forward).

===
What should happen to JIRA tickets still targeting 3.4.3?
===

The current list of open tickets targeted at 3.4.3 can be found at:
https://issues.apache.org/jira/projects/SPARK and search for "Target
Version/s" = 3.4.3

Committers should look at those and triage. Extremely important bug
fixes, documentation, and API tweaks that impact compatibility should
be worked on immediately. Everything else please retarget to an
appropriate release.

==
But my bug isn't fixed?
==

In order to make timely releases, we will typically not hold the
release unless the bug in question is a regression from the previous
release. That being said, if there is something which is a regression
that has not been correctly targeted please ping me or a committer to
help target the issue.


Re: [VOTE] SPARK-44444: Use ANSI SQL mode by default

2024-04-13 Thread Dongjoon Hyun
I'll start from my +1.

Dongjoon.

On 2024/04/13 22:22:05 Dongjoon Hyun wrote:
> Please vote on SPARK-4 to use ANSI SQL mode by default.
> The technical scope is defined in the following PR which is
> one line of code change and one line of migration guide.
> 
> - DISCUSSION:
> https://lists.apache.org/thread/ztlwoz1v1sn81ssks12tb19x37zozxlz
> - JIRA: https://issues.apache.org/jira/browse/SPARK-4
> - PR: https://github.com/apache/spark/pull/46013
> 
> The vote is open until April 17th 1AM (PST) and passes
> if a majority +1 PMC votes are cast, with a minimum of 3 +1 votes.
> 
> [ ] +1 Use ANSI SQL mode by default
> [ ] -1 Do not use ANSI SQL mode by default because ...
> 
> Thank you in advance.
> 
> Dongjoon
> 

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



[VOTE] SPARK-44444: Use ANSI SQL mode by default

2024-04-13 Thread Dongjoon Hyun
Please vote on SPARK-4 to use ANSI SQL mode by default.
The technical scope is defined in the following PR which is
one line of code change and one line of migration guide.

- DISCUSSION:
https://lists.apache.org/thread/ztlwoz1v1sn81ssks12tb19x37zozxlz
- JIRA: https://issues.apache.org/jira/browse/SPARK-4
- PR: https://github.com/apache/spark/pull/46013

The vote is open until April 17th 1AM (PST) and passes
if a majority +1 PMC votes are cast, with a minimum of 3 +1 votes.

[ ] +1 Use ANSI SQL mode by default
[ ] -1 Do not use ANSI SQL mode by default because ...

Thank you in advance.

Dongjoon


Re: [DISCUSS] SPARK-44444: Use ANSI SQL mode by default

2024-04-13 Thread Dongjoon Hyun
Thank you for your opinions, Gangling, Liang-Chi, Wenchen, Huaxin, Serge, 
Nicholas.

To Nicholas, Apache Spark community already decided not to pursuit PostgreSQL 
dialect.

>  I’m flagging this since Spark’s behavior differs in these cases from 
> Postgres,
> as described in the ticket.

Please see the following thread (November 26, 2019).

https://lists.apache.org/thread/v1fx1wkxh5sp6odjcyohppr5x67cyrov
[DISCUSS] PostgreSQL dialect

Given the AS-IS consensus, I'll proceed to start a vote for this topic.

Thanks,
Dongjoon.

On 2024/04/12 17:31:49 Nicholas Chammas wrote:
> This is a side issue, but I’d like to bring people’s attention to 
> SPARK-28024. 
> 
> Cases 2, 3, and 4 described in that ticket are still problems today on master 
> (I just rechecked) even with ANSI mode enabled.
> 
> Well, maybe not problems, but I’m flagging this since Spark’s behavior 
> differs in these cases from Postgres, as described in the ticket.
> 
> 
> > On Apr 12, 2024, at 12:09 AM, Gengliang Wang  wrote:
> > 
> > 
> > +1, enabling Spark's ANSI SQL mode in version 4.0 will significantly 
> > enhance data quality and integrity. I fully support this initiative.
> > 
> > > In other words, the current Spark ANSI SQL implementation becomes the 
> > > first implementation for Spark SQL users to face at first while providing
> > `spark.sql.ansi.enabled=false` in the same way without losing any 
> > capability.`spark.sql.ansi.enabled=false` in the same way without losing 
> > any capability.
> > 
> > BTW, the try_* 
> > <https://spark.apache.org/docs/latest/sql-ref-ansi-compliance.html#useful-functions-for-ansi-mode>
> >  functions and SQL Error Attribution Framework 
> > <https://issues.apache.org/jira/browse/SPARK-38615> will also be beneficial 
> > in migrating to ANSI SQL mode.
> > 
> > 
> > Gengliang
> > 
> > 
> > On Thu, Apr 11, 2024 at 7:56 PM Dongjoon Hyun  > <mailto:dongjoon.h...@gmail.com>> wrote:
> >> Hi, All.
> >> 
> >> Thanks to you, we've been achieving many things and have on-going SPIPs.
> >> I believe it's time to scope Apache Spark 4.0.0 (SPARK-44111) more narrowly
> >> by asking your opinions about Apache Spark's ANSI SQL mode.
> >> 
> >> https://issues.apache.org/jira/browse/SPARK-44111
> >> Prepare Apache Spark 4.0.0
> >> 
> >> SPARK-4 was proposed last year (on 15/Jul/23) as the one of desirable
> >> items for 4.0.0 because it's a big behavior.
> >> 
> >> https://issues.apache.org/jira/browse/SPARK-4
> >> Use ANSI SQL mode by default
> >> 
> >> Historically, spark.sql.ansi.enabled was added at Apache Spark 3.0.0 and 
> >> has
> >> been aiming to provide a better Spark SQL compatibility in a standard way.
> >> We also have a daily CI to protect the behavior too.
> >> 
> >> https://github.com/apache/spark/actions/workflows/build_ansi.yml
> >> 
> >> However, it's still behind the configuration with several known issues, 
> >> e.g.,
> >> 
> >> SPARK-41794 Reenable ANSI mode in test_connect_column
> >> SPARK-41547 Reenable ANSI mode in test_connect_functions
> >> SPARK-46374 Array Indexing is 1-based via ANSI SQL Standard
> >> 
> >> To be clear, we know that many DBMSes have their own implementations of
> >> SQL standard and not the same. Like them, SPARK-4 aims to enable
> >> only the existing Spark's configuration, `spark.sql.ansi.enabled=true`.
> >> There is nothing more than that.
> >> 
> >> In other words, the current Spark ANSI SQL implementation becomes the first
> >> implementation for Spark SQL users to face at first while providing
> >> `spark.sql.ansi.enabled=false` in the same way without losing any 
> >> capability.
> >> 
> >> If we don't want this change for some reasons, we can simply exclude
> >> SPARK-4 from SPARK-44111 as a part of Apache Spark 4.0.0 preparation.
> >> It's time just to make a go/no-go decision for this item for the global 
> >> optimization
> >> for Apache Spark 4.0.0 release. After 4.0.0, it's unlikely for us to aim
> >> for this again for the next four years until 2028.
> >> 
> >> WDYT?
> >> 
> >> Bests,
> >> Dongjoon
> 
> 

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: [DISCUSS] Spark 4.0.0 release

2024-04-12 Thread Dongjoon Hyun
Thank you for volunteering, Wenchen.

Dongjoon.

On 2024/04/12 15:11:04 Wenchen Fan wrote:
> Hi all,
> 
> It's close to the previously proposed 4.0.0 release date (June 2024), and I
> think it's time to prepare for it and discuss the ongoing projects:
> 
>- ANSI by default
>- Spark Connect GA
>- Structured Logging
>- Streaming state store data source
>- new data type VARIANT
>- STRING collation support
>- Spark k8s operator versioning
> 
> Please help to add more items to this list that are missed here. I would
> like to volunteer as the release manager for Apache Spark 4.0.0 if there is
> no objection. Thank you all for the great work that fills Spark 4.0!
> 
> Wenchen Fan
> 

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: [VOTE] Add new `Versions` in Apache Spark JIRA for Versioning of Spark Operator

2024-04-12 Thread Dongjoon Hyun
+1

Thank you!

I hope we can customize `dev/merge_spark_pr.py` script per repository after 
this PR.

Dongjoon.

On 2024/04/12 03:28:36 "L. C. Hsieh" wrote:
> Hi all,
> 
> Thanks for all discussions in the thread of "Versioning of Spark
> Operator": https://lists.apache.org/thread/zhc7nb2sxm8jjxdppq8qjcmlf4rcsthh
> 
> I would like to create this vote to get the consensus for versioning
> of the Spark Kubernetes Operator.
> 
> The proposal is to use an independent versioning for the Spark
> Kubernetes Operator.
> 
> Please vote on adding new `Versions` in Apache Spark JIRA which can be
> used for places like "Fix Version/s" in the JIRA tickets of the
> operator.
> 
> The new `Versions` will be `kubernetes-operator-` prefix, for example
> `kubernetes-operator-0.1.0`.
> 
> The vote is open until April 15th 1AM (PST) and passes if a majority
> +1 PMC votes are cast, with a minimum of 3 +1 votes.
> 
> [ ] +1 Adding the new `Versions` for Spark Kubernetes Operator in
> Apache Spark JIRA
> [ ] -1 Do not add the new `Versions` because ...
> 
> Thank you.
> 
> 
> Note that this is not a SPIP vote and also not a release vote. I don't
> find similar votes in previous threads. This is made similarly like a
> SPIP or a release vote. So I think it should be okay. Please correct
> me if this vote format is not good for you.
> 
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
> 
> 

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



[DISCUSS] SPARK-44444: Use ANSI SQL mode by default

2024-04-11 Thread Dongjoon Hyun
Hi, All.

Thanks to you, we've been achieving many things and have on-going SPIPs.
I believe it's time to scope Apache Spark 4.0.0 (SPARK-44111) more narrowly
by asking your opinions about Apache Spark's ANSI SQL mode.

https://issues.apache.org/jira/browse/SPARK-44111
Prepare Apache Spark 4.0.0

SPARK-4 was proposed last year (on 15/Jul/23) as the one of desirable
items for 4.0.0 because it's a big behavior.

https://issues.apache.org/jira/browse/SPARK-4
Use ANSI SQL mode by default

Historically, spark.sql.ansi.enabled was added at Apache Spark 3.0.0 and has
been aiming to provide a better Spark SQL compatibility in a standard way.
We also have a daily CI to protect the behavior too.

https://github.com/apache/spark/actions/workflows/build_ansi.yml

However, it's still behind the configuration with several known issues,
e.g.,

SPARK-41794 Reenable ANSI mode in test_connect_column
SPARK-41547 Reenable ANSI mode in test_connect_functions
SPARK-46374 Array Indexing is 1-based via ANSI SQL Standard

To be clear, we know that many DBMSes have their own implementations of
SQL standard and not the same. Like them, SPARK-4 aims to enable
only the existing Spark's configuration, `spark.sql.ansi.enabled=true`.
There is nothing more than that.

In other words, the current Spark ANSI SQL implementation becomes the first
implementation for Spark SQL users to face at first while providing
`spark.sql.ansi.enabled=false` in the same way without losing any
capability.

If we don't want this change for some reasons, we can simply exclude
SPARK-4 from SPARK-44111 as a part of Apache Spark 4.0.0 preparation.
It's time just to make a go/no-go decision for this item for the global
optimization
for Apache Spark 4.0.0 release. After 4.0.0, it's unlikely for us to aim
for this again for the next four years until 2028.

WDYT?

Bests,
Dongjoon


Re: Introducing Apache Gluten(incubating), a middle layer to offload Spark to native engine

2024-04-10 Thread Dongjoon Hyun
I'm interested in your claim.

Could you elaborate or provide some evidence for your claim, *a door for
all native libraries*, Binwei?

For example, is there any POC for that claim? Maybe, did I miss something
in that SPIP?

Dongjoon.

On Wed, Apr 10, 2024 at 8:19 PM Binwei Yang  wrote:

>
> The SPIP is not for current Gluten, but open a door for all native
> libraries and accelerators support.
>
> On 2024/04/11 00:27:43 Weiting Chen wrote:
> > Yes, the 1st Apache release(v1.2.0) for Gluten will be in September.
> > For Spark version support, currently Gluten v1.1.1 support Spark3.2 and
> 3.3.
> > We are planning to support Spark3.4 and 3.5 in Gluten v1.2.0.
> > Spark4.0 support for Gluten is depending on the release schedule in
> Spark community.
> >
> > On 2024/04/09 07:14:13 Dongjoon Hyun wrote:
> > > Thank you for sharing, Weiting.
> > >
> > > Do you think you can share the future milestone of Apache Gluten?
> > > I'm wondering when the first stable release will come and how we can
> > > coordinate across the ASF communities.
> > >
> > > > This project is still under active development now, and doesn't have
> a
> > > stable release.
> > > > https://github.com/apache/incubator-gluten/releases/tag/v1.1.1
> > >
> > > In the Apache Spark community, Apache Spark 3.2 and 3.3 is the end of
> > > support.
> > > And, 3.4 will have 3.4.3 next week and 3.4.4 (another EOL release) is
> > > scheduled in October.
> > >
> > > For the SPIP, I guess it's applicable for Apache Spark 4.0.0 only if
> there
> > > is something we need to do from Spark side.
> > >
> > > Thanks,
> > > Dongjoon.
> > >
> > >
> > > On Mon, Apr 8, 2024 at 11:19 PM WeitingChen 
> wrote:
> > >
> > > > Hi all,
> > > >
> > > > We are excited to introduce a new Apache incubating project called
> Gluten.
> > > > Gluten serves as a middleware layer designed to offload Spark to
> native
> > > > engines like Velox or ClickHouse.
> > > > For more detailed information, please visit the project repository at
> > > > https://github.com/apache/incubator-gluten
> > > >
> > > > Additionally, a new Spark SPIP related to Spark + Gluten
> collaboration has
> > > > been proposed at https://issues.apache.org/jira/browse/SPARK-47773.
> > > > We eagerly await feedback from the Spark community.
> > > >
> > > > Thanks,
> > > > Weiting.
> > > >
> > > >
> > >
> >
> > -
> > To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
> >
> >
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>


Re: Versioning of Spark Operator

2024-04-10 Thread Dongjoon Hyun
Ya, that would work.

Inevitably, I looked at Apache Flink K8s Operator's JIRA and GitHub repo.

It looks reasonable to me.

Although they share the same JIRA, they choose different patterns per place.

1. In POM file and Maven Artifact, independent version number.
1.8.0

2. Tag is also based on the independent version number
https://github.com/apache/flink-kubernetes-operator/tags
- release-1.8.0
- release-1.7.0

3. JIRA Fixed Version is `kubernetes-operator-` prefix.
https://issues.apache.org/jira/browse/FLINK-34957
> Fix Version/s: kubernetes-operator-1.9.0

Maybe, we can borrow this pattern.

I guess we need a vote for any further decision because we need to create
new `Versions` in Apache Spark JIRA.

Dongjoon.


Re: Versioning of Spark Operator

2024-04-09 Thread Dongjoon Hyun
Do we have a compatibility matrix of Apache Connect Go client already, Bo?

Specifically, I'm wondering which versions the existing Apache Spark Connect Go 
repository is able to support as of now.

We know that it is supposed to be compatible always, but do we have a way to 
verify that actually via CI to make it sure inside Go repository?

Dongjoon.

On 2024/04/09 21:35:45 bo yang wrote:
> Thanks Liang-Chi for the Spark Operator work, and also the discussion here!
> 
> For Spark Operator and Connector Go Client, I am guessing they need to
> support multiple versions of Spark? e.g. same Spark Operator may support
> running multiple versions of Spark, and Connector Go Client might support
> multiple versions of Spark driver as well.
> 
> How do people think of using the minimum supported Spark version as the
> version name for Spark Operator and Connector Go Client? For example,
> Spark Operator 3.5.x supports Spark 3.5 and above.
> 
> Best,
> Bo
> 
> 
> On Tue, Apr 9, 2024 at 10:14 AM Dongjoon Hyun  wrote:
> 
> > Ya, that's simple and possible.
> >
> > However, it may cause many confusions because it implies that new `Spark
> > K8s Operator 4.0.0` and `Spark Connect Go 4.0.0` follow the same `Semantic
> > Versioning` policy like Apache Spark 4.0.0.
> >
> > In addition, `Versioning` is directly related to the Release Cadence. It's
> > unlikely for us to have `Spark K8s Operator` and `Spark Connect Go`
> > releases at every Apache Spark maintenance release. For example, there is
> > no commit in Spark Connect Go repository.
> >
> > I believe the versioning and release cadence is related to those
> > subprojects' maturity more.
> >
> > Dongjoon.
> >
> > On 2024/04/09 16:59:40 DB Tsai wrote:
> > >  Aligning with Spark releases is sensible, as it allows us to guarantee
> > that the Spark operator functions correctly with the new version while also
> > maintaining support for previous versions.
> > >
> > > DB Tsai  |  https://www.dbtsai.com/  |  PGP 42E5B25A8F7A82C1
> > >
> > > > On Apr 9, 2024, at 9:45 AM, Mridul Muralidharan 
> > wrote:
> > > >
> > > >
> > > >   I am trying to understand if we can simply align with Spark's
> > version for this ?
> > > > Makes the release and jira management much more simpler for developers
> > and intuitive for users.
> > > >
> > > > Regards,
> > > > Mridul
> > > >
> > > >
> > > > On Tue, Apr 9, 2024 at 10:09 AM Dongjoon Hyun  > <mailto:dongj...@apache.org>> wrote:
> > > >> Hi, Liang-Chi.
> > > >>
> > > >> Thank you for leading Apache Spark K8s operator as a shepherd.
> > > >>
> > > >> I took a look at `Apache Spark Connect Go` repo mentioned in the
> > thread. Sadly, there is no release at all and no activity since last 6
> > months. It seems to be the first time for Apache Spark community to
> > consider these sister repositories (Go and K8s Operator).
> > > >>
> > > >> https://github.com/apache/spark-connect-go/commits/master/
> > > >>
> > > >> Dongjoon.
> > > >>
> > > >> On 2024/04/08 17:48:18 "L. C. Hsieh" wrote:
> > > >> > Hi all,
> > > >> >
> > > >> > We've opened the dedicated repository of Spark Kubernetes Operator,
> > > >> > and the first PR is created.
> > > >> > Thank you for the review from the community so far.
> > > >> >
> > > >> > About the versioning of Spark Operator, there are questions.
> > > >> >
> > > >> > As we are using Spark JIRA, when we are going to merge PRs, we need
> > to
> > > >> > choose a Spark version. However, the Spark Operator is versioning
> > > >> > differently than Spark. I'm wondering how we deal with this?
> > > >> >
> > > >> > Not sure if Connect also has its versioning different to Spark? If
> > so,
> > > >> > maybe we can follow how Connect does.
> > > >> >
> > > >> > Can someone who is familiar with Connect versioning give some
> > suggestions?
> > > >> >
> > > >> > Thank you.
> > > >> >
> > > >> > Liang-Chi
> > > >> >
> > > >> >
> > -
> > > >> > To unsubscribe e-mail: dev-unsubscr...@spark.apache.org  > dev-unsubscr...@spark.apache.org>
> > > >> >
> > > >> >
> > > >>
> > > >> -
> > > >> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org  > dev-unsubscr...@spark.apache.org>
> > > >>
> > >
> > >
> >
> > -
> > To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
> >
> >
> 

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: Versioning of Spark Operator

2024-04-09 Thread Dongjoon Hyun
Ya, that's simple and possible.

However, it may cause many confusions because it implies that new `Spark K8s 
Operator 4.0.0` and `Spark Connect Go 4.0.0` follow the same `Semantic 
Versioning` policy like Apache Spark 4.0.0.

In addition, `Versioning` is directly related to the Release Cadence. It's 
unlikely for us to have `Spark K8s Operator` and `Spark Connect Go` releases at 
every Apache Spark maintenance release. For example, there is no commit in 
Spark Connect Go repository.

I believe the versioning and release cadence is related to those subprojects' 
maturity more.

Dongjoon.

On 2024/04/09 16:59:40 DB Tsai wrote:
>  Aligning with Spark releases is sensible, as it allows us to guarantee that 
> the Spark operator functions correctly with the new version while also 
> maintaining support for previous versions.
>  
> DB Tsai  |  https://www.dbtsai.com/  |  PGP 42E5B25A8F7A82C1
> 
> > On Apr 9, 2024, at 9:45 AM, Mridul Muralidharan  wrote:
> > 
> > 
> >   I am trying to understand if we can simply align with Spark's version for 
> > this ?
> > Makes the release and jira management much more simpler for developers and 
> > intuitive for users.
> > 
> > Regards,
> > Mridul
> > 
> > 
> > On Tue, Apr 9, 2024 at 10:09 AM Dongjoon Hyun  > <mailto:dongj...@apache.org>> wrote:
> >> Hi, Liang-Chi.
> >> 
> >> Thank you for leading Apache Spark K8s operator as a shepherd. 
> >> 
> >> I took a look at `Apache Spark Connect Go` repo mentioned in the thread. 
> >> Sadly, there is no release at all and no activity since last 6 months. It 
> >> seems to be the first time for Apache Spark community to consider these 
> >> sister repositories (Go and K8s Operator).
> >> 
> >> https://github.com/apache/spark-connect-go/commits/master/
> >> 
> >> Dongjoon.
> >> 
> >> On 2024/04/08 17:48:18 "L. C. Hsieh" wrote:
> >> > Hi all,
> >> > 
> >> > We've opened the dedicated repository of Spark Kubernetes Operator,
> >> > and the first PR is created.
> >> > Thank you for the review from the community so far.
> >> > 
> >> > About the versioning of Spark Operator, there are questions.
> >> > 
> >> > As we are using Spark JIRA, when we are going to merge PRs, we need to
> >> > choose a Spark version. However, the Spark Operator is versioning
> >> > differently than Spark. I'm wondering how we deal with this?
> >> > 
> >> > Not sure if Connect also has its versioning different to Spark? If so,
> >> > maybe we can follow how Connect does.
> >> > 
> >> > Can someone who is familiar with Connect versioning give some 
> >> > suggestions?
> >> > 
> >> > Thank you.
> >> > 
> >> > Liang-Chi
> >> > 
> >> > -
> >> > To unsubscribe e-mail: dev-unsubscr...@spark.apache.org 
> >> > <mailto:dev-unsubscr...@spark.apache.org>
> >> > 
> >> > 
> >> 
> >> -
> >> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org 
> >> <mailto:dev-unsubscr...@spark.apache.org>
> >> 
> 
> 

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: Versioning of Spark Operator

2024-04-09 Thread Dongjoon Hyun
Hi, Liang-Chi.

Thank you for leading Apache Spark K8s operator as a shepherd. 

I took a look at `Apache Spark Connect Go` repo mentioned in the thread. Sadly, 
there is no release at all and no activity since last 6 months. It seems to be 
the first time for Apache Spark community to consider these sister repositories 
(Go and K8s Operator).

https://github.com/apache/spark-connect-go/commits/master/

Dongjoon.

On 2024/04/08 17:48:18 "L. C. Hsieh" wrote:
> Hi all,
> 
> We've opened the dedicated repository of Spark Kubernetes Operator,
> and the first PR is created.
> Thank you for the review from the community so far.
> 
> About the versioning of Spark Operator, there are questions.
> 
> As we are using Spark JIRA, when we are going to merge PRs, we need to
> choose a Spark version. However, the Spark Operator is versioning
> differently than Spark. I'm wondering how we deal with this?
> 
> Not sure if Connect also has its versioning different to Spark? If so,
> maybe we can follow how Connect does.
> 
> Can someone who is familiar with Connect versioning give some suggestions?
> 
> Thank you.
> 
> Liang-Chi
> 
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
> 
> 

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: SPIP: Enhancing the Flexibility of Spark's Physical Plan to Enable Execution on Various Native Engines

2024-04-09 Thread Dongjoon Hyun
Thank you for sharing, Jia.

I have the same questions like the previous Weiting's thread.

Do you think you can share the future milestone of Apache Gluten?
I'm wondering when the first stable release will come and how we can
coordinate across the ASF communities.

> This project is still under active development now, and doesn't have a
stable release.
> https://github.com/apache/incubator-gluten/releases/tag/v1.1.1

In the Apache Spark community, Apache Spark 3.2 and 3.3 is the end of
support.
And, 3.4 will have 3.4.3 next week and 3.4.4 (another EOL release) is
scheduled in October.

For the SPIP, I guess it's applicable for Apache Spark 4.0.0 only if there
is something we need to do from Spark side.

Thanks,
Dongjoon.


On Tue, Apr 9, 2024 at 12:22 AM Ke Jia  wrote:

> Apache Spark currently lacks an official mechanism to support
> cross-platform execution of physical plans. The Gluten project offers a
> mechanism that utilizes the Substrait standard to convert and optimize
> Spark's physical plans. By introducing Gluten's plan conversion,
> validation, and fallback mechanisms into Spark, we can significantly
> enhance the portability and interoperability of Spark's physical plans,
> enabling them to operate across a broader spectrum of execution
> environments without requiring users to migrate, while also improving
> Spark's execution efficiency through the utilization of Gluten's advanced
> optimization techniques. And the integration of Gluten into Spark has
> already shown significant performance improvements with ClickHouse and
> Velox backends and has been successfully deployed in production by several
> customers.
>
> References:
> JIAR Ticket 
> SPIP Doc
> 
>
> Your feedback and comments are welcome and appreciated.  Thanks.
>
> Thanks,
> Jia Ke
>


Re: Introducing Apache Gluten(incubating), a middle layer to offload Spark to native engine

2024-04-09 Thread Dongjoon Hyun
Thank you for sharing, Weiting.

Do you think you can share the future milestone of Apache Gluten?
I'm wondering when the first stable release will come and how we can
coordinate across the ASF communities.

> This project is still under active development now, and doesn't have a
stable release.
> https://github.com/apache/incubator-gluten/releases/tag/v1.1.1

In the Apache Spark community, Apache Spark 3.2 and 3.3 is the end of
support.
And, 3.4 will have 3.4.3 next week and 3.4.4 (another EOL release) is
scheduled in October.

For the SPIP, I guess it's applicable for Apache Spark 4.0.0 only if there
is something we need to do from Spark side.

Thanks,
Dongjoon.


On Mon, Apr 8, 2024 at 11:19 PM WeitingChen  wrote:

> Hi all,
>
> We are excited to introduce a new Apache incubating project called Gluten.
> Gluten serves as a middleware layer designed to offload Spark to native
> engines like Velox or ClickHouse.
> For more detailed information, please visit the project repository at
> https://github.com/apache/incubator-gluten
>
> Additionally, a new Spark SPIP related to Spark + Gluten collaboration has
> been proposed at https://issues.apache.org/jira/browse/SPARK-47773.
> We eagerly await feedback from the Spark community.
>
> Thanks,
> Weiting.
>
>


Re: Apache Spark 3.4.3 (?)

2024-04-08 Thread Dongjoon Hyun
Thank you, Holden, Mridul,  Kent, Liang-Chi, Mich, Jungtaek.

I added `Target Version: 3.4.3` to SPARK-47318 and am going to continue to 
prepare for RC1 (April 15th).

Dongjoon.

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Apache Spark 3.4.3 (?)

2024-04-06 Thread Dongjoon Hyun
Hi, All.

Apache Spark 3.4.2 tag was created on Nov 24th and `branch-3.4` has 85
commits including important security and correctness patches like
SPARK-45580, SPARK-46092, SPARK-46466, SPARK-46794, and SPARK-46862.

https://github.com/apache/spark/releases/tag/v3.4.2

$ git log --oneline v3.4.2..HEAD | wc -l
  85

SPARK-45580 Subquery changes the output schema of the outer query
SPARK-46092 Overflow in Parquet row group filter creation causes incorrect
results
SPARK-46466 Vectorized parquet reader should never do rebase for timestamp
ntz
SPARK-46794 Incorrect results due to inferred predicate from checkpoint
with subquery
SPARK-46862 Incorrect count() of a dataframe loaded from CSV datasource
SPARK-45445 Upgrade snappy to 1.1.10.5
SPARK-47428 Upgrade Jetty to 9.4.54.v20240208
SPARK-46239 Hide `Jetty` info


Currently, I'm checking more applicable patches for branch-3.4. I'd like to
propose to release Apache Spark 3.4.3 and volunteer as the release manager
for Apache Spark 3.4.3. If there are no additional blockers, the first
tentative RC1 vote date is April 15th (Monday).

WDYT?

Dongjoon.


Re: [VOTE] SPIP: Pure Python Package in PyPI (Spark Connect)

2024-03-31 Thread Dongjoon Hyun
+1

Thank you, Hyukjin.

Dongjoon

On Sun, Mar 31, 2024 at 19:07 Haejoon Lee
 wrote:

> +1
>
> On Mon, Apr 1, 2024 at 10:15 AM Hyukjin Kwon  wrote:
>
>> Hi all,
>>
>> I'd like to start the vote for SPIP: Pure Python Package in PyPI (Spark
>> Connect)
>>
>> JIRA 
>> Prototype 
>> SPIP doc
>> 
>>
>> Please vote on the SPIP for the next 72 hours:
>>
>> [ ] +1: Accept the proposal as an official SPIP
>> [ ] +0
>> [ ] -1: I don’t think this is a good idea because …
>>
>> Thanks.
>>
>


Re: The dedicated repository for Kubernetes Operator for Apache Spark

2024-03-28 Thread Dongjoon Hyun
Thank you, Liang-Chi!

Dongjoon.

On Wed, Mar 27, 2024 at 10:56 PM L. C. Hsieh  wrote:

> Hi all,
>
> For the passed SPIP: An Official Kubernetes Operator for Apache Spark,
> the developers have been working on code cleaning and refactoring for
> open source in the last few months. They are ready to contribute the
> code to Spark now.
>
> As we discussed, I will go to create a dedicated repository for the
> Kubernetes Operator for Apache Spark. I think the repository name will
> be "spark-kubernetes-operator". I will try to create the repository
> tomorrow.
>
> After that, they will contribute the code as an initial PR for review
> from the Spark community.
>
> Thank you.
>
> Liang-Chi
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>


Re: [DISCUSS] MySQL version support policy

2024-03-24 Thread Dongjoon Hyun
Hi, Cheng.

Thank you for the suggestion. Your suggestion seems to have at least two
themes.

A. Adding a new Apache Spark community policy (contract) to guarantee MySQL
LTS Versions Support.
B. Dropping the support of non-LTS version support (MySQL 8.3/8.2/8.1)

And, it brings me three questions.

1. For (A), do you mean MySQL LTS versions are not supported by Apache
Spark releases properly due to the improper test suite?
2. For (B), why does Apache Spark need to drop non-LTS MySQL support?
3. What about MariaDB? Do we need to stick to some versions?

To be clear, if needed, we can have daily GitHub Action CIs easily like
Python CI (Python 3.8/3.10/3.11/3.12).

-
https://github.com/apache/spark/blob/master/.github/workflows/build_python.yml

Thanks,
Dongjoon.


On Sun, Mar 24, 2024 at 10:29 PM Cheng Pan  wrote:

> Hi, Spark community,
>
> I noticed that the Spark JDBC connector MySQL dialect is testing against
> the 8.3.0[1] now, a non-LTS version.
>
> MySQL changed the version policy recently[2], which is now very similar to
> the Java version policy. In short, 5.5, 5.6, 5.7, 8.0 is the LTS version,
> 8.1, 8.2, 8.3 is non-LTS, and the next LTS version is 8.4.
>
> I would say that MySQL is one of the most important infrastructures today,
> I checked the AWS RDS MySQL[4] and Azure Database for MySQL[5] version
> support policy, and both only support 5.7 and 8.0.
>
> Also, Spark officially only supports LTS Java versions, like JDK 17 and
> 21, but not 22. I would recommend using MySQL 8.0 for testing until the
> next MySQL LTS version (8.4) is available.
>
> Additional discussion can be found at [3]
>
> [1] https://issues.apache.org/jira/browse/SPARK-47453
> [2]
> https://dev.mysql.com/blog-archive/introducing-mysql-innovation-and-long-term-support-lts-versions/
> [3] https://github.com/apache/spark/pull/45581
> [4] https://aws.amazon.com/rds/mysql/
> [5] https://learn.microsoft.com/en-us/azure/mysql/concepts-version-policy
>
> Thanks,
> Cheng Pan
>
>
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>


Re: [VOTE] SPIP: Structured Logging Framework for Apache Spark

2024-03-11 Thread Dongjoon Hyun
Ya, I also have a similar opinion with Mridul.

+1

Thank you, Gengliang.

Dongjoon.


On Mon, Mar 11, 2024 at 1:34 PM Mridul Muralidharan 
wrote:

>
>   I am supportive of the proposal - this is a step in the right direction !
> Additional metadata (explicit and inferred) for log records, and exposing
> them for indexing is extremely useful.
>
> The specifics of the API still need some work IMO and does not need to be
> this disruptive, but I consider that is orthogonal to this vote itself -
> and something we need to iterate upon during PR reviews.
>
> +1
>
> Regards,
> Mridul
>
>
> On Mon, Mar 11, 2024 at 11:09 AM Mich Talebzadeh <
> mich.talebza...@gmail.com> wrote:
>
>> +1
>>
>> Mich Talebzadeh,
>> Dad | Technologist | Solutions Architect | Engineer
>> London
>> United Kingdom
>>
>>
>>view my Linkedin profile
>> 
>>
>>
>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>
>>
>>
>> *Disclaimer:* The information provided is correct to the best of my
>> knowledge but of course cannot be guaranteed . It is essential to note
>> that, as with any advice, quote "one test result is worth one-thousand
>> expert opinions (Werner
>> Von Braun
>> )".
>>
>>
>> On Mon, 11 Mar 2024 at 09:27, Hyukjin Kwon  wrote:
>>
>>> +1
>>>
>>> On Mon, 11 Mar 2024 at 18:11, yangjie01 
>>> wrote:
>>>
 +1



 Jie Yang



 *发件人**: *Haejoon Lee 
 *日期**: *2024年3月11日 星期一 17:09
 *收件人**: *Gengliang Wang 
 *抄送**: *dev 
 *主题**: *Re: [VOTE] SPIP: Structured Logging Framework for Apache Spark



 +1



 On Mon, Mar 11, 2024 at 10:36 AM Gengliang Wang 
 wrote:

 Hi all,

 I'd like to start the vote for SPIP: Structured Logging Framework for
 Apache Spark


 References:

- JIRA ticket

 
- SPIP doc

 
- Discussion thread

 

 Please vote on the SPIP for the next 72 hours:

 [ ] +1: Accept the proposal as an official SPIP
 [ ] +0
 [ ] -1: I don’t think this is a good idea because …

 Thanks!

 Gengliang Wang




Re: [ANNOUNCE] Apache Spark 3.5.1 released

2024-02-29 Thread Dongjoon Hyun
BTW, Jungtaek.

PySpark document seems to show a wrong branch. At this time, `master`.

https://spark.apache.org/docs/3.5.1/api/python/index.html

PySpark Overview
<https://spark.apache.org/docs/3.5.1/api/python/index.html#pyspark-overview>

   Date: Feb 24, 2024 Version: master

[image: Screenshot 2024-02-29 at 21.12.24.png]


Could you do the follow-up, please?

Thank you in advance.

Dongjoon.


On Thu, Feb 29, 2024 at 2:48 PM John Zhuge  wrote:

> Excellent work, congratulations!
>
> On Wed, Feb 28, 2024 at 10:12 PM Dongjoon Hyun 
> wrote:
>
>> Congratulations!
>>
>> Bests,
>> Dongjoon.
>>
>> On Wed, Feb 28, 2024 at 11:43 AM beliefer  wrote:
>>
>>> Congratulations!
>>>
>>>
>>>
>>> At 2024-02-28 17:43:25, "Jungtaek Lim" 
>>> wrote:
>>>
>>> Hi everyone,
>>>
>>> We are happy to announce the availability of Spark 3.5.1!
>>>
>>> Spark 3.5.1 is a maintenance release containing stability fixes. This
>>> release is based on the branch-3.5 maintenance branch of Spark. We
>>> strongly
>>> recommend all 3.5 users to upgrade to this stable release.
>>>
>>> To download Spark 3.5.1, head over to the download page:
>>> https://spark.apache.org/downloads.html
>>>
>>> To view the release notes:
>>> https://spark.apache.org/releases/spark-release-3-5-1.html
>>>
>>> We would like to acknowledge all community members for contributing to
>>> this
>>> release. This release would not have been possible without you.
>>>
>>> Jungtaek Lim
>>>
>>> ps. Yikun is helping us through releasing the official docker image for
>>> Spark 3.5.1 (Thanks Yikun!) It may take some time to be generally available.
>>>
>>>
>
> --
> John Zhuge
>


Re: When Spark job shows FetchFailedException it creates few duplicate data and we see few data also missing , please explain why

2024-02-29 Thread Dongjoon Hyun
Please use the url as thr full string including '()' part.

Or you can seach directly at ASF Jira with 'Spark' project and three
labels, 'Correctness', 'correctness' and 'data-loss'.

Dongjoon

On Thu, Feb 29, 2024 at 11:54 Prem Sahoo  wrote:

> Hello Dongjoon,
> Thanks for emailing me.
> Could you please share a list of fixes  as the link provided by you is
> not working.
>
> On Thu, Feb 29, 2024 at 11:27 AM Dongjoon Hyun 
> wrote:
>
>> Hi,
>>
>> If you are observing correctness issues, you may hit some old (and fixed)
>> correctness issues.
>>
>> For example, from Apache Spark 3.2.1 to 3.2.4, we fixed 31 correctness
>> issues.
>>
>>
>> https://issues.apache.org/jira/issues/?filter=12345390&jql=project%20%3D%20SPARK%20AND%20fixVersion%20in%20(3.2.1%2C%203.2.2%2C%203.2.3%2C%203.2.4)%20AND%20labels%20in%20(Correctness%2C%20correctness%2C%20data-loss)
>>
>> There are more fixes in 3.3 and 3.4 and 3.5, too.
>>
>> Please use the latest version, Apache Spark 3.5.1, because Apache Spark
>> 3.2 and 3.3 are in the End-Of-Support status of the community.
>>
>> It would be help if you can report any correctness issues with Apache
>> Spark 3.5.1.
>>
>> Thanks,
>> Dongjoon.
>>
>> On 2024/02/29 15:04:41 Prem Sahoo wrote:
>> > When Spark job shows FetchFailedException it creates few duplicate data
>> and
>> > we see few data also missing , please explain why. We have scenario when
>> > spark job complains FetchFailedException as one of the data node got
>> > rebooted middle of job running .
>> >
>> > Now due to this we have few duplicate data and few missing data . Why
>> spark
>> > is not handling this scenario correctly ? kind of we shouldn't miss any
>> > data and we shouldn't create duplicate data .
>> >
>> >
>> >
>> > I am using spark3.2.0 version.
>> >
>>
>> -
>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>
>>


Re: When Spark job shows FetchFailedException it creates few duplicate data and we see few data also missing , please explain why

2024-02-29 Thread Dongjoon Hyun
Hi,

If you are observing correctness issues, you may hit some old (and fixed) 
correctness issues.

For example, from Apache Spark 3.2.1 to 3.2.4, we fixed 31 correctness issues.

https://issues.apache.org/jira/issues/?filter=12345390&jql=project%20%3D%20SPARK%20AND%20fixVersion%20in%20(3.2.1%2C%203.2.2%2C%203.2.3%2C%203.2.4)%20AND%20labels%20in%20(Correctness%2C%20correctness%2C%20data-loss)

There are more fixes in 3.3 and 3.4 and 3.5, too.

Please use the latest version, Apache Spark 3.5.1, because Apache Spark 3.2 and 
3.3 are in the End-Of-Support status of the community.

It would be help if you can report any correctness issues with Apache Spark 
3.5.1.

Thanks,
Dongjoon.

On 2024/02/29 15:04:41 Prem Sahoo wrote:
> When Spark job shows FetchFailedException it creates few duplicate data and
> we see few data also missing , please explain why. We have scenario when
> spark job complains FetchFailedException as one of the data node got
> rebooted middle of job running .
> 
> Now due to this we have few duplicate data and few missing data . Why spark
> is not handling this scenario correctly ? kind of we shouldn't miss any
> data and we shouldn't create duplicate data .
> 
> 
> 
> I am using spark3.2.0 version.
> 

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: [ANNOUNCE] Apache Spark 3.5.1 released

2024-02-28 Thread Dongjoon Hyun
Congratulations!

Bests,
Dongjoon.

On Wed, Feb 28, 2024 at 11:43 AM beliefer  wrote:

> Congratulations!
>
>
>
> At 2024-02-28 17:43:25, "Jungtaek Lim" 
> wrote:
>
> Hi everyone,
>
> We are happy to announce the availability of Spark 3.5.1!
>
> Spark 3.5.1 is a maintenance release containing stability fixes. This
> release is based on the branch-3.5 maintenance branch of Spark. We strongly
> recommend all 3.5 users to upgrade to this stable release.
>
> To download Spark 3.5.1, head over to the download page:
> https://spark.apache.org/downloads.html
>
> To view the release notes:
> https://spark.apache.org/releases/spark-release-3-5-1.html
>
> We would like to acknowledge all community members for contributing to this
> release. This release would not have been possible without you.
>
> Jungtaek Lim
>
> ps. Yikun is helping us through releasing the official docker image for
> Spark 3.5.1 (Thanks Yikun!) It may take some time to be generally available.
>
>


Re: [VOTE] Release Apache Spark 3.5.1 (RC2)

2024-02-23 Thread Dongjoon Hyun
Hi, All.

Unfortunately, the Apache Spark `3.5.1 RC2` document artifact seems to be
generated from unknown source code instead of the correct source code of
the tag, `3.5.1`.

https://spark.apache.org/docs/3.5.1/

[image: Screenshot 2024-02-23 at 14.13.07.png]

Dongjoon.



On Wed, Feb 21, 2024 at 7:15 AM Jungtaek Lim 
wrote:

> Thanks everyone for participating the vote! The vote passed.
> I'll send out the vote result and proceed to the next steps.
>
> On Wed, Feb 21, 2024 at 4:36 PM Maxim Gekk 
> wrote:
>
>> +1
>>
>> On Wed, Feb 21, 2024 at 9:50 AM Hyukjin Kwon 
>> wrote:
>>
>>> +1
>>>
>>> On Tue, 20 Feb 2024 at 22:00, Cheng Pan  wrote:
>>>
 +1 (non-binding)

 - Build successfully from source code.
 - Pass integration tests with Spark ClickHouse Connector[1]

 [1] https://github.com/housepower/spark-clickhouse-connector/pull/299

 Thanks,
 Cheng Pan


 > On Feb 20, 2024, at 10:56, Jungtaek Lim 
 wrote:
 >
 > Thanks Sean, let's continue the process for this RC.
 >
 > +1 (non-binding)
 >
 > - downloaded all files from URL
 > - checked signature
 > - extracted all archives
 > - ran all tests from source files in source archive file, via running
 "sbt clean test package" - Ubuntu 20.04.4 LTS, OpenJDK 17.0.9.
 >
 > Also bump to dev@ to encourage participation - looks like the timing
 is not good for US folks but let's see more days.
 >
 >
 > On Sat, Feb 17, 2024 at 1:49 AM Sean Owen  wrote:
 > Yeah let's get that fix in, but it seems to be a minor test only
 issue so should not block release.
 >
 > On Fri, Feb 16, 2024, 9:30 AM yangjie01  wrote:
 > Very sorry. When I was fixing `SPARK-45242 (
 https://github.com/apache/spark/pull/43594)`
 , I noticed that its
 `Affects Version` and `Fix Version` of SPARK-45242 were both 4.0, and I
 didn't realize that it had also been merged into branch-3.5, so I didn't
 advocate for SPARK-45357 to be backported to branch-3.5.
 >  As far as I know, the condition to trigger this test failure is:
 when using Maven to test the `connect` module, if  `sparkTestRelation` in
 `SparkConnectProtoSuite` is not the first `DataFrame` to be initialized,
 then the `id` of `sparkTestRelation` will no longer be 0. So, I think this
 is indeed related to the order in which Maven executes the test cases in
 the `connect` module.
 >  I have submitted a backport PR to branch-3.5, and if necessary, we
 can merge it to fix this test issue.
 >  Jie Yang
 >   发件人: Jungtaek Lim 
 > 日期: 2024年2月16日 星期五 22:15
 > 收件人: Sean Owen , Rui Wang 
 > 抄送: dev 
 > 主题: Re: [VOTE] Release Apache Spark 3.5.1 (RC2)
 >   I traced back relevant changes and got a sense of what happened.
 >   Yangjie figured out the issue via link. It's a tricky issue
 according to the comments from Yangjie - the test is dependent on ordering
 of execution for test suites. He said it does not fail in sbt, hence CI
 build couldn't catch it.
 > He fixed it via link, but we missed that the offending commit was
 also ported back to 3.5 as well, hence the fix wasn't ported back to 3.5.
 >   Surprisingly, I can't reproduce locally even with maven. In my
 attempt to reproduce, SparkConnectProtoSuite was executed at third,
 SparkConnectStreamingQueryCacheSuite, and ExecuteEventsManagerSuite, and
 then SparkConnectProtoSuite. Maybe very specific to the environment, not
 just maven? My env: MBP M1 pro chip, MacOS 14.3.1, Openjdk 17.0.9. I used
 build/mvn (Maven 3.8.8).
 >   I'm not 100% sure this is something we should fail the release as
 it's a test only and sounds very environment dependent, but I'll respect
 your call on vote.
 >   Btw, looks like Rui also made a relevant fix via link (not to fix
 the failing test but to fix other issues), but this also wasn't ported back
 to 3.5. @Rui Wang Do you think this is a regression issue and warrants a
 new RC?
 > On Fri, Feb 16, 2024 at 11:38 AM Sean Owen 
 wrote:
 > Is anyone seeing this Spark Connect test failure? then again, I have
 some weird issue with this env that always fails 1 or 2 tests that nobody
 else can replicate.
 >   - Test observe *** FAILED ***
 >   == FAIL: Plans do not match ===
 >   !CollectMetrics my_metric, [min(id#0) AS min_val#0, max(id#0) AS
 max_val#0, sum(id#0) AS sum(id)#0L], 0   CollectMetrics my_metric,
 [min(id#0) AS min_val#0, max(id#0) AS max_val#0, sum(id#0) AS sum(id)#0L],
 44
 >+- LocalRelation , [id#0, name#0]
+- LocalRelation , [id#0,
 name#0] (PlanTest.scala:179)
 >   On Thu, Feb 15, 2024 at 1:34 PM Jungtaek Lim <
 kabhwan.opensou...@gmail.com> wrote:
 > DISCLAIMER: RC for Apache Spark 3.5.1 starts with RC2 as I late

Re: ASF board report draft for February

2024-02-18 Thread Dongjoon Hyun
+1, it looks good to me.

Thank you, Matei.

Dongjoon

On Sat, Feb 17, 2024 at 11:21 AM Matei Zaharia 
wrote:

> Hi all,
>
> I missed some reminder emails about our board report this month, but here
> is my draft. I’ll submit it tomorrow if that’s ok.
>
> ==
>
> Issues for the board:
>
> - None
>
> Project status:
>
> - We made two patch releases: Spark 3.3.4 (EOL release) on December 16,
> 2023, and Spark 3.4.2 on November 30, 2023.
> - We have begun voting for a Spark 3.5.1 maintenance release.
> - The vote on "SPIP: Structured Streaming - Arbitrary State API v2" has
> passed.
> - We transitioned to an ASF-hosted analytics service, Matomo. For details,
> visit
> https://analytics.apache.org/index.php?module=CoreHome&action=index&date=yesterday&period=day&idSite=40
> .
> - Project Comet, a plugin designed to accelerate Spark query execution by
> leveraging DataFusion and Arrow, has been open-sourced under the Apache
> Arrow project. For more information, visit
> https://github.com/apache/arrow-datafusion-comet.
>
> Trademarks:
>
> - No changes since the last report.
>
> Latest releases:
>
> - Spark 3.3.4 was released on December 16, 2023
> - Spark 3.4.2 was released on November 30, 2023
> - Spark 3.5.0 was released on September 13, 2023
>
> Committers and PMC:
>
> - The latest committer was added on Oct 2nd, 2023 (Jiaan Geng).
> - The latest PMC members were added on Oct 2nd, 2023 (Yuanjian Li and
> Yikun Jiang).
>
> ==
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>


Re: Heads-up: Update on Spark 3.5.1 RC

2024-02-13 Thread Dongjoon Hyun
Thank you for the update, Jungtaek.

Dongjoon.

On Tue, Feb 13, 2024 at 7:29 AM Jungtaek Lim 
wrote:

> Hi,
>
> Just a head-up since I didn't give an update for a week after the last
> update from the discussion thread.
>
> I've been following the automated release process and encountered several
> issues. Maybe I will file JIRA tickets and follow PRs.
>
> Issues I figured out so far are 1) python library version issue in the
> release docker image, 2) doc build failure in pyspark ml for Spark connect.
> I'm deferring to submit fixes till I see dry-run to succeed.
>
> Btw, I optimistically ran the process without a dry-run as GA has been
> paased (my bad), and the tag for RC1 being created was done before I saw
> issues. Maybe I'll need to start with RC2 after things are sorted out and
> necessary fixes are landed to branch-3.5.
>
> Thanks,
> Jungtaek Lim (HeartSaVioR)
>
>


Re: [DISCUSS] Release Spark 3.5.1?

2024-02-03 Thread Dongjoon Hyun
+1

On Sat, Feb 3, 2024 at 9:18 PM yangjie01 
wrote:

> +1
>
> 在 2024/2/4 13:13,“Kent Yao”mailto:y...@apache.org>> 写入:
>
>
> +1
>
>
> Jungtaek Lim  kabhwan.opensou...@gmail.com>> 于2024年2月3日周六 21:14写道:
> >
> > Hi dev,
> >
> > looks like there are a huge number of commits being pushed to branch-3.5
> after 3.5.0 was released, 200+ commits.
> >
> > $ git log --oneline v3.5.0..HEAD | wc -l
> > 202
> >
> > Also, there are 180 JIRA tickets containing 3.5.1 as fixed version, and
> 10 resolved issues are either marked as blocker (even correctness issues)
> or critical, which justifies the release.
> > https://issues.apache.org/jira/projects/SPARK/versions/12353495 <
> https://issues.apache.org/jira/projects/SPARK/versions/12353495>
> >
> > What do you think about releasing 3.5.1 with the current head of
> branch-3.5? I'm happy to volunteer as the release manager.
> >
> > Thanks,
> > Jungtaek Lim (HeartSaVioR)
>
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org  dev-unsubscr...@spark.apache.org>
>
>
>
>
>
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>


[ANNOUNCE] Apache Spark 3.3.4 released

2023-12-16 Thread Dongjoon Hyun
We are happy to announce the availability of Apache Spark 3.3.4!

Spark 3.3.4 is the last maintenance release based on the
branch-3.3 maintenance branch of Spark. It contains many fixes
including security and correctness domains. We strongly
recommend all 3.3 users to upgrade to this or higher stable release.

To download Spark 3.3.4, head over to the download page:
https://spark.apache.org/downloads.html

To view the release notes:
https://spark.apache.org/releases/spark-release-3-3-4.html

We would like to acknowledge all community members for contributing to
this release. This release would not have been possible without you.

Dongjoon Hyun


[VOTE][RESULT] Release Spark 3.3.4 (RC1)

2023-12-15 Thread Dongjoon Hyun
The vote passes with 6 +1s (3 binding +1s).
Thanks to all who helped with the release!

(* = binding)
+1:
- Dongjoon Hyun *
- Yuming Wang *
- Kent Yao
- Liang-Chi Hsieh *
- Yang Jie
- Malcolm Decuire

+0: None

-1: None


Re: [VOTE] Release Spark 3.3.4 (RC1)

2023-12-15 Thread Dongjoon Hyun
Thank you all. This vote passed.

Let me conclude.

Dongjoon

On 2023/12/11 23:58:28 Malcolm Decuire wrote:
> +1
> 
> On Mon, Dec 11, 2023 at 6:21 PM Yang Jie  wrote:
> 
> > +1
> >
> > On 2023/12/11 03:03:39 "L. C. Hsieh" wrote:
> > > +1
> > >
> > > On Sun, Dec 10, 2023 at 6:15 PM Kent Yao  wrote:
> > > >
> > > > +1(non-binding
> > > >
> > > > Kent Yao
> > > >
> > > > Yuming Wang  于2023年12月11日周一 09:33写道:
> > > > >
> > > > > +1
> > > > >
> > > > > On Mon, Dec 11, 2023 at 5:55 AM Dongjoon Hyun 
> > wrote:
> > > > >>
> > > > >> +1
> > > > >>
> > > > >> Dongjoon
> > > > >>
> > > > >> On 2023/12/08 21:41:00 Dongjoon Hyun wrote:
> > > > >> > Please vote on releasing the following candidate as Apache Spark
> > version
> > > > >> > 3.3.4.
> > > > >> >
> > > > >> > The vote is open until December 15th 1AM (PST) and passes if a
> > majority +1
> > > > >> > PMC votes are cast, with a minimum of 3 +1 votes.
> > > > >> >
> > > > >> > [ ] +1 Release this package as Apache Spark 3.3.4
> > > > >> > [ ] -1 Do not release this package because ...
> > > > >> >
> > > > >> > To learn more about Apache Spark, please see
> > https://spark.apache.org/
> > > > >> >
> > > > >> > The tag to be voted on is v3.3.4-rc1 (commit
> > > > >> > 18db204995b32e87a650f2f09f9bcf047ddafa90)
> > > > >> > https://github.com/apache/spark/tree/v3.3.4-rc1
> > > > >> >
> > > > >> > The release files, including signatures, digests, etc. can be
> > found at:
> > > > >> >
> > > > >> > https://dist.apache.org/repos/dist/dev/spark/v3.3.4-rc1-bin/
> > > > >> >
> > > > >> >
> > > > >> > Signatures used for Spark RCs can be found in this file:
> > > > >> >
> > > > >> > https://dist.apache.org/repos/dist/dev/spark/KEYS
> > > > >> >
> > > > >> >
> > > > >> > The staging repository for this release can be found at:
> > > > >> >
> > > > >> >
> > https://repository.apache.org/content/repositories/orgapachespark-1451/
> > > > >> >
> > > > >> >
> > > > >> > The documentation corresponding to this release can be found at:
> > > > >> >
> > > > >> > https://dist.apache.org/repos/dist/dev/spark/v3.3.4-rc1-docs/
> > > > >> >
> > > > >> >
> > > > >> > The list of bug fixes going into 3.3.4 can be found at the
> > following URL:
> > > > >> >
> > > > >> > https://issues.apache.org/jira/projects/SPARK/versions/12353505
> > > > >> >
> > > > >> >
> > > > >> > This release is using the release script of the tag v3.3.4-rc1.
> > > > >> >
> > > > >> >
> > > > >> > FAQ
> > > > >> >
> > > > >> >
> > > > >> > =
> > > > >> >
> > > > >> > How can I help test this release?
> > > > >> >
> > > > >> > =
> > > > >> >
> > > > >> >
> > > > >> >
> > > > >> > If you are a Spark user, you can help us test this release by
> > taking
> > > > >> >
> > > > >> > an existing Spark workload and running on this release candidate,
> > then
> > > > >> >
> > > > >> > reporting any regressions.
> > > > >> >
> > > > >> >
> > > > >> >
> > > > >> > If you're working in PySpark you can set up a virtual env and
> > install
> > > > >> >
> > > > >> > the current RC and see if anything important breaks, in the
> > Java/Scala
> > > > >> &

Re: [VOTE] Release Spark 3.3.4 (RC1)

2023-12-11 Thread Dongjoon Hyun
lk_global_ops(co)}
> ^
>   File
> "/home/mridul/work/apache/vote/spark/python/pyspark/cloudpickle/cloudpickle.py",
> line 334, in 
> out_names = {names[oparg]: None for _, oparg in _walk_global_ops(co)}
>  ~^^^
> IndexError: tuple index out of range
>
> During handling of the above exception, another exception occurred:
>
> Traceback (most recent call last):
>   File "", line 1, in 
>   File
> "/home/mridul/work/apache/vote/spark/python/pyspark/serializers.py", line
> 468, in dumps
> raise pickle.PicklingError(msg)
> _pickle.PicklingError: Could not serialize object: IndexError: tuple index
> out of range
> - UNSUPPORTED_FEATURE: Using Python UDF with unsupported join condition
> *** FAILED ***
>
>
>
> On Sun, Dec 10, 2023 at 9:05 PM L. C. Hsieh  wrote:
>
>> +1
>>
>> On Sun, Dec 10, 2023 at 6:15 PM Kent Yao  wrote:
>> >
>> > +1(non-binding
>> >
>> > Kent Yao
>> >
>> > Yuming Wang  于2023年12月11日周一 09:33写道:
>> > >
>> > > +1
>> > >
>> > > On Mon, Dec 11, 2023 at 5:55 AM Dongjoon Hyun 
>> wrote:
>> > >>
>> > >> +1
>> > >>
>> > >> Dongjoon
>> > >>
>> > >> On 2023/12/08 21:41:00 Dongjoon Hyun wrote:
>> > >> > Please vote on releasing the following candidate as Apache Spark
>> version
>> > >> > 3.3.4.
>> > >> >
>> > >> > The vote is open until December 15th 1AM (PST) and passes if a
>> majority +1
>> > >> > PMC votes are cast, with a minimum of 3 +1 votes.
>> > >> >
>> > >> > [ ] +1 Release this package as Apache Spark 3.3.4
>> > >> > [ ] -1 Do not release this package because ...
>> > >> >
>> > >> > To learn more about Apache Spark, please see
>> https://spark.apache.org/
>> > >> >
>> > >> > The tag to be voted on is v3.3.4-rc1 (commit
>> > >> > 18db204995b32e87a650f2f09f9bcf047ddafa90)
>> > >> > https://github.com/apache/spark/tree/v3.3.4-rc1
>> > >> >
>> > >> > The release files, including signatures, digests, etc. can be
>> found at:
>> > >> >
>> > >> > https://dist.apache.org/repos/dist/dev/spark/v3.3.4-rc1-bin/
>> > >> >
>> > >> >
>> > >> > Signatures used for Spark RCs can be found in this file:
>> > >> >
>> > >> > https://dist.apache.org/repos/dist/dev/spark/KEYS
>> > >> >
>> > >> >
>> > >> > The staging repository for this release can be found at:
>> > >> >
>> > >> >
>> https://repository.apache.org/content/repositories/orgapachespark-1451/
>> > >> >
>> > >> >
>> > >> > The documentation corresponding to this release can be found at:
>> > >> >
>> > >> > https://dist.apache.org/repos/dist/dev/spark/v3.3.4-rc1-docs/
>> > >> >
>> > >> >
>> > >> > The list of bug fixes going into 3.3.4 can be found at the
>> following URL:
>> > >> >
>> > >> > https://issues.apache.org/jira/projects/SPARK/versions/12353505
>> > >> >
>> > >> >
>> > >> > This release is using the release script of the tag v3.3.4-rc1.
>> > >> >
>> > >> >
>> > >> > FAQ
>> > >> >
>> > >> >
>> > >> > =
>> > >> >
>> > >> > How can I help test this release?
>> > >> >
>> > >> > =
>> > >> >
>> > >> >
>> > >> >
>> > >> > If you are a Spark user, you can help us test this release by
>> taking
>> > >> >
>> > >> > an existing Spark workload and running on this release candidate,
>> then
>> > >> >
>> > >> > reporting any regressions.
>> > >> >
>> > >> >
>> > >> >
>> > >> > If you're working in PySpark you can set up a virtual env and
>> install
>> > 

Re: [VOTE] Release Spark 3.3.4 (RC1)

2023-12-10 Thread Dongjoon Hyun
+1

Dongjoon

On 2023/12/08 21:41:00 Dongjoon Hyun wrote:
> Please vote on releasing the following candidate as Apache Spark version
> 3.3.4.
> 
> The vote is open until December 15th 1AM (PST) and passes if a majority +1
> PMC votes are cast, with a minimum of 3 +1 votes.
> 
> [ ] +1 Release this package as Apache Spark 3.3.4
> [ ] -1 Do not release this package because ...
> 
> To learn more about Apache Spark, please see https://spark.apache.org/
> 
> The tag to be voted on is v3.3.4-rc1 (commit
> 18db204995b32e87a650f2f09f9bcf047ddafa90)
> https://github.com/apache/spark/tree/v3.3.4-rc1
> 
> The release files, including signatures, digests, etc. can be found at:
> 
> https://dist.apache.org/repos/dist/dev/spark/v3.3.4-rc1-bin/
> 
> 
> Signatures used for Spark RCs can be found in this file:
> 
> https://dist.apache.org/repos/dist/dev/spark/KEYS
> 
> 
> The staging repository for this release can be found at:
> 
> https://repository.apache.org/content/repositories/orgapachespark-1451/
> 
> 
> The documentation corresponding to this release can be found at:
> 
> https://dist.apache.org/repos/dist/dev/spark/v3.3.4-rc1-docs/
> 
> 
> The list of bug fixes going into 3.3.4 can be found at the following URL:
> 
> https://issues.apache.org/jira/projects/SPARK/versions/12353505
> 
> 
> This release is using the release script of the tag v3.3.4-rc1.
> 
> 
> FAQ
> 
> 
> =
> 
> How can I help test this release?
> 
> =
> 
> 
> 
> If you are a Spark user, you can help us test this release by taking
> 
> an existing Spark workload and running on this release candidate, then
> 
> reporting any regressions.
> 
> 
> 
> If you're working in PySpark you can set up a virtual env and install
> 
> the current RC and see if anything important breaks, in the Java/Scala
> 
> you can add the staging repository to your projects resolvers and test
> 
> with the RC (make sure to clean up the artifact cache before/after so
> 
> you don't end up building with a out of date RC going forward).
> 
> 
> 
> ===
> 
> What should happen to JIRA tickets still targeting 3.3.4?
> 
> ===
> 
> 
> 
> The current list of open tickets targeted at 3.3.4 can be found at:
> 
> https://issues.apache.org/jira/projects/SPARK and search for "Target
> Version/s" = 3.3.4
> 
> 
> Committers should look at those and triage. Extremely important bug
> 
> fixes, documentation, and API tweaks that impact compatibility should
> 
> be worked on immediately. Everything else please retarget to an
> 
> appropriate release.
> 
> 
> 
> ==
> 
> But my bug isn't fixed?
> 
> ==
> 
> 
> 
> In order to make timely releases, we will typically not hold the
> 
> release unless the bug in question is a regression from the previous
> 
> release. That being said, if there is something which is a regression
> 
> that has not been correctly targeted please ping me or a committer to
> 
> help target the issue.
> 

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: Spark on Yarn with Java 17

2023-12-09 Thread Dongjoon Hyun
Please try Apache Spark 3.3+ (SPARK-33772) with Java 17 on your cluster
simply, Jason.

I believe you can set up for your Spark 3.3+ jobs to run with Java 17 while
your cluster(DataNode/NameNode/ResourceManager/NodeManager) is still
sitting on Java 8.

Dongjoon.

On Fri, Dec 8, 2023 at 11:12 PM Jason Xu  wrote:

> Dongjoon, thank you for the fast response!
>
> Apache Spark 4.0.0 depends on only Apache Hadoop client library.
>
> To better understand your answer, does that mean a Spark application built
> with Java 17 can successfully run on a Hadoop cluster on version 3.3 and
> Java 8 runtime?
>
> On Fri, Dec 8, 2023 at 4:33 PM Dongjoon Hyun  wrote:
>
>> Hi, Jason.
>>
>> Apache Spark 4.0.0 depends on only Apache Hadoop client library.
>>
>> You can track all `Apache Spark 4` activities including Hadoop dependency
>> here.
>>
>> https://issues.apache.org/jira/browse/SPARK-44111
>> (Prepare Apache Spark 4.0.0)
>>
>> According to the release history, the original suggested timeline was
>> June, 2024.
>> - Spark 1: 2014.05 (1.0.0) ~ 2016.11 (1.6.3)
>> - Spark 2: 2016.07 (2.0.0) ~ 2021.05 (2.4.8)
>> - Spark 3: 2020.06 (3.0.0) ~ 2026.xx (3.5.x)
>> - Spark 4: 2024.06 (4.0.0, NEW)
>>
>> Thanks,
>> Dongjoon.
>>
>> On 2023/12/08 23:50:15 Jason Xu wrote:
>> > Hi Spark devs,
>> >
>> > According to the Spark 3.5 release notes, Spark 4 will no longer support
>> > Java 8 and 11 (link
>> > <
>> https://spark.apache.org/releases/spark-release-3-5-0.html#upcoming-removal
>> >
>> > ).
>> >
>> > My company is using Spark on Yarn with Java 8 now. When considering a
>> > future upgrade to Spark 4, one issue we face is that the latest version
>> of
>> > Hadoop (3.3) does not yet support Java 17. There is an open ticket (
>> > HADOOP-17177 <https://issues.apache.org/jira/browse/HADOOP-17177>) for
>> this
>> > issue, which has been open for over two years.
>> >
>> > My question is: Does the release of Spark 4 depend on the availability
>> of
>> > Java 17 support in Hadoop? Additionally, do we have a rough estimate for
>> > the release of Spark 4? Thanks!
>> >
>> >
>> > Cheers,
>> >
>> > Jason Xu
>> >
>>
>> -
>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>
>>


Re: Spark on Yarn with Java 17

2023-12-08 Thread Dongjoon Hyun
Hi, Jason.

Apache Spark 4.0.0 depends on only Apache Hadoop client library.

You can track all `Apache Spark 4` activities including Hadoop dependency here.

https://issues.apache.org/jira/browse/SPARK-44111
(Prepare Apache Spark 4.0.0)

According to the release history, the original suggested timeline was June, 
2024.
- Spark 1: 2014.05 (1.0.0) ~ 2016.11 (1.6.3)
- Spark 2: 2016.07 (2.0.0) ~ 2021.05 (2.4.8)
- Spark 3: 2020.06 (3.0.0) ~ 2026.xx (3.5.x)
- Spark 4: 2024.06 (4.0.0, NEW)

Thanks,
Dongjoon.

On 2023/12/08 23:50:15 Jason Xu wrote:
> Hi Spark devs,
> 
> According to the Spark 3.5 release notes, Spark 4 will no longer support
> Java 8 and 11 (link
> 
> ).
> 
> My company is using Spark on Yarn with Java 8 now. When considering a
> future upgrade to Spark 4, one issue we face is that the latest version of
> Hadoop (3.3) does not yet support Java 17. There is an open ticket (
> HADOOP-17177 ) for this
> issue, which has been open for over two years.
> 
> My question is: Does the release of Spark 4 depend on the availability of
> Java 17 support in Hadoop? Additionally, do we have a rough estimate for
> the release of Spark 4? Thanks!
> 
> 
> Cheers,
> 
> Jason Xu
> 

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



[VOTE] Release Spark 3.3.4 (RC1)

2023-12-08 Thread Dongjoon Hyun
Please vote on releasing the following candidate as Apache Spark version
3.3.4.

The vote is open until December 15th 1AM (PST) and passes if a majority +1
PMC votes are cast, with a minimum of 3 +1 votes.

[ ] +1 Release this package as Apache Spark 3.3.4
[ ] -1 Do not release this package because ...

To learn more about Apache Spark, please see https://spark.apache.org/

The tag to be voted on is v3.3.4-rc1 (commit
18db204995b32e87a650f2f09f9bcf047ddafa90)
https://github.com/apache/spark/tree/v3.3.4-rc1

The release files, including signatures, digests, etc. can be found at:

https://dist.apache.org/repos/dist/dev/spark/v3.3.4-rc1-bin/


Signatures used for Spark RCs can be found in this file:

https://dist.apache.org/repos/dist/dev/spark/KEYS


The staging repository for this release can be found at:

https://repository.apache.org/content/repositories/orgapachespark-1451/


The documentation corresponding to this release can be found at:

https://dist.apache.org/repos/dist/dev/spark/v3.3.4-rc1-docs/


The list of bug fixes going into 3.3.4 can be found at the following URL:

https://issues.apache.org/jira/projects/SPARK/versions/12353505


This release is using the release script of the tag v3.3.4-rc1.


FAQ


=

How can I help test this release?

=



If you are a Spark user, you can help us test this release by taking

an existing Spark workload and running on this release candidate, then

reporting any regressions.



If you're working in PySpark you can set up a virtual env and install

the current RC and see if anything important breaks, in the Java/Scala

you can add the staging repository to your projects resolvers and test

with the RC (make sure to clean up the artifact cache before/after so

you don't end up building with a out of date RC going forward).



===

What should happen to JIRA tickets still targeting 3.3.4?

===



The current list of open tickets targeted at 3.3.4 can be found at:

https://issues.apache.org/jira/projects/SPARK and search for "Target
Version/s" = 3.3.4


Committers should look at those and triage. Extremely important bug

fixes, documentation, and API tweaks that impact compatibility should

be worked on immediately. Everything else please retarget to an

appropriate release.



==

But my bug isn't fixed?

==



In order to make timely releases, we will typically not hold the

release unless the bug in question is a regression from the previous

release. That being said, if there is something which is a regression

that has not been correctly targeted please ping me or a committer to

help target the issue.


Re: Apache Spark 3.3.4 EOL Release?

2023-12-08 Thread Dongjoon Hyun
Thank you, Mridul, and Kent, too.

Additionally, thank you for volunteering as a release manager, Jungtaek,

For the 3.3.4 EOL release, I've already been testing and preparing for one
week since my first email.

So, why don't you proceed with the Apache Spark 3.5.1 release? It has 142
patches already.

$ git log --oneline v3.5.0..HEAD | wc -l
 142

I'd like to recommend you to proceed by sending an independent discussion
email to the dev mailing list.

I love to see Apache Spark 3.5.1 in December. too.

BTW, as you mentioned, there is no strict timeline for 3.5.1, so take your
time.

Thanks,
Dongjoon.



On Fri, Dec 8, 2023 at 2:04 AM Jungtaek Lim 
wrote:

> +1 to release 3.3.4 and consider 3.3 as EOL.
>
> Btw, it'd be probably ideal if we could encourage taking an opportunity of
> experiencing the release process to people who hadn't had a time to go
> through (when there are people who are happy to take it). If you don't mind
> and we are not very strict on the timeline, I'd be happy to volunteer and
> give it a try.
>
> On Tue, Dec 5, 2023 at 12:12 PM Kent Yao  wrote:
>
>> +1
>>
>> Thank you for driving this EOL release, Dongjoon!
>>
>> Kent Yao
>>
>> On 2023/12/04 19:40:10 Mridul Muralidharan wrote:
>> > +1
>> >
>> > Regards,
>> > Mridul
>> >
>> > On Mon, Dec 4, 2023 at 11:40 AM L. C. Hsieh  wrote:
>> >
>> > > +1
>> > >
>> > > Thanks Dongjoon!
>> > >
>> > > On Mon, Dec 4, 2023 at 9:26 AM Yang Jie  wrote:
>> > > >
>> > > > +1 for a 3.3.4 EOL Release. Thanks Dongjoon.
>> > > >
>> > > > Jie Yang
>> > > >
>> > > > On 2023/12/04 15:08:25 Tom Graves wrote:
>> > > > >  +1 for a 3.3.4 EOL Release. Thanks Dongjoon.
>> > > > > Tom
>> > > > > On Friday, December 1, 2023 at 02:48:22 PM CST, Dongjoon Hyun
>> <
>> > > dongjoon.h...@gmail.com> wrote:
>> > > > >
>> > > > >  Hi, All.
>> > > > >
>> > > > > Since the Apache Spark 3.3.0 RC6 vote passed on Jun 14, 2022,
>> > > branch-3.3 has been maintained and served well until now.
>> > > > >
>> > > > > - https://github.com/apache/spark/releases/tag/v3.3.0 (tagged on
>> Jun
>> > > 9th, 2022)
>> > > > > -
>> https://lists.apache.org/thread/zg6k1spw6k1c7brgo6t7qldvsqbmfytm
>> > > (vote result on June 14th, 2022)
>> > > > >
>> > > > > As of today, branch-3.3 has 56 additional patches after v3.3.3
>> (tagged
>> > > on Aug 3rd about 4 month ago) and reaches the end-of-life this month
>> > > according to the Apache Spark release cadence,
>> > > https://spark.apache.org/versioning-policy.html .
>> > > > >
>> > > > > $ git log --oneline v3.3.3..HEAD | wc -l
>> > > > > 56
>> > > > >
>> > > > > Along with the recent Apache Spark 3.4.2 release, I hope the
>> users can
>> > > get a chance to have these last bits of Apache Spark 3.3.x, and I'd
>> like to
>> > > propose to have Apache Spark 3.3.4 EOL Release vote on December 11th
>> and
>> > > volunteer as the release manager.
>> > > > >
>> > > > > WDTY?
>> > > > >
>> > > > > Please let us know if you need more patches on branch-3.3.
>> > > > >
>> > > > > Thanks,
>> > > > > Dongjoon.
>> > > > >
>> > > >
>> > > >
>> -
>> > > > To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>> > > >
>> > >
>> > > -
>> > > To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>> > >
>> > >
>> >
>>
>> -
>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>
>>


Re: Apache Spark 3.3.4 EOL Release?

2023-12-04 Thread Dongjoon Hyun
Thank you all.

Dongjoon.

On Mon, Dec 4, 2023 at 9:40 AM L. C. Hsieh  wrote:

> +1
>
> Thanks Dongjoon!
>
> On Mon, Dec 4, 2023 at 9:26 AM Yang Jie  wrote:
> >
> > +1 for a 3.3.4 EOL Release. Thanks Dongjoon.
> >
> > Jie Yang
> >
> > On 2023/12/04 15:08:25 Tom Graves wrote:
> > >  +1 for a 3.3.4 EOL Release. Thanks Dongjoon.
> > > Tom
> > > On Friday, December 1, 2023 at 02:48:22 PM CST, Dongjoon Hyun <
> dongjoon.h...@gmail.com> wrote:
> > >
> > >  Hi, All.
> > >
> > > Since the Apache Spark 3.3.0 RC6 vote passed on Jun 14, 2022,
> branch-3.3 has been maintained and served well until now.
> > >
> > > - https://github.com/apache/spark/releases/tag/v3.3.0 (tagged on Jun
> 9th, 2022)
> > > - https://lists.apache.org/thread/zg6k1spw6k1c7brgo6t7qldvsqbmfytm
> (vote result on June 14th, 2022)
> > >
> > > As of today, branch-3.3 has 56 additional patches after v3.3.3 (tagged
> on Aug 3rd about 4 month ago) and reaches the end-of-life this month
> according to the Apache Spark release cadence,
> https://spark.apache.org/versioning-policy.html .
> > >
> > > $ git log --oneline v3.3.3..HEAD | wc -l
> > > 56
> > >
> > > Along with the recent Apache Spark 3.4.2 release, I hope the users can
> get a chance to have these last bits of Apache Spark 3.3.x, and I'd like to
> propose to have Apache Spark 3.3.4 EOL Release vote on December 11th and
> volunteer as the release manager.
> > >
> > > WDTY?
> > >
> > > Please let us know if you need more patches on branch-3.3.
> > >
> > > Thanks,
> > > Dongjoon.
> > >
> >
> > -
> > To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
> >
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>


`orc-format` 1.0 (ORC-1531) for Apache ORC 2.0

2023-12-03 Thread Dongjoon Hyun
Hi, All.

As one of the key parts of Apache ORC 2.0, we've been discussing a new
repository and module, `orc-format`, in the following.

https://github.com/apache/orc/issues/1543

Now, we are ready to create a new repo.

Please take a look at the POC repo and code and let us know your thoughts.

Bests,
Dongjoon


  1   2   3   4   5   6   7   8   9   >