from:"Wenchen Fan"

Re: [DISCUSS] clarify the definition of behavior changes

2024-05-28 Thread Wenchen Fan

Hi all,

I've created a PR to put the behavior change guideline on the Spark
website: https://github.com/apache/spark-website/pull/518 . Please leave
comments if you have any, thanks!

On Wed, May 15, 2024 at 1:41 AM Wenchen Fan  wrote:

> Thanks all for the feedback here! Let me put up a new version, which
> clarifies the definition of "users":
>
> Behavior changes mean user-visible functional changes in a new release via
> public APIs. The "user" here is not only the user who writes queries and/or
> develops Spark plugins, but also the user who deploys and/or manages Spark
> clusters. New features, and even bug fixes that eliminate NPE or correct
> query results, are behavior changes. Things like performance improvement,
> code refactoring, and changes to unreleased APIs/features are not. All
> behavior changes should be called out in the PR description. We need to
> write an item in the migration guide (and probably legacy config) for those
> that may break users when upgrading:
>
>- Bug fixes that change query results. Users may need to do backfill
>to correct the existing data and must know about these correctness fixes.
>- Bug fixes that change query schema. Users may need to update the
>schema of the tables in their data pipelines and must know about these
>changes.
>- Remove configs
>- Rename error class/condition
>- Any non-additive change to the public Python/SQL/Scala/Java/R APIs
>(including developer APIs): rename function, remove parameters, add
>parameters, rename parameters, change parameter default values, etc. These
>changes should be avoided in general, or done in a binary-compatible
>way like deprecating and adding a new function instead of renaming.
>- Any non-additive change to the way Spark should be deployed and
>managed.
>
> The list above is not supposed to be comprehensive. Anyone can raise your
> concern when reviewing PRs and ask the PR author to add migration guide if
> you believe the change is risky and may break users.
>
> On Thu, May 2, 2024 at 10:25 PM Will Raschkowski <
> wraschkow...@palantir.com> wrote:
>
>> To add some user perspective, I wanted to share our experience from
>> automatically upgrading tens of thousands of jobs from Spark 2 to 3 at
>> Palantir:
>>
>>
>>
>> We didn't mind "loud" changes that threw exceptions. We have some infra
>> to try run jobs with Spark 3 and fallback to Spark 2 if there's an
>> exception. E.g., the datetime parsing and rebasing migration in Spark 3 was
>> great: Spark threw a helpful exception but never silently changed results.
>> Similarly, for things listed in the migration guide as silent changes
>> (e.g., add_months's handling of last-day-of-month), we wrote custom check
>> rules to throw unless users acknowledged the change through config.
>>
>>
>>
>> Silent changes *not* in the migration guide were really bad for us:
>> Trusting the migration guide to be exhaustive, we automatically upgraded
>> jobs which then “succeeded” but wrote incorrect results. For example, some
>> expression increased timestamp precision in Spark 3; a query implicitly
>> relied on the reduced precision, and then produced bad results on upgrade.
>> It’s a silly query but a note in the migration guide would have helped.
>>
>>
>>
>> To summarize: the migration guide was invaluable, we appreciated every
>> entry, and we'd appreciate Wenchen's stricter definition of "behavior
>> changes" (especially for silent ones).
>>
>>
>>
>> *From: *Nimrod Ofek 
>> *Date: *Thursday, 2 May 2024 at 11:57
>> *To: *Wenchen Fan 
>> *Cc: *Erik Krogen , Spark dev list <
>> dev@spark.apache.org>
>> *Subject: *Re: [DISCUSS] clarify the definition of behavior changes
>>
>> *CAUTION:* This email originates from an external party (outside of
>> Palantir). If you believe this message is suspicious in nature, please use
>> the "Report Message" button built into Outlook.
>>
>>
>>
>> Hi Erik and Wenchen,
>>
>>
>>
>> I think that usually a good practice with public api and with internal
>> api that has big impact and a lot of usage is to ease in changes by
>> providing defaults to new parameters that will keep former behaviour in a
>> method with the previous signature with deprecation notice, and deleting
>> that deprecated function in the next release- so the actual break will be
>> in the next release after all libraries had the chance to align with the
>> api and upgrades can be done while already using the new version.
>>
>>
&g

Re: [VOTE] SPARK 4.0.0-preview1 (RC3)

2024-05-28 Thread Wenchen Fan

one correction: "The tag to be voted on is v4.0.0-preview1-rc2 (commit
7cfe5a6e44e8d7079ae29ad3e2cee7231cd3dc66)" should be "The tag to be voted
on is v4.0.0-preview1-rc3 (commit
7a7a8bc4bab591ac8b98b2630b38c57adf619b82):"

On Tue, May 28, 2024 at 11:48 AM Wenchen Fan  wrote:

> Please vote on releasing the following candidate as Apache Spark version
> 4.0.0-preview1.
>
> The vote is open until May 31 PST and passes if a majority +1 PMC votes
> are cast, with
> a minimum of 3 +1 votes.
>
> [ ] +1 Release this package as Apache Spark 4.0.0-preview1
> [ ] -1 Do not release this package because ...
>
> To learn more about Apache Spark, please see http://spark.apache.org/
>
> The tag to be voted on is v4.0.0-preview1-rc2 (commit
> 7cfe5a6e44e8d7079ae29ad3e2cee7231cd3dc66):
> https://github.com/apache/spark/tree/v4.0.0-preview1-rc3
>
> The release files, including signatures, digests, etc. can be found at:
> https://dist.apache.org/repos/dist/dev/spark/v4.0.0-preview1-rc3-bin/
>
> Signatures used for Spark RCs can be found in this file:
> https://dist.apache.org/repos/dist/dev/spark/KEYS
>
> The staging repository for this release can be found at:
> https://repository.apache.org/content/repositories/orgapachespark-1456/
>
> The documentation corresponding to this release can be found at:
> https://dist.apache.org/repos/dist/dev/spark/v4.0.0-preview1-rc3-docs/
>
> The list of bug fixes going into 4.0.0 can be found at the following URL:
> https://issues.apache.org/jira/projects/SPARK/versions/12353359
>
> FAQ
>
> =
> How can I help test this release?
> =
>
> If you are a Spark user, you can help us test this release by taking
> an existing Spark workload and running on this release candidate, then
> reporting any regressions.
>
> If you're working in PySpark you can set up a virtual env and install
> the current RC and see if anything important breaks, in the Java/Scala
> you can add the staging repository to your projects resolvers and test
> with the RC (make sure to clean up the artifact cache before/after so
> you don't end up building with an out of date RC going forward).
>

[VOTE] SPARK 4.0.0-preview1 (RC3)

2024-05-28 Thread Wenchen Fan

Please vote on releasing the following candidate as Apache Spark version
4.0.0-preview1.

The vote is open until May 31 PST and passes if a majority +1 PMC votes are
cast, with
a minimum of 3 +1 votes.

[ ] +1 Release this package as Apache Spark 4.0.0-preview1
[ ] -1 Do not release this package because ...

To learn more about Apache Spark, please see http://spark.apache.org/

The tag to be voted on is v4.0.0-preview1-rc2 (commit
7cfe5a6e44e8d7079ae29ad3e2cee7231cd3dc66):
https://github.com/apache/spark/tree/v4.0.0-preview1-rc3

The release files, including signatures, digests, etc. can be found at:
https://dist.apache.org/repos/dist/dev/spark/v4.0.0-preview1-rc3-bin/

Signatures used for Spark RCs can be found in this file:
https://dist.apache.org/repos/dist/dev/spark/KEYS

The staging repository for this release can be found at:
https://repository.apache.org/content/repositories/orgapachespark-1456/

The documentation corresponding to this release can be found at:
https://dist.apache.org/repos/dist/dev/spark/v4.0.0-preview1-rc3-docs/

The list of bug fixes going into 4.0.0 can be found at the following URL:
https://issues.apache.org/jira/projects/SPARK/versions/12353359

FAQ

=
How can I help test this release?
=

If you are a Spark user, you can help us test this release by taking
an existing Spark workload and running on this release candidate, then
reporting any regressions.

If you're working in PySpark you can set up a virtual env and install
the current RC and see if anything important breaks, in the Java/Scala
you can add the staging repository to your projects resolvers and test
with the RC (make sure to clean up the artifact cache before/after so
you don't end up building with an out of date RC going forward).

Re: [VOTE] SPARK 4.0.0-preview1 (RC2)

2024-05-28 Thread Wenchen Fan

Thanks for the quick reply! I'm cutting RC3 now.

On Tue, May 28, 2024 at 2:28 AM Kent Yao  wrote:

> -1
>
> You've updated your key in [2]  with a new one [1]. I believe you should
> add your new key without removing the old one. Otherwise, users cannot
> verify those archived releases you published.
>
> Thanks,
> Kent Yao
>
> [1] https://dist.apache.org/repos/dist/dev/spark/KEYS
> [2] https://downloads.apache.org/spark/KEYS
>
> On 2024/05/28 07:52:45 Yi Wu wrote:
> > -1
> > I think we should include this bug fix
> >
> https://github.com/apache/spark/commit/6cd1ccc56321dfa52672cd25f4cfdf2bbc86b3ea
> .
> > The bug can lead to the unrecoverable job failure.
> >
> > Thanks,
> > Yi
> >
> > On Tue, May 28, 2024 at 3:45 PM Wenchen Fan  wrote:
> >
> > > Please vote on releasing the following candidate as Apache Spark
> version
> > > 4.0.0-preview1.
> > >
> > > The vote is open until May 31 PST and passes if a majority +1 PMC votes
> > > are cast, with
> > > a minimum of 3 +1 votes.
> > >
> > > [ ] +1 Release this package as Apache Spark 4.0.0-preview1
> > > [ ] -1 Do not release this package because ...
> > >
> > > To learn more about Apache Spark, please see http://spark.apache.org/
> > >
> > > The tag to be voted on is v4.0.0-preview1-rc2 (commit
> > > 7cfe5a6e44e8d7079ae29ad3e2cee7231cd3dc66):
> > > https://github.com/apache/spark/tree/v4.0.0-preview1-rc2
> > >
> > > The release files, including signatures, digests, etc. can be found at:
> > > https://dist.apache.org/repos/dist/dev/spark/v4.0.0-preview1-rc2-bin/
> > >
> > > Signatures used for Spark RCs can be found in this file:
> > > https://dist.apache.org/repos/dist/dev/spark/KEYS
> > >
> > > The staging repository for this release can be found at:
> > >
> https://repository.apache.org/content/repositories/orgapachespark-1455/
> > >
> > > The documentation corresponding to this release can be found at:
> > > https://dist.apache.org/repos/dist/dev/spark/v4.0.0-preview1-rc2-docs/
> > >
> > > The list of bug fixes going into 4.0.0 can be found at the following
> URL:
> > > https://issues.apache.org/jira/projects/SPARK/versions/12353359
> > >
> > > FAQ
> > >
> > > =
> > > How can I help test this release?
> > > =
> > >
> > > If you are a Spark user, you can help us test this release by taking
> > > an existing Spark workload and running on this release candidate, then
> > > reporting any regressions.
> > >
> > > If you're working in PySpark you can set up a virtual env and install
> > > the current RC and see if anything important breaks, in the Java/Scala
> > > you can add the staging repository to your projects resolvers and test
> > > with the RC (make sure to clean up the artifact cache before/after so
> > > you don't end up building with an out of date RC going forward).
> > >
> >
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>

[VOTE] SPARK 4.0.0-preview1 (RC2)

2024-05-28 Thread Wenchen Fan

Please vote on releasing the following candidate as Apache Spark version
4.0.0-preview1.

The vote is open until May 31 PST and passes if a majority +1 PMC votes are
cast, with
a minimum of 3 +1 votes.

[ ] +1 Release this package as Apache Spark 4.0.0-preview1
[ ] -1 Do not release this package because ...

To learn more about Apache Spark, please see http://spark.apache.org/

The tag to be voted on is v4.0.0-preview1-rc2 (commit
7cfe5a6e44e8d7079ae29ad3e2cee7231cd3dc66):
https://github.com/apache/spark/tree/v4.0.0-preview1-rc2

The release files, including signatures, digests, etc. can be found at:
https://dist.apache.org/repos/dist/dev/spark/v4.0.0-preview1-rc2-bin/

Signatures used for Spark RCs can be found in this file:
https://dist.apache.org/repos/dist/dev/spark/KEYS

The staging repository for this release can be found at:
https://repository.apache.org/content/repositories/orgapachespark-1455/

The documentation corresponding to this release can be found at:
https://dist.apache.org/repos/dist/dev/spark/v4.0.0-preview1-rc2-docs/

The list of bug fixes going into 4.0.0 can be found at the following URL:
https://issues.apache.org/jira/projects/SPARK/versions/12353359

FAQ

=
How can I help test this release?
=

If you are a Spark user, you can help us test this release by taking
an existing Spark workload and running on this release candidate, then
reporting any regressions.

If you're working in PySpark you can set up a virtual env and install
the current RC and see if anything important breaks, in the Java/Scala
you can add the staging repository to your projects resolvers and test
with the RC (make sure to clean up the artifact cache before/after so
you don't end up building with an out of date RC going forward).

Re: [DISCUSS] clarify the definition of behavior changes

2024-05-15 Thread Wenchen Fan

Thanks all for the feedback here! Let me put up a new version, which
clarifies the definition of "users":

Behavior changes mean user-visible functional changes in a new release via
public APIs. The "user" here is not only the user who writes queries and/or
develops Spark plugins, but also the user who deploys and/or manages Spark
clusters. New features, and even bug fixes that eliminate NPE or correct
query results, are behavior changes. Things like performance improvement,
code refactoring, and changes to unreleased APIs/features are not. All
behavior changes should be called out in the PR description. We need to
write an item in the migration guide (and probably legacy config) for those
that may break users when upgrading:

   - Bug fixes that change query results. Users may need to do backfill to
   correct the existing data and must know about these correctness fixes.
   - Bug fixes that change query schema. Users may need to update the
   schema of the tables in their data pipelines and must know about these
   changes.
   - Remove configs
   - Rename error class/condition
   - Any non-additive change to the public Python/SQL/Scala/Java/R APIs
   (including developer APIs): rename function, remove parameters, add
   parameters, rename parameters, change parameter default values, etc. These
   changes should be avoided in general, or done in a binary-compatible
   way like deprecating and adding a new function instead of renaming.
   - Any non-additive change to the way Spark should be deployed and
   managed.

The list above is not supposed to be comprehensive. Anyone can raise your
concern when reviewing PRs and ask the PR author to add migration guide if
you believe the change is risky and may break users.

On Thu, May 2, 2024 at 10:25 PM Will Raschkowski 
wrote:

> To add some user perspective, I wanted to share our experience from
> automatically upgrading tens of thousands of jobs from Spark 2 to 3 at
> Palantir:
>
>
>
> We didn't mind "loud" changes that threw exceptions. We have some infra to
> try run jobs with Spark 3 and fallback to Spark 2 if there's an exception.
> E.g., the datetime parsing and rebasing migration in Spark 3 was great:
> Spark threw a helpful exception but never silently changed results.
> Similarly, for things listed in the migration guide as silent changes
> (e.g., add_months's handling of last-day-of-month), we wrote custom check
> rules to throw unless users acknowledged the change through config.
>
>
>
> Silent changes *not* in the migration guide were really bad for us:
> Trusting the migration guide to be exhaustive, we automatically upgraded
> jobs which then “succeeded” but wrote incorrect results. For example, some
> expression increased timestamp precision in Spark 3; a query implicitly
> relied on the reduced precision, and then produced bad results on upgrade.
> It’s a silly query but a note in the migration guide would have helped.
>
>
>
> To summarize: the migration guide was invaluable, we appreciated every
> entry, and we'd appreciate Wenchen's stricter definition of "behavior
> changes" (especially for silent ones).
>
>
>
> *From: *Nimrod Ofek 
> *Date: *Thursday, 2 May 2024 at 11:57
> *To: *Wenchen Fan 
> *Cc: *Erik Krogen , Spark dev list <
> dev@spark.apache.org>
> *Subject: *Re: [DISCUSS] clarify the definition of behavior changes
>
> *CAUTION:* This email originates from an external party (outside of
> Palantir). If you believe this message is suspicious in nature, please use
> the "Report Message" button built into Outlook.
>
>
>
> Hi Erik and Wenchen,
>
>
>
> I think that usually a good practice with public api and with internal api
> that has big impact and a lot of usage is to ease in changes by providing
> defaults to new parameters that will keep former behaviour in a method with
> the previous signature with deprecation notice, and deleting that
> deprecated function in the next release- so the actual break will be in the
> next release after all libraries had the chance to align with the api and
> upgrades can be done while already using the new version.
>
>
>
> Another thing is that we should probably examine what private apis are
> used externally to provide better experience and provide proper public apis
> to meet those needs (for instance, applicative metrics and some way of
> creating custom behaviour columns).
>
>
>
> Thanks,
>
> Nimrod
>
>
>
> בתאריך יום ה׳, 2 במאי 2024, 03:51, מאת Wenchen Fan ‏:
>
> Hi Erik,
>
>
>
> Thanks for sharing your thoughts! Note: developer APIs are also public
> APIs (such as Data Source V2 API, Spark Listener API, etc.), so breaking
> changes should be avoided as much as we can and new APIs should be
> mentioned

Re: [VOTE] SPARK 4.0.0-preview1 (RC1)

2024-05-15 Thread Wenchen Fan

RC1 failed because of this issue. I'll cut RC2 after we downgrade Jetty to
9.x.

On Sat, May 11, 2024 at 3:37 PM Cheng Pan  wrote:

> -1 (non-binding)
>
> A small question, the tag is orphan but I suppose it should belong to the
> master branch.
>
> Seems YARN integration is broken due to javax =>  jakarta namespace
> migration, I filled SPARK-48238, and left some comments on
> https://github.com/apache/spark/pull/45154
>
> Caused by: java.lang.IllegalStateException: class
> org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter is not a
> jakarta.servlet.Filter
> at
> org.sparkproject.jetty.servlet.FilterHolder.doStart(FilterHolder.java:99)
> ~[spark-core_2.13-4.0.0-preview1.jar:4.0.0-preview1]
> at
> org.sparkproject.jetty.util.component.AbstractLifeCycle.start(AbstractLifeCycle.java:93)
> ~[spark-core_2.13-4.0.0-preview1.jar:4.0.0-preview1]
> at
> org.sparkproject.jetty.servlet.ServletHandler.lambda$initialize$2(ServletHandler.java:724)
> ~[spark-core_2.13-4.0.0-preview1.jar:4.0.0-preview1]
> at
> java.base/java.util.ArrayList$ArrayListSpliterator.forEachRemaining(ArrayList.java:1625)
> ~[?:?]
> at
> java.base/java.util.stream.Streams$ConcatSpliterator.forEachRemaining(Streams.java:734)
> ~[?:?]
> at
> java.base/java.util.stream.ReferencePipeline$Head.forEach(ReferencePipeline.java:762)
> ~[?:?]
> at
> org.sparkproject.jetty.servlet.ServletHandler.initialize(ServletHandler.java:749)
> ~[spark-core_2.13-4.0.0-preview1.jar:4.0.0-preview1]
> ... 38 more
>
> Thanks,
> Cheng Pan
>
>
> > On May 11, 2024, at 13:55, Wenchen Fan  wrote:
> >
> > Please vote on releasing the following candidate as Apache Spark version
> 4.0.0-preview1.
> >
> > The vote is open until May 16 PST and passes if a majority +1 PMC votes
> are cast, with
> > a minimum of 3 +1 votes.
> >
> > [ ] +1 Release this package as Apache Spark 4.0.0-preview1
> > [ ] -1 Do not release this package because ...
> >
> > To learn more about Apache Spark, please see http://spark.apache.org/
> >
> > The tag to be voted on is v4.0.0-preview1-rc1 (commit
> 7dcf77c739c3854260464d732dbfb9a0f54706e7):
> > https://github.com/apache/spark/tree/v4.0.0-preview1-rc1
> >
> > The release files, including signatures, digests, etc. can be found at:
> > https://dist.apache.org/repos/dist/dev/spark/v4.0.0-preview1-rc1-bin/
> >
> > Signatures used for Spark RCs can be found in this file:
> > https://dist.apache.org/repos/dist/dev/spark/KEYS
> >
> > The staging repository for this release can be found at:
> > https://repository.apache.org/content/repositories/orgapachespark-1454/
> >
> > The documentation corresponding to this release can be found at:
> > https://dist.apache.org/repos/dist/dev/spark/v4.0.0-preview1-rc1-docs/
> >
> > The list of bug fixes going into 4.0.0 can be found at the following URL:
> > https://issues.apache.org/jira/projects/SPARK/versions/12353359
> >
> > FAQ
> >
> > =
> > How can I help test this release?
> > =
> >
> > If you are a Spark user, you can help us test this release by taking
> > an existing Spark workload and running on this release candidate, then
> > reporting any regressions.
> >
> > If you're working in PySpark you can set up a virtual env and install
> > the current RC and see if anything important breaks, in the Java/Scala
> > you can add the staging repository to your projects resolvers and test
> > with the RC (make sure to clean up the artifact cache before/after so
> > you don't end up building with an out of date RC going forward).
>
>

Re: [VOTE] SPIP: Stored Procedures API for Catalogs

2024-05-13 Thread Wenchen Fan

+1

On Tue, May 14, 2024 at 8:19 AM Zhou Jiang  wrote:

> +1 (non-binding)
>
> On Sat, May 11, 2024 at 2:10 PM L. C. Hsieh  wrote:
>
>> Hi all,
>>
>> I’d like to start a vote for SPIP: Stored Procedures API for Catalogs.
>>
>> Please also refer to:
>>
>>- Discussion thread:
>> https://lists.apache.org/thread/7r04pz544c9qs3gc8q2nyj3fpzfnv8oo
>>- JIRA ticket: https://issues.apache.org/jira/browse/SPARK-44167
>>- SPIP doc:
>> https://docs.google.com/document/d/1rDcggNl9YNcBECsfgPcoOecHXYZOu29QYFrloo2lPBg/
>>
>>
>> Please vote on the SPIP for the next 72 hours:
>>
>> [ ] +1: Accept the proposal as an official SPIP
>> [ ] +0
>> [ ] -1: I don’t think this is a good idea because …
>>
>>
>> Thank you!
>>
>> Liang-Chi Hsieh
>>
>> -
>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>
>>
>
> --
> *Zhou JIANG*
>
>

Re: [DISCUSS] Spark - How to improve our release processes

2024-05-13 Thread Wenchen Fan

Hi Nicholas,

Thanks for your help! I'm definitely interested in participating in this
unification work. Let me know how I can help.

Wenchen

On Mon, May 13, 2024 at 1:41 PM Nicholas Chammas 
wrote:

> Re: unification
>
> We also have a long-standing problem with how we manage Python
> dependencies, something I’ve tried (unsuccessfully
> <https://github.com/apache/spark/pull/27928>) to fix in the past.
>
> Consider, for example, how many separate places this numpy dependency is
> installed:
>
> 1.
> https://github.com/apache/spark/blob/9a2818820f11f9bdcc042f4ab80850918911c68c/.github/workflows/build_and_test.yml#L277
> 2.
> https://github.com/apache/spark/blob/9a2818820f11f9bdcc042f4ab80850918911c68c/.github/workflows/build_and_test.yml#L733
> 3.
> https://github.com/apache/spark/blob/9a2818820f11f9bdcc042f4ab80850918911c68c/.github/workflows/build_and_test.yml#L853
> 4.
> https://github.com/apache/spark/blob/9a2818820f11f9bdcc042f4ab80850918911c68c/.github/workflows/build_and_test.yml#L871
> 5.
> https://github.com/apache/spark/blob/8094535973f19e9f0543535a97254e8ebffc1b23/.github/workflows/build_python_connect35.yml#L70
> 6.
> https://github.com/apache/spark/blob/553e1b85c42a60c082d33f7b9df53b0495893286/.github/workflows/maven_test.yml#L181
> 7.
> https://github.com/apache/spark/blob/6e5d1db9058de62a45f35d3f41e028a72f688b70/dev/requirements.txt#L5
> 8.
> https://github.com/apache/spark/blob/678aeb7ef7086bd962df7ac6d1c5f39151a0515b/dev/run-pip-tests#L90
> 9.
> https://github.com/apache/spark/blob/678aeb7ef7086bd962df7ac6d1c5f39151a0515b/dev/run-pip-tests#L99
> 10.
> https://github.com/apache/spark/blob/9a2818820f11f9bdcc042f4ab80850918911c68c/dev/create-release/spark-rm/Dockerfile#L40
> 11.
> https://github.com/apache/spark/blob/9a42610d5ad8ae0ded92fb68c7617861cfe975e1/dev/infra/Dockerfile#L89
> 12.
> https://github.com/apache/spark/blob/9a42610d5ad8ae0ded92fb68c7617861cfe975e1/dev/infra/Dockerfile#L92
>
> None of those installations reference a unified version requirement, so
> naturally they are inconsistent across all these different lines. Some say
> `>=1.21`, others say `>=1.20.0`, and still others say `==1.20.3`. In
> several cases there is no version requirement specified at all.
>
> I’m interested in trying again to fix this problem, but it needs to be in
> collaboration with a committer since I cannot fully test the release
> scripts. (This testing gap is what doomed my last attempt at fixing this
> problem.)
>
> Nick
>
>
> On May 13, 2024, at 12:18 AM, Wenchen Fan  wrote:
>
> After finishing the 4.0.0-preview1 RC1, I have more experience with this
> topic now.
>
> In fact, the main job of the release process: building packages and
> documents, is tested in Github Action jobs. However, the way we test them
> is different from what we do in the release scripts.
>
> 1. the execution environment is different:
> The release scripts define the execution environment with this Dockerfile:
> https://github.com/apache/spark/blob/master/dev/create-release/spark-rm/Dockerfile
> However, Github Action jobs use a different Dockerfile:
> https://github.com/apache/spark/blob/master/dev/infra/Dockerfile
> We should figure out a way to unify it. The docker image for the release
> process needs to set up more things so it may not be viable to use a single
> Dockerfile for both.
>
> 2. the execution code is different. Use building documents as an example:
> The release scripts:
> https://github.com/apache/spark/blob/master/dev/create-release/release-build.sh#L404-L411
> The Github Action job:
> https://github.com/apache/spark/blob/master/.github/workflows/build_and_test.yml#L883-L895
> I don't know which one is more correct, but we should definitely unify
> them.
>
> It's better if we can run the release scripts as Github Action jobs, but I
> think it's more important to do the unification now.
>
> Thanks,
> Wenchen
>
>
> On Fri, May 10, 2024 at 12:34 AM Hussein Awala  wrote:
>
>> Hello,
>>
>> I can answer some of your common questions with other Apache projects.
>>
>> > Who currently has permissions for Github actions? Is there a specific
>> owner for that today or a different volunteer each time?
>>
>> The Apache organization owns Github Actions, and committers (contributors
>> with write permissions) can retrigger/cancel a Github Actions workflow, but
>> Github Actions runners are managed by the Apache infra team.
>>
>> > What are the current limits of GitHub Actions, who set them - and what
>> is the process to change those (if possible at all, but I presume not all
>> Apache projects have the same limits)?
>>
>> For limits, I don't think there is

Re: [VOTE] SPIP: Stored Procedures API for Catalogs

2024-05-12 Thread Wenchen Fan

+1

On Mon, May 13, 2024 at 10:30 AM Kent Yao  wrote:

> +1
>
> Dongjoon Hyun  于2024年5月13日周一 08:39写道：
> >
> > +1
> >
> > On Sun, May 12, 2024 at 3:50 PM huaxin gao 
> wrote:
> >>
> >> +1
> >>
> >> On Sat, May 11, 2024 at 4:35 PM L. C. Hsieh  wrote:
> >>>
> >>> +1
> >>>
> >>> On Sat, May 11, 2024 at 3:11 PM Chao Sun  wrote:
> >>> >
> >>> > +1
> >>> >
> >>> > On Sat, May 11, 2024 at 2:10 PM L. C. Hsieh 
> wrote:
> >>> >>
> >>> >> Hi all,
> >>> >>
> >>> >> I’d like to start a vote for SPIP: Stored Procedures API for
> Catalogs.
> >>> >>
> >>> >> Please also refer to:
> >>> >>
> >>> >>- Discussion thread:
> >>> >> https://lists.apache.org/thread/7r04pz544c9qs3gc8q2nyj3fpzfnv8oo
> >>> >>- JIRA ticket: https://issues.apache.org/jira/browse/SPARK-44167
> >>> >>- SPIP doc:
> https://docs.google.com/document/d/1rDcggNl9YNcBECsfgPcoOecHXYZOu29QYFrloo2lPBg/
> >>> >>
> >>> >>
> >>> >> Please vote on the SPIP for the next 72 hours:
> >>> >>
> >>> >> [ ] +1: Accept the proposal as an official SPIP
> >>> >> [ ] +0
> >>> >> [ ] -1: I don’t think this is a good idea because …
> >>> >>
> >>> >>
> >>> >> Thank you!
> >>> >>
> >>> >> Liang-Chi Hsieh
> >>> >>
> >>> >>
> -
> >>> >> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
> >>> >>
> >>>
> >>> -
> >>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
> >>>
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>

Re: [DISCUSS] Spark - How to improve our release processes

2024-05-12 Thread Wenchen Fan

After finishing the 4.0.0-preview1 RC1, I have more experience with this
topic now.

In fact, the main job of the release process: building packages and
documents, is tested in Github Action jobs. However, the way we test them
is different from what we do in the release scripts.

1. the execution environment is different:
The release scripts define the execution environment with this Dockerfile:
https://github.com/apache/spark/blob/master/dev/create-release/spark-rm/Dockerfile
However, Github Action jobs use a different Dockerfile:
https://github.com/apache/spark/blob/master/dev/infra/Dockerfile
We should figure out a way to unify it. The docker image for the release
process needs to set up more things so it may not be viable to use a single
Dockerfile for both.

2. the execution code is different. Use building documents as an example:
The release scripts:
https://github.com/apache/spark/blob/master/dev/create-release/release-build.sh#L404-L411
The Github Action job:
https://github.com/apache/spark/blob/master/.github/workflows/build_and_test.yml#L883-L895
I don't know which one is more correct, but we should definitely unify them.

It's better if we can run the release scripts as Github Action jobs, but I
think it's more important to do the unification now.

Thanks,
Wenchen

On Fri, May 10, 2024 at 12:34 AM Hussein Awala  wrote:

> Hello,
>
> I can answer some of your common questions with other Apache projects.
>
> > Who currently has permissions for Github actions? Is there a specific
> owner for that today or a different volunteer each time?
>
> The Apache organization owns Github Actions, and committers (contributors
> with write permissions) can retrigger/cancel a Github Actions workflow, but
> Github Actions runners are managed by the Apache infra team.
>
> > What are the current limits of GitHub Actions, who set them - and what
> is the process to change those (if possible at all, but I presume not all
> Apache projects have the same limits)?
>
> For limits, I don't think there is any significant limit, especially since
> the Apache organization has 900 donated runners used by its projects, and
> there is an initiative from the Infra team to add self-hosted runners
> running on Kubernetes (document
> 
> ).
>
> > Where should the artifacts be stored?
>
> Usually, we use Maven for jars, DockerHub for Docker images, and Github
> cache for workflow cache. But we can use Github artifacts to store any kind
> of package (even Docker images in the ghcr), which is fully accepted by
> Apache policies. Also if the project has a cloud account (AWS, GCP, Azure,
> ...), a bucket can be used to store some of the packages.
>
>
>  > Who should be permitted to sign a version - and what is the process for
> that?
>
> The Apache documentation is clear about this, by default only PMC members
> can be release managers, but we can contact the infra team to add one of
> the committers as a release manager (document
> ). The
> process of creating a new version is described in this document
> .
>
>
> On Thu, May 9, 2024 at 10:45 AM Nimrod Ofek  wrote:
>
>> Following the conversation started with Spark 4.0.0 release, this is a
>> thread to discuss improvements to our release processes.
>>
>> I'll Start by raising some questions that probably should have answers to
>> start the discussion:
>>
>>
>>1. What is currently running in GitHub Actions?
>>2. Who currently has permissions for Github actions? Is there a
>>specific owner for that today or a different volunteer each time?
>>3. What are the current limits of GitHub Actions, who set them - and
>>what is the process to change those (if possible at all, but I presume not
>>all Apache projects have the same limits)?
>>4. What versions should we support as an output for the build?
>>5. Where should the artifacts be stored?
>>6. What should be the output? only tar or also a docker image
>>published somewhere?
>>7. Do we want to have a release on fixed dates or a manual release
>>upon request?
>>8. Who should be permitted to sign a version - and what is the
>>process for that?
>>
>>
>> Thanks!
>> Nimrod
>>
>

[VOTE] SPARK 4.0.0-preview1 (RC1)

2024-05-10 Thread Wenchen Fan

Please vote on releasing the following candidate as Apache Spark version
4.0.0-preview1.

The vote is open until May 16 PST and passes if a majority +1 PMC votes are
cast, with
a minimum of 3 +1 votes.

[ ] +1 Release this package as Apache Spark 4.0.0-preview1
[ ] -1 Do not release this package because ...

To learn more about Apache Spark, please see http://spark.apache.org/

The tag to be voted on is v4.0.0-preview1-rc1 (commit
7dcf77c739c3854260464d732dbfb9a0f54706e7):
https://github.com/apache/spark/tree/v4.0.0-preview1-rc1

The release files, including signatures, digests, etc. can be found at:
https://dist.apache.org/repos/dist/dev/spark/v4.0.0-preview1-rc1-bin/

Signatures used for Spark RCs can be found in this file:
https://dist.apache.org/repos/dist/dev/spark/KEYS

The staging repository for this release can be found at:
https://repository.apache.org/content/repositories/orgapachespark-1454/

The documentation corresponding to this release can be found at:
https://dist.apache.org/repos/dist/dev/spark/v4.0.0-preview1-rc1-docs/

The list of bug fixes going into 4.0.0 can be found at the following URL:
https://issues.apache.org/jira/projects/SPARK/versions/12353359

FAQ

=
How can I help test this release?
=

If you are a Spark user, you can help us test this release by taking
an existing Spark workload and running on this release candidate, then
reporting any regressions.

If you're working in PySpark you can set up a virtual env and install
the current RC and see if anything important breaks, in the Java/Scala
you can add the staging repository to your projects resolvers and test
with the RC (make sure to clean up the artifact cache before/after so
you don't end up building with an out of date RC going forward).

Re: [DISCUSS] SPIP: Stored Procedures API for Catalogs

2024-05-09 Thread Wenchen Fan

Thanks for leading this project! Let's move forward.

On Fri, May 10, 2024 at 10:31 AM L. C. Hsieh  wrote:

> Thanks Anton. Thank you, Wenchen, Dongjoon, Ryan, Serge, Allison and
> others if I miss those who are participating in the discussion.
>
> I suppose we have reached a consensus or close to being in the design.
>
> If you have some more comments, please let us know.
>
> If not, I will go to start a vote soon after a few days.
>
> Thank you.
>
> On Thu, May 9, 2024 at 6:12 PM Anton Okolnychyi 
> wrote:
> >
> > Thanks to everyone who commented on the design doc. I updated the
> proposal and it is ready for another look. I hope we can converge and move
> forward with this effort!
> >
> > - Anton
> >
> > пт, 19 квіт. 2024 р. о 15:54 Anton Okolnychyi 
> пише:
> >>
> >> Hi folks,
> >>
> >> I'd like to start a discussion on SPARK-44167 that aims to enable
> catalogs to expose custom routines as stored procedures. I believe this
> functionality will enhance Spark’s ability to interact with external
> connectors and allow users to perform more operations in plain SQL.
> >>
> >> SPIP [1] contains proposed API changes and parser extensions. Any
> feedback is more than welcome!
> >>
> >> Unlike the initial proposal for stored procedures with Python [2], this
> one focuses on exposing pre-defined stored procedures via the catalog API.
> This approach is inspired by a similar functionality in Trino and avoids
> the challenges of supporting user-defined routines discussed earlier [3].
> >>
> >> Liang-Chi was kind enough to shepherd this effort. Thanks!
> >>
> >> - Anton
> >>
> >> [1] -
> https://docs.google.com/document/d/1rDcggNl9YNcBECsfgPcoOecHXYZOu29QYFrloo2lPBg/
> >> [2] -
> https://docs.google.com/document/d/1ce2EZrf2BxHu7TjfGn4TgToK3TBYYzRkmsIVcfmkNzE/
> >> [3] - https://lists.apache.org/thread/lkjm9r7rx7358xxn2z8yof4wdknpzg3l
> >>
> >>
> >>
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>

Re: [DISCUSS] Spark 4.0.0 release

2024-05-09 Thread Wenchen Fan

UPDATE:

I've successfully uploaded the release packages:
https://dist.apache.org/repos/dist/dev/spark/v4.0.0-preview1-rc1-bin/
(I skipped SparkR as I was not able to fix the errors, I'll get back to it
later)

However, there is a new issue with doc building:
https://github.com/apache/spark/pull/44628#discussion_r1595718574

I'll continue after the issue is fixed.

On Fri, May 10, 2024 at 12:29 AM Dongjoon Hyun 
wrote:

> Please re-try to upload, Wenchen. ASF Infra team bumped up our upload
> limit based on our request.
>
> > Your upload limit has been increased to 650MB
>
> Dongjoon.
>
>
>
> On Thu, May 9, 2024 at 8:12 AM Wenchen Fan  wrote:
>
>> I've created a ticket: https://issues.apache.org/jira/browse/INFRA-25776
>>
>> On Thu, May 9, 2024 at 11:06 PM Dongjoon Hyun 
>> wrote:
>>
>>> In addition, FYI, I was the latest release manager with Apache Spark
>>> 3.4.3 (2024-04-15 Vote)
>>>
>>> According to my work log, I uploaded the following binaries to SVN from
>>> EC2 (us-west-2) without any issues.
>>>
>>> -rw-r--r--.  1 centos centos 311384003 Apr 15 01:29 pyspark-3.4.3.tar.gz
>>> -rw-r--r--.  1 centos centos 397870995 Apr 15 00:44
>>> spark-3.4.3-bin-hadoop3-scala2.13.tgz
>>> -rw-r--r--.  1 centos centos 388930980 Apr 15 01:29
>>> spark-3.4.3-bin-hadoop3.tgz
>>> -rw-r--r--.  1 centos centos 300786123 Apr 15 01:04
>>> spark-3.4.3-bin-without-hadoop.tgz
>>> -rw-r--r--.  1 centos centos  32219044 Apr 15 00:23 spark-3.4.3.tgz
>>> -rw-r--r--.  1 centos centos356749 Apr 15 01:29 SparkR_3.4.3.tar.gz
>>>
>>> Since Apache Spark 4.0.0-preview doesn't have Scala 2.12 combination,
>>> the total size should be smaller than 3.4.3 binaires.
>>>
>>> Given that, if there is any INFRA change, that could happen after 4/15.
>>>
>>> Dongjoon.
>>>
>>> On Thu, May 9, 2024 at 7:57 AM Dongjoon Hyun 
>>> wrote:
>>>
>>>> Could you file an INFRA JIRA issue with the error message and context
>>>> first, Wenchen?
>>>>
>>>> As you know, if we see something, we had better file a JIRA issue
>>>> because it could be not only an Apache Spark project issue but also all ASF
>>>> project issues.
>>>>
>>>> Dongjoon.
>>>>
>>>>
>>>> On Thu, May 9, 2024 at 12:28 AM Wenchen Fan 
>>>> wrote:
>>>>
>>>>> UPDATE:
>>>>>
>>>>> After resolving a few issues in the release scripts, I can finally
>>>>> build the release packages. However, I can't upload them to the staging 
>>>>> SVN
>>>>> repo due to a transmitting error, and it seems like a limitation from the
>>>>> server side. I tried it on both my local laptop and remote AWS instance,
>>>>> but neither works. These package binaries are like 300-400 MBs, and we 
>>>>> just
>>>>> did a release last month. Not sure if this is a new limitation due to cost
>>>>> saving.
>>>>>
>>>>> While I'm looking for help to get unblocked, I'm wondering if we can
>>>>> upload release packages to a public git repo instead, under the Apache
>>>>> account?
>>>>>
>>>>>>
>>>>>>>>>>>>>>

Re: [DISCUSS] Spark 4.0.0 release

2024-05-09 Thread Wenchen Fan

I've created a ticket: https://issues.apache.org/jira/browse/INFRA-25776

On Thu, May 9, 2024 at 11:06 PM Dongjoon Hyun 
wrote:

> In addition, FYI, I was the latest release manager with Apache Spark 3.4.3
> (2024-04-15 Vote)
>
> According to my work log, I uploaded the following binaries to SVN from
> EC2 (us-west-2) without any issues.
>
> -rw-r--r--.  1 centos centos 311384003 Apr 15 01:29 pyspark-3.4.3.tar.gz
> -rw-r--r--.  1 centos centos 397870995 Apr 15 00:44
> spark-3.4.3-bin-hadoop3-scala2.13.tgz
> -rw-r--r--.  1 centos centos 388930980 Apr 15 01:29
> spark-3.4.3-bin-hadoop3.tgz
> -rw-r--r--.  1 centos centos 300786123 Apr 15 01:04
> spark-3.4.3-bin-without-hadoop.tgz
> -rw-r--r--.  1 centos centos  32219044 Apr 15 00:23 spark-3.4.3.tgz
> -rw-r--r--.  1 centos centos356749 Apr 15 01:29 SparkR_3.4.3.tar.gz
>
> Since Apache Spark 4.0.0-preview doesn't have Scala 2.12 combination, the
> total size should be smaller than 3.4.3 binaires.
>
> Given that, if there is any INFRA change, that could happen after 4/15.
>
> Dongjoon.
>
> On Thu, May 9, 2024 at 7:57 AM Dongjoon Hyun 
> wrote:
>
>> Could you file an INFRA JIRA issue with the error message and context
>> first, Wenchen?
>>
>> As you know, if we see something, we had better file a JIRA issue because
>> it could be not only an Apache Spark project issue but also all ASF project
>> issues.
>>
>> Dongjoon.
>>
>>
>> On Thu, May 9, 2024 at 12:28 AM Wenchen Fan  wrote:
>>
>>> UPDATE:
>>>
>>> After resolving a few issues in the release scripts, I can finally build
>>> the release packages. However, I can't upload them to the staging SVN repo
>>> due to a transmitting error, and it seems like a limitation from the server
>>> side. I tried it on both my local laptop and remote AWS instance, but
>>> neither works. These package binaries are like 300-400 MBs, and we just did
>>> a release last month. Not sure if this is a new limitation due to cost
>>> saving.
>>>
>>> While I'm looking for help to get unblocked, I'm wondering if we can
>>> upload release packages to a public git repo instead, under the Apache
>>> account?
>>>
>>>>
>>>>>>>>>>>>

Re: [DISCUSS] Spark - How to improve our release processes

2024-05-09 Thread Wenchen Fan

Thanks for starting the discussion! To add a bit more color, we should at
least add a test job to make sure the release script can produce the
packages correctly. Today it's kind of being manually tested by the
release manager each time, which slows down the release process. It's
better if we can automate it entirely, so that making a release is a simple
click by authorized people.

On Thu, May 9, 2024 at 9:48 PM Nimrod Ofek  wrote:

> Following the conversation started with Spark 4.0.0 release, this is a
> thread to discuss improvements to our release processes.
>
> I'll Start by raising some questions that probably should have answers to
> start the discussion:
>
>
>1. What is currently running in GitHub Actions?
>2. Who currently has permissions for Github actions? Is there a
>specific owner for that today or a different volunteer each time?
>3. What are the current limits of GitHub Actions, who set them - and
>what is the process to change those (if possible at all, but I presume not
>all Apache projects have the same limits)?
>4. What versions should we support as an output for the build?
>5. Where should the artifacts be stored?
>6. What should be the output? only tar or also a docker image
>published somewhere?
>7. Do we want to have a release on fixed dates or a manual release
>upon request?
>8. Who should be permitted to sign a version - and what is the process
>for that?
>
>
> Thanks!
> Nimrod
>

Re: [DISCUSS] Spark 4.0.0 release

2024-05-09 Thread Wenchen Fan

UPDATE:

After resolving a few issues in the release scripts, I can finally build
the release packages. However, I can't upload them to the staging SVN repo
due to a transmitting error, and it seems like a limitation from the server
side. I tried it on both my local laptop and remote AWS instance, but
neither works. These package binaries are like 300-400 MBs, and we just did
a release last month. Not sure if this is a new limitation due to cost
saving.

While I'm looking for help to get unblocked, I'm wondering if we can upload
release packages to a public git repo instead, under the Apache account?

On Thu, May 9, 2024 at 12:39 AM Holden Karau  wrote:

> That looks cool, maybe let’s split off a thread on how to improve our
> release processes?
>
> Twitter: https://twitter.com/holdenkarau
> Books (Learning Spark, High Performance Spark, etc.):
> https://amzn.to/2MaRAG9  <https://amzn.to/2MaRAG9>
> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>
>
> On Wed, May 8, 2024 at 9:31 AM Erik Krogen  wrote:
>
>> On that note, GitHub recently released (public preview) a new feature
>> called Artifact Attestions which may be relevant/useful here: Introducing
>> Artifact Attestations–now in public beta - The GitHub Blog
>> <https://github.blog/2024-05-02-introducing-artifact-attestations-now-in-public-beta/>
>>
>> On Wed, May 8, 2024 at 9:06 AM Nimrod Ofek  wrote:
>>
>>> I have no permissions so I can't do it but I'm happy to help (although I
>>> am more familiar with Gitlab CICD than Github Actions).
>>> Is there some point of contact that can provide me needed context and
>>> permissions?
>>> I'd also love to see why the costs are high and see how we can reduce
>>> them...
>>>
>>> Thanks,
>>> Nimrod
>>>
>>> On Wed, May 8, 2024 at 8:26 AM Holden Karau 
>>> wrote:
>>>
>>>> I think signing the artifacts produced from a secure CI sounds like a
>>>> good idea. I know we’ve been asked to reduce our GitHub action usage but
>>>> perhaps someone interested could volunteer to set that up.
>>>>
>>>> Twitter: https://twitter.com/holdenkarau
>>>> Books (Learning Spark, High Performance Spark, etc.):
>>>> https://amzn.to/2MaRAG9  <https://amzn.to/2MaRAG9>
>>>> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>>>>
>>>>
>>>> On Tue, May 7, 2024 at 9:43 PM Nimrod Ofek 
>>>> wrote:
>>>>
>>>>> Hi,
>>>>> Thanks for the reply.
>>>>>
>>>>> From my experience, a build on a build server would be much more
>>>>> predictable and less error prone than building on some laptop- and of
>>>>> course much faster to have builds, snapshots, release candidates, early
>>>>> previews releases, release candidates or final releases.
>>>>> It will enable us to have a preview version with current changes-
>>>>> snapshot version, either automatically every day or if we need to save
>>>>> costs (although build is really not expensive) - with a click of a button.
>>>>>
>>>>> Regarding keys for signing. - that's what vaults are for, all across
>>>>> the industry we are using vaults (such as hashicorp vault)- but if the
>>>>> build will be automated and the only thing which will be manual is to sign
>>>>> the release for security reasons that would be reasonable.
>>>>>
>>>>> Thanks,
>>>>> Nimrod
>>>>>
>>>>>
>>>>> בתאריך יום ד׳, 8 במאי 2024, 00:54, מאת Holden Karau ‏<
>>>>> holden.ka...@gmail.com>:
>>>>>
>>>>>> Indeed. We could conceivably build the release in CI/CD but the final
>>>>>> verification / signing should be done locally to keep the keys safe 
>>>>>> (there
>>>>>> was some concern from earlier release processes).
>>>>>>
>>>>>> Twitter: https://twitter.com/holdenkarau
>>>>>> Books (Learning Spark, High Performance Spark, etc.):
>>>>>> https://amzn.to/2MaRAG9  <https://amzn.to/2MaRAG9>
>>>>>> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>>>>>>
>>>>>>
>>>>>> On Tue, May 7, 2024 at 10:55 AM Nimrod Ofek 
>>>>>> wrote:
>>>>>>
>>>>>>> Hi,
>>>>>>>
>>>>>>> Sorry for the novice question, Wen

Re: [DISCUSS] Spark 4.0.0 release

2024-05-07 Thread Wenchen Fan

UPDATE:

Unfortunately, it took me quite some time to set up my laptop and get it
ready for the release process (docker desktop doesn't work anymore, my pgp
key is lost, etc.). I'll start the RC process at my tomorrow. Thanks for
your patience!

Wenchen

On Fri, May 3, 2024 at 7:47 AM yangjie01  wrote:

> +1
>
>
>
> *发件人**: *Jungtaek Lim 
> *日期**: *2024年5月2日 星期四 10:21
> *收件人**: *Holden Karau 
> *抄送**: *Chao Sun , Xiao Li ,
> Tathagata Das , Wenchen Fan <
> cloud0...@gmail.com>, Cheng Pan , Nicholas Chammas <
> nicholas.cham...@gmail.com>, Dongjoon Hyun ,
> Cheng Pan , Spark dev list ,
> Anish Shrigondekar 
> *主题**: *Re: [DISCUSS] Spark 4.0.0 release
>
>
>
> +1 love to see it!
>
>
>
> On Thu, May 2, 2024 at 10:08 AM Holden Karau 
> wrote:
>
> +1 :) yay previews
>
>
>
> On Wed, May 1, 2024 at 5:36 PM Chao Sun  wrote:
>
> +1
>
>
>
> On Wed, May 1, 2024 at 5:23 PM Xiao Li  wrote:
>
> +1 for next Monday.
>
>
>
> We can do more previews when the other features are ready for preview.
>
>
>
> Tathagata Das  于2024年5月1日周三 08:46写道：
>
> Next week sounds great! Thank you Wenchen!
>
>
>
> On Wed, May 1, 2024 at 11:16 AM Wenchen Fan  wrote:
>
> Yea I think a preview release won't hurt (without a branch cut). We don't
> need to wait for all the ongoing projects to be ready. How about we do a
> 4.0 preview release based on the current master branch next Monday?
>
>
>
> On Wed, May 1, 2024 at 11:06 PM Tathagata Das 
> wrote:
>
> Hey all,
>
>
>
> Reviving this thread, but Spark master has already accumulated a huge
> amount of changes.  As a downstream project maintainer, I want to really
> start testing the new features and other breaking changes, and it's hard to
> do that without a Preview release. So the sooner we make a Preview release,
> the faster we can start getting feedback for fixing things for a great
> Spark 4.0 final release.
>
>
>
> So I urge the community to produce a Spark 4.0 Preview soon even if
> certain features targeting the Delta 4.0 release are still incomplete.
>
>
>
> Thanks!
>
>
>
>
>
> On Wed, Apr 17, 2024 at 8:35 AM Wenchen Fan  wrote:
>
> Thank you all for the replies!
>
>
>
> To @Nicholas Chammas  : Thanks for cleaning
> up the error terminology and documentation! I've merged the first PR and
> let's finish others before the 4.0 release.
>
> To @Dongjoon Hyun  : Thanks for driving the ANSI
> on by default effort! Now the vote has passed, let's flip the config and
> finish the DataFrame error context feature before 4.0.
>
> To @Jungtaek Lim  : Ack. We can treat the
> Streaming state store data source as completed for 4.0 then.
>
> To @Cheng Pan  : Yea we definitely should have a
> preview release. Let's collect more feedback on the ongoing projects and
> then we can propose a date for the preview release.
>
>
>
> On Wed, Apr 17, 2024 at 1:22 PM Cheng Pan  wrote:
>
> will we have preview release for 4.0.0 like we did for 2.0.0 and 3.0.0?
>
> Thanks,
> Cheng Pan
>
>
> > On Apr 15, 2024, at 09:58, Jungtaek Lim 
> wrote:
> >
> > W.r.t. state data source - reader (SPARK-45511), there are several
> follow-up tickets, but we don't plan to address them soon. The current
> implementation is the final shape for Spark 4.0.0, unless there are demands
> on the follow-up tickets.
> >
> > We may want to check the plan for transformWithState - my understanding
> is that we want to release the feature to 4.0.0, but there are several
> remaining works to be done. While the tentative timeline for releasing is
> June 2024, what would be the tentative timeline for the RC cut?
> > (cc. Anish to add more context on the plan for transformWithState)
> >
> > On Sat, Apr 13, 2024 at 3:15 AM Wenchen Fan  wrote:
> > Hi all,
> >
> > It's close to the previously proposed 4.0.0 release date (June 2024),
> and I think it's time to prepare for it and discuss the ongoing projects:
> > •
> > ANSI by default
> > • Spark Connect GA
> > • Structured Logging
> > • Streaming state store data source
> > • new data type VARIANT
> > • STRING collation support
> > • Spark k8s operator versioning
> > Please help to add more items to this list that are missed here. I would
> like to volunteer as the release manager for Apache Spark 4.0.0 if there is
> no objection. Thank you all for the great work that fills Spark 4.0!
> >
> > Wenchen Fan
>
>
>
>
> --
>
> Twitter: https://twitter.com/holdenkarau
> <https://mailshield.baidu.com/check?q=9DewFnOIsK%2bK64Uu60Jx4QkcL9rDgnApD6spzOBjk%2fa2KQxn>
>
> Books (Learning Spark, High Performance Spark, etc.):
> https://amzn.to/2MaRAG9
> <https://mailshield.baidu.com/check?q=D34Ozfkj%2bFrnkuu9ci%2b4FcMkreOvMZ3jO85bIw%3d%3d>
>
> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
> <https://mailshield.baidu.com/check?q=nadOZCZjNeU0qOVGCJesf8dvH4OrsWdKamKIxnJncPneWoN8%2bsIqc2DWow8%3d>
>
>

Re: ASF board report draft for May

2024-05-06 Thread Wenchen Fan

The preview release also needs a vote. I'll try my best to cut the RC on
Monday, but the actual release may take some time. Hopefully, we can get it
out this week but if the vote fails, it will take longer as we need more
RCs.

On Mon, May 6, 2024 at 7:22 AM Dongjoon Hyun 
wrote:

> +1 for Holden's comment. Yes, it would be great to mention `it` as "soon".
> (If Wenchen release it on Monday, we can simply mention the release)
>
> In addition, Apache Spark PMC received an official notice from ASF Infra
> team.
>
> https://lists.apache.org/thread/rgy1cg17tkd3yox7qfq87ht12sqclkbg
> > [NOTICE] Apache Spark's GitHub Actions usage exceeds allowances for ASF
> projects
>
> To track and comply with the new ASF Infra Policy as much as possible, we
> opened a blocker-level JIRA issue and have been working on it.
> - https://infra.apache.org/github-actions-policy.html
>
> Please include a sentence that Apache Spark PMC is working on under the
> following umbrella JIRA issue.
>
> https://issues.apache.org/jira/browse/SPARK-48094
> > Reduce GitHub Action usage according to ASF project allowance
>
> Thanks,
> Dongjoon.
>
>
> On Sun, May 5, 2024 at 3:45 PM Holden Karau 
> wrote:
>
>> Do we want to include that we’re planning on having a preview release of
>> Spark 4 so folks can see the APIs “soon”?
>>
>> Twitter: https://twitter.com/holdenkarau
>> Books (Learning Spark, High Performance Spark, etc.):
>> https://amzn.to/2MaRAG9  
>> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>>
>>
>> On Sun, May 5, 2024 at 3:24 PM Matei Zaharia 
>> wrote:
>>
>>> It’s time for our quarterly ASF board report on Apache Spark this
>>> Wednesday. Here’s a draft, feel free to suggest changes.
>>>
>>> 
>>>
>>> Description:
>>>
>>> Apache Spark is a fast and general purpose engine for large-scale data
>>> processing. It offers high-level APIs in Java, Scala, Python, R and SQL as
>>> well as a rich set of libraries including stream processing, machine
>>> learning, and graph analytics.
>>>
>>> Issues for the board:
>>>
>>> - None
>>>
>>> Project status:
>>>
>>> - We made two patch releases: Spark 3.5.1 on February 28, 2024, and
>>> Spark 3.4.2 on April 18, 2024.
>>> - The votes on "SPIP: Structured Logging Framework for Apache Spark" and
>>> "Pure Python Package in PyPI (Spark Connect)" have passed.
>>> - The votes for two behavior changes have passed: "SPARK-4: Use ANSI
>>> SQL mode by default" and "SPARK-46122: Set
>>> spark.sql.legacy.createHiveTableByDefault to false".
>>> - The community decided that upcoming Spark 4.0 release will drop
>>> support for Python 3.8.
>>> - We started a discussion about the definition of behavior changes that
>>> is critical for version upgrades and user experience.
>>> - We've opened a dedicated repository for the Spark Kubernetes Operator
>>> at https://github.com/apache/spark-kubernetes-operator. We added a new
>>> version in Apache Spark JIRA for versioning of the Spark operator based on
>>> a vote result.
>>>
>>> Trademarks:
>>>
>>> - No changes since the last report.
>>>
>>> Latest releases:
>>> - Spark 3.4.3 was released on April 18, 2024
>>> - Spark 3.5.1 was released on February 28, 2024
>>> - Spark 3.3.4 was released on December 16, 2023
>>>
>>> Committers and PMC:
>>>
>>> - The latest committer was added on Oct 2nd, 2023 (Jiaan Geng).
>>> - The latest PMC members were added on Oct 2nd, 2023 (Yuanjian Li and
>>> Yikun Jiang).
>>>
>>> 
>>> -
>>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>>
>>>

Re: [DISCUSS] clarify the definition of behavior changes

2024-05-01 Thread Wenchen Fan

Hi Erik,

Thanks for sharing your thoughts! Note: developer APIs are also public APIs
(such as Data Source V2 API, Spark Listener API, etc.), so breaking changes
should be avoided as much as we can and new APIs should be mentioned in the
release notes. Breaking binary compatibility is also a "functional change"
and should be treated as a behavior change.

BTW, AFAIK some downstream libraries use private APIs such as Catalyst
Expression and LogicalPlan. It's too much work to track all the changes to
private APIs and I think it's the downstream library's responsibility to
check such changes in new Spark versions, or avoid using private APIs.
Exceptions can happen if certain private APIs are used too widely and we
should avoid breaking them.

Thanks,
Wenchen

On Wed, May 1, 2024 at 11:51 PM Erik Krogen  wrote:

> Thanks for raising this important discussion Wenchen! Two points I would
> like to raise, though I'm fully supportive of any improvements in this
> regard, my points below notwithstanding -- I am not intending to let
> perfect be the enemy of good here.
>
> On a similar note as Santosh's comment, we should consider how this
> relates to developer APIs. Let's say I am an end user relying on some
> library like frameless <https://github.com/typelevel/frameless>, which
> relies on developer APIs in Spark. When we make a change to Spark's
> developer APIs that requires a corresponding change in frameless, I don't
> directly see that change as an end user, but it *does* impact me, because
> now I have to upgrade to a new version of frameless that supports those new
> changes. This can have ripple effects across the ecosystem. Should we call
> out such changes so that end users understand the potential impact to
> libraries they use?
>
> Second point, what about binary compatibility? Currently our versioning
> policy says "Link-level compatibility is something we’ll try to guarantee
> in future releases." (FWIW, it has said this since at least 2016
> <https://web.archive.org/web/20161127193643/https://spark.apache.org/versioning-policy.html>...)
> One step towards this would be to clearly call out any binary-incompatible
> changes in our release notes, to help users understand if they may be
> impacted. Similar to my first point, this has ripple effects across the
> ecosystem -- if I just use Spark itself, recompiling is probably not a big
> deal, but if I use N libraries that each depend on Spark, then after a
> binary-incompatible change is made I have to wait for all N libraries to
> publish new compatible versions before I can upgrade myself, presenting a
> nontrivial barrier to adoption.
>
> On Wed, May 1, 2024 at 8:18 AM Santosh Pingale
>  wrote:
>
>> Thanks Wenchen for starting this!
>>
>> How do we define "the user" for spark?
>> 1. End users: There are some users that use spark as a service from a
>> provider
>> 2. Providers/Operators: There are some users that provide spark as a
>> service for their internal(on-prem setup with yarn/k8s)/external(Something
>> like EMR) customers
>> 3. ?
>>
>> Perhaps we need to consider infrastructure behavior changes as well to
>> accommodate the second group of users.
>>
>> On 1 May 2024, at 06:08, Wenchen Fan  wrote:
>>
>> Hi all,
>>
>> It's exciting to see innovations keep happening in the Spark community
>> and Spark keeps evolving itself. To make these innovations available to
>> more users, it's important to help users upgrade to newer Spark versions
>> easily. We've done a good job on it: the PR template requires the author to
>> write down user-facing behavior changes, and the migration guide contains
>> behavior changes that need attention from users. Sometimes behavior changes
>> come with a legacy config to restore the old behavior. However, we still
>> lack a clear definition of behavior changes and I propose the following
>> definition:
>>
>> Behavior changes mean user-visible functional changes in a new release
>> via public APIs. This means new features, and even bug fixes that eliminate
>> NPE or correct query results, are behavior changes. Things like performance
>> improvement, code refactoring, and changes to unreleased APIs/features are
>> not. All behavior changes should be called out in the PR description. We
>> need to write an item in the migration guide (and probably legacy config)
>> for those that may break users when upgrading:
>>
>>- Bug fixes that change query results. Users may need to do backfill
>>to correct the existing data and must know about these correctness fixes.
>>- Bug fixes that change query schema. Users may need to update the
>>

Re: [DISCUSS] clarify the definition of behavior changes

2024-05-01 Thread Wenchen Fan

Good point, Santosh!

I was originally targeting end users who write queries with Spark, as this
is probably the largest user base. But we should definitely consider other
users who deploy and manage Spark clusters. Those users are usually more
tolerant of behavior changes and I think it should be sufficient to put
behavior changes in this area in the release notes.

On Wed, May 1, 2024 at 11:18 PM Santosh Pingale 
wrote:

> Thanks Wenchen for starting this!
>
> How do we define "the user" for spark?
> 1. End users: There are some users that use spark as a service from a
> provider
> 2. Providers/Operators: There are some users that provide spark as a
> service for their internal(on-prem setup with yarn/k8s)/external(Something
> like EMR) customers
> 3. ?
>
> Perhaps we need to consider infrastructure behavior changes as well to
> accommodate the second group of users.
>
> On 1 May 2024, at 06:08, Wenchen Fan  wrote:
>
> Hi all,
>
> It's exciting to see innovations keep happening in the Spark community and
> Spark keeps evolving itself. To make these innovations available to more
> users, it's important to help users upgrade to newer Spark versions easily.
> We've done a good job on it: the PR template requires the author to write
> down user-facing behavior changes, and the migration guide contains
> behavior changes that need attention from users. Sometimes behavior changes
> come with a legacy config to restore the old behavior. However, we still
> lack a clear definition of behavior changes and I propose the following
> definition:
>
> Behavior changes mean user-visible functional changes in a new release via
> public APIs. This means new features, and even bug fixes that eliminate NPE
> or correct query results, are behavior changes. Things like performance
> improvement, code refactoring, and changes to unreleased APIs/features are
> not. All behavior changes should be called out in the PR description. We
> need to write an item in the migration guide (and probably legacy config)
> for those that may break users when upgrading:
>
>- Bug fixes that change query results. Users may need to do backfill
>to correct the existing data and must know about these correctness fixes.
>- Bug fixes that change query schema. Users may need to update the
>schema of the tables in their data pipelines and must know about these
>changes.
>- Remove configs
>- Rename error class/condition
>- Any change to the public Python/SQL/Scala/Java/R APIs: rename
>function, remove parameters, add parameters, rename parameters, change
>parameter default values, etc. These changes should be avoided in general,
>or do it in a compatible way like deprecating and adding a new function
>instead of renaming.
>
> Once we reach a conclusion, I'll document it in
> https://spark.apache.org/versioning-policy.html .
>
> Thanks,
> Wenchen
>
>
>

Re: [DISCUSS] Spark 4.0.0 release

2024-05-01 Thread Wenchen Fan

Yea I think a preview release won't hurt (without a branch cut). We don't
need to wait for all the ongoing projects to be ready. How about we do a
4.0 preview release based on the current master branch next Monday?

On Wed, May 1, 2024 at 11:06 PM Tathagata Das 
wrote:

> Hey all,
>
> Reviving this thread, but Spark master has already accumulated a huge
> amount of changes.  As a downstream project maintainer, I want to really
> start testing the new features and other breaking changes, and it's hard to
> do that without a Preview release. So the sooner we make a Preview release,
> the faster we can start getting feedback for fixing things for a great
> Spark 4.0 final release.
>
> So I urge the community to produce a Spark 4.0 Preview soon even if
> certain features targeting the Delta 4.0 release are still incomplete.
>
> Thanks!
>
>
> On Wed, Apr 17, 2024 at 8:35 AM Wenchen Fan  wrote:
>
>> Thank you all for the replies!
>>
>> To @Nicholas Chammas  : Thanks for cleaning
>> up the error terminology and documentation! I've merged the first PR and
>> let's finish others before the 4.0 release.
>> To @Dongjoon Hyun  : Thanks for driving the
>> ANSI on by default effort! Now the vote has passed, let's flip the config
>> and finish the DataFrame error context feature before 4.0.
>> To @Jungtaek Lim  : Ack. We can treat the
>> Streaming state store data source as completed for 4.0 then.
>> To @Cheng Pan  : Yea we definitely should have a
>> preview release. Let's collect more feedback on the ongoing projects and
>> then we can propose a date for the preview release.
>>
>> On Wed, Apr 17, 2024 at 1:22 PM Cheng Pan  wrote:
>>
>>> will we have preview release for 4.0.0 like we did for 2.0.0 and 3.0.0?
>>>
>>> Thanks,
>>> Cheng Pan
>>>
>>>
>>> > On Apr 15, 2024, at 09:58, Jungtaek Lim 
>>> wrote:
>>> >
>>> > W.r.t. state data source - reader (SPARK-45511), there are several
>>> follow-up tickets, but we don't plan to address them soon. The current
>>> implementation is the final shape for Spark 4.0.0, unless there are demands
>>> on the follow-up tickets.
>>> >
>>> > We may want to check the plan for transformWithState - my
>>> understanding is that we want to release the feature to 4.0.0, but there
>>> are several remaining works to be done. While the tentative timeline for
>>> releasing is June 2024, what would be the tentative timeline for the RC cut?
>>> > (cc. Anish to add more context on the plan for transformWithState)
>>> >
>>> > On Sat, Apr 13, 2024 at 3:15 AM Wenchen Fan 
>>> wrote:
>>> > Hi all,
>>> >
>>> > It's close to the previously proposed 4.0.0 release date (June 2024),
>>> and I think it's time to prepare for it and discuss the ongoing projects:
>>> > •
>>> > ANSI by default
>>> > • Spark Connect GA
>>> > • Structured Logging
>>> > • Streaming state store data source
>>> > • new data type VARIANT
>>> > • STRING collation support
>>> > • Spark k8s operator versioning
>>> > Please help to add more items to this list that are missed here. I
>>> would like to volunteer as the release manager for Apache Spark 4.0.0 if
>>> there is no objection. Thank you all for the great work that fills Spark
>>> 4.0!
>>> >
>>> > Wenchen Fan
>>>
>>>

[DISCUSS] clarify the definition of behavior changes

2024-04-30 Thread Wenchen Fan

Hi all,

It's exciting to see innovations keep happening in the Spark community and
Spark keeps evolving itself. To make these innovations available to more
users, it's important to help users upgrade to newer Spark versions easily.
We've done a good job on it: the PR template requires the author to write
down user-facing behavior changes, and the migration guide contains
behavior changes that need attention from users. Sometimes behavior changes
come with a legacy config to restore the old behavior. However, we still
lack a clear definition of behavior changes and I propose the following
definition:

Behavior changes mean user-visible functional changes in a new release via
public APIs. This means new features, and even bug fixes that eliminate NPE
or correct query results, are behavior changes. Things like performance
improvement, code refactoring, and changes to unreleased APIs/features are
not. All behavior changes should be called out in the PR description. We
need to write an item in the migration guide (and probably legacy config)
for those that may break users when upgrading:

   - Bug fixes that change query results. Users may need to do backfill to
   correct the existing data and must know about these correctness fixes.
   - Bug fixes that change query schema. Users may need to update the
   schema of the tables in their data pipelines and must know about these
   changes.
   - Remove configs
   - Rename error class/condition
   - Any change to the public Python/SQL/Scala/Java/R APIs: rename
   function, remove parameters, add parameters, rename parameters, change
   parameter default values, etc. These changes should be avoided in general,
   or do it in a compatible way like deprecating and adding a new function
   instead of renaming.

Once we reach a conclusion, I'll document it in
https://spark.apache.org/versioning-policy.html .

Thanks,
Wenchen

Re: Potential Impact of Hive Upgrades on Spark Tables

2024-04-30 Thread Wenchen Fan

Yes, Spark has a shim layer to support all Hive versions. It shouldn't be
an issue as many users create native Spark data source tables already
today, by explicitly putting the `USING` clause in the CREATE TABLE
statement.

On Wed, May 1, 2024 at 12:56 AM Mich Talebzadeh 
wrote:

> @Wenchen Fan Got your explanation, thanks!
>
> My understanding is that even if we create Spark tables using Spark's
> native data sources, by default, the metadata about these tables will
> be stored in the Hive metastore. As a consequence, a Hive upgrade can
> potentially affect Spark tables. For example, depending on the
> severity of the changes, the Hive metastore schema might change, which
> could require Spark code to be updated to handle these changes in how
> table metadata is represented. Is this assertion correct?
>
> Thanks
>
> Mich Talebzadeh,
>
> Technologist | Architect | Data Engineer  | Generative AI | FinCrime
>
> London
> United Kingdom
>
>
>view my Linkedin profile
>
>
>  https://en.everybodywiki.com/Mich_Talebzadeh
>
>
>
> Disclaimer: The information provided is correct to the best of my
> knowledge but of course cannot be guaranteed . It is essential to note
> that, as with any advice, quote "one test result is worth one-thousand
> expert opinions (Werner Von Braun)".
>

Re: [VOTE] SPARK-46122: Set spark.sql.legacy.createHiveTableByDefault to false

2024-04-29 Thread Wenchen Fan

To add more color:

Spark data source table and Hive Serde table are both stored in the Hive
metastore and keep the data files in the table directory. The only
difference is they have different "table provider", which means Spark will
use different reader/writer. Ideally the Spark native data source
reader/writer is faster than the Hive Serde ones.

What's more, the default format of Hive Serde is text. I don't think people
want to use text format tables in production. Most people will add `STORED
AS parquet` or `USING parquet` explicitly. By setting this config to false,
we have a more reasonable default behavior: creating Parquet tables (or
whatever is specified by `spark.sql.sources.default`).

On Tue, Apr 30, 2024 at 10:45 AM Wenchen Fan  wrote:

> @Mich Talebzadeh  there seems to be a
> misunderstanding here. The Spark native data source table is still stored
> in the Hive metastore, it's just that Spark will use a different (and
> faster) reader/writer for it. `hive-site.xml` should work as it is today.
>
> On Tue, Apr 30, 2024 at 5:23 AM Hyukjin Kwon  wrote:
>
>> +1
>>
>> It's a legacy conf that we should eventually remove it away. Spark should
>> create Spark table by default, not Hive table.
>>
>> Mich, for your workload, you can simply switch that conf off if it
>> concerns you. We also enabled ANSI as well (that you agreed on). It's a bit
>> akwakrd to stop in the middle for this compatibility reason during making
>> Spark sound. The compatibility has been tested in production for a long
>> time so I don't see any particular issue about the compatibility case you
>> mentioned.
>>
>> On Mon, Apr 29, 2024 at 2:08 AM Mich Talebzadeh <
>> mich.talebza...@gmail.com> wrote:
>>
>>>
>>> Hi @Wenchen Fan 
>>>
>>> Thanks for your response. I believe we have not had enough time to
>>> "DISCUSS" this matter.
>>>
>>> Currently in order to make Spark take advantage of Hive, I create a soft
>>> link in $SPARK_HOME/conf. FYI, my spark version is 3.4.0 and Hive is 3.1.1
>>>
>>>  /opt/spark/conf/hive-site.xml ->
>>> /data6/hduser/hive-3.1.1/conf/hive-site.xml
>>>
>>> This works fine for me in my lab. So in the future if we opt to use the
>>> setting "spark.sql.legacy.createHiveTableByDefault" to False, there will
>>> not be a need for this logical link.?
>>> On the face of it, this looks fine but in real life it may require a
>>> number of changes to the old scripts. Hence my concern.
>>> As a matter of interest has anyone liaised with the Hive team to ensure
>>> they have introduced the additional changes you outlined?
>>>
>>> HTH
>>>
>>> Mich Talebzadeh,
>>> Technologist | Architect | Data Engineer  | Generative AI | FinCrime
>>> London
>>> United Kingdom
>>>
>>>
>>>view my Linkedin profile
>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>>
>>>
>>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>>
>>>
>>>
>>> *Disclaimer:* The information provided is correct to the best of my
>>> knowledge but of course cannot be guaranteed . It is essential to note
>>> that, as with any advice, quote "one test result is worth one-thousand
>>> expert opinions (Werner
>>> <https://en.wikipedia.org/wiki/Wernher_von_Braun>Von Braun
>>> <https://en.wikipedia.org/wiki/Wernher_von_Braun>)".
>>>
>>>
>>> On Sun, 28 Apr 2024 at 09:34, Wenchen Fan  wrote:
>>>
>>>> @Mich Talebzadeh  thanks for sharing your
>>>> concern!
>>>>
>>>> Note: creating Spark native data source tables is usually Hive
>>>> compatible as well, unless we use features that Hive does not support
>>>> (TIMESTAMP NTZ, ANSI INTERVAL, etc.). I think it's a better default to
>>>> create Spark native table in this case, instead of creating Hive table and
>>>> fail.
>>>>
>>>> On Sat, Apr 27, 2024 at 12:46 PM Cheng Pan  wrote:
>>>>
>>>>> +1 (non-binding)
>>>>>
>>>>> Thanks,
>>>>> Cheng Pan
>>>>>
>>>>> On Sat, Apr 27, 2024 at 9:29 AM Holden Karau 
>>>>> wrote:
>>>>> >
>>>>> > +1
>>>>> >
>>>>> > Twitter: https://twitter.com/holdenkarau
>>>>> > Books (Learning Spark, High Performance Spark, etc.):
>>>>> https://amzn.to/2MaRAG9
>>>>

Re: [VOTE] SPARK-46122: Set spark.sql.legacy.createHiveTableByDefault to false

2024-04-29 Thread Wenchen Fan

@Mich Talebzadeh  there seems to be a
misunderstanding here. The Spark native data source table is still stored
in the Hive metastore, it's just that Spark will use a different (and
faster) reader/writer for it. `hive-site.xml` should work as it is today.

On Tue, Apr 30, 2024 at 5:23 AM Hyukjin Kwon  wrote:

> +1
>
> It's a legacy conf that we should eventually remove it away. Spark should
> create Spark table by default, not Hive table.
>
> Mich, for your workload, you can simply switch that conf off if it
> concerns you. We also enabled ANSI as well (that you agreed on). It's a bit
> akwakrd to stop in the middle for this compatibility reason during making
> Spark sound. The compatibility has been tested in production for a long
> time so I don't see any particular issue about the compatibility case you
> mentioned.
>
> On Mon, Apr 29, 2024 at 2:08 AM Mich Talebzadeh 
> wrote:
>
>>
>> Hi @Wenchen Fan 
>>
>> Thanks for your response. I believe we have not had enough time to
>> "DISCUSS" this matter.
>>
>> Currently in order to make Spark take advantage of Hive, I create a soft
>> link in $SPARK_HOME/conf. FYI, my spark version is 3.4.0 and Hive is 3.1.1
>>
>>  /opt/spark/conf/hive-site.xml ->
>> /data6/hduser/hive-3.1.1/conf/hive-site.xml
>>
>> This works fine for me in my lab. So in the future if we opt to use the
>> setting "spark.sql.legacy.createHiveTableByDefault" to False, there will
>> not be a need for this logical link.?
>> On the face of it, this looks fine but in real life it may require a
>> number of changes to the old scripts. Hence my concern.
>> As a matter of interest has anyone liaised with the Hive team to ensure
>> they have introduced the additional changes you outlined?
>>
>> HTH
>>
>> Mich Talebzadeh,
>> Technologist | Architect | Data Engineer  | Generative AI | FinCrime
>> London
>> United Kingdom
>>
>>
>>view my Linkedin profile
>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>
>>
>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>
>>
>>
>> *Disclaimer:* The information provided is correct to the best of my
>> knowledge but of course cannot be guaranteed . It is essential to note
>> that, as with any advice, quote "one test result is worth one-thousand
>> expert opinions (Werner
>> <https://en.wikipedia.org/wiki/Wernher_von_Braun>Von Braun
>> <https://en.wikipedia.org/wiki/Wernher_von_Braun>)".
>>
>>
>> On Sun, 28 Apr 2024 at 09:34, Wenchen Fan  wrote:
>>
>>> @Mich Talebzadeh  thanks for sharing your
>>> concern!
>>>
>>> Note: creating Spark native data source tables is usually Hive
>>> compatible as well, unless we use features that Hive does not support
>>> (TIMESTAMP NTZ, ANSI INTERVAL, etc.). I think it's a better default to
>>> create Spark native table in this case, instead of creating Hive table and
>>> fail.
>>>
>>> On Sat, Apr 27, 2024 at 12:46 PM Cheng Pan  wrote:
>>>
>>>> +1 (non-binding)
>>>>
>>>> Thanks,
>>>> Cheng Pan
>>>>
>>>> On Sat, Apr 27, 2024 at 9:29 AM Holden Karau 
>>>> wrote:
>>>> >
>>>> > +1
>>>> >
>>>> > Twitter: https://twitter.com/holdenkarau
>>>> > Books (Learning Spark, High Performance Spark, etc.):
>>>> https://amzn.to/2MaRAG9
>>>> > YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>>>> >
>>>> >
>>>> > On Fri, Apr 26, 2024 at 12:06 PM L. C. Hsieh 
>>>> wrote:
>>>> >>
>>>> >> +1
>>>> >>
>>>> >> On Fri, Apr 26, 2024 at 10:01 AM Dongjoon Hyun 
>>>> wrote:
>>>> >> >
>>>> >> > I'll start with my +1.
>>>> >> >
>>>> >> > Dongjoon.
>>>> >> >
>>>> >> > On 2024/04/26 16:45:51 Dongjoon Hyun wrote:
>>>> >> > > Please vote on SPARK-46122 to set
>>>> spark.sql.legacy.createHiveTableByDefault
>>>> >> > > to `false` by default. The technical scope is defined in the
>>>> following PR.
>>>> >> > >
>>>> >> > > - DISCUSSION:
>>>> >> > > https://lists.apache.org/thread/ylk96fg4lvn6klxhj6t6yh42lyqb8wmd
>>>> >> > > - JIRA: https://issues.apache.org/jira/browse/SPARK-46122
>>>> >> > > - PR: https://github.com/apache/spark/pull/46207
>>>> >> > >
>>>> >> > > The vote is open until April 30th 1AM (PST) and passes
>>>> >> > > if a majority +1 PMC votes are cast, with a minimum of 3 +1
>>>> votes.
>>>> >> > >
>>>> >> > > [ ] +1 Set spark.sql.legacy.createHiveTableByDefault to false by
>>>> default
>>>> >> > > [ ] -1 Do not change spark.sql.legacy.createHiveTableByDefault
>>>> because ...
>>>> >> > >
>>>> >> > > Thank you in advance.
>>>> >> > >
>>>> >> > > Dongjoon
>>>> >> > >
>>>> >> >
>>>> >> >
>>>> -
>>>> >> > To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>>> >> >
>>>> >>
>>>> >> -
>>>> >> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>>> >>
>>>>
>>>> -
>>>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>>>
>>>>

Re: [VOTE] SPARK-46122: Set spark.sql.legacy.createHiveTableByDefault to false

2024-04-28 Thread Wenchen Fan

@Mich Talebzadeh  thanks for sharing your
concern!

Note: creating Spark native data source tables is usually Hive compatible
as well, unless we use features that Hive does not support (TIMESTAMP NTZ,
ANSI INTERVAL, etc.). I think it's a better default to create Spark native
table in this case, instead of creating Hive table and fail.

On Sat, Apr 27, 2024 at 12:46 PM Cheng Pan  wrote:

> +1 (non-binding)
>
> Thanks,
> Cheng Pan
>
> On Sat, Apr 27, 2024 at 9:29 AM Holden Karau 
> wrote:
> >
> > +1
> >
> > Twitter: https://twitter.com/holdenkarau
> > Books (Learning Spark, High Performance Spark, etc.):
> https://amzn.to/2MaRAG9
> > YouTube Live Streams: https://www.youtube.com/user/holdenkarau
> >
> >
> > On Fri, Apr 26, 2024 at 12:06 PM L. C. Hsieh  wrote:
> >>
> >> +1
> >>
> >> On Fri, Apr 26, 2024 at 10:01 AM Dongjoon Hyun 
> wrote:
> >> >
> >> > I'll start with my +1.
> >> >
> >> > Dongjoon.
> >> >
> >> > On 2024/04/26 16:45:51 Dongjoon Hyun wrote:
> >> > > Please vote on SPARK-46122 to set
> spark.sql.legacy.createHiveTableByDefault
> >> > > to `false` by default. The technical scope is defined in the
> following PR.
> >> > >
> >> > > - DISCUSSION:
> >> > > https://lists.apache.org/thread/ylk96fg4lvn6klxhj6t6yh42lyqb8wmd
> >> > > - JIRA: https://issues.apache.org/jira/browse/SPARK-46122
> >> > > - PR: https://github.com/apache/spark/pull/46207
> >> > >
> >> > > The vote is open until April 30th 1AM (PST) and passes
> >> > > if a majority +1 PMC votes are cast, with a minimum of 3 +1 votes.
> >> > >
> >> > > [ ] +1 Set spark.sql.legacy.createHiveTableByDefault to false by
> default
> >> > > [ ] -1 Do not change spark.sql.legacy.createHiveTableByDefault
> because ...
> >> > >
> >> > > Thank you in advance.
> >> > >
> >> > > Dongjoon
> >> > >
> >> >
> >> > -
> >> > To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
> >> >
> >>
> >> -
> >> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
> >>
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>

Re: [DISCUSS] SPARK-46122: Set spark.sql.legacy.createHiveTableByDefault to false

2024-04-25 Thread Wenchen Fan

It's for the data source. For example, Spark's built-in Parquet
reader/writer is faster than the Hive serde Parquet reader/writer.

On Thu, Apr 25, 2024 at 9:55 PM Mich Talebzadeh 
wrote:

> I see a statement made as below  and I quote
>
> "The proposal of SPARK-46122 is to switch the default value of this
> configuration from `true` to `false` to use Spark native tables because
> we support better."
>
> Can you please elaborate on the above specifically with regard to the
> phrase ".. because
> we support better."
>
> Are you referring to the performance of Spark catalog (I believe it is
> internal) or integration with Spark?
>
> HTH
>
> Mich Talebzadeh,
> Technologist | Architect | Data Engineer  | Generative AI | FinCrime
> London
> United Kingdom
>
>
>view my Linkedin profile
> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>
>
>  https://en.everybodywiki.com/Mich_Talebzadeh
>
>
>
> *Disclaimer:* The information provided is correct to the best of my
> knowledge but of course cannot be guaranteed . It is essential to note
> that, as with any advice, quote "one test result is worth one-thousand
> expert opinions (Werner  <https://en.wikipedia.org/wiki/Wernher_von_Braun>Von
> Braun <https://en.wikipedia.org/wiki/Wernher_von_Braun>)".
>
>
> On Thu, 25 Apr 2024 at 11:17, Wenchen Fan  wrote:
>
>> +1
>>
>> On Thu, Apr 25, 2024 at 2:46 PM Kent Yao  wrote:
>>
>>> +1
>>>
>>> Nit: the umbrella ticket is SPARK-44111, not SPARK-4.
>>>
>>> Thanks,
>>> Kent Yao
>>>
>>> Dongjoon Hyun  于2024年4月25日周四 14:39写道：
>>> >
>>> > Hi, All.
>>> >
>>> > It's great to see community activities to polish 4.0.0 more and more.
>>> > Thank you all.
>>> >
>>> > I'd like to bring SPARK-46122 (another SQL topic) to you from the
>>> subtasks
>>> > of SPARK-4 (Prepare Apache Spark 4.0.0),
>>> >
>>> > - https://issues.apache.org/jira/browse/SPARK-46122
>>> >Set `spark.sql.legacy.createHiveTableByDefault` to `false` by
>>> default
>>> >
>>> > This legacy configuration is about `CREATE TABLE` SQL syntax without
>>> > `USING` and `STORED AS`, which is currently mapped to `Hive` table.
>>> > The proposal of SPARK-46122 is to switch the default value of this
>>> > configuration from `true` to `false` to use Spark native tables because
>>> > we support better.
>>> >
>>> > In other words, Spark will use the value of `spark.sql.sources.default`
>>> > as the table provider instead of `Hive` like the other Spark APIs. Of
>>> course,
>>> > the users can get all the legacy behavior by setting back to `true`.
>>> >
>>> > Historically, this behavior change was merged once at Apache Spark
>>> 3.0.0
>>> > preparation via SPARK-30098 already, but reverted during the 3.0.0 RC
>>> period.
>>> >
>>> > 2019-12-06: SPARK-30098 Use default datasource as provider for CREATE
>>> TABLE
>>> > 2020-05-16: SPARK-31707 Revert SPARK-30098 Use default datasource as
>>> > provider for CREATE TABLE command
>>> >
>>> > At Apache Spark 3.1.0, we had another discussion about this and
>>> defined it
>>> > as one of legacy behavior via this configuration via reused ID,
>>> SPARK-30098.
>>> >
>>> > 2020-12-01:
>>> https://lists.apache.org/thread/8c8k1jk61pzlcosz3mxo4rkj5l23r204
>>> > 2020-12-03: SPARK-30098 Add a configuration to use default datasource
>>> as
>>> > provider for CREATE TABLE command
>>> >
>>> > Last year, we received two additional requests twice to switch this
>>> because
>>> > Apache Spark 4.0.0 is a good time to make a decision for the future
>>> direction.
>>> >
>>> > 2023-02-27: SPARK-42603 as an independent idea.
>>> > 2023-11-27: SPARK-46122 as a part of Apache Spark 4.0.0 idea
>>> >
>>> >
>>> > WDYT? The technical scope is defined in the following PR which is one
>>> line of main
>>> > code, one line of migration guide, and a few lines of test code.
>>> >
>>> > - https://github.com/apache/spark/pull/46207
>>> >
>>> > Dongjoon.
>>>
>>> -
>>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>>
>>>

Re: [DISCUSS] SPARK-46122: Set spark.sql.legacy.createHiveTableByDefault to false

2024-04-25 Thread Wenchen Fan

+1

On Thu, Apr 25, 2024 at 2:46 PM Kent Yao  wrote:

> +1
>
> Nit: the umbrella ticket is SPARK-44111, not SPARK-4.
>
> Thanks,
> Kent Yao
>
> Dongjoon Hyun  于2024年4月25日周四 14:39写道：
> >
> > Hi, All.
> >
> > It's great to see community activities to polish 4.0.0 more and more.
> > Thank you all.
> >
> > I'd like to bring SPARK-46122 (another SQL topic) to you from the
> subtasks
> > of SPARK-4 (Prepare Apache Spark 4.0.0),
> >
> > - https://issues.apache.org/jira/browse/SPARK-46122
> >Set `spark.sql.legacy.createHiveTableByDefault` to `false` by default
> >
> > This legacy configuration is about `CREATE TABLE` SQL syntax without
> > `USING` and `STORED AS`, which is currently mapped to `Hive` table.
> > The proposal of SPARK-46122 is to switch the default value of this
> > configuration from `true` to `false` to use Spark native tables because
> > we support better.
> >
> > In other words, Spark will use the value of `spark.sql.sources.default`
> > as the table provider instead of `Hive` like the other Spark APIs. Of
> course,
> > the users can get all the legacy behavior by setting back to `true`.
> >
> > Historically, this behavior change was merged once at Apache Spark 3.0.0
> > preparation via SPARK-30098 already, but reverted during the 3.0.0 RC
> period.
> >
> > 2019-12-06: SPARK-30098 Use default datasource as provider for CREATE
> TABLE
> > 2020-05-16: SPARK-31707 Revert SPARK-30098 Use default datasource as
> > provider for CREATE TABLE command
> >
> > At Apache Spark 3.1.0, we had another discussion about this and defined
> it
> > as one of legacy behavior via this configuration via reused ID,
> SPARK-30098.
> >
> > 2020-12-01:
> https://lists.apache.org/thread/8c8k1jk61pzlcosz3mxo4rkj5l23r204
> > 2020-12-03: SPARK-30098 Add a configuration to use default datasource as
> > provider for CREATE TABLE command
> >
> > Last year, we received two additional requests twice to switch this
> because
> > Apache Spark 4.0.0 is a good time to make a decision for the future
> direction.
> >
> > 2023-02-27: SPARK-42603 as an independent idea.
> > 2023-11-27: SPARK-46122 as a part of Apache Spark 4.0.0 idea
> >
> >
> > WDYT? The technical scope is defined in the following PR which is one
> line of main
> > code, one line of migration guide, and a few lines of test code.
> >
> > - https://github.com/apache/spark/pull/46207
> >
> > Dongjoon.
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>

Re: [DISCUSS] Spark 4.0.0 release

2024-04-17 Thread Wenchen Fan

Thank you all for the replies!

To @Nicholas Chammas  : Thanks for cleaning up
the error terminology and documentation! I've merged the first PR and let's
finish others before the 4.0 release.
To @Dongjoon Hyun  : Thanks for driving the ANSI
on by default effort! Now the vote has passed, let's flip the config and
finish the DataFrame error context feature before 4.0.
To @Jungtaek Lim  : Ack. We can treat the
Streaming state store data source as completed for 4.0 then.
To @Cheng Pan  : Yea we definitely should have a
preview release. Let's collect more feedback on the ongoing projects and
then we can propose a date for the preview release.

On Wed, Apr 17, 2024 at 1:22 PM Cheng Pan  wrote:

> will we have preview release for 4.0.0 like we did for 2.0.0 and 3.0.0?
>
> Thanks,
> Cheng Pan
>
>
> > On Apr 15, 2024, at 09:58, Jungtaek Lim 
> wrote:
> >
> > W.r.t. state data source - reader (SPARK-45511), there are several
> follow-up tickets, but we don't plan to address them soon. The current
> implementation is the final shape for Spark 4.0.0, unless there are demands
> on the follow-up tickets.
> >
> > We may want to check the plan for transformWithState - my understanding
> is that we want to release the feature to 4.0.0, but there are several
> remaining works to be done. While the tentative timeline for releasing is
> June 2024, what would be the tentative timeline for the RC cut?
> > (cc. Anish to add more context on the plan for transformWithState)
> >
> > On Sat, Apr 13, 2024 at 3:15 AM Wenchen Fan  wrote:
> > Hi all,
> >
> > It's close to the previously proposed 4.0.0 release date (June 2024),
> and I think it's time to prepare for it and discuss the ongoing projects:
> > •
> > ANSI by default
> > • Spark Connect GA
> > • Structured Logging
> > • Streaming state store data source
> > • new data type VARIANT
> > • STRING collation support
> > • Spark k8s operator versioning
> > Please help to add more items to this list that are missed here. I would
> like to volunteer as the release manager for Apache Spark 4.0.0 if there is
> no objection. Thank you all for the great work that fills Spark 4.0!
> >
> > Wenchen Fan
>
>

Re: [VOTE] Release Spark 3.4.3 (RC2)

2024-04-16 Thread Wenchen Fan

+1

On Mon, Apr 15, 2024 at 12:31 PM Dongjoon Hyun  wrote:

> I'll start with my +1.
>
> - Checked checksum and signature
> - Checked Scala/Java/R/Python/SQL Document's Spark version
> - Checked published Maven artifacts
> - All CIs passed.
>
> Thanks,
> Dongjoon.
>
> On 2024/04/15 04:22:26 Dongjoon Hyun wrote:
> > Please vote on releasing the following candidate as Apache Spark version
> > 3.4.3.
> >
> > The vote is open until April 18th 1AM (PDT) and passes if a majority +1
> PMC
> > votes are cast, with a minimum of 3 +1 votes.
> >
> > [ ] +1 Release this package as Apache Spark 3.4.3
> > [ ] -1 Do not release this package because ...
> >
> > To learn more about Apache Spark, please see https://spark.apache.org/
> >
> > The tag to be voted on is v3.4.3-rc2 (commit
> > 1eb558c3a6fbdd59e5a305bc3ab12ce748f6511f)
> > https://github.com/apache/spark/tree/v3.4.3-rc2
> >
> > The release files, including signatures, digests, etc. can be found at:
> > https://dist.apache.org/repos/dist/dev/spark/v3.4.3-rc2-bin/
> >
> > Signatures used for Spark RCs can be found in this file:
> > https://dist.apache.org/repos/dist/dev/spark/KEYS
> >
> > The staging repository for this release can be found at:
> > https://repository.apache.org/content/repositories/orgapachespark-1453/
> >
> > The documentation corresponding to this release can be found at:
> > https://dist.apache.org/repos/dist/dev/spark/v3.4.3-rc2-docs/
> >
> > The list of bug fixes going into 3.4.3 can be found at the following URL:
> > https://issues.apache.org/jira/projects/SPARK/versions/12353987
> >
> > This release is using the release script of the tag v3.4.3-rc2.
> >
> > FAQ
> >
> > =
> > How can I help test this release?
> > =
> >
> > If you are a Spark user, you can help us test this release by taking
> > an existing Spark workload and running on this release candidate, then
> > reporting any regressions.
> >
> > If you're working in PySpark you can set up a virtual env and install
> > the current RC and see if anything important breaks, in the Java/Scala
> > you can add the staging repository to your projects resolvers and test
> > with the RC (make sure to clean up the artifact cache before/after so
> > you don't end up building with a out of date RC going forward).
> >
> > ===
> > What should happen to JIRA tickets still targeting 3.4.3?
> > ===
> >
> > The current list of open tickets targeted at 3.4.3 can be found at:
> > https://issues.apache.org/jira/projects/SPARK and search for "Target
> > Version/s" = 3.4.3
> >
> > Committers should look at those and triage. Extremely important bug
> > fixes, documentation, and API tweaks that impact compatibility should
> > be worked on immediately. Everything else please retarget to an
> > appropriate release.
> >
> > ==
> > But my bug isn't fixed?
> > ==
> >
> > In order to make timely releases, we will typically not hold the
> > release unless the bug in question is a regression from the previous
> > release. That being said, if there is something which is a regression
> > that has not been correctly targeted please ping me or a committer to
> > help target the issue.
> >
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>

Re: [VOTE] SPARK-44444: Use ANSI SQL mode by default

2024-04-14 Thread Wenchen Fan

+1

On Sun, Apr 14, 2024 at 6:28 AM Dongjoon Hyun  wrote:

> I'll start from my +1.
>
> Dongjoon.
>
> On 2024/04/13 22:22:05 Dongjoon Hyun wrote:
> > Please vote on SPARK-4 to use ANSI SQL mode by default.
> > The technical scope is defined in the following PR which is
> > one line of code change and one line of migration guide.
> >
> > - DISCUSSION:
> > https://lists.apache.org/thread/ztlwoz1v1sn81ssks12tb19x37zozxlz
> > - JIRA: https://issues.apache.org/jira/browse/SPARK-4
> > - PR: https://github.com/apache/spark/pull/46013
> >
> > The vote is open until April 17th 1AM (PST) and passes
> > if a majority +1 PMC votes are cast, with a minimum of 3 +1 votes.
> >
> > [ ] +1 Use ANSI SQL mode by default
> > [ ] -1 Do not use ANSI SQL mode by default because ...
> >
> > Thank you in advance.
> >
> > Dongjoon
> >
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>

Re: [DISCUSS] SPARK-44444: Use ANSI SQL mode by default

2024-04-12 Thread Wenchen Fan

+1, the existing "NULL on error" behavior is terrible for data quality.

I have one concern about error reporting with DataFrame APIs. Query
execution is lazy and where the error happens can be far away from where
the dataframe/column was created. We are improving it (PR
) but it's not fully done yet.

On Fri, Apr 12, 2024 at 2:21 PM L. C. Hsieh  wrote:

> +1
>
> I believe ANSI mode is well developed after many releases. No doubt it
> could be used.
> Since it is very easy to disable it to restore to current behavior, I
> guess the impact could be limited.
> Do we have known the possible impacts such as what are the major
> changes (e.g., what kind of queries/expressions will fail)? We can
> describe them in the release note.
>
> On Thu, Apr 11, 2024 at 10:29 PM Gengliang Wang  wrote:
> >
> >
> > +1, enabling Spark's ANSI SQL mode in version 4.0 will significantly
> enhance data quality and integrity. I fully support this initiative.
> >
> > > In other words, the current Spark ANSI SQL implementation becomes the
> first implementation for Spark SQL users to face at first while providing
> > `spark.sql.ansi.enabled=false` in the same way without losing any
> capability.`spark.sql.ansi.enabled=false` in the same way without losing
> any capability.
> >
> > BTW, the try_* functions and SQL Error Attribution Framework will also
> be beneficial in migrating to ANSI SQL mode.
> >
> >
> > Gengliang
> >
> >
> > On Thu, Apr 11, 2024 at 7:56 PM Dongjoon Hyun 
> wrote:
> >>
> >> Hi, All.
> >>
> >> Thanks to you, we've been achieving many things and have on-going SPIPs.
> >> I believe it's time to scope Apache Spark 4.0.0 (SPARK-44111) more
> narrowly
> >> by asking your opinions about Apache Spark's ANSI SQL mode.
> >>
> >> https://issues.apache.org/jira/browse/SPARK-44111
> >> Prepare Apache Spark 4.0.0
> >>
> >> SPARK-4 was proposed last year (on 15/Jul/23) as the one of
> desirable
> >> items for 4.0.0 because it's a big behavior.
> >>
> >> https://issues.apache.org/jira/browse/SPARK-4
> >> Use ANSI SQL mode by default
> >>
> >> Historically, spark.sql.ansi.enabled was added at Apache Spark 3.0.0
> and has
> >> been aiming to provide a better Spark SQL compatibility in a standard
> way.
> >> We also have a daily CI to protect the behavior too.
> >>
> >> https://github.com/apache/spark/actions/workflows/build_ansi.yml
> >>
> >> However, it's still behind the configuration with several known issues,
> e.g.,
> >>
> >> SPARK-41794 Reenable ANSI mode in test_connect_column
> >> SPARK-41547 Reenable ANSI mode in test_connect_functions
> >> SPARK-46374 Array Indexing is 1-based via ANSI SQL Standard
> >>
> >> To be clear, we know that many DBMSes have their own implementations of
> >> SQL standard and not the same. Like them, SPARK-4 aims to enable
> >> only the existing Spark's configuration, `spark.sql.ansi.enabled=true`.
> >> There is nothing more than that.
> >>
> >> In other words, the current Spark ANSI SQL implementation becomes the
> first
> >> implementation for Spark SQL users to face at first while providing
> >> `spark.sql.ansi.enabled=false` in the same way without losing any
> capability.
> >>
> >> If we don't want this change for some reasons, we can simply exclude
> >> SPARK-4 from SPARK-44111 as a part of Apache Spark 4.0.0
> preparation.
> >> It's time just to make a go/no-go decision for this item for the global
> optimization
> >> for Apache Spark 4.0.0 release. After 4.0.0, it's unlikely for us to aim
> >> for this again for the next four years until 2028.
> >>
> >> WDYT?
> >>
> >> Bests,
> >> Dongjoon
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>

[DISCUSS] Spark 4.0.0 release

2024-04-12 Thread Wenchen Fan

Hi all,

It's close to the previously proposed 4.0.0 release date (June 2024), and I
think it's time to prepare for it and discuss the ongoing projects:

   - ANSI by default
   - Spark Connect GA
   - Structured Logging
   - Streaming state store data source
   - new data type VARIANT
   - STRING collation support
   - Spark k8s operator versioning

Please help to add more items to this list that are missed here. I would
like to volunteer as the release manager for Apache Spark 4.0.0 if there is
no objection. Thank you all for the great work that fills Spark 4.0!

Wenchen Fan

Re: SPIP: Enhancing the Flexibility of Spark's Physical Plan to Enable Execution on Various Native Engines

2024-04-10 Thread Wenchen Fan

It's good to reduce duplication between different native accelerators of
Spark, and AFAIK there is already a project trying to solve it:
https://substrait.io/

I'm not sure why we need to do this inside Spark, instead of doing
the unification for a wider scope (for all engines, not only Spark).


On Wed, Apr 10, 2024 at 10:11 AM Holden Karau 
wrote:

> I like the idea of improving flexibility of Sparks physical plans and
> really anything that might reduce code duplication among the ~4 or so
> different accelerators.
>
> Twitter: https://twitter.com/holdenkarau
> Books (Learning Spark, High Performance Spark, etc.):
> https://amzn.to/2MaRAG9  
> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>
>
> On Tue, Apr 9, 2024 at 3:14 AM Dongjoon Hyun 
> wrote:
>
>> Thank you for sharing, Jia.
>>
>> I have the same questions like the previous Weiting's thread.
>>
>> Do you think you can share the future milestone of Apache Gluten?
>> I'm wondering when the first stable release will come and how we can
>> coordinate across the ASF communities.
>>
>> > This project is still under active development now, and doesn't have a
>> stable release.
>> > https://github.com/apache/incubator-gluten/releases/tag/v1.1.1
>>
>> In the Apache Spark community, Apache Spark 3.2 and 3.3 is the end of
>> support.
>> And, 3.4 will have 3.4.3 next week and 3.4.4 (another EOL release) is
>> scheduled in October.
>>
>> For the SPIP, I guess it's applicable for Apache Spark 4.0.0 only if
>> there is something we need to do from Spark side.
>>
> +1 I think any changes need to target 4.0
>
>>
>> Thanks,
>> Dongjoon.
>>
>>
>> On Tue, Apr 9, 2024 at 12:22 AM Ke Jia  wrote:
>>
>>> Apache Spark currently lacks an official mechanism to support
>>> cross-platform execution of physical plans. The Gluten project offers a
>>> mechanism that utilizes the Substrait standard to convert and optimize
>>> Spark's physical plans. By introducing Gluten's plan conversion,
>>> validation, and fallback mechanisms into Spark, we can significantly
>>> enhance the portability and interoperability of Spark's physical plans,
>>> enabling them to operate across a broader spectrum of execution
>>> environments without requiring users to migrate, while also improving
>>> Spark's execution efficiency through the utilization of Gluten's advanced
>>> optimization techniques. And the integration of Gluten into Spark has
>>> already shown significant performance improvements with ClickHouse and
>>> Velox backends and has been successfully deployed in production by several
>>> customers.
>>>
>>> References:
>>> JIAR Ticket 
>>> SPIP Doc
>>> 
>>>
>>> Your feedback and comments are welcome and appreciated.  Thanks.
>>>
>>> Thanks,
>>> Jia Ke
>>>
>>

Re: [VOTE] SPIP: Structured Logging Framework for Apache Spark

2024-03-11 Thread Wenchen Fan

+1

On Mon, Mar 11, 2024 at 5:26 PM Hyukjin Kwon  wrote:

> +1
>
> On Mon, 11 Mar 2024 at 18:11, yangjie01 
> wrote:
>
>> +1
>>
>>
>>
>> Jie Yang
>>
>>
>>
>> *发件人**: *Haejoon Lee 
>> *日期**: *2024年3月11日 星期一 17:09
>> *收件人**: *Gengliang Wang 
>> *抄送**: *dev 
>> *主题**: *Re: [VOTE] SPIP: Structured Logging Framework for Apache Spark
>>
>>
>>
>> +1
>>
>>
>>
>> On Mon, Mar 11, 2024 at 10:36 AM Gengliang Wang  wrote:
>>
>> Hi all,
>>
>> I'd like to start the vote for SPIP: Structured Logging Framework for
>> Apache Spark
>>
>>
>> References:
>>
>>- JIRA ticket
>>
>> 
>>- SPIP doc
>>
>> 
>>- Discussion thread
>>
>> 
>>
>> Please vote on the SPIP for the next 72 hours:
>>
>> [ ] +1: Accept the proposal as an official SPIP
>> [ ] +0
>> [ ] -1: I don’t think this is a good idea because …
>>
>> Thanks!
>>
>> Gengliang Wang
>>
>>

Re: [VOTE] Release Apache Spark 3.5.1 (RC2)

2024-02-19 Thread Wenchen Fan

+1, thanks for making the release!

On Sat, Feb 17, 2024 at 3:54 AM Sean Owen  wrote:

> Yeah let's get that fix in, but it seems to be a minor test only issue so
> should not block release.
>
> On Fri, Feb 16, 2024, 9:30 AM yangjie01  wrote:
>
>> Very sorry. When I was fixing `SPARK-45242 (
>> https://github.com/apache/spark/pull/43594)`
>> , I noticed that its
>> `Affects Version` and `Fix Version` of SPARK-45242 were both 4.0, and I
>> didn't realize that it had also been merged into branch-3.5, so I didn't
>> advocate for SPARK-45357 to be backported to branch-3.5.
>>
>>
>>
>> As far as I know, the condition to trigger this test failure is: when
>> using Maven to test the `connect` module, if  `sparkTestRelation` in
>> `SparkConnectProtoSuite` is not the first `DataFrame` to be initialized,
>> then the `id` of `sparkTestRelation` will no longer be 0. So, I think this
>> is indeed related to the order in which Maven executes the test cases in
>> the `connect` module.
>>
>>
>>
>> I have submitted a backport PR
>>  to branch-3.5, and if
>> necessary, we can merge it to fix this test issue.
>>
>>
>>
>> Jie Yang
>>
>>
>>
>> *发件人**: *Jungtaek Lim 
>> *日期**: *2024年2月16日 星期五 22:15
>> *收件人**: *Sean Owen , Rui Wang 
>> *抄送**: *dev 
>> *主题**: *Re: [VOTE] Release Apache Spark 3.5.1 (RC2)
>>
>>
>>
>> I traced back relevant changes and got a sense of what happened.
>>
>>
>>
>> Yangjie figured out the issue via link
>> .
>> It's a tricky issue according to the comments from Yangjie - the test is
>> dependent on ordering of execution for test suites. He said it does not
>> fail in sbt, hence CI build couldn't catch it.
>>
>> He fixed it via link
>> ,
>> but we missed that the offending commit was also ported back to 3.5 as
>> well, hence the fix wasn't ported back to 3.5.
>>
>>
>>
>> Surprisingly, I can't reproduce locally even with maven. In my attempt to
>> reproduce, SparkConnectProtoSuite was executed at
>> third, SparkConnectStreamingQueryCacheSuite, and ExecuteEventsManagerSuite,
>> and then SparkConnectProtoSuite. Maybe very specific to the environment,
>> not just maven? My env: MBP M1 pro chip, MacOS 14.3.1, Openjdk 17.0.9. I
>> used build/mvn (Maven 3.8.8).
>>
>>
>>
>> I'm not 100% sure this is something we should fail the release as it's a
>> test only and sounds very environment dependent, but I'll respect your call
>> on vote.
>>
>>
>>
>> Btw, looks like Rui also made a relevant fix via link
>> 
>>  (not
>> to fix the failing test but to fix other issues), but this also wasn't
>> ported back to 3.5. @Rui Wang  Do you think this
>> is a regression issue and warrants a new RC?
>>
>>
>>
>>
>>
>> On Fri, Feb 16, 2024 at 11:38 AM Sean Owen  wrote:
>>
>> Is anyone seeing this Spark Connect test failure? then again, I have some
>> weird issue with this env that always fails 1 or 2 tests that nobody else
>> can replicate.
>>
>>
>>
>> - Test observe *** FAILED ***
>>   == FAIL: Plans do not match ===
>>   !CollectMetrics my_metric, [min(id#0) AS min_val#0, max(id#0) AS
>> max_val#0, sum(id#0) AS sum(id)#0L], 0   CollectMetrics my_metric,
>> [min(id#0) AS min_val#0, max(id#0) AS max_val#0, sum(id#0) AS sum(id)#0L],
>> 44
>>+- LocalRelation , [id#0, name#0]
>>   +- LocalRelation , [id#0, name#0]
>> (PlanTest.scala:179)
>>
>>
>>
>> On Thu, Feb 15, 2024 at 1:34 PM Jungtaek Lim <
>> kabhwan.opensou...@gmail.com> wrote:
>>
>> DISCLAIMER: RC for Apache Spark 3.5.1 starts with RC2 as I lately figured
>> out doc generation issue after tagging RC1.
>>
>>
>>
>> Please vote on releasing the following candidate as Apache Spark version
>> 3.5.1.
>>
>> The vote is open until February 18th 9AM (PST) and passes if a majority
>> +1 PMC votes are cast, with
>> a minimum of 3 +1 votes.
>>
>> [ ] +1 Release this package as Apache Spark 3.5.1
>> [ ] -1 Do not release this package because ...
>>
>> To learn more about Apache Spark, please see https://spark.apache.org/
>> 
>>
>> The tag to be voted on is v3.5.1-rc2 (commit
>> fd86f85e181fc2dc0f50a096855acf83a6cc5d9c):
>> https://github.com/apache/spark/tree/v3.5.1-rc2
>> 
>>
>> The release files, including signatures, digests, etc. can be found at:
>> https://dist.apache.org/repos/dist/dev/spark/v3.5.1-rc2-bin/
>>

Re: [VOTE] SPIP: Structured Streaming - Arbitrary State API v2

2024-01-10 Thread Wenchen Fan

+1

On Thu, Jan 11, 2024 at 9:32 AM L. C. Hsieh  wrote:

> +1
>
> On Wed, Jan 10, 2024 at 9:06 AM Bhuwan Sahni
>  wrote:
>
>> +1. This is a good addition.
>>
>> 
>> *Bhuwan Sahni*
>> Staff Software Engineer
>>
>> bhuwan.sa...@databricks.com
>> 500 108th Ave. NE
>> Bellevue, WA 98004
>> USA
>>
>>
>> On Wed, Jan 10, 2024 at 9:00 AM Burak Yavuz  wrote:
>>
>>> +1. Excited to see more stateful workloads with Structured Streaming!
>>>
>>>
>>> Best,
>>> Burak
>>>
>>> On Wed, Jan 10, 2024 at 8:21 AM Praveen Gattu
>>>  wrote:
>>>
 +1. This brings Structured Streaming a good solution for
 customers wanting to build stateful stream processing applications.

 On Wed, Jan 10, 2024 at 7:30 AM Bartosz Konieczny <
 bartkoniec...@gmail.com> wrote:

> +1 :)
>
> On Wed, Jan 10, 2024 at 9:57 AM Shixiong Zhu 
> wrote:
>
>> +1 (binding)
>>
>> Best Regards,
>> Shixiong Zhu
>>
>>
>> On Tue, Jan 9, 2024 at 6:47 PM 刘唯  wrote:
>>
>>> This is a good addition! +1
>>>
>>> Raghu Angadi  于2024年1月9日周二
>>> 13:17写道：
>>>
 +1. This is a major improvement to the state API.

 Raghu.

 On Tue, Jan 9, 2024 at 1:42 AM Mich Talebzadeh <
 mich.talebza...@gmail.com> wrote:

> +1 for me as well
>
>
> Mich Talebzadeh,
> Dad | Technologist | Solutions Architect | Engineer
> London
> United Kingdom
>
>
>view my Linkedin profile
> 
>
>
>  https://en.everybodywiki.com/Mich_Talebzadeh
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility
> for any loss, damage or destruction of data or any other property 
> which may
> arise from relying on this email's technical content is explicitly
> disclaimed. The author will in no case be liable for any monetary 
> damages
> arising from such loss, damage or destruction.
>
>
>
>
> On Tue, 9 Jan 2024 at 03:24, Anish Shrigondekar
>  wrote:
>
>> Thanks Jungtaek for creating the Vote thread.
>>
>> +1 (non-binding) from my side too.
>>
>> Thanks,
>> Anish
>>
>> On Tue, Jan 9, 2024 at 6:09 AM Jungtaek Lim <
>> kabhwan.opensou...@gmail.com> wrote:
>>
>>> Starting with my +1 (non-binding). Thanks!
>>>
>>> On Tue, Jan 9, 2024 at 9:37 AM Jungtaek Lim <
>>> kabhwan.opensou...@gmail.com> wrote:
>>>
 Hi all,

 I'd like to start the vote for SPIP: Structured Streaming -
 Arbitrary State API v2.

 References:

- JIRA ticket

- SPIP doc

 
- Discussion thread

 

 Please vote on the SPIP for the next 72 hours:

 [ ] +1: Accept the proposal as an official SPIP
 [ ] +0
 [ ] -1: I don’t think this is a good idea because …

 Thanks!
 Jungtaek Lim (HeartSaVioR)

>>>
>
> --
> Bartosz Konieczny
> freelance data engineer
> https://www.waitingforcode.com
> https://github.com/bartosz25/
> https://twitter.com/waitingforcode
>
>

Re: [DISCUSS] SPIP: Testing Framework for Spark UI Javascript files

2023-11-21 Thread Wenchen Fan

+1, very useful!

On Wed, Nov 22, 2023 at 10:29 AM Dongjoon Hyun 
wrote:

> Thank you for proposing a new UI test framework for Apache Spark 4.0.
>
> It looks very useful.
>
> Thanks,
> Dongjoon.
>
>
> On Tue, Nov 21, 2023 at 1:51 AM Kent Yao  wrote:
>
>> Hi Spark Dev,
>>
>> This is a call to discuss a new SPIP: Testing Framework for
>> Spark UI Javascript files [1]. The SPIP aims to improve the test
>> coverage and develop experience for Spark UI-related javascript
>> codes.
>> The Jest [2], a JavaScript Testing Framework licensed under MIT, will
>> be used to build this dev and test-only module.
>> There is also a W.I.P. pull request [3] to show what it would be like.
>>
>> This thread will be open for at least the next 72 hours. Suggestions
>> are welcome.If there is no veto found, I will close this thread after
>> 2023-11-24 18:00(+08:00) and raise a new thread for voting.
>>
>> Thanks,
>> Kent Yao
>>
>> [1]
>> https://docs.google.com/document/d/1hWl5Q2CNNOjN5Ubyoa28XmpJtDyD9BtGtiEG2TT94rg/edit?usp=sharing
>> [2] https://github.com/jestjs/jest
>> [3] https://github.com/apache/spark/pull/43903
>>
>> -
>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>
>>

Re: [VOTE] SPIP: State Data Source - Reader

2023-10-23 Thread Wenchen Fan

+1

On Mon, Oct 23, 2023 at 4:03 PM Jungtaek Lim 
wrote:

> Starting with my +1 (non-binding). Thanks!
>
> On Mon, Oct 23, 2023 at 1:23 PM Jungtaek Lim 
> wrote:
>
>> Hi all,
>>
>> I'd like to start the vote for SPIP: State Data Source - Reader.
>>
>> The high level summary of the SPIP is that we propose a new data source
>> which enables a read ability for state store in the checkpoint, via batch
>> query. This would enable two major use cases 1) constructing tests with
>> verifying state store 2) inspecting values in state store in the scenario
>> of incident.
>>
>> References:
>>
>>- JIRA ticket 
>>- SPIP doc
>>
>> 
>>- Discussion thread
>>
>>
>> Please vote on the SPIP for the next 72 hours:
>>
>> [ ] +1: Accept the proposal as an official SPIP
>> [ ] +0
>> [ ] -1: I don’t think this is a good idea because …
>>
>> Thanks!
>> Jungtaek Lim (HeartSaVioR)
>>
>

Re: Welcome to Our New Apache Spark Committer and PMCs

2023-10-03 Thread Wenchen Fan

Congrats!

On Wed, Oct 4, 2023 at 8:25 AM Hyukjin Kwon  wrote:

> Woohoo!
>
> On Tue, 3 Oct 2023 at 22:47, Hussein Awala  wrote:
>
>> Congrats to all of you!
>>
>> On Tue 3 Oct 2023 at 08:15, Rui Wang  wrote:
>>
>>> Congratulations! Well deserved!
>>>
>>> -Rui
>>>
>>>
>>> On Mon, Oct 2, 2023 at 10:32 PM Gengliang Wang  wrote:
>>>
 Congratulations to all! Well deserved!

 On Mon, Oct 2, 2023 at 10:16 PM Xiao Li  wrote:

> Hi all,
>
> The Spark PMC is delighted to announce that we have voted to add one
> new committer and two new PMC members. These individuals have consistently
> contributed to the project and have clearly demonstrated their expertise.
>
> New Committer:
> - Jiaan Geng (focusing on Spark Connect and Spark SQL)
>
> New PMCs:
> - Yuanjian Li
> - Yikun Jiang
>
> Please join us in extending a warm welcome to them in their new roles!
>
> Sincerely,
> The Spark PMC
>

Re: [VOTE] Release Apache Spark 3.5.0 (RC5)

2023-09-11 Thread Wenchen Fan

+1

On Tue, Sep 12, 2023 at 9:00 AM Yuanjian Li  wrote:

> +1 (non-binding)
>
> Yuanjian Li  于2023年9月11日周一 09:36写道：
>
>> @Peter Toth  I've looked into the details of this
>> issue, and it appears that it's neither a regression in version 3.5.0 nor a
>> correctness issue. It's a bug related to a new feature. I think we can fix
>> this in 3.5.1 and list it as a known issue of the Scala client of Spark
>> Connect in 3.5.0.
>>
>> Mridul Muralidharan  于2023年9月10日周日 04:12写道：
>>
>>>
>>> +1
>>>
>>> Signatures, digests, etc check out fine.
>>> Checked out tag and build/tested with -Phive -Pyarn -Pmesos -Pkubernetes
>>>
>>> Regards,
>>> Mridul
>>>
>>> On Sat, Sep 9, 2023 at 10:02 AM Yuanjian Li 
>>> wrote:
>>>
 Please vote on releasing the following candidate(RC5) as Apache Spark
 version 3.5.0.

 The vote is open until 11:59pm Pacific time Sep 11th and passes if a
 majority +1 PMC votes are cast, with a minimum of 3 +1 votes.

 [ ] +1 Release this package as Apache Spark 3.5.0

 [ ] -1 Do not release this package because ...

 To learn more about Apache Spark, please see http://spark.apache.org/

 The tag to be voted on is v3.5.0-rc5 (commit
 ce5ddad990373636e94071e7cef2f31021add07b):

 https://github.com/apache/spark/tree/v3.5.0-rc5

 The release files, including signatures, digests, etc. can be found at:

 https://dist.apache.org/repos/dist/dev/spark/v3.5.0-rc5-bin/

 Signatures used for Spark RCs can be found in this file:

 https://dist.apache.org/repos/dist/dev/spark/KEYS

 The staging repository for this release can be found at:

 https://repository.apache.org/content/repositories/orgapachespark-1449

 The documentation corresponding to this release can be found at:

 https://dist.apache.org/repos/dist/dev/spark/v3.5.0-rc5-docs/

 The list of bug fixes going into 3.5.0 can be found at the following
 URL:

 https://issues.apache.org/jira/projects/SPARK/versions/12352848

 This release is using the release script of the tag v3.5.0-rc5.


 FAQ

 =

 How can I help test this release?

 =

 If you are a Spark user, you can help us test this release by taking

 an existing Spark workload and running on this release candidate, then

 reporting any regressions.

 If you're working in PySpark you can set up a virtual env and install

 the current RC and see if anything important breaks, in the Java/Scala

 you can add the staging repository to your projects resolvers and test

 with the RC (make sure to clean up the artifact cache before/after so

 you don't end up building with an out of date RC going forward).

 ===

 What should happen to JIRA tickets still targeting 3.5.0?

 ===

 The current list of open tickets targeted at 3.5.0 can be found at:

 https://issues.apache.org/jira/projects/SPARK and search for "Target
 Version/s" = 3.5.0

 Committers should look at those and triage. Extremely important bug

 fixes, documentation, and API tweaks that impact compatibility should

 be worked on immediately. Everything else please retarget to an

 appropriate release.

 ==

 But my bug isn't fixed?

 ==

 In order to make timely releases, we will typically not hold the

 release unless the bug in question is a regression from the previous

 release. That being said, if there is something which is a regression

 that has not been correctly targeted please ping me or a committer to

 help target the issue.

 Thanks,

 Yuanjian Li

>>>

Re: [VOTE] Release Apache Spark 3.5.0 (RC3)

2023-08-31 Thread Wenchen Fan

Sorry for the last-minute bug report, but we found a regression in 3.5: the
SQL INSERT command without a column list fills missing columns with NULL
while Spark 3.4 does not allow it. According to the SQL standard, this
shouldn't be allowed and thus a regression in 3.5.

The fix has been merged but one day after the RC3 cut:
https://github.com/apache/spark/pull/42393 . I'm -1 and let's include this
fix in 3.5.

Thanks,
Wenchen

On Thu, Aug 31, 2023 at 9:09 PM Ian Manning  wrote:

> +1 (non-binding)
>
> Using Spark Core, Spark SQL, Structured Streaming.
>
> On Tue, Aug 29, 2023 at 8:12 PM Yuanjian Li 
> wrote:
>
>> Please vote on releasing the following candidate(RC3) as Apache Spark
>> version 3.5.0.
>>
>> The vote is open until 11:59pm Pacific time Aug 31st and passes if a
>> majority +1 PMC votes are cast, with a minimum of 3 +1 votes.
>>
>> [ ] +1 Release this package as Apache Spark 3.5.0
>>
>> [ ] -1 Do not release this package because ...
>>
>> To learn more about Apache Spark, please see http://spark.apache.org/
>>
>> The tag to be voted on is v3.5.0-rc3 (commit
>> 9f137aa4dc43398aafa0c3e035ed3174182d7d6c):
>>
>> https://github.com/apache/spark/tree/v3.5.0-rc3
>>
>> The release files, including signatures, digests, etc. can be found at:
>>
>> https://dist.apache.org/repos/dist/dev/spark/v3.5.0-rc3-bin/
>>
>> Signatures used for Spark RCs can be found in this file:
>>
>> https://dist.apache.org/repos/dist/dev/spark/KEYS
>>
>> The staging repository for this release can be found at:
>>
>> https://repository.apache.org/content/repositories/orgapachespark-1447
>>
>> The documentation corresponding to this release can be found at:
>>
>> https://dist.apache.org/repos/dist/dev/spark/v3.5.0-rc3-docs/
>>
>> The list of bug fixes going into 3.5.0 can be found at the following URL:
>>
>> https://issues.apache.org/jira/projects/SPARK/versions/12352848
>>
>> This release is using the release script of the tag v3.5.0-rc3.
>>
>>
>> FAQ
>>
>> =
>>
>> How can I help test this release?
>>
>> =
>>
>> If you are a Spark user, you can help us test this release by taking
>>
>> an existing Spark workload and running on this release candidate, then
>>
>> reporting any regressions.
>>
>> If you're working in PySpark you can set up a virtual env and install
>>
>> the current RC and see if anything important breaks, in the Java/Scala
>>
>> you can add the staging repository to your projects resolvers and test
>>
>> with the RC (make sure to clean up the artifact cache before/after so
>>
>> you don't end up building with an out of date RC going forward).
>>
>> ===
>>
>> What should happen to JIRA tickets still targeting 3.5.0?
>>
>> ===
>>
>> The current list of open tickets targeted at 3.5.0 can be found at:
>>
>> https://issues.apache.org/jira/projects/SPARK and search for "Target
>> Version/s" = 3.5.0
>>
>> Committers should look at those and triage. Extremely important bug
>>
>> fixes, documentation, and API tweaks that impact compatibility should
>>
>> be worked on immediately. Everything else please retarget to an
>>
>> appropriate release.
>>
>> ==
>>
>> But my bug isn't fixed?
>>
>> ==
>>
>> In order to make timely releases, we will typically not hold the
>>
>> release unless the bug in question is a regression from the previous
>>
>> release. That being said, if there is something which is a regression
>>
>> that has not been correctly targeted please ping me or a committer to
>>
>> help target the issue.
>>
>> Thanks,
>>
>> Yuanjian Li
>>
>

Re: Spark writing API

2023-08-17 Thread Wenchen Fan

I'm not quite sure if this hint is useful. People usually keep a buffer and
flush the buffer when it's full, so that they can control the batch size of
writing, no matter how many inputs they will get. e.g. if spark hints to
you that there will be 1 GB data, are you going to allocate a 1 GB buffer
for it?

Also note that: there is not always a shuffle right before the writing, if
it's shuffle read -> filter -> data source write, we don't know the final
output data size unless we know the selectivity of the filter.

On Thu, Aug 17, 2023 at 1:02 PM Andrew Melo  wrote:

> Hello Wenchen,
>
> On Wed, Aug 16, 2023 at 23:33 Wenchen Fan  wrote:
>
>> > is there a way to hint to the downstream users on the number of rows
>> expected to write?
>>
>> It will be very hard to do. Spark pipelines the execution (within shuffle
>> boundaries) and we can't predict the number of final output rows.
>>
>
> Perhaps I don't understand -- even in the case of multiple shuffles, you
> can assume that there is exactly one shuffle boundary before the write
> operation, and that shuffle boundary knows the number of input rows for
> that shuffle. That number of rows has to be, by construction, the upper
> bound on the number of rows that will be passed to the writer.
>
> If the writer can be hinted that bound then it can do something smart with
> allocating (memory or disk). By comparison, the current API just gives
> rows/batches one at a time, and in the case of off-heap allocation (like
> with arrow's off-heap storage), it's crazy inefficient to try and do the
> equivalent of realloc() to grow the buffer size.
>
> Thanks
> Andrew
>
>
>
>> On Mon, Aug 7, 2023 at 8:27 PM Steve Loughran 
>> wrote:
>>
>>>
>>>
>>> On Thu, 1 Jun 2023 at 00:58, Andrew Melo  wrote:
>>>
>>>> Hi all
>>>>
>>>> I've been developing for some time a Spark DSv2 plugin "Laurelin" (
>>>> https://github.com/spark-root/laurelin
>>>> ) to read the ROOT (https://root.cern) file format (which is used in
>>>> high energy physics). I've recently presented my work in a conference (
>>>> https://indico.jlab.org/event/459/contributions/11603/).
>>>>
>>>>
>>> nice paper given the esoteric nature of HEP file formats.
>>>
>>> All of that to say,
>>>>
>>>> A) is there no reason that the builtin (eg parquet) data sources can't
>>>> consume the external APIs? It's hard to write a plugin that has to use a
>>>> specific API when you're competing with another source who gets access to
>>>> the internals directly.
>>>>
>>>> B) What is the Spark-approved API to code against for to write? There
>>>> is a mess of *ColumnWriter classes in the Java namespace, and while there
>>>> is no documentation, it's unclear which is preferred by the core (maybe
>>>> ArrowWriterColumnVector?). We can give a zero copy write if the API
>>>> describes it
>>>>
>>>
>>> There's a dangerous tendency for things that libraries need to be tagged
>>> private [spark], normally worked around by people putting their code into
>>> org.apache.spark packages. Really everyone who does that should try to get
>>> a longer term fix in, as well as that quick-and-effective workaround.
>>> Knowing where problems lie would be a good first step. spark sub-modules
>>> are probably a place to get insight into where those low-level internal
>>> operations are considered important, although many uses may be for historic
>>> "we wrote it that way a long time ago" reasons
>>>
>>>
>>>>
>>>> C) Putting aside everything above, is there a way to hint to the
>>>> downstream users on the number of rows expected to write? Any smart writer
>>>> will use off-heap memory to write to disk/memory, so the current API that
>>>> shoves rows in doesn't do the trick. You don't want to keep reallocating
>>>> buffers constantly
>>>>
>>>> D) what is sparks plan to use arrow-based columnar data
>>>> representations? I see that there a lot of external efforts whose only
>>>> option is to inject themselves in the CLASSPATH. The regular DSv2 api is
>>>> already crippled for reads and for writes it's even worse. Is there a
>>>> commitment from the spark core to bring the API to parity? Or is instead is
>>>> it just a YMMV commitment
>>>>
>>>
>>> No idea, I'm afraid. I do think arrow makes a good format for
>>> processing, and it'd be interesting to see how well it actually works as a
>>> wire format to replace other things (e.g hive's protocol), especially on
>>> RDMA networks and the like. I'm not up to date with ongoing work there -if
>>> anyone has pointers that'd be interesting.
>>>
>>>>
>>>> Thanks!
>>>> Andrew
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> It's dark in this basement.
>>>>
>>> --
> It's dark in this basement.
>

Re: Spark writing API

2023-08-16 Thread Wenchen Fan

> is there a way to hint to the downstream users on the number of rows
expected to write?

It will be very hard to do. Spark pipelines the execution (within shuffle
boundaries) and we can't predict the number of final output rows.

On Mon, Aug 7, 2023 at 8:27 PM Steve Loughran 
wrote:

>
>
> On Thu, 1 Jun 2023 at 00:58, Andrew Melo  wrote:
>
>> Hi all
>>
>> I've been developing for some time a Spark DSv2 plugin "Laurelin" (
>> https://github.com/spark-root/laurelin
>> ) to read the ROOT (https://root.cern) file format (which is used in
>> high energy physics). I've recently presented my work in a conference (
>> https://indico.jlab.org/event/459/contributions/11603/).
>>
>>
> nice paper given the esoteric nature of HEP file formats.
>
> All of that to say,
>>
>> A) is there no reason that the builtin (eg parquet) data sources can't
>> consume the external APIs? It's hard to write a plugin that has to use a
>> specific API when you're competing with another source who gets access to
>> the internals directly.
>>
>> B) What is the Spark-approved API to code against for to write? There is
>> a mess of *ColumnWriter classes in the Java namespace, and while there is
>> no documentation, it's unclear which is preferred by the core (maybe
>> ArrowWriterColumnVector?). We can give a zero copy write if the API
>> describes it
>>
>
> There's a dangerous tendency for things that libraries need to be tagged
> private [spark], normally worked around by people putting their code into
> org.apache.spark packages. Really everyone who does that should try to get
> a longer term fix in, as well as that quick-and-effective workaround.
> Knowing where problems lie would be a good first step. spark sub-modules
> are probably a place to get insight into where those low-level internal
> operations are considered important, although many uses may be for historic
> "we wrote it that way a long time ago" reasons
>
>
>>
>> C) Putting aside everything above, is there a way to hint to the
>> downstream users on the number of rows expected to write? Any smart writer
>> will use off-heap memory to write to disk/memory, so the current API that
>> shoves rows in doesn't do the trick. You don't want to keep reallocating
>> buffers constantly
>>
>> D) what is sparks plan to use arrow-based columnar data representations?
>> I see that there a lot of external efforts whose only option is to inject
>> themselves in the CLASSPATH. The regular DSv2 api is already crippled for
>> reads and for writes it's even worse. Is there a commitment from the spark
>> core to bring the API to parity? Or is instead is it just a YMMV commitment
>>
>
> No idea, I'm afraid. I do think arrow makes a good format for processing,
> and it'd be interesting to see how well it actually works as a wire format
> to replace other things (e.g hive's protocol), especially on RDMA networks
> and the like. I'm not up to date with ongoing work there -if anyone has
> pointers that'd be interesting.
>
>>
>> Thanks!
>> Andrew
>>
>>
>>
>>
>>
>> --
>> It's dark in this basement.
>>
>

Re: What else could be removed in Spark 4?

2023-08-07 Thread Wenchen Fan

I think the principle is we should remove things that block us from
supporting new things like Java 21, or come with a significant
maintenance cost. If there is no benefit to removing deprecated APIs (just
to keep the codebase clean?), I'd prefer to leave them there and not bother.

On Tue, Aug 8, 2023 at 9:00 AM Jia Fan  wrote:

> Thanks Sean  for open this discussion.
>
> 1. I think drop Scala 2.12 is a good option.
>
> 2. Personally, I think we should remove most methods that are deprecated
> since 2.x/1.x unless it can't find a good replacement. There is already a
> 3.x version as a buffer and I don't think it is good practice to use the
> deprecated method of 2.x on 4.x.
>
> 3. For Mesos, I think we should remove it from doc first.
> 
>
> Jia Fan
>
>
>
> 2023年8月8日 05:47，Sean Owen  写道：
>
> While we're noodling on the topic, what else might be worth removing in
> Spark 4?
>
> For example, looks like we're finally hitting problems supporting Java 8
> through 21 all at once, related to Scala 2.13.x updates. It would be
> reasonable to require Java 11, or even 17, as a baseline for the multi-year
> lifecycle of Spark 4.
>
> Dare I ask: drop Scala 2.12? supporting 2.12 / 2.13 / 3.0 might get hard
> otherwise.
>
> There was a good discussion about whether old deprecated methods should be
> removed. They can't be removed at other times, but, doesn't mean they all
> *should* be. createExternalTable was brought up as a first example. What
> deprecated methods are worth removing?
>
> There's Mesos support, long since deprecated, which seems like something
> to prune.
>
> Are there old Hive/Hadoop version combos we should just stop supporting?
>
>
>

Welcome two new Apache Spark committers

2023-08-06 Thread Wenchen Fan

Hi all,

The Spark PMC recently voted to add two new committers. Please join me in
welcoming them to their new role!

- Peter Toth (Spark SQL)
- Xiduo You (Spark SQL)

They consistently make contributions to the project and clearly showed
their expertise. We are very excited to have them join as committers.

Re: [DISCUSS] SPIP: Python Data Source API

2023-06-20 Thread Wenchen Fan

In an ideal world, every data source you want to connect to already has a
Spark data source implementation (either v1 or v2), then this Python API is
useless. But I feel it's common that people want to do quick data
exploration, and the target data system is not popular enough to have an
existing Spark data source implementation. It will be useful if people can
quickly implement a Spark data source using their favorite Python language.

I'm +1 to this proposal, assuming that we will keep it simple and won't
copy all the complicated features we built in DS v2 to this new Python API.

On Tue, Jun 20, 2023 at 2:11 PM Maciej  wrote:

> Similarly to Jacek, I feel it fails to document an actual community need
> for such a feature.
>
> Currently, any data source implementation has the potential to benefit
> Spark users across all supported and third-party clients.  For generally
> available sources, this is advantageous for the whole Spark community and
> avoids creating 1st and 2nd-tier citizens. This is even more important with
> new officially supported languages being added through connect.
>
> Instead, we might rather document in detail the process of implementing a
> new source using current APIs and work towards easily extensible or
> customizable sources, in case there is such a need.
>
> --
> Best regards,
> Maciej Szymkiewicz
>
> Web: https://zero323.net
> PGP: A30CEF0C31A501EC
>
>
> On 6/20/23 05:19, Hyukjin Kwon wrote:
>
> Actually I support this idea in a way that Python developers don't have to
> learn Scala to write their own source (and separate packaging).
> This is more crucial especially when you want to write a simple data
> source that interacts with the Python ecosystem.
>
> On Tue, 20 Jun 2023 at 03:08, Denny Lee  wrote:
>
>> Slightly biased, but per my conversations - this would be awesome to
>> have!
>>
>> On Mon, Jun 19, 2023 at 09:43 Abdeali Kothari 
>> wrote:
>>
>>> I would definitely use it - is it's available :)
>>>
>>> On Mon, 19 Jun 2023, 21:56 Jacek Laskowski,  wrote:
>>>
 Hi Allison and devs,

 Although I was against this idea at first sight (probably because I'm a
 Scala dev), I think it could work as long as there are people who'd be
 interested in such an API. Were there any? I'm just curious. I've seen no
 emails requesting it.

 I also doubt that Python devs would like to work on new data sources
 but support their wishes wholeheartedly :)

 Pozdrawiam,
 Jacek Laskowski
 
 "The Internals Of" Online Books 
 Follow me on https://twitter.com/jaceklaskowski

 


 On Fri, Jun 16, 2023 at 6:14 AM Allison Wang
 
  wrote:

> Hi everyone,
>
> I would like to start a discussion on “Python Data Source API”.
>
> This proposal aims to introduce a simple API in Python for Data
> Sources. The idea is to enable Python developers to create data sources
> without having to learn Scala or deal with the complexities of the current
> data source APIs. The goal is to make a Python-based API that is simple 
> and
> easy to use, thus making Spark more accessible to the wider Python
> developer community. This proposed approach is based on the recently
> introduced Python user-defined table functions with extensions to support
> data sources.
>
> *SPIP Doc*:
> https://docs.google.com/document/d/1oYrCKEKHzznljYfJO4kx5K_Npcgt1Slyfph3NEk7JRU/edit?usp=sharing
>
> *SPIP JIRA*: https://issues.apache.org/jira/browse/SPARK-44076
>
> Looking forward to your feedback.
>
> Thanks,
> Allison
>

>
>

Re: [Feature Request] create permanent Spark View from DataFrame via PySpark

2023-06-09 Thread Wenchen Fan

DataFrame view stores the logical plan, while SQL view stores SQL text. I
don't think we can support this feature until we have a reliable way to
materialize logical plans.

On Sun, Jun 4, 2023 at 10:31 PM Mich Talebzadeh 
wrote:

> Try sending it to dev@spark.apache.org (and join that group)
>
> You need to raise a JIRA for this request plus related doc related
>
>
> Example JIRA
>
> https://issues.apache.org/jira/browse/SPARK-42485
>
> and the related *Spark project improvement proposals (SPIP) *to be filled
> in
>
> https://spark.apache.org/improvement-proposals.html
>
>
> HTH
>
>
> Mich Talebzadeh,
> Lead Solutions Architect/Engineering Lead
> Palantir Technologies Limited
> London
> United Kingdom
>
>
>view my Linkedin profile
> 
>
>
>  https://en.everybodywiki.com/Mich_Talebzadeh
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
> On Sun, 4 Jun 2023 at 12:38, keen  wrote:
>
>> Do Spark **devs** read this mailing list?
>> Is there another/a better way to make feature requests?
>> I tried in the past to write a mail to the dev mailing list but it did
>> not show at all.
>>
>> Cheers
>>
>> keen  schrieb am Do., 1. Juni 2023, 07:11:
>>
>>> Hi all,
>>> currently only *temporary* Spark Views can be created from a DataFrame
>>> (df.createOrReplaceTempView or df.createOrReplaceGlobalTempView).
>>>
>>> When I want a *permanent* Spark View I need to specify it via Spark SQL
>>> (CREATE VIEW AS SELECT ...).
>>>
>>> Sometimes it is easier to specify the desired logic of the View through
>>> Spark/PySpark DataFrame API.
>>> Therefore, I'd like to suggest to implement a new PySpark method that
>>> allows creating a *permanent* Spark View from a DataFrame
>>> (df.createOrReplaceView).
>>>
>>> see also:
>>>
>>> https://community.databricks.com/s/question/0D53f1PANVgCAP/is-there-a-way-to-create-a-nontemporary-spark-view-with-pyspark
>>>
>>> Regards
>>> Martin
>>>
>>

Re: Apache Spark 3.4.1 Release?

2023-06-09 Thread Wenchen Fan

+1

On Fri, Jun 9, 2023 at 8:52 PM Xinrong Meng  wrote:

> +1. Thank you Doonjoon!
>
> Thanks,
>
> Xinrong Meng
>
> Mridul Muralidharan 于2023年6月9日 周五上午5:22写道：
>
>>
>> +1, thanks Dongjoon !
>>
>> Regards,
>> Mridul
>>
>> On Thu, Jun 8, 2023 at 7:16 PM Jia Fan 
>> wrote:
>>
>>> +1
>>>
>>> 
>>>
>>>
>>> Jia Fan
>>>
>>>
>>>
>>> 2023年6月9日 08:00，Yuming Wang  写道：
>>>
>>> +1.
>>>
>>> On Fri, Jun 9, 2023 at 7:14 AM Chao Sun  wrote:
>>>
 +1 too

 On Thu, Jun 8, 2023 at 2:34 PM kazuyuki tanimura
  wrote:
 >
 > +1 (non-binding), Thank you Dongjoon
 >
 > Kazu
 >

 -
 To unsubscribe e-mail: dev-unsubscr...@spark.apache.org


>>>

Re: [VOTE] Release Apache Spark 3.4.0 (RC7)

2023-04-11 Thread Wenchen Fan

+1

On Tue, Apr 11, 2023 at 9:57 AM Yuming Wang  wrote:

> +1.
>
> On Tue, Apr 11, 2023 at 9:14 AM Yikun Jiang  wrote:
>
>> +1 (non-binding)
>>
>> Also ran the docker image related test (signatures/standalone/k8s) with
>> rc7: https://github.com/apache/spark-docker/pull/32
>>
>> Regards,
>> Yikun
>>
>>
>> On Tue, Apr 11, 2023 at 4:44 AM Jacek Laskowski  wrote:
>>
>>> +1
>>>
>>> * Built fine with Scala 2.13
>>> and -Pkubernetes,hadoop-cloud,hive,hive-thriftserver,scala-2.13,volcano
>>> * Ran some demos on Java 17
>>> * Mac mini / Apple M2 Pro / Ventura 13.3.1
>>>
>>> Pozdrawiam,
>>> Jacek Laskowski
>>> 
>>> "The Internals Of" Online Books 
>>> Follow me on https://twitter.com/jaceklaskowski
>>>
>>> 
>>>
>>>
>>> On Sat, Apr 8, 2023 at 1:30 AM Xinrong Meng 
>>> wrote:
>>>
 Please vote on releasing the following candidate(RC7) as Apache Spark
 version 3.4.0.

 The vote is open until 11:59pm Pacific time *April 12th* and passes if
 a majority +1 PMC votes are cast, with a minimum of 3 +1 votes.

 [ ] +1 Release this package as Apache Spark 3.4.0
 [ ] -1 Do not release this package because ...

 To learn more about Apache Spark, please see http://spark.apache.org/

 The tag to be voted on is v3.4.0-rc7 (commit
 87a5442f7ed96b11051d8a9333476d080054e5a0):
 https://github.com/apache/spark/tree/v3.4.0-rc7

 The release files, including signatures, digests, etc. can be found at:
 https://dist.apache.org/repos/dist/dev/spark/v3.4.0-rc7-bin/

 Signatures used for Spark RCs can be found in this file:
 https://dist.apache.org/repos/dist/dev/spark/KEYS

 The staging repository for this release can be found at:
 https://repository.apache.org/content/repositories/orgapachespark-1441

 The documentation corresponding to this release can be found at:
 https://dist.apache.org/repos/dist/dev/spark/v3.4.0-rc7-docs/

 The list of bug fixes going into 3.4.0 can be found at the following
 URL:
 https://issues.apache.org/jira/projects/SPARK/versions/12351465

 This release is using the release script of the tag v3.4.0-rc7.

 FAQ

 =
 How can I help test this release?
 =
 If you are a Spark user, you can help us test this release by taking
 an existing Spark workload and running on this release candidate, then
 reporting any regressions.

 If you're working in PySpark you can set up a virtual env and install
 the current RC and see if anything important breaks, in the Java/Scala
 you can add the staging repository to your projects resolvers and test
 with the RC (make sure to clean up the artifact cache before/after so
 you don't end up building with an out of date RC going forward).

 ===
 What should happen to JIRA tickets still targeting 3.4.0?
 ===
 The current list of open tickets targeted at 3.4.0 can be found at:
 https://issues.apache.org/jira/projects/SPARK and search for "Target
 Version/s" = 3.4.0

 Committers should look at those and triage. Extremely important bug
 fixes, documentation, and API tweaks that impact compatibility should
 be worked on immediately. Everything else please retarget to an
 appropriate release.

 ==
 But my bug isn't fixed?
 ==
 In order to make timely releases, we will typically not hold the
 release unless the bug in question is a regression from the previous
 release. That being said, if there is something which is a regression
 that has not been correctly targeted please ping me or a committer to
 help target the issue.

 Thanks,
 Xinrong Meng

>>>

Re: [VOTE] Release Apache Spark 3.2.4 (RC1)

2023-04-11 Thread Wenchen Fan

+1

On Tue, Apr 11, 2023 at 10:09 AM Hyukjin Kwon  wrote:

> +1
>
> On Tue, 11 Apr 2023 at 11:04, Ruifeng Zheng  wrote:
>
>> +1 (non-binding)
>>
>> Thank you for driving this release!
>>
>> --
>> Ruifeng  Zheng
>> ruife...@foxmail.com
>>
>> 
>>
>>
>>
>> -- Original --
>> *From:* "Yuming Wang" ;
>> *Date:* Tue, Apr 11, 2023 09:56 AM
>> *To:* "Mridul Muralidharan";
>> *Cc:* "huaxin gao";"Chao Sun"> >;"yangjie01";"Dongjoon Hyun";"Sean
>> Owen";"dev@spark.apache.org";
>> *Subject:* Re: [VOTE] Release Apache Spark 3.2.4 (RC1)
>>
>> +1.
>>
>> On Tue, Apr 11, 2023 at 12:17 AM Mridul Muralidharan 
>> wrote:
>>
>>> +1
>>>
>>> Signatures, digests, etc check out fine.
>>> Checked out tag and build/tested with -Phive -Pyarn -Pmesos -Pkubernetes
>>>
>>> Regards,
>>> Mridul
>>>
>>>
>>> On Mon, Apr 10, 2023 at 10:34 AM huaxin gao 
>>> wrote:
>>>
 +1

 On Mon, Apr 10, 2023 at 8:17 AM Chao Sun  wrote:

> +1 (non-binding)
>
> On Mon, Apr 10, 2023 at 7:07 AM yangjie01  wrote:
>
>> +1 (non-binding)
>>
>>
>>
>> *发件人**: *Sean Owen 
>> *日期**: *2023年4月10日 星期一 21:19
>> *收件人**: *Dongjoon Hyun 
>> *抄送**: *"dev@spark.apache.org" 
>> *主题**: *Re: [VOTE] Release Apache Spark 3.2.4 (RC1)
>>
>>
>>
>> +1 from me
>>
>>
>>
>> On Sun, Apr 9, 2023 at 7:19 PM Dongjoon Hyun 
>> wrote:
>>
>> I'll start with my +1.
>>
>> I verified the checksum, signatures of the artifacts, and
>> documentations.
>> Also, ran the tests with YARN and K8s modules.
>>
>> Dongjoon.
>>
>> On 2023/04/09 23:46:10 Dongjoon Hyun wrote:
>> > Please vote on releasing the following candidate as Apache Spark
>> version
>> > 3.2.4.
>> >
>> > The vote is open until April 13th 1AM (PST) and passes if a
>> majority +1 PMC
>> > votes are cast, with a minimum of 3 +1 votes.
>> >
>> > [ ] +1 Release this package as Apache Spark 3.2.4
>> > [ ] -1 Do not release this package because ...
>> >
>> > To learn more about Apache Spark, please see
>> https://spark.apache.org/
>> 
>> >
>> > The tag to be voted on is v3.2.4-rc1 (commit
>> > 0ae10ac18298d1792828f1d59b652ef17462d76e)
>> > https://github.com/apache/spark/tree/v3.2.4-rc1
>> 
>> >
>> > The release files, including signatures, digests, etc. can be found
>> at:
>> > https://dist.apache.org/repos/dist/dev/spark/v3.2.4-rc1-bin/
>> 
>> >
>> > Signatures used for Spark RCs can be found in this file:
>> > https://dist.apache.org/repos/dist/dev/spark/KEYS
>> 
>> >
>> > The staging repository for this release can be found at:
>> >
>> https://repository.apache.org/content/repositories/orgapachespark-1442/
>> 
>> >
>> > The documentation corresponding to this release can be found at:
>> > https://dist.apache.org/repos/dist/dev/spark/v3.2.4-rc1-docs/
>> 
>> >
>> > The list of bug fixes going into 3.2.4 can be found at the
>> following URL:
>> > https://issues.apache.org/jira/projects/SPARK/versions/12352607
>> 
>> >
>> > This release is using the release script of the tag v3.2.4-rc1.
>> >
>> > FAQ
>> >
>> > =
>> > How can I help test this release?
>> > =
>> >
>> > If you are a Spark user, you can help us test this release by taking
>> > an existing Spark workload and running on this release candidate,
>> then
>> > reporting any regressions.
>> >
>> > If you're working in PySpark you can set up a virtual env and
>> install
>> > the current RC and see if anything important breaks, in the

Re: [VOTE] Release Apache Spark 3.4.0 (RC5)

2023-04-03 Thread Wenchen Fan

Sorry for the last-minute change, but we found two wrong behaviors and want
to fix them before the release:

https://github.com/apache/spark/pull/40641
We missed a corner case when the input index for `array_insert` is 0. It
should fail as 0 is an invalid index.

https://github.com/apache/spark/pull/40623
We found some usability issues with a new API and need to change the API to
fix it. If people have concerns we can also remove the new API entirely.

Thus I'm -1 to this RC. I'll merge these 2 PRs today if no objections.

Thanks,
Wenchen

On Tue, Apr 4, 2023 at 3:47 AM L. C. Hsieh  wrote:

> +1
>
> Thanks Xinrong.
>
> On Mon, Apr 3, 2023 at 12:35 PM Dongjoon Hyun 
> wrote:
> >
> > +1
> >
> > I also verified that RC5 has SBOM artifacts.
> >
> >
> https://repository.apache.org/content/repositories/orgapachespark-1439/org/apache/spark/spark-core_2.12/3.4.0/spark-core_2.12-3.4.0-cyclonedx.json
> >
> https://repository.apache.org/content/repositories/orgapachespark-1439/org/apache/spark/spark-core_2.13/3.4.0/spark-core_2.13-3.4.0-cyclonedx.json
> >
> > Thanks,
> > Dongjoon.
> >
> >
> >
> > On Mon, Apr 3, 2023 at 1:57 AM yangjie01  wrote:
> >>
> >> +1, checked Java 17 + Scala 2.13 + Python 3.10.10.
> >>
> >>
> >>
> >> 发件人: Herman van Hovell 
> >> 日期: 2023年3月31日 星期五 12:12
> >> 收件人: Sean Owen 
> >> 抄送: Xinrong Meng , dev 
> >> 主题: Re: [VOTE] Release Apache Spark 3.4.0 (RC5)
> >>
> >>
> >>
> >> +1
> >>
> >>
> >>
> >> On Thu, Mar 30, 2023 at 11:05 PM Sean Owen  wrote:
> >>
> >> +1 same result from me as last time.
> >>
> >>
> >>
> >> On Thu, Mar 30, 2023 at 3:21 AM Xinrong Meng 
> wrote:
> >>
> >> Please vote on releasing the following candidate(RC5) as Apache Spark
> version 3.4.0.
> >>
> >> The vote is open until 11:59pm Pacific time April 4th and passes if a
> majority +1 PMC votes are cast, with a minimum of 3 +1 votes.
> >>
> >> [ ] +1 Release this package as Apache Spark 3.4.0
> >> [ ] -1 Do not release this package because ...
> >>
> >> To learn more about Apache Spark, please see http://spark.apache.org/
> >>
> >> The tag to be voted on is v3.4.0-rc5 (commit
> f39ad617d32a671e120464e4a75986241d72c487):
> >> https://github.com/apache/spark/tree/v3.4.0-rc5
> >>
> >> The release files, including signatures, digests, etc. can be found at:
> >> https://dist.apache.org/repos/dist/dev/spark/v3.4.0-rc5-bin/
> >>
> >> Signatures used for Spark RCs can be found in this file:
> >> https://dist.apache.org/repos/dist/dev/spark/KEYS
> >>
> >> The staging repository for this release can be found at:
> >> https://repository.apache.org/content/repositories/orgapachespark-1439
> >>
> >> The documentation corresponding to this release can be found at:
> >> https://dist.apache.org/repos/dist/dev/spark/v3.4.0-rc5-docs/
> >>
> >> The list of bug fixes going into 3.4.0 can be found at the following
> URL:
> >> https://issues.apache.org/jira/projects/SPARK/versions/12351465
> >>
> >> This release is using the release script of the tag v3.4.0-rc5.
> >>
> >>
> >>
> >>
> >>
> >> FAQ
> >>
> >> =
> >> How can I help test this release?
> >> =
> >> If you are a Spark user, you can help us test this release by taking
> >> an existing Spark workload and running on this release candidate, then
> >> reporting any regressions.
> >>
> >> If you're working in PySpark you can set up a virtual env and install
> >> the current RC and see if anything important breaks, in the Java/Scala
> >> you can add the staging repository to your projects resolvers and test
> >> with the RC (make sure to clean up the artifact cache before/after so
> >> you don't end up building with an out of date RC going forward).
> >>
> >> ===
> >> What should happen to JIRA tickets still targeting 3.4.0?
> >> ===
> >> The current list of open tickets targeted at 3.4.0 can be found at:
> >> https://issues.apache.org/jira/projects/SPARK and search for "Target
> Version/s" = 3.4.0
> >>
> >> Committers should look at those and triage. Extremely important bug
> >> fixes, documentation, and API tweaks that impact compatibility should
> >> be worked on immediately. Everything else please retarget to an
> >> appropriate release.
> >>
> >> ==
> >> But my bug isn't fixed?
> >> ==
> >> In order to make timely releases, we will typically not hold the
> >> release unless the bug in question is a regression from the previous
> >> release. That being said, if there is something which is a regression
> >> that has not been correctly targeted please ping me or a committer to
> >> help target the issue.
> >>
> >>
> >>
> >> Thanks,
> >>
> >> Xinrong Meng
> >>
> >>
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>

Re: Time for release v3.3.2

2023-01-31 Thread Wenchen Fan

+1, thanks!

On Tue, Jan 31, 2023 at 3:17 PM Maxim Gekk
 wrote:

> +1
>
> On Tue, Jan 31, 2023 at 10:12 AM John Zhuge  wrote:
>
>> +1 Thanks Liang-Chi for driving the release!
>>
>> On Mon, Jan 30, 2023 at 10:26 PM Yuming Wang  wrote:
>>
>>> +1
>>>
>>> On Tue, Jan 31, 2023 at 12:18 PM yangjie01  wrote:
>>>
 +1 Thanks Liang-Chi!



 YangJie



 *发件人**: *huaxin gao 
 *日期**: *2023年1月31日 星期二 10:03
 *收件人**: *Dongjoon Hyun 
 *抄送**: *Hyukjin Kwon , Chao Sun <
 sunc...@apache.org>, "L. C. Hsieh" , Spark dev list <
 dev@spark.apache.org>
 *主题**: *Re: Time for release v3.3.2



 +1 Thanks Liang-Chi!



 On Mon, Jan 30, 2023 at 6:01 PM Dongjoon Hyun 
 wrote:

 +1



 Thank you so much, Liang-Chi.

 3.3.2 release will help 3.4.0 release too because they share many bug
 fixes.



 Dongjoon





 On Mon, Jan 30, 2023 at 5:56 PM Hyukjin Kwon 
 wrote:

 +100!



 On Tue, 31 Jan 2023 at 10:54, Chao Sun  wrote:

 +1, thanks Liang-Chi for volunteering!

 Chao

 On Mon, Jan 30, 2023 at 5:51 PM L. C. Hsieh  wrote:
 >
 > Hi Spark devs,
 >
 > As you know, it has been 4 months since Spark 3.3.1 was released on
 > 2022/10, it seems a good time to think about next maintenance release,
 > i.e. Spark 3.3.2.
 >
 > I'm thinking of the release of Spark 3.3.2 this Feb (2023/02).
 >
 > What do you think?
 >
 > I am willing to volunteer for Spark 3.3.2 if there is consensus about
 > this maintenance release.
 >
 > Thank you.
 >
 > -
 > To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
 >

 -
 To unsubscribe e-mail: dev-unsubscr...@spark.apache.org


>>
>> --
>> John Zhuge
>>
>

Re: [VOTE][SPIP] Asynchronous Offset Management in Structured Streaming

2022-12-01 Thread Wenchen Fan

+1

On Thu, Dec 1, 2022 at 12:31 PM Shixiong Zhu  wrote:

> +1
>
>
> On Wed, Nov 30, 2022 at 8:04 PM Hyukjin Kwon  wrote:
>
>> +1
>>
>> On Thu, 1 Dec 2022 at 12:39, Mridul Muralidharan 
>> wrote:
>>
>>>
>>> +1
>>>
>>> Regards,
>>> Mridul
>>>
>>> On Wed, Nov 30, 2022 at 8:55 PM Xingbo Jiang 
>>> wrote:
>>>
 +1

 On Wed, Nov 30, 2022 at 5:59 PM Jungtaek Lim <
 kabhwan.opensou...@gmail.com> wrote:

> Starting with +1 from me.
>
> On Thu, Dec 1, 2022 at 10:54 AM Jungtaek Lim <
> kabhwan.opensou...@gmail.com> wrote:
>
>> Hi all,
>>
>> I'd like to start the vote for SPIP: Asynchronous Offset Management
>> in Structured Streaming.
>>
>> The high level summary of the SPIP is that we propose a couple of
>> improvements on offset management in microbatch execution to lower down
>> processing latency, which would help for certain types of workloads.
>>
>> References:
>>
>>- JIRA ticket 
>>- SPIP doc
>>
>> 
>>- Discussion thread
>>
>>
>> Please vote on the SPIP for the next 72 hours:
>>
>> [ ] +1: Accept the proposal as an official SPIP
>> [ ] +0
>> [ ] -1: I don’t think this is a good idea because …
>>
>> Thanks!
>> Jungtaek Lim (HeartSaVioR)
>>
>

Re: [DISCUSSION] SPIP: Asynchronous Offset Management in Structured Streaming

2022-11-30 Thread Wenchen Fan

+1 to improve the widely used micro-batch mode first.

On Thu, Dec 1, 2022 at 8:49 AM Hyukjin Kwon  wrote:

> +1
>
> On Thu, 1 Dec 2022 at 08:10, Shixiong Zhu  wrote:
>
>> +1
>>
>> This is exciting. I agree with Jerry that this SPIP and continuous
>> processing are orthogonal. This SPIP itself would be a great improvement
>> and impact most Structured Streaming users.
>>
>> Best Regards,
>> Shixiong
>>
>>
>> On Wed, Nov 30, 2022 at 6:57 AM Mridul Muralidharan 
>> wrote:
>>
>>>
>>> Thanks for all the clarifications and details Jerry, Jungtaek :-)
>>> This looks like an exciting improvement to Structured Streaming -
>>> looking forward to it becoming part of Apache Spark !
>>>
>>> Regards,
>>> Mridul
>>>
>>>
>>> On Mon, Nov 28, 2022 at 8:40 PM Jerry Peng 
>>> wrote:
>>>
 Hi all,

 I will add my two cents.  Improving the Microbatch execution engine
 does not prevent us from working/improving on the continuous execution
 engine in the future.  These are orthogonal issues.  This new mode I am
 proposing in the microbatch execution engine intends to lower latency of
 this execution engine that most people use today.  We can view it as an
 incremental improvement on the existing engine. I see the continuous
 execution engine as a partially completed re-write of spark streaming and
 may serve as the "future" engine powering Spark Streaming.   Improving the
 "current" engine does not mean we cannot work on a "future" engine.  These
 two are not mutually exclusive. I would like to focus the discussion on the
 merits of this feature in regards to the current micro-batch execution
 engine and not a discussion on the future of continuous execution engine.

 Best,

 Jerry


 On Wed, Nov 23, 2022 at 3:17 AM Jungtaek Lim <
 kabhwan.opensou...@gmail.com> wrote:

> Hi Mridul,
>
> I'd like to make clear to avoid any misunderstanding - the decision
> was not led by me. (I'm just a one of engineers in the team. Not even TL.)
> As you see the direction, there was an internal consensus to not revisit
> the continuous mode. There are various reasons, which I think we know
> already. You seem to remember I have raised concerns about continuous 
> mode,
> but have you indicated that it was even over 2 years ago? I still see no
> traction around the project. The main reason I abandoned the discussion 
> was
> due to promising effort on integrating push based shuffle into continuous
> mode to achieve shuffle, but no effort has been made so far.
>
> The goal of this SPIP is to have an alternative approach dealing with
> same workload, given that we no longer have confidence of success of
> continuous mode. But I also want to make clear that deprecating and
> eventually retiring continuous mode is not a goal of this project. If that
> happens eventually, that would be a side-effect. Someone may have concerns
> that we have two different projects aiming for similar thing, but I'd
> rather see both projects having competition. If anyone willing to improve
> continuous mode can start making the effort right now. This SPIP does not
> block it.
>
>
> On Wed, Nov 23, 2022 at 5:29 PM Mridul Muralidharan 
> wrote:
>
>>
>> Hi Jungtaek,
>>
>>   Given the goal of the SPIP is reducing latency for stateless apps,
>> and should reasonably fit continuous mode design goals, it feels odd to 
>> not
>> support it fin the proposal.
>>
>> I know you have raised concerns about continuous mode in past as well
>> in dev@ list, and we are further ignoring it in this proposal (and
>> possibly other enhancements in past few releases).
>>
>> Do you want to revisit the discussion to support it and propose a
>> vote on that ? And move it to deprecated ?
>>
>> I am much more comfortable not supporting this SPIP for CM if it was
>> deprecated.
>>
>> Thoughts ?
>>
>> Regards,
>> Mridul
>>
>>
>>
>>
>> On Wed, Nov 23, 2022 at 1:16 AM Jerry Peng <
>> jerry.boyang.p...@gmail.com> wrote:
>>
>>> Jungtaek,
>>>
>>> Thanks for taking up the role to shepard this SPIP!  Thank you for
>>> also chiming in on your thoughts concerning the continuous mode!
>>>
>>> Best,
>>>
>>> Jerry
>>>
>>> On Tue, Nov 22, 2022 at 5:57 PM Jungtaek Lim <
>>> kabhwan.opensou...@gmail.com> wrote:
>>>
 Just FYI, I'm shepherding this SPIP project.

 I think the major meta question would be, "why don't we spend
 effort on continuous mode rather than initiating another feature 
 aiming for
 the same workload?". Jerry already updated the doc to answer the 
 question,
 but I can also share my thoughts about it.

 I feel like the current "continuous mode" is a niche

Re: [ANNOUNCE] Apache Spark 3.2.3 released

2022-11-30 Thread Wenchen Fan

Thanks, Chao!

On Wed, Nov 30, 2022 at 1:33 AM Chao Sun  wrote:

> We are happy to announce the availability of Apache Spark 3.2.3!
>
> Spark 3.2.3 is a maintenance release containing stability fixes. This
> release is based on the branch-3.2 maintenance branch of Spark. We strongly
> recommend all 3.2 users to upgrade to this stable release.
>
> To download Spark 3.2.3, head over to the download page:
> https://spark.apache.org/downloads.html
>
> To view the release notes:
> https://spark.apache.org/releases/spark-release-3-2-3.html
>
> We would like to acknowledge all community members for contributing to this
> release. This release would not have been possible without you.
>
> Chao
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>

Re: [VOTE][SPIP] Better Spark UI scalability and Driver stability for large applications

2022-11-16 Thread Wenchen Fan

+1, I'm looking forward to it!

On Thu, Nov 17, 2022 at 9:44 AM Ye Zhou  wrote:

> +1 (non-binding)
> Thanks for proposing this improvement to SHS, it resolves the main
> performance issue within SHS.
>
> On Wed, Nov 16, 2022 at 1:15 PM Jungtaek Lim 
> wrote:
>
>> +1
>>
>> Nice to see the chance for driver to reduce resource usage and increase
>> stability, especially the fact that the driver is SPOF. It's even promising
>> to have a future plan to pre-bake the kvstore for SHS from the driver.
>>
>> Thanks for driving the effort, Gengliang!
>>
>> On Thu, Nov 17, 2022 at 5:32 AM Chris Nauroth 
>> wrote:
>>
>>> +1 (non-binding)
>>>
>>> Gengliang, thank you for the SPIP.
>>>
>>> Chris Nauroth
>>>
>>>
>>> On Wed, Nov 16, 2022 at 4:27 AM Maciej  wrote:
>>>
 +1

 On 11/16/22 13:19, Yuming Wang wrote:
 > +1, non-binding
 >
 > On Wed, Nov 16, 2022 at 8:12 PM Yang,Jie(INF) >>> > > wrote:
 >
 > +1, non-binding
 >
 > __ __
 >
 > Yang Jie
 >
 > __ __
 >
 > *发件人**: *Mridul Muralidharan >>> > >
 > *日期**: *2022年11月16日星期三17:35
 > *收件人**: *Kent Yao mailto:y...@apache.org>>
 > *抄送**: *Gengliang Wang >>> > >, dev >>> > >
 > *主题**: *Re: [VOTE][SPIP] Better Spark UI scalability and Driver
 > stability for large applications
 >
 > __ __
 >
 > __ __
 >
 > +1
 >
 > __ __
 >
 > Would be great to see history server performance improvements and
 > lower resource utilization at driver !
 >
 > __ __
 >
 > Regards,
 >
 > Mridul 
 >
 > __ __
 >
 > On Wed, Nov 16, 2022 at 2:38 AM Kent Yao >>> > > wrote:
 >
 > +1, non-binding
 >
 > Gengliang Wang mailto:ltn...@gmail.com>> 于
 > 2022年11月16日周三16:36写道：
 > >
 > > Hi all,
 > >
 > > I’d like to start a vote for SPIP: "Better Spark UI
 scalability and Driver stability for large applications"
 > >
 > > The goal of the SPIP is to improve the Driver's stability
 by supporting storing Spark's UI data on RocksDB. Furthermore, to fasten
 the read and write operations on RocksDB, it introduces a new Protobuf
 serializer.
 > >
 > > Please also refer to the following:
 > >
 > > Previous discussion in the dev mailing list: [DISCUSS]
 SPIP: Better Spark UI scalability and Driver stability for large
 applications
 > > Design Doc: Better Spark UI scalability and Driver
 stability for large applications
 > > JIRA: SPARK-41053
 > >
 > >
 > > Please vote on the SPIP for the next 72 hours:
 > >
 > > [ ] +1: Accept the proposal as an official SPIP
 > > [ ] +0
 > > [ ] -1: I don’t think this is a good idea because …
 > >
 > > Kind Regards,
 > > Gengliang
 >
 >
  -
 > To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
 > 
 >

 --
 Best regards,
 Maciej Szymkiewicz

 Web: https://zero323.net
 PGP: A30CEF0C31A501EC


>
> --
>
> *Zhou, Ye  **周晔*
>

Re: [VOTE] Release Spark 3.2.3 (RC1)

2022-11-16 Thread Wenchen Fan

+1

On Thu, Nov 17, 2022 at 10:20 AM Yang,Jie(INF)  wrote:

> +1,non-binding
>
>
>
> The test combination of Java 11 + Scala 2.12 and Java 11 + Scala 2.13 has
> passed.
>
>
>
> Yang Jie
>
>
>
> *发件人**: *Chris Nauroth 
> *日期**: *2022年11月17日 星期四 04:27
> *收件人**: *Yuming Wang 
> *抄送**: *"Yang,Jie(INF)" , Dongjoon Hyun <
> dongjoon.h...@gmail.com>, huaxin gao , "L. C.
> Hsieh" , Chao Sun , dev <
> dev@spark.apache.org>
> *主题**: *Re: [VOTE] Release Spark 3.2.3 (RC1)
>
>
>
> +1 (non-binding)
>
> * Verified all checksums.
> * Verified all signatures.
> * Built from source, with multiple profiles, to full success, for Java 11
> and Scala 2.12:
> * build/mvn -Phadoop-3.2 -Phadoop-cloud -Phive-2.3 -Phive-thriftserver
> -Pkubernetes -Pscala-2.12 -Psparkr -Pyarn -DskipTests clean package
> * Tests passed.
> * Ran several examples successfully:
> * bin/spark-submit --class org.apache.spark.examples.SparkPi
> examples/jars/spark-examples_2.12-3.2.3.jar
> * bin/spark-submit --class
> org.apache.spark.examples.sql.hive.SparkHiveExample
> examples/jars/spark-examples_2.12-3.2.3.jar
> * bin/spark-submit
> examples/src/main/python/streaming/network_wordcount.py localhost 
>
>
>
> Chao, thank you for preparing the release.
>
>
>
> Chris Nauroth
>
>
>
>
>
> On Wed, Nov 16, 2022 at 5:22 AM Yuming Wang  wrote:
>
> +1
>
>
>
> On Wed, Nov 16, 2022 at 2:28 PM Yang,Jie(INF)  wrote:
>
> I switched Scala 2.13 to Scala 2.12 today. The test is still in progress
> and it has not been hung.
>
>
>
> Yang Jie
>
>
>
> *发件人**: *Dongjoon Hyun 
> *日期**: *2022年11月16日 星期三 01:17
> *收件人**: *"Yang,Jie(INF)" 
> *抄送**: *huaxin gao , "L. C. Hsieh" <
> vii...@gmail.com>, Chao Sun , dev <
> dev@spark.apache.org>
> *主题**: *Re: [VOTE] Release Spark 3.2.3 (RC1)
>
>
>
> Did you hit that in Scala 2.12, too?
>
>
>
> Dongjoon.
>
>
>
> On Tue, Nov 15, 2022 at 4:36 AM Yang,Jie(INF)  wrote:
>
> Hi, all
>
>
>
> I test v3.2.3 with following command:
>
>
>
> ```
>
> dev/change-scala-version.sh 2.13
>
> build/mvn clean install -Phadoop-3 -Phadoop-cloud -Pmesos -Pyarn
> -Pkinesis-asl -Phive-thriftserver -Pspark-ganglia-lgpl -Pkubernetes -Phive
> -Pscala-2.13 -fn
>
> ```
>
>
>
> The testing environment is:
>
>
>
> OS: CentOS 6u3 Final
>
> Java: zulu 11.0.17
>
> Python: 3.9.7
>
> Scala: 2.13
>
>
>
> The above test command has been executed twice, and all times hang in the
> following stack:
>
>
>
> ```
>
> "ScalaTest-main-running-JoinSuite" #1 prio=5 os_prio=0 cpu=312870.06ms
> elapsed=1552.65s tid=0x7f2ddc02d000 nid=0x7132 waiting on condition
> [0x7f2de3929000]
>
>java.lang.Thread.State: WAITING (parking)
>
>at jdk.internal.misc.Unsafe.park(java.base@11.0.17/Native Method)
>
>- parking to wait for  <0x000790d00050> (a
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
>
>at java.util.concurrent.locks.LockSupport.park(java.base@11.0.17
> /LockSupport.java:194)
>
>at
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(java.base@11.0.17
> /AbstractQueuedSynchronizer.java:2081)
>
>at java.util.concurrent.LinkedBlockingQueue.take(java.base@11.0.17
> /LinkedBlockingQueue.java:433)
>
>at
> org.apache.spark.sql.execution.adaptive.AdaptiveSparkPlanExec.$anonfun$getFinalPhysicalPlan$1(AdaptiveSparkPlanExec.scala:275)
>
>at
> org.apache.spark.sql.execution.adaptive.AdaptiveSparkPlanExec$$Lambda$9429/0x000802269840.apply(Unknown
> Source)
>
>at
> org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:775)
>
>at
> org.apache.spark.sql.execution.adaptive.AdaptiveSparkPlanExec.getFinalPhysicalPlan(AdaptiveSparkPlanExec.scala:228)
>
>- locked <0x000790d00208> (a java.lang.Object)
>
>at
> org.apache.spark.sql.execution.adaptive.AdaptiveSparkPlanExec.withFinalPlanUpdate(AdaptiveSparkPlanExec.scala:370)
>
>at
> org.apache.spark.sql.execution.adaptive.AdaptiveSparkPlanExec.doExecute(AdaptiveSparkPlanExec.scala:355)
>
>at
> org.apache.spark.sql.execution.SparkPlan.$anonfun$execute$1(SparkPlan.scala:185)
>
>at
> org.apache.spark.sql.execution.SparkPlan$$Lambda$8573/0x000801f99c40.apply(Unknown
> Source)
>
>at
> org.apache.spark.sql.execution.SparkPlan.$anonfun$executeQuery$1(SparkPlan.scala:223)
>
>at
> org.apache.spark.sql.execution.SparkPlan$$Lambda$8574/0x000801f9a040.apply(Unknown
> Source)
>
>at
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>
>at
> org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:220)
>
>at
> org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:181)
>
>at
> org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:172)
>
>- locked <0x000790d00218> (a
> org.apache.spark.sql.execution.QueryExecution)
>
>at
> org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:171)
>
>

Re: [DISCUSS] SPIP: Better Spark UI scalability and Driver stability for large applications

2022-11-15 Thread Wenchen Fan

This looks great! UI stability/scalability has been a pain point for a long
time.

On Sat, Nov 12, 2022 at 5:24 AM Gengliang Wang  wrote:

> Hi Everyone,
>
> I want to discuss the "Better Spark UI scalability and Driver stability
> for large applications" proposal. Please find the links below:
>
> *JIRA* - https://issues.apache.org/jira/browse/SPARK-41053
> *SPIP Document* -
> https://docs.google.com/document/d/1cuKnFwlTodyVhUQPMuakq2YDaLH05jaY9FRu_aD1zMo/edit?usp=sharing
>
> *Excerpt from the document: *
>
> After SPARK-18085 ,
> the Spark history server(SHS) becomes more scalable for processing large
> applications by supporting a persistent KV-store(LevelDB/RocksDB) as the
> storage layer.
>
> As for the live Spark UI, all the data is still stored in memory, which
> can bring memory pressures to the Spark driver for large applications.
>
> For better Spark UI scalability and Driver stability, I propose to
>
>-
>
>Support storing all the UI data in a persistent KV store.
>RocksDB/LevelDB provides low memory overhead. Their write/read performance
>is fast enough to serve the workloads of live UI. Spark UI can retain more
>data with the new backend, while SHS can leverage it to fasten its startup.
>- Support a new Protobuf serializer for all the UI data. The new
>serializer is supposed to be faster, according to benchmarks. It will be
>the default serializer for the persistent KV store of live UI.
>
>
>
>
> I appreciate any suggestions you can provide,
> Gengliang
>

Re: [VOTE] Release Spark 3.3.1 (RC4)

2022-10-18 Thread Wenchen Fan

+1

On Wed, Oct 19, 2022 at 4:59 AM Chao Sun  wrote:

> +1. Thanks Yuming!
>
> Chao
>
> On Tue, Oct 18, 2022 at 1:18 PM Thomas graves  wrote:
> >
> > +1. Ran internal test suite.
> >
> > Tom
> >
> > On Sun, Oct 16, 2022 at 9:14 PM Yuming Wang  wrote:
> > >
> > > Please vote on releasing the following candidate as Apache Spark
> version 3.3.1.
> > >
> > > The vote is open until 11:59pm Pacific time October 21th and passes if
> a majority +1 PMC votes are cast, with a minimum of 3 +1 votes.
> > >
> > > [ ] +1 Release this package as Apache Spark 3.3.1
> > > [ ] -1 Do not release this package because ...
> > >
> > > To learn more about Apache Spark, please see https://spark.apache.org
> > >
> > > The tag to be voted on is v3.3.1-rc4 (commit
> fbbcf9434ac070dd4ced4fb9efe32899c6db12a9):
> > > https://github.com/apache/spark/tree/v3.3.1-rc4
> > >
> > > The release files, including signatures, digests, etc. can be found at:
> > > https://dist.apache.org/repos/dist/dev/spark/v3.3.1-rc4-bin
> > >
> > > Signatures used for Spark RCs can be found in this file:
> > > https://dist.apache.org/repos/dist/dev/spark/KEYS
> > >
> > > The staging repository for this release can be found at:
> > > https://repository.apache.org/content/repositories/orgapachespark-1430
> > >
> > > The documentation corresponding to this release can be found at:
> > > https://dist.apache.org/repos/dist/dev/spark/v3.3.1-rc4-docs
> > >
> > > The list of bug fixes going into 3.3.1 can be found at the following
> URL:
> > > https://s.apache.org/ttgz6
> > >
> > > This release is using the release script of the tag v3.3.1-rc4.
> > >
> > >
> > > FAQ
> > >
> > > ==
> > > What happened to v3.3.1-rc3?
> > > ==
> > > A performance regression(SPARK-40703) was found after tagging
> v3.3.1-rc3, which the Iceberg community hopes Spark 3.3.1 could fix.
> > > So we skipped the vote on v3.3.1-rc3.
> > >
> > > =
> > > How can I help test this release?
> > > =
> > > If you are a Spark user, you can help us test this release by taking
> > > an existing Spark workload and running on this release candidate, then
> > > reporting any regressions.
> > >
> > > If you're working in PySpark you can set up a virtual env and install
> > > the current RC and see if anything important breaks, in the Java/Scala
> > > you can add the staging repository to your projects resolvers and test
> > > with the RC (make sure to clean up the artifact cache before/after so
> > > you don't end up building with a out of date RC going forward).
> > >
> > > ===
> > > What should happen to JIRA tickets still targeting 3.3.1?
> > > ===
> > > The current list of open tickets targeted at 3.3.1 can be found at:
> > > https://issues.apache.org/jira/projects/SPARK and search for "Target
> Version/s" = 3.3.1
> > >
> > > Committers should look at those and triage. Extremely important bug
> > > fixes, documentation, and API tweaks that impact compatibility should
> > > be worked on immediately. Everything else please retarget to an
> > > appropriate release.
> > >
> > > ==
> > > But my bug isn't fixed?
> > > ==
> > > In order to make timely releases, we will typically not hold the
> > > release unless the bug in question is a regression from the previous
> > > release. That being said, if there is something which is a regression
> > > that has not been correctly targeted please ping me or a committer to
> > > help target the issue.
> > >
> > >
> >
> > -
> > To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
> >
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>

Re: [DISCUSS] SPIP: Support Docker Official Image for Spark

2022-09-19 Thread Wenchen Fan

+1

On Mon, Sep 19, 2022 at 2:59 PM Yang,Jie(INF)  wrote:

> +1 (non-binding)
>
>
>
> Yang Jie
> --
> *发件人:* Yikun Jiang 
> *发送时间:* 2022年9月19日 14:23:14
> *收件人:* Denny Lee
> *抄送:* bo zhaobo; Yuming Wang; Kent Yao; Gengliang Wang; Hyukjin Kwon;
> dev; zrf
> *主题:* Re: [DISCUSS] SPIP: Support Docker Official Image for Spark
>
> Thanks for your support!  @all
>
> > Count me in to help as well, eh?! :)
>
> @Denny Sure, It would be great to have your help! I'm going to create a
> JIRA and TASKS if the SPIP vote passes.
>
>
> On Mon, Sep 19, 2022 at 10:34 AM Denny Lee  wrote:
>
>> +1 (non-binding).
>>
>> This is a great idea and we should definitely do this.  Count me in to
>> help as well, eh?! :)
>>
>> On Sun, Sep 18, 2022 at 7:24 PM bo zhaobo 
>> wrote:
>>
>>> +1 (non-binding)
>>>
>>> This will bring the good experience to customers. So excited about this.
>>> ;-)
>>>
>>> Yuming Wang  于2022年9月19日周一 10:18写道：
>>>
 +1.

 On Mon, Sep 19, 2022 at 9:44 AM Kent Yao  wrote:

> +1
>
> Gengliang Wang  于2022年9月19日周一 09:23写道：
> >
> > +1, thanks for the work!
> >
> > On Sun, Sep 18, 2022 at 6:20 PM Hyukjin Kwon 
> wrote:
> >>
> >> +1
> >>
> >> On Mon, 19 Sept 2022 at 09:15, Yikun Jiang 
> wrote:
> >>>
> >>> Hi, all
> >>>
> >>>
> >>> I would like to start the discussion for supporting Docker
> Official Image for Spark.
> >>>
> >>>
> >>> This SPIP is proposed to add Docker Official Image(DOI) to ensure
> the Spark Docker images meet the quality standards for Docker images, to
> provide these Docker images for users who want to use Apache Spark via
> Docker image.
> >>>
> >>>
> >>> There are also several Apache projects that release the Docker
> Official Images, such as: flink, storm, solr, zookeeper, httpd (with 50M+
> to 1B+ download for each). From the huge download statistics, we can see
> the real demands of users, and from the support of other apache projects,
> we should also be able to do it.
> >>>
> >>>
> >>> After support:
> >>>
> >>> The Dockerfile will still be maintained by the Apache Spark
> community and reviewed by Docker.
> >>>
> >>> The images will be maintained by the Docker community to ensure
> the quality standards for Docker images of the Docker community.
> >>>
> >>>
> >>> It will also reduce the extra docker images maintenance effort
> (such as frequently rebuilding, image security update) of the Apache Spark
> community.
> >>>
> >>>
> >>> See more in SPIP DOC:
> https://docs.google.com/document/d/1nN-pKuvt-amUcrkTvYAQ-bJBgtsWb9nAkNoVNRM2S2o
> 
> >>>
> >>>
> >>> cc: Ruifeng (co-author) and Hyukjin (shepherd)
> >>>
> >>>
> >>> Regards,
> >>> Yikun
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>

Re: Non-deterministic function duplicated in final Spark plan

2022-08-01 Thread Wenchen Fan

This is a hard one. Spark duplicates the join child plan if it's a
self-join because Spark does not support diamond-shaped query plans. Seems
the only option here is to write the join child plan to a parquet table (or
using a shuffle) and read it back.

On Mon, Aug 1, 2022 at 4:46 PM Enrico Minack  wrote:

> Hi all,
>
> this holds for any non-deterministic function, e.g. rand(), uuid(),
> spark_partition_id(), ... Details below.
>
> To my knowledge, marking a function non-deterministic is to prevent it
> from being called multiple times per row, which is violated here.
>
>read parquet  read parquet
> | |
> NON-DETERMINIST-FUNC  NON-DETERMINIST-FUNC
> | |
>  \   /
>   \ /
>  join on non-deterministic id
>|
>   ...
>
> What I would expect is
>
> read parquet
>  |
>  NON-DETERMINIST-FUNC
>  /   \
> / \
> \ /
>  \   /
>join on non-deterministic id
>  |
> ...
>
>
> Any thoughts?
>
> Enrico
>
>
>
> Am 27.06.22 um 11:21 schrieb Enrico Minack:
>
> Hi devs,
>
> SQL function spark_partition_id() provides the partition number for each
> row. Using a Dataset that is enriched with that number in a self join may
> produce an unexpected result.
>
> What I am doing:
>
> After calling spark_partition_id() I am counting rows per partition, then
> join those counts back to the dataset on the partition id.
>
> *Adding some downstream operations, the join has some mismatching
> partition ids:*
>
> import org.apache.spark.sql.SaveMode
>
> val cols = 10
> spark.range(1, 100, 1,
> 100).write.mode(SaveMode.Overwrite).parquet("example-data.parquet")
> val df1 = spark.read.parquet("example-data.parquet")
> val df2 = df1.select($"id" +: 1.to(cols).map(column =>
> rand().as(column.toString)): _*)
> val df3 = df2.orderBy($"id").withColumn("partition_id",
> spark_partition_id())
> val df4 = df3.groupBy($"partition_id").count
> val df5 = df3.join(df4, Seq("partition_id"), "full_outer")
> val df6 = df5.groupBy($"partition_id").agg(sum($"count").as("rows in
> partition"), 1.to(cols).map(column => sum(col(column.toString))).reduce(_
> + _).as("aggregates"))
> df6.orderBy($"partition_id").show
>
> ++-+--+
> |partition_id|rows in partition|aggregates|
> ++-+--+
> |   0|  11937639870|298035.81962027296|
> |   1|  11986710088| 300833.1871222975|
> |   2|  12060904752| 301299.8924913869|
> |   3|  12049668900| 298595.2881827633|
> |   4|  11837825400|298083.34705855395|
> |   5| null|  301525.597300101|
> |   6| null|  293885.554873316|
> ...
> ++-+--+
>
> Some partition ids do not have counts.
>
> The spark_partition_id() function is planned twice in the final Spark
> plan, where each has different partitioning schemes, resulting in different
> number of partition ids.
>
> Joining those together produces mismatches:
>
>read parquetread parquet
> |   |
>  exchangeexchange
>   17 partitions   12 partitions
> |   |
> spark_partition_id  spark_partition_id
> ||
> |  count per partition id
>  \ /
>   \   /
> join on partition id
>   |
>  ...
>
> I would have expected the Dataset with partition id to have been reused:
>
> read parquet
>  |
>  spark_partition_id
>  /   \
> / \
>/count per partition id
>   /   /
>   \  /
> join on partition id
>   |
>  ...
>
>
> I understand that spark_partition_id is non-deterministic, because it
> depends on the existing partitioning. But being scheduled twice and
> therefore being sensitive to query planning was surprising.
>
> *Questions:*
>
> Is this expected? Is this behavior fully covered by "@note This is
> non-deterministic because it depends on data partitioning and task
> scheduling."?
>
> Is there a way to "materialize" the partition ids (other than caching or
> check-pointing) so that downstream operations do not see different values
> for this column?
>
> Is there a way to tell Catalyst to stop optimizing across
> spark_partition_id (optimize the plan before with the plan after), like the
> AnalysisBarrier used to do?
>
>
> Cheers,
> Enrico
>
>
> --
> Dr.-Ing. Enrico Minack
> Freiberuflicher Software Ingenieur
>
> e-mail: m...@enrico.minack.dev
> Teams: te...@enrico.minack.dev
> Skype: sk...@enrico.minack.dev

Re: [VOTE] Release Spark 3.2.2 (RC1)

2022-07-14 Thread Wenchen Fan

+1

On Wed, Jul 13, 2022 at 7:29 PM Yikun Jiang  wrote:

> +1 (non-binding)
>
> Checked out tag and built from source on Linux aarch64 and ran some basic
> test.
>
>
> Regards,
> Yikun
>
>
> On Wed, Jul 13, 2022 at 5:54 AM Mridul Muralidharan 
> wrote:
>
>>
>> +1
>>
>> Signatures, digests, etc check out fine.
>> Checked out tag and build/tested with "-Pyarn -Pmesos -Pkubernetes"
>>
>> As always, the test "SPARK-33084: Add jar support Ivy URI in SQL" in
>> sql.SQLQuerySuite fails in my env; but other than that, the rest looks good.
>>
>> Regards,
>> Mridul
>>
>>
>> On Tue, Jul 12, 2022 at 3:17 AM Maxim Gekk
>>  wrote:
>>
>>> +1
>>>
>>> On Tue, Jul 12, 2022 at 11:05 AM Yang,Jie(INF) 
>>> wrote:
>>>
 +1 (non-binding)



 Yang Jie





 *发件人**: *Dongjoon Hyun 
 *日期**: *2022年7月12日 星期二 16:03
 *收件人**: *dev 
 *抄送**: *Cheng Su , "Yang,Jie(INF)" <
 yangji...@baidu.com>, Sean Owen 
 *主题**: *Re: [VOTE] Release Spark 3.2.2 (RC1)



 +1



 Dongjoon.



 On Mon, Jul 11, 2022 at 11:34 PM Cheng Su  wrote:

 +1 (non-binding). Built from source, and ran some scala unit tests on
 M1 mac, with OpenJDK 8 and Scala 2.12.



 Thanks,

 Cheng Su



 On Mon, Jul 11, 2022 at 10:31 PM Yang,Jie(INF) 
 wrote:

 Does this happen when running all UTs? I ran this suite several times
 alone using OpenJDK(zulu) 8u322-b06 on my Mac, but no similar error
 occurred



 *发件人**: *Sean Owen 
 *日期**: *2022年7月12日 星期二 10:45
 *收件人**: *Dongjoon Hyun 
 *抄送**: *dev 
 *主题**: *Re: [VOTE] Release Spark 3.2.2 (RC1)



 Is anyone seeing this error? I'm on OpenJDK 8 on a Mac:



 #
 # A fatal error has been detected by the Java Runtime Environment:
 #
 #  SIGSEGV (0xb) at pc=0x000101ca8ace, pid=11962,
 tid=0x1603
 #
 # JRE version: OpenJDK Runtime Environment (8.0_322) (build
 1.8.0_322-bre_2022_02_28_15_01-b00)
 # Java VM: OpenJDK 64-Bit Server VM (25.322-b00 mixed mode bsd-amd64
 compressed oops)
 # Problematic frame:
 # V  [libjvm.dylib+0x549ace]
 #
 # Failed to write core dump. Core dumps have been disabled. To enable
 core dumping, try "ulimit -c unlimited" before starting Java again
 #
 # An error report file with more information is saved as:
 # /private/tmp/spark-3.2.2/sql/core/hs_err_pid11962.log
 ColumnVectorSuite:
 - boolean
 - byte
 Compiled method (nm)  885897 75403 n 0
 sun.misc.Unsafe::putShort (native)
  total in heap  [0x000102fdaa10,0x000102fdad48] = 824
  relocation [0x000102fdab38,0x000102fdab78] = 64
  main code  [0x000102fdab80,0x000102fdad48] = 456
 Compiled method (nm)  885897 75403 n 0
 sun.misc.Unsafe::putShort (native)
  total in heap  [0x000102fdaa10,0x000102fdad48] = 824
  relocation [0x000102fdab38,0x000102fdab78] = 64
  main code  [0x000102fdab80,0x000102fdad48] = 456



 On Mon, Jul 11, 2022 at 4:58 PM Dongjoon Hyun 
 wrote:

 Please vote on releasing the following candidate as Apache Spark
 version 3.2.2.

 The vote is open until July 15th 1AM (PST) and passes if a majority +1
 PMC votes are cast, with a minimum of 3 +1 votes.

 [ ] +1 Release this package as Apache Spark 3.2.2
 [ ] -1 Do not release this package because ...

 To learn more about Apache Spark, please see https://spark.apache.org/
 

 The tag to be voted on is v3.2.2-rc1 (commit
 78a5825fe266c0884d2dd18cbca9625fa258d7f7):
 https://github.com/apache/spark/tree/v3.2.2-rc1
 

 The release files, including signatures, digests, etc. can be found at:
 https://dist.apache.org/repos/dist/dev/spark/v3.2.2-rc1-bin/
 

 Signatures used for Spark RCs can be found in this file:
 https://dist.apache.org/repos/dist/dev/spark/KEYS
 

 The staging repository for this release can be found at:
 https://repository.apache.org/content/repositories/orgapachespark-1409/
 

 The documentation corresponding to this release can be found at:

Re: [DISCUSS][Catalog API] Deprecate 4 Catalog API that takes two parameters which are (dbName, tableName/functionName)

2022-07-08 Thread Wenchen Fan

It's better to keep all APIs working. But in this case, I really have no
idea how to make these 4 APIs reasonable. For example, tableExists(dbName:
String, tableName: String) currently checks if table "dbName.tableName"
exists in the Hive metastore, and does not work with v2 catalogs at all.
It's not only a "not needed" API, but also a confusing API. We need a
mechanism to move users away from confusing APIs.

I agree that we should not abuse deprecation. I think a general principle
to use deprecation is you have the intention to remove it eventually, which
is exactly the case here. We should remove these 4 APIs when most users
have moved away.

Thanks,
Wenchen

On Fri, Jul 8, 2022 at 2:49 PM Dongjoon Hyun 
wrote:

> Thank you for starting the official discussion, Rui.
>
> 'Unneeded API' doesn't sound like a good frame for this discussion
> because it ignores the existing users and codes completely.
> Technically, the above mentioned reasons look irrelevant to any
> specific existing bugs or future maintenance cost saving. Instead, the
> deprecation already causes costs to the community (your PR, the future
> migration guide, and the communication with the customers like Q)
> and to the users for the actual migration to new API and validations.
> Given that, for now, the goal of this proposal looks like a pure
> educational purpose to advertise new APIs to Apache Spark 3.4+ users.
>
> Can we be more conservative at Apache Spark deprecation and allow
> users to use both APIs freely without any concern of uncertain
> insupportability? I simply want to avoid the situation where the pure
> educational deprecation itself becomes `Unneeded Deprecation` in the
> community.
>
> Dongjoon.
>
> On Thu, Jul 7, 2022 at 2:26 PM Rui Wang  wrote:
> >
> > I want to highlight in case I missed this in the original email:
> >
> > The 4 API will not be deleted. They will just be marked as deprecated
> annotations and we encourage users to use their alternatives.
> >
> >
> > -Rui
> >
> > On Thu, Jul 7, 2022 at 2:23 PM Rui Wang  wrote:
> >>
> >> Hi Community,
> >>
> >> Proposal:
> >> I want to discuss a proposal to deprecate the following Catalog API:
> >> def listColumns(dbName: String, tableName: String): Dataset[Column]
> >> def getTable(dbName: String, tableName: String): Table
> >> def getFunction(dbName: String, functionName: String): Function
> >> def tableExists(dbName: String, tableName: String): Boolean
> >>
> >>
> >> Context:
> >> We have been adding table identifier with catalog name (aka 3 layer
> namespace) support to Catalog API in
> https://issues.apache.org/jira/browse/SPARK-39235.
> >> The basic idea is, if an API accepts:
> >> 1. only tableName:String, we allow it accepts "a.b.c" and goes analyzer
> which treats a as catalog name, b namespace name and c table name.
> >> 2. only dbName:String, we allow it accepts "a.b" and goes analyzer
> which treats a as catalog name, b namespace name.
> >> Meanwhile we still maintain the backwards compatibility for such API to
> make sure past behavior remains the same. E.g. If you only use tableName it
> is still recognized by the session catalog.
> >>
> >> With this effort ongoing, the above 4 API becomes not fully compatible
> with the 3 layer namespace.
> >>
> >> use tableExists(dbName: String, tableName: String) as an example, given
> that it takes two parameters but leaves no room for the extra catalog name.
> Also if we want to reuse the two parameters, which one will be the one that
> takes more than one name part?
> >>
> >>
> >> How?
> >> So how to improve the above 4 API? There are two options:
> >> a. Expand those four API to let those API accept catalog names. For
> example, tableExists(catalogName: String, dbName: String, tableName:
> String).
> >> b. mark those API as `deprecated`.
> >>
> >> I am proposing to follow option B which does API deprecation.
> >>
> >> Why?
> >> 1. Reduce unneeded API. The existing API can support the same behavior
> given SPARK-39235. For example, tableExists(dbName, tableName) can be
> replaced to use tableExists("dbName.tableName").
> >> 2. Reduce incomplete API. The proposed API to deprecate does not
> support 3 layer namespace now, and it is hard to do so (where to take 3
> part names)?
> >> 3. Deprecation suggests users to migrate their usage on API.
> >> 4. There was existing practice that we deprecated CreateExternalTable
> API when adding CreateTable API:
> https://github.com/apache/spark/blob/7dcb4bafd02dd43213d3cc4a936c170bda56ddc5/sql/core/src/main/scala/org/apache/spark/sql/catalog/Catalog.scala#L220
> >>
> >>
> >> What do you think?
> >>
> >> Thanks,
> >> Rui Wang
> >>
> >>
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>

Re: Apache Spark 3.2.2 Release?

2022-07-06 Thread Wenchen Fan

+1

On Thu, Jul 7, 2022 at 10:41 AM Xinrong Meng
 wrote:

> +1
>
> Thanks!
>
>
> Xinrong Meng
>
> Software Engineer
>
> Databricks
>
>
> On Wed, Jul 6, 2022 at 7:25 PM Xiao Li  wrote:
>
>> +1
>>
>> Xiao
>>
>> Cheng Su  于2022年7月6日周三 19:16写道：
>>
>>> +1 (non-binding)
>>>
>>> Thanks,
>>> Cheng Su
>>>
>>> On Wed, Jul 6, 2022 at 6:01 PM Yuming Wang  wrote:
>>>
 +1

 On Thu, Jul 7, 2022 at 5:53 AM Maxim Gekk
  wrote:

> +1
>
> On Thu, Jul 7, 2022 at 12:26 AM John Zhuge  wrote:
>
>> +1  Thanks for the effort!
>>
>> On Wed, Jul 6, 2022 at 2:23 PM Bjørn Jørgensen <
>> bjornjorgen...@gmail.com> wrote:
>>
>>> +1
>>>
>>> ons. 6. jul. 2022, 23:05 skrev Hyukjin Kwon :
>>>
 Yeah +1

 On Thu, Jul 7, 2022 at 5:40 AM Dongjoon Hyun <
 dongjoon.h...@gmail.com> wrote:

> Hi, All.
>
> Since Apache Spark 3.2.1 tag creation (Jan 19), new 197 patches
> including 11 correctness patches arrived at branch-3.2.
>
> Shall we make a new release, Apache Spark 3.2.2, as the third
> release
> at 3.2 line? I'd like to volunteer as the release manager for
> Apache
> Spark 3.2.2. I'm thinking about starting the first RC next week.
>
> $ git log --oneline v3.2.1..HEAD | wc -l
>  197
>
> # Correctness issues
>
> SPARK-38075 Hive script transform with order by and limit will
> return fake rows
> SPARK-38204 All state operators are at a risk of inconsistency
> between state partitioning and operator partitioning
> SPARK-38309 SHS has incorrect percentiles for shuffle read
> bytes
> and shuffle total blocks metrics
> SPARK-38320 (flat)MapGroupsWithState can timeout groups which
> just
> received inputs in the same microbatch
> SPARK-38614 After Spark update, df.show() shows incorrect
> F.percent_rank results
> SPARK-38655 OffsetWindowFunctionFrameBase cannot find the
> offset
> row whose input is not null
> SPARK-38684 Stream-stream outer join has a possible correctness
> issue due to weakly read consistent on outer iterators
> SPARK-39061 Incorrect results or NPE when using Inline function
> against an array of dynamically created structs
> SPARK-39107 Silent change in regexp_replace's handling of
> empty strings
> SPARK-39259 Timestamps returned by now() and equivalent
> functions
> are not consistent in subqueries
> SPARK-39293 The accumulator of ArrayAggregate should copy the
> intermediate result if string, struct, array, or map
>
> Best,
> Dongjoon.
>
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
> --
>> John Zhuge
>>
>

Re: [VOTE][SPIP] Spark Connect

2022-06-14 Thread Wenchen Fan

+1

On Tue, Jun 14, 2022 at 9:38 AM Ruifeng Zheng  wrote:

> +1
>
>
> -- 原始邮件 --
> *发件人:* "huaxin gao" ;
> *发送时间:* 2022年6月14日(星期二) 上午8:47
> *收件人:* "L. C. Hsieh";
> *抄送:* "Spark dev list";
> *主题:* Re: [VOTE][SPIP] Spark Connect
>
> +1
>
> On Mon, Jun 13, 2022 at 5:42 PM L. C. Hsieh  wrote:
>
>> +1
>>
>> On Mon, Jun 13, 2022 at 5:41 PM Chao Sun  wrote:
>> >
>> > +1 (non-binding)
>> >
>> > On Mon, Jun 13, 2022 at 5:11 PM Hyukjin Kwon 
>> wrote:
>> >>
>> >> +1
>> >>
>> >> On Tue, 14 Jun 2022 at 08:50, Yuming Wang  wrote:
>> >>>
>> >>> +1.
>> >>>
>> >>> On Tue, Jun 14, 2022 at 2:20 AM Matei Zaharia <
>> matei.zaha...@gmail.com> wrote:
>> 
>>  +1, very excited about this direction.
>> 
>>  Matei
>> 
>>  On Jun 13, 2022, at 11:07 AM, Herman van Hovell
>>  wrote:
>> 
>>  Let me kick off the voting...
>> 
>>  +1
>> 
>>  On Mon, Jun 13, 2022 at 2:02 PM Herman van Hovell <
>> her...@databricks.com> wrote:
>> >
>> > Hi all,
>> >
>> > I’d like to start a vote for SPIP: "Spark Connect"
>> >
>> > The goal of the SPIP is to introduce a Dataframe based
>> client/server API for Spark
>> >
>> > Please also refer to:
>> >
>> > - Previous discussion in dev mailing list: [DISCUSS] SPIP: Spark
>> Connect - A client and server interface for Apache Spark.
>> > - Design doc: Spark Connect - A client and server interface for
>> Apache Spark.
>> > - JIRA: SPARK-39375
>> >
>> > Please vote on the SPIP for the next 72 hours:
>> >
>> > [ ] +1: Accept the proposal as an official SPIP
>> > [ ] +0
>> > [ ] -1: I don’t think this is a good idea because …
>> >
>> > Kind Regards,
>> > Herman
>> 
>> 
>>
>> -
>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>
>>

Re: [VOTE] Release Spark 3.3.0 (RC6)

2022-06-13 Thread Wenchen Fan

+1, tests are all green and there are no more blocker issues AFAIK.

On Fri, Jun 10, 2022 at 12:27 PM Maxim Gekk
 wrote:

> Please vote on releasing the following candidate as
> Apache Spark version 3.3.0.
>
> The vote is open until 11:59pm Pacific time June 14th and passes if a
> majority +1 PMC votes are cast, with a minimum of 3 +1 votes.
>
> [ ] +1 Release this package as Apache Spark 3.3.0
> [ ] -1 Do not release this package because ...
>
> To learn more about Apache Spark, please see http://spark.apache.org/
>
> The tag to be voted on is v3.3.0-rc6 (commit
> f74867bddfbcdd4d08076db36851e88b15e66556):
> https://github.com/apache/spark/tree/v3.3.0-rc6
>
> The release files, including signatures, digests, etc. can be found at:
> https://dist.apache.org/repos/dist/dev/spark/v3.3.0-rc6-bin/
>
> Signatures used for Spark RCs can be found in this file:
> https://dist.apache.org/repos/dist/dev/spark/KEYS
>
> The staging repository for this release can be found at:
> https://repository.apache.org/content/repositories/orgapachespark-1407
>
> The documentation corresponding to this release can be found at:
> https://dist.apache.org/repos/dist/dev/spark/v3.3.0-rc6-docs/
>
> The list of bug fixes going into 3.3.0 can be found at the following URL:
> https://issues.apache.org/jira/projects/SPARK/versions/12350369
>
> This release is using the release script of the tag v3.3.0-rc6.
>
>
> FAQ
>
> =
> How can I help test this release?
> =
> If you are a Spark user, you can help us test this release by taking
> an existing Spark workload and running on this release candidate, then
> reporting any regressions.
>
> If you're working in PySpark you can set up a virtual env and install
> the current RC and see if anything important breaks, in the Java/Scala
> you can add the staging repository to your projects resolvers and test
> with the RC (make sure to clean up the artifact cache before/after so
> you don't end up building with a out of date RC going forward).
>
> ===
> What should happen to JIRA tickets still targeting 3.3.0?
> ===
> The current list of open tickets targeted at 3.3.0 can be found at:
> https://issues.apache.org/jira/projects/SPARK and search for "Target
> Version/s" = 3.3.0
>
> Committers should look at those and triage. Extremely important bug
> fixes, documentation, and API tweaks that impact compatibility should
> be worked on immediately. Everything else please retarget to an
> appropriate release.
>
> ==
> But my bug isn't fixed?
> ==
> In order to make timely releases, we will typically not hold the
> release unless the bug in question is a regression from the previous
> release. That being said, if there is something which is a regression
> that has not been correctly targeted please ping me or a committer to
> help target the issue.
>
> Maxim Gekk
>
> Software Engineer
>
> Databricks, Inc.
>

Re: [VOTE] Release Spark 3.3.0 (RC2)

2022-05-19 Thread Wenchen Fan

I think it should have been fixed  by
https://github.com/apache/spark/commit/0fdb6757946e2a0991256a3b73c0c09d6e764eed
. Maybe the fix is not completed...

On Thu, May 19, 2022 at 2:16 PM Kent Yao  wrote:

> Thanks, Maxim.
>
> Leave my -1 for this release candidate.
>
> Unfortunately, I don't know which PR fixed this.
> Does anyone happen to know?
>
> BR,
> Kent Yao
>
> Maxim Gekk  于2022年5月19日周四 13:42写道：
> >
> > Hi Kent,
> >
> > > Shall we backport the fix from the master to 3.3 too?
> >
> > Yes, we shall.
> >
> > Maxim Gekk
> >
> > Software Engineer
> >
> > Databricks, Inc.
> >
> >
> >
> > On Thu, May 19, 2022 at 6:44 AM Kent Yao  wrote:
> >>
> >> Hi,
> >>
> >> I verified the simple case below with the binary release, and it looks
> >> like a bug to me.
> >>
> >> bin/spark-sql -e "select date '2018-11-17' > 1"
> >>
> >> Error in query: Invalid call to toAttribute on unresolved object;
> >> 'Project [unresolvedalias((2018-11-17 > 1), None)]
> >> +- OneRowRelation
> >>
> >> Both 3.2 releases and the master branch work fine with correct errors
> >> -  'due to data type mismatch'.
> >>
> >> Shall we backport the fix from the master to 3.3 too?
> >>
> >> Bests
> >>
> >> Kent Yao
> >>
> >>
> >> Yuming Wang  于2022年5月18日周三 19:04写道：
> >> >
> >> > -1. There is a regression: https://github.com/apache/spark/pull/36595
> >> >
> >> > On Wed, May 18, 2022 at 4:11 PM Martin Grigorov 
> wrote:
> >> >>
> >> >> Hi,
> >> >>
> >> >> [X] +1 Release this package as Apache Spark 3.3.0
> >> >>
> >> >> Tested:
> >> >> - make local distribution from sources (with
> ./dev/make-distribution.sh --tgz --name with-volcano
> -Pkubernetes,volcano,hadoop-3)
> >> >> - create a Docker image (with JDK 11)
> >> >> - run Pi example on
> >> >> -- local
> >> >> -- Kubernetes with default scheduler
> >> >> -- Kubernetes with Volcano scheduler
> >> >>
> >> >> On both x86_64 and aarch64 !
> >> >>
> >> >> Regards,
> >> >> Martin
> >> >>
> >> >>
> >> >> On Mon, May 16, 2022 at 3:44 PM Maxim Gekk <
> maxim.g...@databricks.com.invalid> wrote:
> >> >>>
> >> >>> Please vote on releasing the following candidate as Apache Spark
> version 3.3.0.
> >> >>>
> >> >>> The vote is open until 11:59pm Pacific time May 19th and passes if
> a majority +1 PMC votes are cast, with a minimum of 3 +1 votes.
> >> >>>
> >> >>> [ ] +1 Release this package as Apache Spark 3.3.0
> >> >>> [ ] -1 Do not release this package because ...
> >> >>>
> >> >>> To learn more about Apache Spark, please see
> http://spark.apache.org/
> >> >>>
> >> >>> The tag to be voted on is v3.3.0-rc2 (commit
> c8c657b922ac8fd8dcf9553113e11a80079db059):
> >> >>> https://github.com/apache/spark/tree/v3.3.0-rc2
> >> >>>
> >> >>> The release files, including signatures, digests, etc. can be found
> at:
> >> >>> https://dist.apache.org/repos/dist/dev/spark/v3.3.0-rc2-bin/
> >> >>>
> >> >>> Signatures used for Spark RCs can be found in this file:
> >> >>> https://dist.apache.org/repos/dist/dev/spark/KEYS
> >> >>>
> >> >>> The staging repository for this release can be found at:
> >> >>>
> https://repository.apache.org/content/repositories/orgapachespark-1403
> >> >>>
> >> >>> The documentation corresponding to this release can be found at:
> >> >>> https://dist.apache.org/repos/dist/dev/spark/v3.3.0-rc2-docs/
> >> >>>
> >> >>> The list of bug fixes going into 3.3.0 can be found at the
> following URL:
> >> >>> https://issues.apache.org/jira/projects/SPARK/versions/12350369
> >> >>>
> >> >>> This release is using the release script of the tag v3.3.0-rc2.
> >> >>>
> >> >>>
> >> >>> FAQ
> >> >>>
> >> >>> =
> >> >>> How can I help test this release?
> >> >>> =
> >> >>> If you are a Spark user, you can help us test this release by taking
> >> >>> an existing Spark workload and running on this release candidate,
> then
> >> >>> reporting any regressions.
> >> >>>
> >> >>> If you're working in PySpark you can set up a virtual env and
> install
> >> >>> the current RC and see if anything important breaks, in the
> Java/Scala
> >> >>> you can add the staging repository to your projects resolvers and
> test
> >> >>> with the RC (make sure to clean up the artifact cache before/after
> so
> >> >>> you don't end up building with a out of date RC going forward).
> >> >>>
> >> >>> ===
> >> >>> What should happen to JIRA tickets still targeting 3.3.0?
> >> >>> ===
> >> >>> The current list of open tickets targeted at 3.3.0 can be found at:
> >> >>> https://issues.apache.org/jira/projects/SPARK and search for
> "Target Version/s" = 3.3.0
> >> >>>
> >> >>> Committers should look at those and triage. Extremely important bug
> >> >>> fixes, documentation, and API tweaks that impact compatibility
> should
> >> >>> be worked on immediately. Everything else please retarget to an
> >> >>> appropriate release.
> >> >>>
> >> >>> ==
> >> >>> But my bug isn't fixed?
> >> >>> ==

Re: Unable to create view due to up cast error when migrating from Hive to Spark

2022-05-18 Thread Wenchen Fan

A view is essentially a SQL query. It's fragile to share views between
Spark and Hive because different systems have different SQL dialects. They
may interpret the view SQL query differently and introduce unexpected
behaviors.

In this case, Spark returns decimal type for gender * 0.3 - 0.1 but Hive
returns double type. The view schema was determined during creation by
Hive, which does not match the view SQL query when we use Spark to read the
view. We need to re-create this view using Spark. Actually I think we need
to do the same for every Hive view if we need to use it in Spark.

On Wed, May 18, 2022 at 7:03 PM beliefer  wrote:

> During the migration from hive to spark, there was a problem with the SQL
> used to create views in hive. The problem is that the SQL that legally
> creates a view in hive will make an error when executed in spark SQL.
>
> The SQL is as follows:
>
> CREATE VIEW test_db.my_view AS
> select
> case
> when age > 12 then gender * 0.3 - 0.1
> end AS TT,
> gender,
> age,
> careers,
> education
> from
> test_db.my_table;
>
> The error message is as follows:
>
> Cannot up cast TT from decimal(13, 1) to double.
> The type path of the target object is:
>
> You can either add an explicit cast to the input data or choose a higher
> precision type of the field in the target object
>
> *How should we solve this problem?*
>
>
>
>

Re: SIGMOD System Award for Apache Spark

2022-05-13 Thread Wenchen Fan

Great! Congratulations to everyone!

On Fri, May 13, 2022 at 10:38 AM Gengliang Wang  wrote:

> Congratulations to the whole spark community!
>
> On Fri, May 13, 2022 at 10:14 AM Jungtaek Lim <
> kabhwan.opensou...@gmail.com> wrote:
>
>> Congrats Spark community!
>>
>> On Fri, May 13, 2022 at 10:40 AM Qian Sun  wrote:
>>
>>> Congratulations !!!
>>>
>>> 2022年5月13日 上午3:44，Matei Zaharia  写道：
>>>
>>> Hi all,
>>>
>>> We recently found out that Apache Spark received
>>>  the SIGMOD System Award this
>>> year, given by SIGMOD (the ACM’s data management research organization) to
>>> impactful real-world and research systems. This puts Spark in good company
>>> with some very impressive previous recipients
>>> . This award is
>>> really an achievement by the whole community, so I wanted to say congrats
>>> to everyone who contributes to Spark, whether through code, issue reports,
>>> docs, or other means.
>>>
>>> Matei
>>>
>>>
>>>

Re: [VOTE] Release Spark 3.3.0 (RC1)

2022-05-10 Thread Wenchen Fan

I'd like to see an RC2 as well. There is kind of a correctness bug fixed
after RC1 is cut: https://github.com/apache/spark/pull/36468
Users may hit this bug much more frequently if they enable ANSI mode. It's
not a regression so I'd vote -0.

On Wed, May 11, 2022 at 5:24 AM Thomas graves  wrote:

> Is there going to be an rc2? I thought there were a couple of issue
> mentioned in the thread.
>
> On Tue, May 10, 2022 at 11:53 AM Maxim Gekk
>  wrote:
> >
> > Hi All,
> >
> > Today is the last day for voting. Please, test the RC1 and vote.
> >
> > Maxim Gekk
> >
> > Software Engineer
> >
> > Databricks, Inc.
> >
> >
> >
> > On Sat, May 7, 2022 at 10:58 AM beliefer  wrote:
> >>
> >>
> >>  @Maxim Gekk  Glad to hear that!
> >>
> >> But there is a bug https://github.com/apache/spark/pull/36457
> >> I think we should merge it into 3.3.0
> >>
> >>
> >> At 2022-05-05 19:00:27, "Maxim Gekk" 
> wrote:
> >>
> >> Please vote on releasing the following candidate as Apache Spark
> version 3.3.0.
> >>
> >> The vote is open until 11:59pm Pacific time May 10th and passes if a
> majority +1 PMC votes are cast, with a minimum of 3 +1 votes.
> >>
> >> [ ] +1 Release this package as Apache Spark 3.3.0
> >> [ ] -1 Do not release this package because ...
> >>
> >> To learn more about Apache Spark, please see http://spark.apache.org/
> >>
> >> The tag to be voted on is v3.3.0-rc1 (commit
> 482b7d54b522c4d1e25f3e84eabbc78126f22a3d):
> >> https://github.com/apache/spark/tree/v3.3.0-rc1
> >>
> >> The release files, including signatures, digests, etc. can be found at:
> >> https://dist.apache.org/repos/dist/dev/spark/v3.3.0-rc1-bin/
> >>
> >> Signatures used for Spark RCs can be found in this file:
> >> https://dist.apache.org/repos/dist/dev/spark/KEYS
> >>
> >> The staging repository for this release can be found at:
> >> https://repository.apache.org/content/repositories/orgapachespark-1402
> >>
> >> The documentation corresponding to this release can be found at:
> >> https://dist.apache.org/repos/dist/dev/spark/v3.3.0-rc1-docs/
> >>
> >> The list of bug fixes going into 3.3.0 can be found at the following
> URL:
> >> https://issues.apache.org/jira/projects/SPARK/versions/12350369
> >>
> >> This release is using the release script of the tag v3.3.0-rc1.
> >>
> >>
> >> FAQ
> >>
> >> =
> >> How can I help test this release?
> >> =
> >> If you are a Spark user, you can help us test this release by taking
> >> an existing Spark workload and running on this release candidate, then
> >> reporting any regressions.
> >>
> >> If you're working in PySpark you can set up a virtual env and install
> >> the current RC and see if anything important breaks, in the Java/Scala
> >> you can add the staging repository to your projects resolvers and test
> >> with the RC (make sure to clean up the artifact cache before/after so
> >> you don't end up building with a out of date RC going forward).
> >>
> >> ===
> >> What should happen to JIRA tickets still targeting 3.3.0?
> >> ===
> >> The current list of open tickets targeted at 3.3.0 can be found at:
> >> https://issues.apache.org/jira/projects/SPARK and search for "Target
> Version/s" = 3.3.0
> >>
> >> Committers should look at those and triage. Extremely important bug
> >> fixes, documentation, and API tweaks that impact compatibility should
> >> be worked on immediately. Everything else please retarget to an
> >> appropriate release.
> >>
> >> ==
> >> But my bug isn't fixed?
> >> ==
> >> In order to make timely releases, we will typically not hold the
> >> release unless the bug in question is a regression from the previous
> >> release. That being said, if there is something which is a regression
> >> that has not been correctly targeted please ping me or a committer to
> >> help target the issue.
> >>
> >> Maxim Gekk
> >>
> >> Software Engineer
> >>
> >> Databricks, Inc.
> >>
> >>
> >>
> >>
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>

Re: PR builder not working now

2022-04-19 Thread Wenchen Fan

Thank you, Hyukjin!

On Wed, Apr 20, 2022 at 7:48 AM Dongjoon Hyun 
wrote:

> It's great! Thank you. :)
>
> On Tue, Apr 19, 2022 at 4:42 PM Hyukjin Kwon  wrote:
>
>> It's fixed now.
>>
>> On Tue, 19 Apr 2022 at 08:33, Hyukjin Kwon  wrote:
>>
>>> It's still persistent. I will send an email to GitHub support today
>>>
>>> On Wed, 13 Apr 2022 at 11:04, Dongjoon Hyun 
>>> wrote:
>>>
 Thank you for sharing that information!

 Bests
 Dongjoon.


 On Mon, Apr 11, 2022 at 10:29 PM Hyukjin Kwon 
 wrote:

> Hi all,
>
> There is a bug in GitHub Actions' RESTful API (see
> https://github.com/HyukjinKwon/spark/actions?query=branch%3Adebug-ga-detection
> as an example).
> So, currently OSS PR builder doesn't work properly with showing a
> screen such as
> https://github.com/apache/spark/pull/36157/checks?check_run_id=5984075130
> because we rely on that.
>
> To check the PR builder's status, we should manually find the workflow
> run in PR author's repository for now by going to:
> https://github.com/[PR AUTHOR
> ID]/spark/actions/workflows/build_and_test.yml
>

Re: bazel and external/

2022-03-21 Thread Wenchen Fan

How about renaming it to `connectors` if docker is the only exception and
will be moved out?

On Sat, Mar 19, 2022 at 6:18 PM Alkis Evlogimenos
 wrote:

> It looks like renaming the directory and moving components can be separate
> steps. If there is consensus that connectors will move out, should the
> directory be named misc for everything else until there is some direction
> for the remaining modules?
>
> On Fri, 18 Mar 2022 at 03:03 Jungtaek Lim 
> wrote:
>
>> Avro reader is technically a connector. We eventually called data source
>> implementation "connector" as well; the package name in the catalyst
>> represents it.
>>
>> Docker is something I'm not sure fits with the name "external". It
>> probably deserves a top level directory now, since we start to release an
>> official docker image. That does not seem to be an experimental one.
>>
>> Except Docker, all modules in the external directory are "sort of"
>> connectors. Ganglia metric sink is an exception, but it is still a kind of
>> connector for Dropwizard.
>> (It might be interesting to see how many users are still using
>> kinesis-asl and ganglia-lgpl modules. We have had almost no updates for
>> DStream for several years.)
>>
>> If we agree with my proposal for docker, remaining is going to be
>> effectively a rename. I don't have a strong opinion, just wanted to avoid
>> the external directory to become/remain miscellaneous one.
>>
>> On Fri, Mar 18, 2022 at 10:04 AM Sean Owen  wrote:
>>
>>> I sympathize, but might be less change to just rename the dir. There is
>>> more in there like the avro reader; it's kind of miscellaneous. I think we
>>> might want fewer rather than more top level dirs.
>>>
>>> On Thu, Mar 17, 2022 at 7:33 PM Jungtaek Lim <
>>> kabhwan.opensou...@gmail.com> wrote:
>>>
 We seem to just focus on how to avoid the conflict with the name
 "external" used in bazel. Since we consider the possibility of renaming,
 why not revisit the modules "external" contains?

 Looks like kinds of the modules external directory contains are 1)
 Docker 2) Connectors 3) Sink on Dropwizard metrics (only ganglia here, and
 it seems to be just that Ganglia is LGPL)

 Would it make sense if each kind deserves a top directory? We can
 probably give better generalized names, and as a side-effect we will no
 longer have "external".

 On Fri, Mar 18, 2022 at 5:45 AM Dongjoon Hyun 
 wrote:

> Thank you for posting this, Alkis.
>
> Before the question (1) and (2), I'm curious if the Apache Spark
> community has other downstreams using Bazel.
>
> To All. If there are some Bazel users with Apache Spark code, could
> you share your practice? If you are using renaming, what is your renamed
> directory name?
>
> Dongjoon.
>
>
> On Thu, Mar 17, 2022 at 11:56 AM Alkis Evlogimenos
>  wrote:
>
>> AFAIK there is not. `external` has been baked in bazel since the
>> beginning and there is no plan from bazel devs to attempt to fix this
>> 
>> .
>>
>> On Thu, Mar 17, 2022 at 7:52 PM Sean Owen  wrote:
>>
>>> Just checking - there is no way to tell bazel to look somewhere else
>>> for whatever 'external' means to it?
>>> It's a kinda big ugly change but it's not a functional change. If
>>> anything it might break some downstream builds that rely on the current
>>> structure too. But such is life for developers? I don't have a strong
>>> reason we can't.
>>>
>>> On Thu, Mar 17, 2022 at 1:47 PM Alkis Evlogimenos
>>>  wrote:
>>>
 Hi Spark devs.

 The Apache Spark repo has a top level external/ directory. This is
 a reserved name for the bazel build system and it causes all sorts of
 problems: some can be worked around and some cannot (for some details 
 on
 one that cannot see
 https://github.com/hedronvision/bazel-compile-commands-extractor/issues/30
 ).

 Some forks of Apache Spark use bazel as a build system. It would be
 nice if we can make this change in Apache Spark without resorting to
 complex renames/merges whenever changes are pulled from upstream.

 As such I proposed to rename external/ directory to want to rename
 the external/ directory to something else [SPARK-38569
 ]. I also sent
 a tentative [PR-35874 ]
 that renames external/ to vendor/.

 My questions to you are:
 1. Are there any objections to renaming external to X?
 2. Is vendor a good new name for external?

 Cheers,

>>>

Re: Apache Spark 3.3 Release

2022-03-21 Thread Wenchen Fan

Just checked the release calendar, the planned RC cut date is April:
[image: image.png]
Let's revisit after 2 weeks then?

On Mon, Mar 21, 2022 at 2:47 PM Wenchen Fan  wrote:

> Shall we revisit this list after a week? Ideally, they should be either
> merged or rejected for 3.3, so that we can cut rc1. We can still discuss
> them case by case at that time if there are exceptions.
>
> On Sat, Mar 19, 2022 at 5:27 AM Dongjoon Hyun 
> wrote:
>
>> Thank you for your summarization.
>>
>> I believe we need to have a discussion in order to evaluate each PR's
>> readiness.
>>
>> BTW, `branch-3.3` is still open for bug fixes including minor dependency
>> changes like the following.
>>
>> (Backported)
>> [SPARK-38563][PYTHON] Upgrade to Py4J 0.10.9.4
>> Revert "[SPARK-38563][PYTHON] Upgrade to Py4J 0.10.9.4"
>> [SPARK-38563][PYTHON] Upgrade to Py4J 0.10.9.5
>>
>> (Upcoming)
>> [SPARK-38544][BUILD] Upgrade log4j2 to 2.17.2 from 2.17.1
>> [SPARK-38602][BUILD] Upgrade Kafka to 3.1.1 from 3.1.0
>>
>> Dongjoon.
>>
>>
>>
>> On Thu, Mar 17, 2022 at 11:22 PM Maxim Gekk 
>> wrote:
>>
>>> Hi All,
>>>
>>> Here is the allow list which I built based on your requests in this
>>> thread:
>>>
>>>1. SPARK-37396: Inline type hint files for files in
>>>python/pyspark/mllib
>>>2. SPARK-37395: Inline type hint files for files in python/pyspark/ml
>>>3. SPARK-37093: Inline type hints python/pyspark/streaming
>>>4. SPARK-37377: Refactor V2 Partitioning interface and remove
>>>deprecated usage of Distribution
>>>5. SPARK-38085: DataSource V2: Handle DELETE commands for
>>>group-based sources
>>>6. SPARK-32268: Bloom Filter Join
>>>7. SPARK-38548: New SQL function: try_sum
>>>8. SPARK-37691: Support ANSI Aggregation Function: percentile_disc
>>>9. SPARK-38063: Support SQL split_part function
>>>10. SPARK-28516: Data Type Formatting Functions: `to_char`
>>>11. SPARK-38432: Refactor framework so as JDBC dialect could compile
>>>filter by self way
>>>12. SPARK-34863: Support nested column in Spark Parquet vectorized
>>>readers
>>>13. SPARK-38194: Make Yarn memory overhead factor configurable
>>>14. SPARK-37618: Support cleaning up shuffle blocks from external
>>>shuffle service
>>>15. SPARK-37831: Add task partition id in metrics
>>>16. SPARK-37974: Implement vectorized DELTA_BYTE_ARRAY and
>>>DELTA_LENGTH_BYTE_ARRAY encodings for Parquet V2 support
>>>17. SPARK-36664: Log time spent waiting for cluster resources
>>>18. SPARK-34659: Web UI does not correctly get appId
>>>19. SPARK-37650: Tell spark-env.sh the python interpreter
>>>20. SPARK-38589: New SQL function: try_avg
>>>21. SPARK-38590: New SQL function: try_to_binary
>>>22. SPARK-34079: Improvement CTE table scan
>>>
>>> Best regards,
>>> Max Gekk
>>>
>>>
>>> On Thu, Mar 17, 2022 at 4:59 PM Tom Graves  wrote:
>>>
>>>> Is the feature freeze target date March 22nd then?  I saw a few dates
>>>> thrown around want to confirm what we landed on
>>>>
>>>> I am trying to get the following improvements finished review and in,
>>>> if concerns with either, let me know:
>>>> - [SPARK-34079][SQL] Merge non-correlated scalar subqueries
>>>> <https://github.com/apache/spark/pull/32298#>
>>>> - [SPARK-37618][CORE] Remove shuffle blocks using the shuffle service
>>>> for released executors <https://github.com/apache/spark/pull/35085#>
>>>>
>>>> Tom
>>>>
>>>>
>>>> On Thursday, March 17, 2022, 07:24:41 AM CDT, Gengliang Wang <
>>>> ltn...@gmail.com> wrote:
>>>>
>>>>
>>>> I'd like to add the following new SQL functions in the 3.3 release.
>>>> These functions are useful when overflow or encoding errors occur:
>>>>
>>>>- [SPARK-38548][SQL] New SQL function: try_sum
>>>><https://github.com/apache/spark/pull/35848>
>>>>- [SPARK-38589][SQL] New SQL function: try_avg
>>>><https://github.com/apache/spark/pull/35896>
>>>>- [SPARK-38590][SQL] New SQL function: try_to_binary
>>>><https://github.com/apache/spark/pull/35897>
>>>>
>>>> Gengliang
>>>>
>>>&g

Re: Apache Spark 3.3 Release

2022-03-21 Thread Wenchen Fan

Shall we revisit this list after a week? Ideally, they should be either
merged or rejected for 3.3, so that we can cut rc1. We can still discuss
them case by case at that time if there are exceptions.

On Sat, Mar 19, 2022 at 5:27 AM Dongjoon Hyun 
wrote:

> Thank you for your summarization.
>
> I believe we need to have a discussion in order to evaluate each PR's
> readiness.
>
> BTW, `branch-3.3` is still open for bug fixes including minor dependency
> changes like the following.
>
> (Backported)
> [SPARK-38563][PYTHON] Upgrade to Py4J 0.10.9.4
> Revert "[SPARK-38563][PYTHON] Upgrade to Py4J 0.10.9.4"
> [SPARK-38563][PYTHON] Upgrade to Py4J 0.10.9.5
>
> (Upcoming)
> [SPARK-38544][BUILD] Upgrade log4j2 to 2.17.2 from 2.17.1
> [SPARK-38602][BUILD] Upgrade Kafka to 3.1.1 from 3.1.0
>
> Dongjoon.
>
>
>
> On Thu, Mar 17, 2022 at 11:22 PM Maxim Gekk 
> wrote:
>
>> Hi All,
>>
>> Here is the allow list which I built based on your requests in this
>> thread:
>>
>>1. SPARK-37396: Inline type hint files for files in
>>python/pyspark/mllib
>>2. SPARK-37395: Inline type hint files for files in python/pyspark/ml
>>3. SPARK-37093: Inline type hints python/pyspark/streaming
>>4. SPARK-37377: Refactor V2 Partitioning interface and remove
>>deprecated usage of Distribution
>>5. SPARK-38085: DataSource V2: Handle DELETE commands for group-based
>>sources
>>6. SPARK-32268: Bloom Filter Join
>>7. SPARK-38548: New SQL function: try_sum
>>8. SPARK-37691: Support ANSI Aggregation Function: percentile_disc
>>9. SPARK-38063: Support SQL split_part function
>>10. SPARK-28516: Data Type Formatting Functions: `to_char`
>>11. SPARK-38432: Refactor framework so as JDBC dialect could compile
>>filter by self way
>>12. SPARK-34863: Support nested column in Spark Parquet vectorized
>>readers
>>13. SPARK-38194: Make Yarn memory overhead factor configurable
>>14. SPARK-37618: Support cleaning up shuffle blocks from external
>>shuffle service
>>15. SPARK-37831: Add task partition id in metrics
>>16. SPARK-37974: Implement vectorized DELTA_BYTE_ARRAY and
>>DELTA_LENGTH_BYTE_ARRAY encodings for Parquet V2 support
>>17. SPARK-36664: Log time spent waiting for cluster resources
>>18. SPARK-34659: Web UI does not correctly get appId
>>19. SPARK-37650: Tell spark-env.sh the python interpreter
>>20. SPARK-38589: New SQL function: try_avg
>>21. SPARK-38590: New SQL function: try_to_binary
>>22. SPARK-34079: Improvement CTE table scan
>>
>> Best regards,
>> Max Gekk
>>
>>
>> On Thu, Mar 17, 2022 at 4:59 PM Tom Graves  wrote:
>>
>>> Is the feature freeze target date March 22nd then?  I saw a few dates
>>> thrown around want to confirm what we landed on
>>>
>>> I am trying to get the following improvements finished review and in, if
>>> concerns with either, let me know:
>>> - [SPARK-34079][SQL] Merge non-correlated scalar subqueries
>>> 
>>> - [SPARK-37618][CORE] Remove shuffle blocks using the shuffle service
>>> for released executors 
>>>
>>> Tom
>>>
>>>
>>> On Thursday, March 17, 2022, 07:24:41 AM CDT, Gengliang Wang <
>>> ltn...@gmail.com> wrote:
>>>
>>>
>>> I'd like to add the following new SQL functions in the 3.3 release.
>>> These functions are useful when overflow or encoding errors occur:
>>>
>>>- [SPARK-38548][SQL] New SQL function: try_sum
>>>
>>>- [SPARK-38589][SQL] New SQL function: try_avg
>>>
>>>- [SPARK-38590][SQL] New SQL function: try_to_binary
>>>
>>>
>>> Gengliang
>>>
>>> On Thu, Mar 17, 2022 at 7:59 AM Andrew Melo 
>>> wrote:
>>>
>>> Hello,
>>>
>>> I've been trying for a bit to get the following two PRs merged and
>>> into a release, and I'm having some difficulty moving them forward:
>>>
>>> https://github.com/apache/spark/pull/34903 - This passes the current
>>> python interpreter to spark-env.sh to allow some currently-unavailable
>>> customization to happen
>>> https://github.com/apache/spark/pull/31774 - This fixes a bug in the
>>> SparkUI reverse proxy-handling code where it does a greedy match for
>>> "proxy" in the URL, and will mistakenly replace the App-ID in the
>>> wrong place.
>>>
>>> I'm not exactly sure of how to get attention of PRs that have been
>>> sitting around for a while, but these are really important to our
>>> use-cases, and it would be nice to have them merged in.
>>>
>>> Cheers
>>> Andrew
>>>
>>> On Wed, Mar 16, 2022 at 6:21 PM Holden Karau 
>>> wrote:
>>> >
>>> > I'd like to add/backport the logging in
>>> https://github.com/apache/spark/pull/35881 PR so that when users submit
>>> issues with dynamic allocation we can better debug what's going on.
>>> >
>>> > On Wed, Mar 16, 2022 at 3:45 PM Chao Sun  wrote:
>>> >>
>>> >> There is one item on

Re: Apache Spark 3.3 Release

2022-03-16 Thread Wenchen Fan

+1 to define an allowlist of features that we want to backport to branch
3.3. I also have a few in my mind
complex type support in vectorized parquet reader:
https://github.com/apache/spark/pull/34659
refine the DS v2 filter API for JDBC v2:
https://github.com/apache/spark/pull/35768
a few new SQL functions that have been in development for a while: to_char,
split_part, percentile_disc, try_sum, etc.

On Wed, Mar 16, 2022 at 2:41 PM Maxim Gekk
 wrote:

> Hi All,
>
> I have created the branch for Spark 3.3:
> https://github.com/apache/spark/commits/branch-3.3
>
> Please, backport important fixes to it, and if you have some doubts, ping
> me in the PR. Regarding new features, we are still building the allow list
> for branch-3.3.
>
> Best regards,
> Max Gekk
>
>
> On Wed, Mar 16, 2022 at 5:51 AM Dongjoon Hyun 
> wrote:
>
>> Yes, I agree with you for your whitelist approach for backporting. :)
>> Thank you for summarizing.
>>
>> Thanks,
>> Dongjoon.
>>
>>
>> On Tue, Mar 15, 2022 at 4:20 PM Xiao Li  wrote:
>>
>>> I think I finally got your point. What you want to keep unchanged is the
>>> branch cut date of Spark 3.3. Today? or this Friday? This is not a big
>>> deal.
>>>
>>> My major concern is whether we should keep merging the feature work or
>>> the dependency upgrade after the branch cut. To make our release time more
>>> predictable, I am suggesting we should finalize the exception PR list
>>> first, instead of merging them in an ad hoc way. In the past, we spent a
>>> lot of time on the revert of the PRs that were merged after the branch cut.
>>> I hope we can minimize unnecessary arguments in this release. Do you agree,
>>> Dongjoon?
>>>
>>>
>>>
>>> Dongjoon Hyun  于2022年3月15日周二 15:55写道：
>>>
 That is not totally fine, Xiao. It sounds like you are asking a change
 of plan without a proper reason.

 Although we cut the branch Today according our plan, you still can
 collect the list and make a list of exceptions. I'm not blocking what you
 want to do.

 Please let the community start to ramp down as we agreed before.

 Dongjoon



 On Tue, Mar 15, 2022 at 3:07 PM Xiao Li  wrote:

> Please do not get me wrong. If we don't cut a branch, we are allowing
> all patches to land Apache Spark 3.3. That is totally fine. After we cut
> the branch, we should avoid merging the feature work. In the next three
> days, let us collect the actively developed PRs that we want to make an
> exception (i.e., merged to 3.3 after the upcoming branch cut). Does that
> make sense?
>
> Dongjoon Hyun  于2022年3月15日周二 14:54写道：
>
>> Xiao. You are working against what you are saying.
>> If you don't cut a branch, it means you are allowing all patches to
>> land Apache Spark 3.3. No?
>>
>> > we need to avoid backporting the feature work that are not being
>> well discussed.
>>
>>
>>
>> On Tue, Mar 15, 2022 at 12:12 PM Xiao Li 
>> wrote:
>>
>>> Cutting the branch is simple, but we need to avoid backporting the
>>> feature work that are not being well discussed. Not all the members are
>>> actively following the dev list. I think we should wait 3 more days for
>>> collecting the PR list before cutting the branch.
>>>
>>> BTW, there are very few 3.4-only feature work that will be affected.
>>>
>>> Xiao
>>>
>>> Dongjoon Hyun  于2022年3月15日周二 11:49写道：
>>>
 Hi, Max, Chao, Xiao, Holden and all.

 I have a different idea.

 Given the situation and small patch list, I don't think we need to
 postpone the branch cut for those patches. It's easier to cut a 
 branch-3.3
 and allow backporting.

 As of today, we already have an obvious Apache Spark 3.4 patch in
 the branch together. This situation only becomes worse and worse 
 because
 there is no way to block the other patches from landing 
 unintentionally if
 we don't cut a branch.

 [SPARK-38335][SQL] Implement parser support for DEFAULT column
 values

 Let's cut `branch-3.3` Today for Apache Spark 3.3.0 preparation.

 Best,
 Dongjoon.


 On Tue, Mar 15, 2022 at 10:17 AM Chao Sun 
 wrote:

> Cool, thanks for clarifying!
>
> On Tue, Mar 15, 2022 at 10:11 AM Xiao Li 
> wrote:
> >>
> >> For the following list:
> >> #35789 [SPARK-32268][SQL] Row-level Runtime Filtering
> >> #34659 [SPARK-34863][SQL] Support complex types for Parquet
> vectorized reader
> >> #35848 [SPARK-38548][SQL] New SQL function: try_sum
> >> Do you mean we should include them, or exclude them from 3.3?
> >
> >
> > If possible, I hope these features can be shipped with Spark 3.3.
> >
> >

Re: Data correctness issue with Repartition + FetchFailure

2022-03-16 Thread Wenchen Fan

It's great if you can help with it! Basically, we need to propagate the
column-level deterministic information and sort the inputs if the partition
key lineage has nondeterminisitc part.

On Wed, Mar 16, 2022 at 5:28 AM Jason Xu  wrote:

> Hi Wenchen, thanks for the insight. Agree, the previous fix for
> repartition works for deterministic data. With non-deterministic data, I
> didn't find an API to pass DeterministicLevel to underlying rdd.
> Do you plan to continue work on integration with SQL operators? If not,
> I'm available to take a stab.
>
> On Mon, Mar 14, 2022 at 7:00 PM Wenchen Fan  wrote:
>
>> We fixed the repartition correctness bug before, by sorting the data
>> before doing round-robin partitioning. But the issue is that we need to
>> propagate the isDeterministic property through SQL operators.
>>
>> On Tue, Mar 15, 2022 at 1:50 AM Jason Xu  wrote:
>>
>>> Hi Reynold, do you suggest removing RoundRobinPartitioning in
>>> repartition(numPartitions: Int) API implementation? If that's the direction
>>> we're considering, before we have a new implementation, should we suggest
>>> users avoid using the repartition(numPartitions: Int) API?
>>>
>>> On Sat, Mar 12, 2022 at 1:47 PM Reynold Xin  wrote:
>>>
>>>> This is why RoundRobinPartitioning shouldn't be used ...
>>>>
>>>>
>>>> On Sat, Mar 12, 2022 at 12:08 PM, Jason Xu 
>>>> wrote:
>>>>
>>>>> Hi Spark community,
>>>>>
>>>>> I reported a data correctness issue in
>>>>> https://issues.apache.org/jira/browse/SPARK-38388. In short,
>>>>> non-deterministic data + Repartition + FetchFailure could result in
>>>>> incorrect data, this is an issue we run into in production pipelines, I
>>>>> have an example to reproduce the bug in the ticket.
>>>>>
>>>>> I report here to bring more attention, could you help confirm it's a
>>>>> bug and worth effort to further investigate and fix, thank you in advance
>>>>> for help!
>>>>>
>>>>> Thanks,
>>>>> Jason Xu
>>>>>
>>>>
>>>>

Re: Data correctness issue with Repartition + FetchFailure

2022-03-14 Thread Wenchen Fan

We fixed the repartition correctness bug before, by sorting the data before
doing round-robin partitioning. But the issue is that we need to propagate
the isDeterministic property through SQL operators.

On Tue, Mar 15, 2022 at 1:50 AM Jason Xu  wrote:

> Hi Reynold, do you suggest removing RoundRobinPartitioning in
> repartition(numPartitions: Int) API implementation? If that's the direction
> we're considering, before we have a new implementation, should we suggest
> users avoid using the repartition(numPartitions: Int) API?
>
> On Sat, Mar 12, 2022 at 1:47 PM Reynold Xin  wrote:
>
>> This is why RoundRobinPartitioning shouldn't be used ...
>>
>>
>> On Sat, Mar 12, 2022 at 12:08 PM, Jason Xu 
>> wrote:
>>
>>> Hi Spark community,
>>>
>>> I reported a data correctness issue in
>>> https://issues.apache.org/jira/browse/SPARK-38388. In short,
>>> non-deterministic data + Repartition + FetchFailure could result in
>>> incorrect data, this is an issue we run into in production pipelines, I
>>> have an example to reproduce the bug in the ticket.
>>>
>>> I report here to bring more attention, could you help confirm it's a bug
>>> and worth effort to further investigate and fix, thank you in advance for
>>> help!
>>>
>>> Thanks,
>>> Jason Xu
>>>
>>
>>

Re: [VOTE] Spark 3.1.3 RC4

2022-02-15 Thread Wenchen Fan

+1

On Tue, Feb 15, 2022 at 3:59 PM Yuming Wang  wrote:

> +1 (non-binding).
>
> On Tue, Feb 15, 2022 at 10:22 AM Ruifeng Zheng 
> wrote:
>
>> +1 (non-binding)
>>
>> checked the release script issue Dongjoon mentioned:
>>
>> curl -s
>> https://dist.apache.org/repos/dist/dev/spark/v3.1.3-rc4-bin/spark-3.1.3-bin-hadoop2.7.tgz
>> | tar tz | grep hadoop-common
>> spark-3.1.3-bin-hadoop2.7/jars/hadoop-common-2.7.4
>>
>>
>> -- 原始邮件 --
>> *发件人:* "Sean Owen" ;
>> *发送时间:* 2022年2月15日(星期二) 上午10:01
>> *收件人:* "Holden Karau";
>> *抄送:* "dev";
>> *主题:* Re: [VOTE] Spark 3.1.3 RC4
>>
>> Looks good to me, same results as last RC, +1
>>
>> On Mon, Feb 14, 2022 at 2:55 PM Holden Karau 
>> wrote:
>>
>>> Please vote on releasing the following candidate as Apache Spark version
>>> 3.1.3.
>>>
>>> The vote is open until Feb. 18th at 1 PM pacific (9 PM GMT) and passes
>>> if a majority
>>> +1 PMC votes are cast, with a minimum of 3 + 1 votes.
>>>
>>> [ ] +1 Release this package as Apache Spark 3.1.3
>>> [ ] -1 Do not release this package because ...
>>>
>>> To learn more about Apache Spark, please see http://spark.apache.org/
>>>
>>> There are currently no open issues targeting 3.1.3 in Spark's JIRA
>>> https://issues.apache.org/jira/browse
>>> (try project = SPARK AND "Target Version/s" = "3.1.3" AND status in
>>> (Open, Reopened, "In Progress"))
>>> at https://s.apache.org/n79dw
>>>
>>>
>>>
>>> The tag to be voted on is v3.1.3-rc4 (commit
>>> d1f8a503a26bcfb4e466d9accc5fa241a7933667):
>>> https://github.com/apache/spark/tree/v3.1.3-rc4
>>>
>>> The release files, including signatures, digests, etc. can be found at:
>>> https://dist.apache.org/repos/dist/dev/spark/v3.1.3-rc4-bin/
>>>
>>> Signatures used for Spark RCs can be found in this file:
>>> https://dist.apache.org/repos/dist/dev/spark/KEYS
>>>
>>> The staging repository for this release can be found at
>>> https://repository.apache.org/content/repositories/orgapachespark-1401
>>>
>>> The documentation corresponding to this release can be found at:
>>> https://dist.apache.org/repos/dist/dev/spark/v3.1.3-rc4-docs/
>>>
>>> The list of bug fixes going into 3.1.3 can be found at the following URL:
>>> https://s.apache.org/x0q9b
>>>
>>> This release is using the release script from 3.1.3
>>> The release docker container was rebuilt since the previous version
>>> didn't have the necessary components to build the R documentation.
>>>
>>> FAQ
>>>
>>>
>>> =
>>> How can I help test this release?
>>> =
>>>
>>> If you are a Spark user, you can help us test this release by taking
>>> an existing Spark workload and running on this release candidate, then
>>> reporting any regressions.
>>>
>>> If you're working in PySpark you can set up a virtual env and install
>>> the current RC and see if anything important breaks, in the Java/Scala
>>> you can add the staging repository to your projects resolvers and test
>>> with the RC (make sure to clean up the artifact cache before/after so
>>> you don't end up building with an out of date RC going forward).
>>>
>>> ===
>>> What should happen to JIRA tickets still targeting 3.1.3?
>>> ===
>>>
>>> The current list of open tickets targeted at 3.1.3 can be found at:
>>> https://issues.apache.org/jira/projects/SPARK and search for "Target
>>> Version/s" = 3.1.3
>>>
>>> Committers should look at those and triage. Extremely important bug
>>> fixes, documentation, and API tweaks that impact compatibility should
>>> be worked on immediately. Everything else please retarget to an
>>> appropriate release.
>>>
>>> ==
>>> But my bug isn't fixed?
>>> ==
>>>
>>> In order to make timely releases, we will typically not hold the
>>> release unless the bug in question is a regression from the previous
>>> release. That being said, if there is something that is a regression
>>> that has not been correctly targeted please ping me or a committer to
>>> help target the issue.
>>>
>>> Note: I added an extra day to the vote since I know some folks are
>>> likely busy on the 14th with partner(s).
>>>
>>>
>>> --
>>> Twitter: https://twitter.com/holdenkarau
>>> Books (Learning Spark, High Performance Spark, etc.):
>>> https://amzn.to/2MaRAG9  
>>> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>>>
>>

Re: [VOTE] Spark 3.1.3 RC3

2022-02-07 Thread Wenchen Fan

Shall we use the release scripts of branch 3.1 to release 3.1?

On Fri, Feb 4, 2022 at 4:57 AM Holden Karau  wrote:

> Good catch Dongjoon :)
>
> This release candidate fails, but feel free to keep testing for any other
> potential blockers.
>
> I’ll roll RC4 next week with the older release scripts (but the more
> modern image since the legacy image didn’t have a good time with the R doc
> packaging).
>
> On Thu, Feb 3, 2022 at 3:53 PM Dongjoon Hyun 
> wrote:
>
>> Unfortunately, -1 for 3.1.3 RC3 due to the packaging issue.
>>
>> It seems that the master branch release script didn't work properly for
>> Hadoop 2 binary distribution, Holden.
>>
>> $ curl -s
>> https://dist.apache.org/repos/dist/dev/spark/v3.1.3-rc3-bin/spark-3.1.3-bin-hadoop2.tgz
>> | tar tz | grep hadoop-common
>> spark-3.1.3-bin-hadoop2/jars/hadoop-common-3.2.0.jar
>>
>> Apache Spark didn't drop Apache Hadoop 2 based binary distribution yet.
>>
>> Dongjoon
>>
>>
>> On Wed, Feb 2, 2022 at 3:38 PM Mridul Muralidharan 
>> wrote:
>>
>>> Hi,
>>>
>>>   Minor nit: the tag mentioned under [1] looks like a typo - I used
>>> "v3.1.3-rc3"  for my vote (3.2.1 is mentioned in a couple of places, treat
>>> them as 3.1.3 instead)
>>>
>>> +1
>>> Signatures, digests, etc check out fine.
>>> Checked out tag and build/tested with -Pyarn -Pmesos -Pkubernetes
>>>
>>> Regards,
>>> Mridul
>>>
>>> [1] "The tag to be voted on is v3.2.1-rc1" - the commit hash and git url
>>> are correct.
>>>
>>>
>>> On Wed, Feb 2, 2022 at 9:30 AM Mridul Muralidharan 
>>> wrote:
>>>

 Thanks Tom !
 I missed [1] (or probably forgot) the 3.1 part of the discussion given
 it centered around 3.2 ...


 Regards,
 Mridul

 [1] https://www.mail-archive.com/dev@spark.apache.org/msg28484.html

 On Wed, Feb 2, 2022 at 8:55 AM Thomas Graves 
 wrote:

> It was discussed doing all the maintenance lines back at beginning of
> December (Dec 6) when we were talking about release 3.2.1.
>
> Tom
>
> On Wed, Feb 2, 2022 at 2:07 AM Mridul Muralidharan 
> wrote:
> >
> > Hi Holden,
> >
> >   Not that I am against releasing 3.1.3 (given the fixes that have
> already gone in), but did we discuss releasing it ? I might have missed 
> the
> thread ...
> >
> > Regards,
> > Mridul
> >
> > On Tue, Feb 1, 2022 at 7:12 PM Holden Karau 
> wrote:
> >>
> >> Please vote on releasing the following candidate as Apache Spark
> version 3.1.3.
> >>
> >> The vote is open until Feb. 4th at 5 PM PST (1 AM UTC + 1 day) and
> passes if a majority
> >> +1 PMC votes are cast, with a minimum of 3 + 1 votes.
> >>
> >> [ ] +1 Release this package as Apache Spark 3.1.3
> >> [ ] -1 Do not release this package because ...
> >>
> >> To learn more about Apache Spark, please see
> http://spark.apache.org/
> >>
> >> There are currently no open issues targeting 3.1.3 in Spark's JIRA
> https://issues.apache.org/jira/browse
> >> (try project = SPARK AND "Target Version/s" = "3.1.3" AND status in
> (Open, Reopened, "In Progress"))
> >> at https://s.apache.org/n79dw
> >>
> >>
> >>
> >> The tag to be voted on is v3.2.1-rc1 (commit
> >> b8c0799a8cef22c56132d94033759c9f82b0cc86):
> >> https://github.com/apache/spark/tree/v3.1.3-rc3
> >>
> >> The release files, including signatures, digests, etc. can be found
> at:
> >> https://dist.apache.org/repos/dist/dev/spark/v3.1.3-rc3-bin/
> >>
> >> Signatures used for Spark RCs can be found in this file:
> >> https://dist.apache.org/repos/dist/dev/spark/KEYS
> >>
> >> The staging repository for this release can be found at
> >> :
> https://repository.apache.org/content/repositories/orgapachespark-1400/
> >>
> >> The documentation corresponding to this release can be found at:
> >> https://dist.apache.org/repos/dist/dev/spark/v3.1.3-rc3-docs/
> >>
> >> The list of bug fixes going into 3.1.3 can be found at the
> following URL:
> >> https://s.apache.org/x0q9b
> >>
> >> This release is using the release script in master as of
> ddc77fb906cb3ce1567d277c2d0850104c89ac25
> >> The release docker container was rebuilt since the previous version
> didn't have the necessary components to build the R documentation.
> >>
> >> FAQ
> >>
> >>
> >> =
> >> How can I help test this release?
> >> =
> >>
> >> If you are a Spark user, you can help us test this release by taking
> >> an existing Spark workload and running on this release candidate,
> then
> >> reporting any regressions.
> >>
> >> If you're working in PySpark you can set up a virtual env and
> install
> >> the current RC and see if anything important breaks, in the
> Java/Scala
> >> you can add the staging repository to

Re: [VOTE] SPIP: Catalog API for view metadata

2022-02-07 Thread Wenchen Fan

+1 (binding)

On Sun, Feb 6, 2022 at 10:27 AM Jacky Lee  wrote:

> +1 (non-binding). Thanks John!
> It's great to see ViewCatalog moving on, it's a nice feature.
>
> Terry Kim  于2022年2月5日周六 11:57写道：
>
>> +1 (non-binding). Thanks John!
>>
>> Terry
>>
>> On Fri, Feb 4, 2022 at 4:13 PM Yufei Gu  wrote:
>>
>>> +1 (non-binding)
>>> Best,
>>>
>>> Yufei
>>>
>>> `This is not a contribution`
>>>
>>>
>>> On Fri, Feb 4, 2022 at 11:54 AM huaxin gao 
>>> wrote:
>>>
 +1 (non-binding)

 On Fri, Feb 4, 2022 at 11:40 AM L. C. Hsieh  wrote:

> +1
>
> On Thu, Feb 3, 2022 at 7:25 PM Chao Sun  wrote:
> >
> > +1 (non-binding). Looking forward to this feature!
> >
> > On Thu, Feb 3, 2022 at 2:32 PM Ryan Blue  wrote:
> >>
> >> +1 for the SPIP. I think it's well designed and it has worked quite
> well at Netflix for a long time.
> >>
> >> On Thu, Feb 3, 2022 at 2:04 PM John Zhuge 
> wrote:
> >>>
> >>> Hi Spark community,
> >>>
> >>> I’d like to restart the vote for the ViewCatalog design proposal
> (SPIP).
> >>>
> >>> The proposal is to add a ViewCatalog interface that can be used to
> load, create, alter, and drop views in DataSourceV2.
> >>>
> >>> Please vote on the SPIP until Feb. 9th (Wednesday).
> >>>
> >>> [ ] +1: Accept the proposal as an official SPIP
> >>> [ ] +0
> >>> [ ] -1: I don’t think this is a good idea because …
> >>>
> >>> Thanks!
> >>
> >>
> >>
> >> --
> >> Ryan Blue
> >> Tabular
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>

Re: [VOTE] Release Spark 3.2.1 (RC2)

2022-01-24 Thread Wenchen Fan

+1

On Tue, Jan 25, 2022 at 10:13 AM Ruifeng Zheng  wrote:

> +1 (non-binding)
>
>
> -- 原始邮件 --
> *发件人:* "Kent Yao" ;
> *发送时间:* 2022年1月25日(星期二) 上午10:09
> *收件人:* "John Zhuge";
> *抄送:* "dev";
> *主题:* Re: [VOTE] Release Spark 3.2.1 (RC2)
>
> +1， non-binding
>
> John Zhuge  于2022年1月25日周二 06:56写道：
>
>> +1 (non-binding)
>>
>> On Mon, Jan 24, 2022 at 2:28 PM Cheng Su  wrote:
>>
>>> +1 (non-binding)
>>>
>>>
>>>
>>> Cheng Su
>>>
>>>
>>>
>>> *From: *Chao Sun 
>>> *Date: *Monday, January 24, 2022 at 2:10 PM
>>> *To: *Michael Heuer 
>>> *Cc: *dev 
>>> *Subject: *Re: [VOTE] Release Spark 3.2.1 (RC2)
>>>
>>> +1 (non-binding)
>>>
>>>
>>>
>>> On Mon, Jan 24, 2022 at 6:32 AM Michael Heuer  wrote:
>>>
>>> +1 (non-binding)
>>>
>>>
>>>
>>>michael
>>>
>>>
>>>
>>>
>>>
>>> On Jan 24, 2022, at 7:30 AM, Gengliang Wang  wrote:
>>>
>>>
>>>
>>> +1 (non-binding)
>>>
>>>
>>>
>>> On Mon, Jan 24, 2022 at 6:26 PM Dongjoon Hyun 
>>> wrote:
>>>
>>> +1
>>>
>>>
>>>
>>> Dongjoon.
>>>
>>>
>>>
>>> On Sat, Jan 22, 2022 at 7:19 AM Mridul Muralidharan 
>>> wrote:
>>>
>>>
>>>
>>> +1
>>>
>>>
>>>
>>> Signatures, digests, etc check out fine.
>>>
>>> Checked out tag and build/tested with -Pyarn -Pmesos -Pkubernetes
>>>
>>>
>>>
>>> Regards,
>>>
>>> Mridul
>>>
>>>
>>>
>>> On Fri, Jan 21, 2022 at 9:01 PM Sean Owen  wrote:
>>>
>>> +1 with same result as last time.
>>>
>>>
>>>
>>> On Thu, Jan 20, 2022 at 9:59 PM huaxin gao 
>>> wrote:
>>>
>>> Please vote on releasing the following candidate as Apache Spark version
>>> 3.2.1. The vote is open until 8:00pm Pacific time January 25 and passes if
>>> a majority +1 PMC votes are cast, with a minimum of 3 +1 votes. [ ] +1
>>> Release this package as Apache Spark 3.2.1 [ ] -1 Do not release this
>>> package because ... To learn more about Apache Spark, please see
>>> http://spark.apache.org/ The tag to be voted on is v3.2.1-rc2 (commit
>>> 4f25b3f71238a00508a356591553f2dfa89f8290):
>>> https://github.com/apache/spark/tree/v3.2.1-rc2  The release files,
>>> including signatures, digests, etc. can be found at:
>>> https://dist.apache.org/repos/dist/dev/spark/v3.2.1-rc2-bin/  Signatures
>>> used for Spark RCs can be found in this file:
>>> https://dist.apache.org/repos/dist/dev/spark/KEYS The staging
>>> repository for this release can be found at:
>>> https://repository.apache.org/content/repositories/orgapachespark-1398/
>>>   The documentation corresponding to this release can be found at:
>>> https://dist.apache.org/repos/dist/dev/spark/v3.2.1-rc2-docs/_site/  The
>>> list of bug fixes going into 3.2.1 can be found at the following URL:
>>> https://s.apache.org/yu0cy   This release is using the release script
>>> of the tag v3.2.1-rc2. FAQ = How can I help
>>> test this release? = If you are a Spark user, you
>>> can help us test this release by taking an existing Spark workload and
>>> running on this release candidate, then reporting any regressions. If
>>> you're working in PySpark you can set up a virtual env and install the
>>> current RC and see if anything important breaks, in the Java/Scala you can
>>> add the staging repository to your projects resolvers and test with the RC
>>> (make sure to clean up the artifact cache before/after so you don't end up
>>> building with a out of date RC going forward).
>>> === What should happen to JIRA
>>> tickets still targeting 3.2.1? ===
>>> The current list of open tickets targeted at 3.2.1 can be found at:
>>> https://issues.apache.org/jira/projects/SPARK and search for "Target
>>> Version/s" = 3.2.1 Committers should look at those and triage. Extremely
>>> important bug fixes, documentation, and API tweaks that impact
>>> compatibility should be worked on immediately. Everything else please
>>> retarget to an appropriate release. == But my bug isn't
>>> fixed? == In order to make timely releases, we will
>>> typically not hold the release unless the bug in question is a regression
>>> from the previous release. That being said, if there is something which is
>>> a regression that has not been correctly targeted please ping me or a
>>> committer to help target the issue.
>>>
>>>
>>>
>>>
>>
>> --
>> John Zhuge
>>
>

Re: Difference in behavior for Spark 3.0 vs Spark 3.1 "create database "

2022-01-11 Thread Wenchen Fan

Hopefully, this StackOverflow answer can solve your problem:
https://stackoverflow.com/questions/47523037/how-do-i-configure-pyspark-to-write-to-hdfs-by-default

Spark doesn't control the behavior of qualifying paths. It's decided by
certain configs and/or config files.

On Tue, Jan 11, 2022 at 3:03 AM Pablo Langa Blanco  wrote:

> Hi Pralabh,
>
> If it helps, it is probably related to this change
> https://github.com/apache/spark/pull/28527
>
> Regards
>
> On Mon, Jan 10, 2022 at 10:42 AM Pralabh Kumar 
> wrote:
>
>> Hi Spark Team
>>
>> When creating a database via Spark 3.0 on Hive
>>
>> 1) spark.sql("create database test location '/user/hive'").  It creates
>> the database location on hdfs . As expected
>>
>> 2) When running the same command on 3.1 the database is created on the
>> local file system by default. I have to prefix with hdfs to create db on
>> hdfs.
>>
>> Why is there a difference in the behavior, Can you please point me to the
>> jira which causes this change.
>>
>> Note : spark.sql.warehouse.dir and hive.metastore.warehouse.dir both are
>> having default values(not explicitly set)
>>
>> Regards
>> Pralabh Kumar
>>
>

Re: Time for Spark 3.2.1?

2021-12-06 Thread Wenchen Fan

+1 to make new maintenance releases for all 3.x branches.

On Tue, Dec 7, 2021 at 8:57 AM Sean Owen  wrote:

> Always fine by me if someone wants to roll a release.
>
> It's been ~6 months since the last 3.0.x and 3.1.x releases, too; a new
> release of those wouldn't hurt either, if any of our release managers have
> the time or inclination. 3.0.x is reaching unofficial end-of-life around
> now anyway.
>
>
> On Mon, Dec 6, 2021 at 6:55 PM Hyukjin Kwon  wrote:
>
>> Hi all,
>>
>> It's been two months since Spark 3.2.0 release, and we have resolved many
>> bug fixes and regressions. What do you guys think about rolling Spark 3.2.1
>> release?
>>
>> cc @huaxin gao  FYI who I happened to overhear
>> that is interested in rolling the maintenance release :-).
>>
>

Re: [Apache Spark Jenkins] build system shutting down Dec 23th, 2021

2021-12-06 Thread Wenchen Fan

Thanks, Shane! Really appreciate it!

Wenchen

On Tue, Dec 7, 2021 at 12:38 PM Xiao Li  wrote:

> Hi, Shane,
>
> Thank you for your work on it!
>
> Xiao
>
>
>
>
> On Mon, Dec 6, 2021 at 6:20 PM L. C. Hsieh  wrote:
>
>> Thank you, Shane.
>>
>> On Mon, Dec 6, 2021 at 4:27 PM Holden Karau  wrote:
>> >
>> > Shane you kick ass thank you for everything you’ve done for us :) Keep
>> on rocking :)
>> >
>> > On Mon, Dec 6, 2021 at 4:24 PM Hyukjin Kwon 
>> wrote:
>> >>
>> >> Thanks, Shane.
>> >>
>> >> On Tue, 7 Dec 2021 at 09:19, Dongjoon Hyun 
>> wrote:
>> >>>
>> >>> I really want to thank you for all your help.
>> >>> You've done so many things for the Apache Spark community.
>> >>>
>> >>> Sincerely,
>> >>> Dongjoon
>> >>>
>> >>>
>> >>> On Mon, Dec 6, 2021 at 12:02 PM shane knapp ☠ 
>> wrote:
>> 
>>  hey everyone!
>> 
>>  after a marathon run of nearly a decade, we're finally going to be
>> shutting down {amp|rise}lab jenkins at the end of this month...
>> 
>>  the earliest snapshot i could find is from 2013 with builds for
>> spark 0.7:
>> 
>> https://web.archive.org/web/20130426155726/https://amplab.cs.berkeley.edu/jenkins/
>> 
>>  it's been a hell of a run, and i'm gonna miss randomly tweaking the
>> build system, but technology has moved on and running a dedicated set of
>> servers for just one open source project is just too expensive for us here
>> at uc berkeley.
>> 
>>  if there's interest, i'll fire up a zoom session and all y'alls can
>> watch me type the final command:
>> 
>>  systemctl stop jenkins
>> 
>>  feeling bittersweet,
>> 
>>  shane
>>  --
>>  Shane Knapp
>>  Computer Guy / Voice of Reason
>>  UC Berkeley EECS Research / RISELab Staff Technical Lead
>>  https://rise.cs.berkeley.edu
>> >
>> > --
>> > Twitter: https://twitter.com/holdenkarau
>> > Books (Learning Spark, High Performance Spark, etc.):
>> https://amzn.to/2MaRAG9
>> > YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>>
>> -
>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>
>>
>
> --
>
>

Re: Supports Dynamic Table Options for Spark SQL

2021-11-16 Thread Wenchen Fan

It's useful to have a SQL API to specify table options, similar to the
DataFrameReader API. However, I share the same concern from @Hyukjin Kwon
 and am not very comfortable with using hints to do it.

In the PR, someone mentioned TVF. I think it's better than hints, but still
has problems. For example, shall we support `FROM read(t AS VERSION OF 1,
options...)`?

We probably should investigate if there are similar SQL syntaxes in other
databases first.

On Wed, Nov 17, 2021 at 2:39 AM Mich Talebzadeh 
wrote:

> This concept is explained here
> 
> somehow. If this is true why cannot we just use
>
> SELECT * FROM  VERSION AS OF 
>
>
>   view my Linkedin profile
> 
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
> On Tue, 16 Nov 2021 at 17:49, Ryan Blue  wrote:
>
>> Mich, time travel will use the newly added VERSION AS OF or TIMESTAMP AS
>> OF syntax.
>>
>> On Tue, Nov 16, 2021 at 12:40 AM Mich Talebzadeh <
>> mich.talebza...@gmail.com> wrote:
>>
>>> As I stated before, hints are designed to direct the optimizer to
>>> choose a certain query execution plan based on the specific criteria.
>>>
>>>
>>> -- time travel
>>> SELECT * FROM t /*+ OPTIONS('snapshot-id'='10963874102873L') */
>>>
>>>
>>> The alternative would be to specify time travel by creating a snapshot
>>> based on CURRENT_DATE() range which encapsulates time travel for
>>> 'snapshot-id'='10963874102873L'
>>>
>>>
>>> CREATE SNAPSHOT t_snap
>>>
>>>   START WITH CURRENT_DATE() - 30
>>>
>>>   NEXT CURRENT_DATE()
>>>
>>>   AS SELECT * FROM t
>>>
>>>
>>> SELECT * FROM t_snap
>>>
>>>
>>> HTH
>>>
>>>
>>>view my Linkedin profile
>>> 
>>>
>>>
>>>
>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>> any loss, damage or destruction of data or any other property which may
>>> arise from relying on this email's technical content is explicitly
>>> disclaimed. The author will in no case be liable for any monetary damages
>>> arising from such loss, damage or destruction.
>>>
>>>
>>>
>>>
>>> On Tue, 16 Nov 2021 at 04:26, Hyukjin Kwon  wrote:
>>>
 My biggest concern with the syntax in hints is that Spark SQL's options
 can change results (e.g., CSV's header options) whereas hints are generally
 not designed to affect the external results if I am not mistaken. This is
 counterintuitive.
 I left the comment in the PR but what's the real benefit over
 leveraging: SET conf and RESET conf? we can extract options from runtime
 session configurations e.g., SessionConfigSupport.

 On Tue, 16 Nov 2021 at 04:30, Nicholas Chammas <
 nicholas.cham...@gmail.com> wrote:

> Side note about time travel: There is a PR
>  to add VERSION/TIMESTAMP
> AS OF syntax to Spark SQL.
>
> On Mon, Nov 15, 2021 at 2:23 PM Ryan Blue  wrote:
>
>> I want to note that I wouldn't recommend time traveling this way by
>> using the hint for `snapshot-id`. Instead, we want to add the standard 
>> SQL
>> syntax for that in a separate PR. This is useful for other options that
>> help a table scan perform better, like specifying the target split size.
>>
>> You're right that this isn't a typical optimizer hint, but I'm not
>> sure what other syntax is possible for this use case. How else would we
>> send custom properties through to the scan?
>>
>> On Mon, Nov 15, 2021 at 9:25 AM Mich Talebzadeh <
>> mich.talebza...@gmail.com> wrote:
>>
>>> I am looking at the hint and it appears to me (I stand corrected),
>>> it is a single table hint as below:
>>>
>>> -- time travel
>>> SELECT * FROM t /*+ OPTIONS('snapshot-id'='10963874102873L') */
>>>
>>> My assumption is that any view on this table will also benefit from
>>> this hint. This is not a hint to optimizer in a classical sense. Only a
>>> snapshot hint. Normally, a hint is an instruction to the optimizer.
>>> When writing SQL, one may know information about the data unknown to the
>>> optimizer. Hints enable one to make decisions normally made by the
>>> optimizer, sometimes causing the optimizer to select a plan that it 
>>> sees as
>>> higher cost.
>>>
>>>
>>> So far as this case is concerned, it looks OK and I concur it should
>>> be extended to write as well.
>>>
>>>
>>> HTH
>>>
>>>
>>>view my Linkedin profile
>>>

Re: [FYI] Build and run tests on Java 17 for Apache Spark 3.3

2021-11-16 Thread Wenchen Fan

Great job!

On Sat, Nov 13, 2021 at 11:18 AM Hyukjin Kwon  wrote:

> Awesome!
>
> On Sat, Nov 13, 2021 at 12:04 PM Xiao Li  wrote:
>
>> Thank you! Great job!
>>
>> Xiao
>>
>>
>> On Fri, Nov 12, 2021 at 7:02 PM Mridul Muralidharan 
>> wrote:
>>
>>>
>>> Nice job !
>>> There are some nice API's which should be interesting to explore with
>>> JDK 17 :-)
>>>
>>> Regards.
>>> Mridul
>>>
>>> On Fri, Nov 12, 2021 at 7:08 PM Yuming Wang  wrote:
>>>
 Cool, thank you Dongjoon.

 On Sat, Nov 13, 2021 at 4:09 AM shane knapp ☠ 
 wrote:

> woot!  nice work everyone!  :)
>
> On Fri, Nov 12, 2021 at 11:37 AM Dongjoon Hyun <
> dongjoon.h...@gmail.com> wrote:
>
>> Hi, All.
>>
>> Apache Spark community has been working on Java 17 support under the
>> following JIRA.
>>
>> https://issues.apache.org/jira/browse/SPARK-33772
>>
>> As of today, Apache Spark starts to have daily Java 17 test coverage
>> via GitHub Action jobs for Apache Spark 3.3.
>>
>>
>> https://github.com/apache/spark/blob/master/.github/workflows/build_and_test.yml#L38-L39
>>
>> Today's successful run is here.
>>
>> https://github.com/apache/spark/actions/runs/1453788012
>>
>> Please note that we are still working on some new Java 17 features
>> like
>>
>> JEP 391: macOS/AArch64 Port
>> https://bugs.openjdk.java.net/browse/JDK-8251280
>>
>> For example, Oracle Java, Azul Zulu, and Eclipse Temurin Java 17
>> already support Apple Silicon natively, but some 3rd party libraries like
>> RocksDB/LevelDB are not ready yet. Since Mac is one of the popular dev
>> environments, we are going to keep monitoring and improving gradually for
>> Apache Spark 3.3.
>>
>> Please test Java 17 and let us know your feedback.
>>
>> Thanks,
>> Dongjoon.
>>
>
>
> --
> Shane Knapp
> Computer Guy / Voice of Reason
> UC Berkeley EECS Research / RISELab Staff Technical Lead
> https://rise.cs.berkeley.edu
>

>>
>> --
>>
>>

Re: [VOTE] SPIP: Row-level operations in Data Source V2

2021-11-16 Thread Wenchen Fan

+1

On Mon, Nov 15, 2021 at 2:54 AM John Zhuge  wrote:

> +1 (non-binding)
>
> On Sun, Nov 14, 2021 at 10:33 AM Chao Sun  wrote:
>
>> +1 (non-binding). Thanks Anton for the work!
>>
>> On Sun, Nov 14, 2021 at 10:01 AM Ryan Blue  wrote:
>>
>>> +1
>>>
>>> Thanks to Anton for all this great work!
>>>
>>> On Sat, Nov 13, 2021 at 8:24 AM Mich Talebzadeh <
>>> mich.talebza...@gmail.com> wrote:
>>>
 +1 non-binding



view my Linkedin profile
 



 *Disclaimer:* Use it at your own risk. Any and all responsibility for
 any loss, damage or destruction of data or any other property which may
 arise from relying on this email's technical content is explicitly
 disclaimed. The author will in no case be liable for any monetary damages
 arising from such loss, damage or destruction.




 On Sat, 13 Nov 2021 at 15:07, Russell Spitzer <
 russell.spit...@gmail.com> wrote:

> +1 (never binding)
>
> On Sat, Nov 13, 2021 at 1:10 AM Dongjoon Hyun 
> wrote:
>
>> +1
>>
>> On Fri, Nov 12, 2021 at 6:58 PM huaxin gao 
>> wrote:
>>
>>> +1
>>>
>>> On Fri, Nov 12, 2021 at 6:44 PM Yufei Gu 
>>> wrote:
>>>
 +1

 > On Nov 12, 2021, at 6:25 PM, L. C. Hsieh 
 wrote:
 >
 > Hi all,
 >
 > I’d like to start a vote for SPIP: Row-level operations in Data
 Source V2.
 >
 > The proposal is to add support for executing row-level operations
 > such as DELETE, UPDATE, MERGE for v2 tables (SPARK-35801). The
 > execution should be the same across data sources and the best way
 to do
 > that is to implement it in Spark.
 >
 > Right now, Spark can only parse and to some extent analyze
 DELETE, UPDATE,
 > MERGE commands. Data sources that support row-level changes have
 to build
 > custom Spark extensions to execute such statements. The goal of
 this effort
 > is to come up with a flexible and easy-to-use API that will work
 across
 > data sources.
 >
 > Please also refer to:
 >
 >   - Previous discussion in dev mailing list: [DISCUSS] SPIP:
 > Row-level operations in Data Source V2
 >   <
 https://lists.apache.org/thread/kd8qohrk5h3qx8d6y4lhrm67vnn8p6bv>
 >
 >   - JIRA: SPARK-35801 <
 https://issues.apache.org/jira/browse/SPARK-35801>
 >   - PR for handling DELETE statements:
 > 
 >
 >   - Design doc
 > <
 https://docs.google.com/document/d/12Ywmc47j3l2WF4anG5vL4qlrhT2OKigb7_EbIKhxg60/
 >
 >
 > Please vote on the SPIP for the next 72 hours:
 >
 > [ ] +1: Accept the proposal as an official SPIP
 > [ ] +0
 > [ ] -1: I don’t think this is a good idea because …
 >
 >
 -
 > To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
 >



 -
 To unsubscribe e-mail: dev-unsubscr...@spark.apache.org


>>>
>>> --
>>> Ryan Blue
>>> Tabular
>>>
>> --
> John Zhuge
>

Re: Issue Upgrading to 3.2

2021-11-01 Thread Wenchen Fan

To confirm: Does the error happen during view creation, or when we read the
view later?

On Mon, Nov 1, 2021 at 11:28 PM Adam Binford  wrote:

> I don't have a minimal reproduction right now but here's more relevant
> code snippets.
>
> Stacktrace:
>  org.apache.spark.sql.AnalysisException: Undefined function:
> 'ST_PolygonFromEnvelope'. This function is neither a registered temporary
> function nor a permanent function registered in the database 'default'.;
> line 2 pos 50
>   at
> org.apache.spark.sql.catalyst.catalog.SessionCatalog.failFunctionLookup(SessionCatalog.scala:1562)
>   at
> org.apache.spark.sql.catalyst.catalog.SessionCatalog.lookupFunction(SessionCatalog.scala:1660)
>   at
> org.apache.spark.sql.catalyst.catalog.SessionCatalog.lookupFunction(SessionCatalog.scala:1677)
>   at
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveFunctions$$anonfun$apply$27$$anonfun$applyOrElse$114.$anonfun$applyOrElse$116(Analyzer.scala:2150)
>   at
> org.apache.spark.sql.catalyst.analysis.package$.withPosition(package.scala:60)
>   at
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveFunctions$$anonfun$apply$27$$anonfun$applyOrElse$114.applyOrElse(Analyzer.scala:2150)
>   at
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveFunctions$$anonfun$apply$27$$anonfun$applyOrElse$114.applyOrElse(Analyzer.scala:2137)
>   at
> org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDownWithPruning$1(TreeNode.scala:481)
>   at
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:82)
>   at
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDownWithPruning(TreeNode.scala:481)
>
> Expression definition:
> case class ST_PolygonFromEnvelope(inputExpressions: Seq[Expression])
> extends Expression with CodegenFallback with UserDataGeneratator {
>   override def nullable: Boolean = false
>
>   override def eval(input: InternalRow): Any = {
> val minX = inputExpressions(0).eval(input) match {
>   case a: Double => a
>   case b: Decimal => b.toDouble
> }
>
> val minY = inputExpressions(1).eval(input) match {
>   case a: Double => a
>   case b: Decimal => b.toDouble
> }
>
> val maxX = inputExpressions(2).eval(input) match {
>   case a: Double => a
>   case b: Decimal => b.toDouble
> }
>
> val maxY = inputExpressions(3).eval(input) match {
>   case a: Double => a
>   case b: Decimal => b.toDouble
> }
>
> var coordinates = new Array[Coordinate](5)
> coordinates(0) = new Coordinate(minX, minY)
> coordinates(1) = new Coordinate(minX, maxY)
> coordinates(2) = new Coordinate(maxX, maxY)
> coordinates(3) = new Coordinate(maxX, minY)
> coordinates(4) = coordinates(0)
> val geometryFactory = new GeometryFactory()
> val polygon = geometryFactory.createPolygon(coordinates)
> new GenericArrayData(GeometrySerializer.serialize(polygon))
>   }
>
>   override def dataType: DataType = GeometryUDT
>
>   override def children: Seq[Expression] = inputExpressions
> }
>
> Function registration:
> Catalog.expressions.foreach(f => {
>   val functionIdentifier =
> FunctionIdentifier(f.getClass.getSimpleName.dropRight(1))
>   val expressionInfo = new ExpressionInfo(
> f.getClass.getCanonicalName,
> functionIdentifier.database.orNull,
> functionIdentifier.funcName)
>   sparkSession.sessionState.functionRegistry.registerFunction(
> functionIdentifier,
> expressionInfo,
> f
>   )
> })
>
> On Mon, Nov 1, 2021 at 10:43 AM Wenchen Fan  wrote:
>
>> Hi Adam,
>>
>> Thanks for reporting this issue! Do you have the full stacktrace or a
>> code snippet to reproduce the issue at Spark side? It looks like a bug, but
>> it's not obvious to me how this bug can happen.
>>
>> Thanks,
>> Wenchen
>>
>> On Sat, Oct 30, 2021 at 1:08 AM Adam Binford  wrote:
>>
>>> Hi devs,
>>>
>>> I'm working on getting Apache Sedona upgraded to work with Spark 3.2,
>>> and ran into a weird issue I wanted to get some feedback on. The PR and
>>> current discussion can be found here:
>>> https://github.com/apache/incubator-sedona/pull/557
>>>
>>> To try to sum up in a quick way, this library defines custom expressions
>>> and registers the expressions using
>>> sparkSession.sessionState.functionRegistry.registerFunction. One of the
>>> unit tests is now failing because the function can't be found when a
>>> temporary view using that function is created in pure SQL.
>>>
>>> Examples

Re: [DISCUSS] SPIP: Row-level operations in Data Source V2

2021-11-01 Thread Wenchen Fan

The general idea looks great. This is indeed a complicated API and we
probably need more time to evaluate the API design. It's better to commit
this work earlier so that we have more time to verify it before the 3.3
release. Maybe we can commit the group-based API first, then the
delta-based one, as the delta-based API is significantly more convoluted.

On Thu, Oct 28, 2021 at 12:53 AM L. C. Hsieh  wrote:

>
> Thanks for the initial feedback.
>
> I think previously the community is busy on the works related to Spark 3.2
> release.
> As 3.2 release was done, I'd like to bring this up to the surface again
> and seek for more discussion and feedback.
>
> Thanks.
>
> On 2021/06/25 15:49:49, huaxin gao  wrote:
> > I took a quick look at the PR and it looks like a great feature to have.
> It
> > provides unified APIs for data sources to perform the commonly used
> > operations easily and efficiently, so users don't have to implement
> > customer extensions on their own. Thanks Anton for the work!
> >
> > On Thu, Jun 24, 2021 at 9:42 PM L. C. Hsieh  wrote:
> >
> > > Thanks Anton. I'm voluntarily to be the shepherd of the SPIP. This is
> also
> > > my first time to shepherd a SPIP, so please let me know if anything I
> can
> > > improve.
> > >
> > > This looks great features and the rationale claimed by the proposal
> makes
> > > sense. These operations are getting more common and more important in
> big
> > > data workloads. Instead of building custom extensions by individual
> data
> > > sources, it makes more sense to support the API from Spark.
> > >
> > > Please provide your thoughts about the proposal and the design.
> Appreciate
> > > your feedback. Thank you!
> > >
> > > On 2021/06/24 23:53:32, Anton Okolnychyi 
> wrote:
> > > > Hey everyone,
> > > >
> > > > I'd like to start a discussion on adding support for executing
> row-level
> > > > operations such as DELETE, UPDATE, MERGE for v2 tables
> (SPARK-35801). The
> > > > execution should be the same across data sources and the best way to
> do
> > > > that is to implement it in Spark.
> > > >
> > > > Right now, Spark can only parse and to some extent analyze DELETE,
> > > UPDATE,
> > > > MERGE commands. Data sources that support row-level changes have to
> build
> > > > custom Spark extensions to execute such statements. The goal of this
> > > effort
> > > > is to come up with a flexible and easy-to-use API that will work
> across
> > > > data sources.
> > > >
> > > > Design doc:
> > > >
> > >
> https://docs.google.com/document/d/12Ywmc47j3l2WF4anG5vL4qlrhT2OKigb7_EbIKhxg60/
> > > >
> > > > PR for handling DELETE statements:
> > > > https://github.com/apache/spark/pull/33008
> > > >
> > > > Any feedback is more than welcome.
> > > >
> > > > Liang-Chi was kind enough to shepherd this effort. Thanks!
> > > >
> > > > - Anton
> > > >
> > >
> > > -
> > > To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
> > >
> > >
> >
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>

Re: Issue Upgrading to 3.2

2021-11-01 Thread Wenchen Fan

Hi Adam,

Thanks for reporting this issue! Do you have the full stacktrace or a code
snippet to reproduce the issue at Spark side? It looks like a bug, but it's
not obvious to me how this bug can happen.

Thanks,
Wenchen

On Sat, Oct 30, 2021 at 1:08 AM Adam Binford  wrote:

> Hi devs,
>
> I'm working on getting Apache Sedona upgraded to work with Spark 3.2, and
> ran into a weird issue I wanted to get some feedback on. The PR and current
> discussion can be found here:
> https://github.com/apache/incubator-sedona/pull/557
>
> To try to sum up in a quick way, this library defines custom expressions
> and registers the expressions using
> sparkSession.sessionState.functionRegistry.registerFunction. One of the
> unit tests is now failing because the function can't be found when a
> temporary view using that function is created in pure SQL.
>
> Examples:
> This fails with Undefined function: 'ST_PolygonFromEnvelope'. This
> function is neither a registered temporary function nor a permanent
> function registered in the database 'default'.:
>
>  spark.sql(
> """
>   |CREATE OR REPLACE TEMP VIEW pixels AS
>   |SELECT pixel, shape FROM pointtable
>   |LATERAL VIEW EXPLODE(ST_Pixelize(shape, 1000, 1000, 
> ST_PolygonFromEnvelope(-126.790180,24.863836,-64.630926,50.000))) AS pixel
> """.stripMargin)
>
>   // Test visualization partitioner
>   val zoomLevel = 2
>   val newDf = VizPartitioner(spark.table("pixels"), zoomLevel, "pixel", 
> new Envelope(0, 1000, 0, 1000))
>
>
> But both of these work fine:
>
>  val table = spark.sql(
>"""
>  |SELECT pixel, shape FROM pointtable
>  |LATERAL VIEW EXPLODE(ST_Pixelize(shape, 1000, 1000, 
> ST_PolygonFromEnvelope(-126.790180,24.863836,-64.630926,50.000))) AS pixel
> """.stripMargin)
>
>   // Test visualization partitioner
>   val zoomLevel = 2
>   val newDf = VizPartitioner(table, zoomLevel, "pixel", new Envelope(0, 
> 1000, 0, 1000))
>
> val table = spark.sql(
>"""
>  |SELECT pixel, shape FROM pointtable
>  |LATERAL VIEW EXPLODE(ST_Pixelize(shape, 1000, 1000, 
> ST_PolygonFromEnvelope(-126.790180,24.863836,-64.630926,50.000))) AS pixel
> """.stripMargin)
>   table.createOrReplaceTempView("pixels")
>
>   // Test visualization partitioner
>   val zoomLevel = 2
>   val newDf = VizPartitioner(spark.table("pixels"), zoomLevel, "pixel", 
> new Envelope(0, 1000, 0, 1000))
>
>
> So the main question is, is this a feature or a bug?
>
> --
> Adam Binford
>

Re: [VOTE] SPIP: Storage Partitioned Join for Data Source V2

2021-10-31 Thread Wenchen Fan

+1

On Sat, Oct 30, 2021 at 8:58 AM Cheng Su  wrote:

> +1
>
>
>
> Thanks,
>
> Cheng Su
>
>
>
> *From: *Holden Karau 
> *Date: *Friday, October 29, 2021 at 12:41 PM
> *To: *DB Tsai 
> *Cc: *Dongjoon Hyun , Ryan Blue ,
> dev , huaxin gao 
> *Subject: *Re: [VOTE] SPIP: Storage Partitioned Join for Data Source V2
>
> +1
>
>
>
> On Fri, Oct 29, 2021 at 3:07 PM DB Tsai  wrote:
>
> +1
>
>
> DB Tsai  |  https://www.dbtsai.com/  |  PGP 42E5B25A8F7A82C1
>
>
>
>
>
> On Fri, Oct 29, 2021 at 11:42 AM Ryan Blue  wrote:
>
> +1
>
>
>
> On Fri, Oct 29, 2021 at 11:06 AM huaxin gao 
> wrote:
>
> +1
>
>
>
> On Fri, Oct 29, 2021 at 10:59 AM Dongjoon Hyun 
> wrote:
>
> +1
>
> Dongjoon
>
> On 2021/10/29 17:48:59, Russell Spitzer 
> wrote:
> > +1 This is a great idea, (I have no Apache Spark voting points)
> >
> > On Fri, Oct 29, 2021 at 12:41 PM L. C. Hsieh  wrote:
> >
> > >
> > > I'll start with my +1.
> > >
> > > On 2021/10/29 17:30:03, L. C. Hsieh  wrote:
> > > > Hi all,
> > > >
> > > > I’d like to start a vote for SPIP: Storage Partitioned Join for Data
> > > Source V2.
> > > >
> > > > The proposal is to support a new type of join: storage partitioned
> join
> > > which
> > > > covers bucket join support for DataSourceV2 but is more general. The
> goal
> > > > is to let Spark leverage distribution properties reported by data
> > > sources and
> > > > eliminate shuffle whenever possible.
> > > >
> > > > Please also refer to:
> > > >
> > > >- Previous discussion in dev mailing list: [DISCUSS] SPIP: Storage
> > > Partitioned Join for Data Source V2
> > > ><
> > >
> https://lists.apache.org/thread.html/r7dc67c3db280a8b2e65855cb0b1c86b524d4e6ae1ed9db9ca12cb2e6%40%3Cdev.spark.apache.org%3E
> 
> > > >
> > > >.
> > > >- JIRA: SPARK-37166 <
> > > https://issues.apache.org/jira/browse/SPARK-37166>
> > > >- Design doc <
> > >
> https://docs.google.com/document/d/1foTkDSM91VxKgkEcBMsuAvEjNybjja-uHk-r3vtXWFE
> >
> > >
> > > >
> > > > Please vote on the SPIP for the next 72 hours:
> > > >
> > > > [ ] +1: Accept the proposal as an official SPIP
> > > > [ ] +0
> > > > [ ] -1: I don’t think this is a good idea because …
> > > >
> > > > -
> > > > To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
> > > >
> > > >
> > >
> > > -
> > > To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
> > >
> > >
> >
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>
>
>
> --
>
> Ryan Blue
>
> Tabular
>
> --
>
> Twitter: https://twitter.com/holdenkarau
>
> Books (Learning Spark, High Performance Spark, etc.):
> https://amzn.to/2MaRAG9  
>
> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>

Re: [DISCUSS] SPIP: Storage Partitioned Join for Data Source V2

2021-10-28 Thread Wenchen Fan

Thanks for the explanation! It makes sense to always resolve the logical
transforms to concrete implementations, and check the concrete
implementations to decide compatible partitions. We can discuss more
details in the PR later.

On Thu, Oct 28, 2021 at 4:14 AM Ryan Blue  wrote:

> The transform expressions in v2 are logical, not concrete implementations.
> Even days may have different implementations -- the only expectation is
> that the partitions are day-sized. For example, you could use a transform
> that splits days at UTC 00:00, or uses some other day boundary.
>
> Because the expressions are logical, we need to resolve them to
> implementations at some point, like Chao outlines. We can do that using a
> FunctionCatalog, although I think it's worth considering adding an
> interface so that a transform from a Table can be converted into a
> `BoundFunction` directly. That is easier than defining a way for Spark to
> query the function catalog.
>
> In any case, I'm sure it's easy to understand how this works once you get
> a concrete implementation.
>
> On Wed, Oct 27, 2021 at 9:35 AM Wenchen Fan  wrote:
>
>> `BucketTransform` is a builtin partition transform in Spark, instead of a
>> UDF from `FunctionCatalog`. Will Iceberg use UDF from `FunctionCatalog` to
>> represent its bucket transform, or use the Spark builtin `BucketTransform`?
>> I'm asking this because other v2 sources may also use the builtin
>> `BucketTransform` but use a different bucket hash function. Or we can
>> clearly define the bucket hash function of the builtin `BucketTransform` in
>> the doc.
>>
>> On Thu, Oct 28, 2021 at 12:25 AM Ryan Blue  wrote:
>>
>>> Two v2 sources may return different bucket IDs for the same value, and
>>> this breaks the phase 1 split-wise join.
>>>
>>> This is why the FunctionCatalog included a canonicalName method (docs
>>> <https://github.com/apache/spark/blob/master/sql/catalyst/src/main/java/org/apache/spark/sql/connector/catalog/functions/BoundFunction.java#L81-L96>).
>>> That method returns an identifier that can be used to compare whether two
>>> bucket function instances are the same.
>>>
>>>
>>>1. Can we apply this idea to partitioned file source tables
>>>(non-bucketed) as well?
>>>
>>> What do you mean here? The design doc discusses transforms like days(ts)
>>> that can be supported in the future. Is that what you’re asking about? Or
>>> are you referring to v1 file sources? I think the goal is to support v2,
>>> since v1 doesn’t have reliable behavior.
>>>
>>> Note that the initial implementation goal is to support bucketing since
>>> that’s an easier case because both sides have the same number of
>>> partitions. More complex storage-partitioned joins can be implemented later.
>>>
>>>
>>>1. What if the table has many partitions? Shall we apply certain
>>>join algorithms in the phase 1 split-wise join as well? Or even launch a
>>>Spark job to do so?
>>>
>>> I think that this proposal opens up a lot of possibilities, like what
>>> you’re suggesting here. It is a bit like AQE. We’ll need to come up with
>>> heuristics for choosing how and when to use storage partitioning in joins.
>>> As I said above, bucketing is a great way to get started because it fills
>>> an existing gap. More complex use cases can be supported over time.
>>>
>>> Ryan
>>>
>>> On Wed, Oct 27, 2021 at 9:08 AM Wenchen Fan  wrote:
>>>
>>>> IIUC, the general idea is to let each input split report its partition
>>>> value, and Spark can perform the join in two phases:
>>>> 1. join the input splits from left and right tables according to their
>>>> partitions values and join keys, at the driver side.
>>>> 2. for each joined input splits pair (or a group of splits), launch a
>>>> Spark task to join the rows.
>>>>
>>>> My major concern is about how to define "compatible partitions". Things
>>>> like `days(ts)` are straightforward: the same timestamp value always
>>>> results in the same partition value, in whatever v2 sources. `bucket(col,
>>>> num)` is tricky, as Spark doesn't define the bucket hash function. Two v2
>>>> sources may return different bucket IDs for the same value, and this breaks
>>>> the phase 1 split-wise join.
>>>>
>>>> And two questions for further improvements:
>>>> 1. Can we apply this idea to partitioned file source tables
>>>> (non-buckete

Re: [DISCUSS] SPIP: Storage Partitioned Join for Data Source V2

2021-10-27 Thread Wenchen Fan

`BucketTransform` is a builtin partition transform in Spark, instead of a
UDF from `FunctionCatalog`. Will Iceberg use UDF from `FunctionCatalog` to
represent its bucket transform, or use the Spark builtin `BucketTransform`?
I'm asking this because other v2 sources may also use the builtin
`BucketTransform` but use a different bucket hash function. Or we can
clearly define the bucket hash function of the builtin `BucketTransform` in
the doc.

On Thu, Oct 28, 2021 at 12:25 AM Ryan Blue  wrote:

> Two v2 sources may return different bucket IDs for the same value, and
> this breaks the phase 1 split-wise join.
>
> This is why the FunctionCatalog included a canonicalName method (docs
> <https://github.com/apache/spark/blob/master/sql/catalyst/src/main/java/org/apache/spark/sql/connector/catalog/functions/BoundFunction.java#L81-L96>).
> That method returns an identifier that can be used to compare whether two
> bucket function instances are the same.
>
>
>1. Can we apply this idea to partitioned file source tables
>(non-bucketed) as well?
>
> What do you mean here? The design doc discusses transforms like days(ts)
> that can be supported in the future. Is that what you’re asking about? Or
> are you referring to v1 file sources? I think the goal is to support v2,
> since v1 doesn’t have reliable behavior.
>
> Note that the initial implementation goal is to support bucketing since
> that’s an easier case because both sides have the same number of
> partitions. More complex storage-partitioned joins can be implemented later.
>
>
>1. What if the table has many partitions? Shall we apply certain join
>algorithms in the phase 1 split-wise join as well? Or even launch a Spark
>job to do so?
>
> I think that this proposal opens up a lot of possibilities, like what
> you’re suggesting here. It is a bit like AQE. We’ll need to come up with
> heuristics for choosing how and when to use storage partitioning in joins.
> As I said above, bucketing is a great way to get started because it fills
> an existing gap. More complex use cases can be supported over time.
>
> Ryan
>
> On Wed, Oct 27, 2021 at 9:08 AM Wenchen Fan  wrote:
>
>> IIUC, the general idea is to let each input split report its partition
>> value, and Spark can perform the join in two phases:
>> 1. join the input splits from left and right tables according to their
>> partitions values and join keys, at the driver side.
>> 2. for each joined input splits pair (or a group of splits), launch a
>> Spark task to join the rows.
>>
>> My major concern is about how to define "compatible partitions". Things
>> like `days(ts)` are straightforward: the same timestamp value always
>> results in the same partition value, in whatever v2 sources. `bucket(col,
>> num)` is tricky, as Spark doesn't define the bucket hash function. Two v2
>> sources may return different bucket IDs for the same value, and this breaks
>> the phase 1 split-wise join.
>>
>> And two questions for further improvements:
>> 1. Can we apply this idea to partitioned file source tables
>> (non-bucketed) as well?
>> 2. What if the table has many partitions? Shall we apply certain join
>> algorithms in the phase 1 split-wise join as well? Or even launch a Spark
>> job to do so?
>>
>> Thanks,
>> Wenchen
>>
>> On Wed, Oct 27, 2021 at 3:08 AM Chao Sun  wrote:
>>
>>> Thanks Cheng for the comments.
>>>
>>> > Is migrating Hive table read path to data source v2, being a
>>> prerequisite of this SPIP
>>>
>>> Yes, this SPIP only aims at DataSourceV2, so obviously it will help if
>>> Hive eventually moves to use V2 API. With that said, I think some of the
>>> ideas could be useful for V1 Hive support as well. For instance, with the
>>> newly proposed logic to compare whether output partitionings from both
>>> sides of a join operator are compatible, we can have HiveTableScanExec to
>>> report a different partitioning other than HashPartitioning, and
>>> EnsureRequirements could potentially recognize that and therefore avoid
>>> shuffle if both sides report the same compatible partitioning. In addition,
>>> SPARK-35703, which is part of the SPIP, is also useful in that it relaxes
>>> the constraint for V1 bucket join so that the join keys do not necessarily
>>> be identical to the bucket keys.
>>>
>>> > Would aggregate work automatically after the SPIP?
>>>
>>> Yes it will work as before. This case is already supported by
>>> DataSourcePartitioning in V2 (see SPARK-22389).
>>>
>>> > Any major use cases in

Re: [DISCUSS] SPIP: Storage Partitioned Join for Data Source V2

2021-10-27 Thread Wenchen Fan

IIUC, the general idea is to let each input split report its partition
value, and Spark can perform the join in two phases:
1. join the input splits from left and right tables according to their
partitions values and join keys, at the driver side.
2. for each joined input splits pair (or a group of splits), launch a Spark
task to join the rows.

My major concern is about how to define "compatible partitions". Things
like `days(ts)` are straightforward: the same timestamp value always
results in the same partition value, in whatever v2 sources. `bucket(col,
num)` is tricky, as Spark doesn't define the bucket hash function. Two v2
sources may return different bucket IDs for the same value, and this breaks
the phase 1 split-wise join.

And two questions for further improvements:
1. Can we apply this idea to partitioned file source tables (non-bucketed)
as well?
2. What if the table has many partitions? Shall we apply certain join
algorithms in the phase 1 split-wise join as well? Or even launch a Spark
job to do so?

Thanks,
Wenchen

On Wed, Oct 27, 2021 at 3:08 AM Chao Sun  wrote:

> Thanks Cheng for the comments.
>
> > Is migrating Hive table read path to data source v2, being a
> prerequisite of this SPIP
>
> Yes, this SPIP only aims at DataSourceV2, so obviously it will help if
> Hive eventually moves to use V2 API. With that said, I think some of the
> ideas could be useful for V1 Hive support as well. For instance, with the
> newly proposed logic to compare whether output partitionings from both
> sides of a join operator are compatible, we can have HiveTableScanExec to
> report a different partitioning other than HashPartitioning, and
> EnsureRequirements could potentially recognize that and therefore avoid
> shuffle if both sides report the same compatible partitioning. In addition,
> SPARK-35703, which is part of the SPIP, is also useful in that it relaxes
> the constraint for V1 bucket join so that the join keys do not necessarily
> be identical to the bucket keys.
>
> > Would aggregate work automatically after the SPIP?
>
> Yes it will work as before. This case is already supported by
> DataSourcePartitioning in V2 (see SPARK-22389).
>
> > Any major use cases in mind except Hive bucketed table?
>
> Our first use case is Apache Iceberg. In addition to that we also want to
> add the support for Spark's built-in file data sources.
>
> Thanks,
> Chao
>
> On Tue, Oct 26, 2021 at 10:34 AM Cheng Su  wrote:
>
>> +1 for this. This is exciting movement to efficiently read bucketed table
>> from other systems (Hive, Trino & Presto)!
>>
>>
>>
>> Still looking at the details but having some early questions:
>>
>>
>>
>>1. Is migrating Hive table read path to data source v2, being a
>>prerequisite of this SPIP?
>>
>>
>>
>> Hive table read path is currently a mix of data source v1 (for Parquet &
>> ORC file format only), and legacy Hive code path (HiveTableScanExec). In
>> the SPIP, I am seeing we only make change for data source v2, so wondering
>> how this would work with existing Hive table read path. In addition, just
>> FYI, supporting writing Hive bucketed table is merged in master recently (
>> SPARK-19256 <https://issues.apache.org/jira/browse/SPARK-19256> has
>> details).
>>
>>
>>
>>1. Would aggregate work automatically after the SPIP?
>>
>>
>>
>> Another major benefit for having bucketed table, is to avoid shuffle
>> before aggregate. Just want to bring to our attention that it would be
>> great to consider aggregate as well when doing this proposal.
>>
>>
>>
>>1. Any major use cases in mind except Hive bucketed table?
>>
>>
>>
>> Just curious if there’s any other use cases we are targeting as part of
>> SPIP.
>>
>>
>>
>> Thanks,
>>
>> Cheng Su
>>
>>
>>
>>
>>
>>
>>
>> *From: *Ryan Blue 
>> *Date: *Tuesday, October 26, 2021 at 9:39 AM
>> *To: *John Zhuge 
>> *Cc: *Chao Sun , Wenchen Fan ,
>> Cheng Su , DB Tsai , Dongjoon Hyun <
>> dongjoon.h...@gmail.com>, Hyukjin Kwon , Wenchen
>> Fan , angers zhu , dev <
>> dev@spark.apache.org>, huaxin gao 
>> *Subject: *Re: [DISCUSS] SPIP: Storage Partitioned Join for Data Source
>> V2
>>
>> Instead of commenting on the doc, could we keep discussion here on the
>> dev list please? That way more people can follow it and there is more room
>> for discussion. Comment threads have a very small area and easily become
>> hard to follow.
>>
>>
>>
>> Ryan
>>
>>
>>
>> On T

Re: [DISCUSS] SPIP: Storage Partitioned Join for Data Source V2

2021-10-26 Thread Wenchen Fan

+1 to this SPIP and nice writeup of the design doc!

Can we open comment permission in the doc so that we can discuss details
there?

On Tue, Oct 26, 2021 at 8:29 PM Hyukjin Kwon  wrote:

> Seems making sense to me.
>
> Would be great to have some feedback from people such as @Wenchen Fan
>  @Cheng Su  @angers zhu
> .
>
>
> On Tue, 26 Oct 2021 at 17:25, Dongjoon Hyun 
> wrote:
>
>> +1 for this SPIP.
>>
>> On Sun, Oct 24, 2021 at 9:59 AM huaxin gao 
>> wrote:
>>
>>> +1. Thanks for lifting the current restrictions on bucket join and
>>> making this more generalized.
>>>
>>> On Sun, Oct 24, 2021 at 9:33 AM Ryan Blue  wrote:
>>>
>>>> +1 from me as well. Thanks Chao for doing so much to get it to this
>>>> point!
>>>>
>>>> On Sat, Oct 23, 2021 at 11:29 PM DB Tsai  wrote:
>>>>
>>>>> +1 on this SPIP.
>>>>>
>>>>> This is a more generalized version of bucketed tables and bucketed
>>>>> joins which can eliminate very expensive data shuffles when joins, and
>>>>> many users in the Apache Spark community have wanted this feature for
>>>>> a long time!
>>>>>
>>>>> Thank you, Ryan and Chao, for working on this, and I look forward to
>>>>> it as a new feature in Spark 3.3
>>>>>
>>>>> DB Tsai  |  https://www.dbtsai.com/  |  PGP 42E5B25A8F7A82C1
>>>>>
>>>>> On Fri, Oct 22, 2021 at 12:18 PM Chao Sun  wrote:
>>>>> >
>>>>> > Hi,
>>>>> >
>>>>> > Ryan and I drafted a design doc to support a new type of join:
>>>>> storage partitioned join which covers bucket join support for DataSourceV2
>>>>> but is more general. The goal is to let Spark leverage distribution
>>>>> properties reported by data sources and eliminate shuffle whenever 
>>>>> possible.
>>>>> >
>>>>> > Design doc:
>>>>> https://docs.google.com/document/d/1foTkDSM91VxKgkEcBMsuAvEjNybjja-uHk-r3vtXWFE
>>>>> (includes a POC link at the end)
>>>>> >
>>>>> > We'd like to start a discussion on the doc and any feedback is
>>>>> welcome!
>>>>> >
>>>>> > Thanks,
>>>>> > Chao
>>>>>
>>>>
>>>>
>>>> --
>>>> Ryan Blue
>>>>
>>>

Re: [ANNOUNCE] Apache Spark 3.2.0

2021-10-19 Thread Wenchen Fan

Yea the file naming is a bit confusing, we can fix it in the next release.
3.2 actually means 3.2 or higher, so not a big deal I think.

Congrats and thanks!

On Wed, Oct 20, 2021 at 3:44 AM Jungtaek Lim 
wrote:

> Thanks to Gengliang for driving this huge release!
>
> On Wed, Oct 20, 2021 at 1:50 AM Dongjoon Hyun 
> wrote:
>
>> Thank you so much, Gengliang and all!
>>
>> Dongjoon.
>>
>> On Tue, Oct 19, 2021 at 8:48 AM Xiao Li  wrote:
>>
>>> Thank you, Gengliang!
>>>
>>> Congrats to our community and all the contributors!
>>>
>>> Xiao
>>>
>>> Henrik Peng  于2021年10月19日周二 上午8:26写道：
>>>
 Congrats and thanks!


 Gengliang Wang 于2021年10月19日 周二下午10:16写道：

> Hi all,
>
> Apache Spark 3.2.0 is the third release of the 3.x line. With
> tremendous contribution from the open-source community, this release
> managed to resolve in excess of 1,700 Jira tickets.
>
> We'd like to thank our contributors and users for their contributions
> and early feedback to this release. This release would not have been
> possible without you.
>
> To download Spark 3.2.0, head over to the download page:
> https://spark.apache.org/downloads.html
>
> To view the release notes:
> https://spark.apache.org/releases/spark-release-3-2-0.html
>

Re: [VOTE] Release Spark 3.2.0 (RC7)

2021-10-10 Thread Wenchen Fan

+1

On Sat, Oct 9, 2021 at 2:36 PM angers zhu  wrote:

> +1 (non-binding)
>
> Cheng Pan  于2021年10月9日周六 下午2:06写道：
>
>> +1 (non-binding)
>>
>> Integration test passed[1] with my project[2].
>>
>> [1]
>> https://github.com/housepower/spark-clickhouse-connector/runs/3834335017
>> [2] https://github.com/housepower/spark-clickhouse-connector
>>
>> Thanks,
>> Cheng Pan
>>
>>
>> On Sat, Oct 9, 2021 at 2:01 PM Ye Zhou  wrote:
>>
>>> +1 (non-binding).
>>>
>>> Run Maven build, tested within our YARN cluster, in client or cluster
>>> mode, with push based shuffle enabled/disalbled, and shuffling a large
>>> amount of data. Applications ran successfully with expected shuffle
>>> behavior.
>>>
>>> On Fri, Oct 8, 2021 at 10:06 PM sarutak  wrote:
>>>
 +1

 I think no critical issue left.
 Thank you Gengliang.

 Kousuke

 > +1
 >
 > Looks good.
 >
 > Liang-Chi
 >
 > On 2021/10/08 16:16:12, Kent Yao  wrote:
 >> +1 (non-binding) BR
 >> 
 >> 
 >> 
 >> 
 >> 
 >> 
 >> font{
 >> line-height: 1.6;
 >> }
 >> 
 >> 
 >> 
 >> font{
 >> line-height: 1.6;
 >> }
 >> 
 >> 
 >> 
 >> font{
 >> line-height: 1.6;
 >> }
 >> 
 >> 
 >> 
 >> font{
 >> line-height: 1.6;
 >> }
 >> 
 >>
 >> 
 >> font{
 >> line-height: 1.6;
 >> }
 >> 
 >> 
 >> 
 >> font{
 >> line-height: 1.6;
 >> }
 >> 
 >>
 >> 
 >> font{
 >> line-height: 1.6;
 >> }
 >> 
 >>
 >> 
 >> font{
 >> line-height: 1.6;
 >> }
 >> 
 >>
 >> 
 >> font{
 >> line-height: 1.6;
 >> }
 >> 
 >> 
 >> 
 >> font{
 >> line-height: 1.6;
 >> }
 >> 
 >> >>> >> size="3">Kent Yao >>> >> style="color: rgb(82, 82, 82); font-family: 宋体-简; font-size:
 >> x-small;">@ Data Science Center, Hangzhou Research Institute,
 NetEase
 >> Corp.>>> >> style="font-size: 13px;">a s>>> >> color="#525252" style="font-size: 13px;">park >>> style="orphans:
 >> 2; widows: 2;" class=" classDarkfont
 >> classDarkfont">enthusiast>>> >> style="color: rgb(0, 0, 0); font-family: Helvetica; font-size:
 >> 13px;" > class="mr-2 flex-self-stretch" style="box-sizing: border-box;
 > align-self: stretch !important; margin-right: 8px !important;" > face="宋体-简" color="#525252" class=" classDarkfont" style="box-sizing:
 > border-box; font-size: 13px;" > class="" href="https://github.com/yaooqinn/kyuubi; style="box-sizing:
 > border-box;">kyuubiis a unified multi-tenant JDBC interface for large-scale data processing and
 > analytics, built on top of >>> > href="http://spark.apache.org/; rel="nofollow" style="font-we
 >  ight: normal; color: rgb(49, 53, 59); font-family: 宋体-简; font-size:
 > 13px; box-sizing: border-box; font-variant-ligatures: normal;">Apache
 > Spark.>>> > class=" d-flex flex-wrap flex-items-center break-word f3 text-normal
 > classDarkfont" style="box-sizing: border-box; margin: 0px;
 > font-variant-ligatures: normal; orphans: 2; widows: 2;
 > text-decoration-thickness: initial; flex-wrap: wrap !important;
 > align-items: center !important; word-break: break-word !important;
 > overflow-wrap: break-word !important; display: flex !important;" > style="color: rgb(49, 53, 59); font-family: Helvetica; font-size:
 > 14px;" > style="b
 >  ox-sizing: border-box; align-self: stretch !important; margin-right:
 > 8px !important;" > href="https://github.com/yaooqinn/spark-authorizer;
 style="box-sizing:
 > border-box; outline-width: 0px;">spark-authorizer>>> > style="font-weight: normal;">A Spark SQL extension which provides SQL
 > Standard Authorization for http://spark.apache.org/; rel="nofollow"
 > style="font-weight: normal; color: rgb(49, 53, 59); font-family: 宋体-简;
 > font-size: 13px; box-sizing: border-box; font-variant-ligatures:
 > normal;">Apache Spark>>> > style="font-weight: normal; color: rgb(82, 82, 82); font-family: 宋体-简;
 > font-size: 13px; caret-color: rgb(82, 82, 82); font-variant-ligatures:
 > normal; text-decoration-thickness: initial;">.>>> > style="color: rgb(49, 53, 59); font-family: Helvetica; font-size:
 > 14px;" > class="" href="https://github.com/yaooqinn/spark-postgres;
 > style="font-size: 13px; font-family: 宋体-简; box-sizing: border-box;
 > outline-width: 0px;">spark-postgres A library for reading data from and
 > transferring data to Postgres / Greenplum with Spark SQL and
 >

Re: [SQL] When SQLConf vals gets own accessor defs?

2021-09-06 Thread Wenchen Fan

I think SQLConf doesn't need defs anymore. In the beginning, SQLConf lived
in sql/core, so we have to add defs if the code in sql/catalyst needs to
access configs. Now SQLConf is in sql/catalyst (this was done a few years
ago), defs are only needed if we have some special logic that is not just
reading the configs.

On Fri, Sep 3, 2021 at 6:54 PM Jacek Laskowski  wrote:

> Hi,
>
> Just found something I'd consider an inconsistency in how SQLConf
> constants (vals) get their own accessor method for the current value (defs).
>
> I thought that internal config prop vals might not have defs (and that
> would make sense) but
> DYNAMIC_PARTITION_PRUNING_PRUNING_SIDE_EXTRA_FILTER_RATIO [1]
> (with SQLConf.dynamicPartitionPruningPruningSideExtraFilterRatio [2]) seems
> to contradict it.
>
> On the other hand, ADAPTIVE_AUTO_BROADCASTJOIN_THRESHOLD [3] that's new in
> 3.2.0 and seems important does not have a corresponding def to access the
> current value. Why?
>
> [1]
> https://github.com/apache/spark/blob/54cca7f82ecf23e062bb4f6d68697abec2dbcc5b/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala#L344
> [2]
> https://github.com/apache/spark/blob/54cca7f82ecf23e062bb4f6d68697abec2dbcc5b/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala#L3532
> [3]
> https://github.com/apache/spark/blob/54cca7f82ecf23e062bb4f6d68697abec2dbcc5b/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala#L638
>
> Pozdrawiam,
> Jacek Laskowski
> 
> https://about.me/JacekLaskowski
> "The Internals Of" Online Books 
> Follow me on https://twitter.com/jaceklaskowski
>
> 
>

1 2 3 4 5 6 >

1 - 100 of 547 matches

Mail list logo