from:"Chao Sun"

Re: [VOTE] SPIP: Stored Procedures API for Catalogs

2024-05-11 Thread Chao Sun

+1

On Sat, May 11, 2024 at 2:10 PM L. C. Hsieh  wrote:

> Hi all,
>
> I’d like to start a vote for SPIP: Stored Procedures API for Catalogs.
>
> Please also refer to:
>
>- Discussion thread:
> https://lists.apache.org/thread/7r04pz544c9qs3gc8q2nyj3fpzfnv8oo
>- JIRA ticket: https://issues.apache.org/jira/browse/SPARK-44167
>- SPIP doc:
> https://docs.google.com/document/d/1rDcggNl9YNcBECsfgPcoOecHXYZOu29QYFrloo2lPBg/
>
>
> Please vote on the SPIP for the next 72 hours:
>
> [ ] +1: Accept the proposal as an official SPIP
> [ ] +0
> [ ] -1: I don’t think this is a good idea because …
>
>
> Thank you!
>
> Liang-Chi Hsieh
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>

Re: [DISCUSS] Spark 4.0.0 release

2024-05-01 Thread Chao Sun

+1

On Wed, May 1, 2024 at 5:23 PM Xiao Li  wrote:

> +1 for next Monday.
>
> We can do more previews when the other features are ready for preview.
>
> Tathagata Das  于2024年5月1日周三 08:46写道：
>
>> Next week sounds great! Thank you Wenchen!
>>
>> On Wed, May 1, 2024 at 11:16 AM Wenchen Fan  wrote:
>>
>>> Yea I think a preview release won't hurt (without a branch cut). We
>>> don't need to wait for all the ongoing projects to be ready. How about we
>>> do a 4.0 preview release based on the current master branch next Monday?
>>>
>>> On Wed, May 1, 2024 at 11:06 PM Tathagata Das <
>>> tathagata.das1...@gmail.com> wrote:
>>>
 Hey all,

 Reviving this thread, but Spark master has already accumulated a huge
 amount of changes.  As a downstream project maintainer, I want to really
 start testing the new features and other breaking changes, and it's hard to
 do that without a Preview release. So the sooner we make a Preview release,
 the faster we can start getting feedback for fixing things for a great
 Spark 4.0 final release.

 So I urge the community to produce a Spark 4.0 Preview soon even if
 certain features targeting the Delta 4.0 release are still incomplete.

 Thanks!


 On Wed, Apr 17, 2024 at 8:35 AM Wenchen Fan 
 wrote:

> Thank you all for the replies!
>
> To @Nicholas Chammas  : Thanks for
> cleaning up the error terminology and documentation! I've merged the first
> PR and let's finish others before the 4.0 release.
> To @Dongjoon Hyun  : Thanks for driving the
> ANSI on by default effort! Now the vote has passed, let's flip the config
> and finish the DataFrame error context feature before 4.0.
> To @Jungtaek Lim  : Ack. We can treat
> the Streaming state store data source as completed for 4.0 then.
> To @Cheng Pan  : Yea we definitely should have a
> preview release. Let's collect more feedback on the ongoing projects and
> then we can propose a date for the preview release.
>
> On Wed, Apr 17, 2024 at 1:22 PM Cheng Pan  wrote:
>
>> will we have preview release for 4.0.0 like we did for 2.0.0 and
>> 3.0.0?
>>
>> Thanks,
>> Cheng Pan
>>
>>
>> > On Apr 15, 2024, at 09:58, Jungtaek Lim <
>> kabhwan.opensou...@gmail.com> wrote:
>> >
>> > W.r.t. state data source - reader (SPARK-45511), there are several
>> follow-up tickets, but we don't plan to address them soon. The current
>> implementation is the final shape for Spark 4.0.0, unless there are 
>> demands
>> on the follow-up tickets.
>> >
>> > We may want to check the plan for transformWithState - my
>> understanding is that we want to release the feature to 4.0.0, but there
>> are several remaining works to be done. While the tentative timeline for
>> releasing is June 2024, what would be the tentative timeline for the RC 
>> cut?
>> > (cc. Anish to add more context on the plan for transformWithState)
>> >
>> > On Sat, Apr 13, 2024 at 3:15 AM Wenchen Fan 
>> wrote:
>> > Hi all,
>> >
>> > It's close to the previously proposed 4.0.0 release date (June
>> 2024), and I think it's time to prepare for it and discuss the ongoing
>> projects:
>> > •
>> > ANSI by default
>> > • Spark Connect GA
>> > • Structured Logging
>> > • Streaming state store data source
>> > • new data type VARIANT
>> > • STRING collation support
>> > • Spark k8s operator versioning
>> > Please help to add more items to this list that are missed here. I
>> would like to volunteer as the release manager for Apache Spark 4.0.0 if
>> there is no objection. Thank you all for the great work that fills Spark
>> 4.0!
>> >
>> > Wenchen Fan
>>
>>

Re: [VOTE] SPARK-44444: Use ANSI SQL mode by default

2024-04-13 Thread Chao Sun

+1.

This feature is very helpful for guarding against correctness issues, such
as null results due to invalid input or math overflows. It’s been there for
a while now and it’s a good time to enable it by default as Spark enters
the next major release.

On Sat, Apr 13, 2024 at 3:27 PM Dongjoon Hyun  wrote:

> I'll start from my +1.
>
> Dongjoon.
>
> On 2024/04/13 22:22:05 Dongjoon Hyun wrote:
> > Please vote on SPARK-4 to use ANSI SQL mode by default.
> > The technical scope is defined in the following PR which is
> > one line of code change and one line of migration guide.
> >
> > - DISCUSSION:
> > https://lists.apache.org/thread/ztlwoz1v1sn81ssks12tb19x37zozxlz
> > - JIRA: https://issues.apache.org/jira/browse/SPARK-4
> > - PR: https://github.com/apache/spark/pull/46013
> >
> > The vote is open until April 17th 1AM (PST) and passes
> > if a majority +1 PMC votes are cast, with a minimum of 3 +1 votes.
> >
> > [ ] +1 Use ANSI SQL mode by default
> > [ ] -1 Do not use ANSI SQL mode by default because ...
> >
> > Thank you in advance.
> >
> > Dongjoon
> >
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>

Re: [VOTE] Add new `Versions` in Apache Spark JIRA for Versioning of Spark Operator

2024-04-12 Thread Chao Sun

+1

On Fri, Apr 12, 2024 at 4:23 PM Xiao Li 
wrote:

> +1
>
>
>
>
> On Fri, Apr 12, 2024 at 14:30 bo yang  wrote:
>
>> +1
>>
>
>> On Fri, Apr 12, 2024 at 12:34 PM huaxin gao 
>> wrote:
>>
>>> +1
>>>
>>> On Fri, Apr 12, 2024 at 9:07 AM Dongjoon Hyun 
>>> wrote:
>>>
 +1

 Thank you!

 I hope we can customize `dev/merge_spark_pr.py` script per repository
 after this PR.

 Dongjoon.

 On 2024/04/12 03:28:36 "L. C. Hsieh" wrote:
 > Hi all,
 >
 > Thanks for all discussions in the thread of "Versioning of Spark
 > Operator":
 https://lists.apache.org/thread/zhc7nb2sxm8jjxdppq8qjcmlf4rcsthh
 >
 > I would like to create this vote to get the consensus for versioning
 > of the Spark Kubernetes Operator.
 >
 > The proposal is to use an independent versioning for the Spark
 > Kubernetes Operator.
 >
 > Please vote on adding new `Versions` in Apache Spark JIRA which can be
 > used for places like "Fix Version/s" in the JIRA tickets of the
 > operator.
 >
 > The new `Versions` will be `kubernetes-operator-` prefix, for example
 > `kubernetes-operator-0.1.0`.
 >
 > The vote is open until April 15th 1AM (PST) and passes if a majority
 > +1 PMC votes are cast, with a minimum of 3 +1 votes.
 >
 > [ ] +1 Adding the new `Versions` for Spark Kubernetes Operator in
 > Apache Spark JIRA
 > [ ] -1 Do not add the new `Versions` because ...
 >
 > Thank you.
 >
 >
 > Note that this is not a SPIP vote and also not a release vote. I don't
 > find similar votes in previous threads. This is made similarly like a
 > SPIP or a release vote. So I think it should be okay. Please correct
 > me if this vote format is not good for you.
 >
 > -
 > To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
 >
 >

 -
 To unsubscribe e-mail: dev-unsubscr...@spark.apache.org

Re: [VOTE] SPIP: Pure Python Package in PyPI (Spark Connect)

2024-04-01 Thread Chao Sun

+1

On Sun, Mar 31, 2024 at 10:31 PM Hyukjin Kwon  wrote:

> Oh I didn't send the discussion thread out as it's pretty simple,
> non-invasive and the discussion was sort of done as part of the Spark
> Connect initial discussion ..
>
> On Mon, Apr 1, 2024 at 1:59 PM Mridul Muralidharan 
> wrote:
>
>>
>> Can you point me to the SPIP’s discussion thread please ?
>> I was not able to find it, but I was on vacation, and so might have
>> missed this …
>>
>>
>> Regards,
>> Mridul
>>
>
>> On Sun, Mar 31, 2024 at 9:08 PM Haejoon Lee
>>  wrote:
>>
>>> +1
>>>
>>> On Mon, Apr 1, 2024 at 10:15 AM Hyukjin Kwon 
>>> wrote:
>>>
 Hi all,

 I'd like to start the vote for SPIP: Pure Python Package in PyPI (Spark
 Connect)

 JIRA 
 Prototype 
 SPIP doc
 

 Please vote on the SPIP for the next 72 hours:

 [ ] +1: Accept the proposal as an official SPIP
 [ ] +0
 [ ] -1: I don’t think this is a good idea because …

 Thanks.

>>>

Re: [VOTE] SPIP: Structured Logging Framework for Apache Spark

2024-03-12 Thread Chao Sun

+1

On Tue, Mar 12, 2024 at 8:03 AM Xiao Li 
wrote:

> +1
>
> On Tue, Mar 12, 2024 at 6:09 AM Holden Karau 
> wrote:
>
>> +1
>>
>> Twitter: https://twitter.com/holdenkarau
>> Books (Learning Spark, High Performance Spark, etc.):
>> https://amzn.to/2MaRAG9  
>> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>>
>
>>
>> On Mon, Mar 11, 2024 at 7:44 PM Reynold Xin 
>> wrote:
>>
>>> +1
>>>
>>>
>>> On Mon, Mar 11 2024 at 7:38 PM, Jungtaek Lim <
>>> kabhwan.opensou...@gmail.com> wrote:
>>>
 +1 (non-binding), thanks Gengliang!

 On Mon, Mar 11, 2024 at 5:46 PM Gengliang Wang 
 wrote:

> Hi all,
>
> I'd like to start the vote for SPIP: Structured Logging Framework for
> Apache Spark
>
> References:
>
>- JIRA ticket 
>- SPIP doc
>
> 
>- Discussion thread
>
>
> Please vote on the SPIP for the next 72 hours:
>
> [ ] +1: Accept the proposal as an official SPIP
> [ ] +0
> [ ] -1: I don’t think this is a good idea because …
>
> Thanks!
> Gengliang Wang
>

>
> --
>
>

Re: Introducing Comet, a plugin to accelerate Spark execution via DataFusion and Arrow

2024-02-19 Thread Chao Sun

Hi Mich,

> Also have you got some benchmark results from your tests that you can
possibly share?

We only have some partial benchmark results internally so far. Once shuffle
and better memory management have been introduced, we plan to publish the
benchmark results (at least TPC-H) in the repo.

> Compared to standard Spark, what kind of performance gains can be
expected with Comet?

Currently, users could benefit from Comet in a few areas:
- Parquet read: a few improvements have been made against reading from S3
in particular, so users can expect better scan performance in this scenario
- Hash aggregation
- Columnar shuffle
- Decimals (Java's BigDecimal is pretty slow)

> Can one use Comet on k8s in conjunction with something like a Volcano
addon?

I think so. Comet is mostly orthogonal to the Spark scheduler framework.

Chao






On Fri, Feb 16, 2024 at 5:39 AM Mich Talebzadeh 
wrote:

> Hi Chao,
>
> As a cool feature
>
>
>- Compared to standard Spark, what kind of performance gains can be
>expected with Comet?
>-  Can one use Comet on k8s in conjunction with something like a
>Volcano addon?
>
>
> HTH
>
> Mich Talebzadeh,
> Dad | Technologist | Solutions Architect | Engineer
> London
> United Kingdom
>
>
>view my Linkedin profile
> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>
>
>  https://en.everybodywiki.com/Mich_Talebzadeh
>
>
>
> *Disclaimer:* The information provided is correct to the best of my
> knowledge, sourced from both personal expertise and other resources but of
> course cannot be guaranteed . It is essential to note that, as with any
> advice, one verified and tested result holds more weight than a thousand
> expert opinions.
>
>
> On Tue, 13 Feb 2024 at 20:42, Chao Sun  wrote:
>
>> Hi all,
>>
>> We are very happy to announce that Project Comet, a plugin to
>> accelerate Spark query execution via leveraging DataFusion and Arrow,
>> has now been open sourced under the Apache Arrow umbrella. Please
>> check the project repo
>> https://github.com/apache/arrow-datafusion-comet for more details if
>> you are interested. We'd love to collaborate with people from the open
>> source community who share similar goals.
>>
>> Thanks,
>> Chao
>>
>> -
>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>>
>>

Re: Introducing Comet, a plugin to accelerate Spark execution via DataFusion and Arrow

2024-02-14 Thread Chao Sun

Hi Praveen,

We will add a "Getting Started" section in the README soon, but basically
comet-spark-shell
<https://github.com/apache/arrow-datafusion-comet/blob/main/bin/comet-spark-shell>
in
the repo should provide a basic tool to build Comet and launch a Spark
shell with it.

Note that we haven't open sourced several features yet including shuffle
support, which the aggregate operation depends on. Please stay tuned!

Chao


On Wed, Feb 14, 2024 at 2:44 PM praveen sinha 
wrote:

> Hi Chao,
>
> Is there any example app/gist/repo which can help me use this plugin. I
> wanted to try out some realtime aggregate performance on top of parquet and
> spark dataframes.
>
> Thanks and Regards
> Praveen
>
>
> On Wed, Feb 14, 2024 at 9:20 AM Chao Sun  wrote:
>
>> > Out of interest what are the differences in the approach between this
>> and Glutten?
>>
>> Overall they are similar, although Gluten supports multiple backends
>> including Velox and Clickhouse. One major difference is (obviously)
>> Comet is based on DataFusion and Arrow, and written in Rust, while
>> Gluten is mostly C++.
>> I haven't looked very deep into Gluten yet, but there could be other
>> differences such as how strictly the engine follows Spark's semantics,
>> table format support (Iceberg, Delta, etc), fallback mechanism
>> (coarse-grained fallback on stage level or more fine-grained fallback
>> within stages), UDF support (Comet hasn't started on this yet),
>> shuffle support, memory management, etc.
>>
>> Both engines are backed by very strong and vibrant open source
>> communities (Velox, Clickhouse, Arrow & DataFusion) so it's very
>> exciting to see how the projects will grow in future.
>>
>> Best,
>> Chao
>>
>> On Tue, Feb 13, 2024 at 10:06 PM John Zhuge  wrote:
>> >
>> > Congratulations! Excellent work!
>> >
>> > On Tue, Feb 13, 2024 at 8:04 PM Yufei Gu  wrote:
>> >>
>> >> Absolutely thrilled to see the project going open-source! Huge
>> congrats to Chao and the entire team on this milestone!
>> >>
>> >> Yufei
>> >>
>> >>
>> >> On Tue, Feb 13, 2024 at 12:43 PM Chao Sun  wrote:
>> >>>
>> >>> Hi all,
>> >>>
>> >>> We are very happy to announce that Project Comet, a plugin to
>> >>> accelerate Spark query execution via leveraging DataFusion and Arrow,
>> >>> has now been open sourced under the Apache Arrow umbrella. Please
>> >>> check the project repo
>> >>> https://github.com/apache/arrow-datafusion-comet for more details if
>> >>> you are interested. We'd love to collaborate with people from the open
>> >>> source community who share similar goals.
>> >>>
>> >>> Thanks,
>> >>> Chao
>> >>>
>> >>> -
>> >>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>> >>>
>> >
>> >
>> > --
>> > John Zhuge
>>
>> -
>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>>
>>

Re: Introducing Comet, a plugin to accelerate Spark execution via DataFusion and Arrow

2024-02-14 Thread Chao Sun

> Out of interest what are the differences in the approach between this and 
> Glutten?

Overall they are similar, although Gluten supports multiple backends
including Velox and Clickhouse. One major difference is (obviously)
Comet is based on DataFusion and Arrow, and written in Rust, while
Gluten is mostly C++.
I haven't looked very deep into Gluten yet, but there could be other
differences such as how strictly the engine follows Spark's semantics,
table format support (Iceberg, Delta, etc), fallback mechanism
(coarse-grained fallback on stage level or more fine-grained fallback
within stages), UDF support (Comet hasn't started on this yet),
shuffle support, memory management, etc.

Both engines are backed by very strong and vibrant open source
communities (Velox, Clickhouse, Arrow & DataFusion) so it's very
exciting to see how the projects will grow in future.

Best,
Chao

On Tue, Feb 13, 2024 at 10:06 PM John Zhuge  wrote:
>
> Congratulations! Excellent work!
>
> On Tue, Feb 13, 2024 at 8:04 PM Yufei Gu  wrote:
>>
>> Absolutely thrilled to see the project going open-source! Huge congrats to 
>> Chao and the entire team on this milestone!
>>
>> Yufei
>>
>>
>> On Tue, Feb 13, 2024 at 12:43 PM Chao Sun  wrote:
>>>
>>> Hi all,
>>>
>>> We are very happy to announce that Project Comet, a plugin to
>>> accelerate Spark query execution via leveraging DataFusion and Arrow,
>>> has now been open sourced under the Apache Arrow umbrella. Please
>>> check the project repo
>>> https://github.com/apache/arrow-datafusion-comet for more details if
>>> you are interested. We'd love to collaborate with people from the open
>>> source community who share similar goals.
>>>
>>> Thanks,
>>> Chao
>>>
>>> -
>>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>>
>
>
> --
> John Zhuge

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org

Introducing Comet, a plugin to accelerate Spark execution via DataFusion and Arrow

2024-02-13 Thread Chao Sun

Hi all,

We are very happy to announce that Project Comet, a plugin to
accelerate Spark query execution via leveraging DataFusion and Arrow,
has now been open sourced under the Apache Arrow umbrella. Please
check the project repo
https://github.com/apache/arrow-datafusion-comet for more details if
you are interested. We'd love to collaborate with people from the open
source community who share similar goals.

Thanks,
Chao

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org

Re: [VOTE] SPIP: An Official Kubernetes Operator for Apache Spark

2023-11-14 Thread Chao Sun

+1

On Tue, Nov 14, 2023 at 9:52 AM L. C. Hsieh  wrote:
>
> +1
>
> On Tue, Nov 14, 2023 at 9:46 AM Ye Zhou  wrote:
> >
> > +1(Non-binding)
> >
> > On Tue, Nov 14, 2023 at 9:42 AM L. C. Hsieh  wrote:
> >>
> >> Hi all,
> >>
> >> I’d like to start a vote for SPIP: An Official Kubernetes Operator for
> >> Apache Spark.
> >>
> >> The proposal is to develop an official Java-based Kubernetes operator
> >> for Apache Spark to automate the deployment and simplify the lifecycle
> >> management and orchestration of Spark applications and Spark clusters
> >> on k8s at prod scale.
> >>
> >> This aims to reduce the learning curve and operation overhead for
> >> Spark users so they can concentrate on core Spark logic.
> >>
> >> Please also refer to:
> >>
> >>- Discussion thread:
> >> https://lists.apache.org/thread/wdy7jfhf7m8jy74p6s0npjfd15ym5rxz
> >>- JIRA ticket: https://issues.apache.org/jira/browse/SPARK-45923
> >>- SPIP doc: 
> >> https://docs.google.com/document/d/1f5mm9VpSKeWC72Y9IiKN2jbBn32rHxjWKUfLRaGEcLE
> >>
> >>
> >> Please vote on the SPIP for the next 72 hours:
> >>
> >> [ ] +1: Accept the proposal as an official SPIP
> >> [ ] +0
> >> [ ] -1: I don’t think this is a good idea because …
> >>
> >>
> >> Thank you!
> >>
> >> Liang-Chi Hsieh
> >>
> >> -
> >> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
> >>
> >
> >
> > --
> >
> > Zhou, Ye  周晔
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org

Re: [DISCUSSION] SPIP: An Official Kubernetes Operator for Apache Spark

2023-11-09 Thread Chao Sun

+1


On Thu, Nov 9, 2023 at 6:36 PM Xiao Li  wrote:
>
> +1
>
> huaxin gao  于2023年11月9日周四 16:53写道：
>>
>> +1
>>
>> On Thu, Nov 9, 2023 at 3:14 PM DB Tsai  wrote:
>>>
>>> +1
>>>
>>> To be completely transparent, I am employed in the same department as Zhou 
>>> at Apple.
>>>
>>> I support this proposal, provided that we witness community adoption 
>>> following the release of the Flink Kubernetes operator, streamlining Flink 
>>> deployment on Kubernetes.
>>>
>>> A well-maintained official Spark Kubernetes operator is essential for our 
>>> Spark community as well.
>>>
>>> DB Tsai  |  https://www.dbtsai.com/  |  PGP 42E5B25A8F7A82C1
>>>
>>> On Nov 9, 2023, at 12:05 PM, Zhou Jiang  wrote:
>>>
>>> Hi Spark community,
>>>
>>> I'm reaching out to initiate a conversation about the possibility of 
>>> developing a Java-based Kubernetes operator for Apache Spark. Following the 
>>> operator pattern 
>>> (https://kubernetes.io/docs/concepts/extend-kubernetes/operator/), Spark 
>>> users may manage applications and related components seamlessly using 
>>> native tools like kubectl. The primary goal is to simplify the Spark user 
>>> experience on Kubernetes, minimizing the learning curve and operational 
>>> complexities and therefore enable users to focus on the Spark application 
>>> development.
>>> Although there are several open-source Spark on Kubernetes operators 
>>> available, none of them are officially integrated into the Apache Spark 
>>> project. As a result, these operators may lack active support and 
>>> development for new features. Within this proposal, our aim is to introduce 
>>> a Java-based Spark operator as an integral component of the Apache Spark 
>>> project. This solution has been employed internally at Apple for multiple 
>>> years, operating millions of executors in real production environments. The 
>>> use of Java in this solution is intended to accommodate a wider user and 
>>> contributor audience, especially those who are familiar with Scala.
>>> Ideally, this operator should have its dedicated repository, similar to 
>>> Spark Connect Golang or Spark Docker, allowing it to maintain a loose 
>>> connection with the Spark release cycle. This model is also followed by the 
>>> Apache Flink Kubernetes operator.
>>> We believe that this project holds the potential to evolve into a thriving 
>>> community project over the long run. A comparison can be drawn with the 
>>> Flink Kubernetes Operator: Apple has open-sourced internal Flink Kubernetes 
>>> operator, making it a part of the Apache Flink project 
>>> (https://github.com/apache/flink-kubernetes-operator). This move has gained 
>>> wide industry adoption and contributions from the community. In a mere 
>>> year, the Flink operator has garnered more than 600 stars and has attracted 
>>> contributions from over 80 contributors. This showcases the level of 
>>> community interest and collaborative momentum that can be achieved in 
>>> similar scenarios.
>>> More details can be found at SPIP doc : Spark Kubernetes Operator 
>>> https://docs.google.com/document/d/1f5mm9VpSKeWC72Y9IiKN2jbBn32rHxjWKUfLRaGEcLE
>>>
>>> Thanks,
>>>
>>> --
>>> Zhou JIANG
>>>
>>>

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org

Re: Welcome to Our New Apache Spark Committer and PMCs

2023-10-04 Thread Chao Sun

Congratulations!

On Wed, Oct 4, 2023 at 5:11 AM Jungtaek Lim 
wrote:

> Congrats!
>
> 2023년 10월 4일 (수) 오후 5:04, yangjie01 님이 작성:
>
>> Congratulations!
>>
>>
>>
>> Jie Yang
>>
>>
>>
>> *发件人**: *Dongjoon Hyun 
>> *日期**: *2023年10月4日 星期三 13:04
>> *收件人**: *Hyukjin Kwon 
>> *抄送**: *Hussein Awala , Rui Wang ,
>> Gengliang Wang , Xiao Li , "
>> dev@spark.apache.org" 
>> *主题**: *Re: Welcome to Our New Apache Spark Committer and PMCs
>>
>>
>>
>> Congratulations!
>>
>>
>>
>> Dongjoon.
>>
>>
>>
>> On Tue, Oct 3, 2023 at 5:25 PM Hyukjin Kwon  wrote:
>>
>> Woohoo!
>>
>>
>>
>> On Tue, 3 Oct 2023 at 22:47, Hussein Awala  wrote:
>>
>> Congrats to all of you!
>>
>>
>>
>> On Tue 3 Oct 2023 at 08:15, Rui Wang  wrote:
>>
>> Congratulations! Well deserved!
>>
>>
>>
>> -Rui
>>
>>
>>
>>
>>
>> On Mon, Oct 2, 2023 at 10:32 PM Gengliang Wang  wrote:
>>
>> Congratulations to all! Well deserved!
>>
>>
>>
>> On Mon, Oct 2, 2023 at 10:16 PM Xiao Li  wrote:
>>
>> Hi all,
>>
>> The Spark PMC is delighted to announce that we have voted to add one new
>> committer and two new PMC members. These individuals have consistently
>> contributed to the project and have clearly demonstrated their expertise.
>>
>> New Committer:
>> - Jiaan Geng (focusing on Spark Connect and Spark SQL)
>>
>> New PMCs:
>> - Yuanjian Li
>> - Yikun Jiang
>>
>> Please join us in extending a warm welcome to them in their new roles!
>>
>> Sincerely,
>> The Spark PMC
>>
>>

Re: [VOTE] Release Spark 3.4.1 (RC1)

2023-06-22 Thread Chao Sun

+1

On Thu, Jun 22, 2023 at 6:52 AM Yuming Wang  wrote:
>
> +1.
>
> On Thu, Jun 22, 2023 at 4:41 PM Jacek Laskowski  wrote:
>>
>> +1
>>
>> Builds and runs fine on Java 17, macOS.
>>
>> $ ./dev/change-scala-version.sh 2.13
>> $ mvn \
>> -Pkubernetes,hadoop-cloud,hive,hive-thriftserver,scala-2.13,volcano,connect \
>> -DskipTests \
>> clean install
>>
>> $ python/run-tests --parallelism=1 --testnames 'pyspark.sql.session 
>> SparkSession.sql'
>> ...
>> Tests passed in 28 second
>>
>> Pozdrawiam,
>> Jacek Laskowski
>> 
>> "The Internals Of" Online Books
>> Follow me on https://twitter.com/jaceklaskowski
>>
>>
>>
>> On Tue, Jun 20, 2023 at 4:41 AM Dongjoon Hyun  wrote:
>>>
>>> Please vote on releasing the following candidate as Apache Spark version 
>>> 3.4.1.
>>>
>>> The vote is open until June 23rd 1AM (PST) and passes if a majority +1 PMC 
>>> votes are cast, with a minimum of 3 +1 votes.
>>>
>>> [ ] +1 Release this package as Apache Spark 3.4.1
>>> [ ] -1 Do not release this package because ...
>>>
>>> To learn more about Apache Spark, please see https://spark.apache.org/
>>>
>>> The tag to be voted on is v3.4.1-rc1 (commit 
>>> 6b1ff22dde1ead51cbf370be6e48a802daae58b6)
>>> https://github.com/apache/spark/tree/v3.4.1-rc1
>>>
>>> The release files, including signatures, digests, etc. can be found at:
>>> https://dist.apache.org/repos/dist/dev/spark/v3.4.1-rc1-bin/
>>>
>>> Signatures used for Spark RCs can be found in this file:
>>> https://dist.apache.org/repos/dist/dev/spark/KEYS
>>>
>>> The staging repository for this release can be found at:
>>> https://repository.apache.org/content/repositories/orgapachespark-1443/
>>>
>>> The documentation corresponding to this release can be found at:
>>> https://dist.apache.org/repos/dist/dev/spark/v3.4.1-rc1-docs/
>>>
>>> The list of bug fixes going into 3.4.1 can be found at the following URL:
>>> https://issues.apache.org/jira/projects/SPARK/versions/12352874
>>>
>>> This release is using the release script of the tag v3.4.1-rc1.
>>>
>>> FAQ
>>>
>>> =
>>> How can I help test this release?
>>> =
>>>
>>> If you are a Spark user, you can help us test this release by taking
>>> an existing Spark workload and running on this release candidate, then
>>> reporting any regressions.
>>>
>>> If you're working in PySpark you can set up a virtual env and install
>>> the current RC and see if anything important breaks, in the Java/Scala
>>> you can add the staging repository to your projects resolvers and test
>>> with the RC (make sure to clean up the artifact cache before/after so
>>> you don't end up building with a out of date RC going forward).
>>>
>>> ===
>>> What should happen to JIRA tickets still targeting 3.4.1?
>>> ===
>>>
>>> The current list of open tickets targeted at 3.4.1 can be found at:
>>> https://issues.apache.org/jira/projects/SPARK and search for "Target 
>>> Version/s" = 3.4.1
>>>
>>> Committers should look at those and triage. Extremely important bug
>>> fixes, documentation, and API tweaks that impact compatibility should
>>> be worked on immediately. Everything else please retarget to an
>>> appropriate release.
>>>
>>> ==
>>> But my bug isn't fixed?
>>> ==
>>>
>>> In order to make timely releases, we will typically not hold the
>>> release unless the bug in question is a regression from the previous
>>> release. That being said, if there is something which is a regression
>>> that has not been correctly targeted please ping me or a committer to
>>> help target the issue.

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org

Re: [VOTE] Release Plan for Apache Spark 4.0.0 (June 2024)

2023-06-12 Thread Chao Sun

+1

On Mon, Jun 12, 2023 at 12:50 PM kazuyuki tanimura
 wrote:

> +1 (non-binding)
>
> Thank you!
> Kazu
>
>
> On Jun 12, 2023, at 11:32 AM, Holden Karau  wrote:
>
> -0
>
> I'd like to see more of a doc around what we're planning on for a 4.0
> before we pick a target release date etc. (feels like cart before the
> horse).
>
> But it's a weak preference.
>
> On Mon, Jun 12, 2023 at 11:24 AM Xiao Li  wrote:
>
>> Thanks for starting the vote.
>>
>> I do have a concern about the target release date of Spark 4.0.
>>
>> L. C. Hsieh  于2023年6月12日周一 11:09写道：
>>
>>> +1
>>>
>>> On Mon, Jun 12, 2023 at 11:06 AM huaxin gao 
>>> wrote:
>>> >
>>> > +1
>>> >
>>> > On Mon, Jun 12, 2023 at 11:05 AM Dongjoon Hyun 
>>> wrote:
>>> >>
>>> >> +1
>>> >>
>>> >> Dongjoon
>>> >>
>>> >> On 2023/06/12 18:00:38 Dongjoon Hyun wrote:
>>> >> > Please vote on the release plan for Apache Spark 4.0.0.
>>> >> >
>>> >> > The vote is open until June 16th 1AM (PST) and passes if a majority
>>> +1 PMC
>>> >> > votes are cast, with a minimum of 3 +1 votes.
>>> >> >
>>> >> > [ ] +1 Have a release plan for Apache Spark 4.0.0 (June 2024)
>>> >> > [ ] -1 Do not have a plan for Apache Spark 4.0.0 because ...
>>> >> >
>>> >> > ===
>>> >> > Apache Spark 4.0.0 Release Plan
>>> >> > ===
>>> >> >
>>> >> > 1. After creating `branch-3.5`, set "4.0.0-SNAPSHOT" in master
>>> branch.
>>> >> >
>>> >> > 2. Creating `branch-4.0` on April 1st, 2024.
>>> >> >
>>> >> > 3. Apache Spark 4.0.0 RC1 on May 1st, 2024.
>>> >> >
>>> >> > 4. Apache Spark 4.0.0 Release in June, 2024.
>>> >> >
>>> >>
>>> >> -
>>> >> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>> >>
>>>
>>> -
>>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>>
>>>
>
> --
> Twitter: https://twitter.com/holdenkarau
> Books (Learning Spark, High Performance Spark, etc.):
> https://amzn.to/2MaRAG9  
> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>
>
>

Re: Apache Spark 3.4.1 Release?

2023-06-08 Thread Chao Sun

+1 too

On Thu, Jun 8, 2023 at 2:34 PM kazuyuki tanimura
 wrote:
>
> +1 (non-binding), Thank you Dongjoon
>
> Kazu
>

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org

Re: [CONNECT] New Clients for Go and Rust

2023-05-25 Thread Chao Sun

+1 on separate repo too

On Thu, May 25, 2023 at 12:43 PM Dongjoon Hyun  wrote:
>
> +1 for starting on a separate repo.
>
> Dongjoon.
>
> On Thu, May 25, 2023 at 9:53 AM yangjie01  wrote:
>>
>> +1 on start this with a separate repo.
>>
>> Which new clients can be placed in the main repo should be discussed after 
>> they are mature enough,
>>
>>
>>
>> Yang Jie
>>
>>
>>
>> 发件人: Denny Lee 
>> 日期: 2023年5月24日 星期三 21:31
>> 收件人: Hyukjin Kwon 
>> 抄送: Maciej , "dev@spark.apache.org" 
>> 
>> 主题: Re: [CONNECT] New Clients for Go and Rust
>>
>>
>>
>> +1 on separate repo allowing different APIs to run at different speeds and 
>> ensuring they get community support.
>>
>>
>>
>> On Wed, May 24, 2023 at 00:37 Hyukjin Kwon  wrote:
>>
>> I think we can just start this with a separate repo.
>> I am fine with the second option too but in this case we would have to 
>> triage which language to add into the main repo.
>>
>>
>>
>> On Fri, 19 May 2023 at 22:28, Maciej  wrote:
>>
>> Hi,
>>
>>
>>
>> Personally, I'm strongly against the second option and have some preference 
>> towards the third one (or maybe a mix of the first one and the third one).
>>
>>
>>
>> The project is already pretty large as-is and, with an extremely 
>> conservative approach towards removal of APIs, it only tends to grow over 
>> time. Making it even larger is not going to make things more maintainable 
>> and is likely to create an entry barrier for new contributors (that's 
>> similar to Jia's arguments).
>>
>>
>>
>> Moreover, we've seen quite a few different language clients over the years 
>> and all but one or two survived while none is particularly active, as far as 
>> I'm aware.  Taking responsibility for more clients, without being sure that 
>> we have resources to maintain them and there is enough community around them 
>> to make such effort worthwhile, doesn't seem like a good idea.
>>
>>
>>
>> --
>>
>> Best regards,
>>
>> Maciej Szymkiewicz
>>
>>
>>
>> Web: https://zero323.net
>>
>> PGP: A30CEF0C31A501EC
>>
>>
>>
>>
>>
>> On 5/19/23 14:57, Jia Fan wrote:
>>
>> Hi,
>>
>>
>>
>> Thanks for contribution!
>>
>> I prefer (1). There are some reason:
>>
>>
>>
>> 1. Different repository can maintain independent versions, different release 
>> times, and faster bug fix releases.
>>
>>
>>
>> 2. Different languages have different build tools. Putting them in one 
>> repository will make the main repository more and more complicated, and it 
>> will become extremely difficult to perform a complete build in the main 
>> repository.
>>
>>
>>
>> 3. Different repository will make CI configuration and execute easier, and 
>> the PR and commit lists will be clearer.
>>
>>
>>
>> 4. Other repository also have different client to governed, like clickhouse. 
>> It use different repository for jdbc, odbc, c++. Please refer:
>>
>> https://github.com/ClickHouse/clickhouse-java
>>
>> https://github.com/ClickHouse/clickhouse-odbc
>>
>> https://github.com/ClickHouse/clickhouse-cpp
>>
>>
>>
>> PS: I'm looking forward to the javascript connect client!
>>
>>
>>
>> Thanks Regards
>>
>> Jia Fan
>>
>>
>>
>> Martin Grund  于2023年5月19日周五 20:03写道：
>>
>> Hi folks,
>>
>>
>>
>> When Bo (thanks for the time and contribution) started the work on 
>> https://github.com/apache/spark/pull/41036 he started the Go client directly 
>> in the Spark repository. In the meantime, I was approached by other 
>> engineers who are willing to contribute to working on a Rust client for 
>> Spark Connect.
>>
>>
>>
>> Now one of the key questions is where should these connectors live and how 
>> we manage expectations most effectively.
>>
>>
>>
>> At the high level, there are two approaches:
>>
>>
>>
>> (1) "3rd party" (non-JVM / Python) clients should live in separate 
>> repositories owned and governed by the Apache Spark community.
>>
>>
>>
>> (2) All clients should live in the main Apache Spark repository in the 
>> `connector/connect/client` directory.
>>
>>
>>
>> (3) Non-native (Python, JVM) Spark Connect clients should not be part of the 
>> Apache Spark repository and governance rules.
>>
>>
>>
>> Before we iron out how exactly, we mark these clients as experimental and 
>> how we align their release process etc with Spark, my suggestion would be to 
>> get a consensus on this first question.
>>
>>
>>
>> Personally, I'm fine with (1) and (2) with a preference for (2).
>>
>>
>>
>> Would love to get feedback from other members of the community!
>>
>>
>>
>> Thanks
>>
>> Martin
>>
>>
>>
>>
>>
>>
>>
>>

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org

hadoop-2 profile to be removed in 3.5.0

2023-04-14 Thread Chao Sun

Hi all,

Just a heads up that `hadoop-2` profile is going to be removed in
Apache Spark 3.5.0. This has been discussed previously through this
email thread: https://lists.apache.org/thread/z4jdy9959b6zj9t726zl0zcrk4hzs0xs
and is now realized via
https://issues.apache.org/jira/browse/SPARK-42452

Feel free to comment if you still have any concerns.

Thanks.
Chao

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org

Re: [VOTE] Release Apache Spark 3.4.0 (RC7)

2023-04-10 Thread Chao Sun

+1 (non-binding)

On Mon, Apr 10, 2023 at 12:41 AM Ruifeng Zheng  wrote:

> +1 (non-binding)
>
> --
> Ruifeng  Zheng
> ruife...@foxmail.com
>
> 
>
>
>
> -- Original --
> *From:* "Kent Yao" ;
> *Date:* Mon, Apr 10, 2023 03:33 PM
> *To:* "Gengliang Wang";
> *Cc:* "Dongjoon Hyun";"Mridul Muralidharan"<
> mri...@gmail.com>;"L. C. Hsieh";"yangjie01"<
> yangji...@baidu.com>;"Sean Owen";"Xinrong Meng"<
> xinrong.apa...@gmail.com>;"dev";
> *Subject:* Re: [VOTE] Release Apache Spark 3.4.0 (RC7)
>
> +1(non-binding)
>
> Gengliang Wang  于2023年4月10日周一 15:27写道：
> >
> > +1
> >
> > On Sun, Apr 9, 2023 at 3:17 PM Dongjoon Hyun 
> wrote:
> >>
> >> +1
> >>
> >> I verified the same steps like previous RCs.
> >>
> >> Dongjoon.
> >>
> >>
> >> On Sat, Apr 8, 2023 at 7:47 PM Mridul Muralidharan 
> wrote:
> >>>
> >>>
> >>> +1
> >>>
> >>> Signatures, digests, etc check out fine.
> >>> Checked out tag and build/tested with -Phive -Pyarn -Pmesos
> -Pkubernetes
> >>>
> >>> Regards,
> >>> Mridul
> >>>
> >>>
> >>> On Sat, Apr 8, 2023 at 12:13 PM L. C. Hsieh  wrote:
> 
>  +1
> 
>  Thanks Xinrong.
> 
>  On Sat, Apr 8, 2023 at 8:23 AM yangjie01  wrote:
>  >
>  > +1
>  >
>  >
>  >
>  > 发件人: Sean Owen 
>  > 日期: 2023年4月8日 星期六 20:27
>  > 收件人: Xinrong Meng 
>  > 抄送: dev 
>  > 主题: Re: [VOTE] Release Apache Spark 3.4.0 (RC7)
>  >
>  >
>  >
>  > +1 form me, same result as last time.
>  >
>  >
>  >
>  > On Fri, Apr 7, 2023 at 6:30 PM Xinrong Meng <
> xinrong.apa...@gmail.com> wrote:
>  >
>  > Please vote on releasing the following candidate(RC7) as Apache
> Spark version 3.4.0.
>  >
>  > The vote is open until 11:59pm Pacific time April 12th and passes
> if a majority +1 PMC votes are cast, with a minimum of 3 +1 votes.
>  >
>  > [ ] +1 Release this package as Apache Spark 3.4.0
>  > [ ] -1 Do not release this package because ...
>  >
>  > To learn more about Apache Spark, please see
> http://spark.apache.org/
>  >
>  > The tag to be voted on is v3.4.0-rc7 (commit
> 87a5442f7ed96b11051d8a9333476d080054e5a0):
>  > https://github.com/apache/spark/tree/v3.4.0-rc7
>  >
>  > The release files, including signatures, digests, etc. can be found
> at:
>  > https://dist.apache.org/repos/dist/dev/spark/v3.4.0-rc7-bin/
>  >
>  > Signatures used for Spark RCs can be found in this file:
>  > https://dist.apache.org/repos/dist/dev/spark/KEYS
>  >
>  > The staging repository for this release can be found at:
>  >
> https://repository.apache.org/content/repositories/orgapachespark-1441
>  >
>  > The documentation corresponding to this release can be found at:
>  > https://dist.apache.org/repos/dist/dev/spark/v3.4.0-rc7-docs/
>  >
>  > The list of bug fixes going into 3.4.0 can be found at the
> following URL:
>  > https://issues.apache.org/jira/projects/SPARK/versions/12351465
>  >
>  > This release is using the release script of the tag v3.4.0-rc7.
>  >
>  >
>  > FAQ
>  >
>  > =
>  > How can I help test this release?
>  > =
>  > If you are a Spark user, you can help us test this release by taking
>  > an existing Spark workload and running on this release candidate,
> then
>  > reporting any regressions.
>  >
>  > If you're working in PySpark you can set up a virtual env and
> install
>  > the current RC and see if anything important breaks, in the
> Java/Scala
>  > you can add the staging repository to your projects resolvers and
> test
>  > with the RC (make sure to clean up the artifact cache before/after
> so
>  > you don't end up building with an out of date RC going forward).
>  >
>  > ===
>  > What should happen to JIRA tickets still targeting 3.4.0?
>  > ===
>  > The current list of open tickets targeted at 3.4.0 can be found at:
>  > https://issues.apache.org/jira/projects/SPARK and search for
> "Target Version/s" = 3.4.0
>  >
>  > Committers should look at those and triage. Extremely important bug
>  > fixes, documentation, and API tweaks that impact compatibility
> should
>  > be worked on immediately. Everything else please retarget to an
>  > appropriate release.
>  >
>  > ==
>  > But my bug isn't fixed?
>  > ==
>  > In order to make timely releases, we will typically not hold the
>  > release unless

Re: [VOTE] Release Apache Spark 3.2.4 (RC1)

2023-04-10 Thread Chao Sun

+1 (non-binding)

On Mon, Apr 10, 2023 at 7:07 AM yangjie01  wrote:

> +1 (non-binding)
>
>
>
> *发件人**: *Sean Owen 
> *日期**: *2023年4月10日 星期一 21:19
> *收件人**: *Dongjoon Hyun 
> *抄送**: *"dev@spark.apache.org" 
> *主题**: *Re: [VOTE] Release Apache Spark 3.2.4 (RC1)
>
>
>
> +1 from me
>
>
>
> On Sun, Apr 9, 2023 at 7:19 PM Dongjoon Hyun  wrote:
>
> I'll start with my +1.
>
> I verified the checksum, signatures of the artifacts, and documentations.
> Also, ran the tests with YARN and K8s modules.
>
> Dongjoon.
>
> On 2023/04/09 23:46:10 Dongjoon Hyun wrote:
> > Please vote on releasing the following candidate as Apache Spark version
> > 3.2.4.
> >
> > The vote is open until April 13th 1AM (PST) and passes if a majority +1
> PMC
> > votes are cast, with a minimum of 3 +1 votes.
> >
> > [ ] +1 Release this package as Apache Spark 3.2.4
> > [ ] -1 Do not release this package because ...
> >
> > To learn more about Apache Spark, please see https://spark.apache.org/
> 
> >
> > The tag to be voted on is v3.2.4-rc1 (commit
> > 0ae10ac18298d1792828f1d59b652ef17462d76e)
> > https://github.com/apache/spark/tree/v3.2.4-rc1
> 
> >
> > The release files, including signatures, digests, etc. can be found at:
> > https://dist.apache.org/repos/dist/dev/spark/v3.2.4-rc1-bin/
> 
> >
> > Signatures used for Spark RCs can be found in this file:
> > https://dist.apache.org/repos/dist/dev/spark/KEYS
> 
> >
> > The staging repository for this release can be found at:
> > https://repository.apache.org/content/repositories/orgapachespark-1442/
> 
> >
> > The documentation corresponding to this release can be found at:
> > https://dist.apache.org/repos/dist/dev/spark/v3.2.4-rc1-docs/
> 
> >
> > The list of bug fixes going into 3.2.4 can be found at the following URL:
> > https://issues.apache.org/jira/projects/SPARK/versions/12352607
> 
> >
> > This release is using the release script of the tag v3.2.4-rc1.
> >
> > FAQ
> >
> > =
> > How can I help test this release?
> > =
> >
> > If you are a Spark user, you can help us test this release by taking
> > an existing Spark workload and running on this release candidate, then
> > reporting any regressions.
> >
> > If you're working in PySpark you can set up a virtual env and install
> > the current RC and see if anything important breaks, in the Java/Scala
> > you can add the staging repository to your projects resolvers and test
> > with the RC (make sure to clean up the artifact cache before/after so
> > you don't end up building with a out of date RC going forward).
> >
> > ===
> > What should happen to JIRA tickets still targeting 3.2.4?
> > ===
> >
> > The current list of open tickets targeted at 3.2.4 can be found at:
> > https://issues.apache.org/jira/projects/SPARK
> 
> and search for "Target
> > Version/s" = 3.2.4
> >
> > Committers should look at those and triage. Extremely important bug
> > fixes, documentation, and API tweaks that impact compatibility should
> > be worked on immediately. Everything else please retarget to an
> > appropriate release.
> >
> > ==
> > But my bug isn't fixed?
> > ==
> >
> > In order to make timely releases, we will typically not hold the
> > release unless the bug in question is a regression from the previous
> > release. That being said, if there is something which is a regression
> > that has not been correctly targeted please ping me or a committer to
> > help target the issue.
> >
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>

Re: Apache Spark 3.2.4 EOL Release?

2023-04-04 Thread Chao Sun

+1

On Tue, Apr 4, 2023 at 11:12 AM Holden Karau  wrote:

> +1
>
> On Tue, Apr 4, 2023 at 11:04 AM L. C. Hsieh  wrote:
>
>> +1
>>
>> Sounds good and thanks Dongjoon for driving this.
>>
>> On 2023/04/04 17:24:54 Dongjoon Hyun wrote:
>> > Hi, All.
>> >
>> > Since Apache Spark 3.2.0 passed RC7 vote on October 12, 2021, branch-3.2
>> > has been maintained and served well until now.
>> >
>> > - https://github.com/apache/spark/releases/tag/v3.2.0 (tagged on Oct 6,
>> > 2021)
>> > - https://lists.apache.org/thread/jslhkh9sb5czvdsn7nz4t40xoyvznlc7
>> >
>> > As of today, branch-3.2 has 62 additional patches after v3.2.3 and
>> reaches
>> > the end-of-life this month according to the Apache Spark release
>> cadence. (
>> > https://spark.apache.org/versioning-policy.html)
>> >
>> > $ git log --oneline v3.2.3..HEAD | wc -l
>> > 62
>> >
>> > With the upcoming Apache Spark 3.4, I hope the users can get a chance to
>> > have these last bits of Apache Spark 3.2.x, and I'd like to propose to
>> have
>> > Apache Spark 3.2.4 EOL Release next week and volunteer as the release
>> > manager. WDTY? Please let me know if you need more patches on
>> branch-3.2.
>> >
>> > Thanks,
>> > Dongjoon.
>> >
>>
>> -
>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>
>> --
> Twitter: https://twitter.com/holdenkarau
> Books (Learning Spark, High Performance Spark, etc.):
> https://amzn.to/2MaRAG9  
> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>

Re: [ANNOUNCE] Apache Spark 3.3.2 released

2023-02-17 Thread Chao Sun

Thanks Liang-Chi!

On Fri, Feb 17, 2023 at 1:28 AM kazuyuki tanimura
 wrote:

> Great, Thank you Liang-Chi
>
> Kazu
>
> On Feb 17, 2023, at 1:02 AM, Wanqiang Ji  wrote:
>
> Congratulations!
>
> On Fri, Feb 17, 2023 at 4:59 PM L. C. Hsieh  wrote:
>
>
> We are happy to announce the availability of Apache Spark 3.3.2!
>
> Spark 3.3.2 is a maintenance release containing stability fixes. This
> release is based on the branch-3.3 maintenance branch of Spark. We strongly
> recommend all 3.3 users to upgrade to this stable release.
>
> To download Spark 3.3.2, head over to the download page:
> https://spark.apache.org/downloads.html
>
> To view the release notes:
> https://spark.apache.org/releases/spark-release-3-3-2.html
>
> We would like to acknowledge all community members for contributing to this
> release. This release would not have been possible without you.
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
> 
>
>
>

Re: [VOTE] Release Spark 3.3.2 (RC1)

2023-02-13 Thread Chao Sun

+1

On Mon, Feb 13, 2023 at 9:20 AM L. C. Hsieh  wrote:
>
> If it is not supported in Spark 3.3.x, it looks like an improvement at
> Spark 3.4.
> For such cases we usually do not back port. I think this is also why
> the PR did not back port when it was merged.
>
> I'm okay if there is consensus to back port it.
>
> On Mon, Feb 13, 2023 at 9:08 AM Sean Owen  wrote:
> >
> > Does that change change the result for Spark 3.3.x?
> > It looks like we do not support Python 3.11 in Spark 3.3.x, which is one 
> > answer to whether this should be changed now.
> > But if that's the only change that matters for Python 3.11 and makes it 
> > work, sure I think we should back-port. It doesn't necessarily block a 
> > release but if that's the case, it seems OK to include to me in a next RC.
> >
> > On Mon, Feb 13, 2023 at 10:53 AM Bjørn Jørgensen  
> > wrote:
> >>
> >> There is a fix for python 3.11 https://github.com/apache/spark/pull/38987
> >> We should have this in more branches.
> >>
> >> man. 13. feb. 2023 kl. 09:39 skrev Bjørn Jørgensen 
> >> :
> >>>
> >>> On manjaro it is Python 3.10.9
> >>>
> >>> On ubuntu it is Python 3.11.1
> >>>
> >>> man. 13. feb. 2023 kl. 03:24 skrev yangjie01 :
> 
>  Which Python version do you use for testing? When I use the latest 
>  Python 3.11, I can reproduce similar test failures (43 tests of sql 
>  module fail), but when I use python 3.10, they will succeed
> 
> 
> 
>  YangJie
> 
> 
> 
>  发件人: Bjørn Jørgensen 
>  日期: 2023年2月13日 星期一 05:09
>  收件人: Sean Owen 
>  抄送: "L. C. Hsieh" , Spark dev list 
>  
>  主题: Re: [VOTE] Release Spark 3.3.2 (RC1)
> 
> 
> 
>  Tried it one more time and the same result.
> 
> 
> 
>  On another box with Manjaro
> 
>  
>  [INFO] Reactor Summary for Spark Project Parent POM 3.3.2:
>  [INFO]
>  [INFO] Spark Project Parent POM ... SUCCESS 
>  [01:50 min]
>  [INFO] Spark Project Tags . SUCCESS [ 
>  17.359 s]
>  [INFO] Spark Project Sketch ... SUCCESS [ 
>  12.517 s]
>  [INFO] Spark Project Local DB . SUCCESS [ 
>  14.463 s]
>  [INFO] Spark Project Networking ... SUCCESS 
>  [01:07 min]
>  [INFO] Spark Project Shuffle Streaming Service  SUCCESS [  
>  9.013 s]
>  [INFO] Spark Project Unsafe ... SUCCESS [  
>  8.184 s]
>  [INFO] Spark Project Launcher . SUCCESS [ 
>  10.454 s]
>  [INFO] Spark Project Core . SUCCESS 
>  [23:58 min]
>  [INFO] Spark Project ML Local Library . SUCCESS [ 
>  21.218 s]
>  [INFO] Spark Project GraphX ... SUCCESS 
>  [01:24 min]
>  [INFO] Spark Project Streaming  SUCCESS 
>  [04:57 min]
>  [INFO] Spark Project Catalyst . SUCCESS 
>  [08:00 min]
>  [INFO] Spark Project SQL .. SUCCESS [  
>  01:02 h]
>  [INFO] Spark Project ML Library ... SUCCESS 
>  [14:38 min]
>  [INFO] Spark Project Tools  SUCCESS [  
>  4.394 s]
>  [INFO] Spark Project Hive . SUCCESS 
>  [53:43 min]
>  [INFO] Spark Project REPL . SUCCESS 
>  [01:16 min]
>  [INFO] Spark Project Assembly . SUCCESS [  
>  2.186 s]
>  [INFO] Kafka 0.10+ Token Provider for Streaming ... SUCCESS [ 
>  16.150 s]
>  [INFO] Spark Integration for Kafka 0.10 ... SUCCESS 
>  [01:34 min]
>  [INFO] Kafka 0.10+ Source for Structured Streaming  SUCCESS 
>  [32:55 min]
>  [INFO] Spark Project Examples . SUCCESS [ 
>  23.800 s]
>  [INFO] Spark Integration for Kafka 0.10 Assembly .. SUCCESS [  
>  7.301 s]
>  [INFO] Spark Avro . SUCCESS 
>  [01:19 min]
>  [INFO] 
>  
>  [INFO] BUILD SUCCESS
>  [INFO] 
>  
>  [INFO] Total time:  03:31 h
>  [INFO] Finished at: 2023-02-12T21:54:20+01:00
>  [INFO] 
>  
>  [bjorn@amd7g spark-3.3.2]$  java -version
>  openjdk version "17.0.6" 2023-01-17
>  OpenJDK Runtime Environment (build 17.0.6+10)
>  OpenJDK 64-Bit Server VM (build 17.0.6+10, mixed mode)
> 
> 
> 
> 
> 
>  :)
> 
>

Re: Time for release v3.3.2

2023-01-30 Thread Chao Sun

+1, thanks Liang-Chi for volunteering!

Chao

On Mon, Jan 30, 2023 at 5:51 PM L. C. Hsieh  wrote:
>
> Hi Spark devs,
>
> As you know, it has been 4 months since Spark 3.3.1 was released on
> 2022/10, it seems a good time to think about next maintenance release,
> i.e. Spark 3.3.2.
>
> I'm thinking of the release of Spark 3.3.2 this Feb (2023/02).
>
> What do you think?
>
> I am willing to volunteer for Spark 3.3.2 if there is consensus about
> this maintenance release.
>
> Thank you.
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org

Re: Time for Spark 3.4.0 release?

2023-01-04 Thread Chao Sun

+1, thanks!

Chao

On Wed, Jan 4, 2023 at 1:56 AM Mridul Muralidharan  wrote:

>
> +1, Thanks !
>
> Regards,
> Mridul
>
> On Wed, Jan 4, 2023 at 2:20 AM Gengliang Wang  wrote:
>
>> +1, thanks for driving the release!
>>
>>
>> Gengliang
>>
>> On Tue, Jan 3, 2023 at 10:55 PM Dongjoon Hyun 
>> wrote:
>>
>>> +1
>>>
>>> Thank you!
>>>
>>> Dongjoon
>>>
>>> On Tue, Jan 3, 2023 at 9:44 PM Rui Wang  wrote:
>>>
 +1 to cut the branch starting from a workday!

 Great to see this is happening!

 Thanks Xinrong!

 -Rui

 On Tue, Jan 3, 2023 at 9:21 PM 416161...@qq.com 
 wrote:

> +1, thank you Xinrong for driving this release!
>
> --
> Ruifeng Zheng
> ruife...@foxmail.com
>
> 
>
>
>
> -- Original --
> *From:* "Hyukjin Kwon" ;
> *Date:* Wed, Jan 4, 2023 01:15 PM
> *To:* "Xinrong Meng";
> *Cc:* "dev";
> *Subject:* Re: Time for Spark 3.4.0 release?
>
> SGTM +1
>
> On Wed, Jan 4, 2023 at 2:13 PM Xinrong Meng 
> wrote:
>
>> Hi All,
>>
>> Shall we cut *branch-3.4* on *January 16th, 2023*? We proposed
>> January 15th per
>> https://spark.apache.org/versioning-policy.html, but I would suggest
>> we postpone one day since January 15th is a Sunday.
>>
>> I would like to volunteer as the release manager for *Apache Spark
>> 3.4.0*.
>>
>> Thanks,
>>
>> Xinrong Meng
>>
>>

[ANNOUNCE] Apache Spark 3.2.3 released

2022-11-29 Thread Chao Sun

We are happy to announce the availability of Apache Spark 3.2.3!

Spark 3.2.3 is a maintenance release containing stability fixes. This
release is based on the branch-3.2 maintenance branch of Spark. We strongly
recommend all 3.2 users to upgrade to this stable release.

To download Spark 3.2.3, head over to the download page:
https://spark.apache.org/downloads.html

To view the release notes:
https://spark.apache.org/releases/spark-release-3-2-3.html

We would like to acknowledge all community members for contributing to this
release. This release would not have been possible without you.

Chao

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org

Re: [VOTE][RESULT] Release Spark 3.2.3, RC1

2022-11-18 Thread Chao Sun

CORRECTED:

The vote passes with 12 +1s (6 binding +1s).
Thanks to all who helped with the release!

(* = binding)
+1:
- Dongjoon Hyun (*)
- L. C. Hsieh (*)
- Huaxin Gao (*)
- Sean Owen (*)
- Kazuyuki Tanimura
- Mridul Muralidharan (*)
- Yuming Wang
- Chris Nauroth
- Yang Jie
- Wenche Fan (*)
- Ruifeng Zheng
- Chao Sun

+0: None

-1: None

On Fri, Nov 18, 2022 at 10:35 AM Chao Sun  wrote:
>
> Oops, sorry! I thought he voted but for some reason I didn't see his
> vote in the email thread. Strange. Now I found it in here:
> https://lists.apache.org/thread/gh2oktrndxopqnyxbsvp2p0k6jk1n9fs
>
> On Fri, Nov 18, 2022 at 10:33 AM Mridul Muralidharan  wrote:
> >
> >
> > This vote result is missing Sean Owen's vote.
> >
> > - Mridul
> >
> >
> >
> > On Fri, Nov 18, 2022 at 11:51 AM Chao Sun  wrote:
> >>
> >> The vote passes with 11 +1s (5 binding +1s).
> >> Thanks to all who helped with the release!
> >>
> >> (* = binding)
> >> +1:
> >> - Dongjoon Hyun (*)
> >> - L. C. Hsieh (*)
> >> - Huaxin Gao (*)
> >> - Kazuyuki Tanimura
> >> - Mridul Muralidharan (*)
> >> - Yuming Wang
> >> - Chris Nauroth
> >> - Yang Jie
> >> - Wenche Fan (*)
> >> - Ruifeng Zheng
> >> - Chao Sun
> >>
> >> +0: None
> >>
> >> -1: None
> >>
> >> -
> >> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
> >>

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org

Re: [VOTE][RESULT] Release Spark 3.2.3, RC1

2022-11-18 Thread Chao Sun

Oops, sorry! I thought he voted but for some reason I didn't see his
vote in the email thread. Strange. Now I found it in here:
https://lists.apache.org/thread/gh2oktrndxopqnyxbsvp2p0k6jk1n9fs

On Fri, Nov 18, 2022 at 10:33 AM Mridul Muralidharan  wrote:
>
>
> This vote result is missing Sean Owen's vote.
>
> - Mridul
>
>
>
> On Fri, Nov 18, 2022 at 11:51 AM Chao Sun  wrote:
>>
>> The vote passes with 11 +1s (5 binding +1s).
>> Thanks to all who helped with the release!
>>
>> (* = binding)
>> +1:
>> - Dongjoon Hyun (*)
>> - L. C. Hsieh (*)
>> - Huaxin Gao (*)
>> - Kazuyuki Tanimura
>> - Mridul Muralidharan (*)
>> - Yuming Wang
>> - Chris Nauroth
>> - Yang Jie
>> - Wenche Fan (*)
>> - Ruifeng Zheng
>> - Chao Sun
>>
>> +0: None
>>
>> -1: None
>>
>> -
>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org

[VOTE][RESULT] Release Spark 3.2.3, RC1

2022-11-18 Thread Chao Sun

The vote passes with 11 +1s (5 binding +1s).
Thanks to all who helped with the release!

(* = binding)
+1:
- Dongjoon Hyun (*)
- L. C. Hsieh (*)
- Huaxin Gao (*)
- Kazuyuki Tanimura
- Mridul Muralidharan (*)
- Yuming Wang
- Chris Nauroth
- Yang Jie
- Wenche Fan (*)
- Ruifeng Zheng
- Chao Sun

+0: None

-1: None

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org

Re: [VOTE] Release Spark 3.2.3 (RC1)

2022-11-18 Thread Chao Sun

+1 (non-binding) myself. Thanks everyone for voting!

On Wed, Nov 16, 2022 at 9:22 PM 416161...@qq.com 
wrote:

> +1
>
> --
> Ruifeng Zheng
> ruife...@foxmail.com
>
> <https://wx.mail.qq.com/home/index?t=readmail_businesscard_midpage=true=Ruifeng+Zheng=https%3A%2F%2Fres.mail.qq.com%2Fzh_CN%2Fhtmledition%2Fimages%2Frss%2Fmale.gif%3Frand%3D1617349242=ruifengz%40foxmail.com=>
>
>
>
> -- Original --
> *From:* "Wenchen Fan" ;
> *Date:* Thu, Nov 17, 2022 10:26 AM
> *To:* "Yang,Jie(INF)";
> *Cc:* "Chris Nauroth";"Yuming 
> Wang";"Dongjoon
> Hyun";"huaxin gao";"L.
> C. Hsieh";"Chao Sun";"dev"<
> dev@spark.apache.org>;
> *Subject:* Re: [VOTE] Release Spark 3.2.3 (RC1)
>
> +1
>
> On Thu, Nov 17, 2022 at 10:20 AM Yang,Jie(INF) 
> wrote:
>
>> +1,non-binding
>>
>>
>>
>> The test combination of Java 11 + Scala 2.12 and Java 11 + Scala 2.13 has
>> passed.
>>
>>
>>
>> Yang Jie
>>
>>
>>
>> *发件人**: *Chris Nauroth 
>> *日期**: *2022年11月17日 星期四 04:27
>> *收件人**: *Yuming Wang 
>> *抄送**: *"Yang,Jie(INF)" , Dongjoon Hyun <
>> dongjoon.h...@gmail.com>, huaxin gao , "L. C.
>> Hsieh" , Chao Sun , dev <
>> dev@spark.apache.org>
>> *主题**: *Re: [VOTE] Release Spark 3.2.3 (RC1)
>>
>>
>>
>> +1 (non-binding)
>>
>> * Verified all checksums.
>> * Verified all signatures.
>> * Built from source, with multiple profiles, to full success, for Java 11
>> and Scala 2.12:
>> * build/mvn -Phadoop-3.2 -Phadoop-cloud -Phive-2.3
>> -Phive-thriftserver -Pkubernetes -Pscala-2.12 -Psparkr -Pyarn -DskipTests
>> clean package
>> * Tests passed.
>> * Ran several examples successfully:
>> * bin/spark-submit --class org.apache.spark.examples.SparkPi
>> examples/jars/spark-examples_2.12-3.2.3.jar
>> * bin/spark-submit --class
>> org.apache.spark.examples.sql.hive.SparkHiveExample
>> examples/jars/spark-examples_2.12-3.2.3.jar
>> * bin/spark-submit
>> examples/src/main/python/streaming/network_wordcount.py localhost 
>>
>>
>>
>> Chao, thank you for preparing the release.
>>
>>
>>
>> Chris Nauroth
>>
>>
>>
>>
>>
>> On Wed, Nov 16, 2022 at 5:22 AM Yuming Wang  wrote:
>>
>> +1
>>
>>
>>
>> On Wed, Nov 16, 2022 at 2:28 PM Yang,Jie(INF) 
>> wrote:
>>
>> I switched Scala 2.13 to Scala 2.12 today. The test is still in progress
>> and it has not been hung.
>>
>>
>>
>> Yang Jie
>>
>>
>>
>> *发件人**: *Dongjoon Hyun 
>> *日期**: *2022年11月16日 星期三 01:17
>> *收件人**: *"Yang,Jie(INF)" 
>> *抄送**: *huaxin gao , "L. C. Hsieh" <
>> vii...@gmail.com>, Chao Sun , dev <
>> dev@spark.apache.org>
>> *主题**: *Re: [VOTE] Release Spark 3.2.3 (RC1)
>>
>>
>>
>> Did you hit that in Scala 2.12, too?
>>
>>
>>
>> Dongjoon.
>>
>>
>>
>> On Tue, Nov 15, 2022 at 4:36 AM Yang,Jie(INF) 
>> wrote:
>>
>> Hi, all
>>
>>
>>
>> I test v3.2.3 with following command:
>>
>>
>>
>> ```
>>
>> dev/change-scala-version.sh 2.13
>>
>> build/mvn clean install -Phadoop-3 -Phadoop-cloud -Pmesos -Pyarn
>> -Pkinesis-asl -Phive-thriftserver -Pspark-ganglia-lgpl -Pkubernetes -Phive
>> -Pscala-2.13 -fn
>>
>> ```
>>
>>
>>
>> The testing environment is:
>>
>>
>>
>> OS: CentOS 6u3 Final
>>
>> Java: zulu 11.0.17
>>
>> Python: 3.9.7
>>
>> Scala: 2.13
>>
>>
>>
>> The above test command has been executed twice, and all times hang in the
>> following stack:
>>
>>
>>
>> ```
>>
>> "ScalaTest-main-running-JoinSuite" #1 prio=5 os_prio=0 cpu=312870.06ms
>> elapsed=1552.65s tid=0x7f2ddc02d000 nid=0x7132 waiting on condition
>> [0x7f2de3929000]
>>
>>java.lang.Thread.State: WAITING (parking)
>>
>>at jdk.internal.misc.Unsafe.park(java.base@11.0.17/Native Method)
>>
>>- parking to wait for  <0x000790d00050> (a
>> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
>>
>>at java.util.concurrent.locks.LockSupport.park(java.base@11.0.17
>> /LockSupport.java:194)
>>
>>at

[VOTE] Release Spark 3.2.3 (RC1)

2022-11-14 Thread Chao Sun

Please vote on releasing the following candidate as Apache Spark version 3.2.3.

The vote is open until 11:59pm Pacific time Nov 17th and passes if a
majority +1 PMC votes are cast, with a minimum of 3 +1 votes.

[ ] +1 Release this package as Apache Spark 3.2.3
[ ] -1 Do not release this package because ...

To learn more about Apache Spark, please see http://spark.apache.org/

The tag to be voted on is v3.2.3-rc1 (commit
b53c341e0fefbb33d115ab630369a18765b7763d):
https://github.com/apache/spark/tree/v3.2.3-rc1

The release files, including signatures, digests, etc. can be found at:
https://dist.apache.org/repos/dist/dev/spark/v3.2.3-rc1-bin/

Signatures used for Spark RCs can be found in this file:
https://dist.apache.org/repos/dist/dev/spark/KEYS

The staging repository for this release can be found at:
https://repository.apache.org/content/repositories/orgapachespark-1431/

The documentation corresponding to this release can be found at:
https://dist.apache.org/repos/dist/dev/spark/v3.2.3-rc1-docs/

The list of bug fixes going into 3.2.3 can be found at the following URL:
https://issues.apache.org/jira/projects/SPARK/versions/12352105

This release is using the release script of the tag v3.2.3-rc1.


FAQ

=
How can I help test this release?
=
If you are a Spark user, you can help us test this release by taking
an existing Spark workload and running on this release candidate, then
reporting any regressions.

If you're working in PySpark you can set up a virtual env and install
the current RC and see if anything important breaks, in the Java/Scala
you can add the staging repository to your projects resolvers and test
with the RC (make sure to clean up the artifact cache before/after so
you don't end up building with a out of date RC going forward).

===
What should happen to JIRA tickets still targeting 3.2.3?
===
The current list of open tickets targeted at 3.2.3 can be found at:
https://issues.apache.org/jira/projects/SPARK and search for "Target
Version/s" = 3.2.3

Committers should look at those and triage. Extremely important bug
fixes, documentation, and API tweaks that impact compatibility should
be worked on immediately. Everything else please retarget to an
appropriate release.

==
But my bug isn't fixed?
==
In order to make timely releases, we will typically not hold the
release unless the bug in question is a regression from the previous
release. That being said, if there is something which is a regression
that has not been correctly targeted please ping me or a committer to
help target the issue.

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org

Re: [ANNOUNCE] Apache Spark 3.3.1 released

2022-10-26 Thread Chao Sun

Congrats everyone! and thanks Yuming for driving the release!

On Wed, Oct 26, 2022 at 7:37 AM beliefer  wrote:
>
> Congratulations everyone have contributed to this release.
>
>
> At 2022-10-26 14:21:36, "Yuming Wang"  wrote:
>
> We are happy to announce the availability of Apache Spark 3.3.1!
>
> Spark 3.3.1 is a maintenance release containing stability fixes. This
> release is based on the branch-3.3 maintenance branch of Spark. We strongly
> recommend all 3.3 users to upgrade to this stable release.
>
> To download Spark 3.3.1, head over to the download page:
> https://spark.apache.org/downloads.html
>
> To view the release notes:
> https://spark.apache.org/releases/spark-release-3-3-1.html
>
> We would like to acknowledge all community members for contributing to this
> release. This release would not have been possible without you.
>
>

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org

Re: [VOTE] Release Spark 3.3.1 (RC4)

2022-10-18 Thread Chao Sun

+1. Thanks Yuming!

Chao

On Tue, Oct 18, 2022 at 1:18 PM Thomas graves  wrote:
>
> +1. Ran internal test suite.
>
> Tom
>
> On Sun, Oct 16, 2022 at 9:14 PM Yuming Wang  wrote:
> >
> > Please vote on releasing the following candidate as Apache Spark version 
> > 3.3.1.
> >
> > The vote is open until 11:59pm Pacific time October 21th and passes if a 
> > majority +1 PMC votes are cast, with a minimum of 3 +1 votes.
> >
> > [ ] +1 Release this package as Apache Spark 3.3.1
> > [ ] -1 Do not release this package because ...
> >
> > To learn more about Apache Spark, please see https://spark.apache.org
> >
> > The tag to be voted on is v3.3.1-rc4 (commit 
> > fbbcf9434ac070dd4ced4fb9efe32899c6db12a9):
> > https://github.com/apache/spark/tree/v3.3.1-rc4
> >
> > The release files, including signatures, digests, etc. can be found at:
> > https://dist.apache.org/repos/dist/dev/spark/v3.3.1-rc4-bin
> >
> > Signatures used for Spark RCs can be found in this file:
> > https://dist.apache.org/repos/dist/dev/spark/KEYS
> >
> > The staging repository for this release can be found at:
> > https://repository.apache.org/content/repositories/orgapachespark-1430
> >
> > The documentation corresponding to this release can be found at:
> > https://dist.apache.org/repos/dist/dev/spark/v3.3.1-rc4-docs
> >
> > The list of bug fixes going into 3.3.1 can be found at the following URL:
> > https://s.apache.org/ttgz6
> >
> > This release is using the release script of the tag v3.3.1-rc4.
> >
> >
> > FAQ
> >
> > ==
> > What happened to v3.3.1-rc3?
> > ==
> > A performance regression(SPARK-40703) was found after tagging v3.3.1-rc3, 
> > which the Iceberg community hopes Spark 3.3.1 could fix.
> > So we skipped the vote on v3.3.1-rc3.
> >
> > =
> > How can I help test this release?
> > =
> > If you are a Spark user, you can help us test this release by taking
> > an existing Spark workload and running on this release candidate, then
> > reporting any regressions.
> >
> > If you're working in PySpark you can set up a virtual env and install
> > the current RC and see if anything important breaks, in the Java/Scala
> > you can add the staging repository to your projects resolvers and test
> > with the RC (make sure to clean up the artifact cache before/after so
> > you don't end up building with a out of date RC going forward).
> >
> > ===
> > What should happen to JIRA tickets still targeting 3.3.1?
> > ===
> > The current list of open tickets targeted at 3.3.1 can be found at:
> > https://issues.apache.org/jira/projects/SPARK and search for "Target 
> > Version/s" = 3.3.1
> >
> > Committers should look at those and triage. Extremely important bug
> > fixes, documentation, and API tweaks that impact compatibility should
> > be worked on immediately. Everything else please retarget to an
> > appropriate release.
> >
> > ==
> > But my bug isn't fixed?
> > ==
> > In order to make timely releases, we will typically not hold the
> > release unless the bug in question is a regression from the previous
> > release. That being said, if there is something which is a regression
> > that has not been correctly targeted please ping me or a committer to
> > help target the issue.
> >
> >
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org

Apache Spark 3.2.3 Release?

2022-10-18 Thread Chao Sun

Hi All,

It's been more than 3 months since 3.2.2 (tagged at Jul 11) was
released There are now 66 patches accumulated in branch-3.2, including
2 correctness issues.

Is it a good time to start a new release? If there's no objection, I'd
like to volunteer as the release manager for the 3.2.3 release, and
start preparing the first RC next week.

# Correctness issues

SPARK-39833Filtered parquet data frame count() and show() produce
inconsistent results when spark.sql.parquet.filterPushdown is true
SPARK-40002.   Limit improperly pushed down through window using ntile function

Best,
Chao

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org

Re: Welcome Yikun Jiang as a Spark committer

2022-10-09 Thread Chao Sun

Congratulations Yikun!

On Sun, Oct 9, 2022 at 11:14 AM vaquar khan  wrote:

> Congratulations.
>
> Regards,
> Vaquar khan
>
> On Sun, Oct 9, 2022, 6:46 AM 叶先进  wrote:
>
>> Congrats
>>
>> On Oct 9, 2022, at 16:44, XiDuo You  wrote:
>>
>> Congratulations, Yikun !
>>
>> Maxim Gekk  于2022年10月9日周日 15:59写道：
>>
>>> Keep up the great work, Yikun!
>>>
>>> On Sun, Oct 9, 2022 at 10:52 AM Gengliang Wang  wrote:
>>>
 Congratulations, Yikun!

 On Sun, Oct 9, 2022 at 12:33 AM 416161...@qq.com 
 wrote:

> Congrats, Yikun!
>
> --
> Ruifeng Zheng
> ruife...@foxmail.com
>
> 
>
>
>
> -- Original --
> *From:* "Martin Grigorov" ;
> *Date:* Sun, Oct 9, 2022 05:01 AM
> *To:* "Hyukjin Kwon";
> *Cc:* "dev";"Yikun Jiang";
> *Subject:* Re: Welcome Yikun Jiang as a Spark committer
>
> Congratulations, Yikun!
>
> On Sat, Oct 8, 2022 at 7:41 AM Hyukjin Kwon 
> wrote:
>
>> Hi all,
>>
>> The Spark PMC recently added Yikun Jiang as a committer on the
>> project.
>> Yikun is the major contributor of the infrastructure and GitHub
>> Actions in Apache Spark as well as Kubernates and PySpark.
>> He has put a lot of effort into stabilizing and optimizing the builds
>> so we all can work together in Apache Spark more
>> efficiently and effectively. He's also driving the SPIP for Docker
>> official image in Apache Spark as well for users and developers.
>> Please join me in welcoming Yikun!
>>
>>
>>

Re: Dropping Apache Spark Hadoop2 Binary Distribution?

2022-10-05 Thread Chao Sun

+1

> and specifically may allow us to finally move off of the ancient version
of Guava (?)

I think the Guava issue comes from Hive 2.3 dependency, not Hadoop.

On Wed, Oct 5, 2022 at 1:55 PM Xinrong Meng 
wrote:

> +1.
>
> On Wed, Oct 5, 2022 at 1:53 PM Xiao Li 
> wrote:
>
>> +1.
>>
>> Xiao
>>
>> On Wed, Oct 5, 2022 at 12:49 PM Sean Owen  wrote:
>>
>>> I'm OK with this. It simplifies maintenance a bit, and specifically may
>>> allow us to finally move off of the ancient version of Guava (?)
>>>
>>> On Mon, Oct 3, 2022 at 10:16 PM Dongjoon Hyun 
>>> wrote:
>>>
 Hi, All.

 I'm wondering if the following Apache Spark Hadoop2 Binary Distribution
 is still used by someone in the community or not. If it's not used or
 not useful,
 we may remove it from Apache Spark 3.4.0 release.

 https://downloads.apache.org/spark/spark-3.3.0/spark-3.3.0-bin-hadoop2.tgz

 Here is the background of this question.
 Since Apache Spark 2.2.0 (SPARK-19493, SPARK-19550), the Apache
 Spark community has been building and releasing with Java 8 only.
 I believe that the user applications also use Java8+ in these days.
 Recently, I received the following message from the Hadoop PMC.

   > "if you really want to claim hadoop 2.x compatibility, then you
 have to
   > be building against java 7". Otherwise a lot of people with hadoop
 2.x
   > clusters won't be able to run your code. If your projects are java8+
   > only, then they are implicitly hadoop 3.1+, no matter what you use
   > in your build. Hence: no need for branch-2 branches except
   > to complicate your build/test/release processes [1]

 If Hadoop2 binary distribution is no longer used as of today,
 or incomplete somewhere due to Java 8 building, the following three
 existing alternative Hadoop 3 binary distributions could be
 the better official solution for old Hadoop 2 clusters.

 1) Scala 2.12 and without-hadoop distribution
 2) Scala 2.12 and Hadoop 3 distribution
 3) Scala 2.13 and Hadoop 3 distribution

 In short, is there anyone who is using Apache Spark 3.3.0 Hadoop2
 Binary distribution?

 Dongjoon

 [1]
 https://issues.apache.org/jira/browse/ORC-1251?focusedCommentId=17608247=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17608247

>>>
>>
>> --
>>
>>

Re: [DISCUSS] SPIP: Support Docker Official Image for Spark

2022-09-20 Thread Chao Sun

+1 (non-binding)

On Mon, Sep 19, 2022 at 10:17 PM Wenchen Fan  wrote:
>
> +1
>
> On Mon, Sep 19, 2022 at 2:59 PM Yang,Jie(INF)  wrote:
>>
>> +1 (non-binding)
>>
>>
>>
>> Yang Jie
>>
>> 
>> 发件人: Yikun Jiang 
>> 发送时间: 2022年9月19日 14:23:14
>> 收件人: Denny Lee
>> 抄送: bo zhaobo; Yuming Wang; Kent Yao; Gengliang Wang; Hyukjin Kwon; dev; zrf
>> 主题: Re: [DISCUSS] SPIP: Support Docker Official Image for Spark
>>
>> Thanks for your support!  @all
>>
>> > Count me in to help as well, eh?! :)
>>
>> @Denny Sure, It would be great to have your help! I'm going to create a JIRA 
>> and TASKS if the SPIP vote passes.
>>
>>
>> On Mon, Sep 19, 2022 at 10:34 AM Denny Lee  wrote:
>>>
>>> +1 (non-binding).
>>>
>>> This is a great idea and we should definitely do this.  Count me in to help 
>>> as well, eh?! :)
>>>
>>> On Sun, Sep 18, 2022 at 7:24 PM bo zhaobo  
>>> wrote:

 +1 (non-binding)

 This will bring the good experience to customers. So excited about this. 
 ;-)

 Yuming Wang  于2022年9月19日周一 10:18写道：
>
> +1.
>
> On Mon, Sep 19, 2022 at 9:44 AM Kent Yao  wrote:
>>
>> +1
>>
>> Gengliang Wang  于2022年9月19日周一 09:23写道：
>> >
>> > +1, thanks for the work!
>> >
>> > On Sun, Sep 18, 2022 at 6:20 PM Hyukjin Kwon  
>> > wrote:
>> >>
>> >> +1
>> >>
>> >> On Mon, 19 Sept 2022 at 09:15, Yikun Jiang  
>> >> wrote:
>> >>>
>> >>> Hi, all
>> >>>
>> >>>
>> >>> I would like to start the discussion for supporting Docker Official 
>> >>> Image for Spark.
>> >>>
>> >>>
>> >>> This SPIP is proposed to add Docker Official Image(DOI) to ensure 
>> >>> the Spark Docker images meet the quality standards for Docker 
>> >>> images, to provide these Docker images for users who want to use 
>> >>> Apache Spark via Docker image.
>> >>>
>> >>>
>> >>> There are also several Apache projects that release the Docker 
>> >>> Official Images, such as: flink, storm, solr, zookeeper, httpd (with 
>> >>> 50M+ to 1B+ download for each). From the huge download statistics, 
>> >>> we can see the real demands of users, and from the support of other 
>> >>> apache projects, we should also be able to do it.
>> >>>
>> >>>
>> >>> After support:
>> >>>
>> >>> The Dockerfile will still be maintained by the Apache Spark 
>> >>> community and reviewed by Docker.
>> >>>
>> >>> The images will be maintained by the Docker community to ensure the 
>> >>> quality standards for Docker images of the Docker community.
>> >>>
>> >>>
>> >>> It will also reduce the extra docker images maintenance effort (such 
>> >>> as frequently rebuilding, image security update) of the Apache Spark 
>> >>> community.
>> >>>
>> >>>
>> >>> See more in SPIP DOC: 
>> >>> https://docs.google.com/document/d/1nN-pKuvt-amUcrkTvYAQ-bJBgtsWb9nAkNoVNRM2S2o
>> >>>
>> >>>
>> >>> cc: Ruifeng (co-author) and Hyukjin (shepherd)
>> >>>
>> >>>
>> >>> Regards,
>> >>> Yikun
>>
>> -
>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org

Re: [VOTE] Release Spark 3.3.1 (RC1)

2022-09-18 Thread Chao Sun

It'd be really nice if we can include
https://issues.apache.org/jira/browse/SPARK-40169 in this release,
since otherwise it'll introduce a perf regression with Parquet column
index disabled.

On Sat, Sep 17, 2022 at 2:08 PM Sean Owen  wrote:
>
> +1 LGTM. I tested Scala 2.13 + Java 11 on Ubuntu 22.04. I get the same 
> results as usual.
>
> On Sat, Sep 17, 2022 at 2:42 AM Yuming Wang  wrote:
>>
>> Please vote on releasing the following candidate as Apache Spark version 
>> 3.3.1.
>>
>> The vote is open until 11:59pm Pacific time September 22th and passes if a 
>> majority +1 PMC votes are cast, with a minimum of 3 +1 votes.
>>
>> [ ] +1 Release this package as Apache Spark 3.3.1
>> [ ] -1 Do not release this package because ...
>>
>> To learn more about Apache Spark, please see https://spark.apache.org
>>
>> The tag to be voted on is v3.3.1-rc1 (commit 
>> ea1a426a889626f1ee1933e3befaa975a2f0a072):
>> https://github.com/apache/spark/tree/v3.3.1-rc1
>>
>> The release files, including signatures, digests, etc. can be found at:
>> https://dist.apache.org/repos/dist/dev/spark/v3.3.1-rc1-bin
>>
>> Signatures used for Spark RCs can be found in this file:
>> https://dist.apache.org/repos/dist/dev/spark/KEYS
>>
>> The staging repository for this release can be found at:
>> https://repository.apache.org/content/repositories/orgapachespark-1418
>>
>> The documentation corresponding to this release can be found at:
>> https://dist.apache.org/repos/dist/dev/spark/v3.3.1-rc1-docs
>>
>> The list of bug fixes going into 3.3.1 can be found at the following URL:
>> https://issues.apache.org/jira/projects/SPARK/versions/12351710
>>
>> This release is using the release script of the tag v3.3.1-rc1.
>>
>>
>> FAQ
>>
>> =
>> How can I help test this release?
>> =
>> If you are a Spark user, you can help us test this release by taking
>> an existing Spark workload and running on this release candidate, then
>> reporting any regressions.
>>
>> If you're working in PySpark you can set up a virtual env and install
>> the current RC and see if anything important breaks, in the Java/Scala
>> you can add the staging repository to your projects resolvers and test
>> with the RC (make sure to clean up the artifact cache before/after so
>> you don't end up building with a out of date RC going forward).
>>
>> ===
>> What should happen to JIRA tickets still targeting 3.3.1?
>> ===
>> The current list of open tickets targeted at 3.3.1 can be found at:
>> https://issues.apache.org/jira/projects/SPARK and search for "Target 
>> Version/s" = 3.3.1
>>
>> Committers should look at those and triage. Extremely important bug
>> fixes, documentation, and API tweaks that impact compatibility should
>> be worked on immediately. Everything else please retarget to an
>> appropriate release.
>>
>> ==
>> But my bug isn't fixed?
>> ==
>> In order to make timely releases, we will typically not hold the
>> release unless the bug in question is a regression from the previous
>> release. That being said, if there is something which is a regression
>> that has not been correctly targeted please ping me or a committer to
>> help target the issue.
>>
>>

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org

Re: Spark 3.3.0/3.2.2: java.io.IOException: can not read class org.apache.parquet.format.PageHeader: don't know what type: 15

2022-09-01 Thread Chao Sun

Hi Fengyu,

Do you still have the Parquet file that caused the error? could you
open a JIRA and attach the file to it? I can take a look.

Chao

On Thu, Sep 1, 2022 at 4:03 AM FengYu Cao  wrote:
>
> I'm trying to upgrade our spark (3.2.1 now)
>
> but with spark 3.3.0 and spark 3.2.2, we had error with specific parquet file
>
> Is anyone else having the same problem as me? Or do I need to provide any 
> information to the devs ?
>
> ```
>
> org.apache.spark.SparkException: Job aborted due to stage failure: Task 3 in 
> stage 1.0 failed 4 times, most recent failure: Lost task 3.3 in stage 1.0 
> (TID 7) (10.113.39.118 executor 1): java.io.IOException: can not read class 
> org.apache.parquet.format.PageHeader: don't know what type: 15
> at org.apache.parquet.format.Util.read(Util.java:365)
> at org.apache.parquet.format.Util.readPageHeader(Util.java:132)
> at 
> org.apache.parquet.hadoop.ParquetFileReader$Chunk.readPageHeader(ParquetFileReader.java:1382)
> at 
> org.apache.parquet.hadoop.ParquetFileReader$Chunk.readAllPages(ParquetFileReader.java:1429)
> at 
> org.apache.parquet.hadoop.ParquetFileReader$Chunk.readAllPages(ParquetFileReader.java:1402)
> at 
> org.apache.parquet.hadoop.ParquetFileReader.readChunkPages(ParquetFileReader.java:1023)
> at 
> org.apache.parquet.hadoop.ParquetFileReader.readNextRowGroup(ParquetFileReader.java:928)
> at 
> org.apache.parquet.hadoop.ParquetFileReader.readNextFilteredRowGroup(ParquetFileReader.java:972)
> at 
> org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.checkEndOfRowGroup(VectorizedParquetRecordReader.java:338)
> at 
> org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.nextBatch(VectorizedParquetRecordReader.java:293)
> at 
> org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.nextKeyValue(VectorizedParquetRecordReader.java:196)
> at 
> org.apache.spark.sql.execution.datasources.RecordReaderIterator.hasNext(RecordReaderIterator.scala:39)
> at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:104)
> at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:191)
> at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:104)
> at 
> org.apache.spark.sql.execution.FileSourceScanExec$$anon$1.hasNext(DataSourceScanExec.scala:522)
> at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.columnartorow_nextBatch_0$(Unknown
>  Source)
> at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.agg_doAggregateWithKeys_0$(Unknown
>  Source)
> at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown
>  Source)
> at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
> at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:759)
> at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
> at 
> org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:140)
> at 
> org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59)
> at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
> at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:52)
> at org.apache.spark.scheduler.Task.run(Task.scala:131)
> at 
> org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:506)
> at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1491)
> at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:509)
> at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
> at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown 
> Source)
> at java.base/java.lang.Thread.run(Unknown Source)
> Caused by: shaded.parquet.org.apache.thrift.protocol.TProtocolException: 
> don't know what type: 15
> at 
> shaded.parquet.org.apache.thrift.protocol.TCompactProtocol.getTType(TCompactProtocol.java:894)
> at 
> shaded.parquet.org.apache.thrift.protocol.TCompactProtocol.readFieldBegin(TCompactProtocol.java:560)
> at 
> org.apache.parquet.format.InterningProtocol.readFieldBegin(InterningProtocol.java:155)
> at 
> shaded.parquet.org.apache.thrift.protocol.TProtocolUtil.skip(TProtocolUtil.java:108)
> at 
> shaded.parquet.org.apache.thrift.protocol.TProtocolUtil.skip(TProtocolUtil.java:60)
> at 
> org.apache.parquet.format.PageHeader$PageHeaderStandardScheme.read(PageHeader.java:1100)
> at 
> org.apache.parquet.format.PageHeader$PageHeaderStandardScheme.read(PageHeader.java:1019)
> at org.apache.parquet.format.PageHeader.read(PageHeader.java:896)
> at org.apache.parquet.format.Util.read(Util.java:362)
> ... 32 more
>
>
> ```
>
> similar to https://issues.apache.org/jira/browse/SPARK-11844, but we

Re: Welcome Xinrong Meng as a Spark committer

2022-08-09 Thread Chao Sun

Congratulations!

On Tue, Aug 9, 2022 at 1:00 PM huaxin gao  wrote:
>
> Congratulations!
>
> On Tue, Aug 9, 2022 at 12:47 PM Dongjoon Hyun  wrote:
>>
>> Congrat! :)
>>
>> Dongjoon.
>>
>> On Tue, Aug 9, 2022 at 10:40 AM Takuya UESHIN  wrote:
>> >
>> > Congratulations, Xinrong!
>> >
>> > On Tue, Aug 9, 2022 at 10:07 AM Gengliang Wang  wrote:
>> >>
>> >> Congratulations, Xinrong! Well deserved.
>> >>
>> >>
>> >> On Tue, Aug 9, 2022 at 7:09 AM Yi Wu  wrote:
>> >>>
>> >>> Congrats Xinrong!!
>> >>>
>> >>>
>> >>> On Tue, Aug 9, 2022 at 7:07 PM Maxim Gekk 
>> >>>  wrote:
>> 
>>  Congratulations, Xinrong!
>> 
>>  Maxim Gekk
>> 
>>  Software Engineer
>> 
>>  Databricks, Inc.
>> 
>> 
>> 
>>  On Tue, Aug 9, 2022 at 3:15 PM Weichen Xu 
>>   wrote:
>> >
>> > Congrats!
>> >
>> > On Tue, Aug 9, 2022 at 5:55 PM Jungtaek Lim 
>> >  wrote:
>> >>
>> >> Congrats Xinrong! Well deserved.
>> >>
>> >> 2022년 8월 9일 (화) 오후 5:13, Hyukjin Kwon 님이 작성:
>> >>>
>> >>> Hi all,
>> >>>
>> >>> The Spark PMC recently added Xinrong Meng as a committer on the 
>> >>> project. Xinrong is the major contributor of PySpark especially 
>> >>> Pandas API on Spark. She has guided a lot of new contributors 
>> >>> enthusiastically. Please join me in welcoming Xinrong!
>> >>>
>> >
>> >
>> > --
>> > Takuya UESHIN
>> >
>>
>> -
>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org

Re: Welcoming three new PMC members

2022-08-09 Thread Chao Sun

Congrats everyone!

On Tue, Aug 9, 2022 at 5:36 PM Dongjoon Hyun  wrote:
>
> Congrat to all!
>
> Dongjoon.
>
> On Tue, Aug 9, 2022 at 5:13 PM Takuya UESHIN  wrote:
> >
> > Congratulations!
> >
> > On Tue, Aug 9, 2022 at 4:57 PM Hyukjin Kwon  wrote:
> >>
> >> Congrats everybody!
> >>
> >> On Wed, 10 Aug 2022 at 05:50, Mridul Muralidharan  wrote:
> >>>
> >>>
> >>> Congratulations !
> >>> Great to have you join the PMC !!
> >>>
> >>> Regards,
> >>> Mridul
> >>>
> >>> On Tue, Aug 9, 2022 at 11:57 AM vaquar khan  wrote:
> 
>  Congratulations
> 
>  On Tue, Aug 9, 2022, 11:40 AM Xiao Li  wrote:
> >
> > Hi all,
> >
> > The Spark PMC recently voted to add three new PMC members. Join me in 
> > welcoming them to their new roles!
> >
> > New PMC members: Huaxin Gao, Gengliang Wang and Maxim Gekk
> >
> > The Spark PMC
> >
> >
> >
> > --
> > Takuya UESHIN
> >
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org

Re: Update Spark 3.4 Release Window?

2022-07-21 Thread Chao Sun

+1 for Jan 2023 (Code freeze) and Feb 2023 (RC).

Chao

On Thu, Jul 21, 2022 at 11:43 AM L. C. Hsieh  wrote:
>
> I'm also +1 for Feb. 2023 (RC) and Jan. 2023 (Code freeze).
>
> Liang-Chi
>
> On Wed, Jul 20, 2022 at 2:02 PM Dongjoon Hyun  wrote:
> >
> > I fixed typos :)
> >
> > +1 for February 2023 (Release Candidate) and January 2023 (Code freeze).
> >
> > On 2022/07/20 20:59:30 Dongjoon Hyun wrote:
> > > Thank you for initiating this discussion, Xinrong. I also agree with Sean.
> > >
> > > +1 for February 2023 (Release Candidate) and January 2021 (Code freeze).
> > >
> > > Dongjoon.
> > >
> > > On Wed, Jul 20, 2022 at 1:42 PM Sean Owen  wrote:
> > > >
> > > > I don't know any better than others when it will actually happen, 
> > > > though historically, it's more like 7-8 months between minor releases. 
> > > > I might therefore expect a release more like February 2023, and work 
> > > > backwards from there. Doesn't really matter, this is just a public 
> > > > guess and can be changed.
> > > >
> > > > On Wed, Jul 20, 2022 at 3:27 PM Xinrong Meng  
> > > > wrote:
> > > >>
> > > >> Hi All,
> > > >>
> > > >> Since Spark 3.3.0 was released on June 16, 2022, shall we update the 
> > > >> release window https://spark.apache.org/versioning-policy.html for 
> > > >> Spark 3.4?
> > > >>
> > > >> A proposal is as follows:
> > > >>
> > > >> | October 15th 2022 | Code freeze. Release branch cut.
> > > >> | Late October 2022 | QA period. Focus on bug fixes, tests, stability 
> > > >> and docs. Generally, no new features merged.
> > > >> | November 2022 | Release candidates (RC), voting, etc. until 
> > > >> final release passes
> > > >>
> > > >> Thanks!
> > > >>
> > > >> Xinrong Meng
> > > >>
> > >
> > > -
> > > To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
> > >
> > >
> >
> > -
> > To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
> >
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org

Re: [VOTE] Release Spark 3.2.2 (RC1)

2022-07-14 Thread Chao Sun

+1 (non-binding)

On Thu, Jul 14, 2022 at 12:40 AM Wenchen Fan  wrote:
>
> +1
>
> On Wed, Jul 13, 2022 at 7:29 PM Yikun Jiang  wrote:
>>
>> +1 (non-binding)
>>
>> Checked out tag and built from source on Linux aarch64 and ran some basic 
>> test.
>>
>>
>> Regards,
>> Yikun
>>
>>
>> On Wed, Jul 13, 2022 at 5:54 AM Mridul Muralidharan  wrote:
>>>
>>>
>>> +1
>>>
>>> Signatures, digests, etc check out fine.
>>> Checked out tag and build/tested with "-Pyarn -Pmesos -Pkubernetes"
>>>
>>> As always, the test "SPARK-33084: Add jar support Ivy URI in SQL" in 
>>> sql.SQLQuerySuite fails in my env; but other than that, the rest looks good.
>>>
>>> Regards,
>>> Mridul
>>>
>>>
>>> On Tue, Jul 12, 2022 at 3:17 AM Maxim Gekk 
>>>  wrote:

 +1

 On Tue, Jul 12, 2022 at 11:05 AM Yang,Jie(INF)  wrote:
>
> +1 (non-binding)
>
>
>
> Yang Jie
>
>
>
>
>
> 发件人: Dongjoon Hyun 
> 日期: 2022年7月12日 星期二 16:03
> 收件人: dev 
> 抄送: Cheng Su , "Yang,Jie(INF)" , 
> Sean Owen 
> 主题: Re: [VOTE] Release Spark 3.2.2 (RC1)
>
>
>
> +1
>
>
>
> Dongjoon.
>
>
>
> On Mon, Jul 11, 2022 at 11:34 PM Cheng Su  wrote:
>
> +1 (non-binding). Built from source, and ran some scala unit tests on M1 
> mac, with OpenJDK 8 and Scala 2.12.
>
>
>
> Thanks,
>
> Cheng Su
>
>
>
> On Mon, Jul 11, 2022 at 10:31 PM Yang,Jie(INF)  
> wrote:
>
> Does this happen when running all UTs? I ran this suite several times 
> alone using OpenJDK(zulu) 8u322-b06 on my Mac, but no similar error 
> occurred
>
>
>
> 发件人: Sean Owen 
> 日期: 2022年7月12日 星期二 10:45
> 收件人: Dongjoon Hyun 
> 抄送: dev 
> 主题: Re: [VOTE] Release Spark 3.2.2 (RC1)
>
>
>
> Is anyone seeing this error? I'm on OpenJDK 8 on a Mac:
>
>
>
> #
> # A fatal error has been detected by the Java Runtime Environment:
> #
> #  SIGSEGV (0xb) at pc=0x000101ca8ace, pid=11962, 
> tid=0x1603
> #
> # JRE version: OpenJDK Runtime Environment (8.0_322) (build 
> 1.8.0_322-bre_2022_02_28_15_01-b00)
> # Java VM: OpenJDK 64-Bit Server VM (25.322-b00 mixed mode bsd-amd64 
> compressed oops)
> # Problematic frame:
> # V  [libjvm.dylib+0x549ace]
> #
> # Failed to write core dump. Core dumps have been disabled. To enable 
> core dumping, try "ulimit -c unlimited" before starting Java again
> #
> # An error report file with more information is saved as:
> # /private/tmp/spark-3.2.2/sql/core/hs_err_pid11962.log
> ColumnVectorSuite:
> - boolean
> - byte
> Compiled method (nm)  885897 75403 n 0   
> sun.misc.Unsafe::putShort (native)
>  total in heap  [0x000102fdaa10,0x000102fdad48] = 824
>  relocation [0x000102fdab38,0x000102fdab78] = 64
>  main code  [0x000102fdab80,0x000102fdad48] = 456
> Compiled method (nm)  885897 75403 n 0   
> sun.misc.Unsafe::putShort (native)
>  total in heap  [0x000102fdaa10,0x000102fdad48] = 824
>  relocation [0x000102fdab38,0x000102fdab78] = 64
>  main code  [0x000102fdab80,0x000102fdad48] = 456
>
>
>
> On Mon, Jul 11, 2022 at 4:58 PM Dongjoon Hyun  
> wrote:
>
> Please vote on releasing the following candidate as Apache Spark version 
> 3.2.2.
>
> The vote is open until July 15th 1AM (PST) and passes if a majority +1 
> PMC votes are cast, with a minimum of 3 +1 votes.
>
> [ ] +1 Release this package as Apache Spark 3.2.2
> [ ] -1 Do not release this package because ...
>
> To learn more about Apache Spark, please see https://spark.apache.org/
>
> The tag to be voted on is v3.2.2-rc1 (commit 
> 78a5825fe266c0884d2dd18cbca9625fa258d7f7):
> https://github.com/apache/spark/tree/v3.2.2-rc1
>
> The release files, including signatures, digests, etc. can be found at:
> https://dist.apache.org/repos/dist/dev/spark/v3.2.2-rc1-bin/
>
> Signatures used for Spark RCs can be found in this file:
> https://dist.apache.org/repos/dist/dev/spark/KEYS
>
> The staging repository for this release can be found at:
> https://repository.apache.org/content/repositories/orgapachespark-1409/
>
> The documentation corresponding to this release can be found at:
> https://dist.apache.org/repos/dist/dev/spark/v3.2.2-rc1-docs/
>
> The list of bug fixes going into 3.2.2 can be found at the following URL:
> https://issues.apache.org/jira/projects/SPARK/versions/12351232
>
> This release is using the release script of the tag v3.2.2-rc1.
>
> FAQ
>
> =
> How can I help test this release?
> =
>
> If you are a Spark user, you can help us test this release by

Re: [VOTE] Release Spark 3.3.0 (RC6)

2022-06-13 Thread Chao Sun

+1 (non-binding)

Thanks,
Chao

On Mon, Jun 13, 2022 at 5:37 PM Cheng Su  wrote:

> +1 (non-binding).
>
>
>
> Thanks,
>
> Cheng Su
>
>
>
> *From: *L. C. Hsieh 
> *Date: *Monday, June 13, 2022 at 5:13 PM
> *To: *dev 
> *Subject: *Re: [VOTE] Release Spark 3.3.0 (RC6)
>
> +1
>
> On Mon, Jun 13, 2022 at 5:07 PM Holden Karau  wrote:
> >
> > +1
> >
> > On Mon, Jun 13, 2022 at 4:51 PM Yuming Wang  wrote:
> >>
> >> +1 (non-binding)
> >>
> >> On Tue, Jun 14, 2022 at 7:41 AM Dongjoon Hyun 
> wrote:
> >>>
> >>> +1
> >>>
> >>> Thanks,
> >>> Dongjoon.
> >>>
> >>> On Mon, Jun 13, 2022 at 3:54 PM Chris Nauroth 
> wrote:
> 
>  +1 (non-binding)
> 
>  I repeated all checks I described for RC5:
> 
>  https://lists.apache.org/thread/ksoxmozgz7q728mnxl6c2z7ncmo87vls
> 
>  Maxim, thank you for your dedication on these release candidates.
> 
>  Chris Nauroth
> 
> 
>  On Mon, Jun 13, 2022 at 3:21 PM Mridul Muralidharan 
> wrote:
> >
> >
> > +1
> >
> > Signatures, digests, etc check out fine.
> > Checked out tag and build/tested with -Pyarn -Pmesos -Pkubernetes
> >
> > The test "SPARK-33084: Add jar support Ivy URI in SQL" in
> sql.SQLQuerySuite fails; but other than that, rest looks good.
> >
> > Regards,
> > Mridul
> >
> >
> >
> > On Mon, Jun 13, 2022 at 4:25 PM Tom Graves
>  wrote:
> >>
> >> +1
> >>
> >> Tom
> >>
> >> On Thursday, June 9, 2022, 11:27:50 PM CDT, Maxim Gekk
>  wrote:
> >>
> >>
> >> Please vote on releasing the following candidate as Apache Spark
> version 3.3.0.
> >>
> >> The vote is open until 11:59pm Pacific time June 14th and passes if
> a majority +1 PMC votes are cast, with a minimum of 3 +1 votes.
> >>
> >> [ ] +1 Release this package as Apache Spark 3.3.0
> >> [ ] -1 Do not release this package because ...
> >>
> >> To learn more about Apache Spark, please see
> http://spark.apache.org/
> >>
> >> The tag to be voted on is v3.3.0-rc6 (commit
> f74867bddfbcdd4d08076db36851e88b15e66556):
> >> https://github.com/apache/spark/tree/v3.3.0-rc6
> >>
> >> The release files, including signatures, digests, etc. can be found
> at:
> >> https://dist.apache.org/repos/dist/dev/spark/v3.3.0-rc6-bin/
> >>
> >> Signatures used for Spark RCs can be found in this file:
> >> https://dist.apache.org/repos/dist/dev/spark/KEYS
> >>
> >> The staging repository for this release can be found at:
> >>
> https://repository.apache.org/content/repositories/orgapachespark-1407
> >>
> >> The documentation corresponding to this release can be found at:
> >> https://dist.apache.org/repos/dist/dev/spark/v3.3.0-rc6-docs/
> >>
> >> The list of bug fixes going into 3.3.0 can be found at the
> following URL:
> >> https://issues.apache.org/jira/projects/SPARK/versions/12350369
> >>
> >> This release is using the release script of the tag v3.3.0-rc6.
> >>
> >>
> >> FAQ
> >>
> >> =
> >> How can I help test this release?
> >> =
> >> If you are a Spark user, you can help us test this release by taking
> >> an existing Spark workload and running on this release candidate,
> then
> >> reporting any regressions.
> >>
> >> If you're working in PySpark you can set up a virtual env and
> install
> >> the current RC and see if anything important breaks, in the
> Java/Scala
> >> you can add the staging repository to your projects resolvers and
> test
> >> with the RC (make sure to clean up the artifact cache before/after
> so
> >> you don't end up building with a out of date RC going forward).
> >>
> >> ===
> >> What should happen to JIRA tickets still targeting 3.3.0?
> >> ===
> >> The current list of open tickets targeted at 3.3.0 can be found at:
> >> https://issues.apache.org/jira/projects/SPARK  and search for
> "Target Version/s" = 3.3.0
> >>
> >> Committers should look at those and triage. Extremely important bug
> >> fixes, documentation, and API tweaks that impact compatibility
> should
> >> be worked on immediately. Everything else please retarget to an
> >> appropriate release.
> >>
> >> ==
> >> But my bug isn't fixed?
> >> ==
> >> In order to make timely releases, we will typically not hold the
> >> release unless the bug in question is a regression from the previous
> >> release. That being said, if there is something which is a
> regression
> >> that has not been correctly targeted please ping me or a committer
> to
> >> help target the issue.
> >>
> >> Maxim Gekk
> >>
> >> Software Engineer
> >>
> >> Databricks, Inc.
> >
> >
> >
> > --
> > Twitter:

Re: [VOTE][SPIP] Spark Connect

2022-06-13 Thread Chao Sun

+1 (non-binding)

On Mon, Jun 13, 2022 at 5:11 PM Hyukjin Kwon  wrote:

> +1
>
> On Tue, 14 Jun 2022 at 08:50, Yuming Wang  wrote:
>
>> +1.
>>
>> On Tue, Jun 14, 2022 at 2:20 AM Matei Zaharia 
>> wrote:
>>
>>> +1, very excited about this direction.
>>>
>>> Matei
>>>
>>> On Jun 13, 2022, at 11:07 AM, Herman van Hovell <
>>> her...@databricks.com.INVALID> wrote:
>>>
>>> Let me kick off the voting...
>>>
>>> +1
>>>
>>> On Mon, Jun 13, 2022 at 2:02 PM Herman van Hovell 
>>> wrote:
>>>
 Hi all,

 I’d like to start a vote for SPIP: "Spark Connect"

 The goal of the SPIP is to introduce a Dataframe based client/server
 API for Spark

 Please also refer to:

 - Previous discussion in dev mailing list: [DISCUSS] SPIP: Spark
 Connect - A client and server interface for Apache Spark.
 
 - Design doc: Spark Connect - A client and server interface for Apache
 Spark.
 
 - JIRA: SPARK-39375 

 Please vote on the SPIP for the next 72 hours:

 [ ] +1: Accept the proposal as an official SPIP
 [ ] +0
 [ ] -1: I don’t think this is a good idea because …

 Kind Regards,
 Herman

>>>
>>>

Re: SIGMOD System Award for Apache Spark

2022-05-13 Thread Chao Sun

Huge congrats to the whole community!

On Fri, May 13, 2022 at 1:56 AM Wenchen Fan  wrote:

> Great! Congratulations to everyone!
>
> On Fri, May 13, 2022 at 10:38 AM Gengliang Wang  wrote:
>
>> Congratulations to the whole spark community!
>>
>> On Fri, May 13, 2022 at 10:14 AM Jungtaek Lim <
>> kabhwan.opensou...@gmail.com> wrote:
>>
>>> Congrats Spark community!
>>>
>>> On Fri, May 13, 2022 at 10:40 AM Qian Sun 
>>> wrote:
>>>
 Congratulations !!!

 2022年5月13日 上午3:44，Matei Zaharia  写道：

 Hi all,

 We recently found out that Apache Spark received
  the SIGMOD System Award this
 year, given by SIGMOD (the ACM’s data management research organization) to
 impactful real-world and research systems. This puts Spark in good company
 with some very impressive previous recipients
 . This award
 is really an achievement by the whole community, so I wanted to say
 congrats to everyone who contributes to Spark, whether through code, issue
 reports, docs, or other means.

 Matei

Re: Apache Spark 3.3 Release

2022-03-16 Thread Chao Sun

There is one item on our side that we want to backport to 3.3:
- vectorized DELTA_BYTE_ARRAY/DELTA_LENGTH_BYTE_ARRAY encodings for
Parquet V2 support (https://github.com/apache/spark/pull/35262)

It's already reviewed and approved.

On Wed, Mar 16, 2022 at 9:13 AM Tom Graves  wrote:
>
> It looks like the version hasn't been updated on master and still shows 
> 3.3.0-SNAPSHOT, can you please update that.
>
> Tom
>
> On Wednesday, March 16, 2022, 01:41:00 AM CDT, Maxim Gekk 
>  wrote:
>
>
> Hi All,
>
> I have created the branch for Spark 3.3:
> https://github.com/apache/spark/commits/branch-3.3
>
> Please, backport important fixes to it, and if you have some doubts, ping me 
> in the PR. Regarding new features, we are still building the allow list for 
> branch-3.3.
>
> Best regards,
> Max Gekk
>
>
> On Wed, Mar 16, 2022 at 5:51 AM Dongjoon Hyun  wrote:
>
> Yes, I agree with you for your whitelist approach for backporting. :)
> Thank you for summarizing.
>
> Thanks,
> Dongjoon.
>
>
> On Tue, Mar 15, 2022 at 4:20 PM Xiao Li  wrote:
>
> I think I finally got your point. What you want to keep unchanged is the 
> branch cut date of Spark 3.3. Today? or this Friday? This is not a big deal.
>
> My major concern is whether we should keep merging the feature work or the 
> dependency upgrade after the branch cut. To make our release time more 
> predictable, I am suggesting we should finalize the exception PR list first, 
> instead of merging them in an ad hoc way. In the past, we spent a lot of time 
> on the revert of the PRs that were merged after the branch cut. I hope we can 
> minimize unnecessary arguments in this release. Do you agree, Dongjoon?
>
>
>
> Dongjoon Hyun  于2022年3月15日周二 15:55写道：
>
> That is not totally fine, Xiao. It sounds like you are asking a change of 
> plan without a proper reason.
>
> Although we cut the branch Today according our plan, you still can collect 
> the list and make a list of exceptions. I'm not blocking what you want to do.
>
> Please let the community start to ramp down as we agreed before.
>
> Dongjoon
>
>
>
> On Tue, Mar 15, 2022 at 3:07 PM Xiao Li  wrote:
>
> Please do not get me wrong. If we don't cut a branch, we are allowing all 
> patches to land Apache Spark 3.3. That is totally fine. After we cut the 
> branch, we should avoid merging the feature work. In the next three days, let 
> us collect the actively developed PRs that we want to make an exception 
> (i.e., merged to 3.3 after the upcoming branch cut). Does that make sense?
>
> Dongjoon Hyun  于2022年3月15日周二 14:54写道：
>
> Xiao. You are working against what you are saying.
> If you don't cut a branch, it means you are allowing all patches to land 
> Apache Spark 3.3. No?
>
> > we need to avoid backporting the feature work that are not being well 
> > discussed.
>
>
>
> On Tue, Mar 15, 2022 at 12:12 PM Xiao Li  wrote:
>
> Cutting the branch is simple, but we need to avoid backporting the feature 
> work that are not being well discussed. Not all the members are actively 
> following the dev list. I think we should wait 3 more days for collecting the 
> PR list before cutting the branch.
>
> BTW, there are very few 3.4-only feature work that will be affected.
>
> Xiao
>
> Dongjoon Hyun  于2022年3月15日周二 11:49写道：
>
> Hi, Max, Chao, Xiao, Holden and all.
>
> I have a different idea.
>
> Given the situation and small patch list, I don't think we need to postpone 
> the branch cut for those patches. It's easier to cut a branch-3.3 and allow 
> backporting.
>
> As of today, we already have an obvious Apache Spark 3.4 patch in the branch 
> together. This situation only becomes worse and worse because there is no way 
> to block the other patches from landing unintentionally if we don't cut a 
> branch.
>
> [SPARK-38335][SQL] Implement parser support for DEFAULT column values
>
> Let's cut `branch-3.3` Today for Apache Spark 3.3.0 preparation.
>
> Best,
> Dongjoon.
>
>
> On Tue, Mar 15, 2022 at 10:17 AM Chao Sun  wrote:
>
> Cool, thanks for clarifying!
>
> On Tue, Mar 15, 2022 at 10:11 AM Xiao Li  wrote:
> >>
> >> For the following list:
> >> #35789 [SPARK-32268][SQL] Row-level Runtime Filtering
> >> #34659 [SPARK-34863][SQL] Support complex types for Parquet vectorized 
> >> reader
> >> #35848 [SPARK-38548][SQL] New SQL function: try_sum
> >> Do you mean we should include them, or exclude them from 3.3?
> >
> >
> > If possible, I hope these features can be shipped with Spark 3.3.
> >
> >
> >
> > Chao Sun  于2022年3月15日周二 10:06写道：
> >>
> >> Hi Xiao

Re: Apache Spark 3.3 Release

2022-03-15 Thread Chao Sun

Cool, thanks for clarifying!

On Tue, Mar 15, 2022 at 10:11 AM Xiao Li  wrote:
>>
>> For the following list:
>> #35789 [SPARK-32268][SQL] Row-level Runtime Filtering
>> #34659 [SPARK-34863][SQL] Support complex types for Parquet vectorized reader
>> #35848 [SPARK-38548][SQL] New SQL function: try_sum
>> Do you mean we should include them, or exclude them from 3.3?
>
>
> If possible, I hope these features can be shipped with Spark 3.3.
>
>
>
> Chao Sun  于2022年3月15日周二 10:06写道：
>>
>> Hi Xiao,
>>
>> For the following list:
>>
>> #35789 [SPARK-32268][SQL] Row-level Runtime Filtering
>> #34659 [SPARK-34863][SQL] Support complex types for Parquet vectorized reader
>> #35848 [SPARK-38548][SQL] New SQL function: try_sum
>>
>> Do you mean we should include them, or exclude them from 3.3?
>>
>> Thanks,
>> Chao
>>
>> On Tue, Mar 15, 2022 at 9:56 AM Dongjoon Hyun  
>> wrote:
>> >
>> > The following was tested and merged a few minutes ago. So, we can remove 
>> > it from the list.
>> >
>> > #35819 [SPARK-38524][SPARK-38553][K8S] Bump Volcano to v1.5.1
>> >
>> > Thanks,
>> > Dongjoon.
>> >
>> > On Tue, Mar 15, 2022 at 9:48 AM Xiao Li  wrote:
>> >>
>> >> Let me clarify my above suggestion. Maybe we can wait 3 more days to 
>> >> collect the list of actively developed PRs that we want to merge to 3.3 
>> >> after the branch cut?
>> >>
>> >> Please do not rush to merge the PRs that are not fully reviewed. We can 
>> >> cut the branch this Friday and continue merging the PRs that have been 
>> >> discussed in this thread. Does that make sense?
>> >>
>> >> Xiao
>> >>
>> >>
>> >>
>> >> Holden Karau  于2022年3月15日周二 09:10写道：
>> >>>
>> >>> May I suggest we push out one week (22nd) just to give everyone a bit of 
>> >>> breathing space? Rushed software development more often results in bugs.
>> >>>
>> >>> On Tue, Mar 15, 2022 at 6:23 AM Yikun Jiang  wrote:
>> >>>>
>> >>>> > To make our release time more predictable, let us collect the PRs and 
>> >>>> > wait three more days before the branch cut?
>> >>>>
>> >>>> For SPIP: Support Customized Kubernetes Schedulers:
>> >>>> #35819 [SPARK-38524][SPARK-38553][K8S] Bump Volcano to v1.5.1
>> >>>>
>> >>>> Three more days are OK for this from my view.
>> >>>>
>> >>>> Regards,
>> >>>> Yikun
>> >>>
>> >>> --
>> >>> Twitter: https://twitter.com/holdenkarau
>> >>> Books (Learning Spark, High Performance Spark, etc.): 
>> >>> https://amzn.to/2MaRAG9
>> >>> YouTube Live Streams: https://www.youtube.com/user/holdenkarau

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org

Re: Apache Spark 3.3 Release

2022-03-15 Thread Chao Sun

Hi Xiao,

For the following list:

#35789 [SPARK-32268][SQL] Row-level Runtime Filtering
#34659 [SPARK-34863][SQL] Support complex types for Parquet vectorized reader
#35848 [SPARK-38548][SQL] New SQL function: try_sum

Do you mean we should include them, or exclude them from 3.3?

Thanks,
Chao

On Tue, Mar 15, 2022 at 9:56 AM Dongjoon Hyun  wrote:
>
> The following was tested and merged a few minutes ago. So, we can remove it 
> from the list.
>
> #35819 [SPARK-38524][SPARK-38553][K8S] Bump Volcano to v1.5.1
>
> Thanks,
> Dongjoon.
>
> On Tue, Mar 15, 2022 at 9:48 AM Xiao Li  wrote:
>>
>> Let me clarify my above suggestion. Maybe we can wait 3 more days to collect 
>> the list of actively developed PRs that we want to merge to 3.3 after the 
>> branch cut?
>>
>> Please do not rush to merge the PRs that are not fully reviewed. We can cut 
>> the branch this Friday and continue merging the PRs that have been discussed 
>> in this thread. Does that make sense?
>>
>> Xiao
>>
>>
>>
>> Holden Karau  于2022年3月15日周二 09:10写道：
>>>
>>> May I suggest we push out one week (22nd) just to give everyone a bit of 
>>> breathing space? Rushed software development more often results in bugs.
>>>
>>> On Tue, Mar 15, 2022 at 6:23 AM Yikun Jiang  wrote:

 > To make our release time more predictable, let us collect the PRs and 
 > wait three more days before the branch cut?

 For SPIP: Support Customized Kubernetes Schedulers:
 #35819 [SPARK-38524][SPARK-38553][K8S] Bump Volcano to v1.5.1

 Three more days are OK for this from my view.

 Regards,
 Yikun
>>>
>>> --
>>> Twitter: https://twitter.com/holdenkarau
>>> Books (Learning Spark, High Performance Spark, etc.): 
>>> https://amzn.to/2MaRAG9
>>> YouTube Live Streams: https://www.youtube.com/user/holdenkarau

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org

Re: Apache Spark 3.3 Release

2022-03-14 Thread Chao Sun

I mainly mean:

  - [SPARK-35801] Row-level operations in Data Source V2
  - [SPARK-37166] Storage Partitioned Join

For which the PR:

- https://github.com/apache/spark/pull/35395
- https://github.com/apache/spark/pull/35657

are actively being reviewed. It seems there are ongoing PRs for other
SPIPs as well but I'm not involved in those so not quite sure whether
they are intended for 3.3 release.

Chao


Chao

On Mon, Mar 14, 2022 at 8:53 PM Xiao Li  wrote:
>
> Could you please list which features we want to finish before the branch cut? 
> How long will they take?
>
> Xiao
>
> Chao Sun  于2022年3月14日周一 13:30写道：
>>
>> Hi Max,
>>
>> As there are still some ongoing work for the above listed SPIPs, can we 
>> still merge them after the branch cut?
>>
>> Thanks,
>> Chao
>>
>> On Mon, Mar 14, 2022 at 6:12 AM Maxim Gekk 
>>  wrote:
>>>
>>> Hi All,
>>>
>>> Since there are no actual blockers for Spark 3.3.0 and significant 
>>> objections, I am going to cut branch-3.3 after 15th March at 00:00 PST. 
>>> Please, let us know if you have any concerns about that.
>>>
>>> Best regards,
>>> Max Gekk
>>>
>>>
>>> On Thu, Mar 3, 2022 at 9:44 PM Maxim Gekk  wrote:
>>>>
>>>> Hello All,
>>>>
>>>> I would like to bring on the table the theme about the new Spark release 
>>>> 3.3. According to the public schedule at 
>>>> https://spark.apache.org/versioning-policy.html, we planned to start the 
>>>> code freeze and release branch cut on March 15th, 2022. Since this date is 
>>>> coming soon, I would like to take your attention on the topic and gather 
>>>> objections that you might have.
>>>>
>>>> Bellow is the list of ongoing and active SPIPs:
>>>>
>>>> Spark SQL:
>>>> - [SPARK-31357] DataSourceV2: Catalog API for view metadata
>>>> - [SPARK-35801] Row-level operations in Data Source V2
>>>> - [SPARK-37166] Storage Partitioned Join
>>>>
>>>> Spark Core:
>>>> - [SPARK-20624] Add better handling for node shutdown
>>>> - [SPARK-25299] Use remote storage for persisting shuffle data
>>>>
>>>> PySpark:
>>>> - [SPARK-26413] RDD Arrow Support in Spark Core and PySpark
>>>>
>>>> Kubernetes:
>>>> - [SPARK-36057] Support Customized Kubernetes Schedulers
>>>>
>>>> Probably, we should finish if there are any remaining works for Spark 3.3, 
>>>> and switch to QA mode, cut a branch and keep everything on track. I would 
>>>> like to volunteer to help drive this process.
>>>>
>>>> Best regards,
>>>> Max Gekk

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org

Re: Apache Spark 3.3 Release

2022-03-14 Thread Chao Sun

Hi Max,

As there are still some ongoing work for the above listed SPIPs, can we
still merge them after the branch cut?

Thanks,
Chao

On Mon, Mar 14, 2022 at 6:12 AM Maxim Gekk
 wrote:

> Hi All,
>
> Since there are no actual blockers for Spark 3.3.0 and significant
> objections, I am going to cut branch-3.3 after 15th March at 00:00 PST.
> Please, let us know if you have any concerns about that.
>
> Best regards,
> Max Gekk
>
>
> On Thu, Mar 3, 2022 at 9:44 PM Maxim Gekk 
> wrote:
>
>> Hello All,
>>
>> I would like to bring on the table the theme about the new Spark release
>> 3.3. According to the public schedule at
>> https://spark.apache.org/versioning-policy.html, we planned to start the
>> code freeze and release branch cut on March 15th, 2022. Since this date is
>> coming soon, I would like to take your attention on the topic and gather
>> objections that you might have.
>>
>> Bellow is the list of ongoing and active SPIPs:
>>
>> Spark SQL:
>> - [SPARK-31357] DataSourceV2: Catalog API for view metadata
>> - [SPARK-35801] Row-level operations in Data Source V2
>> - [SPARK-37166] Storage Partitioned Join
>>
>> Spark Core:
>> - [SPARK-20624] Add better handling for node shutdown
>> - [SPARK-25299] Use remote storage for persisting shuffle data
>>
>> PySpark:
>> - [SPARK-26413] RDD Arrow Support in Spark Core and PySpark
>>
>> Kubernetes:
>> - [SPARK-36057] Support Customized Kubernetes Schedulers
>>
>> Probably, we should finish if there are any remaining works for Spark
>> 3.3, and switch to QA mode, cut a branch and keep everything on track. I
>> would like to volunteer to help drive this process.
>>
>> Best regards,
>> Max Gekk
>>
>

Re: [VOTE] SPIP: Catalog API for view metadata

2022-02-03 Thread Chao Sun

+1 (non-binding). Looking forward to this feature!

On Thu, Feb 3, 2022 at 2:32 PM Ryan Blue  wrote:

> +1 for the SPIP. I think it's well designed and it has worked quite well
> at Netflix for a long time.
>
> On Thu, Feb 3, 2022 at 2:04 PM John Zhuge  wrote:
>
>> Hi Spark community,
>>
>> I’d like to restart the vote for the ViewCatalog design proposal (SPIP).
>>
>> The proposal is to add a ViewCatalog interface that can be used to load,
>> create, alter, and drop views in DataSourceV2.
>>
>> Please vote on the SPIP until Feb. 9th (Wednesday).
>>
>> [ ] +1: Accept the proposal as an official SPIP
>> [ ] +0
>> [ ] -1: I don’t think this is a good idea because …
>>
>> Thanks!
>>
>
>
> --
> Ryan Blue
> Tabular
>

Re: [ANNOUNCE] Apache Spark 3.2.1 released

2022-01-28 Thread Chao Sun

Thanks Huaxin for driving the release!

On Fri, Jan 28, 2022 at 5:37 PM Ruifeng Zheng  wrote:

> It's Great!
> Congrats and thanks, huaxin!
>
>
> -- 原始邮件 --
> *发件人:* "huaxin gao" ;
> *发送时间:* 2022年1月29日(星期六) 上午9:07
> *收件人:* "dev";"user";
> *主题:* [ANNOUNCE] Apache Spark 3.2.1 released
>
> We are happy to announce the availability of Spark 3.2.1!
>
> Spark 3.2.1 is a maintenance release containing stability fixes. This
> release is based on the branch-3.2 maintenance branch of Spark. We strongly
> recommend all 3.2 users to upgrade to this stable release.
>
> To download Spark 3.2.1, head over to the download page:
> https://spark.apache.org/downloads.html
>
> To view the release notes:
> https://spark.apache.org/releases/spark-release-3-2-1.html
>
> We would like to acknowledge all community members for contributing to this
> release. This release would not have been possible without you.
>
> Huaxin Gao
>

Re: [VOTE] Release Spark 3.2.1 (RC2)

2022-01-24 Thread Chao Sun

+1 (non-binding)

On Mon, Jan 24, 2022 at 6:32 AM Michael Heuer  wrote:

> +1 (non-binding)
>
>michael
>
>
> On Jan 24, 2022, at 7:30 AM, Gengliang Wang  wrote:
>
> +1 (non-binding)
>
> On Mon, Jan 24, 2022 at 6:26 PM Dongjoon Hyun 
> wrote:
>
>> +1
>>
>> Dongjoon.
>>
>> On Sat, Jan 22, 2022 at 7:19 AM Mridul Muralidharan 
>> wrote:
>>
>>>
>>> +1
>>>
>>> Signatures, digests, etc check out fine.
>>> Checked out tag and build/tested with -Pyarn -Pmesos -Pkubernetes
>>>
>>> Regards,
>>> Mridul
>>>
>>> On Fri, Jan 21, 2022 at 9:01 PM Sean Owen  wrote:
>>>
 +1 with same result as last time.

 On Thu, Jan 20, 2022 at 9:59 PM huaxin gao 
 wrote:

> Please vote on releasing the following candidate as Apache Spark
> version 3.2.1. The vote is open until 8:00pm Pacific time January 25 and
> passes if a majority +1 PMC votes are cast, with a minimum of 3 +1 votes. 
> [
> ] +1 Release this package as Apache Spark 3.2.1[ ] -1 Do not release
> this package because ... To learn more about Apache Spark, please see
> http://spark.apache.org/ The tag to be voted on is v3.2.1-rc2 (commit
> 4f25b3f71238a00508a356591553f2dfa89f8290):
> https://github.com/apache/spark/tree/v3.2.1-rc2
> The release files, including signatures, digests, etc. can be found at:
> https://dist.apache.org/repos/dist/dev/spark/v3.2.1-rc2-bin/
> Signatures used for Spark RCs can be found in this file:
> https://dist.apache.org/repos/dist/dev/spark/KEYS The staging
> repository for this release can be found at:
> https://repository.apache.org/content/repositories/orgapachespark-1398/
>
> The documentation corresponding to this release can be found at:
> https://dist.apache.org/repos/dist/dev/spark/v3.2.1-rc2-docs/_site/
> The list of bug fixes going into 3.2.1 can be found at the following
> URL:https://s.apache.org/yu0cy
>
> This release is using the release script of the tag v3.2.1-rc2. FAQ
> = How can I help test this release?
> = If you are a Spark user, you can help us test
> this release by taking an existing Spark workload and running on this
> release candidate, then reporting any regressions. If you're working in
> PySpark you can set up a virtual env and install the current RC and see if
> anything important breaks, in the Java/Scala you can add the staging
> repository to your projects resolvers and test with the RC (make sure to
> clean up the artifact cache before/after so you don't end up building with
> a out of date RC going forward).
> === What should happen to JIRA
> tickets still targeting 3.2.1? ===
> The current list of open tickets targeted at 3.2.1 can be found at:
> https://issues.apache.org/jira/projects/SPARK and search for "Target
> Version/s" = 3.2.1 Committers should look at those and triage. Extremely
> important bug fixes, documentation, and API tweaks that impact
> compatibility should be worked on immediately. Everything else please
> retarget to an appropriate release. == But my bug isn't
> fixed? == In order to make timely releases, we will
> typically not hold the release unless the bug in question is a regression
> from the previous release. That being said, if there is something which is
> a regression that has not been correctly targeted please ping me or a
> committer to help target the issue.
>

>

Re: [VOTE] Release Spark 3.2.1 (RC1)

2022-01-12 Thread Chao Sun

+1 (non-binding). Thanks Huaxin for driving the release!

On Tue, Jan 11, 2022 at 11:56 PM Ruifeng Zheng  wrote:

> +1 (non-binding)
>
> Thanks, ruifeng zheng
>
> -- Original --
> *From:* "Cheng Su" ;
> *Date:* Wed, Jan 12, 2022 02:54 PM
> *To:* "Qian Sun";"huaxin gao"<
> huaxin.ga...@gmail.com>;
> *Cc:* "dev";
> *Subject:* Re: [VOTE] Release Spark 3.2.1 (RC1)
>
> +1 (non-binding). Checked commit history and ran some local tests.
>
>
>
> Thanks,
>
> Cheng Su
>
>
>
> *From: *Qian Sun 
> *Date: *Tuesday, January 11, 2022 at 7:55 PM
> *To: *huaxin gao 
> *Cc: *dev 
> *Subject: *Re: [VOTE] Release Spark 3.2.1 (RC1)
>
> +1
>
>
>
> Looks good. All integration tests passed.
>
>
>
> Qian
>
>
>
> 2022年1月11日 上午2:09，huaxin gao  写道：
>
>
>
> Please vote on releasing the following candidate as Apache Spark version
> 3.2.1.
>
>
> The vote is open until Jan. 13th at 12 PM PST (8 PM UTC) and passes if a
> majority
>
> +1 PMC votes are cast, with a minimum of 3 + 1 votes.
>
>
> [ ] +1 Release this package as Apache Spark 3.2.1
> [ ] -1 Do not release this package because ...
>
> To learn more about Apache Spark, please see http://spark.apache.org/
>
> There are currently no issues targeting 3.2.1 (try project = SPARK AND
> "Target Version/s" = "3.2.1" AND status in (Open, Reopened, "In Progress"))
>
> The tag to be voted on is v3.2.1-rc1 (commit
> 2b0ee226f8dd17b278ad11139e62464433191653):
>
> https://github.com/apache/spark/tree/v3.2.1-rc1
>
> The release files, including signatures, digests, etc. can be found at:
> https://dist.apache.org/repos/dist/dev/spark/v3.2.1-rc1-bin/
>
> Signatures used for Spark RCs can be found in this file:
> https://dist.apache.org/repos/dist/dev/spark/KEYS
>
> The staging repository for this release can be found at:
> https://repository.apache.org/content/repositories/orgapachespark-1395/
>
> The documentation corresponding to this release can be found at:
> https://dist.apache.org/repos/dist/dev/spark/v3.2.1-rc1-docs/
>
> The list of bug fixes going into 3.2.1 can be found at the following URL:
> https://s.apache.org/7tzik
>
> This release is using the release script of the tag v3.2.1-rc1.
>
> FAQ
>
>
> =
> How can I help test this release?
> =
>
> If you are a Spark user, you can help us test this release by taking
> an existing Spark workload and running on this release candidate, then
> reporting any regressions.
>
> If you're working in PySpark you can set up a virtual env and install
> the current RC and see if anything important breaks, in the Java/Scala
> you can add the staging repository to your projects resolvers and test
> with the RC (make sure to clean up the artifact cache before/after so
> you don't end up building with an out of date RC going forward).
>
> ===
> What should happen to JIRA tickets still targeting 3.2.1?
> ===
>
> The current list of open tickets targeted at 3.2.1 can be found at:
> https://issues.apache.org/jira/projects/SPARK and search for "Target
> Version/s" = 3.2.1
>
> Committers should look at those and triage. Extremely important bug
> fixes, documentation, and API tweaks that impact compatibility should
> be worked on immediately. Everything else please retarget to an
> appropriate release.
>
> ==
> But my bug isn't fixed?
> ==
>
> In order to make timely releases, we will typically not hold the
> release unless the bug in question is a regression from the previous
> release. That being said, if there is something which is a regression
> that has not been correctly targeted please ping me or a committer to
> help target the issue.
>
>
>

Re: [VOTE] SPIP: Row-level operations in Data Source V2

2021-11-14 Thread Chao Sun

+1 (non-binding). Thanks Anton for the work!

On Sun, Nov 14, 2021 at 10:01 AM Ryan Blue  wrote:

> +1
>
> Thanks to Anton for all this great work!
>
> On Sat, Nov 13, 2021 at 8:24 AM Mich Talebzadeh 
> wrote:
>
>> +1 non-binding
>>
>>
>>
>>view my Linkedin profile
>> 
>>
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>>
>> On Sat, 13 Nov 2021 at 15:07, Russell Spitzer 
>> wrote:
>>
>>> +1 (never binding)
>>>
>>> On Sat, Nov 13, 2021 at 1:10 AM Dongjoon Hyun 
>>> wrote:
>>>
 +1

 On Fri, Nov 12, 2021 at 6:58 PM huaxin gao 
 wrote:

> +1
>
> On Fri, Nov 12, 2021 at 6:44 PM Yufei Gu 
> wrote:
>
>> +1
>>
>> > On Nov 12, 2021, at 6:25 PM, L. C. Hsieh  wrote:
>> >
>> > Hi all,
>> >
>> > I’d like to start a vote for SPIP: Row-level operations in Data
>> Source V2.
>> >
>> > The proposal is to add support for executing row-level operations
>> > such as DELETE, UPDATE, MERGE for v2 tables (SPARK-35801). The
>> > execution should be the same across data sources and the best way
>> to do
>> > that is to implement it in Spark.
>> >
>> > Right now, Spark can only parse and to some extent analyze DELETE,
>> UPDATE,
>> > MERGE commands. Data sources that support row-level changes have to
>> build
>> > custom Spark extensions to execute such statements. The goal of
>> this effort
>> > is to come up with a flexible and easy-to-use API that will work
>> across
>> > data sources.
>> >
>> > Please also refer to:
>> >
>> >   - Previous discussion in dev mailing list: [DISCUSS] SPIP:
>> > Row-level operations in Data Source V2
>> >   > >
>> >
>> >   - JIRA: SPARK-35801 <
>> https://issues.apache.org/jira/browse/SPARK-35801>
>> >   - PR for handling DELETE statements:
>> > 
>> >
>> >   - Design doc
>> > <
>> https://docs.google.com/document/d/12Ywmc47j3l2WF4anG5vL4qlrhT2OKigb7_EbIKhxg60/
>> >
>> >
>> > Please vote on the SPIP for the next 72 hours:
>> >
>> > [ ] +1: Accept the proposal as an official SPIP
>> > [ ] +0
>> > [ ] -1: I don’t think this is a good idea because …
>> >
>> >
>> -
>> > To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>> >
>>
>>
>> -
>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>
>>
>
> --
> Ryan Blue
> Tabular
>

Re: [VOTE][RESULT] SPIP: Storage Partitioned Join for Data Source V2

2021-11-02 Thread Chao Sun

Thanks all for voting on this proposal!

On Tue, Nov 2, 2021 at 9:39 AM Liang Chi Hsieh  wrote:

> Hi all,
>
> The vote passed with the following 9 +1 votes and no -1 or +0 votes:
> Liang-Chi Hsieh*
> Russell Spitzer
> Dongjoon Hyun*
> Huaxin Gao
> Ryan Blue
> DB Tsai*
> Holden Karau*
> Cheng Su
> Wenchen Fan*
>
> * = binding
>
> Thank you guys all for your feedback and votes.
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>

Re: [DISCUSS] SPIP: Storage Partitioned Join for Data Source V2

2021-10-27 Thread Chao Sun

atible partitions". Things
>>> like `days(ts)` are straightforward: the same timestamp value always
>>> results in the same partition value, in whatever v2 sources. `bucket(col,
>>> num)` is tricky, as Spark doesn't define the bucket hash function. Two v2
>>> sources may return different bucket IDs for the same value, and this breaks
>>> the phase 1 split-wise join.
>>>
>>> And two questions for further improvements:
>>> 1. Can we apply this idea to partitioned file source tables
>>> (non-bucketed) as well?
>>> 2. What if the table has many partitions? Shall we apply certain join
>>> algorithms in the phase 1 split-wise join as well? Or even launch a Spark
>>> job to do so?
>>>
>>> Thanks,
>>> Wenchen
>>>
>>> On Wed, Oct 27, 2021 at 3:08 AM Chao Sun  wrote:
>>>
>>>> Thanks Cheng for the comments.
>>>>
>>>> > Is migrating Hive table read path to data source v2, being a
>>>> prerequisite of this SPIP
>>>>
>>>> Yes, this SPIP only aims at DataSourceV2, so obviously it will help if
>>>> Hive eventually moves to use V2 API. With that said, I think some of the
>>>> ideas could be useful for V1 Hive support as well. For instance, with the
>>>> newly proposed logic to compare whether output partitionings from both
>>>> sides of a join operator are compatible, we can have HiveTableScanExec to
>>>> report a different partitioning other than HashPartitioning, and
>>>> EnsureRequirements could potentially recognize that and therefore avoid
>>>> shuffle if both sides report the same compatible partitioning. In addition,
>>>> SPARK-35703, which is part of the SPIP, is also useful in that it relaxes
>>>> the constraint for V1 bucket join so that the join keys do not necessarily
>>>> be identical to the bucket keys.
>>>>
>>>> > Would aggregate work automatically after the SPIP?
>>>>
>>>> Yes it will work as before. This case is already supported by
>>>> DataSourcePartitioning in V2 (see SPARK-22389).
>>>>
>>>> > Any major use cases in mind except Hive bucketed table?
>>>>
>>>> Our first use case is Apache Iceberg. In addition to that we also want
>>>> to add the support for Spark's built-in file data sources.
>>>>
>>>> Thanks,
>>>> Chao
>>>>
>>>> On Tue, Oct 26, 2021 at 10:34 AM Cheng Su  wrote:
>>>>
>>>>> +1 for this. This is exciting movement to efficiently read bucketed
>>>>> table from other systems (Hive, Trino & Presto)!
>>>>>
>>>>>
>>>>>
>>>>> Still looking at the details but having some early questions:
>>>>>
>>>>>
>>>>>
>>>>>1. Is migrating Hive table read path to data source v2, being a
>>>>>prerequisite of this SPIP?
>>>>>
>>>>>
>>>>>
>>>>> Hive table read path is currently a mix of data source v1 (for Parquet
>>>>> & ORC file format only), and legacy Hive code path (HiveTableScanExec). In
>>>>> the SPIP, I am seeing we only make change for data source v2, so wondering
>>>>> how this would work with existing Hive table read path. In addition, just
>>>>> FYI, supporting writing Hive bucketed table is merged in master recently (
>>>>> SPARK-19256 <https://issues.apache.org/jira/browse/SPARK-19256> has
>>>>> details).
>>>>>
>>>>>
>>>>>
>>>>>1. Would aggregate work automatically after the SPIP?
>>>>>
>>>>>
>>>>>
>>>>> Another major benefit for having bucketed table, is to avoid shuffle
>>>>> before aggregate. Just want to bring to our attention that it would be
>>>>> great to consider aggregate as well when doing this proposal.
>>>>>
>>>>>
>>>>>
>>>>>1. Any major use cases in mind except Hive bucketed table?
>>>>>
>>>>>
>>>>>
>>>>> Just curious if there’s any other use cases we are targeting as part
>>>>> of SPIP.
>>>>>
>>>>>
>>>>>
>>>>> Thanks,
>>>>>
>>>>> Cheng Su
>>>>>
>>>>>
>>>>>
>>>>>
>

Re: [DISCUSS] SPIP: Storage Partitioned Join for Data Source V2

2021-10-26 Thread Chao Sun

Thanks Cheng for the comments.

> Is migrating Hive table read path to data source v2, being a prerequisite
of this SPIP

Yes, this SPIP only aims at DataSourceV2, so obviously it will help if Hive
eventually moves to use V2 API. With that said, I think some of the ideas
could be useful for V1 Hive support as well. For instance, with the newly
proposed logic to compare whether output partitionings from both sides of a
join operator are compatible, we can have HiveTableScanExec to report a
different partitioning other than HashPartitioning, and EnsureRequirements
could potentially recognize that and therefore avoid shuffle if both sides
report the same compatible partitioning. In addition, SPARK-35703, which is
part of the SPIP, is also useful in that it relaxes the constraint for V1
bucket join so that the join keys do not necessarily be identical to the
bucket keys.

> Would aggregate work automatically after the SPIP?

Yes it will work as before. This case is already supported by
DataSourcePartitioning in V2 (see SPARK-22389).

> Any major use cases in mind except Hive bucketed table?

Our first use case is Apache Iceberg. In addition to that we also want to
add the support for Spark's built-in file data sources.

Thanks,
Chao

On Tue, Oct 26, 2021 at 10:34 AM Cheng Su  wrote:

> +1 for this. This is exciting movement to efficiently read bucketed table
> from other systems (Hive, Trino & Presto)!
>
>
>
> Still looking at the details but having some early questions:
>
>
>
>1. Is migrating Hive table read path to data source v2, being a
>prerequisite of this SPIP?
>
>
>
> Hive table read path is currently a mix of data source v1 (for Parquet &
> ORC file format only), and legacy Hive code path (HiveTableScanExec). In
> the SPIP, I am seeing we only make change for data source v2, so wondering
> how this would work with existing Hive table read path. In addition, just
> FYI, supporting writing Hive bucketed table is merged in master recently (
> SPARK-19256 <https://issues.apache.org/jira/browse/SPARK-19256> has
> details).
>
>
>
>1. Would aggregate work automatically after the SPIP?
>
>
>
> Another major benefit for having bucketed table, is to avoid shuffle
> before aggregate. Just want to bring to our attention that it would be
> great to consider aggregate as well when doing this proposal.
>
>
>
>1. Any major use cases in mind except Hive bucketed table?
>
>
>
> Just curious if there’s any other use cases we are targeting as part of
> SPIP.
>
>
>
> Thanks,
>
> Cheng Su
>
>
>
>
>
>
>
> *From: *Ryan Blue 
> *Date: *Tuesday, October 26, 2021 at 9:39 AM
> *To: *John Zhuge 
> *Cc: *Chao Sun , Wenchen Fan ,
> Cheng Su , DB Tsai , Dongjoon Hyun <
> dongjoon.h...@gmail.com>, Hyukjin Kwon , Wenchen Fan
> , angers zhu , dev <
> dev@spark.apache.org>, huaxin gao 
> *Subject: *Re: [DISCUSS] SPIP: Storage Partitioned Join for Data Source V2
>
> Instead of commenting on the doc, could we keep discussion here on the dev
> list please? That way more people can follow it and there is more room for
> discussion. Comment threads have a very small area and easily become hard
> to follow.
>
>
>
> Ryan
>
>
>
> On Tue, Oct 26, 2021 at 9:32 AM John Zhuge  wrote:
>
> +1  Nicely done!
>
>
>
> On Tue, Oct 26, 2021 at 8:08 AM Chao Sun  wrote:
>
> Oops, sorry. I just fixed the permission setting.
>
>
>
> Thanks everyone for the positive support!
>
>
>
> On Tue, Oct 26, 2021 at 7:30 AM Wenchen Fan  wrote:
>
> +1 to this SPIP and nice writeup of the design doc!
>
>
>
> Can we open comment permission in the doc so that we can discuss details
> there?
>
>
>
> On Tue, Oct 26, 2021 at 8:29 PM Hyukjin Kwon  wrote:
>
> Seems making sense to me.
>
> Would be great to have some feedback from people such as @Wenchen Fan
>  @Cheng Su  @angers zhu
> .
>
>
>
>
>
> On Tue, 26 Oct 2021 at 17:25, Dongjoon Hyun 
> wrote:
>
> +1 for this SPIP.
>
>
>
> On Sun, Oct 24, 2021 at 9:59 AM huaxin gao  wrote:
>
> +1. Thanks for lifting the current restrictions on bucket join and making
> this more generalized.
>
>
>
> On Sun, Oct 24, 2021 at 9:33 AM Ryan Blue  wrote:
>
> +1 from me as well. Thanks Chao for doing so much to get it to this point!
>
>
>
> On Sat, Oct 23, 2021 at 11:29 PM DB Tsai  wrote:
>
> +1 on this SPIP.
>
> This is a more generalized version of bucketed tables and bucketed
> joins which can eliminate very expensive data shuffles when joins, and
> many users in the Apache Spark community have wanted this feature for
> a long time!
>
> Thank you, Ryan and Chao,

Re: [DISCUSS] SPIP: Storage Partitioned Join for Data Source V2

2021-10-26 Thread Chao Sun

Oops, sorry. I just fixed the permission setting.

Thanks everyone for the positive support!

On Tue, Oct 26, 2021 at 7:30 AM Wenchen Fan  wrote:

> +1 to this SPIP and nice writeup of the design doc!
>
> Can we open comment permission in the doc so that we can discuss details
> there?
>
> On Tue, Oct 26, 2021 at 8:29 PM Hyukjin Kwon  wrote:
>
>> Seems making sense to me.
>>
>> Would be great to have some feedback from people such as @Wenchen Fan
>>  @Cheng Su  @angers zhu
>> .
>>
>>
>> On Tue, 26 Oct 2021 at 17:25, Dongjoon Hyun 
>> wrote:
>>
>>> +1 for this SPIP.
>>>
>>> On Sun, Oct 24, 2021 at 9:59 AM huaxin gao 
>>> wrote:
>>>
>>>> +1. Thanks for lifting the current restrictions on bucket join and
>>>> making this more generalized.
>>>>
>>>> On Sun, Oct 24, 2021 at 9:33 AM Ryan Blue  wrote:
>>>>
>>>>> +1 from me as well. Thanks Chao for doing so much to get it to this
>>>>> point!
>>>>>
>>>>> On Sat, Oct 23, 2021 at 11:29 PM DB Tsai  wrote:
>>>>>
>>>>>> +1 on this SPIP.
>>>>>>
>>>>>> This is a more generalized version of bucketed tables and bucketed
>>>>>> joins which can eliminate very expensive data shuffles when joins, and
>>>>>> many users in the Apache Spark community have wanted this feature for
>>>>>> a long time!
>>>>>>
>>>>>> Thank you, Ryan and Chao, for working on this, and I look forward to
>>>>>> it as a new feature in Spark 3.3
>>>>>>
>>>>>> DB Tsai  |  https://www.dbtsai.com/  |  PGP 42E5B25A8F7A82C1
>>>>>>
>>>>>> On Fri, Oct 22, 2021 at 12:18 PM Chao Sun  wrote:
>>>>>> >
>>>>>> > Hi,
>>>>>> >
>>>>>> > Ryan and I drafted a design doc to support a new type of join:
>>>>>> storage partitioned join which covers bucket join support for 
>>>>>> DataSourceV2
>>>>>> but is more general. The goal is to let Spark leverage distribution
>>>>>> properties reported by data sources and eliminate shuffle whenever 
>>>>>> possible.
>>>>>> >
>>>>>> > Design doc:
>>>>>> https://docs.google.com/document/d/1foTkDSM91VxKgkEcBMsuAvEjNybjja-uHk-r3vtXWFE
>>>>>> (includes a POC link at the end)
>>>>>> >
>>>>>> > We'd like to start a discussion on the doc and any feedback is
>>>>>> welcome!
>>>>>> >
>>>>>> > Thanks,
>>>>>> > Chao
>>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Ryan Blue
>>>>>
>>>>

[DISCUSS] SPIP: Storage Partitioned Join for Data Source V2

2021-10-22 Thread Chao Sun

Hi,

Ryan and I drafted a design doc to support a new type of join: storage
partitioned join which covers bucket join support for DataSourceV2 but is
more general. The goal is to let Spark leverage distribution properties
reported by data sources and eliminate shuffle whenever possible.

Design doc:
https://docs.google.com/document/d/1foTkDSM91VxKgkEcBMsuAvEjNybjja-uHk-r3vtXWFE
(includes a POC link at the end)

We'd like to start a discussion on the doc and any feedback is welcome!

Thanks,
Chao

Re: [VOTE] Release Spark 3.2.0 (RC7)

2021-10-08 Thread Chao Sun

+1 (non-binding)

On Fri, Oct 8, 2021 at 1:01 AM Maxim Gekk  wrote:

> +1 (non-binding)
>
> On Fri, Oct 8, 2021 at 10:44 AM Mich Talebzadeh 
> wrote:
>
>> +1 (non-binding)
>>
>>
>>
>>view my Linkedin profile
>> 
>>
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>>
>> On Fri, 8 Oct 2021 at 08:42, Peter Toth  wrote:
>>
>>> +1 (non-binding).
>>>
>>> Peter
>>>
>>>
>>> On Fri, Oct 8, 2021 at 9:16 AM Cheng Su  wrote:
>>>
 +1 (non-binding).



 Thanks,

 Cheng Su



 *From: *Reynold Xin 
 *Date: *Thursday, October 7, 2021 at 11:57 PM
 *To: *Yuming Wang 
 *Cc: *Dongjoon Hyun , 郑瑞峰 <
 ruife...@foxmail.com>, Sean Owen , Gengliang Wang <
 ltn...@gmail.com>, dev 
 *Subject: *Re: [VOTE] Release Spark 3.2.0 (RC7)

 +1

 [image: Image removed by sender.]





 On Thu, Oct 07, 2021 at 11:54 PM, Yuming Wang  wrote:

 +1 (non-binding).



 On Fri, Oct 8, 2021 at 1:02 PM Dongjoon Hyun 
 wrote:

 +1 for Apache Spark 3.2.0 RC7.



 It looks good to me. I tested with EKS 1.21 additionally.



 Cheers,

 Dongjoon.





 On Thu, Oct 7, 2021 at 7:46 PM 郑瑞峰  wrote:

 +1 (non-binding)





 -- 原始邮件 --

 *发件人**:* "Sean Owen" ;

 *发送时间**:* 2021年10月7日(星期四) 晚上10:23

 *收件人**:* "Gengliang Wang";

 *抄送**:* "dev";

 *主题**:* Re: [VOTE] Release Spark 3.2.0 (RC7)



 +1 again. Looks good in Scala 2.12, 2.13, and in Java 11.

 I note that the mem requirements for Java 11 tests seem to need to be
 increased but we're handling that separately. It doesn't really affect
 users.



 On Wed, Oct 6, 2021 at 11:49 AM Gengliang Wang 
 wrote:

 Please vote on releasing the following candidate as
 Apache Spark version 3.2.0.



 The vote is open until 11:59pm Pacific time October 11 and passes if a
 majority +1 PMC votes are cast, with a minimum of 3 +1 votes.



 [ ] +1 Release this package as Apache Spark 3.2.0

 [ ] -1 Do not release this package because ...



 To learn more about Apache Spark, please see http://spark.apache.org/



 The tag to be voted on is v3.2.0-rc7 (commit
 5d45a415f3a29898d92380380cfd82bfc7f579ea):

 https://github.com/apache/spark/tree/v3.2.0-rc7



 The release files, including signatures, digests, etc. can be found at:

 https://dist.apache.org/repos/dist/dev/spark/v3.2.0-rc7-bin/



 Signatures used for Spark RCs can be found in this file:

 https://dist.apache.org/repos/dist/dev/spark/KEYS



 The staging repository for this release can be found at:

 https://repository.apache.org/content/repositories/orgapachespark-1394



 The documentation corresponding to this release can be found at:

 https://dist.apache.org/repos/dist/dev/spark/v3.2.0-rc7-docs/



 The list of bug fixes going into 3.2.0 can be found at the following
 URL:

 https://issues.apache.org/jira/projects/SPARK/versions/12349407



 This release is using the release script of the tag v3.2.0-rc7.





 FAQ



 =

 How can I help test this release?

 =

 If you are a Spark user, you can help us test this release by taking

 an existing Spark workload and running on this release candidate, then

 reporting any regressions.



 If you're working in PySpark you can set up a virtual env and install

 the current RC and see if anything important breaks, in the Java/Scala

 you can add the staging repository to your projects resolvers and test

 with the RC (make sure to clean up the artifact cache before/after so

 you don't end up building with a out of date RC going forward).



 ===

 What should happen to JIRA tickets still targeting 3.2.0?

 ===

 The current list of open tickets targeted at 3.2.0 can be found at:

 https://issues.apache.org/jira/projects/SPARK and search for "Target
 Version/s" = 3.2.0



 Committers should look at those and triage. Extremely important bug

Re: [VOTE] Release Spark 3.2.0 (RC5)

2021-09-28 Thread Chao Sun

Looks like it's related to https://github.com/apache/spark/pull/34085. I
filed https://issues.apache.org/jira/browse/SPARK-36873 to fix it.

On Mon, Sep 27, 2021 at 6:00 PM Chao Sun  wrote:

> Thanks. Trying it on my local machine now but it will probably take a
> while. I think https://github.com/apache/spark/pull/34085 is more likely
> to be relevant but don't yet have a clue how it could cause the issue.
> Spark CI also passed for these.
>
> On Mon, Sep 27, 2021 at 5:29 PM Sean Owen  wrote:
>
>> I'm building and testing with
>>
>> mvn -Phadoop-3.2 -Phive -Phive-2.3 -Phive-thriftserver -Pkinesis-asl
>> -Pkubernetes -Pmesos -Pnetlib-lgpl -Pscala-2.12 -Pspark-ganglia-lgpl
>> -Psparkr -Pyarn ...
>>
>> I did a '-DskipTests clean install' and then 'test'; the problem arises
>> only in 'test'.
>>
>> On Mon, Sep 27, 2021 at 6:58 PM Chao Sun  wrote:
>>
>>> Hmm it may be related to the commit. Sean: how do I reproduce this?
>>>
>>> On Mon, Sep 27, 2021 at 4:56 PM Sean Owen  wrote:
>>>
>>>> Another "is anyone else seeing this"? in compiling common/yarn-network:
>>>>
>>>> [ERROR] [Error]
>>>> /mnt/data/testing/spark-3.2.0/common/network-yarn/src/main/java/org/apache/spark/network/yarn/YarnShuffleService.java:32:
>>>> package com.google.common.annotations does not exist
>>>> [ERROR] [Error]
>>>> /mnt/data/testing/spark-3.2.0/common/network-yarn/src/main/java/org/apache/spark/network/yarn/YarnShuffleService.java:33:
>>>> package com.google.common.base does not exist
>>>> [ERROR] [Error]
>>>> /mnt/data/testing/spark-3.2.0/common/network-yarn/src/main/java/org/apache/spark/network/yarn/YarnShuffleService.java:34:
>>>> package com.google.common.collect does not exist
>>>> ...
>>>>
>>>> I didn't see this in RC4, so, I wonder if a recent change affected
>>>> something, but there are barely any changes since RC4. Anything touching
>>>> YARN or Guava maybe, like:
>>>>
>>>> https://github.com/apache/spark/commit/540e45c3cc7c64e37aa5c1673c03a0f2d7462878
>>>> ?
>>>>
>>>>
>>>>
>>>> On Mon, Sep 27, 2021 at 7:56 AM Gengliang Wang 
>>>> wrote:
>>>>
>>>>> Please vote on releasing the following candidate as
>>>>> Apache Spark version 3.2.0.
>>>>>
>>>>> The vote is open until 11:59pm Pacific time September 29 and passes if
>>>>> a majority +1 PMC votes are cast, with a minimum of 3 +1 votes.
>>>>>
>>>>> [ ] +1 Release this package as Apache Spark 3.2.0
>>>>> [ ] -1 Do not release this package because ...
>>>>>
>>>>> To learn more about Apache Spark, please see http://spark.apache.org/
>>>>>
>>>>> The tag to be voted on is v3.2.0-rc5 (commit
>>>>> 49aea14c5afd93ae1b9d19b661cc273a557853f5):
>>>>> https://github.com/apache/spark/tree/v3.2.0-rc5
>>>>>
>>>>> The release files, including signatures, digests, etc. can be found at:
>>>>> https://dist.apache.org/repos/dist/dev/spark/v3.2.0-rc5-bin/
>>>>>
>>>>> Signatures used for Spark RCs can be found in this file:
>>>>> https://dist.apache.org/repos/dist/dev/spark/KEYS
>>>>>
>>>>> The staging repository for this release can be found at:
>>>>> https://repository.apache.org/content/repositories/orgapachespark-1392
>>>>>
>>>>> The documentation corresponding to this release can be found at:
>>>>> https://dist.apache.org/repos/dist/dev/spark/v3.2.0-rc5-docs/
>>>>>
>>>>> The list of bug fixes going into 3.2.0 can be found at the following
>>>>> URL:
>>>>> https://issues.apache.org/jira/projects/SPARK/versions/12349407
>>>>>
>>>>> This release is using the release script of the tag v3.2.0-rc5.
>>>>>
>>>>>
>>>>> FAQ
>>>>>
>>>>> =
>>>>> How can I help test this release?
>>>>> =
>>>>> If you are a Spark user, you can help us test this release by taking
>>>>> an existing Spark workload and running on this release candidate, then
>>>>> reporting any regressions.
>>>>>
>>>>> If you're working in PySpark you can set up a virtual env and install
>>>>> the current RC

Re: [VOTE] Release Spark 3.2.0 (RC5)

2021-09-27 Thread Chao Sun

Thanks. Trying it on my local machine now but it will probably take a
while. I think https://github.com/apache/spark/pull/34085 is more likely to
be relevant but don't yet have a clue how it could cause the issue. Spark
CI also passed for these.

On Mon, Sep 27, 2021 at 5:29 PM Sean Owen  wrote:

> I'm building and testing with
>
> mvn -Phadoop-3.2 -Phive -Phive-2.3 -Phive-thriftserver -Pkinesis-asl
> -Pkubernetes -Pmesos -Pnetlib-lgpl -Pscala-2.12 -Pspark-ganglia-lgpl
> -Psparkr -Pyarn ...
>
> I did a '-DskipTests clean install' and then 'test'; the problem arises
> only in 'test'.
>
> On Mon, Sep 27, 2021 at 6:58 PM Chao Sun  wrote:
>
>> Hmm it may be related to the commit. Sean: how do I reproduce this?
>>
>> On Mon, Sep 27, 2021 at 4:56 PM Sean Owen  wrote:
>>
>>> Another "is anyone else seeing this"? in compiling common/yarn-network:
>>>
>>> [ERROR] [Error]
>>> /mnt/data/testing/spark-3.2.0/common/network-yarn/src/main/java/org/apache/spark/network/yarn/YarnShuffleService.java:32:
>>> package com.google.common.annotations does not exist
>>> [ERROR] [Error]
>>> /mnt/data/testing/spark-3.2.0/common/network-yarn/src/main/java/org/apache/spark/network/yarn/YarnShuffleService.java:33:
>>> package com.google.common.base does not exist
>>> [ERROR] [Error]
>>> /mnt/data/testing/spark-3.2.0/common/network-yarn/src/main/java/org/apache/spark/network/yarn/YarnShuffleService.java:34:
>>> package com.google.common.collect does not exist
>>> ...
>>>
>>> I didn't see this in RC4, so, I wonder if a recent change affected
>>> something, but there are barely any changes since RC4. Anything touching
>>> YARN or Guava maybe, like:
>>>
>>> https://github.com/apache/spark/commit/540e45c3cc7c64e37aa5c1673c03a0f2d7462878
>>> ?
>>>
>>>
>>>
>>> On Mon, Sep 27, 2021 at 7:56 AM Gengliang Wang  wrote:
>>>
>>>> Please vote on releasing the following candidate as
>>>> Apache Spark version 3.2.0.
>>>>
>>>> The vote is open until 11:59pm Pacific time September 29 and passes if
>>>> a majority +1 PMC votes are cast, with a minimum of 3 +1 votes.
>>>>
>>>> [ ] +1 Release this package as Apache Spark 3.2.0
>>>> [ ] -1 Do not release this package because ...
>>>>
>>>> To learn more about Apache Spark, please see http://spark.apache.org/
>>>>
>>>> The tag to be voted on is v3.2.0-rc5 (commit
>>>> 49aea14c5afd93ae1b9d19b661cc273a557853f5):
>>>> https://github.com/apache/spark/tree/v3.2.0-rc5
>>>>
>>>> The release files, including signatures, digests, etc. can be found at:
>>>> https://dist.apache.org/repos/dist/dev/spark/v3.2.0-rc5-bin/
>>>>
>>>> Signatures used for Spark RCs can be found in this file:
>>>> https://dist.apache.org/repos/dist/dev/spark/KEYS
>>>>
>>>> The staging repository for this release can be found at:
>>>> https://repository.apache.org/content/repositories/orgapachespark-1392
>>>>
>>>> The documentation corresponding to this release can be found at:
>>>> https://dist.apache.org/repos/dist/dev/spark/v3.2.0-rc5-docs/
>>>>
>>>> The list of bug fixes going into 3.2.0 can be found at the following
>>>> URL:
>>>> https://issues.apache.org/jira/projects/SPARK/versions/12349407
>>>>
>>>> This release is using the release script of the tag v3.2.0-rc5.
>>>>
>>>>
>>>> FAQ
>>>>
>>>> =
>>>> How can I help test this release?
>>>> =
>>>> If you are a Spark user, you can help us test this release by taking
>>>> an existing Spark workload and running on this release candidate, then
>>>> reporting any regressions.
>>>>
>>>> If you're working in PySpark you can set up a virtual env and install
>>>> the current RC and see if anything important breaks, in the Java/Scala
>>>> you can add the staging repository to your projects resolvers and test
>>>> with the RC (make sure to clean up the artifact cache before/after so
>>>> you don't end up building with a out of date RC going forward).
>>>>
>>>> ===
>>>> What should happen to JIRA tickets still targeting 3.2.0?
>>>> ===
>>>> The current list of open tickets targeted at 3.2.0 can be found at:
>>>> https://issues.apache.org/jira/projects/SPARK and search for "Target
>>>> Version/s" = 3.2.0
>>>>
>>>> Committers should look at those and triage. Extremely important bug
>>>> fixes, documentation, and API tweaks that impact compatibility should
>>>> be worked on immediately. Everything else please retarget to an
>>>> appropriate release.
>>>>
>>>> ==
>>>> But my bug isn't fixed?
>>>> ==
>>>> In order to make timely releases, we will typically not hold the
>>>> release unless the bug in question is a regression from the previous
>>>> release. That being said, if there is something which is a regression
>>>> that has not been correctly targeted please ping me or a committer to
>>>> help target the issue.
>>>>
>>>

Re: [VOTE] Release Spark 3.2.0 (RC5)

2021-09-27 Thread Chao Sun

Hmm it may be related to the commit. Sean: how do I reproduce this?

On Mon, Sep 27, 2021 at 4:56 PM Sean Owen  wrote:

> Another "is anyone else seeing this"? in compiling common/yarn-network:
>
> [ERROR] [Error]
> /mnt/data/testing/spark-3.2.0/common/network-yarn/src/main/java/org/apache/spark/network/yarn/YarnShuffleService.java:32:
> package com.google.common.annotations does not exist
> [ERROR] [Error]
> /mnt/data/testing/spark-3.2.0/common/network-yarn/src/main/java/org/apache/spark/network/yarn/YarnShuffleService.java:33:
> package com.google.common.base does not exist
> [ERROR] [Error]
> /mnt/data/testing/spark-3.2.0/common/network-yarn/src/main/java/org/apache/spark/network/yarn/YarnShuffleService.java:34:
> package com.google.common.collect does not exist
> ...
>
> I didn't see this in RC4, so, I wonder if a recent change affected
> something, but there are barely any changes since RC4. Anything touching
> YARN or Guava maybe, like:
>
> https://github.com/apache/spark/commit/540e45c3cc7c64e37aa5c1673c03a0f2d7462878
> ?
>
>
>
> On Mon, Sep 27, 2021 at 7:56 AM Gengliang Wang  wrote:
>
>> Please vote on releasing the following candidate as
>> Apache Spark version 3.2.0.
>>
>> The vote is open until 11:59pm Pacific time September 29 and passes if a
>> majority +1 PMC votes are cast, with a minimum of 3 +1 votes.
>>
>> [ ] +1 Release this package as Apache Spark 3.2.0
>> [ ] -1 Do not release this package because ...
>>
>> To learn more about Apache Spark, please see http://spark.apache.org/
>>
>> The tag to be voted on is v3.2.0-rc5 (commit
>> 49aea14c5afd93ae1b9d19b661cc273a557853f5):
>> https://github.com/apache/spark/tree/v3.2.0-rc5
>>
>> The release files, including signatures, digests, etc. can be found at:
>> https://dist.apache.org/repos/dist/dev/spark/v3.2.0-rc5-bin/
>>
>> Signatures used for Spark RCs can be found in this file:
>> https://dist.apache.org/repos/dist/dev/spark/KEYS
>>
>> The staging repository for this release can be found at:
>> https://repository.apache.org/content/repositories/orgapachespark-1392
>>
>> The documentation corresponding to this release can be found at:
>> https://dist.apache.org/repos/dist/dev/spark/v3.2.0-rc5-docs/
>>
>> The list of bug fixes going into 3.2.0 can be found at the following URL:
>> https://issues.apache.org/jira/projects/SPARK/versions/12349407
>>
>> This release is using the release script of the tag v3.2.0-rc5.
>>
>>
>> FAQ
>>
>> =
>> How can I help test this release?
>> =
>> If you are a Spark user, you can help us test this release by taking
>> an existing Spark workload and running on this release candidate, then
>> reporting any regressions.
>>
>> If you're working in PySpark you can set up a virtual env and install
>> the current RC and see if anything important breaks, in the Java/Scala
>> you can add the staging repository to your projects resolvers and test
>> with the RC (make sure to clean up the artifact cache before/after so
>> you don't end up building with a out of date RC going forward).
>>
>> ===
>> What should happen to JIRA tickets still targeting 3.2.0?
>> ===
>> The current list of open tickets targeted at 3.2.0 can be found at:
>> https://issues.apache.org/jira/projects/SPARK and search for "Target
>> Version/s" = 3.2.0
>>
>> Committers should look at those and triage. Extremely important bug
>> fixes, documentation, and API tweaks that impact compatibility should
>> be worked on immediately. Everything else please retarget to an
>> appropriate release.
>>
>> ==
>> But my bug isn't fixed?
>> ==
>> In order to make timely releases, we will typically not hold the
>> release unless the bug in question is a regression from the previous
>> release. That being said, if there is something which is a regression
>> that has not been correctly targeted please ping me or a committer to
>> help target the issue.
>>
>

Re: [VOTE] Release Spark 3.2.0 (RC3)

2021-09-21 Thread Chao Sun

Hi Venkata, I'm not aware of the FileSuite test failures. In fact I just
tried it locally on the master branch and the tests are all passing. Could
you provide more details?

The reason we want to disable the LZ4 test is because it requires the
native LZ4 library when running with Hadoop 2.x, which the Spark CI doesn't
have.

On Tue, Sep 21, 2021 at 3:46 PM Venkatakrishnan Sowrirajan 
wrote:

> Hi Chao,
>
> But there are tests in core as well failing. For
> eg: org.apache.spark.FileSuite But these tests are passing in 3.1, why do
> you think we should disable these tests for hadoop version < 3.x?
>
> Regards
> Venkata krishnan
>
>
> On Tue, Sep 21, 2021 at 3:33 PM Chao Sun  wrote:
>
>> I just created SPARK-36820 for the above LZ4 test issue. Will post a PR
>> there soon.
>>
>> On Tue, Sep 21, 2021 at 2:05 PM Chao Sun  wrote:
>>
>>> Mridul, is the LZ4 failure about Parquet? I think Parquet currently uses
>>> Hadoop compression codec while Hadoop 2.7 still depends on native lib for
>>> the LZ4. Maybe we should run the test only for Hadoop 3.2 profile.
>>>
>>> On Tue, Sep 21, 2021 at 10:08 AM Mridul Muralidharan 
>>> wrote:
>>>
>>>>
>>>> Signatures, digests, etc check out fine.
>>>> Checked out tag and build/tested with -Pyarn -Pmesos -Pkubernetes, this
>>>> worked fine.
>>>>
>>>> I found that including "-Phadoop-2.7" failed on lz4 tests ("native lz4
>>>> library not available").
>>>>
>>>> Regards,
>>>> Mridul
>>>>
>>>> On Tue, Sep 21, 2021 at 10:18 AM Gengliang Wang 
>>>> wrote:
>>>>
>>>>> To Stephen: Thanks for pointing that out. I agree with that.
>>>>> To Sean: I made a PR
>>>>> <https://urldefense.com/v3/__https://github.com/apache/spark/pull/34059__;!!IKRxdwAv5BmarQ!O-njQDJjvUEKCXotXCcks-Bp6M5Hvwm2lVAdEvN7Wdi_DsazPKxBtqP5St4gRBM$>
>>>>>  to
>>>>> remove the test dependency so that we can start RC4 ASAP.
>>>>>
>>>>> Gengliang
>>>>>
>>>>> On Tue, Sep 21, 2021 at 8:14 PM Sean Owen  wrote:
>>>>>
>>>>>> Hm yeah I tend to agree. See
>>>>>> https://github.com/apache/spark/pull/33912
>>>>>> <https://urldefense.com/v3/__https://github.com/apache/spark/pull/33912__;!!IKRxdwAv5BmarQ!O-njQDJjvUEKCXotXCcks-Bp6M5Hvwm2lVAdEvN7Wdi_DsazPKxBtqP5nHr4Dvc$>
>>>>>> This _is_ a test-only dependency which makes it less of an issue.
>>>>>> I'm guessing it's not in Maven as it's a small one-off utility; we
>>>>>> _could_ just inline the ~100 lines of code in test code instead?
>>>>>>
>>>>>> On Tue, Sep 21, 2021 at 12:33 AM Stephen Coy
>>>>>>  wrote:
>>>>>>
>>>>>>> Hi there,
>>>>>>>
>>>>>>> I was going to -1 this because of the
>>>>>>> com.github.rdblue:brotli-codec:0.1.1 dependency, which is not available 
>>>>>>> on
>>>>>>> Maven Central, and therefore is not available from our repository 
>>>>>>> manager
>>>>>>> (Nexus).
>>>>>>>
>>>>>>> Historically  most places I have worked have avoided other public
>>>>>>> maven repositories because they are not well curated. i.e artifacts with
>>>>>>> the same GAV have been known to change over time, which never happens 
>>>>>>> with
>>>>>>> Maven Central.
>>>>>>>
>>>>>>> I know that I can address this by changing my settings.xml file.
>>>>>>>
>>>>>>> Anyway, I can see this biting other people so I thought that I would
>>>>>>> mention it.
>>>>>>>
>>>>>>> Steve C
>>>>>>>
>>>>>>> On 19 Sep 2021, at 1:18 pm, Gengliang Wang  wrote:
>>>>>>>
>>>>>>> Please vote on releasing the following candidate as
>>>>>>> Apache Spark version 3.2.0.
>>>>>>>
>>>>>>> The vote is open until 11:59pm Pacific time September 24 and passes
>>>>>>> if a majority +1 PMC votes are cast, with a minimum of 3 +1 votes.
>>>>>>>
>>>>>>> [ ] +1 Release this package as Apache Spark 3.2.0
>>>>>&g

Re: [VOTE] Release Spark 3.2.0 (RC3)

2021-09-21 Thread Chao Sun

I just created SPARK-36820 for the above LZ4 test issue. Will post a PR
there soon.

On Tue, Sep 21, 2021 at 2:05 PM Chao Sun  wrote:

> Mridul, is the LZ4 failure about Parquet? I think Parquet currently uses
> Hadoop compression codec while Hadoop 2.7 still depends on native lib for
> the LZ4. Maybe we should run the test only for Hadoop 3.2 profile.
>
> On Tue, Sep 21, 2021 at 10:08 AM Mridul Muralidharan 
> wrote:
>
>>
>> Signatures, digests, etc check out fine.
>> Checked out tag and build/tested with -Pyarn -Pmesos -Pkubernetes, this
>> worked fine.
>>
>> I found that including "-Phadoop-2.7" failed on lz4 tests ("native lz4
>> library not available").
>>
>> Regards,
>> Mridul
>>
>> On Tue, Sep 21, 2021 at 10:18 AM Gengliang Wang  wrote:
>>
>>> To Stephen: Thanks for pointing that out. I agree with that.
>>> To Sean: I made a PR <https://github.com/apache/spark/pull/34059> to
>>> remove the test dependency so that we can start RC4 ASAP.
>>>
>>> Gengliang
>>>
>>> On Tue, Sep 21, 2021 at 8:14 PM Sean Owen  wrote:
>>>
>>>> Hm yeah I tend to agree. See https://github.com/apache/spark/pull/33912
>>>> This _is_ a test-only dependency which makes it less of an issue.
>>>> I'm guessing it's not in Maven as it's a small one-off utility; we
>>>> _could_ just inline the ~100 lines of code in test code instead?
>>>>
>>>> On Tue, Sep 21, 2021 at 12:33 AM Stephen Coy
>>>>  wrote:
>>>>
>>>>> Hi there,
>>>>>
>>>>> I was going to -1 this because of the
>>>>> com.github.rdblue:brotli-codec:0.1.1 dependency, which is not available on
>>>>> Maven Central, and therefore is not available from our repository manager
>>>>> (Nexus).
>>>>>
>>>>> Historically  most places I have worked have avoided other public
>>>>> maven repositories because they are not well curated. i.e artifacts with
>>>>> the same GAV have been known to change over time, which never happens with
>>>>> Maven Central.
>>>>>
>>>>> I know that I can address this by changing my settings.xml file.
>>>>>
>>>>> Anyway, I can see this biting other people so I thought that I would
>>>>> mention it.
>>>>>
>>>>> Steve C
>>>>>
>>>>> On 19 Sep 2021, at 1:18 pm, Gengliang Wang  wrote:
>>>>>
>>>>> Please vote on releasing the following candidate as
>>>>> Apache Spark version 3.2.0.
>>>>>
>>>>> The vote is open until 11:59pm Pacific time September 24 and passes if
>>>>> a majority +1 PMC votes are cast, with a minimum of 3 +1 votes.
>>>>>
>>>>> [ ] +1 Release this package as Apache Spark 3.2.0
>>>>> [ ] -1 Do not release this package because ...
>>>>>
>>>>> To learn more about Apache Spark, please see http://spark.apache.org/
>>>>> <https://aus01.safelinks.protection.outlook.com/?url=http%3A%2F%2Fspark.apache.org%2F=04%7C01%7Cscoy%40infomedia.com.au%7C40d4b33b156b46c92cd808d97b1c3142%7C45d5407150f849caa59f9457123dc71c%7C0%7C0%7C637676183289473704%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000=mf1Z69isdBZnI7I5MS0ss3GmCmN%2FiyHqfrnKrG4U4qk%3D=0>
>>>>>
>>>>> The tag to be voted on is v3.2.0-rc3 (commit
>>>>> 96044e97353a079d3a7233ed3795ca82f3d9a101):
>>>>> https://github.com/apache/spark/tree/v3.2.0-rc3
>>>>> <https://aus01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Fspark%2Ftree%2Fv3.2.0-rc3=04%7C01%7Cscoy%40infomedia.com.au%7C40d4b33b156b46c92cd808d97b1c3142%7C45d5407150f849caa59f9457123dc71c%7C0%7C0%7C637676183289473704%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000=0QYDm8FEE9Zikf8%2F6x2SvFfjlsqNyarpMd9%2B2xjwnhY%3D=0>
>>>>>
>>>>> The release files, including signatures, digests, etc. can be found at:
>>>>> https://dist.apache.org/repos/dist/dev/spark/v3.2.0-rc3-bin/
>>>>> <https://aus01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fdist.apache.org%2Frepos%2Fdist%2Fdev%2Fspark%2Fv3.2.0-rc3-bin%2F=04%7C01%7Cscoy%40infomedia.com.au%7C40d4b33b156b46c92cd808d97b1c3142%7C45d5407150f849caa59f9457123dc71c%7C0%7C0%7C637676183289483662%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6M

Re: [VOTE] Release Spark 3.2.0 (RC3)

2021-09-21 Thread Chao Sun

Mridul, is the LZ4 failure about Parquet? I think Parquet currently uses
Hadoop compression codec while Hadoop 2.7 still depends on native lib for
the LZ4. Maybe we should run the test only for Hadoop 3.2 profile.

On Tue, Sep 21, 2021 at 10:08 AM Mridul Muralidharan 
wrote:

>
> Signatures, digests, etc check out fine.
> Checked out tag and build/tested with -Pyarn -Pmesos -Pkubernetes, this
> worked fine.
>
> I found that including "-Phadoop-2.7" failed on lz4 tests ("native lz4
> library not available").
>
> Regards,
> Mridul
>
> On Tue, Sep 21, 2021 at 10:18 AM Gengliang Wang  wrote:
>
>> To Stephen: Thanks for pointing that out. I agree with that.
>> To Sean: I made a PR  to
>> remove the test dependency so that we can start RC4 ASAP.
>>
>> Gengliang
>>
>> On Tue, Sep 21, 2021 at 8:14 PM Sean Owen  wrote:
>>
>>> Hm yeah I tend to agree. See https://github.com/apache/spark/pull/33912
>>> This _is_ a test-only dependency which makes it less of an issue.
>>> I'm guessing it's not in Maven as it's a small one-off utility; we
>>> _could_ just inline the ~100 lines of code in test code instead?
>>>
>>> On Tue, Sep 21, 2021 at 12:33 AM Stephen Coy
>>>  wrote:
>>>
 Hi there,

 I was going to -1 this because of the
 com.github.rdblue:brotli-codec:0.1.1 dependency, which is not available on
 Maven Central, and therefore is not available from our repository manager
 (Nexus).

 Historically  most places I have worked have avoided other public maven
 repositories because they are not well curated. i.e artifacts with the same
 GAV have been known to change over time, which never happens with Maven
 Central.

 I know that I can address this by changing my settings.xml file.

 Anyway, I can see this biting other people so I thought that I would
 mention it.

 Steve C

 On 19 Sep 2021, at 1:18 pm, Gengliang Wang  wrote:

 Please vote on releasing the following candidate as
 Apache Spark version 3.2.0.

 The vote is open until 11:59pm Pacific time September 24 and passes if
 a majority +1 PMC votes are cast, with a minimum of 3 +1 votes.

 [ ] +1 Release this package as Apache Spark 3.2.0
 [ ] -1 Do not release this package because ...

 To learn more about Apache Spark, please see http://spark.apache.org/

 The tag to be voted on is v3.2.0-rc3 (commit
 96044e97353a079d3a7233ed3795ca82f3d9a101):
 https://github.com/apache/spark/tree/v3.2.0-rc3

 The release files, including signatures, digests, etc. can be found at:
 https://dist.apache.org/repos/dist/dev/spark/v3.2.0-rc3-bin/

 Signatures used for Spark RCs can be found in this file:
 https://dist.apache.org/repos/dist/dev/spark/KEYS

 The staging repository for this release can be found at:
 https://repository.apache.org/content/repositories/orgapachespark-1390

Re: [VOTE] Release Spark 3.2.0 (RC1)

2021-08-31 Thread Chao Sun

Hi Xiao, I'm still checking with the Parquet community on this. Since the
fix is already +1'd, I'm hoping this won't take long. The delta in
parquet-1.12.x branch is also small with just 2 commits so far.

Chao

On Tue, Aug 31, 2021 at 12:03 PM Xiao Li  wrote:

> Hi, Chao,
>
> How long will it take? Normally, in the RC stage, we always revert the
> upgrade made in the current release. We did the parquet upgrade multiple
> times in the previous releases for avoiding the major delay in our Spark
> release
>
> Thanks，
>
> Xiao
>
>
> On Tue, Aug 31, 2021 at 11:03 AM Chao Sun  wrote:
>
>> The Apache Parquet community found an issue [1] in 1.12.0 which could
>> cause incorrect file offset being written and subsequently reading of the
>> same file to fail. A fix has been proposed in the same JIRA and we may have
>> to wait until a new release is available so that we can upgrade Spark with
>> the hot fix.
>>
>> [1]: https://issues.apache.org/jira/browse/PARQUET-2078
>>
>> On Fri, Aug 27, 2021 at 7:06 AM Sean Owen  wrote:
>>
>>> Maybe, I'm just confused why it's needed at all. Other profiles that add
>>> a dependency seem OK, but something's different here.
>>>
>>> One thing we can/should change is to simply remove the
>>>  block in the profile. It should always be a direct
>>> dep in Scala 2.13 (which lets us take out the profiles in submodules, which
>>> just repeat that)
>>> We can also update the version, by the by.
>>>
>>> I tried this and the resulting POM still doesn't look like what I expect
>>> though.
>>>
>>> (The binary release is OK, FWIW - it gets pulled in as a JAR as expected)
>>>
>>> On Thu, Aug 26, 2021 at 11:34 PM Stephen Coy 
>>> wrote:
>>>
>>>> Hi Sean,
>>>>
>>>> I think that maybe the https://www.mojohaus.org/flatten-maven-plugin/ will
>>>> help you out here.
>>>>
>>>> Cheers,
>>>>
>>>> Steve C
>>>>
>>>> On 27 Aug 2021, at 12:29 pm, Sean Owen  wrote:
>>>>
>>>> OK right, you would have seen a different error otherwise.
>>>>
>>>> Yes profiles are only a compile-time thing, but they should affect the
>>>> effective POM for the artifact. mvn -Pscala-2.13 help:effective-pom shows
>>>> scala-parallel-collections as a dependency in the POM as expected (not in a
>>>> profile). However I see what you see in the .pom in the release repo, and
>>>> in my local repo after building - it's just sitting there as a profile as
>>>> if it weren't activated or something.
>>>>
>>>> I'm confused then, that shouldn't be what happens. I'd say maybe there
>>>> is a problem with the release script, but seems to affect a simple local
>>>> build. Anyone else more expert in this see the problem, while I try to
>>>> debug more?
>>>> The binary distro may actually be fine, I'll check; it may even not
>>>> matter much for users who generally just treat Spark as a compile-time-only
>>>> dependency either. But I can see it would break exactly your case,
>>>> something like a self-contained test job.
>>>>
>>>> On Thu, Aug 26, 2021 at 8:41 PM Stephen Coy 
>>>> wrote:
>>>>
>>>>> I did indeed.
>>>>>
>>>>> The generated spark-core_2.13-3.2.0.pom that is created alongside the
>>>>> jar file in the local repo contains:
>>>>>
>>>>> 
>>>>>   scala-2.13
>>>>>   
>>>>> 
>>>>>   org.scala-lang.modules
>>>>>
>>>>> scala-parallel-collections_${scala.binary.version}
>>>>> 
>>>>>   
>>>>> 
>>>>>
>>>>> which means this dependency will be missing for unit tests that create
>>>>> SparkSessions from library code only, a technique inspired by Spark’s own
>>>>> unit tests.
>>>>>
>>>>> Cheers,
>>>>>
>>>>> Steve C
>>>>>
>>>>> On 27 Aug 2021, at 11:33 am, Sean Owen  wrote:
>>>>>
>>>>> Did you run ./dev/change-scala-version.sh 2.13 ? that's required first
>>>>> to update POMs. It works fine for me.
>>>>>
>>>>> On Thu, Aug 26, 2021 at 8:33 PM Stephen Coy <
>>>>> s...@infomedia.com.au.invalid> wrote:
>>>>>
>>>>>> Hi all

Re: [VOTE] Release Spark 3.2.0 (RC1)

2021-08-31 Thread Chao Sun

The Apache Parquet community found an issue [1] in 1.12.0 which could cause
incorrect file offset being written and subsequently reading of the same
file to fail. A fix has been proposed in the same JIRA and we may have to
wait until a new release is available so that we can upgrade Spark with the
hot fix.

[1]: https://issues.apache.org/jira/browse/PARQUET-2078

On Fri, Aug 27, 2021 at 7:06 AM Sean Owen  wrote:

> Maybe, I'm just confused why it's needed at all. Other profiles that add a
> dependency seem OK, but something's different here.
>
> One thing we can/should change is to simply remove the
>  block in the profile. It should always be a direct
> dep in Scala 2.13 (which lets us take out the profiles in submodules, which
> just repeat that)
> We can also update the version, by the by.
>
> I tried this and the resulting POM still doesn't look like what I expect
> though.
>
> (The binary release is OK, FWIW - it gets pulled in as a JAR as expected)
>
> On Thu, Aug 26, 2021 at 11:34 PM Stephen Coy 
> wrote:
>
>> Hi Sean,
>>
>> I think that maybe the https://www.mojohaus.org/flatten-maven-plugin/ will
>> help you out here.
>>
>> Cheers,
>>
>> Steve C
>>
>> On 27 Aug 2021, at 12:29 pm, Sean Owen  wrote:
>>
>> OK right, you would have seen a different error otherwise.
>>
>> Yes profiles are only a compile-time thing, but they should affect the
>> effective POM for the artifact. mvn -Pscala-2.13 help:effective-pom shows
>> scala-parallel-collections as a dependency in the POM as expected (not in a
>> profile). However I see what you see in the .pom in the release repo, and
>> in my local repo after building - it's just sitting there as a profile as
>> if it weren't activated or something.
>>
>> I'm confused then, that shouldn't be what happens. I'd say maybe there is
>> a problem with the release script, but seems to affect a simple local
>> build. Anyone else more expert in this see the problem, while I try to
>> debug more?
>> The binary distro may actually be fine, I'll check; it may even not
>> matter much for users who generally just treat Spark as a compile-time-only
>> dependency either. But I can see it would break exactly your case,
>> something like a self-contained test job.
>>
>> On Thu, Aug 26, 2021 at 8:41 PM Stephen Coy 
>> wrote:
>>
>>> I did indeed.
>>>
>>> The generated spark-core_2.13-3.2.0.pom that is created alongside the
>>> jar file in the local repo contains:
>>>
>>> 
>>>   scala-2.13
>>>   
>>> 
>>>   org.scala-lang.modules
>>>
>>> scala-parallel-collections_${scala.binary.version}
>>> 
>>>   
>>> 
>>>
>>> which means this dependency will be missing for unit tests that create
>>> SparkSessions from library code only, a technique inspired by Spark’s own
>>> unit tests.
>>>
>>> Cheers,
>>>
>>> Steve C
>>>
>>> On 27 Aug 2021, at 11:33 am, Sean Owen  wrote:
>>>
>>> Did you run ./dev/change-scala-version.sh 2.13 ? that's required first
>>> to update POMs. It works fine for me.
>>>
>>> On Thu, Aug 26, 2021 at 8:33 PM Stephen Coy <
>>> s...@infomedia.com.au.invalid> wrote:
>>>
 Hi all,

 Being adventurous I have built the RC1 code with:

 -Pyarn -Phadoop-3.2  -Pyarn -Phadoop-cloud -Phive-thriftserver
 -Phive-2.3 -Pscala-2.13 -Dhadoop.version=3.2.2


 And then attempted to build my Java based spark application.

 However, I found a number of our unit tests were failing with:

 java.lang.NoClassDefFoundError: scala/collection/parallel/TaskSupport

 at
 org.apache.spark.SparkContext.$anonfun$union$1(SparkContext.scala:1412)
 at
 org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
 at
 org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
 at org.apache.spark.SparkContext.withScope(SparkContext.scala:789)
 at org.apache.spark.SparkContext.union(SparkContext.scala:1406)
 at
 org.apache.spark.sql.execution.UnionExec.doExecute(basicPhysicalOperators.scala:698)
 at
 org.apache.spark.sql.execution.SparkPlan.$anonfun$execute$1(SparkPlan.scala:184)
 …


 I tracked this down to a missing dependency:

 
   org.scala-lang.modules

 scala-parallel-collections_${scala.binary.version}
 


 which unfortunately appears only in a profile in the pom files
 associated with the various spark dependencies.

 As far as I know it is not possible to activate profiles in
 dependencies in maven builds.

 Therefore I suspect that right now a Scala 2.13 migration is not quite
 as seamless as we would like.

 I stress that this is only an issue for developers that write unit
 tests for their applications, as the Spark runtime environment will always
 have the necessary dependencies available to it.

 (You might consider upgrading the
 org.scala-lang.modules:scala-parallel-collections_2.13 version from 0.2 to
 1.0.3 though!)

Re: [DISCUSS] Rename hadoop-3.2/hadoop-2.7 profile to hadoop-3/hadoop-2?

2021-06-25 Thread Chao Sun

Thanks all for the feedback! Yes I agree that we should target this for
Apache Spark 3.3 release. I'll put this aside for now and pick it up again
after the 3.2 release is finished.

> And maybe the current naming leaves the possibility for a "hadoop-3.5" or
something if that needed to be different.

Yes, that's a good point, although I was under the impression that the
Spark community aims to only support a single Hadoop 3.x profile, in which
case we won't have `hadoop-3` and `hadoop-3.5` in parallel.

Chao

On Thu, Jun 24, 2021 at 10:25 PM Gengliang Wang  wrote:

> +1 for targeting the renaming for Apache Spark 3.3 at the current phase.
>
> On Fri, Jun 25, 2021 at 6:55 AM DB Tsai  wrote:
>
>> +1 on renaming.
>>
>> DB Tsai  |  https://www.dbtsai.com/  |  PGP 42E5B25A8F7A82C1
>>
>> On Jun 24, 2021, at 11:41 AM, Chao Sun  wrote:
>>
>> Hi,
>>
>> As Spark master has upgraded to Hadoop-3.3.1, the current Maven profile
>> name hadoop-3.2 is no longer accurate, and it may confuse Spark users when
>> they realize the actual version is not Hadoop 3.2.x. Therefore, I created
>> https://issues.apache.org/jira/browse/SPARK-33880 to change the profile
>> name to hadoop-3 and hadoop-2 respectively. What do you think? Is this
>> something worth doing as part of Spark 3.2.0 release?
>>
>> Best,
>> Chao
>>
>>
>>

[DISCUSS] Rename hadoop-3.2/hadoop-2.7 profile to hadoop-3/hadoop-2?

2021-06-24 Thread Chao Sun

Hi,

As Spark master has upgraded to Hadoop-3.3.1, the current Maven profile
name hadoop-3.2 is no longer accurate, and it may confuse Spark users when
they realize the actual version is not Hadoop 3.2.x. Therefore, I created
https://issues.apache.org/jira/browse/SPARK-33880 to change the profile
name to hadoop-3 and hadoop-2 respectively. What do you think? Is this
something worth doing as part of Spark 3.2.0 release?

Best,
Chao

Re: [VOTE] Release Spark 3.1.2 (RC1)

2021-05-27 Thread Chao Sun

+1 (non-binding) - thanks Dongjoon for the work!

On Wed, May 26, 2021 at 8:35 PM Dongjoon Hyun 
wrote:

> +1
>
> Bests,
> Dongjoon
>
> On Wed, May 26, 2021 at 7:55 PM Kent Yao  wrote:
>
>> +1, non-binding
>>
>> *Kent Yao *
>> @ Data Science Center, Hangzhou Research Institute, NetEase Corp.
>> *a spark enthusiast*
>> *kyuubi is a
>> unified multi-tenant JDBC interface for large-scale data processing and
>> analytics, built on top of Apache Spark .*
>> *spark-authorizer A Spark
>> SQL extension which provides SQL Standard Authorization for **Apache
>> Spark .*
>> *spark-postgres  A library
>> for reading data from and transferring data to Postgres / Greenplum with
>> Spark SQL and DataFrames, 10~100x faster.*
>> *itatchi A** library t**hat
>> brings useful functions from various modern database management systems to 
>> **Apache
>> Spark .*
>>
>>
>>
>> On 05/27/2021 10:44，Yuming Wang 
>> wrote：
>>
>> +1 (non-binding)
>>
>> On Wed, May 26, 2021 at 11:27 PM Maxim Gekk 
>> wrote:
>>
>>> +1 (non-binding)
>>>
>>> On Mon, May 24, 2021 at 9:14 AM Dongjoon Hyun 
>>> wrote:
>>>
 Please vote on releasing the following candidate as Apache Spark
 version 3.1.2.

 The vote is open until May 27th 1AM (PST) and passes if a majority +1
 PMC votes are cast, with a minimum of 3 +1 votes.

 [ ] +1 Release this package as Apache Spark 3.1.2
 [ ] -1 Do not release this package because ...

 To learn more about Apache Spark, please see https://spark.apache.org/

 The tag to be voted on is v3.1.2-rc1 (commit
 de351e30a90dd988b133b3d00fa6218bfcaba8b8):
 https://github.com/apache/spark/tree/v3.1.2-rc1

 The release files, including signatures, digests, etc. can be found at:
 https://dist.apache.org/repos/dist/dev/spark/v3.1.2-rc1-bin/

 Signatures used for Spark RCs can be found in this file:
 https://dist.apache.org/repos/dist/dev/spark/KEYS

 The staging repository for this release can be found at:
 https://repository.apache.org/content/repositories/orgapachespark-1384/

 The documentation corresponding to this release can be found at:
 https://dist.apache.org/repos/dist/dev/spark/v3.1.2-rc1-docs/

 The list of bug fixes going into 3.1.2 can be found at the following
 URL:
 https://issues.apache.org/jira/projects/SPARK/versions/12349602

 This release is using the release script of the tag v3.1.2-rc1.

 FAQ

 =
 How can I help test this release?
 =

 If you are a Spark user, you can help us test this release by taking
 an existing Spark workload and running on this release candidate, then
 reporting any regressions.

 If you're working in PySpark you can set up a virtual env and install
 the current RC and see if anything important breaks, in the Java/Scala
 you can add the staging repository to your projects resolvers and test
 with the RC (make sure to clean up the artifact cache before/after so
 you don't end up building with a out of date RC going forward).

 ===
 What should happen to JIRA tickets still targeting 3.1.2?
 ===

 The current list of open tickets targeted at 3.1.2 can be found at:
 https://issues.apache.org/jira/projects/SPARK and search for "Target
 Version/s" = 3.1.2

 Committers should look at those and triage. Extremely important bug
 fixes, documentation, and API tweaks that impact compatibility should
 be worked on immediately. Everything else please retarget to an
 appropriate release.

 ==
 But my bug isn't fixed?
 ==

 In order to make timely releases, we will typically not hold the
 release unless the bug in question is a regression from the previous
 release. That being said, if there is something which is a regression
 that has not been correctly targeted please ping me or a committer to
 help target the issue.

>>>

Re: [ANNOUNCE] Apache Spark 2.4.8 released

2021-05-18 Thread Chao Sun

Great work Liang-Chi!

On Tue, May 18, 2021 at 1:14 AM Maxim Gekk 
wrote:

> Congratulations everyone with the new release, and thanks to Liang-Chi.
>
> Maxim Gekk
>
> Software Engineer
>
> Databricks, Inc.
>
>
> On Tue, May 18, 2021 at 11:06 AM Yuming Wang  wrote:
>
>> Great work, Liang-Chi!
>>
>> On Tue, May 18, 2021 at 3:57 PM Jungtaek Lim <
>> kabhwan.opensou...@gmail.com> wrote:
>>
>>> Thanks for the huge efforts on driving the release!
>>>
>>> On Tue, May 18, 2021 at 4:53 PM Wenchen Fan  wrote:
>>>
 Thank you, Liang-Chi!

 On Tue, May 18, 2021 at 1:32 PM Dongjoon Hyun 
 wrote:

> Finally! Thank you, Liang-Chi.
>
> Bests,
> Dongjoon.
>
>
> On Mon, May 17, 2021 at 10:14 PM Takeshi Yamamuro <
> linguin@gmail.com> wrote:
>
>> Thank you for the release work, Liang-Chi~
>>
>> On Tue, May 18, 2021 at 2:11 PM Hyukjin Kwon 
>> wrote:
>>
>>> Yay!
>>>
>>> 2021년 5월 18일 (화) 오후 12:57, Liang-Chi Hsieh 님이 작성:
>>>
 We are happy to announce the availability of Spark 2.4.8!

 Spark 2.4.8 is a maintenance release containing stability,
 correctness, and
 security fixes.
 This release is based on the branch-2.4 maintenance branch of
 Spark. We
 strongly recommend all 2.4 users to upgrade to this stable release.

 To download Spark 2.4.8, head over to the download page:
 http://spark.apache.org/downloads.html

 Note that you might need to clear your browser cache or to use
 `Private`/`Incognito` mode according to your browsers.

 To view the release notes:
 https://spark.apache.org/releases/spark-release-2-4-8.html

 We would like to acknowledge all community members for contributing
 to this
 release. This release would not have been possible without you.

 --
 Sent from:
 http://apache-spark-developers-list.1001551.n3.nabble.com/

 -
 To unsubscribe e-mail: dev-unsubscr...@spark.apache.org

>>
>> --
>> ---
>> Takeshi Yamamuro
>>
>

Re: Apache Spark 3.1.2 Release?

2021-05-17 Thread Chao Sun

+1. Thanks Dongjoon for doing this!

On Mon, May 17, 2021 at 7:58 PM John Zhuge  wrote:

> +1, thanks Dongjoon!
>
> On Mon, May 17, 2021 at 7:50 PM Yuming Wang  wrote:
>
>> +1.
>>
>> On Tue, May 18, 2021 at 9:06 AM Hyukjin Kwon  wrote:
>>
>>> +1 thanks for driving me
>>>
>>> On Tue, 18 May 2021, 09:33 Holden Karau,  wrote:
>>>
 +1 and thanks for volunteering to be the RM :)

 On Mon, May 17, 2021 at 4:09 PM Takeshi Yamamuro 
 wrote:

> Thank you, Dongjoon~ sgtm, too.
>
> On Tue, May 18, 2021 at 7:34 AM Cheng Su 
> wrote:
>
>> +1 for a new release, thanks Dongjoon!
>>
>> Cheng Su
>>
>> On 5/17/21, 2:44 PM, "Liang-Chi Hsieh"  wrote:
>>
>> +1 sounds good. Thanks Dongjoon for volunteering on this!
>>
>>
>> Liang-Chi
>>
>>
>> Dongjoon Hyun-2 wrote
>> > Hi, All.
>> >
>> > Since Apache Spark 3.1.1 tag creation (Feb 21),
>> > new 172 patches including 9 correctness patches and 4 K8s
>> patches arrived
>> > at branch-3.1.
>> >
>> > Shall we make a new release, Apache Spark 3.1.2, as the second
>> release at
>> > 3.1 line?
>> > I'd like to volunteer for the release manager for Apache Spark
>> 3.1.2.
>> > I'm thinking about starting the first RC next week.
>> >
>> > $ git log --oneline v3.1.1..HEAD | wc -l
>> >  172
>> >
>> > # Known correctness issues
>> > SPARK-34534 New protocol FetchShuffleBlocks in
>> OneForOneBlockFetcher
>> > lead to data loss or correctness
>> > SPARK-34545 PySpark Python UDF return inconsistent results
>> when
>> > applying 2 UDFs with different return type to 2 columns together
>> > SPARK-34681 Full outer shuffled hash join when building
>> left side
>> > produces wrong result
>> > SPARK-34719 fail if the view query has duplicated column
>> names
>> > SPARK-34794 Nested higher-order functions broken in DSL
>> > SPARK-34829 transform_values return identical values when
>> it's used
>> > with udf that returns reference type
>> > SPARK-34833 Apply right-padding correctly for correlated
>> subqueries
>> > SPARK-35381 Fix lambda variable name issues in nested
>> DataFrame
>> > functions in R APIs
>> > SPARK-35382 Fix lambda variable name issues in nested
>> DataFrame
>> > functions in Python APIs
>> >
>> > # Notable K8s patches since K8s GA
>> > SPARK-34674Close SparkContext after the Main method has
>> finished
>> > SPARK-34948Add ownerReference to executor configmap to fix
>> leakages
>> > SPARK-34820add apt-update before gnupg install
>> > SPARK-34361In case of downscaling avoid killing of
>> executors already
>> > known by the scheduler backend in the pod allocator
>> >
>> > Bests,
>> > Dongjoon.
>>
>>
>>
>>
>>
>> --
>> Sent from:
>> http://apache-spark-developers-list.1001551.n3.nabble.com/
>>
>>
>> -
>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>
>>
>>
>
> --
> ---
> Takeshi Yamamuro
>
 --
 Twitter: https://twitter.com/holdenkarau
 Books (Learning Spark, High Performance Spark, etc.):
 https://amzn.to/2MaRAG9  
 YouTube Live Streams: https://www.youtube.com/user/holdenkarau

>>>
>
> --
> John Zhuge
>

Re: Welcoming six new Apache Spark committers

2021-03-26 Thread Chao Sun

Congrats everyone!

On Fri, Mar 26, 2021 at 6:23 PM Mridul Muralidharan 
wrote:

>
> Congratulations, looking forward to more exciting contributions !
>
> Regards,
> Mridul
>
> On Fri, Mar 26, 2021 at 8:21 PM Dongjoon Hyun 
> wrote:
>
>>
>> Congratulations! :)
>>
>> Bests,
>> Dongjoon.
>>
>> On Fri, Mar 26, 2021 at 5:55 PM angers zhu  wrote:
>>
>>> Congratulations
>>>
>>> Prashant Sharma  于2021年3月27日周六 上午8:35写道：
>>>
 Congratulations  all!!

 On Sat, Mar 27, 2021, 5:10 AM huaxin gao 
 wrote:

> Congratulations to you all!!
>
> On Fri, Mar 26, 2021 at 4:22 PM Yuming Wang  wrote:
>
>> Congrats!
>>
>> On Sat, Mar 27, 2021 at 7:13 AM Takeshi Yamamuro <
>> linguin@gmail.com> wrote:
>>
>>> Congrats, all~
>>>
>>> On Sat, Mar 27, 2021 at 7:46 AM Jungtaek Lim <
>>> kabhwan.opensou...@gmail.com> wrote:
>>>
 Congrats all!

 2021년 3월 27일 (토) 오전 6:56, Liang-Chi Hsieh 님이 작성:

> Congrats! Welcome!
>
>
> Matei Zaharia wrote
> > Hi all,
> >
> > The Spark PMC recently voted to add several new committers.
> Please join me
> > in welcoming them to their new role! Our new committers are:
> >
> > - Maciej Szymkiewicz (contributor to PySpark)
> > - Max Gekk (contributor to Spark SQL)
> > - Kent Yao (contributor to Spark SQL)
> > - Attila Zsolt Piros (contributor to decommissioning and Spark on
> > Kubernetes)
> > - Yi Wu (contributor to Spark Core and SQL)
> > - Gabor Somogyi (contributor to Streaming and security)
> >
> > All six of them contributed to Spark 3.1 and we’re very excited
> to have
> > them join as committers.
> >
> > Matei and the Spark PMC
> >
> -
> > To unsubscribe e-mail:
>
> > dev-unsubscribe@.apache
>
>
>
>
>
> --
> Sent from:
> http://apache-spark-developers-list.1001551.n3.nabble.com/
>
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>
>>>
>>> --
>>> ---
>>> Takeshi Yamamuro
>>>
>>

Re: [VOTE] SPIP: Add FunctionCatalog

2021-03-08 Thread Chao Sun

+1 (non-binding)

On Mon, Mar 8, 2021 at 5:13 PM John Zhuge  wrote:

> +1 (non-binding)
>
> On Mon, Mar 8, 2021 at 4:32 PM Holden Karau  wrote:
>
>> +1 (binding)
>>
>> On Mon, Mar 8, 2021 at 3:56 PM Ryan Blue  wrote:
>>
>>> Hi everyone, I’d like to start a vote for the FunctionCatalog design
>>> proposal (SPIP).
>>>
>>> The proposal is to add a FunctionCatalog interface that can be used to
>>> load and list functions for Spark to call. There are interfaces for scalar
>>> and aggregate functions.
>>>
>>> In the discussion we’ve come to consensus and I’ve updated the design
>>> doc to match how functions will be called:
>>>
>>> In addition to produceResult(InternalRow), which is optional, functions
>>> can define produceResult methods with arguments that are Spark’s
>>> internal data types, like UTF8String. Spark will prefer these methods
>>> when calling the UDF using codgen.
>>>
>>> I’ve also updated the AggregateFunction interface and merged it with
>>> the partial aggregate interface because Spark doesn’t support non-partial
>>> aggregates.
>>>
>>> The full SPIP doc is here:
>>> https://docs.google.com/document/d/1PLBieHIlxZjmoUB0ERF-VozCRJ0xw2j3qKvUNWpWA2U/edit#heading=h.82w8qxfl2uwl
>>>
>>> Please vote on the SPIP in the next 72 hours. Once it is approved, I’ll
>>> do a final update of the PR and we can merge the API.
>>>
>>> [ ] +1: Accept the proposal as an official SPIP
>>> [ ] +0
>>> [ ] -1: I don’t think this is a good idea because …
>>> --
>>> Ryan Blue
>>>
>> --
>> Twitter: https://twitter.com/holdenkarau
>> Books (Learning Spark, High Performance Spark, etc.):
>> https://amzn.to/2MaRAG9  
>> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>>
>
>
> --
> John Zhuge
>

Re: [DISCUSS] SPIP: FunctionCatalog

2021-03-04 Thread Chao Sun

+1 on Dongjoon's proposal. Great to see this is getting moved forward and
thanks everyone for the insightful discussion!



On Thu, Mar 4, 2021 at 8:58 AM Ryan Blue  wrote:

> Okay, great. I'll update the SPIP doc and call a vote in the next day or
> two.
>
> On Thu, Mar 4, 2021 at 8:26 AM Erik Krogen  wrote:
>
>> +1 on Dongjoon's proposal. This is a very nice compromise between the
>> reflective/magic-method approach and the InternalRow approach, providing
>> a lot of flexibility for our users, and allowing for the more complicated
>> reflection-based approach to evolve at its own pace, since you can always
>> fall back to InternalRow for situations which aren't yet supported by
>> reflection.
>>
>> We can even consider having Spark code detect that you haven't overridden
>> the default produceResult (IIRC this is discoverable via reflection),
>> and raise an error at query analysis time instead of at runtime when it
>> can't find a reflective method or an overridden produceResult.
>>
>> I'm very pleased we have found a compromise that everyone seems happy
>> with! Big thanks to everyone who participated.
>>
>> On Wed, Mar 3, 2021 at 8:34 PM John Zhuge  wrote:
>>
>>> +1 Good plan to move forward.
>>>
>>> Thank you all for the constructive and comprehensive discussions in this
>>> thread! Decisions on this important feature will have ramifications for
>>> years to come.
>>>
>>> On Wed, Mar 3, 2021 at 7:42 PM Wenchen Fan  wrote:
>>>
 +1 to this proposal. If people don't like the ScalarFunction0,1, ...
 variants and prefer the "magical methods", then we can have a single
 ScalarFunction interface which has the row-parameter API (with a
 default implementation to fail) and documents to describe the "magical
 methods" (which can be done later).

 I'll start the PR review this week to check the naming, doc, etc.

 Thanks all for the discussion here and let's move forward!

 On Thu, Mar 4, 2021 at 9:53 AM Ryan Blue  wrote:

> Good point, Dongjoon. I think we can probably come to some compromise
> here:
>
>- Remove SupportsInvoke since it isn’t really needed. We should
>always try to find the right method to invoke in the codegen path.
>- Add a default implementation of produceResult so that
>implementations don’t have to use it. If they don’t implement it and a
>magic function can’t be found, then it will throw
>UnsupportedOperationException
>
> This is assuming that we can agree not to introduce all of the
> ScalarFunction interface variations, which would have limited utility
> because of type erasure.
>
> Does that sound like a good plan to everyone? If so, I’ll update the
> SPIP doc so we can move forward.
>
> On Wed, Mar 3, 2021 at 4:36 PM Dongjoon Hyun 
> wrote:
>
>> Hi, All.
>>
>> We shared many opinions in different perspectives.
>> However, we didn't reach a consensus even on a partial merge by
>> excluding something
>> (on the PR by me, on this mailing thread by Wenchen).
>>
>> For the following claims, we have another alternative to mitigate it.
>>
>> > I don't like it because it promotes the row-parameter API and
>> forces users to implement it, even if the users want to only use the
>> individual-parameters API.
>>
>> Why don't we merge the AS-IS PR by adding something instead of
>> excluding something?
>>
>> - R produceResult(InternalRow input);
>> + default R produceResult(InternalRow input) throws Exception {
>> +   throw new UnsupportedOperationException();
>> + }
>>
>> By providing the default implementation, it will not *forcing users
>> to implement it* technically.
>> And, we can provide a document about our expected usage properly.
>> What do you think?
>>
>> Bests,
>> Dongjoon.
>>
>>
>>
>> On Wed, Mar 3, 2021 at 10:28 AM Ryan Blue  wrote:
>>
>>> Yes, GenericInternalRow is safe if when type mismatches, with the
>>> cost of using Object[], and primitive types need to do boxing
>>>
>>> The question is not whether to use the magic functions, which would
>>> not need boxing. The question here is whether to use multiple
>>> ScalarFunction interfaces. Those interfaces would require boxing or
>>> using Object[] so there isn’t a benefit.
>>>
>>> If we do want to reuse one UDF for different types, using “magical
>>> methods” solves the problem
>>>
>>> Yes, that’s correct. We agree that magic methods are a good option
>>> for this.
>>>
>>> Again, the question we need to decide is whether to use InternalRow
>>> or interfaces like ScalarFunction2 for non-codegen. The option to
>>> use multiple interfaces is limited by type erasure because you can only
>>> have one set of type parameters. If you wanted to support both 
>>>

Re: [DISCUSS] SPIP: FunctionCatalog

2021-02-12 Thread Chao Sun

This is an important feature which can unblock several other projects
including bucket join support for DataSource v2, complete support for
enforcing DataSource v2 distribution requirements on the write path, etc. I
like Ryan's proposals which look simple and elegant, with nice support on
function overloading and variadic arguments. On the other hand, I think
Wenchen made a very good point about performance. Overall, I'm excited to
see active discussions on this topic and believe the community will come to
a proposal with the best of both sides.

Chao

On Fri, Feb 12, 2021 at 7:58 PM Hyukjin Kwon  wrote:

> +1 for Liang-chi's.
>
> Thanks Ryan and Wenchen for leading this.
>
>
> 2021년 2월 13일 (토) 오후 12:18, Liang-Chi Hsieh 님이 작성:
>
>> Basically I think the proposal makes sense to me and I'd like to support
>> the
>> SPIP as it looks like we have strong need for the important feature.
>>
>> Thanks Ryan for working on this and I do also look forward to Wenchen's
>> implementation. Thanks for the discussion too.
>>
>> Actually I think the SupportsInvoke proposed by Ryan looks a good
>> alternative to me. Besides Wenchen's alternative implementation, is there
>> a
>> chance we also have the SupportsInvoke for comparison?
>>
>>
>> John Zhuge wrote
>> > Excited to see our Spark community rallying behind this important
>> feature!
>> >
>> > The proposal lays a solid foundation of minimal feature set with careful
>> > considerations for future optimizations and extensions. Can't wait to
>> see
>> > it leading to more advanced functionalities like views with shared
>> custom
>> > functions, function pushdown, lambda, etc. It has already borne fruit
>> from
>> > the constructive collaborations in this thread. Looking forward to
>> > Wenchen's prototype and further discussions including the SupportsInvoke
>> > extension proposed by Ryan.
>> >
>> >
>> > On Fri, Feb 12, 2021 at 4:35 PM Owen O'Malley 
>>
>> > owen.omalley@
>>
>> > 
>> > wrote:
>> >
>> >> I think this proposal is a very good thing giving Spark a standard way
>> of
>> >> getting to and calling UDFs.
>> >>
>> >> I like having the ScalarFunction as the API to call the UDFs. It is
>> >> simple, yet covers all of the polymorphic type cases well. I think it
>> >> would
>> >> also simplify using the functions in other contexts like pushing down
>> >> filters into the ORC & Parquet readers although there are a lot of
>> >> details
>> >> that would need to be considered there.
>> >>
>> >> .. Owen
>> >>
>> >>
>> >> On Fri, Feb 12, 2021 at 11:07 PM Erik Krogen 
>>
>> > ekrogen@.com
>>
>> > 
>> >> wrote:
>> >>
>> >>> I agree that there is a strong need for a FunctionCatalog within Spark
>> >>> to
>> >>> provide support for shareable UDFs, as well as make movement towards
>> >>> more
>> >>> advanced functionality like views which themselves depend on UDFs, so
>> I
>> >>> support this SPIP wholeheartedly.
>> >>>
>> >>> I find both of the proposed UDF APIs to be sufficiently user-friendly
>> >>> and
>> >>> extensible. I generally think Wenchen's proposal is easier for a user
>> to
>> >>> work with in the common case, but has greater potential for confusing
>> >>> and
>> >>> hard-to-debug behavior due to use of reflective method signature
>> >>> searches.
>> >>> The merits on both sides can hopefully be more properly examined with
>> >>> code,
>> >>> so I look forward to seeing an implementation of Wenchen's ideas to
>> >>> provide
>> >>> a more concrete comparison. I am optimistic that we will not let the
>> >>> debate
>> >>> over this point unreasonably stall the SPIP from making progress.
>> >>>
>> >>> Thank you to both Wenchen and Ryan for your detailed consideration and
>> >>> evaluation of these ideas!
>> >>> --
>> >>> *From:* Dongjoon Hyun 
>>
>> > dongjoon.hyun@
>>
>> > 
>> >>> *Sent:* Wednesday, February 10, 2021 6:06 PM
>> >>> *To:* Ryan Blue 
>>
>> > blue@
>>
>> > 
>> >>> *Cc:* Holden Karau 
>>
>> > holden@
>>
>> > ; Hyukjin Kwon <
>> >>>
>>
>> > gurwls223@
>>
>> >>; Spark Dev List 
>>
>> > dev@.apache
>>
>> > ; Wenchen Fan
>> >>> 
>>
>> > cloud0fan@
>>
>> > 
>> >>> *Subject:* Re: [DISCUSS] SPIP: FunctionCatalog
>> >>>
>> >>> BTW, I forgot to add my opinion explicitly in this thread because I
>> was
>> >>> on the PR before this thread.
>> >>>
>> >>> 1. The `FunctionCatalog API` PR was made on May 9, 2019 and has been
>> >>> there for almost two years.
>> >>> 2. I already gave my +1 on that PR last Saturday because I agreed with
>> >>> the latest updated design docs and AS-IS PR.
>> >>>
>> >>> And, the rest of the progress in this thread is also very satisfying
>> to
>> >>> me.
>> >>> (e.g. Ryan's extension suggestion and Wenchen's alternative)
>> >>>
>> >>> To All:
>> >>> Please take a look at the design doc and the PR, and give us some
>> >>> opinions.
>> >>> We really need your participation in order to make DSv2 more complete.
>> >>> This will unblock other DSv2 features, too.
>> >>>
>> >>> Bests,
>> >>> Dongjoon.
>> >>>
>> >>>
>>

Migrating BinaryFileFormat to DSv2?

2020-09-10 Thread Chao Sun

Hi all,

As we are moving all data sources to v2, I'm wondering whether it makes
sense to do the same for `BinaryFileFormat` which only has v1 impl at the
moment.

Also curious to know what other data sources that haven't been migrated yet.

Thanks,
Chao

Re: [SparkSql] Casting of Predicate Literals

2020-08-26 Thread Chao Sun

Thanks Bart. I'll give it a try. Presto has done something very similar on
this (thanks DB for finding this!). They published an article ([1]) last
year with a very thorough analysis on all the cases which I think can be
used as a reference for the implementation in Spark.

[1]: https://prestosql.io/blog/2019/05/21/optimizing-the-casts-away.html

On Wed, Aug 26, 2020 at 1:37 AM Bart Samwel 
wrote:

> IMO it's worth an attempt. The previous attempts seem to be closed because
> of a general sense that this gets messy and leads to lots of special cases,
> but that's just how it is. This optimization would make the difference
> between getting sub-par performance for using some of these datatypes to
> getting decent performance. Also, even if the predicate doesn't get pushed
> down, the transformation can make execution of the predicate faster. So
> this can be an early optimization rule, not tied to pushdowns specifically.
>
> I agree that it gets tricky for some data types. So I'd suggest starting
> small and doing this only for integers. Then cover decimals. For those data
> types at least you can easily reason that the conversion is correct. Other
> data types are a lot trickier and we should analyze them one by one.
>
> On Tue, Aug 25, 2020 at 7:31 PM Chao Sun  wrote:
>
>> Hi,
>>
>> So just realized there were already multiple attempts on this issue in
>> the past. From the discussion it seems the preferred approach is to
>> eliminate the cast before they get pushed to data sources, at least for a
>> few common cases such as numeric types. However, a few PRs following this
>> direction were rejected (see [1] and [2]), so I'm wondering if this is
>> still something worth trying, or if the community thinks this is risky and
>> better not touch it.
>>
>> On the other hand, perhaps we can do the minimum and generate some sort
>> of warning to remind users that they need to explicitly add cast to enable
>> pushdown in this case. What do you think?
>>
>> Thanks for your input!
>> Chao
>>
>>
>> [1]: https://github.com/apache/spark/pull/8718
>> [2]: https://github.com/apache/spark/pull/27648
>>
>> On Mon, Aug 24, 2020 at 1:57 PM Chao Sun  wrote:
>>
>>> > Currently we can't. This is something we should improve, by either
>>> pushing down the cast to the data source, or simplifying the predicates
>>> to eliminate the cast.
>>>
>>> Hi all, I've created https://issues.apache.org/jira/browse/SPARK-32694 to
>>> track this. Welcome to comment on the JIRA.
>>>
>>> On Wed, Aug 19, 2020 at 7:08 AM Wenchen Fan  wrote:
>>>
>>>> Currently we can't. This is something we should improve, by either
>>>> pushing down the cast to the data source, or simplifying the predicates to
>>>> eliminate the cast.
>>>>
>>>> On Wed, Aug 19, 2020 at 5:09 PM Bart Samwel 
>>>> wrote:
>>>>
>>>>> And how are we doing here on integer pushdowns? If someone does e.g.
>>>>> CAST(short_col AS LONG) < 1000, can we still push down "short_col < 1000"
>>>>> without the cast?
>>>>>
>>>>> On Tue, Aug 4, 2020 at 6:55 PM Russell Spitzer <
>>>>> russell.spit...@gmail.com> wrote:
>>>>>
>>>>>> Thanks! That's exactly what I was hoping for! Thanks for finding the
>>>>>> Jira for me!
>>>>>>
>>>>>> On Tue, Aug 4, 2020 at 11:46 AM Wenchen Fan 
>>>>>> wrote:
>>>>>>
>>>>>>> I think this is not a problem in 3.0 anymore, see
>>>>>>> https://issues.apache.org/jira/browse/SPARK-27638
>>>>>>>
>>>>>>> On Wed, Aug 5, 2020 at 12:08 AM Russell Spitzer <
>>>>>>> russell.spit...@gmail.com> wrote:
>>>>>>>
>>>>>>>> I've just run into this issue again with another user and I feel
>>>>>>>> like most folks here have seen some flavor of this at some point.
>>>>>>>>
>>>>>>>> The user registers a Datasource with a column of type Date (or some
>>>>>>>> non string) then performs a query that looks like.
>>>>>>>>
>>>>>>>> *SELECT * from Source WHERE date_col > '2020-08-03'*
>>>>>>>>
>>>>>>>> Seeing that the predicate literal here is a String, Spark needs to
>>>>>>>> make a change so that the DataSource column will be of the same type
>>>

Re: [SparkSql] Casting of Predicate Literals

2020-08-25 Thread Chao Sun

Hi,

So just realized there were already multiple attempts on this issue in the
past. From the discussion it seems the preferred approach is to eliminate
the cast before they get pushed to data sources, at least for a few
common cases such as numeric types. However, a few PRs following this
direction were rejected (see [1] and [2]), so I'm wondering if this is
still something worth trying, or if the community thinks this is risky and
better not touch it.

On the other hand, perhaps we can do the minimum and generate some sort of
warning to remind users that they need to explicitly add cast to enable
pushdown in this case. What do you think?

Thanks for your input!
Chao


[1]: https://github.com/apache/spark/pull/8718
[2]: https://github.com/apache/spark/pull/27648

On Mon, Aug 24, 2020 at 1:57 PM Chao Sun  wrote:

> > Currently we can't. This is something we should improve, by either
> pushing down the cast to the data source, or simplifying the predicates
> to eliminate the cast.
>
> Hi all, I've created https://issues.apache.org/jira/browse/SPARK-32694 to
> track this. Welcome to comment on the JIRA.
>
> On Wed, Aug 19, 2020 at 7:08 AM Wenchen Fan  wrote:
>
>> Currently we can't. This is something we should improve, by either
>> pushing down the cast to the data source, or simplifying the predicates to
>> eliminate the cast.
>>
>> On Wed, Aug 19, 2020 at 5:09 PM Bart Samwel 
>> wrote:
>>
>>> And how are we doing here on integer pushdowns? If someone does e.g.
>>> CAST(short_col AS LONG) < 1000, can we still push down "short_col < 1000"
>>> without the cast?
>>>
>>> On Tue, Aug 4, 2020 at 6:55 PM Russell Spitzer <
>>> russell.spit...@gmail.com> wrote:
>>>
>>>> Thanks! That's exactly what I was hoping for! Thanks for finding the
>>>> Jira for me!
>>>>
>>>> On Tue, Aug 4, 2020 at 11:46 AM Wenchen Fan 
>>>> wrote:
>>>>
>>>>> I think this is not a problem in 3.0 anymore, see
>>>>> https://issues.apache.org/jira/browse/SPARK-27638
>>>>>
>>>>> On Wed, Aug 5, 2020 at 12:08 AM Russell Spitzer <
>>>>> russell.spit...@gmail.com> wrote:
>>>>>
>>>>>> I've just run into this issue again with another user and I feel like
>>>>>> most folks here have seen some flavor of this at some point.
>>>>>>
>>>>>> The user registers a Datasource with a column of type Date (or some
>>>>>> non string) then performs a query that looks like.
>>>>>>
>>>>>> *SELECT * from Source WHERE date_col > '2020-08-03'*
>>>>>>
>>>>>> Seeing that the predicate literal here is a String, Spark needs to
>>>>>> make a change so that the DataSource column will be of the same type
>>>>>> (Date),
>>>>>> so it places a "Cast" on the Datasource column so our plan ends up
>>>>>> looking like.
>>>>>>
>>>>>> Cast(date_col as String) > '2020-08-03'
>>>>>>
>>>>>> Since the Datasource Strategies can't handle a push down of the
>>>>>> "Cast" function we lose the predicate pushdown we could
>>>>>> have had. This can change a Job from a single partition lookup into a
>>>>>> full scan leading to a very confusing situation for
>>>>>> the end user. I also wonder about the relative cost here since we
>>>>>> could be avoiding doing X casts and instead just do a single
>>>>>> one on the predicate, in addition we could be doing the cast at the
>>>>>> Analysis phase and cut the run short before any work even
>>>>>> starts rather than doing a perhaps meaningless comparison between a
>>>>>> date and a non-date string.
>>>>>>
>>>>>> I think we should seriously consider whether in cases like this we
>>>>>> should attempt to cast the literal rather than casting the
>>>>>> source column.
>>>>>>
>>>>>> Please let me know if anyone has thoughts on this, or has some
>>>>>> previous Jiras I could dig into if it's been discussed before,
>>>>>> Russ
>>>>>>
>>>>>
>>>
>>> --
>>> Bart Samwel
>>> bart.sam...@databricks.com
>>>
>>>
>>>

Re: [SparkSql] Casting of Predicate Literals

2020-08-24 Thread Chao Sun

> Currently we can't. This is something we should improve, by either
pushing down the cast to the data source, or simplifying the predicates to
eliminate the cast.

Hi all, I've created https://issues.apache.org/jira/browse/SPARK-32694 to
track this. Welcome to comment on the JIRA.

On Wed, Aug 19, 2020 at 7:08 AM Wenchen Fan  wrote:

> Currently we can't. This is something we should improve, by either pushing
> down the cast to the data source, or simplifying the predicates to
> eliminate the cast.
>
> On Wed, Aug 19, 2020 at 5:09 PM Bart Samwel 
> wrote:
>
>> And how are we doing here on integer pushdowns? If someone does e.g.
>> CAST(short_col AS LONG) < 1000, can we still push down "short_col < 1000"
>> without the cast?
>>
>> On Tue, Aug 4, 2020 at 6:55 PM Russell Spitzer 
>> wrote:
>>
>>> Thanks! That's exactly what I was hoping for! Thanks for finding the
>>> Jira for me!
>>>
>>> On Tue, Aug 4, 2020 at 11:46 AM Wenchen Fan  wrote:
>>>
 I think this is not a problem in 3.0 anymore, see
 https://issues.apache.org/jira/browse/SPARK-27638

 On Wed, Aug 5, 2020 at 12:08 AM Russell Spitzer <
 russell.spit...@gmail.com> wrote:

> I've just run into this issue again with another user and I feel like
> most folks here have seen some flavor of this at some point.
>
> The user registers a Datasource with a column of type Date (or some
> non string) then performs a query that looks like.
>
> *SELECT * from Source WHERE date_col > '2020-08-03'*
>
> Seeing that the predicate literal here is a String, Spark needs to
> make a change so that the DataSource column will be of the same type
> (Date),
> so it places a "Cast" on the Datasource column so our plan ends up
> looking like.
>
> Cast(date_col as String) > '2020-08-03'
>
> Since the Datasource Strategies can't handle a push down of the "Cast"
> function we lose the predicate pushdown we could
> have had. This can change a Job from a single partition lookup into a
> full scan leading to a very confusing situation for
> the end user. I also wonder about the relative cost here since we
> could be avoiding doing X casts and instead just do a single
> one on the predicate, in addition we could be doing the cast at the
> Analysis phase and cut the run short before any work even
> starts rather than doing a perhaps meaningless comparison between a
> date and a non-date string.
>
> I think we should seriously consider whether in cases like this we
> should attempt to cast the literal rather than casting the
> source column.
>
> Please let me know if anyone has thoughts on this, or has some
> previous Jiras I could dig into if it's been discussed before,
> Russ
>

>>
>> --
>> Bart Samwel
>> bart.sam...@databricks.com
>>
>>
>>

2018-04-05 Thread Chao Sun

84 matches

Mail list logo