[VOTE][RESULT] Release Apache Spark 3.5.0 (RC5)

2023-09-12 Thread Yuanjian Li
The vote passes with 13 +1s (8 binding +1s). Thank you all who helped with the release! (* = binding) +1: - Mridul Muralidharan (*) - Yuanjian Li - Xiao Li (*) - Gengliang Wang (*) - Hyukjin Kwon (*) - Ruifeng Zheng (*) - Jungtaek Lim - Wenchen Fan (*) - Jia Fan - Jie Yang - Yuming Wang (*) -

Re: [VOTE] Release Apache Spark 3.5.0 (RC5)

2023-09-12 Thread Dongjoon Hyun
+1 Dongjoon. On 2023/09/12 03:38:37 Kent Yao wrote: > +1 (non-binding), great work! > > Kent Yao > > Yuming Wang 于2023年9月12日周二 11:32写道: > > > > +1. > > > > On Tue, Sep 12, 2023 at 10:57 AM yangjie01 > > wrote: > >> > >> +1 > >> > >> > >> > >> 发件人: Jia Fan > >> 日期: 2023年9月12日 星期二 10:08 > >>

Re: [VOTE] Release Apache Spark 3.5.0 (RC5)

2023-09-11 Thread Kent Yao
+1 (non-binding), great work! Kent Yao Yuming Wang 于2023年9月12日周二 11:32写道: > > +1. > > On Tue, Sep 12, 2023 at 10:57 AM yangjie01 > wrote: >> >> +1 >> >> >> >> 发件人: Jia Fan >> 日期: 2023年9月12日 星期二 10:08 >> 收件人: Ruifeng Zheng >> 抄送: Hyukjin Kwon , Xiao Li , >> Mridul Muralidharan , Peter Toth

Re: [VOTE] Release Apache Spark 3.5.0 (RC5)

2023-09-11 Thread Yuming Wang
+1. On Tue, Sep 12, 2023 at 10:57 AM yangjie01 wrote: > +1 > > > > *发件人**: *Jia Fan > *日期**: *2023年9月12日 星期二 10:08 > *收件人**: *Ruifeng Zheng > *抄送**: *Hyukjin Kwon , Xiao Li , > Mridul Muralidharan , Peter Toth , > Spark dev list , Yuanjian Li > > *主题**: *Re: [VOTE] Release Apache Spark 3.5.0

Re: [VOTE] Release Apache Spark 3.5.0 (RC5)

2023-09-11 Thread yangjie01
+1 发件人: Jia Fan 日期: 2023年9月12日 星期二 10:08 收件人: Ruifeng Zheng 抄送: Hyukjin Kwon , Xiao Li , Mridul Muralidharan , Peter Toth , Spark dev list , Yuanjian Li 主题: Re: [VOTE] Release Apache Spark 3.5.0 (RC5) +1 Ruifeng Zheng mailto:ruife...@apache.org>> 于2023年9月12日周二 08:46写道: +1 On Tue, Sep 12,

Re: [VOTE] Release Apache Spark 3.5.0 (RC5)

2023-09-11 Thread Jia Fan
+1 Ruifeng Zheng 于2023年9月12日周二 08:46写道: > +1 > > On Tue, Sep 12, 2023 at 7:24 AM Hyukjin Kwon wrote: > >> +1 >> >> On Tue, Sep 12, 2023 at 7:05 AM Xiao Li wrote: >> >>> +1 >>> >>> Xiao >>> >>> Yuanjian Li 于2023年9月11日周一 10:53写道: >>> @Peter Toth I've looked into the details of this

Re: [VOTE] Release Apache Spark 3.5.0 (RC5)

2023-09-11 Thread Wenchen Fan
+1 On Tue, Sep 12, 2023 at 9:00 AM Yuanjian Li wrote: > +1 (non-binding) > > Yuanjian Li 于2023年9月11日周一 09:36写道: > >> @Peter Toth I've looked into the details of this >> issue, and it appears that it's neither a regression in version 3.5.0 nor a >> correctness issue. It's a bug related to a

Re: [VOTE] Release Apache Spark 3.5.0 (RC5)

2023-09-11 Thread Jungtaek Lim
+1 (non-binding) Thanks for driving this release and the patience on multiple RCs! On Tue, Sep 12, 2023 at 10:00 AM Yuanjian Li wrote: > +1 (non-binding) > > Yuanjian Li 于2023年9月11日周一 09:36写道: > >> @Peter Toth I've looked into the details of this >> issue, and it appears that it's neither a

Re: [VOTE] Release Apache Spark 3.5.0 (RC5)

2023-09-11 Thread Ruifeng Zheng
+1 On Tue, Sep 12, 2023 at 7:24 AM Hyukjin Kwon wrote: > +1 > > On Tue, Sep 12, 2023 at 7:05 AM Xiao Li wrote: > >> +1 >> >> Xiao >> >> Yuanjian Li 于2023年9月11日周一 10:53写道: >> >>> @Peter Toth I've looked into the details of this >>> issue, and it appears that it's neither a regression in

Re: [VOTE] Release Apache Spark 3.5.0 (RC5)

2023-09-11 Thread Hyukjin Kwon
+1 On Tue, Sep 12, 2023 at 7:05 AM Xiao Li wrote: > +1 > > Xiao > > Yuanjian Li 于2023年9月11日周一 10:53写道: > >> @Peter Toth I've looked into the details of this >> issue, and it appears that it's neither a regression in version 3.5.0 nor a >> correctness issue. It's a bug related to a new

Re: [VOTE] Release Apache Spark 3.5.0 (RC5)

2023-09-11 Thread Gengliang Wang
+1 On Mon, Sep 11, 2023 at 11:28 AM Xiao Li wrote: > +1 > > Xiao > > Yuanjian Li 于2023年9月11日周一 10:53写道: > >> @Peter Toth I've looked into the details of this >> issue, and it appears that it's neither a regression in version 3.5.0 nor a >> correctness issue. It's a bug related to a new

Re: [VOTE] Release Apache Spark 3.5.0 (RC5)

2023-09-11 Thread Xiao Li
+1 Xiao Yuanjian Li 于2023年9月11日周一 10:53写道: > @Peter Toth I've looked into the details of this > issue, and it appears that it's neither a regression in version 3.5.0 nor a > correctness issue. It's a bug related to a new feature. I think we can fix > this in 3.5.1 and list it as a known issue

Re: [VOTE] Release Apache Spark 3.5.0 (RC5)

2023-09-11 Thread Peter Toth
Thanks Yuanjian. Please disregard my -1 then. Yuanjian Li ezt írta (időpont: 2023. szept. 11., H, 18:36): > @Peter Toth I've looked into the details of this > issue, and it appears that it's neither a regression in version 3.5.0 nor a > correctness issue. It's a bug related to a new feature. I

Re: [VOTE] Release Apache Spark 3.5.0 (RC5)

2023-09-11 Thread Yuanjian Li
+1 (non-binding) Yuanjian Li 于2023年9月11日周一 09:36写道: > @Peter Toth I've looked into the details of this > issue, and it appears that it's neither a regression in version 3.5.0 nor a > correctness issue. It's a bug related to a new feature. I think we can fix > this in 3.5.1 and list it as a

Re: [VOTE] Release Apache Spark 3.5.0 (RC5)

2023-09-11 Thread Yuanjian Li
@Peter Toth I've looked into the details of this issue, and it appears that it's neither a regression in version 3.5.0 nor a correctness issue. It's a bug related to a new feature. I think we can fix this in 3.5.1 and list it as a known issue of the Scala client of Spark Connect in 3.5.0. Mridul

unsubscribe

2023-09-11 Thread Sairam Natarajan
unsubscribe

unsubscribe

2023-09-10 Thread Cenk Ariöz
unsubscribe

Re: [VOTE] Release Apache Spark 3.5.0 (RC5)

2023-09-10 Thread Mridul Muralidharan
+1 Signatures, digests, etc check out fine. Checked out tag and build/tested with -Phive -Pyarn -Pmesos -Pkubernetes Regards, Mridul On Sat, Sep 9, 2023 at 10:02 AM Yuanjian Li wrote: > Please vote on releasing the following candidate(RC5) as Apache Spark > version 3.5.0. > > The vote is open

Re: [VOTE] Release Apache Spark 3.5.0 (RC5)

2023-09-10 Thread Peter Toth
Hi Yuanjian, Sorry, -1 from me. Let's not introduce this bugs in 3.5: https://issues.apache.org/jira/browse/SPARK-45109 / https://github.com/apache/spark/pull/42863 Best, Peter Yuanjian Li ezt írta (időpont: 2023. szept. 10., V, 10:39): > Yes, SPARK-44805 has been included. For the commits

Re: [VOTE] Release Apache Spark 3.5.0 (RC4)

2023-09-10 Thread Yuanjian Li
@ian.a.mann...@gmail.com Thank you for your question. Because the voting period hasn't ended yet and this fix has just been merged, we don't want to release version 3.5.0 with a known correctness bug. We've quickly cut RC5, and we welcome you to continue assisting with the testing. Ian Manning

Re: [VOTE] Release Apache Spark 3.5.0 (RC5)

2023-09-10 Thread Yuanjian Li
Yes, SPARK-44805 has been included. For the commits from RC4 to RC5, please refer to https://github.com/apache/spark/commits/v3.5.0-rc5. Mich Talebzadeh 于2023年9月9日周六 08:09写道: > Apologies that should read ... release 3.5.0 (RC4) plus .. > > Mich Talebzadeh, > Distinguished Technologist,

Re: [VOTE] Release Apache Spark 3.5.0 (RC5)

2023-09-09 Thread Mich Talebzadeh
Apologies that should read ... release 3.5.0 (RC4) plus .. Mich Talebzadeh, Distinguished Technologist, Solutions Architect & Engineer London United Kingdom view my Linkedin profile https://en.everybodywiki.com/Mich_Talebzadeh

Re: [VOTE] Release Apache Spark 3.5.0 (RC5)

2023-09-09 Thread Mich Talebzadeh
Hi, Can you please confirm that this cut is release 3.4.0 plus the resolved Jira https://issues.apache.org/jira/browse/SPARK-44805 which was already fixed yesterday? Nothing else I believe? Thanks Mich view my Linkedin profile

[VOTE] Release Apache Spark 3.5.0 (RC5)

2023-09-09 Thread Yuanjian Li
Please vote on releasing the following candidate(RC5) as Apache Spark version 3.5.0. The vote is open until 11:59pm Pacific time Sep 11th and passes if a majority +1 PMC votes are cast, with a minimum of 3 +1 votes. [ ] +1 Release this package as Apache Spark 3.5.0 [ ] -1 Do not release this

Re: [VOTE] Release Apache Spark 3.5.0 (RC4)

2023-09-09 Thread Ian Manning
This issue is not a regression and yet we fail the vote? Couldn't this issue have been fixed in 3.5.1? Sorry I am new, so maybe this is how it works? On Sat, 9 Sep 2023, 02:29 Dongjoon Hyun, wrote: > Sorry but I'm -1 because there exists a late-arrival correctness patch > although it's not a

Re: [VOTE] Release Apache Spark 3.5.0 (RC4)

2023-09-08 Thread Yuanjian Li
@Dongjoon Hyun Thank you for reporting this and for your prompt response. The vote has failed. I'll cut RC5 tonight, PST time. Dongjoon Hyun 于2023年9月8日周五 15:57写道: > Sorry but I'm -1 because there exists a late-arrival correctness patch > although it's not a regression. > > -

Re: [VOTE] Release Apache Spark 3.5.0 (RC4)

2023-09-08 Thread Xinrong Meng
+1 Thank you for driving the release! On Fri, Sep 8, 2023 at 10:12 AM Jungtaek Lim wrote: > +1 (non-binding) > > Thanks for driving this release! > > On Fri, Sep 8, 2023 at 11:29 AM Holden Karau wrote: > >> +1 pip installing seems to function :) >> >> On Thu, Sep 7, 2023 at 7:22 PM Yuming

Re: [VOTE] Release Apache Spark 3.5.0 (RC4)

2023-09-08 Thread Dongjoon Hyun
Sorry but I'm -1 because there exists a late-arrival correctness patch although it's not a regression. - https://issues.apache.org/jira/browse/SPARK-44805 "Data lost after union using spark.sql.parquet.enableNestedColumnVectorizedReader=true" - https://github.com/apache/spark/pull/42850 -

Re: [VOTE] Release Apache Spark 3.5.0 (RC4)

2023-09-08 Thread Jungtaek Lim
+1 (non-binding) Thanks for driving this release! On Fri, Sep 8, 2023 at 11:29 AM Holden Karau wrote: > +1 pip installing seems to function :) > > On Thu, Sep 7, 2023 at 7:22 PM Yuming Wang wrote: > >> +1. >> >> On Thu, Sep 7, 2023 at 10:33 PM yangjie01 >> wrote: >> >>> +1 >>> >>> >>> >>>

Re: Elasticsearch support for Spark 3.x

2023-09-08 Thread Dipayan Dev
@Alfie Davidson : Awesome, it worked with "“org.elasticsearch.spark.sql”" But as soon as I switched to *elasticsearch-spark-20_2.12, *"es" also worked. On Fri, Sep 8, 2023 at 12:45 PM Dipayan Dev wrote: > > Let me try that and get back. Just wondering, if there a change in the > way we pass

Re: Elasticsearch support for Spark 3.x

2023-09-08 Thread Dipayan Dev
Let me try that and get back. Just wondering, if there a change in the way we pass the format in connector from Spark 2 to 3? On Fri, 8 Sep 2023 at 12:35 PM, Alfie Davidson wrote: > I am pretty certain you need to change the write.format from “es” to > “org.elasticsearch.spark.sql” > > Sent

Re: Elasticsearch support for Spark 3.x

2023-09-08 Thread Alfie Davidson
I am pretty certain you need to change the write.format from “es” to “org.elasticsearch.spark.sql”Sent from my iPhoneOn 8 Sep 2023, at 03:10, Dipayan Dev wrote:++ DevOn Thu, 7 Sep 2023 at 10:22 PM, Dipayan Dev wrote:Hi, Can you please elaborate your last response? I

Re: [VOTE] Release Apache Spark 3.5.0 (RC4)

2023-09-07 Thread Holden Karau
+1 pip installing seems to function :) On Thu, Sep 7, 2023 at 7:22 PM Yuming Wang wrote: > +1. > > On Thu, Sep 7, 2023 at 10:33 PM yangjie01 > wrote: > >> +1 >> >> >> >> *发件人**: *Gengliang Wang >> *日期**: *2023年9月7日 星期四 12:53 >> *收件人**: *Yuanjian Li >> *抄送**: *Xiao Li ,

Re: [VOTE] Release Apache Spark 3.5.0 (RC4)

2023-09-07 Thread Yuming Wang
+1. On Thu, Sep 7, 2023 at 10:33 PM yangjie01 wrote: > +1 > > > > *发件人**: *Gengliang Wang > *日期**: *2023年9月7日 星期四 12:53 > *收件人**: *Yuanjian Li > *抄送**: *Xiao Li , "her...@databricks.com.invalid" > , Spark dev list > *主题**: *Re: [VOTE] Release Apache Spark 3.5.0 (RC4) > > > > +1 > > > > On

Re: Making spark plan UI interactive

2023-09-07 Thread Calili dos Santos Silva
I really appreciate the idea. Another inspiration could be Datadog with its line graph and run logs below. Any way to graphically understand the application breakpoint can be great. Em qua., 6 de set. de 2023 08:04, Santosh Pingale escreveu: > Hey community > > Spark UI with the plan

Re: Elasticsearch support for Spark 3.x

2023-09-07 Thread Dipayan Dev
++ Dev On Thu, 7 Sep 2023 at 10:22 PM, Dipayan Dev wrote: > Hi, > > Can you please elaborate your last response? I don’t have any external > dependencies added, and just updated the Spark version as mentioned below. > > Can someone help me with this? > > On Fri, 1 Sep 2023 at 5:58 PM, Koert

Re: [VOTE] Release Apache Spark 3.5.0 (RC4)

2023-09-07 Thread yangjie01
+1 发件人: Gengliang Wang 日期: 2023年9月7日 星期四 12:53 收件人: Yuanjian Li 抄送: Xiao Li , "her...@databricks.com.invalid" , Spark dev list 主题: Re: [VOTE] Release Apache Spark 3.5.0 (RC4) +1 On Wed, Sep 6, 2023 at 9:46 PM Yuanjian Li mailto:xyliyuanj...@gmail.com>> wrote: +1 (non-binding) Xiao Li

Re: [VOTE] Release Apache Spark 3.5.0 (RC4)

2023-09-07 Thread Kent Yao
+1 (Non-binding) Kent Gengliang Wang 于2023年9月7日周四 14:09写道: > > +1 > > On Wed, Sep 6, 2023 at 9:46 PM Yuanjian Li wrote: >> >> +1 (non-binding) >> >> Xiao Li 于2023年9月6日周三 15:27写道: >>> >>> +1 >>> >>> Xiao >>> >>> Herman van Hovell 于2023年9月6日周三 22:08写道: Tested connect, and everything

Re: Making spark plan UI interactive

2023-09-06 Thread 泽民 朴
+1 Making it interactive can boost the productivity of developers who deals with complex plans. On 6 Sep 2023, at 14:39, Abdeali Kothari wrote:  I feel this pain frequently Something more interactive would be great On Wed, 6 Sep 2023 at 4:34 PM, Santosh Pingale wrote: Hey community Spark

Re: [VOTE] Release Apache Spark 3.5.0 (RC4)

2023-09-06 Thread Gengliang Wang
+1 On Wed, Sep 6, 2023 at 9:46 PM Yuanjian Li wrote: > +1 (non-binding) > > Xiao Li 于2023年9月6日周三 15:27写道: > >> +1 >> >> Xiao >> >> Herman van Hovell 于2023年9月6日周三 22:08写道: >> >>> Tested connect, and everything looks good. >>> >>> +1 >>> >>> On Wed, Sep 6, 2023 at 8:11 AM Yuanjian Li >>>

Re: [VOTE] Release Apache Spark 3.5.0 (RC4)

2023-09-06 Thread Yuanjian Li
+1 (non-binding) Xiao Li 于2023年9月6日周三 15:27写道: > +1 > > Xiao > > Herman van Hovell 于2023年9月6日周三 22:08写道: > >> Tested connect, and everything looks good. >> >> +1 >> >> On Wed, Sep 6, 2023 at 8:11 AM Yuanjian Li >> wrote: >> >>> Please vote on releasing the following candidate(RC4) as Apache

Re: [VOTE] Release Apache Spark 3.5.0 (RC4)

2023-09-06 Thread Xiao Li
+1 Xiao Herman van Hovell 于2023年9月6日周三 22:08写道: > Tested connect, and everything looks good. > > +1 > > On Wed, Sep 6, 2023 at 8:11 AM Yuanjian Li wrote: > >> Please vote on releasing the following candidate(RC4) as Apache Spark >> version 3.5.0. >> >> The vote is open until 11:59pm Pacific

Re: [VOTE] Release Apache Spark 3.5.0 (RC4)

2023-09-06 Thread Herman van Hovell
Tested connect, and everything looks good. +1 On Wed, Sep 6, 2023 at 8:11 AM Yuanjian Li wrote: > Please vote on releasing the following candidate(RC4) as Apache Spark > version 3.5.0. > > The vote is open until 11:59pm Pacific time Sep 8th and passes if a > majority +1 PMC votes are cast,

Re: Making spark plan UI interactive

2023-09-06 Thread Abdeali Kothari
I feel this pain frequently Something more interactive would be great On Wed, 6 Sep 2023 at 4:34 PM, Santosh Pingale wrote: > Hey community > > Spark UI with the plan visualisation is an excellent resource for finding > out crucial information about how your application is doing and what parts

Re: [DISCUSS] SPIP: Python Stored Procedures

2023-09-06 Thread Mich Talebzadeh
Thanks Alison for your explanation. 1. As a matter of interest, what does "sessionCatalog.resolveProcedure" do? Does it recompile the stored procedure (SP)? 2. If the SP makes a reference to an underlying table and table schema is changed. then by definition that SP compiled plan will

Making spark plan UI interactive

2023-09-06 Thread Santosh Pingale
Hey community Spark UI with the plan visualisation is an excellent resource for finding out crucial information about how your application is doing and what parts of the execution can still be optimized to fulfill time/resource constraints. The graph in its current form is sufficient for simpler

Re: [DISCUSS] Incremental statistics collection

2023-09-06 Thread Rakesh Raushan
Hi all, I would like to hear more from community on this topic. I believe it would significantly improve statistics collection in spark. Thanks Rakesh On Sat, 2 Sep 2023 at 10:36 AM, Rakesh Raushan wrote: > Thanks all for all your insights. > > @Mich > I am not trying to introduce any

Release Note of Apache Spark 3.5.0

2023-09-06 Thread Yuanjian Li
Hi All, Thank you all for your valuable contributions to the Spark 3.5 release so far! I would appreciate your review and feedback on the release note. Please see here for the draft

[VOTE] Release Apache Spark 3.5.0 (RC4)

2023-09-06 Thread Yuanjian Li
Please vote on releasing the following candidate(RC4) as Apache Spark version 3.5.0. The vote is open until 11:59pm Pacific time Sep 8th and passes if a majority +1 PMC votes are cast, with a minimum of 3 +1 votes. [ ] +1 Release this package as Apache Spark 3.5.0 [ ] -1 Do not release this

Re: [DISCUSS] SPIP: Python Stored Procedures

2023-09-05 Thread Allison Wang
Hi Mich, Thank you for your comments! I've left some comments on the SPIP, but let's continue the discussion here. You've highlighted the potential advantages of Python stored procedures, and I'd like to emphasize two important aspects: 1. *Versatility*: Integrating Python into SQL provides

Re: Feature to restart Spark job from previous failure point

2023-09-05 Thread Mich Talebzadeh
Hi Dipayan, You ought to maintain data source consistency minimising changes. upstream. Spark is not a Swiss Army knife :) Anyhow, we already do this in spark structured streaming with the concept of checkpointing.You can do so by implementing - Checkpointing - Stateful processing in

Feature to restart Spark job from previous failure point

2023-09-04 Thread Dipayan Dev
Hi Team, One of the biggest pain points we're facing is when Spark reads upstream partition data and during Action, the upstream also gets refreshed and the application fails with 'File not exists' error. It could happen that the job has already spent a reasonable amount of time, and re-running

Re: [Internet]Re: Improving Dynamic Allocation Logic for Spark 4+

2023-09-03 Thread Mich Talebzadeh
On this subject of launching both the driver and the executors using lazy executor IDs, this can introduce complexity but potentially could be a viable strategy in certain scenarios. Basically your mileage varies Pros: 1. Faster Startup: launching the driver and initial executors

Re: [VOTE] Release Apache Spark 3.5.0 (RC3)

2023-09-02 Thread Yuanjian Li
Sure, no problem. Holden Karau 于2023年9月2日周六 22:10写道: > Can we delay the next RC cut until after Labor Day? > > On Sat, Sep 2, 2023 at 9:59 PM Yuanjian Li wrote: > >> Thank you for all the reports! >> The vote has failed. I plan to cut RC4 in two days. >> >> @Dipayan Dev I quickly skimmed

Re: [VOTE] Release Apache Spark 3.5.0 (RC3)

2023-09-02 Thread Holden Karau
Can we delay the next RC cut until after Labor Day? On Sat, Sep 2, 2023 at 9:59 PM Yuanjian Li wrote: > Thank you for all the reports! > The vote has failed. I plan to cut RC4 in two days. > > @Dipayan Dev I quickly skimmed through the > corresponding ticket, and it doesn't seem to be a

Re: [VOTE] Release Apache Spark 3.5.0 (RC3)

2023-09-02 Thread Yuanjian Li
Thank you for all the reports! The vote has failed. I plan to cut RC4 in two days. @Dipayan Dev I quickly skimmed through the corresponding ticket, and it doesn't seem to be a regression introduced in 3.5. Additionally, someone is asking if this is the same issue as SPARK-35279. @Yuming Wang I

Re: [DISCUSS] SPIP: Python Stored Procedures

2023-09-02 Thread Mich Talebzadeh
I have noticed an worthy discussion in the SPIP comments regarding the definition of "stored procedure" in the context of Spark, and I believe it is an important point to address. To provide some historical context, Sybase , a

Re: [DISCUSS] Incremental statistics collection

2023-09-01 Thread Rakesh Raushan
Thanks all for all your insights. @Mich I am not trying to introduce any sampling model here. This idea is about collecting the task write metrics while writing the data and aggregating it with the existing values present in the catalog(create a new entry if it's a CTAS command). This approach is

Re: [VOTE] Release Apache Spark 3.5.0 (RC3)

2023-09-01 Thread Jungtaek Lim
My apologies, I have to add another ticket for a blocker, SPARK-45045 . That said, I'm -1 (non-binding). SPARK-43183 made a behavioral change regarding the StreamingQueryListener as well as

Re: [VOTE] Release Apache Spark 3.5.0 (RC3)

2023-08-31 Thread Wenchen Fan
Sorry for the last-minute bug report, but we found a regression in 3.5: the SQL INSERT command without a column list fills missing columns with NULL while Spark 3.4 does not allow it. According to the SQL standard, this shouldn't be allowed and thus a regression in 3.5. The fix has been merged

Re: [DISCUSS] Updating documentation hosted for EOL and maintenance releases

2023-08-31 Thread Matei Zaharia
It would be great to do this IMO, because there are often usability and formatting fixes needed to docs over time, and people naturally search for docs from their *deployed* version of the project — not the latest version, hoping that it also applies to their release. For example, right now

Re: [DISCUSS] SPIP: Python Stored Procedures

2023-08-31 Thread Mich Talebzadeh
I concur with the view point raised by @Sean Owen While this might introduce some challenges related to compatibility and environment issues, it is not fundamentally different from how the users currently import and use common code in Python. The main difference is that now this shared code would

Re: [VOTE] Release Apache Spark 3.5.0 (RC3)

2023-08-31 Thread Ian Manning
+1 (non-binding) Using Spark Core, Spark SQL, Structured Streaming. On Tue, Aug 29, 2023 at 8:12 PM Yuanjian Li wrote: > Please vote on releasing the following candidate(RC3) as Apache Spark > version 3.5.0. > > The vote is open until 11:59pm Pacific time Aug 31st and passes if a > majority +1

Re: [DISCUSS] SPIP: Python Stored Procedures

2023-08-31 Thread Sean Owen
I think you're talking past Hyukjin here. I think the response is: none of that is managed by Pyspark now, and this proposal does not change that. Your current interpreter and environment is used to execute the stored procedure, which is just Python code. It's on you to bring an environment that

Re: [DISCUSS] SPIP: Python Stored Procedures

2023-08-31 Thread Mich Talebzadeh
These are my initial thoughts: As usual your mileage varies. Depending on the use case, introducing support for stored procedures (SP) in Spark SQL with Python as the procedural language *Pros* - Can potentially provide more flexibility and capabilities in the respective SQL workflows. We

Re: [DISCUSS] SPIP: Python Stored Procedures

2023-08-31 Thread Mich Talebzadeh
Thanks Allison! Mich Talebzadeh, Distinguished Technologist, Solutions Architect & Engineer London United Kingdom view my Linkedin profile https://en.everybodywiki.com/Mich_Talebzadeh *Disclaimer:* Use it at your own risk. Any

[DISCUSS] Updating documentation hosted for EOL and maintenance releases

2023-08-30 Thread Hyukjin Kwon
Hi all, I would like to raise a discussion about updating documentation hosted for EOL and maintenance versions. To provide some context, we currently host the documentation for EOL versions of Apache Spark, which can be found at links like

Re: [DISCUSS] SPIP: Python Stored Procedures

2023-08-30 Thread Alexander Shorin
> Which Python version will run that stored procedure? > > All Python versions supported in PySpark > Where in stored procedure defines the exact python version which will run the code? That was the question. > How to manage external dependencies? > > Existing way we have >

Re: [DISCUSS] SPIP: Python Stored Procedures

2023-08-30 Thread Hyukjin Kwon
Which Python version will run that stored procedure? All Python versions supported in PySpark How to manage external dependencies? Existing way we have https://spark.apache.org/docs/latest/api/python/user_guide/python_packaging.html . In fact, this will use the external dependencies within your

Re: [DISCUSS] SPIP: Python Stored Procedures

2023-08-30 Thread Alexander Shorin
-1 Great idea to ignore the experience of others and copy bad practices back for nothing. If you are familiar with Python ecosystem then you should answer the questions: 1. Which Python version will run that stored procedure? 2. How to manage external dependencies? 3. How to test it via a common

Re: [VOTE] Release Apache Spark 3.5.0 (RC3)

2023-08-30 Thread Yuming Wang
It seems can not check signature: yumwang@G9L07H60PK Downloads % gpg --keyserver hkps://keys.openpgp.org --recv-key FC3AE3A7EAA1BAC98770840E7E1ABCC53AAA2216 gpg: key 7E1ABCC53AAA2216: no user ID gpg: Total number processed: 1 yumwang@G9L07H60PK Downloads % gpg --batch --verify

Re: [VOTE] Release Apache Spark 3.5.0 (RC3)

2023-08-30 Thread Sean Owen
It worked fine after I ran it again I included "package test" instead of "test" (I had previously run "install") +1 On Wed, Aug 30, 2023 at 6:06 AM yangjie01 wrote: > Hi, Sean > > > > I have performed testing with Java 17 and Scala 2.13 using maven (`mvn > clean install` and `mvn package

Re: [DISCUSS] SPIP: Python Stored Procedures

2023-08-30 Thread Hyukjin Kwon
+1 we should have this .. a lot of other projects and DBMSes have this too, and we currently don't have a way to handle them within Apache Spark. Disclaimer: I am the shepherd of this SPIP. On Thu, 31 Aug 2023 at 09:31, Allison Wang wrote: > Hi Mich, > > I've updated the permissions on the

Re: [VOTE] Release Apache Spark 3.5.0 (RC3)

2023-08-30 Thread Mridul Muralidharan
+1 Signatures, digests, etc check out fine. Checked out tag and build/tested with -Phive -Pyarn -Pmesos -Pkubernetes Regards, Mridul On Wed, Aug 30, 2023 at 6:10 AM yangjie01 wrote: > Hi, Sean > > > > I have performed testing with Java 17 and Scala 2.13 using maven (`mvn > clean install` and

Re: [DISCUSS] SPIP: Python Stored Procedures

2023-08-30 Thread Allison Wang
Hi Mich, I've updated the permissions on the document. Please feel free to leave comments. Thanks, Allison On Wed, Aug 30, 2023 at 3:44 PM Mich Talebzadeh wrote: > Hi, > > Great. Please allow edit access on SPIP or ability to comment. > > Thanks > > Mich Talebzadeh, > Distinguished

Re: [DISCUSS] SPIP: Python Stored Procedures

2023-08-30 Thread Allison Wang
Hi Mich, I've updated the permissions on the document. Please feel free to leave comments. Thanks, Allison On Wed, Aug 30, 2023 at 3:44 PM Mich Talebzadeh wrote: > Hi, > > Great. Please allow edit access on SPIP or ability to comment. > > Thanks > > Mich Talebzadeh, > Distinguished

Re: [DISCUSS] SPIP: Python Stored Procedures

2023-08-30 Thread Mich Talebzadeh
Hi, Great. Please allow edit access on SPIP or ability to comment. Thanks Mich Talebzadeh, Distinguished Technologist, Solutions Architect & Engineer London United Kingdom view my Linkedin profile

[DISCUSS] SPIP: Python Stored Procedures

2023-08-30 Thread Allison Wang
Hi all, I would like to start a discussion on “Python Stored Procedures". This proposal aims to extend Spark SQL by introducing support for stored procedures, starting with Python as the procedural language. This will enable users to run complex logic using Python within their SQL workflows and

Re: [DISCUSS] Incremental statistics collection

2023-08-30 Thread Mich Talebzadeh
Sorry I missed this one In the context what has been changed we ought to have an additional column timestamp In short we can have datachange(object_name, partition_name, colname, timestamp) timestamp is the point in time you want to compare against for changes. Example SELECT * FROM WHERE

Re: [VOTE] Release Apache Spark 3.5.0 (RC3)

2023-08-30 Thread yangjie01
Hi, Sean I have performed testing with Java 17 and Scala 2.13 using maven (`mvn clean install` and `mvn package test`), and have not encountered the issue you mentioned. The test for the connect module depends on the `spark-protobuf` module to complete the `package,` was it successful? Or

Re: [DISCUSS] Incremental statistics collection

2023-08-30 Thread Mich Talebzadeh
Another idea that came to my mind from the old days, is the concept of having a function called *datachange* This datachange function should measure the amount of change in the data distribution since ANALYZE STATISTICS last ran. Specifically, it should measure the number of inserts, updates and

Re: [VOTE] Release Apache Spark 3.5.0 (RC3)

2023-08-30 Thread Dipayan Dev
Can we fix this bug in Spark 3.5.0? https://issues.apache.org/jira/browse/SPARK-44884 On Wed, Aug 30, 2023 at 11:51 AM Sean Owen wrote: > It looks good except that I'm getting errors running the Spark Connect > tests at the end (Java 17, Scala 2.13) It looks like I missed something > necessary

Re: [VOTE] Release Apache Spark 3.5.0 (RC3)

2023-08-29 Thread Sean Owen
It looks good except that I'm getting errors running the Spark Connect tests at the end (Java 17, Scala 2.13) It looks like I missed something necessary to build; is anyone getting this? [ERROR] [Error]

Re: [DISCUSS] Incremental statistics collection

2023-08-29 Thread Chetan
Thanks for the detailed explanation. Regards, Chetan On Tue, Aug 29, 2023, 4:50 PM Mich Talebzadeh wrote: > OK, let us take a deeper look here > > ANALYSE TABLE mytable COMPUTE STATISTICS FOR COLUMNS *(c1, c2), c3* > > In above, we are *explicitly grouping columns c1 and c2 together for >

Re: [VOTE] Release Apache Spark 3.5.0 (RC3)

2023-08-29 Thread Martin Grund
+1 (non binding) Tested Spark Connect fully isolated and with PySpark build. Tested as well some of the new PySpark ML Connect features On Tue 29. Aug 2023 at 18:25 Yuanjian Li wrote: > Please vote on releasing the following candidate(RC3) as Apache Spark > version 3.5.0. > > The vote is open

[VOTE] Release Apache Spark 3.5.0 (RC3)

2023-08-29 Thread Yuanjian Li
Please vote on releasing the following candidate(RC3) as Apache Spark version 3.5.0. The vote is open until 11:59pm Pacific time Aug 31st and passes if a majority +1 PMC votes are cast, with a minimum of 3 +1 votes. [ ] +1 Release this package as Apache Spark 3.5.0 [ ] -1 Do not release this

Re: [DISCUSS] Incremental statistics collection

2023-08-29 Thread Mich Talebzadeh
OK, let us take a deeper look here ANALYSE TABLE mytable COMPUTE STATISTICS FOR COLUMNS *(c1, c2), c3* In above, we are *explicitly grouping columns c1 and c2 together for which we want to compute statistic*s. Additionally, we are also *computing statistics for column c3 independen*t*ly*. This

Re: Spark Connect: API mismatch in SparkSesession#execute

2023-08-29 Thread Stefan Hagedorn
Thank you, Martin! I got it working now using the same shading rules in my project as in Spark. From: Martin Grund Date: Monday, 28. August 2023 at 17:58 To: Stefan Hagedorn Cc: dev@spark.apache.org Subject: Re: Spark Connect: API mismatch in SparkSesession#execute Hi Stefan, There are some

Re: [DISCUSS] Incremental statistics collection

2023-08-29 Thread Chetan
Hi, If we are taking this up, then would ask can we support multicolumn stats such as : ANALYZE TABLE mytable COMPUTE STATISTICS FOR COLUMNS (c1,c2), c3 This should help in estimating better for conditions involving c1 and c2 Thanks. On Tue, 29 Aug 2023 at 09:05, Mich Talebzadeh wrote: >

Re: [DISCUSS] Incremental statistics collection

2023-08-29 Thread Mich Talebzadeh
short answer on top of my head My point was with regard to Cost Based Optimizer (CBO) in traditional databases. The concept of a rowkey in HBase is somewhat similar to that of a primary key in RDBMS. Now in databases with automatic deduplication features (i.e. ignore duplication of rowkey),

Re: [DISCUSS] Incremental statistics collection

2023-08-28 Thread Jia Fan
For those databases with automatic deduplication capabilities, such as hbase, we have inserted 100 rows with the same rowkey, but in fact there is only one in hbase. Is the new statistical value we added 100 or 1, or hbase already contains this rowkey, the value would be 0. How should we handle

Re: [DISCUSS] Incremental statistics collection

2023-08-28 Thread Mich Talebzadeh
I have never been fond of the notion that measuring inserts, updates, and deletes (referred to as DML) is the sole criterion for signaling a necessity to update statistics for Spark's CBO. Nevertheless, in the absence of an alternative mechanism, it seems this is the only approach at our disposal

Re: [Internet]Re: Improving Dynamic Allocation Logic for Spark 4+

2023-08-28 Thread Mich Talebzadeh
Thanks Qian for your feedback. I will have a look Regards, Mich view my Linkedin profile https://en.everybodywiki.com/Mich_Talebzadeh *Disclaimer:* Use it at your own risk. Any and all responsibility for any loss, damage or

Re: Spark Connect: API mismatch in SparkSesession#execute

2023-08-28 Thread Martin Grund
Hi Stefan, There are some current limitations around how protobuf is embedded in Spark Connect. One of the challenges there is that for compatibility reasons we currently shade protobuf that then shades the `prototobuf.GeneramtedMessage` class. The way to work around this is to shade the protobuf

Spark Connect: API mismatch in SparkSesession#execute

2023-08-28 Thread Stefan Hagedorn
Hi everyone, Trying my luck here, after no success in the user mailing list :) I’m trying to use the "extension" feature of the Spark Connect CommandPlugin (Spark 3.4.1) [1]. I created a simple protobuf message `MyMessage` that I want to send from the connect client-side to the connect server

Re: [Internet]Re: Improving Dynamic Allocation Logic for Spark 4+

2023-08-27 Thread Qian Sun
Hi Mich, ImageCache is an alibaba cloud ECI feature[1]. An image cache is a cluster-level resource that you can use to accelerate the creation of pods in different namespaces. If need to update the spark image, imagecache will be created in the cluster. And specify pod annotation to use image

Re: Beginner - Looking for starter issues

2023-08-27 Thread Harry
Thanks, I'll check it out. On Thu, Jun 29, 2023 at 2:42 AM Jia Fan wrote: > Hi Harry, > Maybe you can start with https://issues.apache.org/jira/browse/SPARK-37935 > > > > Jia Fan > > > 2023年6月28日 08:09,Harry 写道: > > Hi, > > I am looking to pick up some tasks on ASF

Re: [DISCUSS] Incremental statistics collection

2023-08-26 Thread Mich Talebzadeh
Hi, Impressive, yet in the realm of classic DBMSs, it could be seen as a case of old wine in a new bottle. The objective, I assume, is to employ dynamic sampling to enhance the optimizer's capacity to create effective execution plans without the burden of complete I/O and in less time. For

Two new tickets for Spark on K8s

2023-08-26 Thread Mich Talebzadeh
Hi, @holden Karau recently created two Jiras that deal with two items of interest namely: 1. Improve Spark Driver Launch Time SPARK-44950 2. Improve Spark Dynamic Allocation SPARK-44951

[DISCUSS] Incremental statistics collection

2023-08-26 Thread RAKSON RAKESH
Hi all, I would like to propose the incremental collection of statistics in spark. SPARK-44817 has been raised for the same. Currently, spark invalidates the stats after data changing commands which would make CBO non-functional. To update

<    8   9   10   11   12   13   14   15   16   17   >