Re: [DISCUSS] Why do we remove RDD usage and RDD-backed code?

2024-07-23 Thread Hyukjin Kwon
sion` will be there in any way, this is an > architectural change which we need to decide explicitly, not implicitly. > > > On 2024/07/13 05:33:32 Hyukjin Kwon wrote: > > We actually get the active Spark session so it doesn't cause overhead. > Also > > even we create, it w

Re: [VOTE] Differentiate Spark without Spark Connect from Spark Connect

2024-07-22 Thread Hyukjin Kwon
Starting with my own +1. On Tue, 23 Jul 2024 at 09:12, Hyukjin Kwon wrote: > Hi all, > > I’d like to start a vote for differentiating "Spark without Spark Connect" > as "Spark Classic". > > Please also refer to: > >- Discussi

[VOTE] Differentiate Spark without Spark Connect from Spark Connect

2024-07-22 Thread Hyukjin Kwon
Hi all, I’d like to start a vote for differentiating "Spark without Spark Connect" as "Spark Classic". Please also refer to: - Discussion thread: https://lists.apache.org/thread/ys7zsod8cs9c7qllmf0p0msk6z2mz2ym Please vote on the SPIP for the next 72 hours: [ ] +1: Accept the proposal [ ]

Re: [DISCUSS] Differentiate Spark without Spark Connect from Spark Connect

2024-07-22 Thread Hyukjin Kwon
t;> > >> On Mon, Jul 22, 2024 at 1:12 PM Jungtaek Lim < >> > >> kabhwan.opensou...@gmail.com> wrote: >> > >> >> > >>> I'd propose not to change the name of "Spark Connect" - the name >> > >>> represents

Re: [DISCUSS] Differentiate Spark without Spark Connect from Spark Connect

2024-07-21 Thread Hyukjin Kwon
MaRAG9> > YouTube Live Streams: https://www.youtube.com/user/holdenkarau > > > On Sat, Jul 20, 2024 at 8:02 PM Xiao Li wrote: > >> Classic is much better than Legacy. : ) >> >> Hyukjin Kwon 于2024年7月18日周四 16:58写道: >> >>> Hi all, >>> >&g

[DISCUSS] Differentiate Spark without Spark Connect from Spark Connect

2024-07-18 Thread Hyukjin Kwon
Hi all, I noticed that we need to standardize our terminology before moving forward. For instance, when documenting, 'Spark without Spark Connect' is too long and verbose. Additionally, I've observed that we use various names for Spark without Spark Connect: Spark Classic, Classic Spark, Legacy

Re: [DISCUSS] Why do we remove RDD usage and RDD-backed code?

2024-07-13 Thread Hyukjin Kwon
outube.com/user/holdenkarau > > > On Sat, Jul 13, 2024 at 1:37 AM Hyukjin Kwon wrote: > >> Reverted, and opened a new one https://github.com/apache/spark/pull/47341 >> . >> >> On Sat, 13 Jul 2024 at 15:40, Hyukjin Kwon wrote: >> >>> Yeah that's fine.

Re: [DISCUSS] Why do we remove RDD usage and RDD-backed code?

2024-07-13 Thread Hyukjin Kwon
Reverted, and opened a new one https://github.com/apache/spark/pull/47341. On Sat, 13 Jul 2024 at 15:40, Hyukjin Kwon wrote: > Yeah that's fine. I'll revert and open a fresh PR including my own > followup when I get back home later today. > > On Sat, Jul 13, 2024 at 3:08 PM

Re: [DISCUSS] Why do we remove RDD usage and RDD-backed code?

2024-07-13 Thread Hyukjin Kwon
nd the > approver & code author work for the same employer mentioned as the > justification for the change. > > On Fri, Jul 12, 2024 at 6:42 PM Hyukjin Kwon wrote: > >> I think we should have not mentioned a specific vendor there. The change >> als

Re: [DISCUSS] Why do we remove RDD usage and RDD-backed code?

2024-07-12 Thread Hyukjin Kwon
We actually get the active Spark session so it doesn't cause overhead. Also even we create, it will create once which should be pretty trivial overhead. I don't think we can deprecate RDD API IMHO in any event. On Sat, Jul 13, 2024 at 1:30 PM Martin Grund wrote: > Mridul, I really just wanted

Re: [DISCUSS] Why do we remove RDD usage and RDD-backed code?

2024-07-12 Thread Hyukjin Kwon
> > On Sat, Jul 13, 2024 at 9:42 AM Hyukjin Kwon wrote: > >> I think we should have not mentioned a specific vendor there. The change >> also shouldn't repartition. We should create a partition 1. >> >> But in general leveraging Catalyst optimizer and SQL en

Re: [DISCUSS] Why do we remove RDD usage and RDD-backed code?

2024-07-12 Thread Hyukjin Kwon
I think we should have not mentioned a specific vendor there. The change also shouldn't repartition. We should create a partition 1. But in general leveraging Catalyst optimizer and SQL engine there is a good idea as we can leverage all optimization there. For example, it will use UTF8 encoding

Re: [DISCUSS] Release Apache Spark 3.5.2

2024-07-11 Thread Hyukjin Kwon
+1 On Fri, Jul 12, 2024 at 11:13 AM L. C. Hsieh wrote: > +1 > > On Thu, Jul 11, 2024 at 3:22 PM Zhou Jiang wrote: > > > > +1 for releasing 3.5.2, which would also benefit the Spark Operator > multi-version support. > > > > On Thu, Jul 11, 2024 at 7:56 AM Dongjoon Hyun > wrote: > >> > >> Thank

[VOTE][RESULT] Allow GitHub Actions runs for contributors' PRs without approvals in apache/spark-connect-go

2024-07-10 Thread Hyukjin Kwon
The vote passes with +16s (9 binding +1s). (* = binding) +1: Hyukjin Kwon (*) Mich Talebzadeh Denny Lee Holden Karau (*) Martin Grund Zhou Jiang Yuanjian Li (*) Takuya Ueshin (*) Haydn Reynold Xin (*) Wenchen Fan (*) Liang-Chi Hsieh (*) Xianjin Ye Huaxin Gao (*) Mridul Muralidharan (*) Bo Yang

[VOTE][RESULT] Move Spark Connect server to builtin package (Client API layer stays external)

2024-07-10 Thread Hyukjin Kwon
The vote passes with +21s (12 binding +1s). (* = binding) +1: Hyukjin Kwon (*) Denny Lee Cheng Pan Holden Karau (*) Martin Grund Kent Yao Bo Yang Xinrong Meng (*) Takuya Ueshin (*) Matthew Powers Dongjoon Hyun (*) Liang-Chi Hsieh (*) Reynold Xin (*) Gengliang Wang (*) Jungtaek Lim Wenchen Fan

Re: [VOTE] Allow GitHub Actions runs for contributors' PRs without approvals in apache/spark-connect-go

2024-07-04 Thread Hyukjin Kwon
u >> >> >> On Thu, Jul 4, 2024 at 7:33 AM Denny Lee wrote: >> >>> +1 (non-binding) >>> >>> On Thu, Jul 4, 2024 at 19:13 Hyukjin Kwon wrote: >>> >>>> Hi all, >>>> >>>> I’d like to start a vote for a

Re: [VOTE] Move Spark Connect server to builtin package (Client API layer stays external)

2024-07-04 Thread Hyukjin Kwon
;>> >>>>> +1 >>>>> >>>>> On Wed, Jul 3, 2024 at 3:54 PM Dongjoon Hyun >>>>> wrote: >>>>> > >>>>> > +1 >>>>> > >>>>> > Dongjoon >>>>> > >>&

[VOTE] Allow GitHub Actions runs for contributors' PRs without approvals in apache/spark-connect-go

2024-07-04 Thread Hyukjin Kwon
Hi all, I’d like to start a vote for allowing GitHub Actions runs for contributors' PRs without approvals in apache/spark-connect-go. Please also refer to: - Discussion thread: https://lists.apache.org/thread/tsqm0dv01f7jgkv5l4kyvtpw4tc6f420 - JIRA ticket:

Re: [DISCUSS] Allow GitHub Actions runs for contributors' PRs without approvals in apache/spark-connect-go

2024-07-04 Thread Hyukjin Kwon
st contribution it would >> automatically allow kicking of the workflows. >> >> On Thu, Jul 4, 2024 at 04:20 Matthew Powers >> wrote: >> >>> Yea, this would be great. >>> >>> spark-connect-go is still experimental and anything we can do to get

[DISCUSS] Allow GitHub Actions runs for contributors' PRs without approvals in apache/spark-connect-go

2024-07-03 Thread Hyukjin Kwon
Hi all, The Spark Connect Go client repository ( https://github.com/apache/spark-connect-go) requires GitHub Actions runs for individual commits within contributors' PRs. This policy was intentionally applied ( https://issues.apache.org/jira/browse/INFRA-24387), but we can change this default

Re: [VOTE] Move Spark Connect server to builtin package (Client API layer stays external)

2024-07-02 Thread Hyukjin Kwon
Starting with my own +1. On Wed, 3 Jul 2024 at 09:59, Hyukjin Kwon wrote: > Hi all, > > I’d like to start a vote for moving Spark Connect server to builtin > package (Client API layer stays external). > > Please also refer to: > >- Discussion thread: > http

[VOTE] Move Spark Connect server to builtin package (Client API layer stays external)

2024-07-02 Thread Hyukjin Kwon
Hi all, I’d like to start a vote for moving Spark Connect server to builtin package (Client API layer stays external). Please also refer to: - Discussion thread: https://lists.apache.org/thread/odlx9b552dp8yllhrdlp24pf9m9s4tmx - JIRA ticket:

Re: [DISCUSS] Move Spark Connect server to builtin package (Client API layer stays external)

2024-07-02 Thread Hyukjin Kwon
Alrighty, let me start the vote to make sure everybody is happy :-). On Wed, 3 Jul 2024 at 09:55, Hyukjin Kwon wrote: > It will be fine for non-connect users. When we are actually moving client > one, I think we should go with an SPIP cuz that might affect end users > > On Tue

Re: [DISCUSS] Move Spark Connect server to builtin package (Client API layer stays external)

2024-07-02 Thread Hyukjin Kwon
d would be a great quality of life improvement. >> >> +1 (non-binding) >> >> On Tue, Jul 2, 2024 at 4:56 AM Hyukjin Kwon wrote: >> >>> > while leaving the connect jvm client in a separate folder looks weird >>> >>> I plan to actually put it at th

Re: [DISCUSS] Move Spark Connect server to builtin package (Client API layer stays external)

2024-07-02 Thread Hyukjin Kwon
`resource-managers/kubernetes/{docker,integration-tests}`, `hadoop-cloud`. > What about moving the whole `connect` folder to the top level? > > Thanks, > Cheng Pan > > > On Jul 2, 2024, at 08:19, Hyukjin Kwon wrote: > > Hi all, > > I would like to discuss moving Spark Connect

Re: [外部邮件] [DISCUSS] Move Spark Connect server to builtin package (Client API layer stays external)

2024-07-01 Thread Hyukjin Kwon
> > > Jie Yang > > > > *发件人**: *Hyukjin Kwon > *日期**: *2024年7月2日 星期二 08:19 > *收件人**: *dev > *主题**: *[外部邮件] [DISCUSS] Move Spark Connect server to builtin package > (Client API layer stays external) > > > > Hi all, > > I would like to discuss movin

[DISCUSS] Move Spark Connect server to builtin package (Client API layer stays external)

2024-07-01 Thread Hyukjin Kwon
Hi all, I would like to discuss moving Spark Connect server to builtin package. Right now, users have to specify —packages when they run Spark Connect server script, for example: ./sbin/start-connect-server.sh --jars `ls connector/connect/server/target/**/spark-connect*SNAPSHOT.jar` or

Re: [DISCUSS] Versionless Spark Programming Guide Proposal

2024-06-10 Thread Hyukjin Kwon
I am +1 on this but as you guys mentioned, we should really be clear on how to address different versions. On Wed, 5 Jun 2024 at 18:27, Matthew Powers wrote: > I am a huge fan of the Apache Spark docs and I regularly look at the > analytics on this page >

Re: [VOTE] SPIP: Stored Procedures API for Catalogs

2024-05-15 Thread Hyukjin Kwon
+1 On Tue, 14 May 2024 at 16:39, Wenchen Fan wrote: > +1 > > On Tue, May 14, 2024 at 8:19 AM Zhou Jiang wrote: > >> +1 (non-binding) >> >> On Sat, May 11, 2024 at 2:10 PM L. C. Hsieh wrote: >> >>> Hi all, >>> >>> I’d like to start a vote for SPIP: Stored Procedures API for Catalogs. >>> >>>

Re: [DISCUSS] Spark 4.0.0 release

2024-05-01 Thread Hyukjin Kwon
SGTM On Thu, 2 May 2024 at 02:06, Dongjoon Hyun wrote: > +1 for next Monday. > > Dongjoon. > > On Wed, May 1, 2024 at 8:46 AM Tathagata Das > wrote: > >> Next week sounds great! Thank you Wenchen! >> >> On Wed, May 1, 2024 at 11:16 AM Wenchen Fan wrote: >> >>> Yea I think a preview release

Re: [DISCUSS] SPARK-46122: Set spark.sql.legacy.createHiveTableByDefault to false

2024-04-29 Thread Hyukjin Kwon
Mich, It is a legacy config we should get rid of in the end, and it has been tested in production for very long time. Spark should create a Spark table by default. On Tue, Apr 30, 2024 at 5:38 AM Mich Talebzadeh wrote: > Your point > > ".. t's a surprise to me to see that someone has different

Re: [VOTE] SPARK-46122: Set spark.sql.legacy.createHiveTableByDefault to false

2024-04-29 Thread Hyukjin Kwon
+1 It's a legacy conf that we should eventually remove it away. Spark should create Spark table by default, not Hive table. Mich, for your workload, you can simply switch that conf off if it concerns you. We also enabled ANSI as well (that you agreed on). It's a bit akwakrd to stop in the middle

Re: [VOTE] Release Spark 3.4.3 (RC2)

2024-04-16 Thread Hyukjin Kwon
+1 On Wed, Apr 17, 2024 at 3:57 AM L. C. Hsieh wrote: > +1 > > On Tue, Apr 16, 2024 at 4:08 AM Wenchen Fan wrote: > > > > +1 > > > > On Mon, Apr 15, 2024 at 12:31 PM Dongjoon Hyun > wrote: > >> > >> I'll start with my +1. > >> > >> - Checked checksum and signature > >> - Checked

Re: [VOTE] SPARK-44444: Use ANSI SQL mode by default

2024-04-13 Thread Hyukjin Kwon
+1 On Sun, Apr 14, 2024 at 7:46 AM Chao Sun wrote: > +1. > > This feature is very helpful for guarding against correctness issues, such > as null results due to invalid input or math overflows. It’s been there for > a while now and it’s a good time to enable it by default as Spark enters > the

[VOTE][RESULT] SPIP: Pure Python Package in PyPI (Spark Connect)

2024-04-03 Thread Hyukjin Kwon
The vote passes with 19+1s (13 binding +1s). (* = binding) +1: Haejoon Lee Ruifeng Zheng(*) Dongjoon Hyun(*) Gengliang Wang(*) Mridul Muralidharan(*) Liang-Chi Hsieh(*) Takuya Ueshin(*) Kent Yao Chao Sun(*) Hussein Awala Xiao Li(*) Yuanjian Li(*) Denny Lee Felix Cheung(*) Bo Yang Xinrong Meng(*)

Re: [VOTE] SPIP: Pure Python Package in PyPI (Spark Connect)

2024-04-02 Thread Hyukjin Kwon
10:07 PM, Haejoon Lee > wrote: > >  > > +1 > > On Mon, Apr 1, 2024 at 10:15 AM Hyukjin Kwon wrote: > >> Hi all, >> >> I'd like to start the vote for SPIP: Pure Python Package in PyPI (Spark >> Connect) >> >> JIRA <https://issues.apache.o

Re: [VOTE] SPIP: Pure Python Package in PyPI (Spark Connect)

2024-03-31 Thread Hyukjin Kwon
? > I was not able to find it, but I was on vacation, and so might have > missed this … > > > Regards, > Mridul > > On Sun, Mar 31, 2024 at 9:08 PM Haejoon Lee > wrote: > >> +1 >> >> On Mon, Apr 1, 2024 at 10:15 AM Hyukjin Kwon >> wrote: >> &g

[VOTE] SPIP: Pure Python Package in PyPI (Spark Connect)

2024-03-31 Thread Hyukjin Kwon
Hi all, I'd like to start the vote for SPIP: Pure Python Package in PyPI (Spark Connect) JIRA Prototype SPIP doc

Re: A proposal for creating a Knowledge Sharing Hub for Apache Spark Community

2024-03-18 Thread Hyukjin Kwon
One very good example is SparkR releases in Conda channel ( https://github.com/conda-forge/r-sparkr-feedstock). This is fully run by the community unofficially. On Tue, 19 Mar 2024 at 09:54, Mich Talebzadeh wrote: > +1 for me > > Mich Talebzadeh, > Dad | Technologist | Solutions Architect |

Re: [VOTE] SPIP: Structured Logging Framework for Apache Spark

2024-03-11 Thread Hyukjin Kwon
+1 On Mon, 11 Mar 2024 at 18:11, yangjie01 wrote: > +1 > > > > Jie Yang > > > > *发件人**: *Haejoon Lee > *日期**: *2024年3月11日 星期一 17:09 > *收件人**: *Gengliang Wang > *抄送**: *dev > *主题**: *Re: [VOTE] SPIP: Structured Logging Framework for Apache Spark > > > > +1 > > > > On Mon, Mar 11, 2024 at

Re: [ANNOUNCE] Apache Spark 3.5.1 released

2024-03-04 Thread Hyukjin Kwon
Is this related to https://github.com/apache/spark/pull/42428? cc @Yang,Jie(INF) On Mon, 4 Mar 2024 at 22:21, Jungtaek Lim wrote: > Shall we revisit this functionality? The API doc is built with individual > versions, and for each individual version we depend on other released > versions.

Re: [VOTE] Release Apache Spark 3.5.1 (RC2)

2024-02-20 Thread Hyukjin Kwon
+1 On Tue, 20 Feb 2024 at 22:00, Cheng Pan wrote: > +1 (non-binding) > > - Build successfully from source code. > - Pass integration tests with Spark ClickHouse Connector[1] > > [1] https://github.com/housepower/spark-clickhouse-connector/pull/299 > > Thanks, > Cheng Pan > > > > On Feb 20,

Re: [FYI] SPARK-45981: Improve Python language test coverage

2023-12-02 Thread Hyukjin Kwon
Awesome! On Sat, Dec 2, 2023 at 2:33 PM Dongjoon Hyun wrote: > Hi, All. > > As a part of Apache Spark 4.0.0 (SPARK-44111), the Apache Spark community > starts to have test coverage for all supported Python versions from Today. > > - https://github.com/apache/spark/actions/runs/7061665420 > >

Help for testing Windows specific fix (SPARK-23015)

2023-11-21 Thread Hyukjin Kwon
Hi all, I used to have my Windows environment in another laptop but that laptop is broken now so I don't have Windows env to test Windows PRs out (e.g., https://github.com/apache/spark/pull/43706). If anyone has a Windows env, would appreciate it if you take a look at this. Thanks.

Re: On adding applyInArrow to groupBy and cogroup

2023-11-06 Thread Hyukjin Kwon
Sounds good, I'll review the PR. On Fri, 3 Nov 2023 at 14:08, Abdeali Kothari wrote: > Seeing more support for arrow based functions would be great. > Gives more control to application developers. And so pandas just becomes 1 > of the available options. > > On Fri, 3 Nov 2023, 21:23 Luca

Re: Welcome to Our New Apache Spark Committer and PMCs

2023-10-03 Thread Hyukjin Kwon
Woohoo! On Tue, 3 Oct 2023 at 22:47, Hussein Awala wrote: > Congrats to all of you! > > On Tue 3 Oct 2023 at 08:15, Rui Wang wrote: > >> Congratulations! Well deserved! >> >> -Rui >> >> >> On Mon, Oct 2, 2023 at 10:32 PM Gengliang Wang wrote: >> >>> Congratulations to all! Well deserved! >>>

[RESULT] Updating documentation hosted for EOL and maintenance releases

2023-09-29 Thread Hyukjin Kwon
The vote passes with 9 +1s (6 binding +1s). (* = binding) +1: - Hyukjin Kwon * - Ruifeng Zheng * - Jiaan Geng - Yikun Jiang * - Herman van Hovell * - Michel Miotto Barbosa - Maciej Szymkiewicz * - Denny Lee - Yuanjian Li *

Re: [ANNOUNCE] Apache Spark 3.5.0 released

2023-09-26 Thread Hyukjin Kwon
Awesome! On Wed, 27 Sept 2023 at 11:02, Hussein Awala wrote: > I installed the package, tested it with kubernetes master from Jupyter, > and tested it with Spark Connect server, all looks good. > > On Tue, Sep 26, 2023 at 10:45 PM Yuanjian Li > wrote: > >> FYI, we received the handling from

[VOTE] Updating documentation hosted for EOL and maintenance releases

2023-09-25 Thread Hyukjin Kwon
Hi all, I would like to start the vote for updating documentation hosted for EOL and maintenance releases to improve the usability here, and in order for end users to read the proper and correct documentation. For discussion thread, please refer to

Re: [VOTE] Release Apache Spark 3.5.0 (RC5)

2023-09-11 Thread Hyukjin Kwon
+1 On Tue, Sep 12, 2023 at 7:05 AM Xiao Li wrote: > +1 > > Xiao > > Yuanjian Li 于2023年9月11日周一 10:53写道: > >> @Peter Toth I've looked into the details of this >> issue, and it appears that it's neither a regression in version 3.5.0 nor a >> correctness issue. It's a bug related to a new

[DISCUSS] Updating documentation hosted for EOL and maintenance releases

2023-08-30 Thread Hyukjin Kwon
Hi all, I would like to raise a discussion about updating documentation hosted for EOL and maintenance versions. To provide some context, we currently host the documentation for EOL versions of Apache Spark, which can be found at links like

Re: [DISCUSS] SPIP: Python Stored Procedures

2023-08-30 Thread Hyukjin Kwon
Which Python version will run that stored procedure? All Python versions supported in PySpark How to manage external dependencies? Existing way we have https://spark.apache.org/docs/latest/api/python/user_guide/python_packaging.html . In fact, this will use the external dependencies within your

Re: [DISCUSS] SPIP: Python Stored Procedures

2023-08-30 Thread Hyukjin Kwon
+1 we should have this .. a lot of other projects and DBMSes have this too, and we currently don't have a way to handle them within Apache Spark. Disclaimer: I am the shepherd of this SPIP. On Thu, 31 Aug 2023 at 09:31, Allison Wang wrote: > Hi Mich, > > I've updated the permissions on the

Re: Welcome two new Apache Spark committers

2023-08-06 Thread Hyukjin Kwon
Woohoo! On Mon, 7 Aug 2023 at 11:28, Ruifeng Zheng wrote: > Congratulations! Peter and Xiduo! > > On Mon, Aug 7, 2023 at 10:13 AM Xiao Li wrote: > >> Congratulations, Peter and Xiduo! >> >> >> >> Debasish Das 于2023年8月6日周日 19:08写道: >> >>> Congratulations Peter and Xidou. >>> >>> On Sun, Aug 6,

Re: LLM script for error message improvement

2023-08-02 Thread Hyukjin Kwon
I think adding that dev tool script to improve the error message is fine. On Thu, 3 Aug 2023 at 10:24, Haejoon Lee wrote: > Dear contributors, I hope you are doing well! > > I see there are contributors who are interested in working on error > message improvements and persistent contribution,

Re: [VOTE] SPIP: XML data source support

2023-07-29 Thread Hyukjin Kwon
+1 On Sat, 29 Jul 2023 at 22:49, Maciej wrote: > +1 > > Best regards, > Maciej Szymkiewicz > > Web: https://zero323.net > PGP: A30CEF0C31A501EC > > On 7/29/23 11:28, Mich Talebzadeh wrote: > > +1 for me. > > Though Databriks did a good job releasing the code. > > GitHub - databricks/spark-xml:

Re: Spark 3.0.0 EOL

2023-07-24 Thread Hyukjin Kwon
It's already EOL On Mon, Jul 24, 2023 at 4:17 PM Pralabh Kumar wrote: > Hi Dev Team > > If possible , can you please provide the Spark 3.0.0 EOL timelines . > > Regards > Pralabh Kumar > > > > >

Re: Spark Docker Official Image is now available

2023-07-19 Thread Hyukjin Kwon
This is amazing, finally! On Thu, 20 Jul 2023 at 10:10, Yikun Jiang wrote: > The spark Docker Official Image is now available: > https://hub.docker.com/_/spark > > $ docker run -it --rm *spark* /opt/spark/bin/spark-shell > $ docker run -it --rm *spark*:python3 /opt/spark/bin/pyspark > $ docker

Re: [DISCUSS] SPIP: XML data source support

2023-07-19 Thread Hyukjin Kwon
ort is it to use the spark-xml library today? What's the > drawback to keeping this as an external library as-is? > > Best Regards, Martin > -- > *From:* Hyukjin Kwon > *Sent:* Wednesday, July 19, 2023 01:27 > *To:* Sandip Agarwala > *Cc:* dev@spark.

Re: [DISCUSS] SPIP: XML data source support

2023-07-18 Thread Hyukjin Kwon
> XML data in spark. Making spark-xml built-in will provide a better user > experience for Spark SQL and structured streaming. The proposal is to > inline code from the spark-xml package. > I am collaborating with Hyukjin Kwon, who is the original author of > spark-xml, for this e

Re: [VOTE][SPIP] Python Data Source API

2023-07-05 Thread Hyukjin Kwon
+1. See https://youtu.be/yj7XlTB1Jvc?t=604 :-). On Thu, 6 Jul 2023 at 09:15, Allison Wang wrote: > Hi all, > > I'd like to start the vote for SPIP: Python Data Source API. > > The high-level summary for the SPIP is that it aims to introduce a simple > API in Python for Data Sources. The idea

Re: Introducing English SDK for Apache Spark - Seeking Your Feedback and Contributions

2023-07-03 Thread Hyukjin Kwon
The demo was really amazing. On Tue, 4 Jul 2023 at 09:17, Farshid Ashouri wrote: > This is wonderful news! > > On Tue, 4 Jul 2023 at 01:14, Gengliang Wang wrote: > >> Dear Apache Spark community, >> >> We are delighted to announce the launch of a groundbreaking tool that >> aims to make Apache

Re: Time for Spark v3.5.0 release

2023-07-03 Thread Hyukjin Kwon
Yeah one day postponed shouldn't be a big deal. On Tue, Jul 4, 2023 at 7:10 AM Yuanjian Li wrote: > Hi All, > > According to the Spark versioning policy at > https://spark.apache.org/versioning-policy.html, should we cut > *branch-3.5* on *July 17th, 2023*? (We initially proposed January 16th,

Re: [ANNOUNCE] Apache Spark 3.4.1 released

2023-06-23 Thread Hyukjin Kwon
Thanks! On Sat, Jun 24, 2023 at 11:01 AM Mridul Muralidharan wrote: > > Thanks Dongjoon ! > > Regards, > Mridul > > On Fri, Jun 23, 2023 at 6:58 PM Dongjoon Hyun wrote: > >> We are happy to announce the availability of Apache Spark 3.4.1! >> >> Spark 3.4.1 is a maintenance release containing

Re: [VOTE][SPIP] PySpark Test Framework

2023-06-21 Thread Hyukjin Kwon
+1 On Thu, 22 Jun 2023 at 02:20, Jacek Laskowski wrote: > +0 > > Pozdrawiam, > Jacek Laskowski > > "The Internals Of" Online Books > Follow me on https://twitter.com/jaceklaskowski > > > > > On Wed, Jun 21, 2023 at 5:11 PM

Re: [VOTE] Release Spark 3.4.1 (RC1)

2023-06-21 Thread Hyukjin Kwon
+1 On Wed, 21 Jun 2023 at 14:23, yangjie01 wrote: > +1 > > > 在 2023/6/21 13:20,“L. C. Hsieh”mailto:vii...@gmail.com>> > 写入: > > > +1 > > > On Tue, Jun 20, 2023 at 8:48 PM Dongjoon Hyun > wrote: > > > > +1 > > > > Dongjoon > > > > On 2023/06/20 02:51:32 Jia Fan

Re: [DISCUSS] SPIP: Python Data Source API

2023-06-19 Thread Hyukjin Kwon
Actually I support this idea in a way that Python developers don't have to learn Scala to write their own source (and separate packaging). This is more crucial especially when you want to write a simple data source that interacts with the Python ecosystem. On Tue, 20 Jun 2023 at 03:08, Denny Lee

Re: [VOTE] Apache Spark PMC asks Databricks to differentiate its Spark version string

2023-06-18 Thread Hyukjin Kwon
With the spirit of open source, -1. At least there have been other cases mentioned in the discussion thread, and solely doing it for one specific vendor would not solve the problem, and I wouldn't also expect to cast a vote for each case publicly. I would prefer to start this in the narrower

Re: [VOTE][RESULT] Release Plan for Apache Spark 4.0.0 (June 2024)

2023-06-18 Thread Hyukjin Kwon
The major concerns raised in the thread were that we should initiate the discussion for the below first: - Apache Spark 4.0.0 Preview (and Dates) - Apache Spark 4.0.0 Items - Apache Spark 4.0.0 Plan Adjustment before setting the timeline for Spark 4.0.0 because we're unclear on the picture of

Re: [VOTE] Release Plan for Apache Spark 4.0.0 (June 2024)

2023-06-15 Thread Hyukjin Kwon
I am supportive of setting the timeline for Spark 4.0, and I think it has to be done soon. If my understanding is correct, we better need to set up the goals and major changes to happen in 4.0.0? That one I agree with too. Having a preview sounds good to me too so people can try it out. Given

Re: Add user as a contributor

2023-06-14 Thread Hyukjin Kwon
You can open a PR first. When that's merged, the ticket will be assigned to you with the contribuor access On Thu, Jun 15, 2023 at 1:07 PM Aman Raj wrote: > Hi team, > > Can someone please help giving contributor access to amanraj2520 username. > I have raised a Spark Ticket :

Re: [DISCUSS] SPIP: Add PySpark Test Framework

2023-06-13 Thread Hyukjin Kwon
Yeah, I have been thinking about this too, and Holden did some work here that this SPIP will reuse. I support this. On Wed, 14 Jun 2023 at 08:10, Amanda Liu wrote: > Hi all, > > I'd like to start a discussion about implementing an official PySpark test > framework. Currently, there's no

Re: [DISCUSS] Add SQL functions into Scala, Python and R API

2023-05-31 Thread Hyukjin Kwon
Thanks all. I created a JIRA at https://issues.apache.org/jira/browse/SPARK-43907. On Mon, 29 May 2023 at 09:12, Hyukjin Kwon wrote: > Yes, some were cases like you mentioned. > But I found myself explaining that reason to a lot of people, not only > developers but users - I

Re: Apache Spark 3.5.0 Expectations (?)

2023-05-29 Thread Hyukjin Kwon
While I support going forward with a higher version, actually using Scala 2.13 by default is a big deal especially in a way that: - Users would likely download the built-in version assuming that it’s backward binary compatible. - PyPI doesn't allow specifying the Scala version, meaning

Re: [DISCUSS] Add SQL functions into Scala, Python and R API

2023-05-28 Thread Hyukjin Kwon
gt;>>> 5808 W Sunset Blvd | Los Angeles, CA 90028 >>>> <https://www.google.com/maps/search/5808+W+Sunset+Blvd%C2%A0+%7C%C2%A0+Los+Angeles,+CA+90028?entry=gmail=g> >>>> >>>> >>>> >>>> On Wed, May 24, 2023 at 12:44 AM Enr

Re: [DISCUSS] Add SQL functions into Scala, Python and R API

2023-05-25 Thread Hyukjin Kwon
>>> On Wed, May 24, 2023 at 12:44 AM Enrico Minack >>> wrote: >>> >>>> +1 >>>> >>>> Functions available in SQL (more general in one API) should be >>>> available in all APIs. I am very much in favor of this. >>>

[DISCUSS] Add SQL functions into Scala, Python and R API

2023-05-24 Thread Hyukjin Kwon
Hi all, I would like to discuss adding all SQL functions into Scala, Python and R API. We have SQL functions that do not exist in Scala, Python and R around 175. For example, we don’t have pyspark.sql.functions.percentile but you can invoke it as a SQL function, e.g., SELECT percentile(...). The

Re: [CONNECT] New Clients for Go and Rust

2023-05-24 Thread Hyukjin Kwon
I think we can just start this with a separate repo. I am fine with the second option too but in this case we would have to triage which language to add into the main repo. On Fri, 19 May 2023 at 22:28, Maciej wrote: > Hi, > > Personally, I'm strongly against the second option and have some >

Re: PR builder broken

2023-05-10 Thread Hyukjin Kwon
I think this happens globally https://www.githubstatus.com/ On Thu, May 11, 2023 at 6:50 AM Xingbo Jiang wrote: > Hi dev, > > I've seen multiple PR builder failures like below since this morning: > ``` > TypeError: Cannot read properties of undefined (reading 'head_sha') > at eval (eval at

Re: [VOTE] Release Apache Spark 3.2.4 (RC1)

2023-04-10 Thread Hyukjin Kwon
+1 On Tue, 11 Apr 2023 at 11:04, Ruifeng Zheng wrote: > +1 (non-binding) > > Thank you for driving this release! > > -- > Ruifeng Zheng > ruife...@foxmail.com > >

Re: [VOTE] Release Apache Spark 3.4.0 (RC6)

2023-04-06 Thread Hyukjin Kwon
Merged the fix. On Fri, 7 Apr 2023 at 10:07, Xinrong Meng wrote: > Thanks @yangjie01. I marked SPARK-39696 as a blocker. > > On Thu, Apr 6, 2023 at 4:35 PM yangjie01 wrote: > >> -1 for me due to this RC not include the fix of SPARK-39696, SPARK-39696 >> will fix a data race issue in access to

Re: Apache Spark 3.2.4 EOL Release?

2023-04-04 Thread Hyukjin Kwon
+1 On Wed, 5 Apr 2023 at 07:31, Mridul Muralidharan wrote: > > +1 > Sounds good to me. > > Thanks, > Mridul > > > On Tue, Apr 4, 2023 at 1:39 PM huaxin gao wrote: > >> +1 >> >> On Tue, Apr 4, 2023 at 11:17 AM Chao Sun wrote: >> >>> +1 >>> >>> On Tue, Apr 4, 2023 at 11:12 AM Holden Karau >>>

Re: [VOTE] Release Apache Spark 3.4.0 (RC3)

2023-03-09 Thread Hyukjin Kwon
BTW doing another RC isn't a very big deal (compared to what I did before :-) ) since it's not a canonical release yet. On Fri, Mar 10, 2023 at 7:58 AM Hyukjin Kwon wrote: > I guess directly tagging is fine too I guess. > I don't mind cutting the RC4 right away either if that's what you

Re: [VOTE] Release Apache Spark 3.4.0 (RC3)

2023-03-09 Thread Hyukjin Kwon
I guess directly tagging is fine too I guess. I don't mind cutting the RC4 right away either if that's what you prefer. On Fri, Mar 10, 2023 at 7:06 AM Xinrong Meng wrote: > Hi All, > > Thank you all for catching that. Unfortunately, the release script failed > to push the release tag

Re: [Question] Can't start Spark Connect

2023-03-08 Thread Hyukjin Kwon
Just doing a clean build with Maven, and running a test case like `SparkConnectServiceSuite` in IntelliJ should work. On Wed, 8 Mar 2023 at 15:02, Jia Fan wrote: > Hi developers, >I want to contribute some code for Spark Connect. Any doc for starters? > I want to debug

Re: [DISCUSS] Show Python code examples first in Spark documentation

2023-02-26 Thread Hyukjin Kwon
destruction of data or any other property which may arise > from relying on this email's technical content is explicitly disclaimed. > The author will in no case be liable for any monetary damages arising from > such loss, damage or destruction. > > > > > On Sun, 26 Feb 2023

Re: [DISCUSS] Show Python code examples first in Spark documentation

2023-02-26 Thread Hyukjin Kwon
Probably it's worthwhile discussing the order for others but I would keep it separate from this thread to focus on Python as the default since that can be done as an incremental improvement. On Mon, Feb 27, 2023 at 3:36 AM Mich Talebzadeh wrote: > > To me as I stated before this is a

Re: [DISCUSS] Show Python code examples first in Spark documentation

2023-02-23 Thread Hyukjin Kwon
gt;>>> mich.talebza...@gmail.com> wrote: >>>> >>>>> If this is not just flip flopping the document pages and involves >>>>> other changes, then a proper impact analysis needs to be done to assess >>>>> the >>>>> eff

Re: [VOTE] Release Apache Spark 3.4.0 (RC1)

2023-02-23 Thread Hyukjin Kwon
Yes we should fix. I will take a look On Thu, 23 Feb 2023 at 07:32, Jonathan Kelly wrote: > Thanks! I was wondering about that ClientE2ETestSuite failure today, so > I'm glad to know that it's also being experienced by others. > > On a similar note, I am experiencing the following error when

Re: [DISCUSS] Show Python code examples first in Spark documentation

2023-02-22 Thread Hyukjin Kwon
how Python code examples first in Spark >> documentation >> >> +1 Good idea! >> >> On Thu, Feb 23, 2023 at 7:41 AM Jack Goodson >> wrote: >> >>> Good idea, at the company I work at we discussed using Scala as our >>> primary language b

Re: [DISCUSS] Show Python code examples first in Spark documentation

2023-02-22 Thread Hyukjin Kwon
+1 I like this idea too. On Thu, Feb 23, 2023 at 6:00 AM Allan Folting wrote: > Hi all, > > I would like to propose that we show Python code examples first in the > Spark documentation where we have multiple programming language examples. > An example is on the Quick Start page: >

Re: [DISCUSS] Make release cadence predictable

2023-02-15 Thread Hyukjin Kwon
>>> If people are OK with that discipline, sure. >>> A hard 6-month cycle would mean the minor releases are more frequent and >>> have less change in them. That's probably OK. We could also decide to >>> choose a longer cadence like 9 months, but I don't kno

Re: [VOTE][RESULT] Release Spark 3.3.2 (RC1)

2023-02-15 Thread Hyukjin Kwon
Awesome! On Thu, 16 Feb 2023 at 06:39, Dongjoon Hyun wrote: > Great! Thank you, Liang-Chi! > > Dongjoon. > > On Wed, Feb 15, 2023 at 9:22 AM L. C. Hsieh wrote: > >> The vote passes with 12 +1s (4 binding +1s). >> Thanks to all who helped with the release! >> >> (* = binding) >> +1: >> - Mridul

Re: Time for release v3.3.2

2023-01-30 Thread Hyukjin Kwon
+100! On Tue, 31 Jan 2023 at 10:54, Chao Sun wrote: > +1, thanks Liang-Chi for volunteering! > > Chao > > On Mon, Jan 30, 2023 at 5:51 PM L. C. Hsieh wrote: > > > > Hi Spark devs, > > > > As you know, it has been 4 months since Spark 3.3.1 was released on > > 2022/10, it seems a good time to

Re: Time for Spark 3.4.0 release?

2023-01-24 Thread Hyukjin Kwon
Thanks Xinrong. On Wed, 25 Jan 2023 at 12:01, Xinrong Meng wrote: > Hi All, > > Apache Spark 3.4 is cut as https://github.com/apache/spark/tree/branch-3.4 > . > > Thanks, > > Xinrong Meng > > On Wed, Jan 18, 2023 at 3:45 PM Hyukjin Kwon wrote: > >>

Re: Time for Spark 3.4.0 release?

2023-01-17 Thread Hyukjin Kwon
r point? What is > the estimate deadline for that? > > Enrico > > > Am 18.01.23 um 07:59 schrieb Hyukjin Kwon: > > These look like we can fix it after the branch-cut so should be fine. > > On Wed, 18 Jan 2023 at 15:57, Enrico Minack > wrote: > >> Hi Xinrong, >> >

Re: Time for Spark 3.4.0 release?

2023-01-17 Thread Hyukjin Kwon
3.4 to be ready by that time. > > Feel free to reply to the email if you have other ongoing big items for > Spark 3.4. > > Thanks, > > Xinrong Meng > > On Sat, Jan 7, 2023 at 9:16 AM Hyukjin Kwon wrote: > >> Thanks Xinrong. >> >> On Sat, Jan 7, 202

Re: Time for Spark 3.4.0 release?

2023-01-17 Thread Hyukjin Kwon
nch-3.4* at *18:30 PT, January 24, 2023*. Please ensure > your changes for Apache Spark 3.4 to be ready by that time. > > Feel free to reply to the email if you have other ongoing big items for > Spark 3.4. > > Thanks, > > Xinrong Meng > > On Sat, Jan 7, 2023 at 9:16 A

SparkR build with AppVeyor, broken by external reason

2023-01-16 Thread Hyukjin Kwon
Hi all, AppVeyor is currently broken assuming the flaky Github authorization issue ( https://help.appveyor.com/discussions/problems/11287-the-build-phase-is-set-to-msbuild-mode-default-but-no-visual-studio-project-or-solution-files-were-found ). AppVeyor build is specific to SparkR (on WIndows)

Re: [DISCUSS] Deprecate DStream in 3.4

2023-01-12 Thread Hyukjin Kwon
+1 On Fri, 13 Jan 2023 at 08:51, Jungtaek Lim wrote: > bump for more visibility. > > On Wed, Jan 11, 2023 at 12:20 PM Jungtaek Lim < > kabhwan.opensou...@gmail.com> wrote: > >> Hi dev, >> >> I'd like to propose the deprecation of DStream in Spark 3.4, in favor of >> promoting Structured

  1   2   3   4   5   6   7   8   >