from:"Hyukjin Kwon"

Re: [VOTE] SPIP: Stored Procedures API for Catalogs

2024-05-15 Thread Hyukjin Kwon

+1

On Tue, 14 May 2024 at 16:39, Wenchen Fan  wrote:

> +1
>
> On Tue, May 14, 2024 at 8:19 AM Zhou Jiang  wrote:
>
>> +1 (non-binding)
>>
>> On Sat, May 11, 2024 at 2:10 PM L. C. Hsieh  wrote:
>>
>>> Hi all,
>>>
>>> I’d like to start a vote for SPIP: Stored Procedures API for Catalogs.
>>>
>>> Please also refer to:
>>>
>>>- Discussion thread:
>>> https://lists.apache.org/thread/7r04pz544c9qs3gc8q2nyj3fpzfnv8oo
>>>- JIRA ticket: https://issues.apache.org/jira/browse/SPARK-44167
>>>- SPIP doc:
>>> https://docs.google.com/document/d/1rDcggNl9YNcBECsfgPcoOecHXYZOu29QYFrloo2lPBg/
>>>
>>>
>>> Please vote on the SPIP for the next 72 hours:
>>>
>>> [ ] +1: Accept the proposal as an official SPIP
>>> [ ] +0
>>> [ ] -1: I don’t think this is a good idea because …
>>>
>>>
>>> Thank you!
>>>
>>> Liang-Chi Hsieh
>>>
>>> -
>>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>>
>>>
>>
>> --
>> *Zhou JIANG*
>>
>>

Re: [DISCUSS] Spark 4.0.0 release

2024-05-01 Thread Hyukjin Kwon

SGTM

On Thu, 2 May 2024 at 02:06, Dongjoon Hyun  wrote:

> +1 for next Monday.
>
> Dongjoon.
>
> On Wed, May 1, 2024 at 8:46 AM Tathagata Das 
> wrote:
>
>> Next week sounds great! Thank you Wenchen!
>>
>> On Wed, May 1, 2024 at 11:16 AM Wenchen Fan  wrote:
>>
>>> Yea I think a preview release won't hurt (without a branch cut). We
>>> don't need to wait for all the ongoing projects to be ready. How about we
>>> do a 4.0 preview release based on the current master branch next Monday?
>>>
>>> On Wed, May 1, 2024 at 11:06 PM Tathagata Das <
>>> tathagata.das1...@gmail.com> wrote:
>>>
 Hey all,

 Reviving this thread, but Spark master has already accumulated a huge
 amount of changes.  As a downstream project maintainer, I want to really
 start testing the new features and other breaking changes, and it's hard to
 do that without a Preview release. So the sooner we make a Preview release,
 the faster we can start getting feedback for fixing things for a great
 Spark 4.0 final release.

 So I urge the community to produce a Spark 4.0 Preview soon even if
 certain features targeting the Delta 4.0 release are still incomplete.

 Thanks!


 On Wed, Apr 17, 2024 at 8:35 AM Wenchen Fan 
 wrote:

> Thank you all for the replies!
>
> To @Nicholas Chammas  : Thanks for
> cleaning up the error terminology and documentation! I've merged the first
> PR and let's finish others before the 4.0 release.
> To @Dongjoon Hyun  : Thanks for driving the
> ANSI on by default effort! Now the vote has passed, let's flip the config
> and finish the DataFrame error context feature before 4.0.
> To @Jungtaek Lim  : Ack. We can treat
> the Streaming state store data source as completed for 4.0 then.
> To @Cheng Pan  : Yea we definitely should have a
> preview release. Let's collect more feedback on the ongoing projects and
> then we can propose a date for the preview release.
>
> On Wed, Apr 17, 2024 at 1:22 PM Cheng Pan  wrote:
>
>> will we have preview release for 4.0.0 like we did for 2.0.0 and
>> 3.0.0?
>>
>> Thanks,
>> Cheng Pan
>>
>>
>> > On Apr 15, 2024, at 09:58, Jungtaek Lim <
>> kabhwan.opensou...@gmail.com> wrote:
>> >
>> > W.r.t. state data source - reader (SPARK-45511), there are several
>> follow-up tickets, but we don't plan to address them soon. The current
>> implementation is the final shape for Spark 4.0.0, unless there are 
>> demands
>> on the follow-up tickets.
>> >
>> > We may want to check the plan for transformWithState - my
>> understanding is that we want to release the feature to 4.0.0, but there
>> are several remaining works to be done. While the tentative timeline for
>> releasing is June 2024, what would be the tentative timeline for the RC 
>> cut?
>> > (cc. Anish to add more context on the plan for transformWithState)
>> >
>> > On Sat, Apr 13, 2024 at 3:15 AM Wenchen Fan 
>> wrote:
>> > Hi all,
>> >
>> > It's close to the previously proposed 4.0.0 release date (June
>> 2024), and I think it's time to prepare for it and discuss the ongoing
>> projects:
>> > •
>> > ANSI by default
>> > • Spark Connect GA
>> > • Structured Logging
>> > • Streaming state store data source
>> > • new data type VARIANT
>> > • STRING collation support
>> > • Spark k8s operator versioning
>> > Please help to add more items to this list that are missed here. I
>> would like to volunteer as the release manager for Apache Spark 4.0.0 if
>> there is no objection. Thank you all for the great work that fills Spark
>> 4.0!
>> >
>> > Wenchen Fan
>>
>>

Re: [DISCUSS] SPARK-46122: Set spark.sql.legacy.createHiveTableByDefault to false

2024-04-29 Thread Hyukjin Kwon

Mich, It is a legacy config we should get rid of in the end, and it has
been tested in production for very long time. Spark should create a Spark
table by default.

On Tue, Apr 30, 2024 at 5:38 AM Mich Talebzadeh 
wrote:

> Your point
>
> ".. t's a surprise to me to see that someone has different positions in a
> very short period of time in the community"
>
> Well, I have  been with Spark since 2015 and this is the article in the
> medium dated February 7, 2016 with regard to both Hive and Spark and also
> presented in Hortonworks meet-up.
>
> Hive on Spark Engine Versus Spark Using Hive Metastore
> 
>
> With regard to why I castred +1 votre for one and -1 for the other, I
> think it is my prerogative how  I vote and we leave it at that.,
>
> Mich Talebzadeh,
> Technologist | Architect | Data Engineer  | Generative AI | FinCrime
> London
> United Kingdom
>
>
>view my Linkedin profile
> 
>
>
>  https://en.everybodywiki.com/Mich_Talebzadeh
>
>
>
> *Disclaimer:* The information provided is correct to the best of my
> knowledge but of course cannot be guaranteed . It is essential to note
> that, as with any advice, quote "one test result is worth one-thousand
> expert opinions (Werner  Von
> Braun )".
>
>
> On Mon, 29 Apr 2024 at 17:32, Dongjoon Hyun 
> wrote:
>
>> It's a surprise to me to see that someone has different positions
>> in a very short period of time in the community.
>>
>> Mitch casted +1 for SPARK-4 and -1 for SPARK-46122.
>> - https://lists.apache.org/thread/4cbkpvc3vr3b6k0wp6lgsw37spdpnqrc
>> - https://lists.apache.org/thread/x09gynt90v3hh5sql1gt9dlcn6m6699p
>>
>> To Mitch, what I'm interested in is the following specifically.
>> > 2. Compatibility: Changing the default behavior could potentially
>> >  break existing workflows or pipelines that rely on the current
>> behavior.
>>
>> May I ask you the following questions?
>> A. What is the purpose of the migration guide in the ASF projects?
>>
>> B. Do you claim that there is incompatibility when you have
>>  spark.sql.legacy.createHiveTableByDefault=true which is described
>>  in the migration guide?
>>
>> C. Do you know that ANSI SQL has new RUNTIME exceptions
>>  which are harder than SPARK-46122?
>>
>> D. Or, did you cast +1 for SPARK-4 because
>>  you think there is no breaking change by default?
>>
>> I guess there is some misunderstanding on the proposal.
>>
>> Thanks,
>> Dongjoon.
>>
>>
>> On Fri, Apr 26, 2024 at 12:05 PM Mich Talebzadeh <
>> mich.talebza...@gmail.com> wrote:
>>
>>> Hi,
>>>
>>> I would like to add a side note regarding the discussion process and the
>>> current title of the proposal. The title '[DISCUSS] SPARK-46122: Set
>>> spark.sql.legacy.createHiveTableByDefault to false' focuses on a specific
>>> configuration parameter, which might lead some participants to overlook its
>>> broader implications (as was raised by myself and others). I believe that a
>>> more descriptive title, encompassing the broader discussion on default
>>> behaviours for creating Hive tables in Spark SQL, could enable greater
>>> engagement within the community. This is an important topic that deserves
>>> thorough consideration.
>>>
>>> HTH
>>>
>>> Mich Talebzadeh,
>>> Technologist | Architect | Data Engineer  | Generative AI | FinCrime
>>> London
>>> United Kingdom
>>>
>>>
>>>view my Linkedin profile
>>> 
>>>
>>>
>>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>>
>>>
>>>
>>> *Disclaimer:* The information provided is correct to the best of my
>>> knowledge but of course cannot be guaranteed . It is essential to note
>>> that, as with any advice, quote "one test result is worth one-thousand
>>> expert opinions (Werner
>>> Von Braun
>>> )".
>>>
>>>
>>> On Fri, 26 Apr 2024 at 07:13, L. C. Hsieh  wrote:
>>>
 +1

 On Thu, Apr 25, 2024 at 8:16 PM Yuming Wang  wrote:

> +1
>
> On Fri, Apr 26, 2024 at 8:25 AM Nimrod Ofek 
> wrote:
>
>> Of course, I can't think of a scenario of thousands of tables with
>> single in memory Spark cluster with in memory catalog.
>> Thanks for the help!
>>
>> בתאריך יום ה׳, 25 באפר׳ 2024, 23:56, מאת Mich Talebzadeh ‏<
>> mich.talebza...@gmail.com>:
>>
>>>
>>>
>>> Agreed. In scenarios where most of the interactions with the catalog
>>> are related to query planning, saving and metadata management, the 
>>> choice
>>> of catalog implementation may have less impact on query runtime 
>>> performance.
>>> This is because the time spent on metadata operations is generally
>>> minimal

Re: [VOTE] SPARK-46122: Set spark.sql.legacy.createHiveTableByDefault to false

2024-04-29 Thread Hyukjin Kwon

+1

It's a legacy conf that we should eventually remove it away. Spark should
create Spark table by default, not Hive table.

Mich, for your workload, you can simply switch that conf off if it concerns
you. We also enabled ANSI as well (that you agreed on). It's a bit akwakrd
to stop in the middle for this compatibility reason during making Spark
sound. The compatibility has been tested in production for a long time so I
don't see any particular issue about the compatibility case you mentioned.

On Mon, Apr 29, 2024 at 2:08 AM Mich Talebzadeh 
wrote:

>
> Hi @Wenchen Fan 
>
> Thanks for your response. I believe we have not had enough time to
> "DISCUSS" this matter.
>
> Currently in order to make Spark take advantage of Hive, I create a soft
> link in $SPARK_HOME/conf. FYI, my spark version is 3.4.0 and Hive is 3.1.1
>
>  /opt/spark/conf/hive-site.xml ->
> /data6/hduser/hive-3.1.1/conf/hive-site.xml
>
> This works fine for me in my lab. So in the future if we opt to use the
> setting "spark.sql.legacy.createHiveTableByDefault" to False, there will
> not be a need for this logical link.?
> On the face of it, this looks fine but in real life it may require a
> number of changes to the old scripts. Hence my concern.
> As a matter of interest has anyone liaised with the Hive team to ensure
> they have introduced the additional changes you outlined?
>
> HTH
>
> Mich Talebzadeh,
> Technologist | Architect | Data Engineer  | Generative AI | FinCrime
> London
> United Kingdom
>
>
>view my Linkedin profile
> 
>
>
>  https://en.everybodywiki.com/Mich_Talebzadeh
>
>
>
> *Disclaimer:* The information provided is correct to the best of my
> knowledge but of course cannot be guaranteed . It is essential to note
> that, as with any advice, quote "one test result is worth one-thousand
> expert opinions (Werner  Von
> Braun )".
>
>
> On Sun, 28 Apr 2024 at 09:34, Wenchen Fan  wrote:
>
>> @Mich Talebzadeh  thanks for sharing your
>> concern!
>>
>> Note: creating Spark native data source tables is usually Hive compatible
>> as well, unless we use features that Hive does not support (TIMESTAMP NTZ,
>> ANSI INTERVAL, etc.). I think it's a better default to create Spark native
>> table in this case, instead of creating Hive table and fail.
>>
>> On Sat, Apr 27, 2024 at 12:46 PM Cheng Pan  wrote:
>>
>>> +1 (non-binding)
>>>
>>> Thanks,
>>> Cheng Pan
>>>
>>> On Sat, Apr 27, 2024 at 9:29 AM Holden Karau 
>>> wrote:
>>> >
>>> > +1
>>> >
>>> > Twitter: https://twitter.com/holdenkarau
>>> > Books (Learning Spark, High Performance Spark, etc.):
>>> https://amzn.to/2MaRAG9
>>> > YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>>> >
>>> >
>>> > On Fri, Apr 26, 2024 at 12:06 PM L. C. Hsieh  wrote:
>>> >>
>>> >> +1
>>> >>
>>> >> On Fri, Apr 26, 2024 at 10:01 AM Dongjoon Hyun 
>>> wrote:
>>> >> >
>>> >> > I'll start with my +1.
>>> >> >
>>> >> > Dongjoon.
>>> >> >
>>> >> > On 2024/04/26 16:45:51 Dongjoon Hyun wrote:
>>> >> > > Please vote on SPARK-46122 to set
>>> spark.sql.legacy.createHiveTableByDefault
>>> >> > > to `false` by default. The technical scope is defined in the
>>> following PR.
>>> >> > >
>>> >> > > - DISCUSSION:
>>> >> > > https://lists.apache.org/thread/ylk96fg4lvn6klxhj6t6yh42lyqb8wmd
>>> >> > > - JIRA: https://issues.apache.org/jira/browse/SPARK-46122
>>> >> > > - PR: https://github.com/apache/spark/pull/46207
>>> >> > >
>>> >> > > The vote is open until April 30th 1AM (PST) and passes
>>> >> > > if a majority +1 PMC votes are cast, with a minimum of 3 +1 votes.
>>> >> > >
>>> >> > > [ ] +1 Set spark.sql.legacy.createHiveTableByDefault to false by
>>> default
>>> >> > > [ ] -1 Do not change spark.sql.legacy.createHiveTableByDefault
>>> because ...
>>> >> > >
>>> >> > > Thank you in advance.
>>> >> > >
>>> >> > > Dongjoon
>>> >> > >
>>> >> >
>>> >> >
>>> -
>>> >> > To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>> >> >
>>> >>
>>> >> -
>>> >> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>> >>
>>>
>>> -
>>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>>
>>>

Re: [VOTE] Release Spark 3.4.3 (RC2)

2024-04-16 Thread Hyukjin Kwon

+1

On Wed, Apr 17, 2024 at 3:57 AM L. C. Hsieh  wrote:

> +1
>
> On Tue, Apr 16, 2024 at 4:08 AM Wenchen Fan  wrote:
> >
> > +1
> >
> > On Mon, Apr 15, 2024 at 12:31 PM Dongjoon Hyun 
> wrote:
> >>
> >> I'll start with my +1.
> >>
> >> - Checked checksum and signature
> >> - Checked Scala/Java/R/Python/SQL Document's Spark version
> >> - Checked published Maven artifacts
> >> - All CIs passed.
> >>
> >> Thanks,
> >> Dongjoon.
> >>
> >> On 2024/04/15 04:22:26 Dongjoon Hyun wrote:
> >> > Please vote on releasing the following candidate as Apache Spark
> version
> >> > 3.4.3.
> >> >
> >> > The vote is open until April 18th 1AM (PDT) and passes if a majority
> +1 PMC
> >> > votes are cast, with a minimum of 3 +1 votes.
> >> >
> >> > [ ] +1 Release this package as Apache Spark 3.4.3
> >> > [ ] -1 Do not release this package because ...
> >> >
> >> > To learn more about Apache Spark, please see
> https://spark.apache.org/
> >> >
> >> > The tag to be voted on is v3.4.3-rc2 (commit
> >> > 1eb558c3a6fbdd59e5a305bc3ab12ce748f6511f)
> >> > https://github.com/apache/spark/tree/v3.4.3-rc2
> >> >
> >> > The release files, including signatures, digests, etc. can be found
> at:
> >> > https://dist.apache.org/repos/dist/dev/spark/v3.4.3-rc2-bin/
> >> >
> >> > Signatures used for Spark RCs can be found in this file:
> >> > https://dist.apache.org/repos/dist/dev/spark/KEYS
> >> >
> >> > The staging repository for this release can be found at:
> >> >
> https://repository.apache.org/content/repositories/orgapachespark-1453/
> >> >
> >> > The documentation corresponding to this release can be found at:
> >> > https://dist.apache.org/repos/dist/dev/spark/v3.4.3-rc2-docs/
> >> >
> >> > The list of bug fixes going into 3.4.3 can be found at the following
> URL:
> >> > https://issues.apache.org/jira/projects/SPARK/versions/12353987
> >> >
> >> > This release is using the release script of the tag v3.4.3-rc2.
> >> >
> >> > FAQ
> >> >
> >> > =
> >> > How can I help test this release?
> >> > =
> >> >
> >> > If you are a Spark user, you can help us test this release by taking
> >> > an existing Spark workload and running on this release candidate, then
> >> > reporting any regressions.
> >> >
> >> > If you're working in PySpark you can set up a virtual env and install
> >> > the current RC and see if anything important breaks, in the Java/Scala
> >> > you can add the staging repository to your projects resolvers and test
> >> > with the RC (make sure to clean up the artifact cache before/after so
> >> > you don't end up building with a out of date RC going forward).
> >> >
> >> > ===
> >> > What should happen to JIRA tickets still targeting 3.4.3?
> >> > ===
> >> >
> >> > The current list of open tickets targeted at 3.4.3 can be found at:
> >> > https://issues.apache.org/jira/projects/SPARK and search for "Target
> >> > Version/s" = 3.4.3
> >> >
> >> > Committers should look at those and triage. Extremely important bug
> >> > fixes, documentation, and API tweaks that impact compatibility should
> >> > be worked on immediately. Everything else please retarget to an
> >> > appropriate release.
> >> >
> >> > ==
> >> > But my bug isn't fixed?
> >> > ==
> >> >
> >> > In order to make timely releases, we will typically not hold the
> >> > release unless the bug in question is a regression from the previous
> >> > release. That being said, if there is something which is a regression
> >> > that has not been correctly targeted please ping me or a committer to
> >> > help target the issue.
> >> >
> >>
> >> -
> >> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
> >>
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>

Re: [VOTE] SPARK-44444: Use ANSI SQL mode by default

2024-04-13 Thread Hyukjin Kwon

+1

On Sun, Apr 14, 2024 at 7:46 AM Chao Sun  wrote:

> +1.
>
> This feature is very helpful for guarding against correctness issues, such
> as null results due to invalid input or math overflows. It’s been there for
> a while now and it’s a good time to enable it by default as Spark enters
> the next major release.
>
> On Sat, Apr 13, 2024 at 3:27 PM Dongjoon Hyun  wrote:
>
>> I'll start from my +1.
>>
>> Dongjoon.
>>
>> On 2024/04/13 22:22:05 Dongjoon Hyun wrote:
>> > Please vote on SPARK-4 to use ANSI SQL mode by default.
>> > The technical scope is defined in the following PR which is
>> > one line of code change and one line of migration guide.
>> >
>> > - DISCUSSION:
>> > https://lists.apache.org/thread/ztlwoz1v1sn81ssks12tb19x37zozxlz
>> > - JIRA: https://issues.apache.org/jira/browse/SPARK-4
>> > - PR: https://github.com/apache/spark/pull/46013
>> >
>> > The vote is open until April 17th 1AM (PST) and passes
>> > if a majority +1 PMC votes are cast, with a minimum of 3 +1 votes.
>> >
>> > [ ] +1 Use ANSI SQL mode by default
>> > [ ] -1 Do not use ANSI SQL mode by default because ...
>> >
>> > Thank you in advance.
>> >
>> > Dongjoon
>> >
>>
>> -
>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>
>>

[VOTE][RESULT] SPIP: Pure Python Package in PyPI (Spark Connect)

2024-04-03 Thread Hyukjin Kwon

The vote passes with 19+1s (13 binding +1s).

(* = binding)
+1:
Haejoon Lee
Ruifeng Zheng(*)
Dongjoon Hyun(*)
Gengliang Wang(*)
Mridul Muralidharan(*)
Liang-Chi Hsieh(*)
Takuya Ueshin(*)
Kent Yao
Chao Sun(*)
Hussein Awala
Xiao Li(*)
Yuanjian Li(*)
Denny Lee
Felix Cheung(*)
Bo Yang
Xinrong Meng(*)
Holden Karau(*)
Femi Anthony
Tom Graves(*)

+0: None

-1: None

Thanks.

Re: [VOTE] SPIP: Pure Python Package in PyPI (Spark Connect)

2024-04-02 Thread Hyukjin Kwon

Yes

On Tue, Apr 2, 2024 at 6:36 PM Femi Anthony  wrote:

> So, to clarify - the purpose of this package is to enable connectivity to
> a remote Spark cluster without having to install any local JVM
> dependencies, right ?
>
> Sent from my iPhone
>
> On Mar 31, 2024, at 10:07 PM, Haejoon Lee
>  wrote:
>
> 
>
> +1
>
> On Mon, Apr 1, 2024 at 10:15 AM Hyukjin Kwon  wrote:
>
>> Hi all,
>>
>> I'd like to start the vote for SPIP: Pure Python Package in PyPI (Spark
>> Connect)
>>
>> JIRA <https://issues.apache.org/jira/browse/SPARK-47540>
>> Prototype <https://github.com/apache/spark/pull/45053>
>> SPIP doc
>> <https://docs.google.com/document/d/1Pund40wGRuB72LX6L7cliMDVoXTPR-xx4IkPmMLaZXk/edit?usp=sharing>
>>
>> Please vote on the SPIP for the next 72 hours:
>>
>> [ ] +1: Accept the proposal as an official SPIP
>> [ ] +0
>> [ ] -1: I don’t think this is a good idea because …
>>
>> Thanks.
>>
>

Re: [VOTE] SPIP: Pure Python Package in PyPI (Spark Connect)

2024-03-31 Thread Hyukjin Kwon

Oh I didn't send the discussion thread out as it's pretty simple,
non-invasive and the discussion was sort of done as part of the Spark
Connect initial discussion ..

On Mon, Apr 1, 2024 at 1:59 PM Mridul Muralidharan  wrote:

>
> Can you point me to the SPIP’s discussion thread please ?
> I was not able to find it, but I was on vacation, and so might have
> missed this …
>
>
> Regards,
> Mridul
>
> On Sun, Mar 31, 2024 at 9:08 PM Haejoon Lee
>  wrote:
>
>> +1
>>
>> On Mon, Apr 1, 2024 at 10:15 AM Hyukjin Kwon 
>> wrote:
>>
>>> Hi all,
>>>
>>> I'd like to start the vote for SPIP: Pure Python Package in PyPI (Spark
>>> Connect)
>>>
>>> JIRA <https://issues.apache.org/jira/browse/SPARK-47540>
>>> Prototype <https://github.com/apache/spark/pull/45053>
>>> SPIP doc
>>> <https://docs.google.com/document/d/1Pund40wGRuB72LX6L7cliMDVoXTPR-xx4IkPmMLaZXk/edit?usp=sharing>
>>>
>>> Please vote on the SPIP for the next 72 hours:
>>>
>>> [ ] +1: Accept the proposal as an official SPIP
>>> [ ] +0
>>> [ ] -1: I don’t think this is a good idea because …
>>>
>>> Thanks.
>>>
>>

[VOTE] SPIP: Pure Python Package in PyPI (Spark Connect)

2024-03-31 Thread Hyukjin Kwon

Hi all,

I'd like to start the vote for SPIP: Pure Python Package in PyPI (Spark
Connect)

JIRA 
Prototype 
SPIP doc


Please vote on the SPIP for the next 72 hours:

[ ] +1: Accept the proposal as an official SPIP
[ ] +0
[ ] -1: I don’t think this is a good idea because …

Thanks.

Re: A proposal for creating a Knowledge Sharing Hub for Apache Spark Community

2024-03-18 Thread Hyukjin Kwon

One very good example is SparkR releases in Conda channel (
https://github.com/conda-forge/r-sparkr-feedstock).
This is fully run by the community unofficially.

On Tue, 19 Mar 2024 at 09:54, Mich Talebzadeh 
wrote:

> +1 for me
>
> Mich Talebzadeh,
> Dad | Technologist | Solutions Architect | Engineer
> London
> United Kingdom
>
>
>view my Linkedin profile
> 
>
>
>  https://en.everybodywiki.com/Mich_Talebzadeh
>
>
>
> *Disclaimer:* The information provided is correct to the best of my
> knowledge but of course cannot be guaranteed . It is essential to note
> that, as with any advice, quote "one test result is worth one-thousand
> expert opinions (Werner  Von
> Braun )".
>
>
> On Mon, 18 Mar 2024 at 16:23, Parsian, Mahmoud 
> wrote:
>
>> Good idea. Will be useful
>>
>>
>>
>> +1
>>
>>
>>
>>
>>
>>
>>
>> *From: *ashok34...@yahoo.com.INVALID 
>> *Date: *Monday, March 18, 2024 at 6:36 AM
>> *To: *user @spark , Spark dev list <
>> dev@spark.apache.org>, Mich Talebzadeh 
>> *Cc: *Matei Zaharia 
>> *Subject: *Re: A proposal for creating a Knowledge Sharing Hub for
>> Apache Spark Community
>>
>> External message, be mindful when clicking links or attachments
>>
>>
>>
>> Good idea. Will be useful
>>
>>
>>
>> +1
>>
>>
>>
>> On Monday, 18 March 2024 at 11:00:40 GMT, Mich Talebzadeh <
>> mich.talebza...@gmail.com> wrote:
>>
>>
>>
>>
>>
>> Some of you may be aware that Databricks community Home | Databricks
>>
>> have just launched a knowledge sharing hub. I thought it would be a
>>
>> good idea for the Apache Spark user group to have the same, especially
>>
>> for repeat questions on Spark core, Spark SQL, Spark Structured
>>
>> Streaming, Spark Mlib and so forth.
>>
>>
>>
>> Apache Spark user and dev groups have been around for a good while.
>>
>> They are serving their purpose . We went through creating a slack
>>
>> community that managed to create more more heat than light.. This is
>>
>> what Databricks community came up with and I quote
>>
>>
>>
>> "Knowledge Sharing Hub
>>
>> Dive into a collaborative space where members like YOU can exchange
>>
>> knowledge, tips, and best practices. Join the conversation today and
>>
>> unlock a wealth of collective wisdom to enhance your experience and
>>
>> drive success."
>>
>>
>>
>> I don't know the logistics of setting it up.but I am sure that should
>>
>> not be that difficult. If anyone is supportive of this proposal, let
>>
>> the usual +1, 0, -1 decide
>>
>>
>>
>> HTH
>>
>>
>>
>> Mich Talebzadeh,
>>
>> Dad | Technologist | Solutions Architect | Engineer
>>
>> London
>>
>> United Kingdom
>>
>>
>>
>>
>>
>>   view my Linkedin profile
>>
>>
>>
>>
>>
>> https://en.everybodywiki.com/Mich_Talebzadeh
>> 
>>
>>
>>
>>
>>
>>
>>
>> Disclaimer: The information provided is correct to the best of my
>>
>> knowledge but of course cannot be guaranteed . It is essential to note
>>
>> that, as with any advice, quote "one test result is worth one-thousand
>>
>> expert opinions (Werner Von Braun)".
>>
>>
>>
>> -
>>
>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>>
>>
>>
>

Re: [VOTE] SPIP: Structured Logging Framework for Apache Spark

2024-03-11 Thread Hyukjin Kwon

+1

On Mon, 11 Mar 2024 at 18:11, yangjie01  wrote:

> +1
>
>
>
> Jie Yang
>
>
>
> *发件人**: *Haejoon Lee 
> *日期**: *2024年3月11日 星期一 17:09
> *收件人**: *Gengliang Wang 
> *抄送**: *dev 
> *主题**: *Re: [VOTE] SPIP: Structured Logging Framework for Apache Spark
>
>
>
> +1
>
>
>
> On Mon, Mar 11, 2024 at 10:36 AM Gengliang Wang  wrote:
>
> Hi all,
>
> I'd like to start the vote for SPIP: Structured Logging Framework for
> Apache Spark
>
>
> References:
>
>- JIRA ticket
>
> 
>- SPIP doc
>
> 
>- Discussion thread
>
> 
>
> Please vote on the SPIP for the next 72 hours:
>
> [ ] +1: Accept the proposal as an official SPIP
> [ ] +0
> [ ] -1: I don’t think this is a good idea because …
>
> Thanks!
>
> Gengliang Wang
>
>

Re: [ANNOUNCE] Apache Spark 3.5.1 released

2024-03-04 Thread Hyukjin Kwon

Is this related to https://github.com/apache/spark/pull/42428?

cc @Yang,Jie(INF) 

On Mon, 4 Mar 2024 at 22:21, Jungtaek Lim 
wrote:

> Shall we revisit this functionality? The API doc is built with individual
> versions, and for each individual version we depend on other released
> versions. This does not seem to be right to me. Also, the functionality is
> only in PySpark API doc which does not seem to be consistent as well.
>
> I don't think this is manageable with the current approach (listing
> versions in version-dependent doc). Let's say we release 3.4.3 after 3.5.1.
> Should we update the versions in 3.5.1 to add 3.4.3 in version switcher?
> How about the time we are going to release the new version after releasing
> 10 versions? What's the criteria of pruning the version?
>
> Unless we have a good answer to these questions, I think it's better to
> revert the functionality - it missed various considerations.
>
> On Fri, Mar 1, 2024 at 2:44 PM Jungtaek Lim 
> wrote:
>
>> Thanks for reporting - this is odd - the dropdown did not exist in other
>> recent releases.
>>
>> https://spark.apache.org/docs/3.5.0/api/python/index.html
>> https://spark.apache.org/docs/3.4.2/api/python/index.html
>> https://spark.apache.org/docs/3.3.4/api/python/index.html
>>
>> Looks like the dropdown feature was recently introduced but partially
>> done. The addition of a dropdown was done, but the way how to bump the
>> version was missed to be documented.
>> The contributor proposed the way to update the version "automatically",
>> but the PR wasn't merged. As a result, we are neither having the
>> instruction how to bump the version manually, nor having the automatic bump.
>>
>> * PR for addition of dropdown: https://github.com/apache/spark/pull/42428
>> * PR for automatically bumping version:
>> https://github.com/apache/spark/pull/42881
>>
>> We will probably need to add an instruction in the release process to
>> update the version. (For automatic bumping I don't have a good idea.)
>> I'll look into it. Please expect some delay during the holiday weekend
>> in S. Korea.
>>
>> Thanks again.
>> Jungtaek Lim (HeartSaVioR)
>>
>>
>> On Fri, Mar 1, 2024 at 2:14 PM Dongjoon Hyun 
>> wrote:
>>
>>> BTW, Jungtaek.
>>>
>>> PySpark document seems to show a wrong branch. At this time, `master`.
>>>
>>> https://spark.apache.org/docs/3.5.1/api/python/index.html
>>>
>>> PySpark Overview
>>> 
>>>
>>>Date: Feb 24, 2024 Version: master
>>>
>>> [image: Screenshot 2024-02-29 at 21.12.24.png]
>>>
>>>
>>> Could you do the follow-up, please?
>>>
>>> Thank you in advance.
>>>
>>> Dongjoon.
>>>
>>>
>>> On Thu, Feb 29, 2024 at 2:48 PM John Zhuge  wrote:
>>>
 Excellent work, congratulations!

 On Wed, Feb 28, 2024 at 10:12 PM Dongjoon Hyun 
 wrote:

> Congratulations!
>
> Bests,
> Dongjoon.
>
> On Wed, Feb 28, 2024 at 11:43 AM beliefer  wrote:
>
>> Congratulations!
>>
>>
>>
>> At 2024-02-28 17:43:25, "Jungtaek Lim" 
>> wrote:
>>
>> Hi everyone,
>>
>> We are happy to announce the availability of Spark 3.5.1!
>>
>> Spark 3.5.1 is a maintenance release containing stability fixes. This
>> release is based on the branch-3.5 maintenance branch of Spark. We
>> strongly
>> recommend all 3.5 users to upgrade to this stable release.
>>
>> To download Spark 3.5.1, head over to the download page:
>> https://spark.apache.org/downloads.html
>>
>> To view the release notes:
>> https://spark.apache.org/releases/spark-release-3-5-1.html
>>
>> We would like to acknowledge all community members for contributing
>> to this
>> release. This release would not have been possible without you.
>>
>> Jungtaek Lim
>>
>> ps. Yikun is helping us through releasing the official docker image
>> for Spark 3.5.1 (Thanks Yikun!) It may take some time to be generally
>> available.
>>
>>

 --
 John Zhuge

>>>

Re: [VOTE] Release Apache Spark 3.5.1 (RC2)

2024-02-20 Thread Hyukjin Kwon

+1

On Tue, 20 Feb 2024 at 22:00, Cheng Pan  wrote:

> +1 (non-binding)
>
> - Build successfully from source code.
> - Pass integration tests with Spark ClickHouse Connector[1]
>
> [1] https://github.com/housepower/spark-clickhouse-connector/pull/299
>
> Thanks,
> Cheng Pan
>
>
> > On Feb 20, 2024, at 10:56, Jungtaek Lim 
> wrote:
> >
> > Thanks Sean, let's continue the process for this RC.
> >
> > +1 (non-binding)
> >
> > - downloaded all files from URL
> > - checked signature
> > - extracted all archives
> > - ran all tests from source files in source archive file, via running
> "sbt clean test package" - Ubuntu 20.04.4 LTS, OpenJDK 17.0.9.
> >
> > Also bump to dev@ to encourage participation - looks like the timing is
> not good for US folks but let's see more days.
> >
> >
> > On Sat, Feb 17, 2024 at 1:49 AM Sean Owen  wrote:
> > Yeah let's get that fix in, but it seems to be a minor test only issue
> so should not block release.
> >
> > On Fri, Feb 16, 2024, 9:30 AM yangjie01  wrote:
> > Very sorry. When I was fixing `SPARK-45242 (
> https://github.com/apache/spark/pull/43594)`
> , I noticed that its
> `Affects Version` and `Fix Version` of SPARK-45242 were both 4.0, and I
> didn't realize that it had also been merged into branch-3.5, so I didn't
> advocate for SPARK-45357 to be backported to branch-3.5.
> >  As far as I know, the condition to trigger this test failure is: when
> using Maven to test the `connect` module, if  `sparkTestRelation` in
> `SparkConnectProtoSuite` is not the first `DataFrame` to be initialized,
> then the `id` of `sparkTestRelation` will no longer be 0. So, I think this
> is indeed related to the order in which Maven executes the test cases in
> the `connect` module.
> >  I have submitted a backport PR to branch-3.5, and if necessary, we can
> merge it to fix this test issue.
> >  Jie Yang
> >   发件人: Jungtaek Lim 
> > 日期: 2024年2月16日 星期五 22:15
> > 收件人: Sean Owen , Rui Wang 
> > 抄送: dev 
> > 主题: Re: [VOTE] Release Apache Spark 3.5.1 (RC2)
> >   I traced back relevant changes and got a sense of what happened.
> >   Yangjie figured out the issue via link. It's a tricky issue according
> to the comments from Yangjie - the test is dependent on ordering of
> execution for test suites. He said it does not fail in sbt, hence CI build
> couldn't catch it.
> > He fixed it via link, but we missed that the offending commit was also
> ported back to 3.5 as well, hence the fix wasn't ported back to 3.5.
> >   Surprisingly, I can't reproduce locally even with maven. In my attempt
> to reproduce, SparkConnectProtoSuite was executed at third,
> SparkConnectStreamingQueryCacheSuite, and ExecuteEventsManagerSuite, and
> then SparkConnectProtoSuite. Maybe very specific to the environment, not
> just maven? My env: MBP M1 pro chip, MacOS 14.3.1, Openjdk 17.0.9. I used
> build/mvn (Maven 3.8.8).
> >   I'm not 100% sure this is something we should fail the release as it's
> a test only and sounds very environment dependent, but I'll respect your
> call on vote.
> >   Btw, looks like Rui also made a relevant fix via link (not to fix the
> failing test but to fix other issues), but this also wasn't ported back to
> 3.5. @Rui Wang Do you think this is a regression issue and warrants a new
> RC?
> > On Fri, Feb 16, 2024 at 11:38 AM Sean Owen  wrote:
> > Is anyone seeing this Spark Connect test failure? then again, I have
> some weird issue with this env that always fails 1 or 2 tests that nobody
> else can replicate.
> >   - Test observe *** FAILED ***
> >   == FAIL: Plans do not match ===
> >   !CollectMetrics my_metric, [min(id#0) AS min_val#0, max(id#0) AS
> max_val#0, sum(id#0) AS sum(id)#0L], 0   CollectMetrics my_metric,
> [min(id#0) AS min_val#0, max(id#0) AS max_val#0, sum(id#0) AS sum(id)#0L],
> 44
> >+- LocalRelation , [id#0, name#0]
>+- LocalRelation , [id#0, name#0]
> (PlanTest.scala:179)
> >   On Thu, Feb 15, 2024 at 1:34 PM Jungtaek Lim <
> kabhwan.opensou...@gmail.com> wrote:
> > DISCLAIMER: RC for Apache Spark 3.5.1 starts with RC2 as I lately
> figured out doc generation issue after tagging RC1.
> >   Please vote on releasing the following candidate as Apache Spark
> version 3.5.1.
> >
> > The vote is open until February 18th 9AM (PST) and passes if a majority
> +1 PMC votes are cast, with
> > a minimum of 3 +1 votes.
> >
> > [ ] +1 Release this package as Apache Spark 3.5.1
> > [ ] -1 Do not release this package because ...
> >
> > To learn more about Apache Spark, please see https://spark.apache.org/
> >
> > The tag to be voted on is v3.5.1-rc2 (commit
> fd86f85e181fc2dc0f50a096855acf83a6cc5d9c):
> > https://github.com/apache/spark/tree/v3.5.1-rc2
> >
> > The release files, including signatures, digests, etc. can be found at:
> > https://dist.apache.org/repos/dist/dev/spark/v3.5.1-rc2-bin/
> >
> > Signatures used for Spark RCs can be found in this file:
> >

Re: [FYI] SPARK-45981: Improve Python language test coverage

2023-12-02 Thread Hyukjin Kwon

Awesome!

On Sat, Dec 2, 2023 at 2:33 PM Dongjoon Hyun 
wrote:

> Hi, All.
>
> As a part of Apache Spark 4.0.0 (SPARK-44111), the Apache Spark community
> starts to have test coverage for all supported Python versions from Today.
>
> - https://github.com/apache/spark/actions/runs/7061665420
>
> Here is a summary.
>
> 1. Main CI: All PRs and commits on `master` branch are tested with Python
> 3.9.
> 2. Daily CI:
> https://github.com/apache/spark/actions/workflows/build_python.yml
> - PyPy 3.8
> - Python 3.10
> - Python 3.11
> - Python 3.12
>
> This is a great addition for PySpark 4.0+ users and an extensible
> framework for all future Python versions.
>
> Thank you all for making this together!
>
> Best,
> Dongjoon.
>

Help for testing Windows specific fix (SPARK-23015)

2023-11-21 Thread Hyukjin Kwon

Hi all,

I used to have my Windows environment in another laptop but that laptop is
broken now so I don't have Windows env to test Windows PRs out (e.g.,
https://github.com/apache/spark/pull/43706).

If anyone has a Windows env, would appreciate it if you take a look at this.

Thanks.

Re: On adding applyInArrow to groupBy and cogroup

2023-11-06 Thread Hyukjin Kwon

Sounds good, I'll review the PR.

On Fri, 3 Nov 2023 at 14:08, Abdeali Kothari 
wrote:

> Seeing more support for arrow based functions would be great.
> Gives more control to application developers. And so pandas just becomes 1
> of the available options.
>
> On Fri, 3 Nov 2023, 21:23 Luca Canali,  wrote:
>
>> Hi Enrico,
>>
>>
>>
>> +1 on supporting Arrow on par with Pandas. Besides the frameworks and
>> libraries that you mentioned I add awkward array, a library used in High
>> Energy Physics
>>
>> (for those interested more details on how we tested awkward array with
>> Spark from back when mapInArrow was introduced can be found at
>> https://github.com/LucaCanali/Miscellaneous/blob/master/Spark_Notes/Spark_MapInArrow.md
>> )
>>
>>
>>
>> Cheers,
>>
>> Luca
>>
>>
>>
>> *From:* Enrico Minack 
>> *Sent:* Thursday, October 26, 2023 15:33
>> *To:* dev 
>> *Subject:* On adding applyInArrow to groupBy and cogroup
>>
>>
>>
>> Hi devs,
>>
>> PySpark allows to transform a DataFrame via Pandas *and* Arrow API:
>>
>> df.mapInArrow(map_arrow, schema="...")
>> df.mapInPandas(map_pandas, schema="...")
>>
>> For df.groupBy(...) and df.groupBy(...).cogroup(...), there is *only* a
>> Pandas interface, no Arrow interface:
>>
>> df.groupBy("id").applyInPandas(apply_pandas, schema="...")
>>
>> Providing a pure Arrow interface allows user code to use *any*
>> Arrow-based data framework, not only Pandas, e.g. Polars. Adding Arrow
>> interfaces reduces the need to add more framework-specific support.
>>
>> We need your thoughts on whether PySpark should support Arrow on a par
>> with Pandas, or not: https://github.com/apache/spark/pull/38624
>>
>> Cheers,
>> Enrico
>>
>

Re: Welcome to Our New Apache Spark Committer and PMCs

2023-10-03 Thread Hyukjin Kwon

Woohoo!

On Tue, 3 Oct 2023 at 22:47, Hussein Awala  wrote:

> Congrats to all of you!
>
> On Tue 3 Oct 2023 at 08:15, Rui Wang  wrote:
>
>> Congratulations! Well deserved!
>>
>> -Rui
>>
>>
>> On Mon, Oct 2, 2023 at 10:32 PM Gengliang Wang  wrote:
>>
>>> Congratulations to all! Well deserved!
>>>
>>> On Mon, Oct 2, 2023 at 10:16 PM Xiao Li  wrote:
>>>
 Hi all,

 The Spark PMC is delighted to announce that we have voted to add one
 new committer and two new PMC members. These individuals have consistently
 contributed to the project and have clearly demonstrated their expertise.

 New Committer:
 - Jiaan Geng (focusing on Spark Connect and Spark SQL)

 New PMCs:
 - Yuanjian Li
 - Yikun Jiang

 Please join us in extending a warm welcome to them in their new roles!

 Sincerely,
 The Spark PMC

>>>

[RESULT] Updating documentation hosted for EOL and maintenance releases

2023-09-29 Thread Hyukjin Kwon

The vote passes with 9 +1s (6 binding +1s).

(* = binding)
+1:
- Hyukjin Kwon *
- Ruifeng Zheng *
- Jiaan Geng
- Yikun Jiang *
- Herman van Hovell *
- Michel Miotto Barbosa
- Maciej Szymkiewicz *
- Denny Lee
- Yuanjian Li *

Re: [ANNOUNCE] Apache Spark 3.5.0 released

2023-09-26 Thread Hyukjin Kwon

Awesome!

On Wed, 27 Sept 2023 at 11:02, Hussein Awala  wrote:

> I installed the package, tested it with kubernetes master from Jupyter,
> and tested it with Spark Connect server, all looks good.
>
> On Tue, Sep 26, 2023 at 10:45 PM Yuanjian Li 
> wrote:
>
>> FYI, we received the handling from Pypi
>>  org yesterday, and the
>> upload of version 3.5.0 has just been completed. Please assist in verifying
>> it. Thank you!
>>
>> Ruifeng Zheng  于2023年9月17日周日 23:28写道：
>>
>>> Thanks Yuanjian for driving this release, Congratulations!
>>>
>>> On Mon, Sep 18, 2023 at 2:16 PM Maxim Gekk
>>>  wrote:
>>>
 Thank you for the work, Yuanjian!

 On Mon, Sep 18, 2023 at 6:28 AM beliefer  wrote:

> Congratulations! Apache Spark.
>
>
>
> At 2023-09-16 01:01:40, "Yuanjian Li"  wrote:
>
> Hi All,
>
> We are happy to announce the availability of *Apache Spark 3.5.0*!
>
> Apache Spark 3.5.0 is the sixth release of the 3.x line.
>
> To download Spark 3.5.0, head over to the download page:
> https://spark.apache.org/downloads.html
> (Please note: the PyPi upload is pending due to a size limit request;
> we're actively following up here
>  with the PyPi
> organization)
>
> To view the release notes:
> https://spark.apache.org/releases/spark-release-3-5-0.html
>
> We would like to acknowledge all community members for contributing to
> this
> release. This release would not have been possible without you.
>
> Best,
> Yuanjian
>
>
>>>
>>> --
>>> Ruifeng Zheng
>>> E-mail: zrfli...@gmail.com
>>>
>>

[VOTE] Updating documentation hosted for EOL and maintenance releases

2023-09-25 Thread Hyukjin Kwon

Hi all,

I would like to start the vote for updating documentation hosted for EOL
and maintenance releases to improve the usability here, and in order for
end users to read the proper and correct documentation.

For discussion thread, please refer to
https://lists.apache.org/thread/1675rzxx5x4j2x03t9x0kfph8tlys0cx.

Here is one example:
- https://github.com/apache/spark/pull/42989
- https://github.com/apache/spark-website/pull/480

Starting with my own +1.

Re: [VOTE] Release Apache Spark 3.5.0 (RC5)

2023-09-11 Thread Hyukjin Kwon

+1

On Tue, Sep 12, 2023 at 7:05 AM Xiao Li  wrote:

> +1
>
> Xiao
>
> Yuanjian Li  于2023年9月11日周一 10:53写道：
>
>> @Peter Toth  I've looked into the details of this
>> issue, and it appears that it's neither a regression in version 3.5.0 nor a
>> correctness issue. It's a bug related to a new feature. I think we can fix
>> this in 3.5.1 and list it as a known issue of the Scala client of Spark
>> Connect in 3.5.0.
>>
>> Mridul Muralidharan  于2023年9月10日周日 04:12写道：
>>
>>>
>>> +1
>>>
>>> Signatures, digests, etc check out fine.
>>> Checked out tag and build/tested with -Phive -Pyarn -Pmesos -Pkubernetes
>>>
>>> Regards,
>>> Mridul
>>>
>>> On Sat, Sep 9, 2023 at 10:02 AM Yuanjian Li 
>>> wrote:
>>>
 Please vote on releasing the following candidate(RC5) as Apache Spark
 version 3.5.0.

 The vote is open until 11:59pm Pacific time Sep 11th and passes if a
 majority +1 PMC votes are cast, with a minimum of 3 +1 votes.

 [ ] +1 Release this package as Apache Spark 3.5.0

 [ ] -1 Do not release this package because ...

 To learn more about Apache Spark, please see http://spark.apache.org/

 The tag to be voted on is v3.5.0-rc5 (commit
 ce5ddad990373636e94071e7cef2f31021add07b):

 https://github.com/apache/spark/tree/v3.5.0-rc5

 The release files, including signatures, digests, etc. can be found at:

 https://dist.apache.org/repos/dist/dev/spark/v3.5.0-rc5-bin/

 Signatures used for Spark RCs can be found in this file:

 https://dist.apache.org/repos/dist/dev/spark/KEYS

 The staging repository for this release can be found at:

 https://repository.apache.org/content/repositories/orgapachespark-1449

 The documentation corresponding to this release can be found at:

 https://dist.apache.org/repos/dist/dev/spark/v3.5.0-rc5-docs/

 The list of bug fixes going into 3.5.0 can be found at the following
 URL:

 https://issues.apache.org/jira/projects/SPARK/versions/12352848

 This release is using the release script of the tag v3.5.0-rc5.


 FAQ

 =

 How can I help test this release?

 =

 If you are a Spark user, you can help us test this release by taking

 an existing Spark workload and running on this release candidate, then

 reporting any regressions.

 If you're working in PySpark you can set up a virtual env and install

 the current RC and see if anything important breaks, in the Java/Scala

 you can add the staging repository to your projects resolvers and test

 with the RC (make sure to clean up the artifact cache before/after so

 you don't end up building with an out of date RC going forward).

 ===

 What should happen to JIRA tickets still targeting 3.5.0?

 ===

 The current list of open tickets targeted at 3.5.0 can be found at:

 https://issues.apache.org/jira/projects/SPARK and search for "Target
 Version/s" = 3.5.0

 Committers should look at those and triage. Extremely important bug

 fixes, documentation, and API tweaks that impact compatibility should

 be worked on immediately. Everything else please retarget to an

 appropriate release.

 ==

 But my bug isn't fixed?

 ==

 In order to make timely releases, we will typically not hold the

 release unless the bug in question is a regression from the previous

 release. That being said, if there is something which is a regression

 that has not been correctly targeted please ping me or a committer to

 help target the issue.

 Thanks,

 Yuanjian Li

>>>

[DISCUSS] Updating documentation hosted for EOL and maintenance releases

2023-08-30 Thread Hyukjin Kwon

Hi all,

I would like to raise a discussion about updating documentation hosted for
EOL and maintenance
versions.

To provide some context, we currently host the documentation for EOL
versions of Apache Spark,
which can be found at links like
https://spark.apache.org/docs/2.3.1/api/python/index.html. Some
of their documentation appear in search results on the top if you google.
The same applies to
maintenance releases. Once technical mistakes in the documentation,
incorrect information,
etc. are landed mistakenly, they become permanent and/or cannot easily be
fixed, e.g., until
the next maintenance release.

In practice, we’ve already taken steps to update and fix the documentation
for these EOL and
maintenance releases, including:

   - Algolia and Docsearch in which we require to make some changes after
   individual release
   for allowing search results in Apache Spark website and documentation
   - Regenerating the documentation that was incorrectly generated.
   - Fixing the malformed download page
   - …

I would like to take a step further, and want for the doc changes of
improvement and better examples,
in maintenance branches, to be landed to the hosted documentation for
better usability.
The changes landed into EOL or maintenance branches, according to SemVer,
are usually only bug
fixes, so the documentation changes such as fixing examples would not
introduce any surprises.

Those documentation are critical to the end users, and this is the very one
I heard most often
where we should improve, and I eagerly would like to improve the
usability here.

*TL;DR*, what I would like to propose is to improve our current practice of
landing updates in the
documentation hosted for EOL and maintenance versions so that we can show a
better search
result for Spark documentation, end users can read the correct information
in the versions they use,
and follow the better examples provided in Spark documentation.

Re: [DISCUSS] SPIP: Python Stored Procedures

2023-08-30 Thread Hyukjin Kwon

Which Python version will run that stored procedure?

All Python versions supported in PySpark

How to manage external dependencies?

Existing way we have
https://spark.apache.org/docs/latest/api/python/user_guide/python_packaging.html
.
In fact, this will use the external dependencies within your Python
interpreter so you can use all existing conda or venvs.

How to test it via a common CI process?

Existing way of PySpark unittests, see
https://github.com/apache/spark/tree/master/python/pyspark/tests

How to manage versions and do upgrades? Migrations?

This is a new feature so no migration is needed. We will keep the
compatibility according to the sember we follow.

Current Python UDF solution handles these problems in a good way since they
delegate them to project level.

Current UDF solution cannot handle stored procedures because UDF is on the
worker side. This is Driver side.

In my opinion, the concerns raised here look orthogonal with the Stored
Procedure itself.
Let me know if this does not address your concern.

On Thu, 31 Aug 2023 at 12:49, Alexander Shorin  wrote:

> -1
>
> Great idea to ignore the experience of others and copy bad practices back
> for nothing.
>
> If you are familiar with Python ecosystem then you should answer the
> questions:
> 1. Which Python version will run that stored procedure?
> 2. How to manage external dependencies?
> 3. How to test it via a common CI process?
> 4. How to manage versions and do upgrades? Migrations?
>
> Current Python UDF solution handles these problems in a good way since
> they delegate them to project level.
>
> --
> ,,,^..^,,,
>
>
> On Thu, Aug 31, 2023 at 1:29 AM Allison Wang
>  wrote:
>
>> Hi all,
>>
>> I would like to start a discussion on “Python Stored Procedures".
>>
>> This proposal aims to extend Spark SQL by introducing support for stored
>> procedures, starting with Python as the procedural language. This will
>> enable users to run complex logic using Python within their SQL workflows
>> and save these routines in catalogs like HMS for future use.
>>
>> *SPIP*:
>> https://docs.google.com/document/d/1ce2EZrf2BxHu7TjfGn4TgToK3TBYYzRkmsIVcfmkNzE/edit?usp=sharing
>> *JIRA*: https://issues.apache.org/jira/browse/SPARK-45023
>>
>> Looking forward to your feedback!
>>
>> Thanks,
>> Allison
>>
>>

Re: [DISCUSS] SPIP: Python Stored Procedures

2023-08-30 Thread Hyukjin Kwon

+1 we should have this .. a lot of other projects and DBMSes have this too,
and we currently don't have a way to handle them within Apache Spark.

Disclaimer: I am the shepherd of this SPIP.

On Thu, 31 Aug 2023 at 09:31, Allison Wang
 wrote:

> Hi Mich,
>
> I've updated the permissions on the document. Please feel free to leave
> comments.
> Thanks,
> Allison
>
> On Wed, Aug 30, 2023 at 3:44 PM Mich Talebzadeh 
> wrote:
>
>> Hi,
>>
>> Great. Please allow edit access on SPIP or ability to comment.
>>
>> Thanks
>>
>> Mich Talebzadeh,
>> Distinguished Technologist, Solutions Architect & Engineer
>> London
>> United Kingdom
>>
>>
>>view my Linkedin profile
>> 
>>
>>
>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>>
>> On Wed, 30 Aug 2023 at 23:29, Allison Wang
>>  wrote:
>>
>>> Hi all,
>>>
>>> I would like to start a discussion on “Python Stored Procedures".
>>>
>>> This proposal aims to extend Spark SQL by introducing support for stored
>>> procedures, starting with Python as the procedural language. This will
>>> enable users to run complex logic using Python within their SQL workflows
>>> and save these routines in catalogs like HMS for future use.
>>>
>>> *SPIP*:
>>> https://docs.google.com/document/d/1ce2EZrf2BxHu7TjfGn4TgToK3TBYYzRkmsIVcfmkNzE/edit?usp=sharing
>>> *JIRA*: https://issues.apache.org/jira/browse/SPARK-45023
>>>
>>> Looking forward to your feedback!
>>>
>>> Thanks,
>>> Allison
>>>
>>>

Re: Welcome two new Apache Spark committers

2023-08-06 Thread Hyukjin Kwon

Woohoo!

On Mon, 7 Aug 2023 at 11:28, Ruifeng Zheng  wrote:

> Congratulations! Peter and Xiduo!
>
> On Mon, Aug 7, 2023 at 10:13 AM Xiao Li  wrote:
>
>> Congratulations, Peter and Xiduo!
>>
>>
>>
>> Debasish Das  于2023年8月6日周日 19:08写道：
>>
>>> Congratulations Peter and Xidou.
>>>
>>> On Sun, Aug 6, 2023, 7:05 PM Wenchen Fan  wrote:
>>>
 Hi all,

 The Spark PMC recently voted to add two new committers. Please join me
 in welcoming them to their new role!

 - Peter Toth (Spark SQL)
 - Xiduo You (Spark SQL)

 They consistently make contributions to the project and clearly showed
 their expertise. We are very excited to have them join as committers.

>>>

Re: LLM script for error message improvement

2023-08-02 Thread Hyukjin Kwon

I think adding that dev tool script to improve the error message is fine.

On Thu, 3 Aug 2023 at 10:24, Haejoon Lee 
wrote:

> Dear contributors, I hope you are doing well!
>
> I see there are contributors who are interested in working on error
> message improvements and persistent contribution, so I want to share an
> llm-based error message improvement script for helping your contribution.
>
> You can find a detail for the script at
> https://github.com/apache/spark/pull/41711. I believe this can help your
> error message improvement work, so I encourage you to take a look at the
> pull request and leverage the script.
>
> Please let me know if you have any questions or concerns.
>
> Thanks all for your time and contributions!
>
> Best regards,
>
> Haejoon
>

Re: [VOTE] SPIP: XML data source support

2023-07-29 Thread Hyukjin Kwon

+1

On Sat, 29 Jul 2023 at 22:49, Maciej  wrote:

> +1
>
> Best regards,
> Maciej Szymkiewicz
>
> Web: https://zero323.net
> PGP: A30CEF0C31A501EC
>
> On 7/29/23 11:28, Mich Talebzadeh wrote:
>
> +1 for me.
>
> Though Databriks did a good job releasing the code.
>
> GitHub - databricks/spark-xml: XML data source for Spark SQL and DataFrames
> 
>
>
> Mich Talebzadeh,
> Solutions Architect/Engineering Lead
> Palantir Technologies Limited
> London
> United Kingdom
>
>
>view my Linkedin profile
> 
>
>
>  https://en.everybodywiki.com/Mich_Talebzadeh
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
> On Sat, 29 Jul 2023 at 06:34, Jia Fan 
>  wrote:
>
>>
>> + 1
>>
>>
>> 2023年7月29日 13:06，Adrian Pop-Tifrea  写道：
>>
>> +1, the more data source formats, the better, and if the solution is
>> already thoroughly tested, I say we should go for it.
>>
>> On Sat, Jul 29, 2023, 06:35 Xiao Li  wrote:
>>
>>> +1
>>>
>>> On Fri, Jul 28, 2023 at 15:54 Sean Owen  wrote:
>>>
 +1 I think that porting the package 'as is' into Spark is probably
 worthwhile.
 That's relatively easy; the code is already pretty battle-tested and
 not that big and even originally came from Spark code, so is more or less
 similar already.

 One thing it never got was DSv2 support, which means XML reading would
 still be somewhat behind other formats. (I was not able to implement it.)
 This isn't a necessary goal right now, but would be possibly part of
 the logic of moving it into the Spark code base.

 On Fri, Jul 28, 2023 at 5:38 PM Sandip Agarwala
 
  wrote:

> Dear Spark community,
>
> I would like to start the vote for "SPIP: XML data source support".
>
> XML is a widely used data format. An external spark-xml package (
> https://github.com/databricks/spark-xml) is available to read and
> write XML data in spark. Making spark-xml built-in will provide a better
> user experience for Spark SQL and structured streaming. The proposal is to
> inline code from the spark-xml package.
>
> SPIP link:
>
> https://docs.google.com/document/d/1ZaOBT4-YFtN58UCx2cdFhlsKbie1ugAn-Fgz_Dddz-Q/edit?usp=sharing
>
> JIRA:
> https://issues.apache.org/jira/browse/SPARK-44265
>
> Discussion Thread:
> https://lists.apache.org/thread/q32hxgsp738wom03mgpg9ykj9nr2n1fh
>
> Please vote on the SPIP for the next 72 hours:
> [ ] +1: Accept the proposal as an official SPIP
> [ ] +0
> [ ] -1: I don’t think this is a good idea because __.
>
> Thanks, Sandip
>

>>

Re: Spark 3.0.0 EOL

2023-07-24 Thread Hyukjin Kwon

It's already EOL

On Mon, Jul 24, 2023 at 4:17 PM Pralabh Kumar 
wrote:

> Hi Dev Team
>
> If possible , can you please provide the Spark 3.0.0 EOL timelines .
>
> Regards
> Pralabh Kumar
>
>
>
>
>

Re: Spark Docker Official Image is now available

2023-07-19 Thread Hyukjin Kwon

This is amazing, finally!

On Thu, 20 Jul 2023 at 10:10, Yikun Jiang  wrote:

> The spark Docker Official Image is now available:
> https://hub.docker.com/_/spark
>
> $ docker run -it --rm *spark* /opt/spark/bin/spark-shell
> $ docker run -it --rm *spark*:python3 /opt/spark/bin/pyspark
> $ docker run -it --rm *spark*:r /opt/spark/bin/sparkR
>
> We had a longer review journey than we expected, if you are also
> interested in this journey, you can see more in:
>
> https://github.com/docker-library/official-images/pull/13089
>
> Thanks to everyone who helps in the Docker and Apache Spark community!
>
> Some background you might want to know:
> *- apache/spark*: https://hub.docker.com/r/apache/spark, the Apache Spark
> docker image, will be published by *Apache Spark community* when the
> Apache Spark is released, no update.
> *- spark*: https://hub.docker.com/_/spark, the Docker Official Image, it
> will be published by the *Docker community*, keep active rebuilding for
> updates and security fixes by the Docker community.
> - The source repo of *apache/spark *and *spark: *
> https://github.com/apache/spark-docker
>
> See more in:
> [1] [DISCUSS] SPIP: Support Docker Official Image for Spark:
> https://lists.apache.org/thread/l1793y5224n8bqkp3s6ltgkykso4htb3
> [2] [VOTE] SPIP: Support Docker Official Image for Spark:
> https://lists.apache.org/thread/ro6olodm1jzdffwjx4oc7ol7oh6kshbl
> [3] https://github.com/docker-library/official-images/pull/13089
> [4]
> https://docs.google.com/document/d/1nN-pKuvt-amUcrkTvYAQ-bJBgtsWb9nAkNoVNRM2S2o/
> [5] https://issues.apache.org/jira/browse/SPARK-40513
>
> Regards,
> Yikun
>

Re: [DISCUSS] SPIP: XML data source support

2023-07-19 Thread Hyukjin Kwon

Here are the benefits of having it as a built-in source:

   - We can leverage the community to improve the Spark XML (not within
   Databricks repositories).
   - We can share the same core for XML expressions (e.g., from_xml and
   to_xml like from_csv, from_json, etc.).
   - It is more to embrace the commonly used datasource, just like the
   existing builtin data sources we have.
   -

   Users wouldn't have to set the jars or maven coordinates, e.g., for now,
   if they have network problems, etc, it would be harder to use them by
   default.

XML is arguably more used than CSV that is already our built-in source, see
e.g., https://insights.stackoverflow.com/trends?tags=xml%2Cjson%2Ccsv and
https://www.reddit.com/r/programming/comments/bak5qt/a_comparison_of_serialization_formats_csv_json/


On Wed, 19 Jul 2023 at 17:51, Martin Andersson 
wrote:

> How much of an effort is it to use the spark-xml library today? What's the
> drawback to keeping this as an external library as-is?
>
> Best Regards, Martin
> --
> *From:* Hyukjin Kwon 
> *Sent:* Wednesday, July 19, 2023 01:27
> *To:* Sandip Agarwala 
> *Cc:* dev@spark.apache.org 
> *Subject:* Re: [DISCUSS] SPIP: XML data source support
>
>
> EXTERNAL SENDER. Do not click links or open attachments unless you
> recognize the sender and know the content is safe. DO NOT provide your
> username or password.
>
> Yeah I support this. XML is pretty outdated format TBH but still used in
> many legacy systems. For example, Wikipedia dump is one case.
>
> Even when you take a look from stats CVS vs XML vs JSON, some show that
> XML is more used in CSV.
>
> On Wed, Jul 19, 2023 at 12:58 AM Sandip Agarwala <
> sandip.agarw...@databricks.com> wrote:
>
> Dear Spark community,
>
> I would like to start a discussion on "XML data source support".
>
> XML is a widely used data format. An external spark-xml package (
> https://github.com/databricks/spark-xml) is available to read and write
> XML data in spark. Making spark-xml built-in will provide a better user
> experience for Spark SQL and structured streaming. The proposal is to
> inline code from the spark-xml package.
> I am collaborating with Hyukjin Kwon, who is the original author of
> spark-xml, for this effort.
>
> SPIP link:
>
> https://docs.google.com/document/d/1ZaOBT4-YFtN58UCx2cdFhlsKbie1ugAn-Fgz_Dddz-Q/edit?usp=sharing
>
> JIRA:
> https://issues.apache.org/jira/browse/SPARK-44265
>
> Looking forward to your feedback.
> Thanks, Sandip
>
>

Re: [DISCUSS] SPIP: XML data source support

2023-07-18 Thread Hyukjin Kwon

Yeah I support this. XML is pretty outdated format TBH but still used in
many legacy systems. For example, Wikipedia dump is one case.

Even when you take a look from stats CVS vs XML vs JSON, some show that XML
is more used in CSV.

On Wed, Jul 19, 2023 at 12:58 AM Sandip Agarwala <
sandip.agarw...@databricks.com> wrote:

> Dear Spark community,
>
> I would like to start a discussion on "XML data source support".
>
> XML is a widely used data format. An external spark-xml package (
> https://github.com/databricks/spark-xml) is available to read and write
> XML data in spark. Making spark-xml built-in will provide a better user
> experience for Spark SQL and structured streaming. The proposal is to
> inline code from the spark-xml package.
> I am collaborating with Hyukjin Kwon, who is the original author of
> spark-xml, for this effort.
>
> SPIP link:
>
> https://docs.google.com/document/d/1ZaOBT4-YFtN58UCx2cdFhlsKbie1ugAn-Fgz_Dddz-Q/edit?usp=sharing
>
> JIRA:
> https://issues.apache.org/jira/browse/SPARK-44265
>
> Looking forward to your feedback.
> Thanks, Sandip
>

Re: [VOTE][SPIP] Python Data Source API

2023-07-05 Thread Hyukjin Kwon

+1.

See https://youtu.be/yj7XlTB1Jvc?t=604 :-).

On Thu, 6 Jul 2023 at 09:15, Allison Wang
 wrote:

> Hi all,
>
> I'd like to start the vote for SPIP: Python Data Source API.
>
> The high-level summary for the SPIP is that it aims to introduce a simple
> API in Python for Data Sources. The idea is to enable Python developers to
> create data sources without learning Scala or dealing with the complexities
> of the current data source APIs. This would make Spark more accessible to
> the wider Python developer community.
>
> References:
>
>- SPIP doc
>
> 
>- JIRA ticket 
>- Discussion thread
>
>
>
> Please vote on the SPIP for the next 72 hours:
>
> [ ] +1: Accept the proposal as an official SPIP
> [ ] +0
> [ ] -1: I don’t think this is a good idea because __.
>
> Thanks,
> Allison
>

Re: Introducing English SDK for Apache Spark - Seeking Your Feedback and Contributions

2023-07-03 Thread Hyukjin Kwon

The demo was really amazing.

On Tue, 4 Jul 2023 at 09:17, Farshid Ashouri 
wrote:

> This is wonderful news!
>
> On Tue, 4 Jul 2023 at 01:14, Gengliang Wang  wrote:
>
>> Dear Apache Spark community,
>>
>> We are delighted to announce the launch of a groundbreaking tool that
>> aims to make Apache Spark more user-friendly and accessible - the
>> English SDK . Powered by
>> the application of Generative AI, the English SDK
>>  allows you to execute
>> complex tasks with simple English instructions. This exciting news was 
>> announced
>> recently at the Data+AI Summit
>>  and also introduced
>> through a detailed blog post
>> 
>> .
>>
>> Now, we need your invaluable feedback and contributions. The aim of the
>> English SDK is not only to simplify and enrich your Apache Spark experience
>> but also to grow with the community. We're calling upon Spark developers
>> and users to explore this innovative tool, offer your insights, provide
>> feedback, and contribute to its evolution.
>>
>> You can find more details about the SDK and usage examples on the GitHub
>> repository https://github.com/databrickslabs/pyspark-ai/. If you have
>> any feedback or suggestions, please feel free to open an issue directly on
>> the repository. We are actively monitoring the issues and value your
>> insights.
>>
>> We also welcome pull requests and are eager to see how you might extend
>> or refine this tool. Let's come together to continue making Apache Spark
>> more approachable and user-friendly.
>>
>> Thank you in advance for your attention and involvement. We look forward
>> to hearing your thoughts and seeing your contributions!
>>
>> Best,
>> Gengliang Wang
>>
> --
>
>
> *Farshid Ashouri*,
> Senior Vice President,
> J.P. Morgan & Chase Co.
> +44 7932 650 788
>
>

Re: Time for Spark v3.5.0 release

2023-07-03 Thread Hyukjin Kwon

Yeah one day postponed shouldn't be a big deal.

On Tue, Jul 4, 2023 at 7:10 AM Yuanjian Li  wrote:

> Hi All,
>
> According to the Spark versioning policy at
> https://spark.apache.org/versioning-policy.html, should we cut
> *branch-3.5* on *July 17th, 2023*? (We initially proposed January 16th,
> but since it's a Sunday, I suggest we postpone it by one day).
>
> I would like to volunteer as the release manager for Apache Spark 3.5.0.
>
> Best,
> Yuanjian
>

Re: [ANNOUNCE] Apache Spark 3.4.1 released

2023-06-23 Thread Hyukjin Kwon

Thanks!

On Sat, Jun 24, 2023 at 11:01 AM Mridul Muralidharan 
wrote:

>
> Thanks Dongjoon !
>
> Regards,
> Mridul
>
> On Fri, Jun 23, 2023 at 6:58 PM Dongjoon Hyun  wrote:
>
>> We are happy to announce the availability of Apache Spark 3.4.1!
>>
>> Spark 3.4.1 is a maintenance release containing stability fixes. This
>> release is based on the branch-3.4 maintenance branch of Spark. We
>> strongly
>> recommend all 3.4 users to upgrade to this stable release.
>>
>> To download Spark 3.4.1, head over to the download page:
>> https://spark.apache.org/downloads.html
>>
>> To view the release notes:
>> https://spark.apache.org/releases/spark-release-3-4-1.html
>>
>> We would like to acknowledge all community members for contributing to
>> this
>> release. This release would not have been possible without you.
>>
>>
>> Dongjoon Hyun
>>
>

Re: [VOTE][SPIP] PySpark Test Framework

2023-06-21 Thread Hyukjin Kwon

+1

On Thu, 22 Jun 2023 at 02:20, Jacek Laskowski  wrote:

> +0
>
> Pozdrawiam,
> Jacek Laskowski
> 
> "The Internals Of" Online Books 
> Follow me on https://twitter.com/jaceklaskowski
>
> 
>
>
> On Wed, Jun 21, 2023 at 5:11 PM Amanda Liu 
> wrote:
>
>> Hi all,
>>
>> I'd like to start the vote for SPIP: PySpark Test Framework.
>>
>> The high-level summary for the SPIP is that it proposes an official test
>> framework for PySpark. Currently, there are only disparate open-source
>> repos and blog posts for PySpark testing resources. We can streamline and
>> simplify the testing process by incorporating test features, such as a
>> PySpark Test Base class (which allows tests to share Spark sessions) and
>> test util functions (for example, asserting dataframe and schema equality).
>>
>> *SPIP doc:*
>> https://docs.google.com/document/d/1OkyBn3JbEHkkQgSQ45Lq82esXjr9rm2Vj7Ih_4zycRc/edit#heading=h.f5f0u2riv07v
>>
>> *JIRA ticket:* https://issues.apache.org/jira/browse/SPARK-44042
>>
>> *Discussion thread:*
>> https://lists.apache.org/thread/trwgbgn3ycoj8b8k8lkxko2hql23o41n
>>
>> Please vote on the SPIP for the next 72 hours:
>> [ ] +1: Accept the proposal as an official SPIP
>> [ ] +0
>> [ ] -1: I don’t think this is a good idea because __.
>>
>> Thank you!
>>
>> Best,
>> Amanda Liu
>>
>

Re: [VOTE] Release Spark 3.4.1 (RC1)

2023-06-21 Thread Hyukjin Kwon

+1

On Wed, 21 Jun 2023 at 14:23, yangjie01  wrote:

> +1
>
>
> 在 2023/6/21 13:20，“L. C. Hsieh”mailto:vii...@gmail.com>>
> 写入:
>
>
> +1
>
>
> On Tue, Jun 20, 2023 at 8:48 PM Dongjoon Hyun  > wrote:
> >
> > +1
> >
> > Dongjoon
> >
> > On 2023/06/20 02:51:32 Jia Fan wrote:
> > > +1
> > >
> > > Dongjoon Hyun mailto:dongj...@apache.org>>
> 于2023年6月20日周二 10:41写道：
> > >
> > > > Please vote on releasing the following candidate as Apache Spark
> version
> > > > 3.4.1.
> > > >
> > > > The vote is open until June 23rd 1AM (PST) and passes if a majority
> +1 PMC
> > > > votes are cast, with a minimum of 3 +1 votes.
> > > >
> > > > [ ] +1 Release this package as Apache Spark 3.4.1
> > > > [ ] -1 Do not release this package because ...
> > > >
> > > > To learn more about Apache Spark, please see
> https://spark.apache.org/ 
> > > >
> > > > The tag to be voted on is v3.4.1-rc1 (commit
> > > > 6b1ff22dde1ead51cbf370be6e48a802daae58b6)
> > > > https://github.com/apache/spark/tree/v3.4.1-rc1 <
> https://github.com/apache/spark/tree/v3.4.1-rc1>
> > > >
> > > > The release files, including signatures, digests, etc. can be found
> at:
> > > > https://dist.apache.org/repos/dist/dev/spark/v3.4.1-rc1-bin/ <
> https://dist.apache.org/repos/dist/dev/spark/v3.4.1-rc1-bin/>
> > > >
> > > > Signatures used for Spark RCs can be found in this file:
> > > > https://dist.apache.org/repos/dist/dev/spark/KEYS <
> https://dist.apache.org/repos/dist/dev/spark/KEYS>
> > > >
> > > > The staging repository for this release can be found at:
> > > >
> https://repository.apache.org/content/repositories/orgapachespark-1443/ <
> https://repository.apache.org/content/repositories/orgapachespark-1443/>
> > > >
> > > > The documentation corresponding to this release can be found at:
> > > > https://dist.apache.org/repos/dist/dev/spark/v3.4.1-rc1-docs/ <
> https://dist.apache.org/repos/dist/dev/spark/v3.4.1-rc1-docs/>
> > > >
> > > > The list of bug fixes going into 3.4.1 can be found at the following
> URL:
> > > > https://issues.apache.org/jira/projects/SPARK/versions/12352874 <
> https://issues.apache.org/jira/projects/SPARK/versions/12352874>
> > > >
> > > > This release is using the release script of the tag v3.4.1-rc1.
> > > >
> > > > FAQ
> > > >
> > > > =
> > > > How can I help test this release?
> > > > =
> > > >
> > > > If you are a Spark user, you can help us test this release by taking
> > > > an existing Spark workload and running on this release candidate,
> then
> > > > reporting any regressions.
> > > >
> > > > If you're working in PySpark you can set up a virtual env and install
> > > > the current RC and see if anything important breaks, in the
> Java/Scala
> > > > you can add the staging repository to your projects resolvers and
> test
> > > > with the RC (make sure to clean up the artifact cache before/after so
> > > > you don't end up building with a out of date RC going forward).
> > > >
> > > > ===
> > > > What should happen to JIRA tickets still targeting 3.4.1?
> > > > ===
> > > >
> > > > The current list of open tickets targeted at 3.4.1 can be found at:
> > > > https://issues.apache.org/jira/projects/SPARK <
> https://issues.apache.org/jira/projects/SPARK> and search for "Target
> > > > Version/s" = 3.4.1
> > > >
> > > > Committers should look at those and triage. Extremely important bug
> > > > fixes, documentation, and API tweaks that impact compatibility should
> > > > be worked on immediately. Everything else please retarget to an
> > > > appropriate release.
> > > >
> > > > ==
> > > > But my bug isn't fixed?
> > > > ==
> > > >
> > > > In order to make timely releases, we will typically not hold the
> > > > release unless the bug in question is a regression from the previous
> > > > release. That being said, if there is something which is a regression
> > > > that has not been correctly targeted please ping me or a committer to
> > > > help target the issue.
> > > >
> > >
> >
> > -
> > To unsubscribe e-mail: dev-unsubscr...@spark.apache.org  dev-unsubscr...@spark.apache.org>
> >
>
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org  dev-unsubscr...@spark.apache.org>
>
>
>
>
>
>

Re: [DISCUSS] SPIP: Python Data Source API

2023-06-19 Thread Hyukjin Kwon

Actually I support this idea in a way that Python developers don't have to
learn Scala to write their own source (and separate packaging).
This is more crucial especially when you want to write a simple data source
that interacts with the Python ecosystem.

On Tue, 20 Jun 2023 at 03:08, Denny Lee  wrote:

> Slightly biased, but per my conversations - this would be awesome to have!
>
> On Mon, Jun 19, 2023 at 09:43 Abdeali Kothari 
> wrote:
>
>> I would definitely use it - is it's available :)
>>
>> On Mon, 19 Jun 2023, 21:56 Jacek Laskowski,  wrote:
>>
>>> Hi Allison and devs,
>>>
>>> Although I was against this idea at first sight (probably because I'm a
>>> Scala dev), I think it could work as long as there are people who'd be
>>> interested in such an API. Were there any? I'm just curious. I've seen no
>>> emails requesting it.
>>>
>>> I also doubt that Python devs would like to work on new data sources but
>>> support their wishes wholeheartedly :)
>>>
>>> Pozdrawiam,
>>> Jacek Laskowski
>>> 
>>> "The Internals Of" Online Books 
>>> Follow me on https://twitter.com/jaceklaskowski
>>>
>>> 
>>>
>>>
>>> On Fri, Jun 16, 2023 at 6:14 AM Allison Wang
>>>  wrote:
>>>
 Hi everyone,

 I would like to start a discussion on “Python Data Source API”.

 This proposal aims to introduce a simple API in Python for Data
 Sources. The idea is to enable Python developers to create data sources
 without having to learn Scala or deal with the complexities of the current
 data source APIs. The goal is to make a Python-based API that is simple and
 easy to use, thus making Spark more accessible to the wider Python
 developer community. This proposed approach is based on the recently
 introduced Python user-defined table functions with extensions to support
 data sources.

 *SPIP Doc*:
 https://docs.google.com/document/d/1oYrCKEKHzznljYfJO4kx5K_Npcgt1Slyfph3NEk7JRU/edit?usp=sharing

 *SPIP JIRA*: https://issues.apache.org/jira/browse/SPARK-44076

 Looking forward to your feedback.

 Thanks,
 Allison

>>>

Re: [VOTE] Apache Spark PMC asks Databricks to differentiate its Spark version string

2023-06-18 Thread Hyukjin Kwon

With the spirit of open source, -1. At least there have been other cases
mentioned in the discussion thread, and solely doing it for one specific
vendor would not solve the problem, and I wouldn't also expect to cast a
vote for each case publicly.
I would prefer to start this in the narrower scope, for example, contacting
the vendor first and/or starting from a private mailing list instead of
publicly raising this in the dev mailing list.


On Sat, 17 Jun 2023 at 07:22, Dongjoon Hyun  wrote:

> Here are my replies, Sean.
>
> > Since we're here, fine: I vote -1, simply because this states no reason
> for the action at all.
>
> Thank you for your explicit vote because
> this vote was explicitly triggered by this controversial comment,
> "I do not see some police action from the PMC must follow".
>
>
> > I would again ask we not simply repeat the same thread again.
>
> We are in the next stage from the previous discussion which identified
> our diverse perspective. The vote is the only official way to make a
> conclusion, isn't it?
>
>
> > - Relevant ASF policy seems to say this is fine,
> > as argued at
> https://lists.apache.org/thread/p15tc772j9qwyvn852sh8ksmzrol9cof
>
> I already disagreed with the above point, "this is fine", at
> https://lists.apache.org/thread/crp01jg4wr27w10mc9dsbsogxm1qj6co .
>
>
> > - There is no argument any of this has caused a problem
> > for the community anyway
>
> Shall we focus on legal scope on this vote because we are
> talking about ASF branding policy? For the record, the above perspective
> implies
> Apache Spark PMC should ignore ASF branding policy.
>
>
> > Given that this has stopped being about ASF policy, ...
>
> I want to emphasize that this statement vote is only about
> Apache Spark PMC's stance ("Ask or not Ask").
> If the vote decides not to ask, that's it.
>
>
> Dongjoon.
>
>
> On Fri, Jun 16, 2023 at 2:23 PM Sean Owen  wrote:
>
>> On Fri, Jun 16, 2023 at 3:58 PM Dongjoon Hyun 
>> wrote:
>>
>>> I started the thread about already publicly visible version issues
>>> according to the ASF PMC communication guideline. It's no confidential,
>>> personal, or security-related stuff. Are you insisting this is confidential?
>>>
>>
>> Discussion about a particular company should be on private@ - this is
>> IMHO like "personnel matters", in the doc you link. The principle is that
>> discussing whether an entity is doing something right or wrong is better in
>> private, because, hey, if the conclusion is "nothing's wrong here" then you
>> avoid disseminating any implication to the contrary.
>>
>> I agreed with you, there's some value in discussing the general issue on
>> dev@. (I even said who the company was, though, it was I think clear
>> before)
>>
>> But, your thread title here is: "Apache Spark PMC asks Databricks to
>> differentiate its Spark version string"
>> (You separately claim this vote is about whether the PMC has a role here,
>> but, that's plainly not how this thread begins.)
>>
>> Given that this has stopped being about ASF policy, and seems to be about
>> taking some action related to a company, I find it inappropriate again for
>> dev@, for exactly the reason I gave above. We have a PMC member
>> repeating this claim over and over, without support. This is why we don't
>> do this in public.
>>
>>
>>
>>> May I ask which relevant context you are insisting not to receive
>>> specifically? I gave the specific examples (UI/logs/screenshot), and got
>>> the specific legal advice from `legal-discuss@` and replied why the
>>> version should be different.
>>>
>>
>> It is the thread I linked in my reply:
>> https://lists.apache.org/thread/k7gr65wt0fwtldc7hp7bd0vkg1k93rrb
>> This has already been discussed at length, and you're aware of it, but,
>> didn't mention it. I think that's critical; your text contains no problem
>> statement at all by itself.
>>
>> Since we're here, fine: I vote -1, simply because this states no reason
>> for the action at all.
>> If we assume the thread ^^^ above is the extent of the logic, then, -1
>> for the following reasons:
>> - Relevant ASF policy seems to say this is fine, as argued at
>> https://lists.apache.org/thread/p15tc772j9qwyvn852sh8ksmzrol9cof
>> - There is no argument any of this has caused a problem for the community
>> anyway; there is just nothing to 'fix'
>>
>> I would again ask we not simply repeat the same thread again.
>>
>>

Re: [VOTE][RESULT] Release Plan for Apache Spark 4.0.0 (June 2024)

2023-06-18 Thread Hyukjin Kwon

The major concerns raised in the thread were that we should initiate the
discussion for the below first:
- Apache Spark 4.0.0 Preview (and Dates)
- Apache Spark 4.0.0 Items
- Apache Spark 4.0.0 Plan Adjustment

before setting the timeline for Spark 4.0.0 because we're unclear on the
picture of Spark 4.0.0. So discussing the timeline 4.0.0 first is the
opposite order procedurally.
The vote passed as a procedural issue, but I would prefer to consider this
as a tentative date, and should probably need another vote to adjust the
date considering the plans, preview dates, and items we aim for 4.0.0.


On Sat, 17 Jun 2023 at 04:33, Dongjoon Hyun  wrote:

> This was a part of the following on-going discussions.
>
> 2023-05-28  Apache Spark 3.5.0 Expectations (?)
> https://lists.apache.org/thread/3x6dh17bmy20n3frtt3crgxjydnxh2o0
>
> 2023-05-30 Apache Spark 4.0 Timeframe?
> https://lists.apache.org/thread/xhkgj60j361gdpywoxxz7qspp2w80ry6
>
> 2023-06-05 ASF policy violation and Scala version issues
> https://lists.apache.org/thread/k7gr65wt0fwtldc7hp7bd0vkg1k93rrb
>
> 2023-06-12 [VOTE] Release Plan for Apache Spark 4.0.0 (June 2024)
> https://lists.apache.org/thread/r0zn6rd8y25yn2dg59ktw3ttrwxzqrfb
>
> I'm looking forward to seeing the upcoming detailed discussions including
> the following
> - Apache Spark 4.0.0 Preview (and Dates)
> - Apache Spark 4.0.0 Items
> - Apache Spark 4.0.0 Plan Adjustment
>
> Please initiate the discussion.
>
> Thanks,
> Dongjoon.
>
>
> On 2023/06/16 19:30:42 Dongjoon Hyun wrote:
> > The vote passes with 6 +1s (4 binding +1s), one -0, and one -1.
> > Thank you all for your participation and
> > especially your additional comments during this voting,
> > Mridul, Hyukjin, and Jungtaek.
> >
> > (* = binding)
> > +1:
> > - Dongjoon Hyun *
> > - Huaxin Gao *
> > - Liang-Chi Hsieh *
> > - Kazuyuki Tanimura
> > - Chao Sun *
> > - Jia Fan
> >
> > -0: Holden Karau
> >
> > -1: Xiao Li *
> >
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>

Re: [VOTE] Release Plan for Apache Spark 4.0.0 (June 2024)

2023-06-15 Thread Hyukjin Kwon

I am supportive of setting the timeline for Spark 4.0, and I think it has
to be done soon.
If my understanding is correct, we better need to set up the goals and
major changes to happen in 4.0.0? That one I agree with too.
Having a preview sounds good to me too so people can try it out.

Given that all, shall we set down the preview date, major items/targets,
and 4.0 timeline together in a discussion thread?
I think we can initiate this discussion, and start a vote again for the
tentative date.


On Fri, 16 Jun 2023 at 00:54, Xiao Li  wrote:

> Since the vote includes the release date for Spark 4.0, I cast my vote as
> -1, in light of the discussions from the three other PMCs.
>
> Also, considering recent discussions on the dev list, numerous breaking
> changes, such as Scala 2.13, JDK 17 support, and pandas 2.0 support, will
> be incorporated into Spark 4.0. I propose that we first release a preview
> so that the entire community can provide more comprehensive feedback before
> the final release.
>
>
> Jungtaek Lim  于2023年6月12日周一 19:28写道：
>
>> I concur with Holden and Mridul. Let's build a plan before we call the
>> tentative deadline. I understand setting the tentative deadline would
>> definitely help in pushing back features which "never ever ends", but at
>> least we may want to list up features and discuss for priority. It is still
>> possible that we might even want to see some features as hard blocker on
>> the release for any reason, based on discussion of course.
>>
>> On Tue, Jun 13, 2023 at 10:58 AM Mridul Muralidharan 
>> wrote:
>>
>>>
>>> I agree with Holden, we should have some understanding of what we are
>>> targeting for 4.0, given it is a major ver bump - and work from there on
>>> the release date.
>>>
>>> Regards,
>>> Mridul
>>>
>>> On Mon, Jun 12, 2023 at 8:53 PM Jia Fan  wrote:
>>>
 By the way, like Holden said, what's big feature for 4.0.0? I think
 very big version change always bring some different.

 Jia Fan  于2023年6月13日周二 08:25写道：

> +1
>
> 
>
> Jia Fan
>
>
>
> 2023年6月13日 03:51，Chao Sun  写道：
>
> +1
>
> On Mon, Jun 12, 2023 at 12:50 PM kazuyuki tanimura
>  wrote:
>
>> +1 (non-binding)
>>
>> Thank you!
>> Kazu
>>
>>
>> On Jun 12, 2023, at 11:32 AM, Holden Karau 
>> wrote:
>>
>> -0
>>
>> I'd like to see more of a doc around what we're planning on for a 4.0
>> before we pick a target release date etc. (feels like cart before the
>> horse).
>>
>> But it's a weak preference.
>>
>> On Mon, Jun 12, 2023 at 11:24 AM Xiao Li 
>> wrote:
>>
>>> Thanks for starting the vote.
>>>
>>> I do have a concern about the target release date of Spark 4.0.
>>>
>>> L. C. Hsieh  于2023年6月12日周一 11:09写道：
>>>
 +1

 On Mon, Jun 12, 2023 at 11:06 AM huaxin gao 
 wrote:
 >
 > +1
 >
 > On Mon, Jun 12, 2023 at 11:05 AM Dongjoon Hyun <
 dongj...@apache.org> wrote:
 >>
 >> +1
 >>
 >> Dongjoon
 >>
 >> On 2023/06/12 18:00:38 Dongjoon Hyun wrote:
 >> > Please vote on the release plan for Apache Spark 4.0.0.
 >> >
 >> > The vote is open until June 16th 1AM (PST) and passes if a
 majority +1 PMC
 >> > votes are cast, with a minimum of 3 +1 votes.
 >> >
 >> > [ ] +1 Have a release plan for Apache Spark 4.0.0 (June 2024)
 >> > [ ] -1 Do not have a plan for Apache Spark 4.0.0 because ...
 >> >
 >> > ===
 >> > Apache Spark 4.0.0 Release Plan
 >> > ===
 >> >
 >> > 1. After creating `branch-3.5`, set "4.0.0-SNAPSHOT" in master
 branch.
 >> >
 >> > 2. Creating `branch-4.0` on April 1st, 2024.
 >> >
 >> > 3. Apache Spark 4.0.0 RC1 on May 1st, 2024.
 >> >
 >> > 4. Apache Spark 4.0.0 Release in June, 2024.
 >> >
 >>
 >>
 -
 >> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
 >>


 -
 To unsubscribe e-mail: dev-unsubscr...@spark.apache.org


>>
>> --
>> Twitter: https://twitter.com/holdenkarau
>> Books (Learning Spark, High Performance Spark, etc.):
>> https://amzn.to/2MaRAG9  
>> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>>
>>
>>
>

Re: Add user as a contributor

2023-06-14 Thread Hyukjin Kwon

You can open a PR first. When that's merged, the ticket will be assigned to
you with the contribuor access

On Thu, Jun 15, 2023 at 1:07 PM Aman Raj 
wrote:

> Hi team,
>
> Can someone please help giving contributor access to amanraj2520 username.
> I have raised a Spark Ticket : issues.apache.org/jira/browse/SPARK-44058.
> I am not able to assign this to myself.
>
> Thanks,
> Aman.
>

Re: [DISCUSS] SPIP: Add PySpark Test Framework

2023-06-13 Thread Hyukjin Kwon

Yeah, I have been thinking about this too, and Holden did some work here
that this SPIP will reuse. I support this.

On Wed, 14 Jun 2023 at 08:10, Amanda Liu 
wrote:

> Hi all,
>
> I'd like to start a discussion about implementing an official PySpark test
> framework. Currently, there's no official test framework, but only various
> open-source repos and blog posts.
>
> Many of these open-source resources are very popular, which demonstrates
> user-demand for PySpark testing capabilities. spark-testing-base
>  has 1.4k stars, and chispa
>  has 532k downloads/month. However,
> it can be confusing for users to piece together disparate resources to
> write their own PySpark tests (see The Elephant in the Room: How to Write
> PySpark Tests
> 
> ).
>
> We can streamline and simplify the testing process by incorporating test
> features, such as a PySpark Test Base class (which allows tests to share
> Spark sessions) and test util functions (for example, asserting dataframe
> and schema equality).
>
> Please see the SPIP document attached:
> https://docs.google.com/document/d/1OkyBn3JbEHkkQgSQ45Lq82esXjr9rm2Vj7Ih_4zycRc/edit#heading=h.f5f0u2riv07vAnd
> the JIRA ticket: https://issues.apache.org/jira/browse/SPARK-44042
>
> I would appreciate it if you could share your thoughts on this proposal.
>
> Thank you!
> Amanda Liu
>

Re: [DISCUSS] Add SQL functions into Scala, Python and R API

2023-05-31 Thread Hyukjin Kwon

Thanks all. I created a JIRA at
https://issues.apache.org/jira/browse/SPARK-43907.

On Mon, 29 May 2023 at 09:12, Hyukjin Kwon  wrote:

> Yes, some were cases like you mentioned.
> But I found myself explaining that reason to a lot of people, not only
> developers but users - I was asked in a conference, email, slack,
> internally and externally.
> Then realised that maybe we're doing something wrong. This is based on my
> experience so I wanted to open a discussion and see what others think about
> this :-).
>
>
>
>
> On Sat, 27 May 2023 at 00:19, Maciej  wrote:
>
>> Weren't some of these functions provided only for compatibility  and
>> intentionally left out of the language APIs?
>>
>> --
>> Best regards,
>> Maciej
>>
>> On 5/25/23 23:21, Hyukjin Kwon wrote:
>>
>> I don't think it'd be a release blocker .. I think we can implement them
>> across multiple releases.
>>
>> On Fri, May 26, 2023 at 1:01 AM Dongjoon Hyun 
>> wrote:
>>
>>> Thank you for the proposal.
>>>
>>> I'm wondering if we are going to consider them as release blockers or
>>> not.
>>>
>>> In general, I don't think those SQL functions should be available in all
>>> languages as release blockers.
>>> (Especially in R or new Spark Connect languages like Go and Rust).
>>>
>>> If they are not release blockers, we may allow some existing or future
>>> community PRs only before feature freeze (= branch cut).
>>>
>>> Thanks,
>>> Dongjoon.
>>>
>>>
>>> On Wed, May 24, 2023 at 7:09 PM Jia Fan  wrote:
>>>
>>>> +1
>>>> It is important that different APIs can be used to call the same
>>>> function
>>>>
>>>> Ryan Berti  
>>>> 于2023年5月25日周四 01:48写道：
>>>>
>>>>> During my recent experience developing functions, I found that
>>>>> identifying locations (sql + connect functions.scala + functions.py,
>>>>> FunctionRegistry, + whatever is required for R) and standards for adding
>>>>> function signatures was not straight forward (should you use optional args
>>>>> or overload functions? which col/lit helpers should be used when?). Are
>>>>> there docs describing all of the locations + standards for defining a
>>>>> function? If not, that'd be great to have too.
>>>>>
>>>>> Ryan Berti
>>>>>
>>>>> Senior Data Engineer  |  Ads DE
>>>>>
>>>>> M 7023217573
>>>>>
>>>>> 5808 W Sunset Blvd  |  Los Angeles, CA 90028
>>>>> <https://www.google.com/maps/search/5808+W+Sunset+Blvd%C2%A0+%7C%C2%A0+Los+Angeles,+CA+90028?entry=gmail=g>
>>>>>
>>>>>
>>>>>
>>>>> On Wed, May 24, 2023 at 12:44 AM Enrico Minack 
>>>>> wrote:
>>>>>
>>>>>> +1
>>>>>>
>>>>>> Functions available in SQL (more general in one API) should be
>>>>>> available in all APIs. I am very much in favor of this.
>>>>>>
>>>>>> Enrico
>>>>>>
>>>>>>
>>>>>> Am 24.05.23 um 09:41 schrieb Hyukjin Kwon:
>>>>>>
>>>>>> Hi all,
>>>>>>
>>>>>> I would like to discuss adding all SQL functions into Scala, Python
>>>>>> and R API.
>>>>>> We have SQL functions that do not exist in Scala, Python and R around
>>>>>> 175.
>>>>>> For example, we don’t have pyspark.sql.functions.percentile but you
>>>>>> can invoke
>>>>>> it as a SQL function, e.g., SELECT percentile(...).
>>>>>>
>>>>>> The reason why we do not have all functions in the first place is
>>>>>> that we want to
>>>>>> only add commonly used functions, see also
>>>>>> https://github.com/apache/spark/pull/21318 (which I agreed at that
>>>>>> time)
>>>>>>
>>>>>> However, this has been raised multiple times over years, from the OSS
>>>>>> community, dev mailing list, JIRAs, stackoverflow, etc.
>>>>>> Seems it’s confusing about which function is available or not.
>>>>>>
>>>>>> Yes, we have a workaround. We can call all expressions by expr("...")
>>>>>>  or call_udf("...", Columns ...)
>>>>>> But still it seems that it’s not very user-friendly because they
>>>>>> expect them available under the functions namespace.
>>>>>>
>>>>>> Therefore, I would like to propose adding all expressions into all
>>>>>> languages so that Spark is simpler and less confusing, e.g., which API is
>>>>>> in functions or not.
>>>>>>
>>>>>> Any thoughts?
>>>>>>
>>>>>>
>>>>>>
>>

Re: Apache Spark 3.5.0 Expectations (?)

2023-05-29 Thread Hyukjin Kwon

While I support going forward with a higher version, actually using Scala
2.13 by default is a big deal especially in a way that:

   - Users would likely download the built-in version assuming that it’s
   backward binary compatible.
   - PyPI doesn't allow specifying the Scala version, meaning that users
   wouldn’t have a way to 'pip install pyspark' based on Scala 2.12.

I wonder if it’s safer to do it in Spark 4 (which I believe will be
discussed soon).


On Mon, 29 May 2023 at 13:21, Jia Fan  wrote:

> Thanks Dongjoon!
> There are some ticket I want to share.
> SPARK-39420 Support ANALYZE TABLE on v2 tables
> SPARK-42750 Support INSERT INTO by name
> SPARK-43521 Support CREATE TABLE LIKE FILE
>
> Dongjoon Hyun  于2023年5月29日周一 08:42写道：
>
>> Hi, All.
>>
>> Apache Spark 3.5.0 is scheduled for August (1st Release Candidate) and
>> currently a few notable things are under discussions in the mailing list.
>>
>> I believe it's a good time to share a short summary list (containing both
>> completed and in-progress items) to give a highlight in advance and to
>> collect your targets too.
>>
>> Please share your expectations or working items if you want to prioritize
>> them more in the community in Apache Spark 3.5.0 timeframe.
>>
>> (Sorted by ID)
>> SPARK-40497 Upgrade Scala 2.13.11
>> SPARK-42452 Remove hadoop-2 profile from Apache Spark 3.5.0
>> SPARK-42913 Upgrade to Hadoop 3.3.5 (aws-java-sdk-bundle: 1.12.262 ->
>> 1.12.316)
>> SPARK-43024 Upgrade Pandas to 2.0.0
>> SPARK-43200 Remove Hadoop 2 reference in docs
>> SPARK-43347 Remove Python 3.7 Support
>> SPARK-43348 Support Python 3.8 in PyPy3
>> SPARK-43351 Add Spark Connect Go prototype code and example
>> SPARK-43379 Deprecate old Java 8 versions prior to 8u371
>> SPARK-43394 Upgrade to Maven 3.8.8
>> SPARK-43436 Upgrade to RocksDbjni 8.1.1.1
>> SPARK-43446 Upgrade to Apache Arrow 12.0.0
>> SPARK-43447 Support R 4.3.0
>> SPARK-43489 Remove protobuf 2.5.0
>> SPARK-43519 Bump Parquet to 1.13.1
>> SPARK-43581 Upgrade kubernetes-client to 6.6.2
>> SPARK-43588 Upgrade to ASM 9.5
>> SPARK-43600 Update K8s doc to recommend K8s 1.24+
>> SPARK-43738 Upgrade to DropWizard Metrics 4.2.18
>> SPARK-43831 Build and Run Spark on Java 21
>> SPARK-43832 Upgrade to Scala 2.12.18
>> SPARK-43836 Make Scala 2.13 as default in Spark 3.5
>> SPARK-43842 Upgrade gcs-connector to 2.2.14
>> SPARK-43844 Update to ORC 1.9.0
>> UMBRELLA: Add SQL functions into Scala, Python and R API
>>
>> Thanks,
>> Dongjoon.
>>
>> PS. The above is not a list of release blockers. Instead, it could be a
>> nice-to-have from someone's perspective.
>>
>

Re: [DISCUSS] Add SQL functions into Scala, Python and R API

2023-05-28 Thread Hyukjin Kwon

Yes, some were cases like you mentioned.
But I found myself explaining that reason to a lot of people, not only
developers but users - I was asked in a conference, email, slack,
internally and externally.
Then realised that maybe we're doing something wrong. This is based on my
experience so I wanted to open a discussion and see what others think about
this :-).




On Sat, 27 May 2023 at 00:19, Maciej  wrote:

> Weren't some of these functions provided only for compatibility  and
> intentionally left out of the language APIs?
>
> --
> Best regards,
> Maciej
>
> On 5/25/23 23:21, Hyukjin Kwon wrote:
>
> I don't think it'd be a release blocker .. I think we can implement them
> across multiple releases.
>
> On Fri, May 26, 2023 at 1:01 AM Dongjoon Hyun 
> wrote:
>
>> Thank you for the proposal.
>>
>> I'm wondering if we are going to consider them as release blockers or not.
>>
>> In general, I don't think those SQL functions should be available in all
>> languages as release blockers.
>> (Especially in R or new Spark Connect languages like Go and Rust).
>>
>> If they are not release blockers, we may allow some existing or future
>> community PRs only before feature freeze (= branch cut).
>>
>> Thanks,
>> Dongjoon.
>>
>>
>> On Wed, May 24, 2023 at 7:09 PM Jia Fan  wrote:
>>
>>> +1
>>> It is important that different APIs can be used to call the same function
>>>
>>> Ryan Berti  
>>> 于2023年5月25日周四 01:48写道：
>>>
>>>> During my recent experience developing functions, I found that
>>>> identifying locations (sql + connect functions.scala + functions.py,
>>>> FunctionRegistry, + whatever is required for R) and standards for adding
>>>> function signatures was not straight forward (should you use optional args
>>>> or overload functions? which col/lit helpers should be used when?). Are
>>>> there docs describing all of the locations + standards for defining a
>>>> function? If not, that'd be great to have too.
>>>>
>>>> Ryan Berti
>>>>
>>>> Senior Data Engineer  |  Ads DE
>>>>
>>>> M 7023217573
>>>>
>>>> 5808 W Sunset Blvd  |  Los Angeles, CA 90028
>>>> <https://www.google.com/maps/search/5808+W+Sunset+Blvd%C2%A0+%7C%C2%A0+Los+Angeles,+CA+90028?entry=gmail=g>
>>>>
>>>>
>>>>
>>>> On Wed, May 24, 2023 at 12:44 AM Enrico Minack 
>>>> wrote:
>>>>
>>>>> +1
>>>>>
>>>>> Functions available in SQL (more general in one API) should be
>>>>> available in all APIs. I am very much in favor of this.
>>>>>
>>>>> Enrico
>>>>>
>>>>>
>>>>> Am 24.05.23 um 09:41 schrieb Hyukjin Kwon:
>>>>>
>>>>> Hi all,
>>>>>
>>>>> I would like to discuss adding all SQL functions into Scala, Python
>>>>> and R API.
>>>>> We have SQL functions that do not exist in Scala, Python and R around
>>>>> 175.
>>>>> For example, we don’t have pyspark.sql.functions.percentile but you
>>>>> can invoke
>>>>> it as a SQL function, e.g., SELECT percentile(...).
>>>>>
>>>>> The reason why we do not have all functions in the first place is that
>>>>> we want to
>>>>> only add commonly used functions, see also
>>>>> https://github.com/apache/spark/pull/21318 (which I agreed at that
>>>>> time)
>>>>>
>>>>> However, this has been raised multiple times over years, from the OSS
>>>>> community, dev mailing list, JIRAs, stackoverflow, etc.
>>>>> Seems it’s confusing about which function is available or not.
>>>>>
>>>>> Yes, we have a workaround. We can call all expressions by expr("...")
>>>>>  or call_udf("...", Columns ...)
>>>>> But still it seems that it’s not very user-friendly because they
>>>>> expect them available under the functions namespace.
>>>>>
>>>>> Therefore, I would like to propose adding all expressions into all
>>>>> languages so that Spark is simpler and less confusing, e.g., which API is
>>>>> in functions or not.
>>>>>
>>>>> Any thoughts?
>>>>>
>>>>>
>>>>>
>

Re: [DISCUSS] Add SQL functions into Scala, Python and R API

2023-05-25 Thread Hyukjin Kwon

I don't think it'd be a release blocker .. I think we can implement them
across multiple releases.

On Fri, May 26, 2023 at 1:01 AM Dongjoon Hyun 
wrote:

> Thank you for the proposal.
>
> I'm wondering if we are going to consider them as release blockers or not.
>
> In general, I don't think those SQL functions should be available in all
> languages as release blockers.
> (Especially in R or new Spark Connect languages like Go and Rust).
>
> If they are not release blockers, we may allow some existing or future
> community PRs only before feature freeze (= branch cut).
>
> Thanks,
> Dongjoon.
>
>
> On Wed, May 24, 2023 at 7:09 PM Jia Fan  wrote:
>
>> +1
>> It is important that different APIs can be used to call the same function
>>
>> Ryan Berti  于2023年5月25日周四 01:48写道：
>>
>>> During my recent experience developing functions, I found that
>>> identifying locations (sql + connect functions.scala + functions.py,
>>> FunctionRegistry, + whatever is required for R) and standards for adding
>>> function signatures was not straight forward (should you use optional args
>>> or overload functions? which col/lit helpers should be used when?). Are
>>> there docs describing all of the locations + standards for defining a
>>> function? If not, that'd be great to have too.
>>>
>>> Ryan Berti
>>>
>>> Senior Data Engineer  |  Ads DE
>>>
>>> M 7023217573
>>>
>>> 5808 W Sunset Blvd  |  Los Angeles, CA 90028
>>> <https://www.google.com/maps/search/5808+W+Sunset+Blvd%C2%A0+%7C%C2%A0+Los+Angeles,+CA+90028?entry=gmail=g>
>>>
>>>
>>>
>>> On Wed, May 24, 2023 at 12:44 AM Enrico Minack 
>>> wrote:
>>>
>>>> +1
>>>>
>>>> Functions available in SQL (more general in one API) should be
>>>> available in all APIs. I am very much in favor of this.
>>>>
>>>> Enrico
>>>>
>>>>
>>>> Am 24.05.23 um 09:41 schrieb Hyukjin Kwon:
>>>>
>>>> Hi all,
>>>>
>>>> I would like to discuss adding all SQL functions into Scala, Python and
>>>> R API.
>>>> We have SQL functions that do not exist in Scala, Python and R around
>>>> 175.
>>>> For example, we don’t have pyspark.sql.functions.percentile but you
>>>> can invoke
>>>> it as a SQL function, e.g., SELECT percentile(...).
>>>>
>>>> The reason why we do not have all functions in the first place is that
>>>> we want to
>>>> only add commonly used functions, see also
>>>> https://github.com/apache/spark/pull/21318 (which I agreed at that
>>>> time)
>>>>
>>>> However, this has been raised multiple times over years, from the OSS
>>>> community, dev mailing list, JIRAs, stackoverflow, etc.
>>>> Seems it’s confusing about which function is available or not.
>>>>
>>>> Yes, we have a workaround. We can call all expressions by expr("...")
>>>>  or call_udf("...", Columns ...)
>>>> But still it seems that it’s not very user-friendly because they expect
>>>> them available under the functions namespace.
>>>>
>>>> Therefore, I would like to propose adding all expressions into all
>>>> languages so that Spark is simpler and less confusing, e.g., which API is
>>>> in functions or not.
>>>>
>>>> Any thoughts?
>>>>
>>>>
>>>>

[DISCUSS] Add SQL functions into Scala, Python and R API

2023-05-24 Thread Hyukjin Kwon

Hi all,

I would like to discuss adding all SQL functions into Scala, Python and R
API.
We have SQL functions that do not exist in Scala, Python and R around 175.
For example, we don’t have pyspark.sql.functions.percentile but you can
invoke
it as a SQL function, e.g., SELECT percentile(...).

The reason why we do not have all functions in the first place is that we
want to
only add commonly used functions, see also
https://github.com/apache/spark/pull/21318 (which I agreed at that time)

However, this has been raised multiple times over years, from the OSS
community, dev mailing list, JIRAs, stackoverflow, etc.
Seems it’s confusing about which function is available or not.

Yes, we have a workaround. We can call all expressions by expr("...")
or call_udf("...",
Columns ...)
But still it seems that it’s not very user-friendly because they expect
them available under the functions namespace.

Therefore, I would like to propose adding all expressions into all
languages so that Spark is simpler and less confusing, e.g., which API is
in functions or not.

Any thoughts?

Re: [CONNECT] New Clients for Go and Rust

2023-05-24 Thread Hyukjin Kwon

I think we can just start this with a separate repo.
I am fine with the second option too but in this case we would have to
triage which language to add into the main repo.

On Fri, 19 May 2023 at 22:28, Maciej  wrote:

> Hi,
>
> Personally, I'm strongly against the second option and have some
> preference towards the third one (or maybe a mix of the first one and the
> third one).
>
> The project is already pretty large as-is and, with an extremely
> conservative approach towards removal of APIs, it only tends to grow over
> time. Making it even larger is not going to make things more maintainable
> and is likely to create an entry barrier for new contributors (that's
> similar to Jia's arguments).
>
> Moreover, we've seen quite a few different language clients over the years
> and all but one or two survived while none is particularly active, as far
> as I'm aware.  Taking responsibility for more clients, without being sure
> that we have resources to maintain them and there is enough community
> around them to make such effort worthwhile, doesn't seem like a good idea.
>
> --
> Best regards,
> Maciej Szymkiewicz
>
> Web: https://zero323.net
> PGP: A30CEF0C31A501EC
>
>
>
> On 5/19/23 14:57, Jia Fan wrote:
>
> Hi,
>
> Thanks for contribution!
> I prefer (1). There are some reason:
>
> 1. Different repository can maintain independent versions, different
> release times, and faster bug fix releases.
>
> 2. Different languages have different build tools. Putting them in one
> repository will make the main repository more and more complicated, and it
> will become extremely difficult to perform a complete build in the main
> repository.
>
> 3. Different repository will make CI configuration and execute easier, and
> the PR and commit lists will be clearer.
>
> 4. Other repository also have different client to governed, like
> clickhouse. It use different repository for jdbc, odbc, c++. Please refer:
> https://github.com/ClickHouse/clickhouse-java
> https://github.com/ClickHouse/clickhouse-odbc
> https://github.com/ClickHouse/clickhouse-cpp
>
> PS: I'm looking forward to the javascript connect client!
>
> Thanks Regards
> Jia Fan
>
> Martin Grund  于2023年5月19日周五 20:03写道：
>
>> Hi folks,
>>
>> When Bo (thanks for the time and contribution) started the work on
>> https://github.com/apache/spark/pull/41036 he started the Go client
>> directly in the Spark repository. In the meantime, I was approached by
>> other engineers who are willing to contribute to working on a Rust client
>> for Spark Connect.
>>
>> Now one of the key questions is where should these connectors live and
>> how we manage expectations most effectively.
>>
>> At the high level, there are two approaches:
>>
>> (1) "3rd party" (non-JVM / Python) clients should live in separate
>> repositories owned and governed by the Apache Spark community.
>>
>> (2) All clients should live in the main Apache Spark repository in the
>> `connector/connect/client` directory.
>>
>> (3) Non-native (Python, JVM) Spark Connect clients should not be part of
>> the Apache Spark repository and governance rules.
>>
>> Before we iron out how exactly, we mark these clients as experimental and
>> how we align their release process etc with Spark, my suggestion would be
>> to get a consensus on this first question.
>>
>> Personally, I'm fine with (1) and (2) with a preference for (2).
>>
>> Would love to get feedback from other members of the community!
>>
>> Thanks
>> Martin
>>
>>
>>
>>
>

Re: PR builder broken

2023-05-10 Thread Hyukjin Kwon

I think this happens globally
https://www.githubstatus.com/

On Thu, May 11, 2023 at 6:50 AM Xingbo Jiang  wrote:

> Hi dev,
>
> I've seen multiple PR builder failures like below since this morning:
> ```
> TypeError: Cannot read properties of undefined (reading 'head_sha')
> at eval (eval at callAsyncFunction
> (/home/runner/work/_actions/actions/github-script/v6/dist/index.js:15143:16),
> :81:22)
> Error: Unhandled error: TypeError: Cannot read properties of undefined
> (reading 'head_sha')
> at processTicksAndRejections (node:internal/process/task_queues:96:5)
> at async main
> (/home/runner/work/_actions/actions/github-script/v6/dist/index.js:15236:20)
> ```
> (Example links:
> https://github.com/apache/spark/actions/runs/4940984520/jobs/8833154761?pr=40690,
>
> https://github.com/apache/spark/actions/runs/4939269706/jobs/8829852985?pr=41123
> )
>
> It may be related to github, could someone help take a look?
>
> Thanks,
> Xingbo
>

Re: [VOTE] Release Apache Spark 3.2.4 (RC1)

2023-04-10 Thread Hyukjin Kwon

+1

On Tue, 11 Apr 2023 at 11:04, Ruifeng Zheng  wrote:

> +1 (non-binding)
>
> Thank you for driving this release!
>
> --
> Ruifeng  Zheng
> ruife...@foxmail.com
>
> 
>
>
>
> -- Original --
> *From:* "Yuming Wang" ;
> *Date:* Tue, Apr 11, 2023 09:56 AM
> *To:* "Mridul Muralidharan";
> *Cc:* "huaxin gao";"Chao Sun" >;"yangjie01";"Dongjoon Hyun";"Sean
> Owen";"dev@spark.apache.org";
> *Subject:* Re: [VOTE] Release Apache Spark 3.2.4 (RC1)
>
> +1.
>
> On Tue, Apr 11, 2023 at 12:17 AM Mridul Muralidharan 
> wrote:
>
>> +1
>>
>> Signatures, digests, etc check out fine.
>> Checked out tag and build/tested with -Phive -Pyarn -Pmesos -Pkubernetes
>>
>> Regards,
>> Mridul
>>
>>
>> On Mon, Apr 10, 2023 at 10:34 AM huaxin gao 
>> wrote:
>>
>>> +1
>>>
>>> On Mon, Apr 10, 2023 at 8:17 AM Chao Sun  wrote:
>>>
 +1 (non-binding)

 On Mon, Apr 10, 2023 at 7:07 AM yangjie01  wrote:

> +1 (non-binding)
>
>
>
> *发件人**: *Sean Owen 
> *日期**: *2023年4月10日 星期一 21:19
> *收件人**: *Dongjoon Hyun 
> *抄送**: *"dev@spark.apache.org" 
> *主题**: *Re: [VOTE] Release Apache Spark 3.2.4 (RC1)
>
>
>
> +1 from me
>
>
>
> On Sun, Apr 9, 2023 at 7:19 PM Dongjoon Hyun 
> wrote:
>
> I'll start with my +1.
>
> I verified the checksum, signatures of the artifacts, and
> documentations.
> Also, ran the tests with YARN and K8s modules.
>
> Dongjoon.
>
> On 2023/04/09 23:46:10 Dongjoon Hyun wrote:
> > Please vote on releasing the following candidate as Apache Spark
> version
> > 3.2.4.
> >
> > The vote is open until April 13th 1AM (PST) and passes if a majority
> +1 PMC
> > votes are cast, with a minimum of 3 +1 votes.
> >
> > [ ] +1 Release this package as Apache Spark 3.2.4
> > [ ] -1 Do not release this package because ...
> >
> > To learn more about Apache Spark, please see
> https://spark.apache.org/
> 
> >
> > The tag to be voted on is v3.2.4-rc1 (commit
> > 0ae10ac18298d1792828f1d59b652ef17462d76e)
> > https://github.com/apache/spark/tree/v3.2.4-rc1
> 
> >
> > The release files, including signatures, digests, etc. can be found
> at:
> > https://dist.apache.org/repos/dist/dev/spark/v3.2.4-rc1-bin/
> 
> >
> > Signatures used for Spark RCs can be found in this file:
> > https://dist.apache.org/repos/dist/dev/spark/KEYS
> 
> >
> > The staging repository for this release can be found at:
> >
> https://repository.apache.org/content/repositories/orgapachespark-1442/
> 
> >
> > The documentation corresponding to this release can be found at:
> > https://dist.apache.org/repos/dist/dev/spark/v3.2.4-rc1-docs/
> 
> >
> > The list of bug fixes going into 3.2.4 can be found at the following
> URL:
> > https://issues.apache.org/jira/projects/SPARK/versions/12352607
> 
> >
> > This release is using the release script of the tag v3.2.4-rc1.
> >
> > FAQ
> >
> > =
> > How can I help test this release?
> > =
> >
> > If you are a Spark user, you can help us test this release by taking
> > an existing Spark workload and running on this release candidate,
> then
> > reporting any regressions.
> >
> > If you're working in PySpark you can set up a virtual env and install
> > the current RC and see if anything important breaks, in the
> Java/Scala
> > you can add the staging repository to your projects resolvers and
> test
> > with the RC (make sure to clean up the artifact cache before/after so
> > you don't end up

Re: [VOTE] Release Apache Spark 3.4.0 (RC6)

2023-04-06 Thread Hyukjin Kwon

Merged the fix.

On Fri, 7 Apr 2023 at 10:07, Xinrong Meng  wrote:

> Thanks @yangjie01. I marked SPARK-39696 as a blocker.
>
> On Thu, Apr 6, 2023 at 4:35 PM yangjie01  wrote:
>
>> -1 for me due to this RC not include the fix of SPARK-39696, SPARK-39696
>> will fix a data race issue in access to TaskMetrics.externalAccums when
>> using Scala 2.13.8 and this issue will cause high-frequency Executor crash
>> when use Scala 2.13 distribution according to the user's description(
>> https://github.com/apache/spark/pull/37206#issuecomment-1486861885).
>>
>>
>>
>> So I suggest wait for https://github.com/apache/spark/pull/40663 to
>> merge and solve this issue although SPARK-39696 was not set as a blocker
>> when reported.
>>
>>
>>
>> Yang Jie
>>
>>
>>
>> *发件人**: *Xinrong Meng 
>> *日期**: *2023年4月7日 星期五 05:27
>> *收件人**: *dev 
>> *主题**: *[VOTE] Release Apache Spark 3.4.0 (RC6)
>>
>>
>>
>> Please vote on releasing the following candidate(RC6) as Apache Spark
>> version 3.4.0.
>>
>> The vote is open until 11:59pm Pacific time *April 11th* and passes if a
>> majority +1 PMC votes are cast, with a minimum of 3 +1 votes.
>>
>> [ ] +1 Release this package as Apache Spark 3.4.0
>> [ ] -1 Do not release this package because ...
>>
>> To learn more about Apache Spark, please see http://spark.apache.org/
>> 
>>
>> The tag to be voted on is *v3.4.0-rc6* (commit
>> 28d0723beb3579c17df84bb22c98a487d7a72023):
>> https://github.com/apache/spark/tree/v3.4.0-rc6
>> 
>>
>> The release files, including signatures, digests, etc. can be found at:
>> https://dist.apache.org/repos/dist/dev/spark/v3.4.0-rc6-bin/
>> 
>>
>> Signatures used for Spark RCs can be found in this file:
>> https://dist.apache.org/repos/dist/dev/spark/KEYS
>> 
>>
>> The staging repository for this release can be found at:
>> https://repository.apache.org/content/repositories/orgapachespark-1440
>> 
>>
>> The documentation corresponding to this release can be found at:
>> https://dist.apache.org/repos/dist/dev/spark/v3.4.0-rc6-docs/
>> 
>>
>> The list of bug fixes going into 3.4.0 can be found at the following URL:
>> https://issues.apache.org/jira/projects/SPARK/versions/12351465
>> 
>>
>> This release is using the release script of the tag v3.4.0-rc6.
>>
>>
>>
>>
>>
>> FAQ
>>
>> =
>> How can I help test this release?
>> =
>> If you are a Spark user, you can help us test this release by taking
>> an existing Spark workload and running on this release candidate, then
>> reporting any regressions.
>>
>> If you're working in PySpark you can set up a virtual env and install
>> the current RC and see if anything important breaks, in the Java/Scala
>> you can add the staging repository to your projects resolvers and test
>> with the RC (make sure to clean up the artifact cache before/after so
>> you don't end up building with an out of date RC going forward).
>>
>> ===
>> What should happen to JIRA tickets still targeting 3.4.0?
>> ===
>> The current list of open tickets targeted at 3.4.0 can be found at:
>> https://issues.apache.org/jira/projects/SPARK
>> 
>>  and
>> search for "Target Version/s" = 3.4.0
>>
>> Committers should look at those and triage. Extremely important bug
>> fixes, documentation, and API tweaks that impact compatibility should
>> be worked on immediately. Everything else please retarget to an
>> appropriate release.
>>
>> ==
>> But my bug isn't fixed?
>> ==
>> In order to make timely releases, we will typically not hold the
>> release unless the bug in question is a regression from the previous
>> release. That being said, if there is something which is a regression
>> that has not been correctly targeted please ping me or a committer to
>> help target the issue.
>>
>>
>>
>> Thanks,
>>
>> Xinrong Meng
>>
>>
>>
>

Re: Apache Spark 3.2.4 EOL Release?

2023-04-04 Thread Hyukjin Kwon

+1

On Wed, 5 Apr 2023 at 07:31, Mridul Muralidharan  wrote:

>
> +1
> Sounds good to me.
>
> Thanks,
> Mridul
>
>
> On Tue, Apr 4, 2023 at 1:39 PM huaxin gao  wrote:
>
>> +1
>>
>> On Tue, Apr 4, 2023 at 11:17 AM Chao Sun  wrote:
>>
>>> +1
>>>
>>> On Tue, Apr 4, 2023 at 11:12 AM Holden Karau 
>>> wrote:
>>>
 +1

 On Tue, Apr 4, 2023 at 11:04 AM L. C. Hsieh  wrote:

> +1
>
> Sounds good and thanks Dongjoon for driving this.
>
> On 2023/04/04 17:24:54 Dongjoon Hyun wrote:
> > Hi, All.
> >
> > Since Apache Spark 3.2.0 passed RC7 vote on October 12, 2021,
> branch-3.2
> > has been maintained and served well until now.
> >
> > - https://github.com/apache/spark/releases/tag/v3.2.0 (tagged on
> Oct 6,
> > 2021)
> > - https://lists.apache.org/thread/jslhkh9sb5czvdsn7nz4t40xoyvznlc7
> >
> > As of today, branch-3.2 has 62 additional patches after v3.2.3 and
> reaches
> > the end-of-life this month according to the Apache Spark release
> cadence. (
> > https://spark.apache.org/versioning-policy.html)
> >
> > $ git log --oneline v3.2.3..HEAD | wc -l
> > 62
> >
> > With the upcoming Apache Spark 3.4, I hope the users can get a
> chance to
> > have these last bits of Apache Spark 3.2.x, and I'd like to propose
> to have
> > Apache Spark 3.2.4 EOL Release next week and volunteer as the release
> > manager. WDTY? Please let me know if you need more patches on
> branch-3.2.
> >
> > Thanks,
> > Dongjoon.
> >
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
> --
 Twitter: https://twitter.com/holdenkarau
 Books (Learning Spark, High Performance Spark, etc.):
 https://amzn.to/2MaRAG9  
 YouTube Live Streams: https://www.youtube.com/user/holdenkarau

>>>

Re: [VOTE] Release Apache Spark 3.4.0 (RC3)

2023-03-09 Thread Hyukjin Kwon

BTW doing another RC isn't a very big deal (compared to what I did before
:-) ) since it's not a canonical release yet.

On Fri, Mar 10, 2023 at 7:58 AM Hyukjin Kwon  wrote:

> I guess directly tagging is fine too I guess.
> I don't mind cutting the RC4 right away either if that's what you prefer.
>
> On Fri, Mar 10, 2023 at 7:06 AM Xinrong Meng 
> wrote:
>
>> Hi All,
>>
>> Thank you all for catching that. Unfortunately, the release script failed
>> to push the release tag v3.4.0-rc3 to branch-3.4. Sorry about the issue.
>>
>> Shall we cut v3.4.0-rc4 immediately or wait until March 14th?
>>
>> On Fri, Mar 10, 2023 at 5:34 AM Sean Owen  wrote:
>>
>>> If the issue were just tags, then you can simply delete the tag and
>>> re-tag the right commit. That doesn't change a commit log.
>>> But is the issue that the relevant commits aren't in branch-3.4? Like I
>>> don't see the usual release commits in
>>> https://github.com/apache/spark/commits/branch-3.4
>>> Yeah OK that needs a re-do.
>>>
>>> We can still test this release.
>>> It works for me, except that I still get the weird infinite-compile-loop
>>> issue that doesn't seem to be related to Spark. The Spark Connect parts
>>> seem to work.
>>>
>>> On Thu, Mar 9, 2023 at 3:25 PM Dongjoon Hyun 
>>> wrote:
>>>
>>>> No~ We cannot in the AS-IS commit log status because it's screwed
>>>> already as Emil wrote.
>>>> Did you check the branch-3.2 commit log, Sean?
>>>>
>>>> Dongjoon.
>>>>
>>>>
>>>> On Thu, Mar 9, 2023 at 11:42 AM Sean Owen  wrote:
>>>>
>>>>> We can just push the tags onto the branches as needed right? No need
>>>>> to roll a new release
>>>>>
>>>>> On Thu, Mar 9, 2023, 1:36 PM Dongjoon Hyun 
>>>>> wrote:
>>>>>
>>>>>> Yes, I also confirmed that the v3.4.0-rc3 tag is invalid.
>>>>>>
>>>>>> I guess we need RC4.
>>>>>>
>>>>>> Dongjoon.
>>>>>>
>>>>>> On Thu, Mar 9, 2023 at 7:13 AM Emil Ejbyfeldt
>>>>>>  wrote:
>>>>>>
>>>>>>> It might being caused by the v3.4.0-rc3 tag not being part of the
>>>>>>> 3.4
>>>>>>> branch branch-3.4:
>>>>>>>
>>>>>>> $ git log --pretty='format:%d %h' --graph origin/branch-3.4
>>>>>>> v3.4.0-rc3
>>>>>>> | head -n 10
>>>>>>> *  (HEAD, origin/branch-3.4) e38e619946
>>>>>>> *  f3e69a1fe2
>>>>>>> *  74cf1a32b0
>>>>>>> *  0191a5bde0
>>>>>>> *  afced91348
>>>>>>> | *  (tag: v3.4.0-rc3) b9be9ce15a
>>>>>>> |/
>>>>>>> *  006e838ede
>>>>>>> *  fc29b07a31
>>>>>>> *  8655dfe66d
>>>>>>>
>>>>>>>
>>>>>>> Best,
>>>>>>> Emil
>>>>>>>
>>>>>>> On 09/03/2023 15:50, yangjie01 wrote:
>>>>>>> > HI, all
>>>>>>> >
>>>>>>> > I can't git check out the tag of v3.4.0-rc3. At the same time,
>>>>>>> there is
>>>>>>> > the following information on the Github page.
>>>>>>> >
>>>>>>> > Does anyone else have the same problem?
>>>>>>> >
>>>>>>> > Yang Jie
>>>>>>> >
>>>>>>> > *发件人**: *Xinrong Meng 
>>>>>>> > *日期**: *2023年3月9日星期四20:05
>>>>>>> > *收件人**: *dev 
>>>>>>> > *主题**: *[VOTE] Release Apache Spark 3.4.0 (RC3)
>>>>>>> >
>>>>>>> > Please vote on releasing the following candidate(RC3) as Apache
>>>>>>> Spark
>>>>>>> > version 3.4.0.
>>>>>>> >
>>>>>>> > The vote is open until 11:59pm Pacific time *March 14th* and
>>>>>>> passes if a
>>>>>>> > majority +1 PMC votes are cast, with a minimum of 3 +1 votes.
>>>>>>> >
>>>>>>> > [ ] +1 Release this package as Apache Spark 3.4.0
>>>>>>> > [ ] -1 Do not release this package because ...
>>>>>>>

Re: [VOTE] Release Apache Spark 3.4.0 (RC3)

2023-03-09 Thread Hyukjin Kwon

I guess directly tagging is fine too I guess.
I don't mind cutting the RC4 right away either if that's what you prefer.

On Fri, Mar 10, 2023 at 7:06 AM Xinrong Meng 
wrote:

> Hi All,
>
> Thank you all for catching that. Unfortunately, the release script failed
> to push the release tag v3.4.0-rc3 to branch-3.4. Sorry about the issue.
>
> Shall we cut v3.4.0-rc4 immediately or wait until March 14th?
>
> On Fri, Mar 10, 2023 at 5:34 AM Sean Owen  wrote:
>
>> If the issue were just tags, then you can simply delete the tag and
>> re-tag the right commit. That doesn't change a commit log.
>> But is the issue that the relevant commits aren't in branch-3.4? Like I
>> don't see the usual release commits in
>> https://github.com/apache/spark/commits/branch-3.4
>> Yeah OK that needs a re-do.
>>
>> We can still test this release.
>> It works for me, except that I still get the weird infinite-compile-loop
>> issue that doesn't seem to be related to Spark. The Spark Connect parts
>> seem to work.
>>
>> On Thu, Mar 9, 2023 at 3:25 PM Dongjoon Hyun 
>> wrote:
>>
>>> No~ We cannot in the AS-IS commit log status because it's screwed
>>> already as Emil wrote.
>>> Did you check the branch-3.2 commit log, Sean?
>>>
>>> Dongjoon.
>>>
>>>
>>> On Thu, Mar 9, 2023 at 11:42 AM Sean Owen  wrote:
>>>
 We can just push the tags onto the branches as needed right? No need to
 roll a new release

 On Thu, Mar 9, 2023, 1:36 PM Dongjoon Hyun 
 wrote:

> Yes, I also confirmed that the v3.4.0-rc3 tag is invalid.
>
> I guess we need RC4.
>
> Dongjoon.
>
> On Thu, Mar 9, 2023 at 7:13 AM Emil Ejbyfeldt
>  wrote:
>
>> It might being caused by the v3.4.0-rc3 tag not being part of the 3.4
>> branch branch-3.4:
>>
>> $ git log --pretty='format:%d %h' --graph origin/branch-3.4
>> v3.4.0-rc3
>> | head -n 10
>> *  (HEAD, origin/branch-3.4) e38e619946
>> *  f3e69a1fe2
>> *  74cf1a32b0
>> *  0191a5bde0
>> *  afced91348
>> | *  (tag: v3.4.0-rc3) b9be9ce15a
>> |/
>> *  006e838ede
>> *  fc29b07a31
>> *  8655dfe66d
>>
>>
>> Best,
>> Emil
>>
>> On 09/03/2023 15:50, yangjie01 wrote:
>> > HI, all
>> >
>> > I can't git check out the tag of v3.4.0-rc3. At the same time,
>> there is
>> > the following information on the Github page.
>> >
>> > Does anyone else have the same problem?
>> >
>> > Yang Jie
>> >
>> > *发件人**: *Xinrong Meng 
>> > *日期**: *2023年3月9日星期四20:05
>> > *收件人**: *dev 
>> > *主题**: *[VOTE] Release Apache Spark 3.4.0 (RC3)
>> >
>> > Please vote on releasing the following candidate(RC3) as Apache
>> Spark
>> > version 3.4.0.
>> >
>> > The vote is open until 11:59pm Pacific time *March 14th* and passes
>> if a
>> > majority +1 PMC votes are cast, with a minimum of 3 +1 votes.
>> >
>> > [ ] +1 Release this package as Apache Spark 3.4.0
>> > [ ] -1 Do not release this package because ...
>> >
>> > To learn more about Apache Spark, please see
>> http://spark.apache.org/
>> > <
>> https://mailshield.baidu.com/check?q=eJcUboQ1HRRomPZKEwRzpl69wA8DbI%2fNIiRNsQ%3d%3d
>> >
>> >
>> > The tag to be voted on is *v3.4.0-rc3* (commit
>> > b9be9ce15a82b18cca080ee365d308c0820a29a9):
>> > https://github.com/apache/spark/tree/v3.4.0-rc3
>> > <
>> https://mailshield.baidu.com/check?q=ScnsHLDD3dexVfW9cjs3GovMbG2LLAZqBLq9cA8V%2fTOpCQ1LdeNWoD0%2fy7eVo%2b3de8Rk%2bQ%3d%3d
>> >
>> >
>> > The release files, including signatures, digests, etc. can be found
>> at:
>> > https://dist.apache.org/repos/dist/dev/spark/v3.4.0-rc3-bin/
>> > <
>> https://mailshield.baidu.com/check?q=U%2fLs35p0l%2bUUTclb%2blAPSYb%2bALxMfer1Jc%2b3i965Bjh2CxHpG45RFLW0NqSwMx00Ci3MRMz%2b7mTmcKUIa27Pww%3d%3d
>> >
>> >
>> > Signatures used for Spark RCs can be found in this file:
>> > https://dist.apache.org/repos/dist/dev/spark/KEYS
>> > <
>> https://mailshield.baidu.com/check?q=E6fHbSXEWw02TTJBpc3bfA9mi7ea0YiWcNHkm%2fDJxwlaWinGnMdaoO1PahHhgj00vKwcbElpuHA%3d
>> >
>> >
>> > The staging repository for this release can be found at:
>> >
>> https://repository.apache.org/content/repositories/orgapachespark-1437
>> > <
>> https://mailshield.baidu.com/check?q=otrdG4krOioiB1q4MH%2fIEA444B80s7LLO8D2IdosERiNzIymKGZ2D1jV4O0JA9%2fRVfJje3xu6%2b33PB24x0R5V8ArX6BnzcYSkG5cHg%3d%3d
>> >
>> >
>> > The documentation corresponding to this release can be found at:
>> > https://dist.apache.org/repos/dist/dev/spark/v3.4.0-rc3-docs/
>> > <
>> https://mailshield.baidu.com/check?q=wA4vz1x6jiz0lcn1hQ0AhAiPk3gdFJbs7dSHwusppbgB4ph846QORuIJQzNRr8GzerMucW3FL7ADPE3radzpmm3er3g%3d
>> >
>> >
>> > The list of bug fixes going into 3.4.0 can be found at the
>> following URL:
>> >

Re: [Question] Can't start Spark Connect

2023-03-08 Thread Hyukjin Kwon

Just doing a clean build with Maven, and running a test case like
`SparkConnectServiceSuite` in IntelliJ should work.

On Wed, 8 Mar 2023 at 15:02, Jia Fan  wrote:

> Hi developers,
>I want to contribute some code for Spark Connect. Any doc for starters?
> I want to debug SimpleSparkConnectService but I can't start it with IDEA. I
> would appreciate any help.
>
> Thanks
>
> 
>
>
> Jia Fan
>

Re: [DISCUSS] Show Python code examples first in Spark documentation

2023-02-26 Thread Hyukjin Kwon

This is already getting too long to follow the original topic. Would be
great if we can separate others to a different thread.

On Mon, Feb 27, 2023 at 7:41 AM Mich Talebzadeh 
wrote:

> if we are going to do it, we might as well do it all. it is more cost
> effective so to speak.
>
>
> HTH
>
>
>view my Linkedin profile
> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>
>
>  https://en.everybodywiki.com/Mich_Talebzadeh
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
> On Sun, 26 Feb 2023 at 21:48, Hyukjin Kwon  wrote:
>
>> Probably it's worthwhile discussing the order for others but I would keep
>> it separate from this thread to focus on Python as the default since that
>> can be done as an incremental improvement.
>>
>>
>> On Mon, Feb 27, 2023 at 3:36 AM Mich Talebzadeh <
>> mich.talebza...@gmail.com> wrote:
>>
>>>
>>> To me as I stated before this is a tangential issue. However, if it has
>>> to be done, we ought to work out how much effort is needed). Besides, how
>>> about others?
>>>
>>>
>>> [image: image.png]
>>>
>>> what next in the order?
>>>
>>>
>>>
>>>1. Python
>>>2. Scala
>>>3. SQL
>>>4. Java
>>>5. R
>>>
>>> SQL is still very important and with Spark becoming more relevant to ETL
>>> plus lesser extend to DS, I see the above order is fair
>>>
>>> HTH
>>>
>>>
>>>view my Linkedin profile
>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>>
>>>
>>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>>
>>>
>>>
>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>> any loss, damage or destruction of data or any other property which may
>>> arise from relying on this email's technical content is explicitly
>>> disclaimed. The author will in no case be liable for any monetary damages
>>> arising from such loss, damage or destruction.
>>>
>>>
>>>
>>>
>>> On Wed, 22 Feb 2023 at 21:00, Allan Folting 
>>> wrote:
>>>
>>>> Hi all,
>>>>
>>>> I would like to propose that we show Python code examples first in the
>>>> Spark documentation where we have multiple programming language examples.
>>>> An example is on the Quick Start page:
>>>> https://spark.apache.org/docs/latest/quick-start.html
>>>>
>>>> I propose this change because Python has become more popular than the
>>>> other languages supported in Apache Spark. There are a lot more users of
>>>> Spark in Python than Scala today and Python attracts a broader set of new
>>>> users.
>>>> For Python usage data, see https://www.tiobe.com/tiobe-index/ and
>>>> https://insights.stackoverflow.com/trends?tags=r%2Cscala%2Cpython%2Cjava
>>>> .
>>>>
>>>> Also, this change aligns with Python already being the first tab on our
>>>> home page:
>>>> https://spark.apache.org/
>>>>
>>>> Anyone who wants to use another language can still just click on the
>>>> other tabs.
>>>>
>>>> I created a draft PR for the Spark SQL, DataFrames and Datasets Guide
>>>> page as a first step:
>>>> https://github.com/apache/spark/pull/40087
>>>>
>>>>
>>>> I would appreciate it if you could share your thoughts on this proposal.
>>>>
>>>>
>>>> Thanks a lot,
>>>> Allan Folting
>>>>
>>>

Re: [DISCUSS] Show Python code examples first in Spark documentation

2023-02-26 Thread Hyukjin Kwon

Probably it's worthwhile discussing the order for others but I would keep
it separate from this thread to focus on Python as the default since that
can be done as an incremental improvement.


On Mon, Feb 27, 2023 at 3:36 AM Mich Talebzadeh 
wrote:

>
> To me as I stated before this is a tangential issue. However, if it has to
> be done, we ought to work out how much effort is needed). Besides, how
> about others?
>
>
> [image: image.png]
>
> what next in the order?
>
>
>
>1. Python
>2. Scala
>3. SQL
>4. Java
>5. R
>
> SQL is still very important and with Spark becoming more relevant to ETL
> plus lesser extend to DS, I see the above order is fair
>
> HTH
>
>
>view my Linkedin profile
> 
>
>
>  https://en.everybodywiki.com/Mich_Talebzadeh
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
> On Wed, 22 Feb 2023 at 21:00, Allan Folting  wrote:
>
>> Hi all,
>>
>> I would like to propose that we show Python code examples first in the
>> Spark documentation where we have multiple programming language examples.
>> An example is on the Quick Start page:
>> https://spark.apache.org/docs/latest/quick-start.html
>>
>> I propose this change because Python has become more popular than the
>> other languages supported in Apache Spark. There are a lot more users of
>> Spark in Python than Scala today and Python attracts a broader set of new
>> users.
>> For Python usage data, see https://www.tiobe.com/tiobe-index/ and
>> https://insights.stackoverflow.com/trends?tags=r%2Cscala%2Cpython%2Cjava.
>>
>> Also, this change aligns with Python already being the first tab on our
>> home page:
>> https://spark.apache.org/
>>
>> Anyone who wants to use another language can still just click on the
>> other tabs.
>>
>> I created a draft PR for the Spark SQL, DataFrames and Datasets Guide
>> page as a first step:
>> https://github.com/apache/spark/pull/40087
>>
>>
>> I would appreciate it if you could share your thoughts on this proposal.
>>
>>
>> Thanks a lot,
>> Allan Folting
>>
>

Re: [DISCUSS] Show Python code examples first in Spark documentation

2023-02-23 Thread Hyukjin Kwon

That sounds good to have that especially given that it will allow more
flexibility to the users.
But I think that's slightly orthogonal to this proposal since this proposal
is more about the default (before users take an action).


On Fri, 24 Feb 2023 at 15:35, Santosh Pingale 
wrote:

> Very interesting and user focused discussion, thanks for the proposal.
>
> Would it be better if we rather let users set the preference about the
> language they want to see first in the code examples? This preference can
> be easily stored on the browser side and used to decide ordering. This is
> inline with freedom users have with spark today.
>
>
> On Fri, Feb 24, 2023, 4:46 AM Allan Folting  wrote:
>
>> I think this needs to be consistently done on all relevant pages and my
>> intent is to do that work in time for when it is first released.
>> I started with the "Spark SQL, DataFrames and Datasets Guide" page to
>> break it up into multiple, scoped PRs.
>> I should have made that clear before.
>>
>> I think it's a great idea to have an umbrella JIRA for this to outline
>> the full scope and track overall progress and I'm happy to create it.
>>
>> I can't speak on behalf of all Scala users of course, but I don't think
>> this change makes Scala appear as a 2nd class citizen, like I don't think
>> of Python as a 2nd class citizen because it is not first currently, but it
>> does recognize that Python is more broadly popular today.
>>
>> Thanks,
>> Allan
>>
>> On Thu, Feb 23, 2023 at 6:55 PM Dongjoon Hyun 
>> wrote:
>>
>>> Thank you all.
>>>
>>> Yes, attracting more Python users and being more Python user-friendly is
>>> always good.
>>>
>>> Basically, SPARK-42493 is proposing to introduce intentional
>>> inconsistency to Apache Spark documentation.
>>>
>>> The inconsistency from SPARK-42493 might give Python users the following
>>> questions first.
>>>
>>> - Why not RDD pages which are the heart of Apache Spark? Is Python not
>>> good in RDD?
>>> - Why not ML and Structured Streaming pages when DATA+AI Summit focuses
>>> on ML heavily?
>>>
>>> Also, more questions to the Scala users.
>>> - Is Scala language stepping down to the 2nd citizen language?
>>> - What about Scala 3?
>>>
>>> Of course, I understand SPARK-42493 has specific scopes
>>> (SQL/Dataset/Dataframe) and didn't mean anything like the above at all.
>>> However, if SPARK-42493 is emphasized as "the first step" to introduce
>>> that inconsistency, I'm wondering
>>> - What direction we are heading?
>>> - What is the next target scope?
>>> - When it will be achieved (or completed)?
>>> - Or, is the goal to be permanently inconsistent in terms of the
>>> documentation?
>>>
>>> It's unclear even in the documentation-only scope. If we are expecting
>>> more and more subtasks during Apache Spark 3.5 timeframe, shall we have an
>>> umbrella JIRA?
>>>
>>> Bests,
>>> Dongjoon.
>>>
>>>
>>> On Thu, Feb 23, 2023 at 6:15 PM Allan Folting 
>>> wrote:
>>>
>>>> Thanks a lot for the questions and comments/feedback!
>>>>
>>>> To address your questions Dongjoon, I do not intend for these updates
>>>> to the documentation to be tied to the potential changes/suggestions you
>>>> ask about.
>>>>
>>>> In other words, this proposal is only about adjusting the documentation
>>>> to target the majority of people reading it - namely the large and growing
>>>> number of Python users - and new users in particular as they are often
>>>> already familiar with and have a preference for Python when evaluating or
>>>> starting to use Spark.
>>>>
>>>> While we may want to strengthen support for Python in other ways, I
>>>> think such efforts should be tracked separately from this.
>>>>
>>>> Allan
>>>>
>>>> On Thu, Feb 23, 2023 at 1:44 AM Mich Talebzadeh <
>>>> mich.talebza...@gmail.com> wrote:
>>>>
>>>>> If this is not just flip flopping the document pages and involves
>>>>> other changes, then a proper impact analysis needs to be done to assess 
>>>>> the
>>>>> efforts involved. Personally I don't think it really matters.
>>>>>
>>>>> HTH
>>>>>
>>>>>
>>>>>
>>>>>view

Re: [VOTE] Release Apache Spark 3.4.0 (RC1)

2023-02-23 Thread Hyukjin Kwon

Yes we should fix. I will take a look

On Thu, 23 Feb 2023 at 07:32, Jonathan Kelly  wrote:

> Thanks! I was wondering about that ClientE2ETestSuite failure today, so
> I'm glad to know that it's also being experienced by others.
>
> On a similar note, I am experiencing the following error when running the
> Python tests with Python 3.7:
>
> + ./python/run-tests --python-executables=python3
> Running PySpark tests. Output is in
> /home/ec2-user/spark/python/unit-tests.log
> Will test against the following Python executables: ['python3']
> Will test the following Python modules: ['pyspark-connect',
> 'pyspark-core', 'pyspark-errors', 'pyspark-ml', 'pyspark-mllib',
> 'pyspark-pandas', 'pyspark-pandas-slow', 'pyspark-resource', 'pyspark-sql',
> 'pyspark-streaming']
> python3 python_implementation is CPython
> python3 version is: Python 3.7.16
> Starting test(python3): pyspark.ml.tests.test_feature (temp output:
> /home/ec2-user/spark/python/target/8ca9ab1a-05cc-4845-bf89-30d9001510bc/python3__pyspark.ml.tests.test_feature__kg6sseie.log)
> Starting test(python3): pyspark.ml.tests.test_base (temp output:
> /home/ec2-user/spark/python/target/f2264f3b-6b26-4e61-9452-8d6ddd7eb002/python3__pyspark.ml.tests.test_base__0902zf9_.log)
> Starting test(python3): pyspark.ml.tests.test_algorithms (temp output:
> /home/ec2-user/spark/python/target/d1dc4e07-e58c-4c03-abe5-09d8fab22e6a/python3__pyspark.ml.tests.test_algorithms__lh3wb2u8.log)
> Starting test(python3): pyspark.ml.tests.test_evaluation (temp output:
> /home/ec2-user/spark/python/target/3f42dc79-c945-4cf2-a1eb-83e72b40a9ee/python3__pyspark.ml.tests.test_evaluation__89idc7fa.log)
> Finished test(python3): pyspark.ml.tests.test_base (16s)
> Starting test(python3): pyspark.ml.tests.test_functions (temp output:
> /home/ec2-user/spark/python/target/5a3b90f0-216b-4edd-9d15-6619d3e03300/python3__pyspark.ml.tests.test_functions__g5u1290s.log)
> Traceback (most recent call last):
>   File "/usr/lib64/python3.7/runpy.py", line 193, in _run_module_as_main
> "__main__", mod_spec)
>   File "/usr/lib64/python3.7/runpy.py", line 85, in _run_code
> exec(code, run_globals)
>   File "/home/ec2-user/spark/python/pyspark/ml/tests/test_functions.py",
> line 21, in 
> from pyspark.ml.functions import predict_batch_udf
>   File "/home/ec2-user/spark/python/pyspark/ml/functions.py", line 38, in
> 
> from typing import Any, Callable, Iterator, List, Mapping, Protocol,
> TYPE_CHECKING, Tuple, Union
> ImportError: cannot import name 'Protocol' from 'typing'
> (/usr/lib64/python3.7/typing.py)
> Had test failures in pyspark.ml.tests.test_functions with python3; see
> logs.
>
> I know we should move on to a newer version of Python, but isn't Python
> 3.7 still officially supported?
>
> Thank you,
> Jonathan Kelly
>
> On Wed, Feb 22, 2023 at 1:47 PM Herman van Hovell
>  wrote:
>
>> Hi All,
>>
>> Thanks for testing the 3.4.0 RC! I apologize for the maven testing
>> failures for the Spark Connect Scala Client. We will try to get those
>> sorted as soon as possible.
>>
>> This is an artifact of having multiple build systems, and only running CI
>> for one (SBT). That, however, is a debate for another day :)...
>>
>> Cheers,
>> Herman
>>
>> On Wed, Feb 22, 2023 at 5:32 PM Bjørn Jørgensen 
>> wrote:
>>
>>> ./build/mvn clean package
>>>
>>> I'm using ubuntu rolling, python 3.11 openjdk 17
>>>
>>> CompatibilitySuite:
>>> - compatibility MiMa tests *** FAILED ***
>>>   java.lang.AssertionError: assertion failed: Failed to find the jar
>>> inside folder: /home/bjorn/spark-3.4.0/connector/connect/client/jvm/target
>>>   at scala.Predef$.assert(Predef.scala:223)
>>>   at
>>> org.apache.spark.sql.connect.client.util.IntegrationTestUtils$.findJar(IntegrationTestUtils.scala:67)
>>>   at
>>> org.apache.spark.sql.connect.client.CompatibilitySuite.clientJar$lzycompute(CompatibilitySuite.scala:57)
>>>   at
>>> org.apache.spark.sql.connect.client.CompatibilitySuite.clientJar(CompatibilitySuite.scala:53)
>>>   at
>>> org.apache.spark.sql.connect.client.CompatibilitySuite.$anonfun$new$1(CompatibilitySuite.scala:69)
>>>   at org.scalatest.OutcomeOf.outcomeOf(OutcomeOf.scala:85)
>>>   at org.scalatest.OutcomeOf.outcomeOf$(OutcomeOf.scala:83)
>>>   at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
>>>   at org.scalatest.Transformer.apply(Transformer.scala:22)
>>>   at org.scalatest.Transformer.apply(Transformer.scala:20)
>>>   ...
>>> - compatibility API tests: Dataset *** FAILED ***
>>>   java.lang.AssertionError: assertion failed: Failed to find the jar
>>> inside folder: /home/bjorn/spark-3.4.0/connector/connect/client/jvm/target
>>>   at scala.Predef$.assert(Predef.scala:223)
>>>   at
>>> org.apache.spark.sql.connect.client.util.IntegrationTestUtils$.findJar(IntegrationTestUtils.scala:67)
>>>   at
>>> org.apache.spark.sql.connect.client.CompatibilitySuite.clientJar$lzycompute(CompatibilitySuite.scala:57)
>>>   at
>>>

Re: [DISCUSS] Show Python code examples first in Spark documentation

2023-02-22 Thread Hyukjin Kwon

> 1. Does this suggestion imply Python API implementation will be the new
blocker in the future in terms of feature parity among languages? Until
now, Python API feature parity was one of the audit items because it's not
enforced. In other words, Scala and Java have been the full feature because
they are the underlying main developer languages while Python/R/SQL
environments were the nice-to-have.

I think it wouldn't be treated as a blocker .. but I do believe we have
added all new features into the Python side for the last couple of
releases. So, I wouldn't worry about this at this moment - we have been
doing fine in terms of feature parity.

> 2. Does this suggestion assume that the Python environment is easier for
users than Scala/Java always? Given that we support Python 3.8 to 3.11, the
support matrix for Python library dependency is a problem for the Apache
Spark community to solve in order to claim that. As we say at SPARK-41454,
Python language also introduces breaking changes to us historically and we
have many `Pinned` python libraries issues.

Yes. In fact, regardless of this change, I do believe we should test more
versions, etc. At least scheduled jobs like we're doing JDK and Scala
versions.


FWIW, my take about this change is: people use Python and PySpark more
(according to the chart and stats provided) so let's put those examples
first :-).


On Thu, 23 Feb 2023 at 10:27, Dongjoon Hyun  wrote:

> I have two questions to clarify the scope and boundaries.
>
> 1. Does this suggestion imply Python API implementation will be the new
> blocker in the future in terms of feature parity among languages? Until
> now, Python API feature parity was one of the audit items because it's not
> enforced. In other words, Scala and Java have been the full feature because
> they are the underlying main developer languages while Python/R/SQL
> environments were the nice-to-have.
>
> 2. Does this suggestion assume that the Python environment is easier for
> users than Scala/Java always? Given that we support Python 3.8 to 3.11, the
> support matrix for Python library dependency is a problem for the Apache
> Spark community to solve in order to claim that. As we say at SPARK-41454,
> Python language also introduces breaking changes to us historically and we
> have many `Pinned` python libraries issues.
>
> Changing documentation is easy, but I hope we can give clear
> communication and direction in this effort because this is one of the most
> user-facing changes.
>
> Dongjoon.
>
> On Wed, Feb 22, 2023 at 5:26 PM 416161...@qq.com 
> wrote:
>
>> +1 LGTM
>>
>> --
>> Ruifeng Zheng
>> ruife...@foxmail.com
>>
>> <https://wx.mail.qq.com/home/index?t=readmail_businesscard_midpage=true=Ruifeng+Zheng=https%3A%2F%2Fres.mail.qq.com%2Fzh_CN%2Fhtmledition%2Fimages%2Frss%2Fmale.gif%3Frand%3D1617349242=ruifengz%40foxmail.com=>
>>
>>
>>
>> -- Original --
>> *From:* "Xinrong Meng" ;
>> *Date:* Thu, Feb 23, 2023 09:17 AM
>> *To:* "Allan Folting";
>> *Cc:* "dev";
>> *Subject:* Re: [DISCUSS] Show Python code examples first in Spark
>> documentation
>>
>> +1 Good idea!
>>
>> On Thu, Feb 23, 2023 at 7:41 AM Jack Goodson 
>> wrote:
>>
>>> Good idea, at the company I work at we discussed using Scala as our
>>> primary language because technically it is slightly stronger than python
>>> but ultimately chose python in the end as it’s easier for other devs to be
>>> on boarded to our platform and future hiring for the team etc would be
>>> easier
>>>
>>> On Thu, 23 Feb 2023 at 12:20 PM, Hyukjin Kwon 
>>> wrote:
>>>
>>>> +1 I like this idea too.
>>>>
>>>> On Thu, Feb 23, 2023 at 6:00 AM Allan Folting 
>>>> wrote:
>>>>
>>>>> Hi all,
>>>>>
>>>>> I would like to propose that we show Python code examples first in the
>>>>> Spark documentation where we have multiple programming language examples.
>>>>> An example is on the Quick Start page:
>>>>> https://spark.apache.org/docs/latest/quick-start.html
>>>>>
>>>>> I propose this change because Python has become more popular than the
>>>>> other languages supported in Apache Spark. There are a lot more users of
>>>>> Spark in Python than Scala today and Python attracts a broader set of new
>>>>> users.
>>>>> For Python usage data, see https://www.tiobe.com/tiobe-index/ and
>>>>> https://insights.stackoverflow.com/trends?tags=r%2Cscala%2Cpython%2Cjava
>>>>> .
>>>>>
>>>>> Also, this change aligns with Python already being the first tab on
>>>>> our home page:
>>>>> https://spark.apache.org/
>>>>>
>>>>> Anyone who wants to use another language can still just click on the
>>>>> other tabs.
>>>>>
>>>>> I created a draft PR for the Spark SQL, DataFrames and Datasets Guide
>>>>> page as a first step:
>>>>> https://github.com/apache/spark/pull/40087
>>>>>
>>>>>
>>>>> I would appreciate it if you could share your thoughts on this
>>>>> proposal.
>>>>>
>>>>>
>>>>> Thanks a lot,
>>>>> Allan Folting
>>>>>
>>>>

Re: [DISCUSS] Show Python code examples first in Spark documentation

2023-02-22 Thread Hyukjin Kwon

+1 I like this idea too.

On Thu, Feb 23, 2023 at 6:00 AM Allan Folting  wrote:

> Hi all,
>
> I would like to propose that we show Python code examples first in the
> Spark documentation where we have multiple programming language examples.
> An example is on the Quick Start page:
> https://spark.apache.org/docs/latest/quick-start.html
>
> I propose this change because Python has become more popular than the
> other languages supported in Apache Spark. There are a lot more users of
> Spark in Python than Scala today and Python attracts a broader set of new
> users.
> For Python usage data, see https://www.tiobe.com/tiobe-index/ and
> https://insights.stackoverflow.com/trends?tags=r%2Cscala%2Cpython%2Cjava.
>
> Also, this change aligns with Python already being the first tab on our
> home page:
> https://spark.apache.org/
>
> Anyone who wants to use another language can still just click on the other
> tabs.
>
> I created a draft PR for the Spark SQL, DataFrames and Datasets Guide page
> as a first step:
> https://github.com/apache/spark/pull/40087
>
>
> I would appreciate it if you could share your thoughts on this proposal.
>
>
> Thanks a lot,
> Allan Folting
>

Re: [DISCUSS] Make release cadence predictable

2023-02-15 Thread Hyukjin Kwon

While the point about being less time is probably correct, yeah if
something is half-done, we should keep them in the master branch, and/or
don't expose it to the end users (which I believe we usually do).
Good thing is that we can make the schedule predictable. Suppose that the
branchcut date is pinned, then people can schedule/estimate their work
based on the pinned schedule :-).


On Thu, 16 Feb 2023 at 04:32, Sean Owen  wrote:

> I don't think there is a delay per se, because there is no hard release
> date to begin with, to delay with respect to. It's been driven by, "feels
> like enough stuff has gone in" and "someone is willing to roll a release",
> and that happens more like every 8-9 months. This would be a shift not only
> in expectation - lower the threshold for 'enough stuff has gone in' to
> probably match a 6 month cadence - but also a shift in policy to a release
> train-like process. If something isn't ready then it just waits another 6
> months.
>
> You're right, the problem is kind of - what is something is in process in
> a half-baked state? you don't really want to release half a thing, nor do
> you want to develop it quite separately from the master branch.
> It is worth asking what prompts this, too. Just, we want to release
> earlier and more often?
>
> On Wed, Feb 15, 2023 at 1:19 PM Maciej  wrote:
>
>> Hi,
>>
>> Sorry for a silly question, but do we know what exactly caused these
>> delays? Are these avoidable?
>>
>> It is not a systematic observation, but my general impression is that we
>> rarely delay for sake of individual features, unless there is some soft
>> consensus about their importance. Arguably, these could be postponed,
>> assuming we can adhere to the schedule.
>>
>> And then, we're left with large, multi-task features. A lot can be done
>> with proper timing and design, but in our current process there is no way
>> to guarantee that each of these can be delivered within given time window.
>> How are we going to handle these? Delivering half-baked things is hardly
>> satisfying solution and more rigid schedule can only increase pressure on
>> maintainers. Do we plan to introduce something like feature branches for
>> these, to isolate upcoming release in case of delay?
>>
>> On 2/14/23 19:53, Dongjoon Hyun wrote:
>>
>> +1 for Hyukjin and Sean's opinion.
>>
>> Thank you for initiating this discussion.
>>
>> If we have a fixed-predefined regular 6-month, I believe we can persuade
>> the incomplete features to wait for next releases more easily.
>>
>> In addition, I want to add the first RC1 date requirement because RC1
>> always did a great job for us.
>>
>> I guess `branch-cut + 1M (no later than 1month)` could be the reasonable
>> deadline.
>>
>> Thanks,
>> Dongjoon.
>>
>>
>> On Tue, Feb 14, 2023 at 6:33 AM Sean Owen  wrote:
>>
>>> I'm fine with shifting to a stricter cadence-based schedule. Sometimes,
>>> it'll mean some significant change misses a release rather than delays it.
>>> If people are OK with that discipline, sure.
>>> A hard 6-month cycle would mean the minor releases are more frequent and
>>> have less change in them. That's probably OK. We could also decide to
>>> choose a longer cadence like 9 months, but I don't know if that's better.
>>> I assume maintenance releases would still be as-needed, and major
>>> releases would also work differently - probably no 4.0 until next year at
>>> the earliest.
>>>
>>> On Tue, Feb 14, 2023 at 3:01 AM Hyukjin Kwon 
>>> wrote:
>>>
>>>> Hi all,
>>>>
>>>> *TL;DR*: Branch cut for every 6 months (January and July).
>>>>
>>>> I would like to discuss/propose to make our release cadence
>>>> predictable. In our documentation, we mention as follows:
>>>>
>>>> In general, feature (“minor”) releases occur about every 6 months.
>>>> Hence,
>>>> Spark 2.3.0 would generally be released about 6 months after 2.2.0.
>>>>
>>>> However, the reality is slightly different. Here is the time it took
>>>> for the recent releases:
>>>>
>>>>- Spark 3.3.0 took 8 months
>>>>- Spark 3.2.0 took 7 months
>>>>- Spark 3.1 took 9 months
>>>>
>>>> Here are problems caused by such delay:
>>>>
>>>>- The whole related schedules are affected in all downstream
>>>>projects, vendors, etc.
>>>>- It makes the release date unpredictable to the end users.
>>>>- Developers as well as the release managers have to rush because
>>>>of the delay, which prevents us from focusing on having a proper
>>>>regression-free release.
>>>>
>>>> My proposal is to branch cut every 6 months (January and July that
>>>> avoids the public holidays / vacation period in general) so the release can
>>>> happen twice
>>>> every year regardless of the actual release date.
>>>> I believe it both makes the release cadence predictable, and relaxes
>>>> the burden about making releases.
>>>>
>>>> WDYT?
>>>>
>>>
>> --
>> Best regards,
>> Maciej Szymkiewicz
>>
>> Web: https://zero323.net
>> PGP: A30CEF0C31A501EC
>>
>>

Re: [VOTE][RESULT] Release Spark 3.3.2 (RC1)

2023-02-15 Thread Hyukjin Kwon

Awesome!

On Thu, 16 Feb 2023 at 06:39, Dongjoon Hyun  wrote:

> Great! Thank you, Liang-Chi!
>
> Dongjoon.
>
> On Wed, Feb 15, 2023 at 9:22 AM L. C. Hsieh  wrote:
>
>> The vote passes with 12 +1s (4 binding +1s).
>> Thanks to all who helped with the release!
>>
>> (* = binding)
>> +1:
>> - Mridul Muralidharan (*)
>> - Dongjoon Hyun (*)
>> - Sean Owen (*)
>> - Enrico Minack
>> - Bjørn Jørgensen
>> - Yikun Jiang
>> - Yang Jie
>> - Yuming Wang
>> - John Zhuge
>> - William Hyun
>> - Chao Sun
>> - L. C. Hsieh (*)
>>
>> +0: None
>>
>> -1: None
>>
>> -
>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>
>>

Re: Time for release v3.3.2

2023-01-30 Thread Hyukjin Kwon

+100!

On Tue, 31 Jan 2023 at 10:54, Chao Sun  wrote:

> +1, thanks Liang-Chi for volunteering!
>
> Chao
>
> On Mon, Jan 30, 2023 at 5:51 PM L. C. Hsieh  wrote:
> >
> > Hi Spark devs,
> >
> > As you know, it has been 4 months since Spark 3.3.1 was released on
> > 2022/10, it seems a good time to think about next maintenance release,
> > i.e. Spark 3.3.2.
> >
> > I'm thinking of the release of Spark 3.3.2 this Feb (2023/02).
> >
> > What do you think?
> >
> > I am willing to volunteer for Spark 3.3.2 if there is consensus about
> > this maintenance release.
> >
> > Thank you.
> >
> > -
> > To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
> >
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>

Re: Time for Spark 3.4.0 release?

2023-01-24 Thread Hyukjin Kwon

Thanks Xinrong.

On Wed, 25 Jan 2023 at 12:01, Xinrong Meng  wrote:

> Hi All,
>
> Apache Spark 3.4 is cut as https://github.com/apache/spark/tree/branch-3.4
> .
>
> Thanks,
>
> Xinrong Meng
>
> On Wed, Jan 18, 2023 at 3:45 PM Hyukjin Kwon  wrote:
>
>> Yeah, these more look like something we should discuss around RC timing.
>> See "Spark 3.4 release window" in
>> https://spark.apache.org/versioning-policy.html
>>
>> On Wed, 18 Jan 2023 at 16:28, Enrico Minack 
>> wrote:
>>
>>> You are saying the RCs are cut from that branch at a later point? What
>>> is the estimate deadline for that?
>>>
>>> Enrico
>>>
>>>
>>> Am 18.01.23 um 07:59 schrieb Hyukjin Kwon:
>>>
>>> These look like we can fix it after the branch-cut so should be fine.
>>>
>>> On Wed, 18 Jan 2023 at 15:57, Enrico Minack 
>>> wrote:
>>>
>>>> Hi Xinrong,
>>>>
>>>> what about regression issue
>>>> https://issues.apache.org/jira/browse/SPARK-40819
>>>> and correctness issue https://issues.apache.org/jira/browse/SPARK-40885
>>>> ?
>>>>
>>>> The latter gets fixed by either
>>>> https://issues.apache.org/jira/browse/SPARK-41959 or
>>>> https://issues.apache.org/jira/browse/SPARK-42049.
>>>>
>>>> Are those considered important?
>>>>
>>>> Cheers,
>>>> Enrico
>>>>
>>>>
>>>> Am 18.01.23 um 04:29 schrieb Xinrong Meng:
>>>>
>>>> Hi All,
>>>>
>>>> Considering there are still important issues unresolved (some are as
>>>> shown below), I would suggest to be conservative, we delay the branch-3.4's
>>>> cut for one week.
>>>>
>>>> https://issues.apache.org/jira/browse/SPARK-39375
>>>> https://issues.apache.org/jira/browse/SPARK-41589
>>>> https://issues.apache.org/jira/browse/SPARK-42075
>>>> https://issues.apache.org/jira/browse/SPARK-25299
>>>> https://issues.apache.org/jira/browse/SPARK-41053
>>>>
>>>> I plan to cut *branch-3.4* at *18:30 PT, January 24, 2023*. Please
>>>> ensure your changes for Apache Spark 3.4 to be ready by that time.
>>>>
>>>> Feel free to reply to the email if you have other ongoing big items for
>>>> Spark 3.4.
>>>>
>>>> Thanks,
>>>>
>>>> Xinrong Meng
>>>>
>>>> On Sat, Jan 7, 2023 at 9:16 AM Hyukjin Kwon 
>>>> wrote:
>>>>
>>>>> Thanks Xinrong.
>>>>>
>>>>> On Sat, Jan 7, 2023 at 9:18 AM Xinrong Meng 
>>>>> wrote:
>>>>>
>>>>>> The release window for Apache Spark 3.4.0 is updated per
>>>>>> https://github.com/apache/spark-website/pull/430.
>>>>>>
>>>>>> Thank you all!
>>>>>>
>>>>>> On Thu, Jan 5, 2023 at 2:10 PM Maxim Gekk 
>>>>>> wrote:
>>>>>>
>>>>>>> +1
>>>>>>>
>>>>>>> On Thu, Jan 5, 2023 at 12:25 AM huaxin gao 
>>>>>>> wrote:
>>>>>>>
>>>>>>>> +1 Thanks!
>>>>>>>>
>>>>>>>> On Wed, Jan 4, 2023 at 10:19 AM L. C. Hsieh 
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> +1
>>>>>>>>>
>>>>>>>>> Thank you!
>>>>>>>>>
>>>>>>>>> On Wed, Jan 4, 2023 at 9:13 AM Chao Sun 
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> +1, thanks!
>>>>>>>>>>
>>>>>>>>>> Chao
>>>>>>>>>>
>>>>>>>>>> On Wed, Jan 4, 2023 at 1:56 AM Mridul Muralidharan <
>>>>>>>>>> mri...@gmail.com> wrote:
>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> +1, Thanks !
>>>>>>>>>>>
>>>>>>>>>>> Regards,
>>>>>>>>>>> Mridul
>>>>>>>>>>>
>>>>>>>>>>> On Wed, Jan 4, 2023 at 2:20 AM Gengliang Wang 
>>>>>>>>>>> wrote:

Re: Time for Spark 3.4.0 release?

2023-01-17 Thread Hyukjin Kwon

Yeah, these more look like something we should discuss around RC timing.
See "Spark 3.4 release window" in
https://spark.apache.org/versioning-policy.html

On Wed, 18 Jan 2023 at 16:28, Enrico Minack  wrote:

> You are saying the RCs are cut from that branch at a later point? What is
> the estimate deadline for that?
>
> Enrico
>
>
> Am 18.01.23 um 07:59 schrieb Hyukjin Kwon:
>
> These look like we can fix it after the branch-cut so should be fine.
>
> On Wed, 18 Jan 2023 at 15:57, Enrico Minack 
> wrote:
>
>> Hi Xinrong,
>>
>> what about regression issue
>> https://issues.apache.org/jira/browse/SPARK-40819
>> and correctness issue https://issues.apache.org/jira/browse/SPARK-40885?
>>
>> The latter gets fixed by either
>> https://issues.apache.org/jira/browse/SPARK-41959 or
>> https://issues.apache.org/jira/browse/SPARK-42049.
>>
>> Are those considered important?
>>
>> Cheers,
>> Enrico
>>
>>
>> Am 18.01.23 um 04:29 schrieb Xinrong Meng:
>>
>> Hi All,
>>
>> Considering there are still important issues unresolved (some are as
>> shown below), I would suggest to be conservative, we delay the branch-3.4's
>> cut for one week.
>>
>> https://issues.apache.org/jira/browse/SPARK-39375
>> https://issues.apache.org/jira/browse/SPARK-41589
>> https://issues.apache.org/jira/browse/SPARK-42075
>> https://issues.apache.org/jira/browse/SPARK-25299
>> https://issues.apache.org/jira/browse/SPARK-41053
>>
>> I plan to cut *branch-3.4* at *18:30 PT, January 24, 2023*. Please
>> ensure your changes for Apache Spark 3.4 to be ready by that time.
>>
>> Feel free to reply to the email if you have other ongoing big items for
>> Spark 3.4.
>>
>> Thanks,
>>
>> Xinrong Meng
>>
>> On Sat, Jan 7, 2023 at 9:16 AM Hyukjin Kwon  wrote:
>>
>>> Thanks Xinrong.
>>>
>>> On Sat, Jan 7, 2023 at 9:18 AM Xinrong Meng 
>>> wrote:
>>>
>>>> The release window for Apache Spark 3.4.0 is updated per
>>>> https://github.com/apache/spark-website/pull/430.
>>>>
>>>> Thank you all!
>>>>
>>>> On Thu, Jan 5, 2023 at 2:10 PM Maxim Gekk 
>>>> wrote:
>>>>
>>>>> +1
>>>>>
>>>>> On Thu, Jan 5, 2023 at 12:25 AM huaxin gao 
>>>>> wrote:
>>>>>
>>>>>> +1 Thanks!
>>>>>>
>>>>>> On Wed, Jan 4, 2023 at 10:19 AM L. C. Hsieh  wrote:
>>>>>>
>>>>>>> +1
>>>>>>>
>>>>>>> Thank you!
>>>>>>>
>>>>>>> On Wed, Jan 4, 2023 at 9:13 AM Chao Sun  wrote:
>>>>>>>
>>>>>>>> +1, thanks!
>>>>>>>>
>>>>>>>> Chao
>>>>>>>>
>>>>>>>> On Wed, Jan 4, 2023 at 1:56 AM Mridul Muralidharan <
>>>>>>>> mri...@gmail.com> wrote:
>>>>>>>>
>>>>>>>>>
>>>>>>>>> +1, Thanks !
>>>>>>>>>
>>>>>>>>> Regards,
>>>>>>>>> Mridul
>>>>>>>>>
>>>>>>>>> On Wed, Jan 4, 2023 at 2:20 AM Gengliang Wang 
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> +1, thanks for driving the release!
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Gengliang
>>>>>>>>>>
>>>>>>>>>> On Tue, Jan 3, 2023 at 10:55 PM Dongjoon Hyun <
>>>>>>>>>> dongjoon.h...@gmail.com> wrote:
>>>>>>>>>>
>>>>>>>>>>> +1
>>>>>>>>>>>
>>>>>>>>>>> Thank you!
>>>>>>>>>>>
>>>>>>>>>>> Dongjoon
>>>>>>>>>>>
>>>>>>>>>>> On Tue, Jan 3, 2023 at 9:44 PM Rui Wang 
>>>>>>>>>>> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> +1 to cut the branch starting from a workday!
>>>>>>>>>>>>
>>>>>>>>>>>> Great to see this is happening!
>>>>>>>>>>>>
>>>>>>>>>>>> Thanks Xinrong!
>>>>>>>>>>>>
>>>>>>>>>>>> -Rui
>>>>>>>>>>>>
>>>>>>>>>>>> On Tue, Jan 3, 2023 at 9:21 PM 416161...@qq.com <
>>>>>>>>>>>> ruife...@foxmail.com> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> +1, thank you Xinrong for driving this release!
>>>>>>>>>>>>>
>>>>>>>>>>>>> --
>>>>>>>>>>>>> Ruifeng Zheng
>>>>>>>>>>>>> ruife...@foxmail.com
>>>>>>>>>>>>>
>>>>>>>>>>>>> <https://wx.mail.qq.com/home/index?t=readmail_businesscard_midpage=true=Ruifeng+Zheng=https%3A%2F%2Fres.mail.qq.com%2Fzh_CN%2Fhtmledition%2Fimages%2Frss%2Fmale.gif%3Frand%3D1617349242=ruifengz%40foxmail.com=>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> -- Original --
>>>>>>>>>>>>> *From:* "Hyukjin Kwon" ;
>>>>>>>>>>>>> *Date:* Wed, Jan 4, 2023 01:15 PM
>>>>>>>>>>>>> *To:* "Xinrong Meng";
>>>>>>>>>>>>> *Cc:* "dev";
>>>>>>>>>>>>> *Subject:* Re: Time for Spark 3.4.0 release?
>>>>>>>>>>>>>
>>>>>>>>>>>>> SGTM +1
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Wed, Jan 4, 2023 at 2:13 PM Xinrong Meng <
>>>>>>>>>>>>> xinrong.apa...@gmail.com> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> Hi All,
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Shall we cut *branch-3.4* on *January 16th, 2023*? We
>>>>>>>>>>>>>> proposed January 15th per
>>>>>>>>>>>>>> https://spark.apache.org/versioning-policy.html, but I would
>>>>>>>>>>>>>> suggest we postpone one day since January 15th is a Sunday.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> I would like to volunteer as the release manager for *Apache
>>>>>>>>>>>>>> Spark 3.4.0*.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Xinrong Meng
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>
>

Re: Time for Spark 3.4.0 release?

2023-01-17 Thread Hyukjin Kwon

These look like we can fix it after the branch-cut so should be fine.

On Wed, 18 Jan 2023 at 15:57, Enrico Minack  wrote:

> Hi Xinrong,
>
> what about regression issue
> https://issues.apache.org/jira/browse/SPARK-40819
> and correctness issue https://issues.apache.org/jira/browse/SPARK-40885?
>
> The latter gets fixed by either
> https://issues.apache.org/jira/browse/SPARK-41959 or
> https://issues.apache.org/jira/browse/SPARK-42049.
>
> Are those considered important?
>
> Cheers,
> Enrico
>
>
> Am 18.01.23 um 04:29 schrieb Xinrong Meng:
>
> Hi All,
>
> Considering there are still important issues unresolved (some are as shown
> below), I would suggest to be conservative, we delay the branch-3.4's cut
> for one week.
>
> https://issues.apache.org/jira/browse/SPARK-39375
> https://issues.apache.org/jira/browse/SPARK-41589
> https://issues.apache.org/jira/browse/SPARK-42075
> https://issues.apache.org/jira/browse/SPARK-25299
> https://issues.apache.org/jira/browse/SPARK-41053
>
> I plan to cut *branch-3.4* at *18:30 PT, January 24, 2023*. Please ensure
> your changes for Apache Spark 3.4 to be ready by that time.
>
> Feel free to reply to the email if you have other ongoing big items for
> Spark 3.4.
>
> Thanks,
>
> Xinrong Meng
>
> On Sat, Jan 7, 2023 at 9:16 AM Hyukjin Kwon  wrote:
>
>> Thanks Xinrong.
>>
>> On Sat, Jan 7, 2023 at 9:18 AM Xinrong Meng 
>> wrote:
>>
>>> The release window for Apache Spark 3.4.0 is updated per
>>> https://github.com/apache/spark-website/pull/430.
>>>
>>> Thank you all!
>>>
>>> On Thu, Jan 5, 2023 at 2:10 PM Maxim Gekk 
>>> wrote:
>>>
>>>> +1
>>>>
>>>> On Thu, Jan 5, 2023 at 12:25 AM huaxin gao 
>>>> wrote:
>>>>
>>>>> +1 Thanks!
>>>>>
>>>>> On Wed, Jan 4, 2023 at 10:19 AM L. C. Hsieh  wrote:
>>>>>
>>>>>> +1
>>>>>>
>>>>>> Thank you!
>>>>>>
>>>>>> On Wed, Jan 4, 2023 at 9:13 AM Chao Sun  wrote:
>>>>>>
>>>>>>> +1, thanks!
>>>>>>>
>>>>>>> Chao
>>>>>>>
>>>>>>> On Wed, Jan 4, 2023 at 1:56 AM Mridul Muralidharan 
>>>>>>> wrote:
>>>>>>>
>>>>>>>>
>>>>>>>> +1, Thanks !
>>>>>>>>
>>>>>>>> Regards,
>>>>>>>> Mridul
>>>>>>>>
>>>>>>>> On Wed, Jan 4, 2023 at 2:20 AM Gengliang Wang 
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> +1, thanks for driving the release!
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Gengliang
>>>>>>>>>
>>>>>>>>> On Tue, Jan 3, 2023 at 10:55 PM Dongjoon Hyun <
>>>>>>>>> dongjoon.h...@gmail.com> wrote:
>>>>>>>>>
>>>>>>>>>> +1
>>>>>>>>>>
>>>>>>>>>> Thank you!
>>>>>>>>>>
>>>>>>>>>> Dongjoon
>>>>>>>>>>
>>>>>>>>>> On Tue, Jan 3, 2023 at 9:44 PM Rui Wang 
>>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>>> +1 to cut the branch starting from a workday!
>>>>>>>>>>>
>>>>>>>>>>> Great to see this is happening!
>>>>>>>>>>>
>>>>>>>>>>> Thanks Xinrong!
>>>>>>>>>>>
>>>>>>>>>>> -Rui
>>>>>>>>>>>
>>>>>>>>>>> On Tue, Jan 3, 2023 at 9:21 PM 416161...@qq.com <
>>>>>>>>>>> ruife...@foxmail.com> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> +1, thank you Xinrong for driving this release!
>>>>>>>>>>>>
>>>>>>>>>>>> --
>>>>>>>>>>>> Ruifeng Zheng
>>>>>>>>>>>> ruife...@foxmail.com
>>>>>>>>>>>>
>>>>>>>>>>>> <https://wx.mail.qq.com/home/index?t=readmail_businesscard_midpage=true=Ruifeng+Zheng=https%3A%2F%2Fres.mail.qq.com%2Fzh_CN%2Fhtmledition%2Fimages%2Frss%2Fmale.gif%3Frand%3D1617349242=ruifengz%40foxmail.com=>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> -- Original --
>>>>>>>>>>>> *From:* "Hyukjin Kwon" ;
>>>>>>>>>>>> *Date:* Wed, Jan 4, 2023 01:15 PM
>>>>>>>>>>>> *To:* "Xinrong Meng";
>>>>>>>>>>>> *Cc:* "dev";
>>>>>>>>>>>> *Subject:* Re: Time for Spark 3.4.0 release?
>>>>>>>>>>>>
>>>>>>>>>>>> SGTM +1
>>>>>>>>>>>>
>>>>>>>>>>>> On Wed, Jan 4, 2023 at 2:13 PM Xinrong Meng <
>>>>>>>>>>>> xinrong.apa...@gmail.com> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> Hi All,
>>>>>>>>>>>>>
>>>>>>>>>>>>> Shall we cut *branch-3.4* on *January 16th, 2023*? We
>>>>>>>>>>>>> proposed January 15th per
>>>>>>>>>>>>> https://spark.apache.org/versioning-policy.html, but I would
>>>>>>>>>>>>> suggest we postpone one day since January 15th is a Sunday.
>>>>>>>>>>>>>
>>>>>>>>>>>>> I would like to volunteer as the release manager for *Apache
>>>>>>>>>>>>> Spark 3.4.0*.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>
>>>>>>>>>>>>> Xinrong Meng
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>

Re: Time for Spark 3.4.0 release?

2023-01-17 Thread Hyukjin Kwon

+1. Thanks for driving this, Xinrong.

On Wed, 18 Jan 2023 at 12:31, Xinrong Meng  wrote:

> Hi All,
>
> Considering there are still important issues unresolved (some are as shown
> below), I would suggest to be conservative, we delay the branch-3.4's cut
> for one week.
>
> https://issues.apache.org/jira/browse/SPARK-39375
> https://issues.apache.org/jira/browse/SPARK-41589
> https://issues.apache.org/jira/browse/SPARK-42075
> https://issues.apache.org/jira/browse/SPARK-25299
> https://issues.apache.org/jira/browse/SPARK-41053
>
> I plan to cut *branch-3.4* at *18:30 PT, January 24, 2023*. Please ensure
> your changes for Apache Spark 3.4 to be ready by that time.
>
> Feel free to reply to the email if you have other ongoing big items for
> Spark 3.4.
>
> Thanks,
>
> Xinrong Meng
>
> On Sat, Jan 7, 2023 at 9:16 AM Hyukjin Kwon  wrote:
>
>> Thanks Xinrong.
>>
>> On Sat, Jan 7, 2023 at 9:18 AM Xinrong Meng 
>> wrote:
>>
>>> The release window for Apache Spark 3.4.0 is updated per
>>> https://github.com/apache/spark-website/pull/430.
>>>
>>> Thank you all!
>>>
>>> On Thu, Jan 5, 2023 at 2:10 PM Maxim Gekk 
>>> wrote:
>>>
>>>> +1
>>>>
>>>> On Thu, Jan 5, 2023 at 12:25 AM huaxin gao 
>>>> wrote:
>>>>
>>>>> +1 Thanks!
>>>>>
>>>>> On Wed, Jan 4, 2023 at 10:19 AM L. C. Hsieh  wrote:
>>>>>
>>>>>> +1
>>>>>>
>>>>>> Thank you!
>>>>>>
>>>>>> On Wed, Jan 4, 2023 at 9:13 AM Chao Sun  wrote:
>>>>>>
>>>>>>> +1, thanks!
>>>>>>>
>>>>>>> Chao
>>>>>>>
>>>>>>> On Wed, Jan 4, 2023 at 1:56 AM Mridul Muralidharan 
>>>>>>> wrote:
>>>>>>>
>>>>>>>>
>>>>>>>> +1, Thanks !
>>>>>>>>
>>>>>>>> Regards,
>>>>>>>> Mridul
>>>>>>>>
>>>>>>>> On Wed, Jan 4, 2023 at 2:20 AM Gengliang Wang 
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> +1, thanks for driving the release!
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Gengliang
>>>>>>>>>
>>>>>>>>> On Tue, Jan 3, 2023 at 10:55 PM Dongjoon Hyun <
>>>>>>>>> dongjoon.h...@gmail.com> wrote:
>>>>>>>>>
>>>>>>>>>> +1
>>>>>>>>>>
>>>>>>>>>> Thank you!
>>>>>>>>>>
>>>>>>>>>> Dongjoon
>>>>>>>>>>
>>>>>>>>>> On Tue, Jan 3, 2023 at 9:44 PM Rui Wang 
>>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>>> +1 to cut the branch starting from a workday!
>>>>>>>>>>>
>>>>>>>>>>> Great to see this is happening!
>>>>>>>>>>>
>>>>>>>>>>> Thanks Xinrong!
>>>>>>>>>>>
>>>>>>>>>>> -Rui
>>>>>>>>>>>
>>>>>>>>>>> On Tue, Jan 3, 2023 at 9:21 PM 416161...@qq.com <
>>>>>>>>>>> ruife...@foxmail.com> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> +1, thank you Xinrong for driving this release!
>>>>>>>>>>>>
>>>>>>>>>>>> --
>>>>>>>>>>>> Ruifeng Zheng
>>>>>>>>>>>> ruife...@foxmail.com
>>>>>>>>>>>>
>>>>>>>>>>>> <https://wx.mail.qq.com/home/index?t=readmail_businesscard_midpage=true=Ruifeng+Zheng=https%3A%2F%2Fres.mail.qq.com%2Fzh_CN%2Fhtmledition%2Fimages%2Frss%2Fmale.gif%3Frand%3D1617349242=ruifengz%40foxmail.com=>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> -- Original --
>>>>>>>>>>>> *From:* "Hyukjin Kwon" ;
>>>>>>>>>>>> *Date:* Wed, Jan 4, 2023 01:15 PM
>>>>>>>>>>>> *To:* "Xinrong Meng";
>>>>>>>>>>>> *Cc:* "dev";
>>>>>>>>>>>> *Subject:* Re: Time for Spark 3.4.0 release?
>>>>>>>>>>>>
>>>>>>>>>>>> SGTM +1
>>>>>>>>>>>>
>>>>>>>>>>>> On Wed, Jan 4, 2023 at 2:13 PM Xinrong Meng <
>>>>>>>>>>>> xinrong.apa...@gmail.com> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> Hi All,
>>>>>>>>>>>>>
>>>>>>>>>>>>> Shall we cut *branch-3.4* on *January 16th, 2023*? We
>>>>>>>>>>>>> proposed January 15th per
>>>>>>>>>>>>> https://spark.apache.org/versioning-policy.html, but I would
>>>>>>>>>>>>> suggest we postpone one day since January 15th is a Sunday.
>>>>>>>>>>>>>
>>>>>>>>>>>>> I would like to volunteer as the release manager for *Apache
>>>>>>>>>>>>> Spark 3.4.0*.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>
>>>>>>>>>>>>> Xinrong Meng
>>>>>>>>>>>>>
>>>>>>>>>>>>>

SparkR build with AppVeyor, broken by external reason

2023-01-16 Thread Hyukjin Kwon

Hi all,

AppVeyor is currently broken assuming the flaky Github authorization issue (
https://help.appveyor.com/discussions/problems/11287-the-build-phase-is-set-to-msbuild-mode-default-but-no-visual-studio-project-or-solution-files-were-found
).

AppVeyor build is specific to SparkR (on WIndows) so could be ignored in
most cases for now.

Re: [DISCUSS] Deprecate DStream in 3.4

2023-01-12 Thread Hyukjin Kwon

+1

On Fri, 13 Jan 2023 at 08:51, Jungtaek Lim 
wrote:

> bump for more visibility.
>
> On Wed, Jan 11, 2023 at 12:20 PM Jungtaek Lim <
> kabhwan.opensou...@gmail.com> wrote:
>
>> Hi dev,
>>
>> I'd like to propose the deprecation of DStream in Spark 3.4, in favor of
>> promoting Structured Streaming.
>> (Sorry for the late proposal, if we don't make the change in 3.4, we will
>> have to wait for another 6 months.)
>>
>> We have been focusing on Structured Streaming for years (across multiple
>> major and minor versions), and during the time we haven't made any
>> improvements for DStream. Furthermore, recently we updated the DStream doc
>> to explicitly say DStream is a legacy project.
>> https://spark.apache.org/docs/latest/streaming-programming-guide.html#note
>>
>> The baseline of deprecation is that we don't see a particular use case
>> which only DStream solves. This is a different story with GraphX and MLLIB,
>> as we don't have replacements for that.
>>
>> The proposal does not mean we will remove the API soon, as the Spark
>> project has been making deprecation against public API. I don't intend to
>> propose the target version for removal. The goal is to guide users to
>> refrain from constructing a new workload with DStream. We might want to go
>> with this in future, but it would require a new discussion thread at that
>> time.
>>
>> What do you think?
>>
>> Thanks,
>> Jungtaek Lim (HeartSaVioR)
>>
>

Re: Base Docker image caching broken in CI

2023-01-11 Thread Hyukjin Kwon

Seems like it's fixed now!

On Wed, 11 Jan 2023 at 15:58, Hyukjin Kwon  wrote:

> Hi all,
>
> ghcr is flaky now, so we will have to wait for a couple of days and see if
> it gets fixed up soon.
> See also
> https://github.com/apache/spark/pull/39490#issuecomment-1378190658
> Thanks Yikun for taking a look at this.
>

Base Docker image caching broken in CI

2023-01-10 Thread Hyukjin Kwon

Hi all,

ghcr is flaky now, so we will have to wait for a couple of days and see if
it gets fixed up soon.
See also https://github.com/apache/spark/pull/39490#issuecomment-1378190658
Thanks Yikun for taking a look at this.

Re: Time for Spark 3.4.0 release?

2023-01-06 Thread Hyukjin Kwon

Thanks Xinrong.

On Sat, Jan 7, 2023 at 9:18 AM Xinrong Meng 
wrote:

> The release window for Apache Spark 3.4.0 is updated per
> https://github.com/apache/spark-website/pull/430.
>
> Thank you all!
>
> On Thu, Jan 5, 2023 at 2:10 PM Maxim Gekk 
> wrote:
>
>> +1
>>
>> On Thu, Jan 5, 2023 at 12:25 AM huaxin gao 
>> wrote:
>>
>>> +1 Thanks!
>>>
>>> On Wed, Jan 4, 2023 at 10:19 AM L. C. Hsieh  wrote:
>>>
>>>> +1
>>>>
>>>> Thank you!
>>>>
>>>> On Wed, Jan 4, 2023 at 9:13 AM Chao Sun  wrote:
>>>>
>>>>> +1, thanks!
>>>>>
>>>>> Chao
>>>>>
>>>>> On Wed, Jan 4, 2023 at 1:56 AM Mridul Muralidharan 
>>>>> wrote:
>>>>>
>>>>>>
>>>>>> +1, Thanks !
>>>>>>
>>>>>> Regards,
>>>>>> Mridul
>>>>>>
>>>>>> On Wed, Jan 4, 2023 at 2:20 AM Gengliang Wang 
>>>>>> wrote:
>>>>>>
>>>>>>> +1, thanks for driving the release!
>>>>>>>
>>>>>>>
>>>>>>> Gengliang
>>>>>>>
>>>>>>> On Tue, Jan 3, 2023 at 10:55 PM Dongjoon Hyun <
>>>>>>> dongjoon.h...@gmail.com> wrote:
>>>>>>>
>>>>>>>> +1
>>>>>>>>
>>>>>>>> Thank you!
>>>>>>>>
>>>>>>>> Dongjoon
>>>>>>>>
>>>>>>>> On Tue, Jan 3, 2023 at 9:44 PM Rui Wang 
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> +1 to cut the branch starting from a workday!
>>>>>>>>>
>>>>>>>>> Great to see this is happening!
>>>>>>>>>
>>>>>>>>> Thanks Xinrong!
>>>>>>>>>
>>>>>>>>> -Rui
>>>>>>>>>
>>>>>>>>> On Tue, Jan 3, 2023 at 9:21 PM 416161...@qq.com <
>>>>>>>>> ruife...@foxmail.com> wrote:
>>>>>>>>>
>>>>>>>>>> +1, thank you Xinrong for driving this release!
>>>>>>>>>>
>>>>>>>>>> --
>>>>>>>>>> Ruifeng Zheng
>>>>>>>>>> ruife...@foxmail.com
>>>>>>>>>>
>>>>>>>>>> <https://wx.mail.qq.com/home/index?t=readmail_businesscard_midpage=true=Ruifeng+Zheng=https%3A%2F%2Fres.mail.qq.com%2Fzh_CN%2Fhtmledition%2Fimages%2Frss%2Fmale.gif%3Frand%3D1617349242=ruifengz%40foxmail.com=>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> -- Original --
>>>>>>>>>> *From:* "Hyukjin Kwon" ;
>>>>>>>>>> *Date:* Wed, Jan 4, 2023 01:15 PM
>>>>>>>>>> *To:* "Xinrong Meng";
>>>>>>>>>> *Cc:* "dev";
>>>>>>>>>> *Subject:* Re: Time for Spark 3.4.0 release?
>>>>>>>>>>
>>>>>>>>>> SGTM +1
>>>>>>>>>>
>>>>>>>>>> On Wed, Jan 4, 2023 at 2:13 PM Xinrong Meng <
>>>>>>>>>> xinrong.apa...@gmail.com> wrote:
>>>>>>>>>>
>>>>>>>>>>> Hi All,
>>>>>>>>>>>
>>>>>>>>>>> Shall we cut *branch-3.4* on *January 16th, 2023*? We proposed
>>>>>>>>>>> January 15th per
>>>>>>>>>>> https://spark.apache.org/versioning-policy.html, but I would
>>>>>>>>>>> suggest we postpone one day since January 15th is a Sunday.
>>>>>>>>>>>
>>>>>>>>>>> I would like to volunteer as the release manager for *Apache
>>>>>>>>>>> Spark 3.4.0*.
>>>>>>>>>>>
>>>>>>>>>>> Thanks,
>>>>>>>>>>>
>>>>>>>>>>> Xinrong Meng
>>>>>>>>>>>
>>>>>>>>>>>

Re: Time for Spark 3.4.0 release?

2023-01-03 Thread Hyukjin Kwon

SGTM +1

On Wed, Jan 4, 2023 at 2:13 PM Xinrong Meng 
wrote:

> Hi All,
>
> Shall we cut *branch-3.4* on *January 16th, 2023*? We proposed January
> 15th per
> https://spark.apache.org/versioning-policy.html, but I would suggest we
> postpone one day since January 15th is a Sunday.
>
> I would like to volunteer as the release manager for *Apache Spark 3.4.0*.
>
> Thanks,
>
> Xinrong Meng
>
>

Re: maven build failing in spark sql w/BouncyCastleProvider CNFE

2022-12-05 Thread Hyukjin Kwon

Steve, does the lower version of scala plugin work for you? If that solves,
we could temporary downgrade for now.

On Mon, 5 Dec 2022 at 22:23, Steve Loughran 
wrote:

>  trying to build spark master w/ hadoop trunk and the maven sbt plugin is
> failing. This doesn't happen with the 3.3.5 RC0;
>
> I note that the only mention of this anywhere was me in march.
>
> clearly something in hadoop trunk has changed in a way which is
> incompatible.
>
> Has anyone else tried such a build/seen this problem? any suggestions of a
> fix?
>
> Created SPARK-41392 to cover this...
>
> [INFO]
> 
> [ERROR] Failed to execute goal
> net.alchim31.maven:scala-maven-plugin:4.7.2:testCompile
> (scala-test-compile-first) on project spark-sql_2.12: Execution
> scala-test-compile-first of goal
> net.alchim31.maven:scala-maven-plugin:4.7.2:testCompile failed: A required
> class was missing while executing
> net.alchim31.maven:scala-maven-plugin:4.7.2:testCompile:
> org/bouncycastle/jce/provider/BouncyCastleProvider
> [ERROR] -
> [ERROR] realm =plugin>net.alchim31.maven:scala-maven-plugin:4.7.2
> [ERROR] strategy =
> org.codehaus.plexus.classworlds.strategy.SelfFirstStrategy
> [ERROR] urls[0] =
> file:/Users/stevel/.m2/repository/net/alchim31/maven/scala-maven-plugin/4.7.2/scala-maven-plugin-4.7.2.jar
> [ERROR] urls[1] =
> file:/Users/stevel/.m2/repository/org/apache/maven/shared/maven-dependency-tree/3.2.0/maven-dependency-tree-3.2.0.jar
> [ERROR] urls[2] =
> file:/Users/stevel/.m2/repository/org/eclipse/aether/aether-util/1.0.0.v20140518/aether-util-1.0.0.v20140518.jar
> [ERROR] urls[3] =
> file:/Users/stevel/.m2/repository/org/apache/maven/reporting/maven-reporting-api/3.1.1/maven-reporting-api-3.1.1.jar
> [ERROR] urls[4] =
> file:/Users/stevel/.m2/repository/org/apache/maven/doxia/doxia-sink-api/1.11.1/doxia-sink-api-1.11.1.jar
> [ERROR] urls[5] =
> file:/Users/stevel/.m2/repository/org/apache/maven/doxia/doxia-logging-api/1.11.1/doxia-logging-api-1.11.1.jar
> [ERROR] urls[6] =
> file:/Users/stevel/.m2/repository/org/apache/maven/maven-archiver/3.6.0/maven-archiver-3.6.0.jar
> [ERROR] urls[7] =
> file:/Users/stevel/.m2/repository/org/codehaus/plexus/plexus-io/3.4.0/plexus-io-3.4.0.jar
> [ERROR] urls[8] =
> file:/Users/stevel/.m2/repository/org/codehaus/plexus/plexus-interpolation/1.26/plexus-interpolation-1.26.jar
> [ERROR] urls[9] =
> file:/Users/stevel/.m2/repository/org/apache/commons/commons-exec/1.3/commons-exec-1.3.jar
> [ERROR] urls[10] =
> file:/Users/stevel/.m2/repository/org/codehaus/plexus/plexus-utils/3.4.2/plexus-utils-3.4.2.jar
> [ERROR] urls[11] =
> file:/Users/stevel/.m2/repository/org/codehaus/plexus/plexus-archiver/4.5.0/plexus-archiver-4.5.0.jar
> [ERROR] urls[12] =
> file:/Users/stevel/.m2/repository/commons-io/commons-io/2.11.0/commons-io-2.11.0.jar
> [ERROR] urls[13] =
> file:/Users/stevel/.m2/repository/org/apache/commons/commons-compress/1.21/commons-compress-1.21.jar
> [ERROR] urls[14] =
> file:/Users/stevel/.m2/repository/org/iq80/snappy/snappy/0.4/snappy-0.4.jar
> [ERROR] urls[15] =
> file:/Users/stevel/.m2/repository/org/tukaani/xz/1.9/xz-1.9.jar
> [ERROR] urls[16] =
> file:/Users/stevel/.m2/repository/com/github/luben/zstd-jni/1.5.2-4/zstd-jni-1.5.2-4.jar
> [ERROR] urls[17] =
> file:/Users/stevel/.m2/repository/org/scala-sbt/zinc_2.13/1.7.1/zinc_2.13-1.7.1.jar
> [ERROR] urls[18] =
> file:/Users/stevel/.m2/repository/org/scala-lang/scala-library/2.13.8/scala-library-2.13.8.jar
> [ERROR] urls[19] =
> file:/Users/stevel/.m2/repository/org/scala-sbt/zinc-core_2.13/1.7.1/zinc-core_2.13-1.7.1.jar
> [ERROR] urls[20] =
> file:/Users/stevel/.m2/repository/org/scala-sbt/zinc-apiinfo_2.13/1.7.1/zinc-apiinfo_2.13-1.7.1.jar
> [ERROR] urls[21] =
> file:/Users/stevel/.m2/repository/org/scala-sbt/compiler-bridge_2.13/1.7.1/compiler-bridge_2.13-1.7.1.jar
> [ERROR] urls[22] =
> file:/Users/stevel/.m2/repository/org/scala-sbt/zinc-classpath_2.13/1.7.1/zinc-classpath_2.13-1.7.1.jar
> [ERROR] urls[23] =
> file:/Users/stevel/.m2/repository/org/scala-lang/scala-compiler/2.13.8/scala-compiler-2.13.8.jar
> [ERROR] urls[24] =
> file:/Users/stevel/.m2/repository/org/scala-sbt/compiler-interface/1.7.1/compiler-interface-1.7.1.jar
> [ERROR] urls[25] =
> file:/Users/stevel/.m2/repository/org/scala-sbt/util-interface/1.7.0/util-interface-1.7.0.jar
> [ERROR] urls[26] =
> file:/Users/stevel/.m2/repository/org/scala-sbt/zinc-persist-core-assembly/1.7.1/zinc-persist-core-assembly-1.7.1.jar
> [ERROR] urls[27] =
> file:/Users/stevel/.m2/repository/org/scala-lang/modules/scala-parallel-collections_2.13/0.2.0/scala-parallel-collections_2.13-0.2.0.jar
> [ERROR] urls[28] =
> file:/Users/stevel/.m2/repository/org/scala-sbt/io_2.13/1.7.0/io_2.13-1.7.0.jar
> [ERROR] urls[29] =
> file:/Users/stevel/.m2/repository/com/swoval/file-tree-views/2.1.9/file-tree-views-2.1.9.jar
> [ERROR] urls[30] =
>

Re: Contributions needed: 4 higher order functions

2022-12-01 Thread Hyukjin Kwon

Yes, I can. Please send me an email that includes your preferred id and
your full name, see also https://infra.apache.org/jira-guidelines.html#who

On Fri, 2 Dec 2022 at 08:00, jason carlson  wrote:

> Can someone make me a jira account?
>
> Sent from my iPhone
>
> On Nov 30, 2022, at 5:35 AM, Hyukjin Kwon  wrote:
>
> 
> Hi all,
>
> There are four higher order functions in our backlog:
>
> - https://issues.apache.org/jira/browse/SPARK-41235
> - https://issues.apache.org/jira/browse/SPARK-41234
> - https://issues.apache.org/jira/browse/SPARK-41233
> - https://issues.apache.org/jira/browse/SPARK-41232
>
> Would be a great chance for new contributors to understand and get into
> Catalyst optimizer and Spark SQL.
>
> Any help on these tickets would be much appreciated.
>
>

Re: [VOTE][SPIP] Asynchronous Offset Management in Structured Streaming

2022-11-30 Thread Hyukjin Kwon

+1

On Thu, 1 Dec 2022 at 12:39, Mridul Muralidharan  wrote:

>
> +1
>
> Regards,
> Mridul
>
> On Wed, Nov 30, 2022 at 8:55 PM Xingbo Jiang 
> wrote:
>
>> +1
>>
>> On Wed, Nov 30, 2022 at 5:59 PM Jungtaek Lim <
>> kabhwan.opensou...@gmail.com> wrote:
>>
>>> Starting with +1 from me.
>>>
>>> On Thu, Dec 1, 2022 at 10:54 AM Jungtaek Lim <
>>> kabhwan.opensou...@gmail.com> wrote:
>>>
 Hi all,

 I'd like to start the vote for SPIP: Asynchronous Offset Management in
 Structured Streaming.

 The high level summary of the SPIP is that we propose a couple of
 improvements on offset management in microbatch execution to lower down
 processing latency, which would help for certain types of workloads.

 References:

- JIRA ticket 
- SPIP doc

- Discussion thread

 Please vote on the SPIP for the next 72 hours:

 [ ] +1: Accept the proposal as an official SPIP
 [ ] +0
 [ ] -1: I don’t think this is a good idea because …

 Thanks!
 Jungtaek Lim (HeartSaVioR)

>>>

Re: [DISCUSSION] SPIP: Asynchronous Offset Management in Structured Streaming

2022-11-30 Thread Hyukjin Kwon

+1

On Thu, 1 Dec 2022 at 08:10, Shixiong Zhu  wrote:

> +1
>
> This is exciting. I agree with Jerry that this SPIP and continuous
> processing are orthogonal. This SPIP itself would be a great improvement
> and impact most Structured Streaming users.
>
> Best Regards,
> Shixiong
>
>
> On Wed, Nov 30, 2022 at 6:57 AM Mridul Muralidharan 
> wrote:
>
>>
>> Thanks for all the clarifications and details Jerry, Jungtaek :-)
>> This looks like an exciting improvement to Structured Streaming - looking
>> forward to it becoming part of Apache Spark !
>>
>> Regards,
>> Mridul
>>
>>
>> On Mon, Nov 28, 2022 at 8:40 PM Jerry Peng 
>> wrote:
>>
>>> Hi all,
>>>
>>> I will add my two cents.  Improving the Microbatch execution engine does
>>> not prevent us from working/improving on the continuous execution engine in
>>> the future.  These are orthogonal issues.  This new mode I am proposing in
>>> the microbatch execution engine intends to lower latency of this execution
>>> engine that most people use today.  We can view it as an incremental
>>> improvement on the existing engine. I see the continuous execution engine
>>> as a partially completed re-write of spark streaming and may serve as the
>>> "future" engine powering Spark Streaming.   Improving the "current" engine
>>> does not mean we cannot work on a "future" engine.  These two are not
>>> mutually exclusive. I would like to focus the discussion on the merits of
>>> this feature in regards to the current micro-batch execution engine and not
>>> a discussion on the future of continuous execution engine.
>>>
>>> Best,
>>>
>>> Jerry
>>>
>>>
>>> On Wed, Nov 23, 2022 at 3:17 AM Jungtaek Lim <
>>> kabhwan.opensou...@gmail.com> wrote:
>>>
 Hi Mridul,

 I'd like to make clear to avoid any misunderstanding - the decision was
 not led by me. (I'm just a one of engineers in the team. Not even TL.) As
 you see the direction, there was an internal consensus to not revisit the
 continuous mode. There are various reasons, which I think we know already.
 You seem to remember I have raised concerns about continuous mode, but have
 you indicated that it was even over 2 years ago? I still see no traction
 around the project. The main reason I abandoned the discussion was due to
 promising effort on integrating push based shuffle into continuous mode to
 achieve shuffle, but no effort has been made so far.

 The goal of this SPIP is to have an alternative approach dealing with
 same workload, given that we no longer have confidence of success of
 continuous mode. But I also want to make clear that deprecating and
 eventually retiring continuous mode is not a goal of this project. If that
 happens eventually, that would be a side-effect. Someone may have concerns
 that we have two different projects aiming for similar thing, but I'd
 rather see both projects having competition. If anyone willing to improve
 continuous mode can start making the effort right now. This SPIP does not
 block it.


 On Wed, Nov 23, 2022 at 5:29 PM Mridul Muralidharan 
 wrote:

>
> Hi Jungtaek,
>
>   Given the goal of the SPIP is reducing latency for stateless apps,
> and should reasonably fit continuous mode design goals, it feels odd to 
> not
> support it fin the proposal.
>
> I know you have raised concerns about continuous mode in past as well
> in dev@ list, and we are further ignoring it in this proposal (and
> possibly other enhancements in past few releases).
>
> Do you want to revisit the discussion to support it and propose a vote
> on that ? And move it to deprecated ?
>
> I am much more comfortable not supporting this SPIP for CM if it was
> deprecated.
>
> Thoughts ?
>
> Regards,
> Mridul
>
>
>
>
> On Wed, Nov 23, 2022 at 1:16 AM Jerry Peng <
> jerry.boyang.p...@gmail.com> wrote:
>
>> Jungtaek,
>>
>> Thanks for taking up the role to shepard this SPIP!  Thank you for
>> also chiming in on your thoughts concerning the continuous mode!
>>
>> Best,
>>
>> Jerry
>>
>> On Tue, Nov 22, 2022 at 5:57 PM Jungtaek Lim <
>> kabhwan.opensou...@gmail.com> wrote:
>>
>>> Just FYI, I'm shepherding this SPIP project.
>>>
>>> I think the major meta question would be, "why don't we spend
>>> effort on continuous mode rather than initiating another feature aiming 
>>> for
>>> the same workload?". Jerry already updated the doc to answer the 
>>> question,
>>> but I can also share my thoughts about it.
>>>
>>> I feel like the current "continuous mode" is a niche solution. (It's
>>> not to blame. If you have to deal with such workload but can't rewrite 
>>> the
>>> underlying engine from scratch, then there are really few options.)
>>> Since the implementation went with a workaround to implement

Contributions needed: 4 higher order functions

2022-11-30 Thread Hyukjin Kwon

Hi all,

There are four higher order functions in our backlog:

- https://issues.apache.org/jira/browse/SPARK-41235
- https://issues.apache.org/jira/browse/SPARK-41234
- https://issues.apache.org/jira/browse/SPARK-41233
- https://issues.apache.org/jira/browse/SPARK-41232

Would be a great chance for new contributors to understand and get into
Catalyst optimizer and Spark SQL.

Any help on these tickets would be much appreciated.

Re: [ANNOUNCE] Apache Spark 3.3.1 released

2022-10-26 Thread Hyukjin Kwon

Thanks, Yuming.

On Wed, 26 Oct 2022 at 16:01, L. C. Hsieh  wrote:

> Thank you for driving the release of Apache Spark 3.3.1, Yuming!
>
> On Tue, Oct 25, 2022 at 11:38 PM Dongjoon Hyun 
> wrote:
> >
> > It's great. Thank you so much, Yuming!
> >
> > Dongjoon
> >
> > On Tue, Oct 25, 2022 at 11:23 PM Yuming Wang  wrote:
> >>
> >> We are happy to announce the availability of Apache Spark 3.3.1!
> >>
> >> Spark 3.3.1 is a maintenance release containing stability fixes. This
> >> release is based on the branch-3.3 maintenance branch of Spark. We
> strongly
> >> recommend all 3.3 users to upgrade to this stable release.
> >>
> >> To download Spark 3.3.1, head over to the download page:
> >> https://spark.apache.org/downloads.html
> >>
> >> To view the release notes:
> >> https://spark.apache.org/releases/spark-release-3-3-1.html
> >>
> >> We would like to acknowledge all community members for contributing to
> this
> >> release. This release would not have been possible without you.
> >>
> >>
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>

Re: Enforcing scalafmt on Spark Connect - connector/connect

2022-10-14 Thread Hyukjin Kwon

I personally like this idea. At least we now do this in PySpark, and it's
pretty nice that you can just forget about formatting it manually by
yourself.

On Fri, 14 Oct 2022 at 16:37, Martin Grund
 wrote:

> Hi folks,
>
> I'm reaching out to ask to gather input / consensus on the following
> proposal: Since Spark Connect is effectively new code, I would like to
> enforce scalafmt explicitly *only* on this module by adding a check in
> `dev/lint-scala` that checks if there is a diff after running
>
>  ./build/mvn -Pscala-2.12 scalafmt:format -Dscalafmt.skip=false -pl
> connector/connect
>
> I know that enforcing scalafmt is not desirable on the existing code base
> but since the Spark Connect code is very new I'm thinking it might reduce
> friction in the code reviews and create a consistent style.
>
> In my previous code reviews where I have applied scalafmt I've
> received feedback on the import grouping that scalafmt is changing
> different from our default style. I've prepared a PR
> https://github.com/apache/spark/pull/38252 to address this issue by
> explicitly setting it in the scalafmt option.
>
> Would you be supportive of enforcing scalafmt *only* on the Spark Connect
> module?
>
> Thanks
> Martin
>

Welcome Yikun Jiang as a Spark committer

2022-10-07 Thread Hyukjin Kwon

Hi all,

The Spark PMC recently added Yikun Jiang as a committer on the project.
Yikun is the major contributor of the infrastructure and GitHub Actions in
Apache Spark as well as Kubernates and PySpark.
He has put a lot of effort into stabilizing and optimizing the builds so we
all can work together in Apache Spark more
efficiently and effectively. He's also driving the SPIP for Docker official
image in Apache Spark as well for users and developers.
Please join me in welcoming Yikun!

Re: [VOTE][RESULT] SPIP: Support Docker Official Image for Spark

2022-09-25 Thread Hyukjin Kwon

There was a typo in the result email.  I am resending now:

The vote passes with 4 +10s (4 binding +10s).

+1:
Hyukjin Kwon*
Ruifeng Zheng
Yikun Jiang
Qian Sun
Kent Yao
Rui Chen
Xiangrui Meng*
Gengliang Wang*
Martin Grigorov
Yang Jie
Ankit Gupta
Denny Lee
Bryan Cutler
Dongjoon Hyun*

0: None

-1: None

(* = binding)

On Sun, 25 Sept 2022 at 14:54, Hyukjin Kwon  wrote:

> The vote passes with 4 +10s (4 binding +10s).
>
> +1:
> Hyukjin Kwon*
> Ruifeng Zheng
> Yikun Jiang
> Qian Sun
> Kent Yao
> Rui Chen
> Xiangrui Meng*
> Gengliang Wang*
> Martin Grigorov
> Yang Jie
> Ankit Gupta
> Denny Lee
> Bryan Cutler
> Dongjoon Hyun*
>
> 0: None
> (Tom has voiced some architectural concerns)
>
> -1: None
>
> (* = binding)
>
>

[VOTE][RESULT] SPIP: Support Docker Official Image for Spark

2022-09-24 Thread Hyukjin Kwon

The vote passes with 4 +10s (4 binding +10s).

+1:
Hyukjin Kwon*
Ruifeng Zheng
Yikun Jiang
Qian Sun
Kent Yao
Rui Chen
Xiangrui Meng*
Gengliang Wang*
Martin Grigorov
Yang Jie
Ankit Gupta
Denny Lee
Bryan Cutler
Dongjoon Hyun*

0: None
(Tom has voiced some architectural concerns)

-1: None

(* = binding)

Re: [VOTE] SPIP: Support Docker Official Image for Spark

2022-09-21 Thread Hyukjin Kwon

Starting with my +1.

On Thu, 22 Sept 2022 at 10:41, Hyukjin Kwon  wrote:

> Hi all,
>
> I would like to start a vote for SPIP: "Support Docker Official Image for
> Spark"
>
> The goal of the SPIP is to add Docker Official Image(DOI)
> <https://github.com/docker-library/official-images> to ensure the Spark
> Docker images
> meet the quality standards for Docker images, to provide these Docker
> images for users
> who want to use Apache Spark via Docker image.
>
> Please also refer to:
>
> - Previous discussion in dev mailing list: [DISCUSS] SPIP: Support Docker
> Official Image for Spark
> <https://lists.apache.org/thread/l1793y5224n8bqkp3s6ltgkykso4htb3>
> - SPIP doc: SPIP: Support Docker Official Image for Spark
> <https://docs.google.com/document/d/1nN-pKuvt-amUcrkTvYAQ-bJBgtsWb9nAkNoVNRM2S2o>
> - JIRA: SPARK-40513 <https://issues.apache.org/jira/browse/SPARK-40513>
>
> Please vote on the SPIP for the next 72 hours:
>
> [ ] +1: Accept the proposal as an official SPIP
> [ ] +0
> [ ] -1: I don’t think this is a good idea because …
>
>

[VOTE] SPIP: Support Docker Official Image for Spark

2022-09-21 Thread Hyukjin Kwon

Hi all,

I would like to start a vote for SPIP: "Support Docker Official Image for
Spark"

The goal of the SPIP is to add Docker Official Image(DOI)
 to ensure the Spark
Docker images
meet the quality standards for Docker images, to provide these Docker
images for users
who want to use Apache Spark via Docker image.

Please also refer to:

- Previous discussion in dev mailing list: [DISCUSS] SPIP: Support Docker
Official Image for Spark

- SPIP doc: SPIP: Support Docker Official Image for Spark

- JIRA: SPARK-40513 

Please vote on the SPIP for the next 72 hours:

[ ] +1: Accept the proposal as an official SPIP
[ ] +0
[ ] -1: I don’t think this is a good idea because …

Re: [DISCUSS] SPIP: Support Docker Official Image for Spark

2022-09-21 Thread Hyukjin Kwon

Given that support, I will start the vote officially.

On Thu, 22 Sept 2022 at 08:40, Yikun Jiang  wrote:

> @Ankit
>
> Thanks for your support! Your questions are very valuable, but this SPIP
> is just a start point to cover existing apache/spark image features first.
> And we will also set up a build/test/publish image workflow (to make sure
> the quality of image) and some helper scripts to help developers extend
> more custom images easily in future.
>
> > How do we support deployments of spark-standalone clusters in case the
> users want to use the same image for spark-standalone clusters ? Since that
> is also widely used.
> Yes, it's possible, it can be done by exposing some ports, but still need
> to validate and then doc them in standalone mode doc.
>
> > 2. I am not sure about the End of Support of Hadoop 2 with spark, but if
> that is not planned sooner, shouldn't we be making it configurable to be
> able to use spark prebuilt with hadoop 2?
> DOI required a static dockerfile, so couldn't be configurable in
> runtime.Of course, all spark published releases can also be supported as a
> separate image in principle. About supporting more distribution, we also
> planned to add some scripts to help generate the dockerfile.
>
> > 3. Also, don't we want to make it feasible for the users to be able to
> customize the base linux flavour?
> This is also a good point, but out of scope of this SPIP. Currently, we
> start with Ubuntu OS (debian series, apt software manager). We might also
> consider supporting more OS after this SPIP. Such as
> rehl/centos/rocky/openEuler serious, yum/dnf software manager. But as you
> know, different OS's have various package versions and upgrade policies, so
> it's perhaps not very easy work for maintenance, but I think it's possible.
>
> Regards,
> Yikun
>
>
> On Thu, Sep 22, 2022 at 3:43 AM Ankit Gupta  wrote:
>
>> Hi Yikun
>>
>> Thanks for all your efforts! This is very much needed. But I have the
>> below three questions:
>> 1. How do we support deployments of spark-standalone clusters in case the
>> users wants to use the same image for spark-standalone clusters ? Since
>> that is also widely used.
>> 2. I am not sure about the End of Support of Hadoop 2 with spark, but if
>> that is not planned sooner, shouldn't we be making it configurable to be
>> able to use spark prebuilt with hadoop 2?
>> 3. Also, don't we want to make it feasible for the users to be able to
>> customise the base linux flavour?
>>
>> Thanks and Regards.
>>
>> Ankit Prakash Gupta
>>
>>
>> On Wed, Sep 21, 2022 at 9:19 PM Xiao Li  wrote:
>>
>>> +1
>>>
>>> Yikun Jiang  于2022年9月21日周三 07:22写道：
>>>
 Thanks for all your inputs! BTW, I also create a JIRA to track related
 work: https://issues.apache.org/jira/browse/SPARK-40513

 > can I be involved in this work?

 @qian Of course! Thanks!

 Regards,
 Yikun

 On Wed, Sep 21, 2022 at 7:31 PM Xinrong Meng 
 wrote:

> +1
>
> On Tue, Sep 20, 2022 at 11:08 PM Qian SUN 
> wrote:
>
>> +1.
>> It's valuable, can I be involved in this work?
>>
>> Yikun Jiang  于2022年9月19日周一 08:15写道：
>>
>>> Hi, all
>>>
>>> I would like to start the discussion for supporting Docker Official
>>> Image for Spark.
>>>
>>> This SPIP is proposed to add Docker Official Image(DOI)
>>>  to ensure the
>>> Spark Docker images meet the quality standards for Docker images, to
>>> provide these Docker images for users who want to use Apache Spark via
>>> Docker image.
>>>
>>> There are also several Apache projects that release the Docker
>>> Official Images
>>> ,
>>> such as: flink , storm
>>> , solr
>>> , zookeeper
>>> , httpd
>>>  (with 50M+ to 1B+ download for
>>> each). From the huge download statistics, we can see the real demands of
>>> users, and from the support of other apache projects, we should also be
>>> able to do it.
>>>
>>> After support:
>>>
>>>-
>>>
>>>The Dockerfile will still be maintained by the Apache Spark
>>>community and reviewed by Docker.
>>>-
>>>
>>>The images will be maintained by the Docker community to ensure
>>>the quality standards for Docker images of the Docker community.
>>>
>>>
>>> It will also reduce the extra docker images maintenance effort (such
>>> as frequently rebuilding, image security update) of the Apache Spark
>>> community.
>>>
>>> See more in SPIP DOC:
>>> https://docs.google.com/document/d/1nN-pKuvt-amUcrkTvYAQ-bJBgtsWb9nAkNoVNRM2S2o
>>>
>>> cc: Ruifeng (co-author) and Hyukjin (shepherd)
>>>

Re: [DISCUSS] SPIP: Support Docker Official Image for Spark

2022-09-18 Thread Hyukjin Kwon

+1

On Mon, 19 Sept 2022 at 09:15, Yikun Jiang  wrote:

> Hi, all
>
> I would like to start the discussion for supporting Docker Official Image
> for Spark.
>
> This SPIP is proposed to add Docker Official Image(DOI)
>  to ensure the Spark
> Docker images meet the quality standards for Docker images, to provide
> these Docker images for users who want to use Apache Spark via Docker image.
>
> There are also several Apache projects that release the Docker Official
> Images ,
> such as: flink , storm
> , solr ,
> zookeeper , httpd
>  (with 50M+ to 1B+ download for each).
> From the huge download statistics, we can see the real demands of users,
> and from the support of other apache projects, we should also be able to do
> it.
>
> After support:
>
>-
>
>The Dockerfile will still be maintained by the Apache Spark community
>and reviewed by Docker.
>-
>
>The images will be maintained by the Docker community to ensure the
>quality standards for Docker images of the Docker community.
>
>
> It will also reduce the extra docker images maintenance effort (such as
> frequently rebuilding, image security update) of the Apache Spark community.
>
> See more in SPIP DOC:
> https://docs.google.com/document/d/1nN-pKuvt-amUcrkTvYAQ-bJBgtsWb9nAkNoVNRM2S2o
>
> cc: Ruifeng (co-author) and Hyukjin (shepherd)
>
> Regards,
> Yikun
>

Creating a new component "Connect" in JIRA

2022-09-16 Thread Hyukjin Kwon

Hi all,

I created a new component called "Connect" temporarily for the Spark
Connect project,
see https://issues.apache.org/jira/browse/SPARK-39375 because a lot of
changes will be
made in an isolated location, and the concept itself is pretty isolated as
a separate component
In addition, this will be an alpha component, see also
https://spark.apache.org/versioning-policy.html .

While I don't think there is a particular problem, I wanted to see if
there's any concern about this.
Please let me know if you guys have any concerns about this!

Otherwise, I will make some corresponding changes in other places such as
https://github.com/apache/spark/blob/master/.github/labeler.yml.

Thanks all!!

Re: Time for Spark 3.3.1 release?

2022-09-12 Thread Hyukjin Kwon

+1

On Tue, 13 Sept 2022 at 06:45, Gengliang Wang  wrote:

> +1.
> Thank you, Yuming!
>
> On Mon, Sep 12, 2022 at 12:10 PM L. C. Hsieh  wrote:
>
>> +1
>>
>> Thanks Yuming!
>>
>> On Mon, Sep 12, 2022 at 11:50 AM Dongjoon Hyun 
>> wrote:
>> >
>> > +1
>> >
>> > Thanks,
>> > Dongjoon.
>> >
>> > On Mon, Sep 12, 2022 at 6:38 AM Yuming Wang  wrote:
>> >>
>> >> Hi, All.
>> >>
>> >>
>> >>
>> >> Since Apache Spark 3.3.0 tag creation (Jun 10), new 138 patches
>> including 7 correctness patches arrived at branch-3.3.
>> >>
>> >>
>> >>
>> >> Shall we make a new release, Apache Spark 3.3.1, as the second release
>> at branch-3.3? I'd like to volunteer as the release manager for Apache
>> Spark 3.3.1.
>> >>
>> >>
>> >>
>> >> All changes:
>> >>
>> >> https://github.com/apache/spark/compare/v3.3.0...branch-3.3
>> >>
>> >>
>> >>
>> >> Correctness issues:
>> >>
>> >> SPARK-40149: Propagate metadata columns through Project
>> >>
>> >> SPARK-40002: Don't push down limit through window using ntile
>> >>
>> >> SPARK-39976: ArrayIntersect should handle null in left expression
>> correctly
>> >>
>> >> SPARK-39833: Disable Parquet column index in DSv1 to fix a correctness
>> issue in the case of overlapping partition and data columns
>> >>
>> >> SPARK-39061: Set nullable correctly for Inline output attributes
>> >>
>> >> SPARK-39887: RemoveRedundantAliases should keep aliases that make the
>> output of projection nodes unique
>> >>
>> >> SPARK-38614: Don't push down limit through window that's using
>> percent_rank
>>
>> -
>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>
>>

Re: Contributions and help needed in SPARK-40005

2022-08-30 Thread Hyukjin Kwon

Oh, that's a mistake. please just go ahead and reuse that JIRA :-).
You can just create a PR with reusing the same JIRA ID for functions.py

On Wed, 31 Aug 2022 at 01:18, Khalid Mammadov 
wrote:

> Hi @Hyukjin Kwon 
>
> I see you have resolved the JIRA and I got some more things to do in
> functions.py (only done 50%). So shall I create a new JIRA for each new PR
> or ok to reuse this one?
>
> On Fri, 19 Aug 2022, 09:29 Khalid Mammadov, 
> wrote:
>
>> Will do, thanks!
>>
>> On Fri, 19 Aug 2022, 09:11 Hyukjin Kwon,  wrote:
>>
>>> Sure, that would be great.
>>>
>>> I did the first 25 functions in functions.py. Please go ahead with the
>>> rest of them.
>>> You can create a PR with the title such
>>> as [SPARK-40142][PYTHON][SQL][FOLLOW-UP] Make pyspark.sql.functions
>>> examples self-contained (part 2, 25 functions)
>>>
>>> Thanks!
>>>
>>> On Fri, 19 Aug 2022 at 16:50, Khalid Mammadov 
>>> wrote:
>>>
>>>> I am picking up "functions.py" if noone is already
>>>>
>>>> On Fri, 19 Aug 2022, 07:56 Khalid Mammadov, 
>>>> wrote:
>>>>
>>>>> I thought it's all finished (checked few). Do you have list of those
>>>>> 50%?
>>>>> Happy to contribute 
>>>>>
>>>>> On Fri, 19 Aug 2022, 05:54 Hyukjin Kwon,  wrote:
>>>>>
>>>>>> We're half way, roughly 50%. More contributions would be very helpful.
>>>>>> If the size of the file is too large, feel free to split it to
>>>>>> multiple parts (e.g., https://github.com/apache/spark/pull/37575)
>>>>>>
>>>>>> On Tue, 9 Aug 2022 at 12:26, Qian SUN  wrote:
>>>>>>
>>>>>>> Sure, I will do it. SPARK-40010
>>>>>>> <https://issues.apache.org/jira/browse/SPARK-40010> is built to
>>>>>>> track progress.
>>>>>>>
>>>>>>> Hyukjin Kwon gurwls...@gmail.com <http://mailto:gurwls...@gmail.com>
>>>>>>> 于2022年8月9日周二 10:58写道：
>>>>>>>
>>>>>>> Please go ahead. Would be very appreciated.
>>>>>>>>
>>>>>>>> On Tue, 9 Aug 2022 at 11:58, Qian SUN 
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> Hi Hyukjin
>>>>>>>>>
>>>>>>>>> I would like to do some work and pick up *Window.py *if possible.
>>>>>>>>>
>>>>>>>>> Thanks,
>>>>>>>>> Qian
>>>>>>>>>
>>>>>>>>> Hyukjin Kwon  于2022年8月9日周二 10:41写道：
>>>>>>>>>
>>>>>>>>>> Thanks Khalid for taking a look.
>>>>>>>>>>
>>>>>>>>>> On Tue, 9 Aug 2022 at 00:37, Khalid Mammadov <
>>>>>>>>>> khalidmammad...@gmail.com> wrote:
>>>>>>>>>>
>>>>>>>>>>> Hi Hyukjin
>>>>>>>>>>> That's great initiative, here is a PR that address one of those
>>>>>>>>>>> issues that's waiting for review:
>>>>>>>>>>> https://github.com/apache/spark/pull/37408
>>>>>>>>>>>
>>>>>>>>>>> Perhaps, it would be also good to track these pending issues
>>>>>>>>>>> somewhere to avoid effort duplication.
>>>>>>>>>>>
>>>>>>>>>>> For example, I would like to pick up *union* and *union all* if
>>>>>>>>>>> no one has already.
>>>>>>>>>>>
>>>>>>>>>>> Thanks,
>>>>>>>>>>> Khalid
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On Mon, Aug 8, 2022 at 1:44 PM Hyukjin Kwon 
>>>>>>>>>>> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Hi all,
>>>>>>>>>>>>
>>>>>>>>>>>> I am trying to improve PySpark documentation especially:
>>>>>>>>>>>>
>>>>>>>>>>>>- Make the examples self-contained, e.g.,
>>>>>>>>>>>>
>>>>>>>>>>>> https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.pivot.html
>>>>>>>>>>>>- Document Parameters
>>>>>>>>>>>>
>>>>>>>>>>>> https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.pivot.html#pandas.DataFrame.pivot.
>>>>>>>>>>>>There are many API that misses parameters in PySpark, e.g., 
>>>>>>>>>>>> DataFrame.union
>>>>>>>>>>>>
>>>>>>>>>>>> Here is one example PR I am working on:
>>>>>>>>>>>> https://github.com/apache/spark/pull/37437
>>>>>>>>>>>> I can't do it all by myself. Any help, review, and
>>>>>>>>>>>> contributions would be welcome and appreciated.
>>>>>>>>>>>>
>>>>>>>>>>>> Thank you all in advance.
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>
>>>>>>>>> --
>>>>>>>>> Best!
>>>>>>>>> Qian SUN
>>>>>>>>>
>>>>>>>> --
>>>>>>> Best!
>>>>>>> Qian SUN
>>>>>>>
>>>>>>

Re: Contributions and help needed in SPARK-40005

2022-08-19 Thread Hyukjin Kwon

Sure, that would be great.

I did the first 25 functions in functions.py. Please go ahead with the rest
of them.
You can create a PR with the title such
as [SPARK-40142][PYTHON][SQL][FOLLOW-UP] Make pyspark.sql.functions
examples self-contained (part 2, 25 functions)

Thanks!

On Fri, 19 Aug 2022 at 16:50, Khalid Mammadov 
wrote:

> I am picking up "functions.py" if noone is already
>
> On Fri, 19 Aug 2022, 07:56 Khalid Mammadov, 
> wrote:
>
>> I thought it's all finished (checked few). Do you have list of those 50%?
>> Happy to contribute 
>>
>> On Fri, 19 Aug 2022, 05:54 Hyukjin Kwon,  wrote:
>>
>>> We're half way, roughly 50%. More contributions would be very helpful.
>>> If the size of the file is too large, feel free to split it to multiple
>>> parts (e.g., https://github.com/apache/spark/pull/37575)
>>>
>>> On Tue, 9 Aug 2022 at 12:26, Qian SUN  wrote:
>>>
>>>> Sure, I will do it. SPARK-40010
>>>> <https://issues.apache.org/jira/browse/SPARK-40010> is built to track
>>>> progress.
>>>>
>>>> Hyukjin Kwon gurwls...@gmail.com <http://mailto:gurwls...@gmail.com>
>>>> 于2022年8月9日周二 10:58写道：
>>>>
>>>> Please go ahead. Would be very appreciated.
>>>>>
>>>>> On Tue, 9 Aug 2022 at 11:58, Qian SUN  wrote:
>>>>>
>>>>>> Hi Hyukjin
>>>>>>
>>>>>> I would like to do some work and pick up *Window.py *if possible.
>>>>>>
>>>>>> Thanks,
>>>>>> Qian
>>>>>>
>>>>>> Hyukjin Kwon  于2022年8月9日周二 10:41写道：
>>>>>>
>>>>>>> Thanks Khalid for taking a look.
>>>>>>>
>>>>>>> On Tue, 9 Aug 2022 at 00:37, Khalid Mammadov <
>>>>>>> khalidmammad...@gmail.com> wrote:
>>>>>>>
>>>>>>>> Hi Hyukjin
>>>>>>>> That's great initiative, here is a PR that address one of those
>>>>>>>> issues that's waiting for review:
>>>>>>>> https://github.com/apache/spark/pull/37408
>>>>>>>>
>>>>>>>> Perhaps, it would be also good to track these pending issues
>>>>>>>> somewhere to avoid effort duplication.
>>>>>>>>
>>>>>>>> For example, I would like to pick up *union* and *union all* if no
>>>>>>>> one has already.
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>> Khalid
>>>>>>>>
>>>>>>>>
>>>>>>>> On Mon, Aug 8, 2022 at 1:44 PM Hyukjin Kwon 
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> Hi all,
>>>>>>>>>
>>>>>>>>> I am trying to improve PySpark documentation especially:
>>>>>>>>>
>>>>>>>>>- Make the examples self-contained, e.g.,
>>>>>>>>>
>>>>>>>>> https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.pivot.html
>>>>>>>>>- Document Parameters
>>>>>>>>>
>>>>>>>>> https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.pivot.html#pandas.DataFrame.pivot.
>>>>>>>>>There are many API that misses parameters in PySpark, e.g., 
>>>>>>>>> DataFrame.union
>>>>>>>>>
>>>>>>>>> Here is one example PR I am working on:
>>>>>>>>> https://github.com/apache/spark/pull/37437
>>>>>>>>> I can't do it all by myself. Any help, review, and contributions
>>>>>>>>> would be welcome and appreciated.
>>>>>>>>>
>>>>>>>>> Thank you all in advance.
>>>>>>>>>
>>>>>>>>
>>>>>>
>>>>>> --
>>>>>> Best!
>>>>>> Qian SUN
>>>>>>
>>>>> --
>>>> Best!
>>>> Qian SUN
>>>>
>>>

Re: Contributions and help needed in SPARK-40005

2022-08-18 Thread Hyukjin Kwon

We're half way, roughly 50%. More contributions would be very helpful.
If the size of the file is too large, feel free to split it to multiple
parts (e.g., https://github.com/apache/spark/pull/37575)

On Tue, 9 Aug 2022 at 12:26, Qian SUN  wrote:

> Sure, I will do it. SPARK-40010
> <https://issues.apache.org/jira/browse/SPARK-40010> is built to track
> progress.
>
> Hyukjin Kwon gurwls...@gmail.com <http://mailto:gurwls...@gmail.com>
> 于2022年8月9日周二 10:58写道：
>
> Please go ahead. Would be very appreciated.
>>
>> On Tue, 9 Aug 2022 at 11:58, Qian SUN  wrote:
>>
>>> Hi Hyukjin
>>>
>>> I would like to do some work and pick up *Window.py *if possible.
>>>
>>> Thanks,
>>> Qian
>>>
>>> Hyukjin Kwon  于2022年8月9日周二 10:41写道：
>>>
>>>> Thanks Khalid for taking a look.
>>>>
>>>> On Tue, 9 Aug 2022 at 00:37, Khalid Mammadov 
>>>> wrote:
>>>>
>>>>> Hi Hyukjin
>>>>> That's great initiative, here is a PR that address one of those issues
>>>>> that's waiting for review: https://github.com/apache/spark/pull/37408
>>>>>
>>>>> Perhaps, it would be also good to track these pending issues somewhere
>>>>> to avoid effort duplication.
>>>>>
>>>>> For example, I would like to pick up *union* and *union all* if no
>>>>> one has already.
>>>>>
>>>>> Thanks,
>>>>> Khalid
>>>>>
>>>>>
>>>>> On Mon, Aug 8, 2022 at 1:44 PM Hyukjin Kwon 
>>>>> wrote:
>>>>>
>>>>>> Hi all,
>>>>>>
>>>>>> I am trying to improve PySpark documentation especially:
>>>>>>
>>>>>>- Make the examples self-contained, e.g.,
>>>>>>
>>>>>> https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.pivot.html
>>>>>>- Document Parameters
>>>>>>
>>>>>> https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.pivot.html#pandas.DataFrame.pivot.
>>>>>>There are many API that misses parameters in PySpark, e.g., 
>>>>>> DataFrame.union
>>>>>>
>>>>>> Here is one example PR I am working on:
>>>>>> https://github.com/apache/spark/pull/37437
>>>>>> I can't do it all by myself. Any help, review, and contributions
>>>>>> would be welcome and appreciated.
>>>>>>
>>>>>> Thank you all in advance.
>>>>>>
>>>>>
>>>
>>> --
>>> Best!
>>> Qian SUN
>>>
>> --
> Best!
> Qian SUN
>

Re: Welcoming three new PMC members

2022-08-09 Thread Hyukjin Kwon

Congrats everybody!

On Wed, 10 Aug 2022 at 05:50, Mridul Muralidharan  wrote:

>
> Congratulations !
> Great to have you join the PMC !!
>
> Regards,
> Mridul
>
> On Tue, Aug 9, 2022 at 11:57 AM vaquar khan  wrote:
>
>> Congratulations
>>
>> On Tue, Aug 9, 2022, 11:40 AM Xiao Li  wrote:
>>
>>> Hi all,
>>>
>>> The Spark PMC recently voted to add three new PMC members. Join me in
>>> welcoming them to their new roles!
>>>
>>> New PMC members: Huaxin Gao, Gengliang Wang and Maxim Gekk
>>>
>>> The Spark PMC
>>>
>>

Welcome Xinrong Meng as a Spark committer

2022-08-09 Thread Hyukjin Kwon

Hi all,

The Spark PMC recently added Xinrong Meng as a committer on the project.
Xinrong is the major contributor of PySpark especially Pandas API on Spark.
She has guided a lot of new contributors enthusiastically. Please join me
in welcoming Xinrong!

Re: Contributions and help needed in SPARK-40005

2022-08-08 Thread Hyukjin Kwon

Please go ahead. Would be very appreciated.

On Tue, 9 Aug 2022 at 11:58, Qian SUN  wrote:

> Hi Hyukjin
>
> I would like to do some work and pick up *Window.py *if possible.
>
> Thanks,
> Qian
>
> Hyukjin Kwon  于2022年8月9日周二 10:41写道：
>
>> Thanks Khalid for taking a look.
>>
>> On Tue, 9 Aug 2022 at 00:37, Khalid Mammadov 
>> wrote:
>>
>>> Hi Hyukjin
>>> That's great initiative, here is a PR that address one of those issues
>>> that's waiting for review: https://github.com/apache/spark/pull/37408
>>>
>>> Perhaps, it would be also good to track these pending issues somewhere
>>> to avoid effort duplication.
>>>
>>> For example, I would like to pick up *union* and *union all* if no
>>> one has already.
>>>
>>> Thanks,
>>> Khalid
>>>
>>>
>>> On Mon, Aug 8, 2022 at 1:44 PM Hyukjin Kwon  wrote:
>>>
>>>> Hi all,
>>>>
>>>> I am trying to improve PySpark documentation especially:
>>>>
>>>>- Make the examples self-contained, e.g.,
>>>>https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.pivot.html
>>>>- Document Parameters
>>>>
>>>> https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.pivot.html#pandas.DataFrame.pivot.
>>>>There are many API that misses parameters in PySpark, e.g., 
>>>> DataFrame.union
>>>>
>>>> Here is one example PR I am working on:
>>>> https://github.com/apache/spark/pull/37437
>>>> I can't do it all by myself. Any help, review, and contributions
>>>> would be welcome and appreciated.
>>>>
>>>> Thank you all in advance.
>>>>
>>>
>
> --
> Best!
> Qian SUN
>

Re: Contributions and help needed in SPARK-40005

2022-08-08 Thread Hyukjin Kwon

Thanks Khalid for taking a look.

On Tue, 9 Aug 2022 at 00:37, Khalid Mammadov 
wrote:

> Hi Hyukjin
> That's great initiative, here is a PR that address one of those issues
> that's waiting for review: https://github.com/apache/spark/pull/37408
>
> Perhaps, it would be also good to track these pending issues somewhere to
> avoid effort duplication.
>
> For example, I would like to pick up *union* and *union all* if no
> one has already.
>
> Thanks,
> Khalid
>
>
> On Mon, Aug 8, 2022 at 1:44 PM Hyukjin Kwon  wrote:
>
>> Hi all,
>>
>> I am trying to improve PySpark documentation especially:
>>
>>- Make the examples self-contained, e.g.,
>>https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.pivot.html
>>- Document Parameters
>>
>> https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.pivot.html#pandas.DataFrame.pivot.
>>There are many API that misses parameters in PySpark, e.g., 
>> DataFrame.union
>>
>> Here is one example PR I am working on:
>> https://github.com/apache/spark/pull/37437
>> I can't do it all by myself. Any help, review, and contributions would be
>> welcome and appreciated.
>>
>> Thank you all in advance.
>>
>

Contributions and help needed in SPARK-40005

2022-08-08 Thread Hyukjin Kwon

Hi all,

I am trying to improve PySpark documentation especially:

   - Make the examples self-contained, e.g.,
   https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.pivot.html
   - Document Parameters
   
https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.pivot.html#pandas.DataFrame.pivot.
   There are many API that misses parameters in PySpark, e.g., DataFrame.union

Here is one example PR I am working on:
https://github.com/apache/spark/pull/37437
I can't do it all by myself. Any help, review, and contributions would be
welcome and appreciated.

Thank you all in advance.

1 2 3 4 5 6 7 >

1 - 100 of 687 matches

Mail list logo