Re: [VOTE] SPIP: An Official Kubernetes Operator for Apache Spark

2023-11-15 Thread Yikun Jiang
+1

Regards,
Yikun


On Wed, Nov 15, 2023 at 4:26 PM huaxin gao  wrote:

> +1
>
> On Tue, Nov 14, 2023 at 10:45 AM Holden Karau 
> wrote:
>
>> +1
>>
>> On Tue, Nov 14, 2023 at 10:21 AM DB Tsai  wrote:
>>
>>> +1
>>>
>>> DB Tsai  |  https://www.dbtsai.com/  |  PGP 42E5B25A8F7A82C1
>>>
>>> On Nov 14, 2023, at 10:14 AM, Vakaris Baškirov <
>>> vakaris.bashki...@gmail.com> wrote:
>>>
>>> +1 (non-binding)
>>>
>>>
>>> On Tue, Nov 14, 2023 at 8:03 PM Chao Sun  wrote:
>>>
 +1

 On Tue, Nov 14, 2023 at 9:52 AM L. C. Hsieh  wrote:
 >
 > +1
 >
 > On Tue, Nov 14, 2023 at 9:46 AM Ye Zhou  wrote:
 > >
 > > +1(Non-binding)
 > >
 > > On Tue, Nov 14, 2023 at 9:42 AM L. C. Hsieh 
 wrote:
 > >>
 > >> Hi all,
 > >>
 > >> I’d like to start a vote for SPIP: An Official Kubernetes Operator
 for
 > >> Apache Spark.
 > >>
 > >> The proposal is to develop an official Java-based Kubernetes
 operator
 > >> for Apache Spark to automate the deployment and simplify the
 lifecycle
 > >> management and orchestration of Spark applications and Spark
 clusters
 > >> on k8s at prod scale.
 > >>
 > >> This aims to reduce the learning curve and operation overhead for
 > >> Spark users so they can concentrate on core Spark logic.
 > >>
 > >> Please also refer to:
 > >>
 > >>- Discussion thread:
 > >> https://lists.apache.org/thread/wdy7jfhf7m8jy74p6s0npjfd15ym5rxz
 > >>- JIRA ticket:
 https://issues.apache.org/jira/browse/SPARK-45923
 > >>- SPIP doc:
 https://docs.google.com/document/d/1f5mm9VpSKeWC72Y9IiKN2jbBn32rHxjWKUfLRaGEcLE
 > >>
 > >>
 > >> Please vote on the SPIP for the next 72 hours:
 > >>
 > >> [ ] +1: Accept the proposal as an official SPIP
 > >> [ ] +0
 > >> [ ] -1: I don’t think this is a good idea because …
 > >>
 > >>
 > >> Thank you!
 > >>
 > >> Liang-Chi Hsieh
 > >>
 > >>
 -
 > >> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
 > >>
 > >
 > >
 > > --
 > >
 > > Zhou, Ye  周晔
 >
 > -
 > To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
 >

 -
 To unsubscribe e-mail: dev-unsubscr...@spark.apache.org


>>>


Spark Docker Official image (Java 17) coming soon

2023-11-13 Thread Yikun Jiang
We added the Java 17 support for Apache Spark docker official image at [1].
(Thanks @vakarisbk efforts)

After the [2] merge in future, the first java17 series docker official
image will be available.

You can also have a try on ghcr test image:

all in one image: ghcr.io/apache/spark-docker/spark
:3.5.0-scala2.12-java17-python3-r-ubuntu
python3 image:ghcr.io/apache/spark-docker/spark
:3.5.0-scala2.12-java17-python3-ubuntu
r image:  ghcr.io/apache/spark-docker/spark
:3.5.0-scala2.12-java17-r-ubuntu
scala image:  ghcr.io/apache/spark-docker/spark
:3.5.0-scala2.12-java17-ubuntu

Please let us know if you have any other questions or concerns, otherwise,
it will be published soon [2].

[1] https://github.com/apache/spark-docker/pull/56
[2] https://github.com/docker-library/official-images/pull/15697

Regards,
Yikun


Re: [VOTE] Updating documentation hosted for EOL and maintenance releases

2023-09-26 Thread Yikun Jiang
+1, I believe it is a wise choice to update the EOL policy of the document
based on the real demands of community users.

Regards,
Yikun


On Tue, Sep 26, 2023 at 1:06 PM Ruifeng Zheng  wrote:

> +1
>
> On Tue, Sep 26, 2023 at 12:51 PM Hyukjin Kwon 
> wrote:
>
>> Hi all,
>>
>> I would like to start the vote for updating documentation hosted for EOL
>> and maintenance releases to improve the usability here, and in order for
>> end users to read the proper and correct documentation.
>>
>> For discussion thread, please refer to
>> https://lists.apache.org/thread/1675rzxx5x4j2x03t9x0kfph8tlys0cx.
>>
>> Here is one example:
>> - https://github.com/apache/spark/pull/42989
>> - https://github.com/apache/spark-website/pull/480
>>
>> Starting with my own +1.
>>
>


Re: Volcano in spark distro

2023-08-22 Thread Yikun Jiang
@Santosh

We tried to add this in v3.3.0. [1] The main reason for not adding it at
that time was:
1. Volcano multi-arch not supported before v1.7.0. (already upgraded to
1.7.0 since Spark 3.4.0)
2. Spark on K8s + Volcano is experimental. (We have removed the
experimental [2])

Consider spark volcano integrations already stable to run on spark
community (since spark 3.4.0) [3] and volcano community (since spark 3.3.0)
[4] for a long time. I think it's stable enough.

So I believe we have the capability to enable the volcano module in Apache
Spark now (master / maybe Apache Spark 4.0?).

[1] https://github.com/apache/spark/pull/35922
[2] https://github.com/apache/spark/pull/40152
[3]
https://github.com/apache/spark/blob/master/.github/workflows/build_and_test.yml#L1090
[4]
https://github.com/volcano-sh/volcano/blob/master/.github/workflows/e2e_spark.yaml#L12

Regards,
Yikun


On Tue, Aug 22, 2023 at 8:14 PM Santosh Pingale
 wrote:

> Hey all
>
> It would useful to support volcano in spark distro itself just like
> yunikorn. So I am wondering what is the reason behind this decision of not
> packaging it already.
> Running Spark on Kubernetes - Spark 3.4.1 Documentation
> 
> spark.apache.org
> 
> [image: apple-touch-icon.png]
> 
> 
>
> Can we package it to make it easily available and hence usable?
>
> Kind regards
> Santosh
>


Spark Docker Official Image is now available

2023-07-19 Thread Yikun Jiang
The spark Docker Official Image is now available:
https://hub.docker.com/_/spark

$ docker run -it --rm *spark* /opt/spark/bin/spark-shell
$ docker run -it --rm *spark*:python3 /opt/spark/bin/pyspark
$ docker run -it --rm *spark*:r /opt/spark/bin/sparkR

We had a longer review journey than we expected, if you are also interested
in this journey, you can see more in:

https://github.com/docker-library/official-images/pull/13089

Thanks to everyone who helps in the Docker and Apache Spark community!

Some background you might want to know:
*- apache/spark*: https://hub.docker.com/r/apache/spark, the Apache Spark
docker image, will be published by *Apache Spark community* when the Apache
Spark is released, no update.
*- spark*: https://hub.docker.com/_/spark, the Docker Official Image, it
will be published by the *Docker community*, keep active rebuilding for
updates and security fixes by the Docker community.
- The source repo of *apache/spark *and *spark: *
https://github.com/apache/spark-docker

See more in:
[1] [DISCUSS] SPIP: Support Docker Official Image for Spark:
https://lists.apache.org/thread/l1793y5224n8bqkp3s6ltgkykso4htb3
[2] [VOTE] SPIP: Support Docker Official Image for Spark:
https://lists.apache.org/thread/ro6olodm1jzdffwjx4oc7ol7oh6kshbl
[3] https://github.com/docker-library/official-images/pull/13089
[4]
https://docs.google.com/document/d/1nN-pKuvt-amUcrkTvYAQ-bJBgtsWb9nAkNoVNRM2S2o/
[5] https://issues.apache.org/jira/browse/SPARK-40513

Regards,
Yikun


Re: [VOTE][SPIP] PySpark Test Framework

2023-06-24 Thread Yikun Jiang
+1

Regards,
Yikun


On Fri, Jun 23, 2023 at 6:17 AM L. C. Hsieh  wrote:

> +1
>
> On Thu, Jun 22, 2023 at 3:10 PM Xinrong Meng  wrote:
> >
> > +1
> >
> > Thanks for driving that!
> >
> > On Wed, Jun 21, 2023 at 10:25 PM Ruifeng Zheng 
> wrote:
> >>
> >> +1
> >>
> >> On Thu, Jun 22, 2023 at 1:11 PM Dongjoon Hyun 
> wrote:
> >>>
> >>> +1
> >>>
> >>> Dongjoon
> >>>
> >>> On Wed, Jun 21, 2023 at 8:56 PM Hyukjin Kwon 
> wrote:
> 
>  +1
> 
>  On Thu, 22 Jun 2023 at 02:20, Jacek Laskowski 
> wrote:
> >
> > +0
> >
> > Pozdrawiam,
> > Jacek Laskowski
> > 
> > "The Internals Of" Online Books
> > Follow me on https://twitter.com/jaceklaskowski
> >
> >
> >
> > On Wed, Jun 21, 2023 at 5:11 PM Amanda Liu <
> amandastephanie...@gmail.com> wrote:
> >>
> >> Hi all,
> >>
> >> I'd like to start the vote for SPIP: PySpark Test Framework.
> >>
> >> The high-level summary for the SPIP is that it proposes an official
> test framework for PySpark. Currently, there are only disparate open-source
> repos and blog posts for PySpark testing resources. We can streamline and
> simplify the testing process by incorporating test features, such as a
> PySpark Test Base class (which allows tests to share Spark sessions) and
> test util functions (for example, asserting dataframe and schema equality).
> >>
> >> SPIP doc:
> https://docs.google.com/document/d/1OkyBn3JbEHkkQgSQ45Lq82esXjr9rm2Vj7Ih_4zycRc/edit#heading=h.f5f0u2riv07v
> >>
> >> JIRA ticket: https://issues.apache.org/jira/browse/SPARK-44042
> >>
> >> Discussion thread:
> https://lists.apache.org/thread/trwgbgn3ycoj8b8k8lkxko2hql23o41n
> >>
> >> Please vote on the SPIP for the next 72 hours:
> >> [ ] +1: Accept the proposal as an official SPIP
> >> [ ] +0
> >> [ ] -1: I don’t think this is a good idea because __.
> >>
> >> Thank you!
> >>
> >> Best,
> >> Amanda Liu
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>


Re: [DISCUSS] Unified Apache Spark Docker image tag?

2023-05-09 Thread Yikun Jiang
As I said in my last mail, I am very sorry if any mislead.

As you know, it is a little bit complicated to take into account K8s, base
image, standalone, Docker official image, etc., as well as various docker
image requirements such as java version and docker image tag.
Of course, it's not an excuse to escape.

I also think the most important thing right now is to let the community
know what's going on and where there is no consensus and reach a consensus.
For the latest tag behavior changes (Java version, tag rules, image size),
I also explained the original intention.
After collecting community feedback, it will also help us to take the next
step.

Let me add some more information to help us move forward:
- The docker image has been published with the *previous tag (v3.4.0)* and*
new tag (3.4.0/python/r/all in one)* for *v3.4.0*.
- The *latest** tag change is a user behavior change compared to the
previous tag*. it's pointed to the new tag python, in detail:
  * Point to python image rather than scala image, mainly considering
PySpark / Scala is more common.
  * The image size of the python image is 490+ MB, the scala image is 400+
MB
  * The default Java version is Java 11 compared to previous tag (v3.4.0)
Java 17.
- The docker image publish workflow updated to new workflow:
https://github.com/apache/spark-website/pull/458

>From my perspective, the next step is that:
- Decide default Java version of the latest tag
- Decide default image of the latest tag
- Decide should we update the publish workflow


On Wed, May 10, 2023 at 12:11 AM Dongjoon Hyun  wrote:

> May I ask why you think that sentence, "might need to deprecate ..." of
> SPIP, decided anything at that time?
>
> From my perspective,
> - `might need to` suggested only a possible necessity at some point in the
> future.
> - `deprecation` means no breaking change.
>
>
> Dongjoon
>
>
>
> On Tue, May 9, 2023 at 12:01 AM Yikun Jiang  wrote:
>
>> > It seems that your reply (the following) didn't reach out to the
>> mailing list correctly.
>>
>> Thanks! I'm not sure what happened before, thanks for your forward
>>
>> > Let me add my opinion. IIUC, the whole content of SPIP (Support Docker
>> Official Image for Spark) aims to add (1) newly, not to corrupt or destroy
>> the existing (2).
>>
>> - There were some description about how we should address the
>> apache/spark image after DOI support in doc:
>> "Considering that already had the apache/spark image, might need to
>> deprecate: spark/spark-py/spark-r `v3.3.0`, `v3.1.3`, `v3.2.1`, `v3.2.2`
>> tags, and *unified apache/spark image tags to docker official images
>> tags rule*, and also still keep apache/spark images and update
>> apache/spark images when released."
>> - I also post a mail
>> https://lists.apache.org/thread/zp550lt4f098zfpxgpc9bn360bwcfhs4 in Nov.
>> 2022, it's about Apache Spark official image, it's not for Docker official
>> image.
>>
>> So, it is not only for Docker official image (spark) but also for Apache
>> Spark official image (apache/spark).
>> Anyway, I am very sorry if there is any misleading, really many thanks
>> for your feedback and review.
>>
>> On Tue, May 9, 2023 at 12:37 PM Dongjoon Hyun 
>> wrote:
>>
>>> To Yikun,
>>>
>>> It seems that your reply (the following) didn't reach out to the mailing
>>> list correctly.
>>>
>>> > Just FYI, we also had a discussion about tag policy (latest/3.4.0) and
>>> also rough size estimation [1] in "SPIP: Support Docker Official Image for
>>> Spark".
>>> >
>>> https://docs.google.com/document/d/1nN-pKuvt-amUcrkTvYAQ-bJBgtsWb9nAkNoVNRM2S2o/edit?disco=f2TyFr0
>>>
>>> Let me add my opinion. IIUC, the whole content of SPIP (Support Docker
>>> Official Image for Spark) aims to add (1) newly, not to corrupt or destroy
>>> the existing (2).
>>>
>>> (1) https://hub.docker.com/_/spark
>>> (2) https://hub.docker.com/r/apache/spark/tags
>>>
>>> The reference model repos were also documented like the followings.
>>>
>>> https://hub.docker.com/_/flink
>>> https://hub.docker.com/_/storm
>>> https://hub.docker.com/_/solr
>>> https://hub.docker.com/_/zookeeper
>>>
>>> In short, according to the SPIP's `Docker Official Image` definition,
>>> new images should go to (1) only in order to achieve `Support Docker
>>> Official Image for Spark`, shouldn't they?
>>>
>>> Dongjoon.
>>>
>>> On Mon, May 8, 2023 at 6:22 PM Yikun Jiang  wrote:
>>>
>>>> > 1. The size regression: `apache/

Re: [DISCUSS] Unified Apache Spark Docker image tag?

2023-05-09 Thread Yikun Jiang
> 1. The size regression: `apache/spark:3.4.0` tag which is claimed to be a
replacement of the existing `apache/spark:v3.4.0`. However, 3.4.0 is 500MB
while the original v3.4.0 is 405MB. 25% is huge in terms of the size.

> 2. Accidental overwrite: `apache/spark:latest` was accidentally
overwritten by `apache/spark:python3` image which has a bigger size due to
the additional python binary. This is a breaking change to enforce the
downstream users to change to something like `apache/spark:scala`.

Just FYI, we also had a discussion about tag policy (latest/3.4.0) and
also rough size estimation [1] in "SPIP: Support Docker Official Image for
Spark".

[1]
https://docs.google.com/document/d/1nN-pKuvt-amUcrkTvYAQ-bJBgtsWb9nAkNoVNRM2S2o/edit?disco=f2TyFr0

Regards,
Yikun


On Tue, May 9, 2023 at 5:03 AM Dongjoon Hyun  wrote:

> Thank you for initiating the discussion in the community. Yes, we need to
> give more context in the dev mailing list.
>
> This root cause is not about SPARK-40941 or SPARK-40513. Technically, this
> situation started 16 days ago due to SPARK-43148 because it made some
> breaking changes.
>
> https://github.com/apache/spark-docker/pull/33
> SPARK-43148 Add Apache Spark 3.4.0 Dockerfiles
>
> 1. The size regression: `apache/spark:3.4.0` tag which is claimed to be a
> replacement of the existing `apache/spark:v3.4.0`. However, 3.4.0 is 500MB
> while the original v3.4.0 is 405MB. 25% is huge in terms of the size.
>
> 2. Accidental overwrite: `apache/spark:latest` was accidentally
> overwritten by `apache/spark:python3` image which has a bigger size due to
> the additional python binary. This is a breaking change to enforce the
> downstream users to change to something like `apache/spark:scala`.
>
> I believe (1) and (2) were our mistakes. We had better recover them ASAP.
> For Java questions, I prefer to be consistent with Apache Spark repo's
> default.
>
> Dongjoon.
>
> On 2023/05/08 08:56:26 Yikun Jiang wrote:
> > This is a call for discussion for how we can unified Apache Spark Docker
> > image tag fluently.
> >
> > As you might know, there is an apache/spark-docker
> > <https://github.com/apache/spark-docker> repo to store the dockerfiles
> and
> > help to publish the docker images, also intended to replace the original
> > manually publish workflow.
> >
> > The scope of new images is to cover previous image cases (K8s / docker
> run)
> > and also cover base image, standalone, Docker Official Image.
> >
> > - (Previous) apache/spark:v3.4.0, apache/spark-py:v3.4.0,
> > apache/spark-r:v3.4.0
> >
> > * The image build from apache/spark spark on k8s dockerfiles
> > <
> https://github.com/apache/spark/tree/branch-3.4/resource-managers/kubernetes/docker/src/main/dockerfiles/spark
> >
> >
> > * Java version: Java 17 (It was Java 11 before v3.4.0, such as
> > v3.3.0/v3.3.1/v3.3.2), set Java 17 by default in SPARK-40941
> > <https://github.com/apache/spark/pull/38417>.
> >
> > * Support: K8s / docker run
> >
> > * See also: Time to start publishing Spark Docker Images
> > <https://lists.apache.org/thread/h729bxrf1o803l4wz7g8bngkjd56y6x8>
> >
> > * Link: https://hub.docker.com/r/apache/spark-py,
> > https://hub.docker.com/r/apache/spark-r,
> > https://hub.docker.com/r/apache/spark
> >
> > - (New) apache/spark:3.4.0-python3(3.4.0/latest), apache/spark:3.4.0-r,
> > apache/spark:3.4.0-scala, and also a all in one image:
> > apache/spark:3.4.0-scala2.12-java11-python3-r-ubuntu
> >
> > * The image build from apache/spark-docker dockerfiles
> > <https://github.com/apache/spark-docker/tree/master/3.4.0>
> >
> > * Java version: Java 11, Java17 is supported by SPARK-40513
> > <https://github.com/apache/spark-docker/pull/35> (under review)
> >
> > * Support: K8s / docker run / base image / standalone / Docker
> Official
> > Image
> >
> > * See detail in: Support Docker Official Image for Spark
> > <https://issues.apache.org/jira/browse/SPARK-40513>
> >
> > * About dropping prefix `v`:
> > https://github.com/docker-library/official-images/issues/14506
> >
> > * Link: https://hub.docker.com/r/apache/spark
> >
> > We had some initial discuss on spark-website#458
> > <
> https://github.com/apache/spark-website/pull/458#issuecomment-1522426236>,
> > the mainly discussion is around version tag and default Java version
> > behavior changes, so we’d like to hear your idea in here about below
> > questions:
> >
> > *#1.Which Java version should be used by default (latest t

Re: [DISCUSS] Unified Apache Spark Docker image tag?

2023-05-09 Thread Yikun Jiang
> It seems that your reply (the following) didn't reach out to the mailing
list correctly.

Thanks! I'm not sure what happened before, thanks for your forward

> Let me add my opinion. IIUC, the whole content of SPIP (Support Docker
Official Image for Spark) aims to add (1) newly, not to corrupt or destroy
the existing (2).

- There were some description about how we should address the apache/spark
image after DOI support in doc:
"Considering that already had the apache/spark image, might need to
deprecate: spark/spark-py/spark-r `v3.3.0`, `v3.1.3`, `v3.2.1`, `v3.2.2`
tags, and *unified apache/spark image tags to docker official images tags
rule*, and also still keep apache/spark images and update apache/spark
images when released."
- I also post a mail
https://lists.apache.org/thread/zp550lt4f098zfpxgpc9bn360bwcfhs4 in Nov.
2022, it's about Apache Spark official image, it's not for Docker official
image.

So, it is not only for Docker official image (spark) but also for Apache
Spark official image (apache/spark).
Anyway, I am very sorry if there is any misleading, really many thanks for
your feedback and review.

On Tue, May 9, 2023 at 12:37 PM Dongjoon Hyun  wrote:

> To Yikun,
>
> It seems that your reply (the following) didn't reach out to the mailing
> list correctly.
>
> > Just FYI, we also had a discussion about tag policy (latest/3.4.0) and
> also rough size estimation [1] in "SPIP: Support Docker Official Image for
> Spark".
> >
> https://docs.google.com/document/d/1nN-pKuvt-amUcrkTvYAQ-bJBgtsWb9nAkNoVNRM2S2o/edit?disco=f2TyFr0
>
> Let me add my opinion. IIUC, the whole content of SPIP (Support Docker
> Official Image for Spark) aims to add (1) newly, not to corrupt or destroy
> the existing (2).
>
> (1) https://hub.docker.com/_/spark
> (2) https://hub.docker.com/r/apache/spark/tags
>
> The reference model repos were also documented like the followings.
>
> https://hub.docker.com/_/flink
> https://hub.docker.com/_/storm
> https://hub.docker.com/_/solr
> https://hub.docker.com/_/zookeeper
>
> In short, according to the SPIP's `Docker Official Image` definition, new
> images should go to (1) only in order to achieve `Support Docker Official
> Image for Spark`, shouldn't they?
>
> Dongjoon.
>
> On Mon, May 8, 2023 at 6:22 PM Yikun Jiang  wrote:
>
>> > 1. The size regression: `apache/spark:3.4.0` tag which is claimed to be
>> a replacement of the existing `apache/spark:v3.4.0`. However, 3.4.0 is
>> 500MB while the original v3.4.0 is 405MB. 25% is huge in terms of the size.
>>
>> > 2. Accidental overwrite: `apache/spark:latest` was accidentally
>> overwritten by `apache/spark:python3` image which has a bigger size due to
>> the additional python binary. This is a breaking change to enforce the
>> downstream users to change to something like `apache/spark:scala`.
>>
>> Just FYI, we also had a discussion about tag policy (latest/3.4.0) and
>> also rough size estimation [1] in "SPIP: Support Docker Official Image for
>> Spark".
>>
>> [1]
>> https://docs.google.com/document/d/1nN-pKuvt-amUcrkTvYAQ-bJBgtsWb9nAkNoVNRM2S2o/edit?disco=f2TyFr0
>>
>> Regards,
>> Yikun
>>
>>
>> On Tue, May 9, 2023 at 5:03 AM Dongjoon Hyun  wrote:
>>
>>> Thank you for initiating the discussion in the community. Yes, we need
>>> to give more context in the dev mailing list.
>>>
>>> This root cause is not about SPARK-40941 or SPARK-40513. Technically,
>>> this situation started 16 days ago due to SPARK-43148 because it made some
>>> breaking changes.
>>>
>>> https://github.com/apache/spark-docker/pull/33
>>> SPARK-43148 Add Apache Spark 3.4.0 Dockerfiles
>>>
>>> 1. The size regression: `apache/spark:3.4.0` tag which is claimed to be
>>> a replacement of the existing `apache/spark:v3.4.0`. However, 3.4.0 is
>>> 500MB while the original v3.4.0 is 405MB. 25% is huge in terms of the size.
>>>
>>> 2. Accidental overwrite: `apache/spark:latest` was accidentally
>>> overwritten by `apache/spark:python3` image which has a bigger size due to
>>> the additional python binary. This is a breaking change to enforce the
>>> downstream users to change to something like `apache/spark:scala`.
>>>
>>> I believe (1) and (2) were our mistakes. We had better recover them ASAP.
>>> For Java questions, I prefer to be consistent with Apache Spark repo's
>>> default.
>>>
>>> Dongjoon.
>>>
>>> On 2023/05/08 08:56:26 Yikun Jiang wrote:
>>> > This is a call for discussion for how we can unified Apache Spark
>>> Docker
>

[DISCUSS] Unified Apache Spark Docker image tag?

2023-05-08 Thread Yikun Jiang
This is a call for discussion for how we can unified Apache Spark Docker
image tag fluently.

As you might know, there is an apache/spark-docker
 repo to store the dockerfiles and
help to publish the docker images, also intended to replace the original
manually publish workflow.

The scope of new images is to cover previous image cases (K8s / docker run)
and also cover base image, standalone, Docker Official Image.

- (Previous) apache/spark:v3.4.0, apache/spark-py:v3.4.0,
apache/spark-r:v3.4.0

* The image build from apache/spark spark on k8s dockerfiles


* Java version: Java 17 (It was Java 11 before v3.4.0, such as
v3.3.0/v3.3.1/v3.3.2), set Java 17 by default in SPARK-40941
.

* Support: K8s / docker run

* See also: Time to start publishing Spark Docker Images


* Link: https://hub.docker.com/r/apache/spark-py,
https://hub.docker.com/r/apache/spark-r,
https://hub.docker.com/r/apache/spark

- (New) apache/spark:3.4.0-python3(3.4.0/latest), apache/spark:3.4.0-r,
apache/spark:3.4.0-scala, and also a all in one image:
apache/spark:3.4.0-scala2.12-java11-python3-r-ubuntu

* The image build from apache/spark-docker dockerfiles


* Java version: Java 11, Java17 is supported by SPARK-40513
 (under review)

* Support: K8s / docker run / base image / standalone / Docker Official
Image

* See detail in: Support Docker Official Image for Spark


* About dropping prefix `v`:
https://github.com/docker-library/official-images/issues/14506

* Link: https://hub.docker.com/r/apache/spark

We had some initial discuss on spark-website#458
,
the mainly discussion is around version tag and default Java version
behavior changes, so we’d like to hear your idea in here about below
questions:

*#1.Which Java version should be used by default (latest tag)? Java8 or
Java 11 or Java 17 or Any*

*#2.Which tag should be used in apache/spark? v3.4.0 (with prefix v) or
3.4.0 (dropping prefix v) or Both or Any*

Starts with my prefer:

1. Java8 or Java17 are also ok to me (mainly considering the Java
maintenance cycle). BTW, other apache projects: flink (8/11, 11 as default
),
solr (11 as default

for 8.x, 17 as default

since solr9), zookeeper (11 as default

)

2. Only 3.4.0 (dropping prefix v). It will help us transition to the new
tags with less confusion and also consider DOI suggestions
.

Please feel free to share your ideas.


Re: [VOTE] Release Apache Spark 3.4.0 (RC7)

2023-04-10 Thread Yikun Jiang
+1 (non-binding)

Also ran the docker image related test (signatures/standalone/k8s) with
rc7: https://github.com/apache/spark-docker/pull/32

Regards,
Yikun


On Tue, Apr 11, 2023 at 4:44 AM Jacek Laskowski  wrote:

> +1
>
> * Built fine with Scala 2.13
> and -Pkubernetes,hadoop-cloud,hive,hive-thriftserver,scala-2.13,volcano
> * Ran some demos on Java 17
> * Mac mini / Apple M2 Pro / Ventura 13.3.1
>
> Pozdrawiam,
> Jacek Laskowski
> 
> "The Internals Of" Online Books 
> Follow me on https://twitter.com/jaceklaskowski
>
> 
>
>
> On Sat, Apr 8, 2023 at 1:30 AM Xinrong Meng 
> wrote:
>
>> Please vote on releasing the following candidate(RC7) as Apache Spark
>> version 3.4.0.
>>
>> The vote is open until 11:59pm Pacific time *April 12th* and passes if a
>> majority +1 PMC votes are cast, with a minimum of 3 +1 votes.
>>
>> [ ] +1 Release this package as Apache Spark 3.4.0
>> [ ] -1 Do not release this package because ...
>>
>> To learn more about Apache Spark, please see http://spark.apache.org/
>>
>> The tag to be voted on is v3.4.0-rc7 (commit
>> 87a5442f7ed96b11051d8a9333476d080054e5a0):
>> https://github.com/apache/spark/tree/v3.4.0-rc7
>>
>> The release files, including signatures, digests, etc. can be found at:
>> https://dist.apache.org/repos/dist/dev/spark/v3.4.0-rc7-bin/
>>
>> Signatures used for Spark RCs can be found in this file:
>> https://dist.apache.org/repos/dist/dev/spark/KEYS
>>
>> The staging repository for this release can be found at:
>> https://repository.apache.org/content/repositories/orgapachespark-1441
>>
>> The documentation corresponding to this release can be found at:
>> https://dist.apache.org/repos/dist/dev/spark/v3.4.0-rc7-docs/
>>
>> The list of bug fixes going into 3.4.0 can be found at the following URL:
>> https://issues.apache.org/jira/projects/SPARK/versions/12351465
>>
>> This release is using the release script of the tag v3.4.0-rc7.
>>
>>
>> FAQ
>>
>> =
>> How can I help test this release?
>> =
>> If you are a Spark user, you can help us test this release by taking
>> an existing Spark workload and running on this release candidate, then
>> reporting any regressions.
>>
>> If you're working in PySpark you can set up a virtual env and install
>> the current RC and see if anything important breaks, in the Java/Scala
>> you can add the staging repository to your projects resolvers and test
>> with the RC (make sure to clean up the artifact cache before/after so
>> you don't end up building with an out of date RC going forward).
>>
>> ===
>> What should happen to JIRA tickets still targeting 3.4.0?
>> ===
>> The current list of open tickets targeted at 3.4.0 can be found at:
>> https://issues.apache.org/jira/projects/SPARK and search for "Target
>> Version/s" = 3.4.0
>>
>> Committers should look at those and triage. Extremely important bug
>> fixes, documentation, and API tweaks that impact compatibility should
>> be worked on immediately. Everything else please retarget to an
>> appropriate release.
>>
>> ==
>> But my bug isn't fixed?
>> ==
>> In order to make timely releases, we will typically not hold the
>> release unless the bug in question is a regression from the previous
>> release. That being said, if there is something which is a regression
>> that has not been correctly targeted please ping me or a committer to
>> help target the issue.
>>
>> Thanks,
>> Xinrong Meng
>>
>


Re: [VOTE] Release Spark 3.3.2 (RC1)

2023-02-12 Thread Yikun Jiang
+1, Test 3.3.2-rc1 with spark-docker:
- Downloading rc4 tgz, validate the key.
- Extract bin and build image
- Run K8s IT, standalone test of R/Python/Scala/All image [1]

[1] https://github.com/apache/spark-docker/pull/29

Regards,
Yikun


On Mon, Feb 13, 2023 at 10:25 AM yangjie01  wrote:

> Which Python version do you use for testing? When I use the latest Python
> 3.11, I can reproduce similar test failures (43 tests of sql module fail),
> but when I use python 3.10, they will succeed
>
>
>
> YangJie
>
>
>
> *发件人**: *Bjørn Jørgensen 
> *日期**: *2023年2月13日 星期一 05:09
> *收件人**: *Sean Owen 
> *抄送**: *"L. C. Hsieh" , Spark dev list <
> dev@spark.apache.org>
> *主题**: *Re: [VOTE] Release Spark 3.3.2 (RC1)
>
>
>
> Tried it one more time and the same result.
>
>
>
> On another box with Manjaro
>
> 
> [INFO] Reactor Summary for Spark Project Parent POM 3.3.2:
> [INFO]
> [INFO] Spark Project Parent POM ... SUCCESS [01:50
> min]
> [INFO] Spark Project Tags . SUCCESS [
> 17.359 s]
> [INFO] Spark Project Sketch ... SUCCESS [
> 12.517 s]
> [INFO] Spark Project Local DB . SUCCESS [
> 14.463 s]
> [INFO] Spark Project Networking ... SUCCESS [01:07
> min]
> [INFO] Spark Project Shuffle Streaming Service  SUCCESS [
>  9.013 s]
> [INFO] Spark Project Unsafe ... SUCCESS [
>  8.184 s]
> [INFO] Spark Project Launcher . SUCCESS [
> 10.454 s]
> [INFO] Spark Project Core . SUCCESS [23:58
> min]
> [INFO] Spark Project ML Local Library . SUCCESS [
> 21.218 s]
> [INFO] Spark Project GraphX ... SUCCESS [01:24
> min]
> [INFO] Spark Project Streaming  SUCCESS [04:57
> min]
> [INFO] Spark Project Catalyst . SUCCESS [08:00
> min]
> [INFO] Spark Project SQL .. SUCCESS [
>  01:02 h]
> [INFO] Spark Project ML Library ... SUCCESS [14:38
> min]
> [INFO] Spark Project Tools  SUCCESS [
>  4.394 s]
> [INFO] Spark Project Hive . SUCCESS [53:43
> min]
> [INFO] Spark Project REPL . SUCCESS [01:16
> min]
> [INFO] Spark Project Assembly . SUCCESS [
>  2.186 s]
> [INFO] Kafka 0.10+ Token Provider for Streaming ... SUCCESS [
> 16.150 s]
> [INFO] Spark Integration for Kafka 0.10 ... SUCCESS [01:34
> min]
> [INFO] Kafka 0.10+ Source for Structured Streaming  SUCCESS [32:55
> min]
> [INFO] Spark Project Examples . SUCCESS [
> 23.800 s]
> [INFO] Spark Integration for Kafka 0.10 Assembly .. SUCCESS [
>  7.301 s]
> [INFO] Spark Avro . SUCCESS [01:19
> min]
> [INFO]
> 
> [INFO] BUILD SUCCESS
> [INFO]
> 
> [INFO] Total time:  03:31 h
> [INFO] Finished at: 2023-02-12T21:54:20+01:00
> [INFO]
> 
> [bjorn@amd7g spark-3.3.2]$  java -version
> openjdk version "17.0.6" 2023-01-17
> OpenJDK Runtime Environment (build 17.0.6+10)
> OpenJDK 64-Bit Server VM (build 17.0.6+10, mixed mode)
>
>
>
>
>
> :)
>
>
>
> So I'm +1
>
>
>
>
>
> søn. 12. feb. 2023 kl. 12:53 skrev Bjørn Jørgensen <
> bjornjorgen...@gmail.com>:
>
> I use ubuntu rolling
>
> $ java -version
> openjdk version "17.0.6" 2023-01-17
> OpenJDK Runtime Environment (build 17.0.6+10-Ubuntu-0ubuntu1)
> OpenJDK 64-Bit Server VM (build 17.0.6+10-Ubuntu-0ubuntu1, mixed mode,
> sharing)
>
>
>
> I have reboot now and restart ./build/mvn clean package
>
>
>
>
>
>
>
> søn. 12. feb. 2023 kl. 04:47 skrev Sean Owen :
>
> +1 The tests and all results were the same as ever for me (Java 11, Scala
> 2.13, Ubuntu 22.04)
>
> I also didn't see that issue ... maybe somehow locale related? which could
> still be a bug.
>
>
>
> On Sat, Feb 11, 2023 at 8:49 PM L. C. Hsieh  wrote:
>
> Thank you for testing it.
>
> I was going to run it again but still didn't see any errors.
>
> I also checked CI (and looked again now) on branch-3.3 before cutting RC.
>
> BTW, I didn't find an actual test failure (i.e. "- test_name ***
> FAILED ***") in the log file.
>
> Maybe it is due to the dev env? What dev env you're using to run the test?
>
>
> On Sat, Feb 11, 2023 at 8:58 AM Bjørn Jørgensen
>  wrote:
> >
> >
> > ./build/mvn clean package
> >
> > Run completed in 1 hour, 18 minutes, 29 seconds.
> > Total number of tests run: 11652
> > Suites: completed 516, aborted 0
> > Tests: succeeded 11609, failed 43, canceled 8, ignored 57, pending 0
> > *** 43 

Re: Publish Apache Spark offcial image under the new rules?

2022-11-10 Thread Yikun Jiang
BTW, you might want to try the new image, I publish them in my local
ghcr/docker hub, you could try:

- Try spark shell / pyspark / sparkr
docker run -ti ghcr.io/yikun/spark-docker/spark /opt/spark/bin/spark-shell
docker run -ti ghcr.io/yikun/spark-docker/spark /opt/spark/bin/pyspark
docker run -ti ghcr.io/yikun/spark-docker/spark:r /opt/spark/bin/sparkR

- Try standalone mode like this
<https://github.com/Yikun/spark-docker/blob/52152c1b6d70acc2e7c5e32bffe0265b55df7b6f/.github/workflows/main.yml#L113>

- Try them in K8s with local minikube like this
<https://github.com/Yikun/spark-docker/blob/master/.github/workflows/main.yml#L161-L216>

- All available image tags in here
<https://github.com/Yikun/spark-docker/pkgs/container/spark-docker%2Fspark/versions?filters%5Bversion_type%5D=tagged>
(ghcr)
or here
<https://hub.docker.com/repository/registry-1.docker.io/yikunkero/spark/tags?page=1=last_updated>
(docker hub) .

Regards,
Yikun


On Thu, Nov 10, 2022 at 6:27 PM Yikun Jiang  wrote:

> Hi, all
>
> Last month the vote of "Support Docker Official Image for Spark
> <https://issues.apache.org/jira/browse/SPARK-40513>" passed.
>
> # Progress of SPIP:
>
> ## Completed:
> - A new github repo created: https://github.com/apache/spark-docker
> - Add "Spark Docker
> <https://issues.apache.org/jira/browse/SPARK-40969?jql=project%20%3D%20SPARK%20AND%20component%20%3D%20%22Spark%20Docker%22>"
> component label in JIRA
> - Uploaded 3.3.0/3.3.1 dockerfiles: spark-docker#2
> <https://github.com/apache/spark-docker/pull/2> spark-docker#20
> <https://github.com/apache/spark-docker/pull/20>
> - Some fixes apply to dockerfiles to meet the DOI qualities requirements:
>   * spark-docker#11 <https://github.com/apache/spark-docker/pull/11> Use
> spark as username in official image (instead of magic number 185),
>   * spark-docker#14 <https://github.com/apache/spark-docker/pull/14>  Cleanup
> os download list cache to reduce image size.
>   * spark-docker#17 <https://github.com/apache/spark-docker/pull/17> Remove
> pip/setuptools dynamic upgrade to ensure image's repeatability
> - Support dockerfile template to help generate all kinds of Dockerfiles
> for specific version spark-docker#12
> <https://github.com/apache/spark-docker/pull/12>
> - Add workflow to help build/test dockerfile to ensure the Dockerfile's
> quality
>   * K8s Integration test spark-docker#9
> <https://github.com/apache/spark-docker/pull/9>
>   * Standalone test spark-docker#21
> <https://github.com/apache/spark-docker/pull/21> (Great job by
> @dcoliversun)
> - spark-website#424 <https://github.com/apache/spark-website/pull/424> Use
> docker image in the example of SQL/Scala/Java
> - INFRA-23882 <https://issues.apache.org/jira/browse/INFRA-23882> Add
> Docker hub secrets to spark-docker repo to help publish docker hub image
>
> ## Not merged yet:
> - spark-docker#23 <https://github.com/apache/spark-docker/pull/23> One
> click to publish "apache/spark" image
>   instead of the current Spark Docker Images publish step
> <https://github.com/wangyum/spark-website/blob/1c6b2ee13a1e22748ed416c5cc260c33795a76c8/release-process.md#create-and-upload-spark-docker-images>.
> It will also run K8s IT /standalone test first then publish.
> - docker-library/official-images#13089
> <https://github.com/docker-library/official-images/pull/13089> Add Apache
> Spark Docker Official Image,
>   waiting for review from docker side.
>
> After the above work, I think we almost reached the quality of DOI (might
> have some small fix according to docker
> side review in future maybe), but limited by the docker side review
> bandwith. The good news is that the PR are in
> the top of the review queue according to review history.
>
>
> # Next step?
>
> Should we publish the apache/spark image (3.3.0/3.3.1) according to
> new rules now?
>
> After publish, the apache/spark will add several new tags for v3.3.0 and
> v3.3.1 like:
>
> - apache/spark:python3
> - apache/spark:scala
> - apache/spark:r
> - apache/spark all in one
> * You can see the complete tag info in here
> <https://github.com/apache/spark-docker/pull/23/files#diff-2b39d33506bc7a34cef4b9ebf4cf8b1e3a5532f2131ceb37011b94261cec5f8c>
> .
>
> WDYT?
>
> Regards,
> Yikun
>


Publish Apache Spark offcial image under the new rules?

2022-11-10 Thread Yikun Jiang
Hi, all

Last month the vote of "Support Docker Official Image for Spark
" passed.

# Progress of SPIP:

## Completed:
- A new github repo created: https://github.com/apache/spark-docker
- Add "Spark Docker
"
component label in JIRA
- Uploaded 3.3.0/3.3.1 dockerfiles: spark-docker#2
 spark-docker#20

- Some fixes apply to dockerfiles to meet the DOI qualities requirements:
  * spark-docker#11  Use
spark as username in official image (instead of magic number 185),
  * spark-docker#14   Cleanup
os download list cache to reduce image size.
  * spark-docker#17  Remove
pip/setuptools dynamic upgrade to ensure image's repeatability
- Support dockerfile template to help generate all kinds of Dockerfiles for
specific version spark-docker#12

- Add workflow to help build/test dockerfile to ensure the Dockerfile's
quality
  * K8s Integration test spark-docker#9

  * Standalone test spark-docker#21
 (Great job by @dcoliversun)
- spark-website#424  Use
docker image in the example of SQL/Scala/Java
- INFRA-23882  Add
Docker hub secrets to spark-docker repo to help publish docker hub image

## Not merged yet:
- spark-docker#23  One
click to publish "apache/spark" image
  instead of the current Spark Docker Images publish step
.
It will also run K8s IT /standalone test first then publish.
- docker-library/official-images#13089
 Add Apache
Spark Docker Official Image,
  waiting for review from docker side.

After the above work, I think we almost reached the quality of DOI (might
have some small fix according to docker
side review in future maybe), but limited by the docker side review
bandwith. The good news is that the PR are in
the top of the review queue according to review history.


# Next step?

Should we publish the apache/spark image (3.3.0/3.3.1) according to
new rules now?

After publish, the apache/spark will add several new tags for v3.3.0 and
v3.3.1 like:

- apache/spark:python3
- apache/spark:scala
- apache/spark:r
- apache/spark all in one
* You can see the complete tag info in here

.

WDYT?

Regards,
Yikun


Re: [VOTE] Release Spark 3.3.1 (RC4)

2022-10-18 Thread Yikun Jiang
+1, also test passed with spark-docker workflow (downloading rc4 tgz,
extract, build image, run K8s IT)

[1] https://github.com/Yikun/spark-docker/pull/9

Regards,
Yikun

On Wed, Oct 19, 2022 at 8:59 AM Wenchen Fan  wrote:

> +1
>
> On Wed, Oct 19, 2022 at 4:59 AM Chao Sun  wrote:
>
>> +1. Thanks Yuming!
>>
>> Chao
>>
>> On Tue, Oct 18, 2022 at 1:18 PM Thomas graves  wrote:
>> >
>> > +1. Ran internal test suite.
>> >
>> > Tom
>> >
>> > On Sun, Oct 16, 2022 at 9:14 PM Yuming Wang  wrote:
>> > >
>> > > Please vote on releasing the following candidate as Apache Spark
>> version 3.3.1.
>> > >
>> > > The vote is open until 11:59pm Pacific time October 21th and passes
>> if a majority +1 PMC votes are cast, with a minimum of 3 +1 votes.
>> > >
>> > > [ ] +1 Release this package as Apache Spark 3.3.1
>> > > [ ] -1 Do not release this package because ...
>> > >
>> > > To learn more about Apache Spark, please see https://spark.apache.org
>> > >
>> > > The tag to be voted on is v3.3.1-rc4 (commit
>> fbbcf9434ac070dd4ced4fb9efe32899c6db12a9):
>> > > https://github.com/apache/spark/tree/v3.3.1-rc4
>> > >
>> > > The release files, including signatures, digests, etc. can be found
>> at:
>> > > https://dist.apache.org/repos/dist/dev/spark/v3.3.1-rc4-bin
>> > >
>> > > Signatures used for Spark RCs can be found in this file:
>> > > https://dist.apache.org/repos/dist/dev/spark/KEYS
>> > >
>> > > The staging repository for this release can be found at:
>> > >
>> https://repository.apache.org/content/repositories/orgapachespark-1430
>> > >
>> > > The documentation corresponding to this release can be found at:
>> > > https://dist.apache.org/repos/dist/dev/spark/v3.3.1-rc4-docs
>> > >
>> > > The list of bug fixes going into 3.3.1 can be found at the following
>> URL:
>> > > https://s.apache.org/ttgz6
>> > >
>> > > This release is using the release script of the tag v3.3.1-rc4.
>> > >
>> > >
>> > > FAQ
>> > >
>> > > ==
>> > > What happened to v3.3.1-rc3?
>> > > ==
>> > > A performance regression(SPARK-40703) was found after tagging
>> v3.3.1-rc3, which the Iceberg community hopes Spark 3.3.1 could fix.
>> > > So we skipped the vote on v3.3.1-rc3.
>> > >
>> > > =
>> > > How can I help test this release?
>> > > =
>> > > If you are a Spark user, you can help us test this release by taking
>> > > an existing Spark workload and running on this release candidate, then
>> > > reporting any regressions.
>> > >
>> > > If you're working in PySpark you can set up a virtual env and install
>> > > the current RC and see if anything important breaks, in the Java/Scala
>> > > you can add the staging repository to your projects resolvers and test
>> > > with the RC (make sure to clean up the artifact cache before/after so
>> > > you don't end up building with a out of date RC going forward).
>> > >
>> > > ===
>> > > What should happen to JIRA tickets still targeting 3.3.1?
>> > > ===
>> > > The current list of open tickets targeted at 3.3.1 can be found at:
>> > > https://issues.apache.org/jira/projects/SPARK and search for "Target
>> Version/s" = 3.3.1
>> > >
>> > > Committers should look at those and triage. Extremely important bug
>> > > fixes, documentation, and API tweaks that impact compatibility should
>> > > be worked on immediately. Everything else please retarget to an
>> > > appropriate release.
>> > >
>> > > ==
>> > > But my bug isn't fixed?
>> > > ==
>> > > In order to make timely releases, we will typically not hold the
>> > > release unless the bug in question is a regression from the previous
>> > > release. That being said, if there is something which is a regression
>> > > that has not been correctly targeted please ping me or a committer to
>> > > help target the issue.
>> > >
>> > >
>> >
>> > -
>> > To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>> >
>>
>> -
>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>
>>


Re: Enforcing scalafmt on Spark Connect - connector/connect

2022-10-14 Thread Yikun Jiang
+1, I also think it's a good idea.

BTW, we might also consider adding some notes about `lint-scala` in [1],
just like `lint-python` in pyspark [2].

[1] https://spark.apache.org/developer-tools.html
[2]
https://spark.apache.org/docs/latest/api/python/development/contributing.html


Regards,
Yikun


On Fri, Oct 14, 2022 at 4:51 PM Hyukjin Kwon  wrote:

> I personally like this idea. At least we now do this in PySpark, and it's
> pretty nice that you can just forget about formatting it manually by
> yourself.
>
> On Fri, 14 Oct 2022 at 16:37, Martin Grund
>  wrote:
>
>> Hi folks,
>>
>> I'm reaching out to ask to gather input / consensus on the following
>> proposal: Since Spark Connect is effectively new code, I would like to
>> enforce scalafmt explicitly *only* on this module by adding a check in
>> `dev/lint-scala` that checks if there is a diff after running
>>
>>  ./build/mvn -Pscala-2.12 scalafmt:format -Dscalafmt.skip=false -pl
>> connector/connect
>>
>> I know that enforcing scalafmt is not desirable on the existing code base
>> but since the Spark Connect code is very new I'm thinking it might reduce
>> friction in the code reviews and create a consistent style.
>>
>> In my previous code reviews where I have applied scalafmt I've
>> received feedback on the import grouping that scalafmt is changing
>> different from our default style. I've prepared a PR
>> https://github.com/apache/spark/pull/38252 to address this issue by
>> explicitly setting it in the scalafmt option.
>>
>> Would you be supportive of enforcing scalafmt *only* on the Spark
>> Connect module?
>>
>> Thanks
>> Martin
>>
>


Re: Welcome Yikun Jiang as a Spark committer

2022-10-09 Thread Yikun Jiang
Thank you all!

Regards,
Yikun


On Mon, Oct 10, 2022 at 3:18 AM Chao Sun  wrote:

> Congratulations Yikun!
>
> On Sun, Oct 9, 2022 at 11:14 AM vaquar khan  wrote:
>
>> Congratulations.
>>
>> Regards,
>> Vaquar khan
>>
>> On Sun, Oct 9, 2022, 6:46 AM 叶先进  wrote:
>>
>>> Congrats
>>>
>>> On Oct 9, 2022, at 16:44, XiDuo You  wrote:
>>>
>>> Congratulations, Yikun !
>>>
>>> Maxim Gekk  于2022年10月9日周日 15:59写道:
>>>
>>>> Keep up the great work, Yikun!
>>>>
>>>> On Sun, Oct 9, 2022 at 10:52 AM Gengliang Wang 
>>>> wrote:
>>>>
>>>>> Congratulations, Yikun!
>>>>>
>>>>> On Sun, Oct 9, 2022 at 12:33 AM 416161...@qq.com 
>>>>> wrote:
>>>>>
>>>>>> Congrats, Yikun!
>>>>>>
>>>>>> --
>>>>>> Ruifeng Zheng
>>>>>> ruife...@foxmail.com
>>>>>>
>>>>>> <https://wx.mail.qq.com/home/index?t=readmail_businesscard_midpage=true=Ruifeng+Zheng=https%3A%2F%2Fres.mail.qq.com%2Fzh_CN%2Fhtmledition%2Fimages%2Frss%2Fmale.gif%3Frand%3D1617349242=ruifengz%40foxmail.com=>
>>>>>>
>>>>>>
>>>>>>
>>>>>> -- Original --
>>>>>> *From:* "Martin Grigorov" ;
>>>>>> *Date:* Sun, Oct 9, 2022 05:01 AM
>>>>>> *To:* "Hyukjin Kwon";
>>>>>> *Cc:* "dev";"Yikun Jiang";
>>>>>> *Subject:* Re: Welcome Yikun Jiang as a Spark committer
>>>>>>
>>>>>> Congratulations, Yikun!
>>>>>>
>>>>>> On Sat, Oct 8, 2022 at 7:41 AM Hyukjin Kwon 
>>>>>> wrote:
>>>>>>
>>>>>>> Hi all,
>>>>>>>
>>>>>>> The Spark PMC recently added Yikun Jiang as a committer on the
>>>>>>> project.
>>>>>>> Yikun is the major contributor of the infrastructure and GitHub
>>>>>>> Actions in Apache Spark as well as Kubernates and PySpark.
>>>>>>> He has put a lot of effort into stabilizing and optimizing the
>>>>>>> builds so we all can work together in Apache Spark more
>>>>>>> efficiently and effectively. He's also driving the SPIP for Docker
>>>>>>> official image in Apache Spark as well for users and developers.
>>>>>>> Please join me in welcoming Yikun!
>>>>>>>
>>>>>>>
>>>


Re: [VOTE] SPIP: Support Docker Official Image for Spark

2022-09-21 Thread Yikun Jiang
+1 (non-binding)

Regards,
Yikun


On Thu, Sep 22, 2022 at 9:43 AM Hyukjin Kwon  wrote:

> Starting with my +1.
>
> On Thu, 22 Sept 2022 at 10:41, Hyukjin Kwon  wrote:
>
>> Hi all,
>>
>> I would like to start a vote for SPIP: "Support Docker Official Image
>> for Spark"
>>
>> The goal of the SPIP is to add Docker Official Image(DOI)
>>  to ensure the Spark
>> Docker images
>> meet the quality standards for Docker images, to provide these Docker
>> images for users
>> who want to use Apache Spark via Docker image.
>>
>> Please also refer to:
>>
>> - Previous discussion in dev mailing list: [DISCUSS] SPIP: Support
>> Docker Official Image for Spark
>> 
>> - SPIP doc: SPIP: Support Docker Official Image for Spark
>> 
>> - JIRA: SPARK-40513 
>>
>> Please vote on the SPIP for the next 72 hours:
>>
>> [ ] +1: Accept the proposal as an official SPIP
>> [ ] +0
>> [ ] -1: I don’t think this is a good idea because …
>>
>>


Re: [DISCUSS] SPIP: Support Docker Official Image for Spark

2022-09-21 Thread Yikun Jiang
@Ankit

Thanks for your support! Your questions are very valuable, but this SPIP is
just a start point to cover existing apache/spark image features first.
And we will also set up a build/test/publish image workflow (to make sure
the quality of image) and some helper scripts to help developers extend
more custom images easily in future.

> How do we support deployments of spark-standalone clusters in case the
users want to use the same image for spark-standalone clusters ? Since that
is also widely used.
Yes, it's possible, it can be done by exposing some ports, but still need
to validate and then doc them in standalone mode doc.

> 2. I am not sure about the End of Support of Hadoop 2 with spark, but if
that is not planned sooner, shouldn't we be making it configurable to be
able to use spark prebuilt with hadoop 2?
DOI required a static dockerfile, so couldn't be configurable in runtime.Of
course, all spark published releases can also be supported as a separate
image in principle. About supporting more distribution, we also planned to
add some scripts to help generate the dockerfile.

> 3. Also, don't we want to make it feasible for the users to be able to
customize the base linux flavour?
This is also a good point, but out of scope of this SPIP. Currently, we
start with Ubuntu OS (debian series, apt software manager). We might also
consider supporting more OS after this SPIP. Such as
rehl/centos/rocky/openEuler serious, yum/dnf software manager. But as you
know, different OS's have various package versions and upgrade policies, so
it's perhaps not very easy work for maintenance, but I think it's possible.

Regards,
Yikun


On Thu, Sep 22, 2022 at 3:43 AM Ankit Gupta  wrote:

> Hi Yikun
>
> Thanks for all your efforts! This is very much needed. But I have the
> below three questions:
> 1. How do we support deployments of spark-standalone clusters in case the
> users wants to use the same image for spark-standalone clusters ? Since
> that is also widely used.
> 2. I am not sure about the End of Support of Hadoop 2 with spark, but if
> that is not planned sooner, shouldn't we be making it configurable to be
> able to use spark prebuilt with hadoop 2?
> 3. Also, don't we want to make it feasible for the users to be able to
> customise the base linux flavour?
>
> Thanks and Regards.
>
> Ankit Prakash Gupta
>
>
> On Wed, Sep 21, 2022 at 9:19 PM Xiao Li  wrote:
>
>> +1
>>
>> Yikun Jiang  于2022年9月21日周三 07:22写道:
>>
>>> Thanks for all your inputs! BTW, I also create a JIRA to track related
>>> work: https://issues.apache.org/jira/browse/SPARK-40513
>>>
>>> > can I be involved in this work?
>>>
>>> @qian Of course! Thanks!
>>>
>>> Regards,
>>> Yikun
>>>
>>> On Wed, Sep 21, 2022 at 7:31 PM Xinrong Meng 
>>> wrote:
>>>
>>>> +1
>>>>
>>>> On Tue, Sep 20, 2022 at 11:08 PM Qian SUN 
>>>> wrote:
>>>>
>>>>> +1.
>>>>> It's valuable, can I be involved in this work?
>>>>>
>>>>> Yikun Jiang  于2022年9月19日周一 08:15写道:
>>>>>
>>>>>> Hi, all
>>>>>>
>>>>>> I would like to start the discussion for supporting Docker Official
>>>>>> Image for Spark.
>>>>>>
>>>>>> This SPIP is proposed to add Docker Official Image(DOI)
>>>>>> <https://github.com/docker-library/official-images> to ensure the
>>>>>> Spark Docker images meet the quality standards for Docker images, to
>>>>>> provide these Docker images for users who want to use Apache Spark via
>>>>>> Docker image.
>>>>>>
>>>>>> There are also several Apache projects that release the Docker
>>>>>> Official Images
>>>>>> <https://hub.docker.com/search?q=apache_filter=official>, such
>>>>>> as: flink <https://hub.docker.com/_/flink>, storm
>>>>>> <https://hub.docker.com/_/storm>, solr
>>>>>> <https://hub.docker.com/_/solr>, zookeeper
>>>>>> <https://hub.docker.com/_/zookeeper>, httpd
>>>>>> <https://hub.docker.com/_/httpd> (with 50M+ to 1B+ download for
>>>>>> each). From the huge download statistics, we can see the real demands of
>>>>>> users, and from the support of other apache projects, we should also be
>>>>>> able to do it.
>>>>>>
>>>>>> After support:
>>>>>>
>>>>>>-
>>>>>>
>>>>>>The Dockerfile will still be maintained by the Apache Spark
>>>>>>community and reviewed by Docker.
>>>>>>-
>>>>>>
>>>>>>The images will be maintained by the Docker community to ensure
>>>>>>the quality standards for Docker images of the Docker community.
>>>>>>
>>>>>>
>>>>>> It will also reduce the extra docker images maintenance effort (such
>>>>>> as frequently rebuilding, image security update) of the Apache Spark
>>>>>> community.
>>>>>>
>>>>>> See more in SPIP DOC:
>>>>>> https://docs.google.com/document/d/1nN-pKuvt-amUcrkTvYAQ-bJBgtsWb9nAkNoVNRM2S2o
>>>>>>
>>>>>> cc: Ruifeng (co-author) and Hyukjin (shepherd)
>>>>>>
>>>>>> Regards,
>>>>>> Yikun
>>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Best!
>>>>> Qian SUN
>>>>>
>>>>


Re: [DISCUSS] SPIP: Support Docker Official Image for Spark

2022-09-21 Thread Yikun Jiang
Thanks for all your inputs! BTW, I also create a JIRA to track related
work: https://issues.apache.org/jira/browse/SPARK-40513

> can I be involved in this work?

@qian Of course! Thanks!

Regards,
Yikun

On Wed, Sep 21, 2022 at 7:31 PM Xinrong Meng 
wrote:

> +1
>
> On Tue, Sep 20, 2022 at 11:08 PM Qian SUN  wrote:
>
>> +1.
>> It's valuable, can I be involved in this work?
>>
>> Yikun Jiang  于2022年9月19日周一 08:15写道:
>>
>>> Hi, all
>>>
>>> I would like to start the discussion for supporting Docker Official
>>> Image for Spark.
>>>
>>> This SPIP is proposed to add Docker Official Image(DOI)
>>> <https://github.com/docker-library/official-images> to ensure the Spark
>>> Docker images meet the quality standards for Docker images, to provide
>>> these Docker images for users who want to use Apache Spark via Docker image.
>>>
>>> There are also several Apache projects that release the Docker Official
>>> Images <https://hub.docker.com/search?q=apache_filter=official>,
>>> such as: flink <https://hub.docker.com/_/flink>, storm
>>> <https://hub.docker.com/_/storm>, solr <https://hub.docker.com/_/solr>,
>>> zookeeper <https://hub.docker.com/_/zookeeper>, httpd
>>> <https://hub.docker.com/_/httpd> (with 50M+ to 1B+ download for each).
>>> From the huge download statistics, we can see the real demands of users,
>>> and from the support of other apache projects, we should also be able to do
>>> it.
>>>
>>> After support:
>>>
>>>-
>>>
>>>The Dockerfile will still be maintained by the Apache Spark
>>>community and reviewed by Docker.
>>>-
>>>
>>>The images will be maintained by the Docker community to ensure the
>>>quality standards for Docker images of the Docker community.
>>>
>>>
>>> It will also reduce the extra docker images maintenance effort (such as
>>> frequently rebuilding, image security update) of the Apache Spark community.
>>>
>>> See more in SPIP DOC:
>>> https://docs.google.com/document/d/1nN-pKuvt-amUcrkTvYAQ-bJBgtsWb9nAkNoVNRM2S2o
>>>
>>> cc: Ruifeng (co-author) and Hyukjin (shepherd)
>>>
>>> Regards,
>>> Yikun
>>>
>>
>>
>> --
>> Best!
>> Qian SUN
>>
>


Re: [DISCUSS] SPIP: Support Docker Official Image for Spark

2022-09-19 Thread Yikun Jiang
Thanks for your support!  @all

> Count me in to help as well, eh?! :)

@Denny Sure, It would be great to have your help! I'm going to create a
JIRA and TASKS if the SPIP vote passes.


On Mon, Sep 19, 2022 at 10:34 AM Denny Lee  wrote:

> +1 (non-binding).
>
> This is a great idea and we should definitely do this.  Count me in to
> help as well, eh?! :)
>
> On Sun, Sep 18, 2022 at 7:24 PM bo zhaobo 
> wrote:
>
>> +1 (non-binding)
>>
>> This will bring the good experience to customers. So excited about this.
>> ;-)
>>
>> Yuming Wang  于2022年9月19日周一 10:18写道:
>>
>>> +1.
>>>
>>> On Mon, Sep 19, 2022 at 9:44 AM Kent Yao  wrote:
>>>
>>>> +1
>>>>
>>>> Gengliang Wang  于2022年9月19日周一 09:23写道:
>>>> >
>>>> > +1, thanks for the work!
>>>> >
>>>> > On Sun, Sep 18, 2022 at 6:20 PM Hyukjin Kwon 
>>>> wrote:
>>>> >>
>>>> >> +1
>>>> >>
>>>> >> On Mon, 19 Sept 2022 at 09:15, Yikun Jiang 
>>>> wrote:
>>>> >>>
>>>> >>> Hi, all
>>>> >>>
>>>> >>>
>>>> >>> I would like to start the discussion for supporting Docker Official
>>>> Image for Spark.
>>>> >>>
>>>> >>>
>>>> >>> This SPIP is proposed to add Docker Official Image(DOI) to ensure
>>>> the Spark Docker images meet the quality standards for Docker images, to
>>>> provide these Docker images for users who want to use Apache Spark via
>>>> Docker image.
>>>> >>>
>>>> >>>
>>>> >>> There are also several Apache projects that release the Docker
>>>> Official Images, such as: flink, storm, solr, zookeeper, httpd (with 50M+
>>>> to 1B+ download for each). From the huge download statistics, we can see
>>>> the real demands of users, and from the support of other apache projects,
>>>> we should also be able to do it.
>>>> >>>
>>>> >>>
>>>> >>> After support:
>>>> >>>
>>>> >>> The Dockerfile will still be maintained by the Apache Spark
>>>> community and reviewed by Docker.
>>>> >>>
>>>> >>> The images will be maintained by the Docker community to ensure the
>>>> quality standards for Docker images of the Docker community.
>>>> >>>
>>>> >>>
>>>> >>> It will also reduce the extra docker images maintenance effort
>>>> (such as frequently rebuilding, image security update) of the Apache Spark
>>>> community.
>>>> >>>
>>>> >>>
>>>> >>> See more in SPIP DOC:
>>>> https://docs.google.com/document/d/1nN-pKuvt-amUcrkTvYAQ-bJBgtsWb9nAkNoVNRM2S2o
>>>> >>>
>>>> >>>
>>>> >>> cc: Ruifeng (co-author) and Hyukjin (shepherd)
>>>> >>>
>>>> >>>
>>>> >>> Regards,
>>>> >>> Yikun
>>>>
>>>> -
>>>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>>>
>>>>


[DISCUSS] SPIP: Support Docker Official Image for Spark

2022-09-18 Thread Yikun Jiang
Hi, all

I would like to start the discussion for supporting Docker Official Image
for Spark.

This SPIP is proposed to add Docker Official Image(DOI)
 to ensure the Spark
Docker images meet the quality standards for Docker images, to provide
these Docker images for users who want to use Apache Spark via Docker image.

There are also several Apache projects that release the Docker Official
Images , such
as: flink , storm
, solr ,
zookeeper , httpd
 (with 50M+ to 1B+ download for each). From
the huge download statistics, we can see the real demands of users, and
from the support of other apache projects, we should also be able to do it.

After support:

   -

   The Dockerfile will still be maintained by the Apache Spark community
   and reviewed by Docker.
   -

   The images will be maintained by the Docker community to ensure the
   quality standards for Docker images of the Docker community.


It will also reduce the extra docker images maintenance effort (such as
frequently rebuilding, image security update) of the Apache Spark community.

See more in SPIP DOC:
https://docs.google.com/document/d/1nN-pKuvt-amUcrkTvYAQ-bJBgtsWb9nAkNoVNRM2S2o

cc: Ruifeng (co-author) and Hyukjin (shepherd)

Regards,
Yikun


Re: Welcoming three new PMC members

2022-08-10 Thread Yikun Jiang
Congratulations!

Regards,
Yikun


On Wed, Aug 10, 2022 at 3:19 PM Maciej  wrote:

> Congratulations!
>
> On 8/10/22 08:14, Yi Wu wrote:
> > Congrats everyone!
> >
> >
> >
> > On Wed, Aug 10, 2022 at 11:33 AM Yuanjian Li  > > wrote:
> >
> > Congrats everyone!
> >
> > L. C. Hsieh mailto:vii...@gmail.com>>于2022年8月9
> > 日 周二19:01写道:
> >
> > Congrats!
> >
> > On Tue, Aug 9, 2022 at 5:38 PM Chao Sun  > > wrote:
> >  >
> >  > Congrats everyone!
> >  >
> >  > On Tue, Aug 9, 2022 at 5:36 PM Dongjoon Hyun
> > mailto:dongjoon.h...@gmail.com>>
> wrote:
> >  > >
> >  > > Congrat to all!
> >  > >
> >  > > Dongjoon.
> >  > >
> >  > > On Tue, Aug 9, 2022 at 5:13 PM Takuya UESHIN
> > mailto:ues...@happy-camper.st>> wrote:
> >  > > >
> >  > > > Congratulations!
> >  > > >
> >  > > > On Tue, Aug 9, 2022 at 4:57 PM Hyukjin Kwon
> > mailto:gurwls...@gmail.com>> wrote:
> >  > > >>
> >  > > >> Congrats everybody!
> >  > > >>
> >  > > >> On Wed, 10 Aug 2022 at 05:50, Mridul Muralidharan
> > mailto:mri...@gmail.com>> wrote:
> >  > > >>>
> >  > > >>>
> >  > > >>> Congratulations !
> >  > > >>> Great to have you join the PMC !!
> >  > > >>>
> >  > > >>> Regards,
> >  > > >>> Mridul
> >  > > >>>
> >  > > >>> On Tue, Aug 9, 2022 at 11:57 AM vaquar khan
> > mailto:vaquar.k...@gmail.com>> wrote:
> >  > > 
> >  > >  Congratulations
> >  > > 
> >  > >  On Tue, Aug 9, 2022, 11:40 AM Xiao Li
> > mailto:gatorsm...@gmail.com>> wrote:
> >  > > >
> >  > > > Hi all,
> >  > > >
> >  > > > The Spark PMC recently voted to add three new PMC
> > members. Join me in welcoming them to their new roles!
> >  > > >
> >  > > > New PMC members: Huaxin Gao, Gengliang Wang and Maxim
> > Gekk
> >  > > >
> >  > > > The Spark PMC
> >  > > >
> >  > > >
> >  > > >
> >  > > > --
> >  > > > Takuya UESHIN
> >  > > >
> >  > >
> >  > >
> >
>  -
> >  > > To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
> > 
> >  > >
> >  >
> >  >
> >
>  -
> >  > To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
> > 
> >  >
> >
> >
>  -
> > To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
> > 
> >
>
> --
> Best regards,
> Maciej Szymkiewicz
>
> Web: https://zero323.net
> PGP: A30CEF0C31A501EC
>
>


Re: Welcome Xinrong Meng as a Spark committer

2022-08-09 Thread Yikun Jiang
Congratulations!

Regards,
Yikun


On Tue, Aug 9, 2022 at 4:13 PM Hyukjin Kwon  wrote:

> Hi all,
>
> The Spark PMC recently added Xinrong Meng as a committer on the project.
> Xinrong is the major contributor of PySpark especially Pandas API on Spark.
> She has guided a lot of new contributors enthusiastically. Please join me
> in welcoming Xinrong!
>
>


Re: [SPARK-39515] Improve scheduled jobs in GitHub Actions

2022-07-14 Thread Yikun Jiang
With the help from the community, the cache based job switch has been
completed!

* About the ghcr images:

You might notice that two images are generated in apache ghcr:

- Image cache: spark/apache-spark-github-action-image-cache
<https://github.com/orgs/apache/packages/container/package/spark%2Fapache-spark-github-action-image-cache>:
This is the cache based on branches' dev/infra/Dockerfile.

- CI image: apache-spark-ci-image
<https://github.com/orgs/apache/packages/container/package/apache-spark-ci-image>:
This is for scheduled jobs. It builds an image just-in-time from the cache,
and then uses it to run the CI jobs.

- Distributed (User) CI image: such as yikun/apache-spark-ci-image
<https://github.com/Yikun/spark/pkgs/container/apache-spark-ci-image>: This
is for PR triggered jobs. Again built just-in-time from the cache and used
to execute the CI job(s) in the user's Github Action space.

* About the job:

For Lint/PySpark/SparkR jobs, "Base image build" will do a just-in-time
build and generate a ci-image for each PR, and jobs use the image as the
job container image.

* About how to change the infra deps:

Currently, the CI image is just like a static image unless you change the
Dockerfile.

- If you want to change the version of a dependency of Lint/PySpark/SparkR
jobs, you could change the dev/infra/Dockerfile just like
https://github.com/apache/spark/pull/37175.

- If you want to trigger a full refresh you could just change the
FULL_REFRESH_DATE
in the Dockerfile
<https://github.com/apache/spark/blob/35d00df9bba7238ad4f40617fae4d04ddbfd/dev/infra/Dockerfile#L21>
.

FYI, I also do a updated the doc on
https://docs.google.com/document/d/1_uiId-U1DODYyYZejAZeyz2OAjxcnA-xfwjynDF6vd0
to
help you understand.


Through this work, I can really feel the efforts of previous maintenance! A
simple version bump of a dependency may lead to a lot of investigation!
Thanks to HyukjinKwon, Dongjoon and the whole community for keeping the
infra deps always latest!

Feel free to ping me if you have any other concerns or ideas!

Regards,
Yikun


On Mon, Jun 27, 2022 at 12:05 AM Yikun Jiang  wrote:

> > There’s one last task to simply caching the Docker image (
> https://issues.apache.org/jira/browse/SPARK-39522).
> I will have to be less active for this week and next week because of the
> Spark Summit. Would appreciate if somebody
> finds some time to take a stab.
>
> I did some investigations on spark container jobs (pyspark/sparkr/lint)
> using cache, and draft a doc to help you guys understand #36980
> <https://github.com/apache/spark/pull/36980>:
>
> https://docs.google.com/document/d/1_uiId-U1DODYyYZejAZeyz2OAjxcnA-xfwjynDF6vd0
>
>
> > About a quick hallway meetup, I will be there after Holden’s talk at
> least to say hello to her :-).
>
> Something topic I was interesting about and related to build CI:
> - K8S integrations <https://github.com/apache/spark/pull/35830> test on
> GA:
> - To help various OS <https://github.com/apache/spark/pull/35142> and
> multi architecture/hardware (x86/arm64, gpu) integration support, what we
> can do to help improving.
> Please feel free to ping me if necessary. It's a little bit pity I
> couldn't have the opportunity to be there, I hope you guys have a fabulous
> meet on summit!
>
> Regards,
> Yikun
>
>
> On Fri, Jun 24, 2022 at 11:15 AM Dongjoon Hyun 
> wrote:
>
>> Yep, I'll be there too. Thank you for the adjustment. See you soon. :)
>>
>> Dongjoon.
>>
>> On Thu, Jun 23, 2022 at 4:59 PM Hyukjin Kwon  wrote:
>>
>>> Alright, I'll be there after Holden's talk Thursday
>>> https://databricks.com/dataaisummit/session/tools-assisted-apache-spark-version-migrations-21-32
>>> w/ Dongjoon (since he manages OSS Jenkins too).
>>> Let's have a quickie chat :-).
>>>
>>> On Thu, 23 Jun 2022 at 06:16, Hyukjin Kwon  wrote:
>>>
>>>> Oops, I was confused about the time and distance in the US. I won't
>>>> make it too.
>>>> Let me find another time slot that works for more ppl.
>>>>
>>>> On Thu, 23 Jun 2022 at 00:19, Dongjoon Hyun 
>>>> wrote:
>>>>
>>>>> Thank you, Hyukjin! :)
>>>>>
>>>>> BTW, unfortunately, it seems that I cannot join that quick meeting.
>>>>> I have another schedule at South Bay around 7PM and need to leave San
>>>>> Francisco at least 5PM.
>>>>>
>>>>> Dongjoon.
>>>>>
>>>>>
>>>>> On Wed, Jun 22, 2022 at 3:39 AM Hyukjin Kwon 
>>>>> wrote:
>>>>>
>>>>>> (cc @Yikun Jiang  @Gengliang Wang
>>>>>>  @Maxim Gekk
>>>&g

Re: [VOTE] Release Spark 3.2.2 (RC1)

2022-07-13 Thread Yikun Jiang
+1 (non-binding)

Checked out tag and built from source on Linux aarch64 and ran some basic
test.


Regards,
Yikun


On Wed, Jul 13, 2022 at 5:54 AM Mridul Muralidharan 
wrote:

>
> +1
>
> Signatures, digests, etc check out fine.
> Checked out tag and build/tested with "-Pyarn -Pmesos -Pkubernetes"
>
> As always, the test "SPARK-33084: Add jar support Ivy URI in SQL" in
> sql.SQLQuerySuite fails in my env; but other than that, the rest looks good.
>
> Regards,
> Mridul
>
>
> On Tue, Jul 12, 2022 at 3:17 AM Maxim Gekk
>  wrote:
>
>> +1
>>
>> On Tue, Jul 12, 2022 at 11:05 AM Yang,Jie(INF) 
>> wrote:
>>
>>> +1 (non-binding)
>>>
>>>
>>>
>>> Yang Jie
>>>
>>>
>>>
>>>
>>>
>>> *发件人**: *Dongjoon Hyun 
>>> *日期**: *2022年7月12日 星期二 16:03
>>> *收件人**: *dev 
>>> *抄送**: *Cheng Su , "Yang,Jie(INF)" <
>>> yangji...@baidu.com>, Sean Owen 
>>> *主题**: *Re: [VOTE] Release Spark 3.2.2 (RC1)
>>>
>>>
>>>
>>> +1
>>>
>>>
>>>
>>> Dongjoon.
>>>
>>>
>>>
>>> On Mon, Jul 11, 2022 at 11:34 PM Cheng Su  wrote:
>>>
>>> +1 (non-binding). Built from source, and ran some scala unit tests on M1
>>> mac, with OpenJDK 8 and Scala 2.12.
>>>
>>>
>>>
>>> Thanks,
>>>
>>> Cheng Su
>>>
>>>
>>>
>>> On Mon, Jul 11, 2022 at 10:31 PM Yang,Jie(INF) 
>>> wrote:
>>>
>>> Does this happen when running all UTs? I ran this suite several times
>>> alone using OpenJDK(zulu) 8u322-b06 on my Mac, but no similar error
>>> occurred
>>>
>>>
>>>
>>> *发件人**: *Sean Owen 
>>> *日期**: *2022年7月12日 星期二 10:45
>>> *收件人**: *Dongjoon Hyun 
>>> *抄送**: *dev 
>>> *主题**: *Re: [VOTE] Release Spark 3.2.2 (RC1)
>>>
>>>
>>>
>>> Is anyone seeing this error? I'm on OpenJDK 8 on a Mac:
>>>
>>>
>>>
>>> #
>>> # A fatal error has been detected by the Java Runtime Environment:
>>> #
>>> #  SIGSEGV (0xb) at pc=0x000101ca8ace, pid=11962,
>>> tid=0x1603
>>> #
>>> # JRE version: OpenJDK Runtime Environment (8.0_322) (build
>>> 1.8.0_322-bre_2022_02_28_15_01-b00)
>>> # Java VM: OpenJDK 64-Bit Server VM (25.322-b00 mixed mode bsd-amd64
>>> compressed oops)
>>> # Problematic frame:
>>> # V  [libjvm.dylib+0x549ace]
>>> #
>>> # Failed to write core dump. Core dumps have been disabled. To enable
>>> core dumping, try "ulimit -c unlimited" before starting Java again
>>> #
>>> # An error report file with more information is saved as:
>>> # /private/tmp/spark-3.2.2/sql/core/hs_err_pid11962.log
>>> ColumnVectorSuite:
>>> - boolean
>>> - byte
>>> Compiled method (nm)  885897 75403 n 0
>>> sun.misc.Unsafe::putShort (native)
>>>  total in heap  [0x000102fdaa10,0x000102fdad48] = 824
>>>  relocation [0x000102fdab38,0x000102fdab78] = 64
>>>  main code  [0x000102fdab80,0x000102fdad48] = 456
>>> Compiled method (nm)  885897 75403 n 0
>>> sun.misc.Unsafe::putShort (native)
>>>  total in heap  [0x000102fdaa10,0x000102fdad48] = 824
>>>  relocation [0x000102fdab38,0x000102fdab78] = 64
>>>  main code  [0x000102fdab80,0x000102fdad48] = 456
>>>
>>>
>>>
>>> On Mon, Jul 11, 2022 at 4:58 PM Dongjoon Hyun 
>>> wrote:
>>>
>>> Please vote on releasing the following candidate as Apache Spark version
>>> 3.2.2.
>>>
>>> The vote is open until July 15th 1AM (PST) and passes if a majority +1
>>> PMC votes are cast, with a minimum of 3 +1 votes.
>>>
>>> [ ] +1 Release this package as Apache Spark 3.2.2
>>> [ ] -1 Do not release this package because ...
>>>
>>> To learn more about Apache Spark, please see https://spark.apache.org/
>>> 
>>>
>>> The tag to be voted on is v3.2.2-rc1 (commit
>>> 78a5825fe266c0884d2dd18cbca9625fa258d7f7):
>>> https://github.com/apache/spark/tree/v3.2.2-rc1
>>> 
>>>
>>> The release files, including signatures, digests, etc. can be found at:
>>> https://dist.apache.org/repos/dist/dev/spark/v3.2.2-rc1-bin/
>>> 
>>>
>>> Signatures used for Spark RCs can be found in this file:
>>> https://dist.apache.org/repos/dist/dev/spark/KEYS
>>> 
>>>
>>> The staging repository for this release can be found at:
>>> https://repository.apache.org/content/repositories/orgapachespark-1409/
>>> 
>>>
>>> The documentation corresponding to this release can be found at:
>>> https://dist.apache.org/repos/dist/dev/spark/v3.2.2-rc1-docs/
>>> 
>>>
>>> The list of bug fixes going into 3.2.2 can be found at the 

Re: Apache Spark 3.2.2 Release?

2022-07-07 Thread Yikun Jiang
+1  (non-binding)

Thanks!

Regards,
Yikun


On Thu, Jul 7, 2022 at 1:57 PM Mridul Muralidharan  wrote:

> +1
>
> Thanks for driving this Dongjoon !
>
> Regards,
> Mridul
>
> On Thu, Jul 7, 2022 at 12:36 AM Gengliang Wang  wrote:
>
>> +1.
>> Thank you, Dongjoon.
>>
>> On Wed, Jul 6, 2022 at 10:21 PM Wenchen Fan  wrote:
>>
>>> +1
>>>
>>> On Thu, Jul 7, 2022 at 10:41 AM Xinrong Meng
>>>  wrote:
>>>
 +1

 Thanks!


 Xinrong Meng

 Software Engineer

 Databricks


 On Wed, Jul 6, 2022 at 7:25 PM Xiao Li  wrote:

> +1
>
> Xiao
>
> Cheng Su  于2022年7月6日周三 19:16写道:
>
>> +1 (non-binding)
>>
>> Thanks,
>> Cheng Su
>>
>> On Wed, Jul 6, 2022 at 6:01 PM Yuming Wang  wrote:
>>
>>> +1
>>>
>>> On Thu, Jul 7, 2022 at 5:53 AM Maxim Gekk
>>>  wrote:
>>>
 +1

 On Thu, Jul 7, 2022 at 12:26 AM John Zhuge 
 wrote:

> +1  Thanks for the effort!
>
> On Wed, Jul 6, 2022 at 2:23 PM Bjørn Jørgensen <
> bjornjorgen...@gmail.com> wrote:
>
>> +1
>>
>> ons. 6. jul. 2022, 23:05 skrev Hyukjin Kwon > >:
>>
>>> Yeah +1
>>>
>>> On Thu, Jul 7, 2022 at 5:40 AM Dongjoon Hyun <
>>> dongjoon.h...@gmail.com> wrote:
>>>
 Hi, All.

 Since Apache Spark 3.2.1 tag creation (Jan 19), new 197 patches
 including 11 correctness patches arrived at branch-3.2.

 Shall we make a new release, Apache Spark 3.2.2, as the third
 release
 at 3.2 line? I'd like to volunteer as the release manager for
 Apache
 Spark 3.2.2. I'm thinking about starting the first RC next week.

 $ git log --oneline v3.2.1..HEAD | wc -l
  197

 # Correctness issues

 SPARK-38075 Hive script transform with order by and limit
 will
 return fake rows
 SPARK-38204 All state operators are at a risk of
 inconsistency
 between state partitioning and operator partitioning
 SPARK-38309 SHS has incorrect percentiles for shuffle read
 bytes
 and shuffle total blocks metrics
 SPARK-38320 (flat)MapGroupsWithState can timeout groups
 which just
 received inputs in the same microbatch
 SPARK-38614 After Spark update, df.show() shows incorrect
 F.percent_rank results
 SPARK-38655 OffsetWindowFunctionFrameBase cannot find the
 offset
 row whose input is not null
 SPARK-38684 Stream-stream outer join has a possible
 correctness
 issue due to weakly read consistent on outer iterators
 SPARK-39061 Incorrect results or NPE when using Inline
 function
 against an array of dynamically created structs
 SPARK-39107 Silent change in regexp_replace's handling of
 empty strings
 SPARK-39259 Timestamps returned by now() and equivalent
 functions
 are not consistent in subqueries
 SPARK-39293 The accumulator of ArrayAggregate should copy
 the
 intermediate result if string, struct, array, or map

 Best,
 Dongjoon.


 -
 To unsubscribe e-mail: dev-unsubscr...@spark.apache.org

 --
> John Zhuge
>



Re: [SPARK-39515] Improve scheduled jobs in GitHub Actions

2022-06-26 Thread Yikun Jiang
> There’s one last task to simply caching the Docker image (
https://issues.apache.org/jira/browse/SPARK-39522).
I will have to be less active for this week and next week because of the
Spark Summit. Would appreciate if somebody
finds some time to take a stab.

I did some investigations on spark container jobs (pyspark/sparkr/lint)
using cache, and draft a doc to help you guys understand #36980
<https://github.com/apache/spark/pull/36980>:
https://docs.google.com/document/d/1_uiId-U1DODYyYZejAZeyz2OAjxcnA-xfwjynDF6vd0


> About a quick hallway meetup, I will be there after Holden’s talk at
least to say hello to her :-).

Something topic I was interesting about and related to build CI:
- K8S integrations <https://github.com/apache/spark/pull/35830> test on GA:
- To help various OS <https://github.com/apache/spark/pull/35142> and multi
architecture/hardware (x86/arm64, gpu) integration support, what we can do
to help improving.
Please feel free to ping me if necessary. It's a little bit pity I couldn't
have the opportunity to be there, I hope you guys have a fabulous meet on
summit!

Regards,
Yikun


On Fri, Jun 24, 2022 at 11:15 AM Dongjoon Hyun 
wrote:

> Yep, I'll be there too. Thank you for the adjustment. See you soon. :)
>
> Dongjoon.
>
> On Thu, Jun 23, 2022 at 4:59 PM Hyukjin Kwon  wrote:
>
>> Alright, I'll be there after Holden's talk Thursday
>> https://databricks.com/dataaisummit/session/tools-assisted-apache-spark-version-migrations-21-32
>> w/ Dongjoon (since he manages OSS Jenkins too).
>> Let's have a quickie chat :-).
>>
>> On Thu, 23 Jun 2022 at 06:16, Hyukjin Kwon  wrote:
>>
>>> Oops, I was confused about the time and distance in the US. I won't make
>>> it too.
>>> Let me find another time slot that works for more ppl.
>>>
>>> On Thu, 23 Jun 2022 at 00:19, Dongjoon Hyun 
>>> wrote:
>>>
>>>> Thank you, Hyukjin! :)
>>>>
>>>> BTW, unfortunately, it seems that I cannot join that quick meeting.
>>>> I have another schedule at South Bay around 7PM and need to leave San
>>>> Francisco at least 5PM.
>>>>
>>>> Dongjoon.
>>>>
>>>>
>>>> On Wed, Jun 22, 2022 at 3:39 AM Hyukjin Kwon 
>>>> wrote:
>>>>
>>>>> (cc @Yikun Jiang  @Gengliang Wang
>>>>>  @Maxim Gekk
>>>>>  @Yang,Jie(INF)  FYI)
>>>>>
>>>>> On Wed, 22 Jun 2022 at 19:34, Hyukjin Kwon 
>>>>> wrote:
>>>>>
>>>>>> Couple of updates:
>>>>>>
>>>>>>-
>>>>>>
>>>>>>All builds passed now with all combinations we defined in the
>>>>>>GitHub Actions (e.g., branch-3.2, branch-3.3, JDK 11,
>>>>>>JDK 17 and Scala 2.13), see
>>>>>>https://github.com/apache/spark/actions cc @Tom Graves
>>>>>> @Dongjoon Hyun 
>>>>>> FYI
>>>>>>-
>>>>>>
>>>>>>except one test that is being failed due to OOM. That’s being
>>>>>>fixed at https://github.com/apache/spark/pull/36954, see
>>>>>>also
>>>>>>https://github.com/apache/spark/pull/36787#discussion_r901190636
>>>>>>-
>>>>>>
>>>>>>I am now adding PySpark, SparkR jobs to the scheduled builds at
>>>>>>https://github.com/apache/spark/pull/36940
>>>>>>and see if they pass. We might need a couple of more fixes there.
>>>>>>-
>>>>>>
>>>>>>There’s one last task to simply caching the Docker image (
>>>>>>https://issues.apache.org/jira/browse/SPARK-39522).
>>>>>>I will have to be less active for this week and next week because
>>>>>>of the Spark Summit. Would appreciate if somebody
>>>>>>finds some time to take a stab.
>>>>>>
>>>>>> About a quick hallway meetup, I will be there after Holden’s talk at
>>>>>> least to say hello to her :-).
>>>>>> Let’s have a quick chat about our CI. We still have some general
>>>>>> problems to cope with like the lack of resources in
>>>>>> GitHub Actions.
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Tue, 21 Jun 2022 at 11:49, Hyukjin Kwon 
>>>>>> wrote:
>>>>>>
>>>>>>> Just chatted offline - both

Re: Re: [VOTE][SPIP] Spark Connect

2022-06-15 Thread Yikun Jiang
+1 (non-binding)

A lighter client will definitely help other ecosystems integrate more
easily with Spark!

Regards,
Yikun


On Thu, Jun 16, 2022 at 12:54 AM Gengliang Wang  wrote:

> +1 (non-binding)
>
> On Wed, Jun 15, 2022 at 9:32 AM Dongjoon Hyun 
> wrote:
>
>> +1
>>
>> On Wed, Jun 15, 2022 at 9:22 AM Xiao Li  wrote:
>>
>>> +1
>>>
>>> Xiao
>>>
>>> beliefer  于2022年6月14日周二 03:35写道:
>>>
 +1
 Yeah, I tried to use Apache Livy, so as we can runing interactive
 query. But the Spark Driver in Livy looks heavy.

 The SPIP may resolve the issue.



 At 2022-06-14 18:11:21, "Wenchen Fan"  wrote:

 +1

 On Tue, Jun 14, 2022 at 9:38 AM Ruifeng Zheng 
 wrote:

> +1
>
>
> -- 原始邮件 --
> *发件人:* "huaxin gao" ;
> *发送时间:* 2022年6月14日(星期二) 上午8:47
> *收件人:* "L. C. Hsieh";
> *抄送:* "Spark dev list";
> *主题:* Re: [VOTE][SPIP] Spark Connect
>
> +1
>
> On Mon, Jun 13, 2022 at 5:42 PM L. C. Hsieh  wrote:
>
>> +1
>>
>> On Mon, Jun 13, 2022 at 5:41 PM Chao Sun  wrote:
>> >
>> > +1 (non-binding)
>> >
>> > On Mon, Jun 13, 2022 at 5:11 PM Hyukjin Kwon 
>> wrote:
>> >>
>> >> +1
>> >>
>> >> On Tue, 14 Jun 2022 at 08:50, Yuming Wang 
>> wrote:
>> >>>
>> >>> +1.
>> >>>
>> >>> On Tue, Jun 14, 2022 at 2:20 AM Matei Zaharia <
>> matei.zaha...@gmail.com> wrote:
>> 
>>  +1, very excited about this direction.
>> 
>>  Matei
>> 
>>  On Jun 13, 2022, at 11:07 AM, Herman van Hovell
>>  wrote:
>> 
>>  Let me kick off the voting...
>> 
>>  +1
>> 
>>  On Mon, Jun 13, 2022 at 2:02 PM Herman van Hovell <
>> her...@databricks.com> wrote:
>> >
>> > Hi all,
>> >
>> > I’d like to start a vote for SPIP: "Spark Connect"
>> >
>> > The goal of the SPIP is to introduce a Dataframe based
>> client/server API for Spark
>> >
>> > Please also refer to:
>> >
>> > - Previous discussion in dev mailing list: [DISCUSS] SPIP:
>> Spark Connect - A client and server interface for Apache Spark.
>> > - Design doc: Spark Connect - A client and server interface for
>> Apache Spark.
>> > - JIRA: SPARK-39375
>> >
>> > Please vote on the SPIP for the next 72 hours:
>> >
>> > [ ] +1: Accept the proposal as an official SPIP
>> > [ ] +0
>> > [ ] -1: I don’t think this is a good idea because …
>> >
>> > Kind Regards,
>> > Herman
>> 
>> 
>>
>> -
>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>
>>


Re: [VOTE] Release Spark 3.3.0 (RC5)

2022-06-06 Thread Yikun Jiang
+1 (non-binding)

1. Verify binary checksums and signatures.
2. Check  kubernetes and pyspark documentation
3. Verify K8S integration test on aarch64
4. Verify customized scheduler and volcano integration test with Volcano
1.5.1.


On Mon, Jun 6, 2022 at 3:43 PM Dongjoon Hyun 
wrote:

> +1.
>
> I double-checked the following additionally.
>
> - Run unit tests on Apple Silicon with Java 17/Python 3.9.11/R 4.1.2
> - Run unit tests on Linux with Java11/Scala 2.12/2.13
> - K8s integration test (including Volcano batch scheduler) on K8s v1.24
> - Check S3 read/write with spark-shell with Scala 2.13/Java17.
>
> So far, it looks good except one flaky test from the new `Row-level
> Runtime Filters` feature. Actually, this has been flaky in the previous RCs
> too.
>
> Since `Row-level Runtime Filters` feature is still disabled by default in
> Apache Spark 3.3.0, I filed it as a non-blocker flaky test bug.
>
> https://issues.apache.org/jira/browse/SPARK-39386
>
> If there is no other report on this test case, this could be my local
> environmental issue.
>
> I'm going to test RC5 more until the deadline (June 8th PST).
>
> Thanks,
> Dongjoon.
>
>
> On Sat, Jun 4, 2022 at 1:33 PM Sean Owen  wrote:
>
>> +1 looks good now on Scala 2.13
>>
>> On Sat, Jun 4, 2022 at 9:51 AM Maxim Gekk
>>  wrote:
>>
>>> Please vote on releasing the following candidate as
>>> Apache Spark version 3.3.0.
>>>
>>> The vote is open until 11:59pm Pacific time June 8th and passes if a
>>> majority +1 PMC votes are cast, with a minimum of 3 +1 votes.
>>>
>>> [ ] +1 Release this package as Apache Spark 3.3.0
>>> [ ] -1 Do not release this package because ...
>>>
>>> To learn more about Apache Spark, please see http://spark.apache.org/
>>>
>>> The tag to be voted on is v3.3.0-rc5 (commit
>>> 7cf29705272ab8e8c70e8885a3664ad8ae3cd5e9):
>>> https://github.com/apache/spark/tree/v3.3.0-rc5
>>>
>>> The release files, including signatures, digests, etc. can be found at:
>>> https://dist.apache.org/repos/dist/dev/spark/v3.3.0-rc5-bin/
>>>
>>> Signatures used for Spark RCs can be found in this file:
>>> https://dist.apache.org/repos/dist/dev/spark/KEYS
>>>
>>> The staging repository for this release can be found at:
>>> https://repository.apache.org/content/repositories/orgapachespark-1406
>>>
>>> The documentation corresponding to this release can be found at:
>>> https://dist.apache.org/repos/dist/dev/spark/v3.3.0-rc5-docs/
>>>
>>> The list of bug fixes going into 3.3.0 can be found at the following URL:
>>> https://issues.apache.org/jira/projects/SPARK/versions/12350369
>>>
>>> This release is using the release script of the tag v3.3.0-rc5.
>>>
>>>
>>> FAQ
>>>
>>> =
>>> How can I help test this release?
>>> =
>>> If you are a Spark user, you can help us test this release by taking
>>> an existing Spark workload and running on this release candidate, then
>>> reporting any regressions.
>>>
>>> If you're working in PySpark you can set up a virtual env and install
>>> the current RC and see if anything important breaks, in the Java/Scala
>>> you can add the staging repository to your projects resolvers and test
>>> with the RC (make sure to clean up the artifact cache before/after so
>>> you don't end up building with a out of date RC going forward).
>>>
>>> ===
>>> What should happen to JIRA tickets still targeting 3.3.0?
>>> ===
>>> The current list of open tickets targeted at 3.3.0 can be found at:
>>> https://issues.apache.org/jira/projects/SPARK and search for "Target
>>> Version/s" = 3.3.0
>>>
>>> Committers should look at those and triage. Extremely important bug
>>> fixes, documentation, and API tweaks that impact compatibility should
>>> be worked on immediately. Everything else please retarget to an
>>> appropriate release.
>>>
>>> ==
>>> But my bug isn't fixed?
>>> ==
>>> In order to make timely releases, we will typically not hold the
>>> release unless the bug in question is a regression from the previous
>>> release. That being said, if there is something which is a regression
>>> that has not been correctly targeted please ping me or a committer to
>>> help target the issue.
>>>
>>> Maxim Gekk
>>>
>>> Software Engineer
>>>
>>> Databricks, Inc.
>>>
>>


Re: Introducing "Pandas API on Spark" component in JIRA, and use "PS" PR title component

2022-05-16 Thread Yikun Jiang
It's a pretty good idea, +1.

To be clear in Github:

- For each PR Title: [SPARK-XXX][PYTHON][PS] The Pandas on spark pr title
(*still keep [PYTHON]* and [PS] new added)

- For PR label: new added: `PANDAS API ON Spark`, still keep: `PYTHON`,
`CORE`
(*still keep `PYTHON`, `CORE`* and `PANDAS API ON SPARK` new added)
https://github.com/apache/spark/pull/36574

Right?

Regards,
Yikun


On Tue, May 17, 2022 at 11:26 AM Hyukjin Kwon  wrote:

> Hi all,
>
> What about we introduce a component in JIRA "Pandas API on Spark", and use
> "PS"  (pandas-on-Spark) in PR titles? We already use "ps" in many places
> when we: import pyspark.pandas as ps.
> This is similar to "Structured Streaming" in JIRA, and "SS" in PR title.
>
> I think it'd be easier to track the changes here with that. Currently it's
> a bit difficult to identify it from pure PySpark changes.
>
>


Re: SIGMOD System Award for Apache Spark

2022-05-14 Thread Yikun Jiang
Awesome! Congrats to the whole community!

On Fri, May 13, 2022 at 3:44 AM Matei Zaharia 
wrote:

> Hi all,
>
> We recently found out that Apache Spark received
>  the SIGMOD System Award this
> year, given by SIGMOD (the ACM’s data management research organization) to
> impactful real-world and research systems. This puts Spark in good company
> with some very impressive previous recipients
> . This award is
> really an achievement by the whole community, so I wanted to say congrats
> to everyone who contributes to Spark, whether through code, issue reports,
> docs, or other means.
>
> Matei
>


Final recap: SPIP: Support Customized Kubernetes Scheduler

2022-03-24 Thread Yikun Jiang
Last month, I synced some progress
 on
"Support Customized Kubernetes Scheduler" [1] at 24. Feb. 2022.

Another month has passed, with the cut of the 3.3 release, there are also
some changes on SPIP. I'd like to share in here:

TLDR: below changes are merged in last month before 3.3 branch cut:

1. [Common] SPARK-38383 :
Support APP_ID and EXECUTOR_ID placeholder in annotations
This will help Yunikorn set appid annotation.
Thanks @dongjoon-hyun @weiweiyang

2. [Common] SPARK-38561
: Add
doc for Customized Kubernetes Schedulers

3. [Volcano] SPARK-38455 :
Introduce a new
configuration: spark.kubernetes.scheduler.volcano.podGroupTemplateFile
to replace original configuration design:
spark.kubernetes.job.[queue|minRes|priority]
Thanks @dongjoon-hyun

4. [Volcano] Add queue/priority/resource reservation(gang) scheduling
integration test:
Queue scheduling: SPARK-38188

Priority scheduling: SPARK-38423

Resource reservation (Gang): SPARK-38187


5. [Volcano] adding doc for volcano scheduler: SPARK-38562
:

6. Volcano community are adding Spark + Volcano IT:
https://github.com/volcano-sh/volcano/pull/2113

I also completed a new slide as final recap to help you understand above
and all we done:
https://docs.google.com/presentation/d/1itSii7C4gkLhsTwO9aWHLbqSVgxEJkv1zcowS_8ynJc


Re: Apache Spark 3.3 Release

2022-03-15 Thread Yikun Jiang
> To make our release time more predictable, let us collect the PRs and
wait three more days before the branch cut?

For SPIP: Support Customized Kubernetes Schedulers:
#35819 [SPARK-38524][SPARK-38553][K8S] Bump Volcano to v1.5.1


Three more days are OK for this from my view.

Regards,
Yikun


Re: Apache Spark 3.3 Release

2022-03-04 Thread Yikun Jiang
@Maxim Thanks for driving the release!

> Not sure about SPARK-36057 since the current state.

@Igor Costa Thanks for your attention, as dongjoon said, basic framework
abilities of  custom scheduler have been supported, we are also planning to
mark this as beta in 3.3.0. Of course, we will do more tests to make sure
it is more stable and also welcome more input to make it better
continuously.

> I don't think that could be a blocker for Apache Spark 3.2.0.

Yep, and v3.3.0, : )

Regards,
Yikun


Re: `running-on-kubernetes` page render bad in v3.2.1(latest) website

2022-03-03 Thread Yikun Jiang
It already has been fixed by: https://github.com/apache/spark/pull/35572

Sorry for bothering here. Just ignore my previous email.


`running-on-kubernetes` page render bad in v3.2.1(latest) website

2022-03-02 Thread Yikun Jiang
Looks like the `running-on-kubernetes` page encounterd some problems when
published.

[1]
https://spark.apache.org/docs/latest/running-on-kubernetes.html#spark-properties

(You can see bad format after #spark-properties)

- I also check the master branch (setup local env) and also v3.2.0 (
https://spark.apache.org/docs/3.2.0/running-on-kubernetes.html) it works
well, .
- But for v3.2.1 tag, I couldn't install doc deps due to deps conflict.

I'm not very familiar with doc infra tool. Can anyone help to take a look?

JIRA: https://issues.apache.org/jira/browse/SPARK-38403


Re: Recap on current status of "SPIP: Support Customized Kubernetes Schedulers"

2022-02-24 Thread Yikun Jiang
@dongjoon-hyun @yangwwei Thanks!

@Mich Thanks for testing it, I'm not very professional with GKE,

I'm also not quite sure if it is different in configurations, internal
network, scheduler implementations
itself VS upstream K8S. As far as I know, different K8S vendors also
maintain their own optimizations
in their downstream product.

But you can see some basic integration test results based on upstream K8S
on x86/arm64:
- x86: https://github.com/apache/spark/pull/35422#issuecomment-1035901775
- Arm64: https://github.com/apache/spark/pull/35422#issuecomment-1037039764

As can be seen from the results, for a single job, there is no big
difference between default scheduler
and volcano.

Also custom schedulers such as Volcano, Yunikorn are more for the overall
situation for multiple jobs
and the utilization of the entire K8S cluster.


Recap on current status of "SPIP: Support Customized Kubernetes Schedulers"

2022-02-23 Thread Yikun Jiang
First, much thanks for all your help (Spark/Volcano/Yunikorn community) to
make this SPIP happen!

Especially,@dongjoon-hyun @holdenk @william-wang @attilapiros @HyukjinKwon
@martin-g @yangwwei @tgravescs

The SPIP is near the end of the stage. It can be said that it is beta
available at the basic level.

I also draft a simple slide to show how to use and help you understand what
we have done:
https://docs.google.com/presentation/d/1XDsTWPcsBe4PQ-1MlBwd9pRl8mySdziE_dJE6iATNw8

Below are also some recap to help you understand current implementation and
next step on SPIP:

*# Existing work*
*## Basic part:*
- SPARK-36059  *New
configuration:* ability to specify "schedulerName" in driver/executor for
Spark on K8S
- SPARK-37331  *New
workflow:*ability to create pre-populated resources before driver pod  for
Spark on K8S
- SPARK-37145  *New
developer API:* support user feature step with configuration for Spark on
K8S
- *(reviewing)* *New Job Configurations* for Spark on K8S:
  - SPARK-38188 :
spark.kubernetes.job.queue
  - SPARK-38187 :
spark.kubernetes.job.[minCPU|minMemory]
  - SPARK-38189 :
spark.kubernetes.job.priorityClassName

*## Volcano Part:*
- SPARK-37258  *New
volcano extension* in kubernetes-client fabric8io/kubernetes-client#3579
- SPARK-36061  *New
profile: *-Pvolcano
- SPARK-36061  *New
Feature Step:* VolcanoFeatureStep
- SPARK-36061  *New
integration test:*
 *- Passed on x86 and Arm64 (Linux on Huawei Kunpeng 920 and MacOS on Apple
Silicon M1).*
 - Test basic volcano workflow
 - Test all existing tests based on the volcano.

*## Yunikorn Part:*
@yangwwei  will also make the efforts for Yunikorn module feature step
since this week.
I will help to complete the yunikorn integration based on previous
experience.

*# Next Plan*
There are also 3 main tasks to be completed before v3.3 code freeze:
1. (reviewing) SPARK-38188
: Support queue
scheduling configuration
https://github.com/apache/spark/pull/35553
2. (reviewing) SPARK-38187
: Support resource
reservation (minCPU/minMemory configuration)
https://github.com/apache/spark/pull/35640
3. (reviewing) SPARK-38187
: Support priority
scheduling (priorityClass configuration):
https://issues.apache.org/jira/browse/SPARK-38189
https://github.com/apache/spark/pull/35639
4. (WIP) SPARK-37809 :
Yunikorn integration

Also several misc work is gonna be completed before 3.3:
1. Integrated volcano deploy into integration test (x86 and arm)
- Add it to spark kubernetes integration test once cross compile support:
https://github.com/volcano-sh/volcano/pull/1571
2. Complete doc and test guideline.

Please feel free to contact me if you have any other concerns! Thanks!

[1] https://issues.apache.org/jira/browse/SPARK-36057


[VOTE][RESULT] SPIP: Support Customized Kubernetes Schedulers Proposal

2022-01-20 Thread Yikun Jiang
Hi all,

The vote passed with the following 14 +1 votes and no -1 or +0 votes:

Bowen Li
Weiwei Yang
Chenya Zhang
Chaoran Yu
William Wang
Holden Karau *
bo yang
Mich Talebzadeh
John Zhuge
Thomas Graves *
Kent Yao
Mridul Muralidharan *
Ryan Blue
Yikun Jiang

* = binding

Thank you guys all for your feedback and votes.

Regards,
Yikun


Re: [VOTE][SPIP] Support Customized Kubernetes Schedulers Proposal

2022-01-20 Thread Yikun Jiang
+1 (non-binding)

Also vote late +1 from myself.

Regards,
Yikun


Ryan Blue  于2022年1月13日周四 02:36写道:

> +1 (non-binding)
>
> On Wed, Jan 12, 2022 at 10:29 AM Mridul Muralidharan 
> wrote:
>
>>
>> +1 (binding)
>> This should be a great improvement !
>>
>> Regards,
>> Mridul
>>
>> On Wed, Jan 12, 2022 at 4:04 AM Kent Yao  wrote:
>>
>>> +1 (non-binding)
>>>
>>> Thomas Graves  于2022年1月12日周三 11:52写道:
>>>
>>>> +1 (binding).
>>>>
>>>> One minor note since I haven't had time to look at the implementation
>>>> details is please make sure resource aware scheduling and the stage
>>>> level scheduling still work or any caveats are documented. Feel free
>>>> to ping me if questions in these areas.
>>>>
>>>> Tom
>>>>
>>>> On Wed, Jan 5, 2022 at 7:07 PM Yikun Jiang  wrote:
>>>> >
>>>> > Hi all,
>>>> >
>>>> > I’d like to start a vote for SPIP: "Support Customized Kubernetes
>>>> Schedulers Proposal"
>>>> >
>>>> > The SPIP is to support customized Kubernetes schedulers in Spark on
>>>> Kubernetes.
>>>> >
>>>> > Please also refer to:
>>>> >
>>>> > - Previous discussion in dev mailing list: [DISCUSSION] SPIP: Support
>>>> Volcano/Alternative Schedulers Proposal
>>>> > - Design doc: [SPIP] Spark-36057 Support Customized Kubernetes
>>>> Schedulers Proposal
>>>> > - JIRA: SPARK-36057
>>>> >
>>>> > Please vote on the SPIP:
>>>> >
>>>> > [ ] +1: Accept the proposal as an official SPIP
>>>> > [ ] +0
>>>> > [ ] -1: I don’t think this is a good idea because …
>>>> >
>>>> > Regards,
>>>> > Yikun
>>>>
>>>> -
>>>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>>>
>>>>
>
> --
> Ryan Blue
> Tabular
>


Re: Tries on migrating Spark Linux arm64 Job from Jenkins to GitHub Actions

2022-01-08 Thread Yikun Jiang
BTW, this is not intended to be in potential opposition to Apache Spark
Infra 2022 which dongjoon mentioned in "Apache Spark Jenkins Infra 2022".
It is just to share a possible way for the Linux arm64 scheduled job.

Also, I think we should get a final conclusion about the attitude of
self-hosted action from the spark community for future reference.

Regards,
Yikun

Yikun Jiang  于2022年1月9日周日 11:33写道:

> Hi, all
>
> I tried to verify the possibility of *Linux arm64 scheduled job *using
> self-hosted action, below is some progress and I would like to hear
> suggestion from you in the next step (continue or stop).
>
> Related JIRA: SPARK-35607
> <https://issues.apache.org/jira/browse/SPARK-35607>
>
> *## About self-hosted Github Action:*
> Currently, self-hosted action supported x64(Linux, macOS, Windows),
> ARM64(Linux only), ARM32(Linux only)
> <https://docs.github.com/en/actions/hosting-your-own-runners/about-self-hosted-runners#architectures>
> .
>
> There is guidance on self-hosted runners from Apache Infra
> <https://cwiki.apache.org/confluence/display/INFRA/GitHub+-+self-hosted+runners>.
> The gap to enable self-hosted runner on Apache repo is resource security
> considerations, specifically, it's to prevent the self-hosted runner from
> being accessed by unallow users' PR. As info and suggestion from ASF, the
> apache/airflow team maintained a custom runner
> <https://github.com/ashb/runner/tree/releases/pr-security-options>, and
> it's also used by apache/airflow in their CI. So, we could just use this
> directly.
>
> TLDR, what we needed is setup resource with custom runner, then enable
> these resources in self-hosted action.
>
> *## Test on self-hosted Github Action with custom runner:*
> Here is some tries on my local repo:
> 1. Spark Maven/SBT test:
> PR: https://github.com/apache/spark/pull/35088
> TEST: https://github.com/Yikun/spark/pull/51
> 2. PySpark test:
> PR: https://github.com/apache/spark/pull/35049
> TEST: https://github.com/Yikun/spark/pull/53
> 3. Pull request test on unallow user:
> TEST: https://github.com/Yikun/spark/pull/60
> The self-hosted runner will prevent the PR access the runner due to
> "Running job on worker spark-github-runner-0001 disallowed by security
> policy".
>
> *## Pros of self-hosted github aciton:*
> - Satisfy the simple demands of Linux arm64 sheduled jobs.
> - Reuse the main workflow of github action.
> - All changes are visible on github is easy to review.
> - Easy to migrate when official GA arm64 support ready.
>
> *## What's the next step:*
> * If we can also consider self-hosted action as optional, I will submit a
> JIRA on Apache Infra to request the token to continue, like:
> https://issues.apache.org/jira/browse/INFRA-21305
> * If we certainly think that self-hosted action is not a wise choice, I
> will try to find other way.
>
> There are also some initial discusson, just FYI:
> https://github.com/dongjoon-hyun/ApacheSparkGitHubActionImage/pull/6
>
> Regards,
> Yikun
>


Tries on migrating Spark Linux arm64 Job from Jenkins to GitHub Actions

2022-01-08 Thread Yikun Jiang
Hi, all

I tried to verify the possibility of *Linux arm64 scheduled job *using
self-hosted action, below is some progress and I would like to hear
suggestion from you in the next step (continue or stop).

Related JIRA: SPARK-35607


*## About self-hosted Github Action:*
Currently, self-hosted action supported x64(Linux, macOS, Windows),
ARM64(Linux only), ARM32(Linux only)

.

There is guidance on self-hosted runners from Apache Infra
.
The gap to enable self-hosted runner on Apache repo is resource security
considerations, specifically, it's to prevent the self-hosted runner from
being accessed by unallow users' PR. As info and suggestion from ASF, the
apache/airflow team maintained a custom runner
, and
it's also used by apache/airflow in their CI. So, we could just use this
directly.

TLDR, what we needed is setup resource with custom runner, then enable
these resources in self-hosted action.

*## Test on self-hosted Github Action with custom runner:*
Here is some tries on my local repo:
1. Spark Maven/SBT test:
PR: https://github.com/apache/spark/pull/35088
TEST: https://github.com/Yikun/spark/pull/51
2. PySpark test:
PR: https://github.com/apache/spark/pull/35049
TEST: https://github.com/Yikun/spark/pull/53
3. Pull request test on unallow user:
TEST: https://github.com/Yikun/spark/pull/60
The self-hosted runner will prevent the PR access the runner due to
"Running job on worker spark-github-runner-0001 disallowed by security
policy".

*## Pros of self-hosted github aciton:*
- Satisfy the simple demands of Linux arm64 sheduled jobs.
- Reuse the main workflow of github action.
- All changes are visible on github is easy to review.
- Easy to migrate when official GA arm64 support ready.

*## What's the next step:*
* If we can also consider self-hosted action as optional, I will submit a
JIRA on Apache Infra to request the token to continue, like:
https://issues.apache.org/jira/browse/INFRA-21305
* If we certainly think that self-hosted action is not a wise choice, I
will try to find other way.

There are also some initial discusson, just FYI:
https://github.com/dongjoon-hyun/ApacheSparkGitHubActionImage/pull/6

Regards,
Yikun


[VOTE][SPIP] Support Customized Kubernetes Schedulers Proposal

2022-01-05 Thread Yikun Jiang
Hi all,

I’d like to start a vote for SPIP: "Support Customized Kubernetes
Schedulers Proposal"

The SPIP is to support customized Kubernetes schedulers in Spark on
Kubernetes.

Please also refer to:

- Previous discussion in dev mailing list: [DISCUSSION] SPIP: Support
Volcano/Alternative Schedulers Proposal

- Design doc: [SPIP] Spark-36057 Support Customized Kubernetes Schedulers
Proposal

- JIRA: SPARK-36057 

Please vote on the SPIP:

[ ] +1: Accept the proposal as an official SPIP
[ ] +0
[ ] -1: I don’t think this is a good idea because …

Regards,
Yikun


Re: [DISCUSSION] SPIP: Support Volcano/Alternative Schedulers Proposal

2022-01-05 Thread Yikun Jiang
>>>>
>>>>  Events:
>>>>
>>>>   Type Reason Age   From
>>>> Message
>>>>
>>>>    --   
>>>> ---
>>>>
>>>>   Warning  FailedScheduling   17m   default-scheduler   
>>>> 0/3 nodes are available: 3 Insufficient memory.
>>>>
>>>>   Warning  FailedScheduling   17m   default-scheduler   
>>>> 0/3 nodes are available: 3 Insufficient memory.
>>>>
>>>>   Normal   NotTriggerScaleUp  2m28s (x92 over 17m)  cluster-autoscaler  
>>>> pod didn't trigger scale-up:
>>>>
>>>> Obviously this is far from ideal and this model although works is not
>>>> efficient.
>>>>
>>>>
>>>> Cheers,
>>>>
>>>>
>>>> Mich
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>view my Linkedin profile
>>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>>> any loss, damage or destruction
>>>>
>>>> of data or any other property which may arise from relying on this
>>>> email's technical content is explicitly disclaimed.
>>>>
>>>> The author will in no case be liable for any monetary damages arising
>>>> from such
>>>>
>>>> loss, damage or destruction.
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> On Wed, 5 Jan 2022 at 03:55, William Wang 
>>>> wrote:
>>>>
>>>>> Hi Mich,
>>>>>
>>>>> Here are parts of performance indications in Volcano.
>>>>> 1. Scheduler throughput: 1.5k pod/s (default scheduler: 100 Pod/s)
>>>>> 2. Spark application performance improved 30%+ with minimal resource
>>>>> reservation feature in case of insufficient resource.(tested with TPC-DS)
>>>>>
>>>>> We are still working on more optimizations. Besides the performance,
>>>>> Volcano is continuously enhanced in below four directions to provide
>>>>> abilities that users care about.
>>>>> - Full lifecycle management for jobs
>>>>> - Scheduling policies for high-performance workloads(fair-share,
>>>>> topology, sla, reservation, preemption, backfill etc)
>>>>> - Support for heterogeneous hardware
>>>>> - Performance optimization for high-performance workloads
>>>>>
>>>>> Thanks
>>>>> LeiBo
>>>>>
>>>>> Mich Talebzadeh  于2022年1月4日周二 18:12写道:
>>>>>
>>>> Interesting,thanks
>>>>>>
>>>>>> Do you have any indication of the ballpark figure (a rough numerical
>>>>>> estimate) of adding Volcano as an alternative scheduler is going to
>>>>>> improve Spark on k8s performance?
>>>>>>
>>>>>> Thanks
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>view my Linkedin profile
>>>>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> *Disclaimer:* Use it at your own risk. Any and all responsibility
>>>>>> for any loss, damage or destruction
>>>>>>
>>>>>> of data or any other property which may arise from relying on this
>>>>>> email's technical content is explicitly disclaimed.
>>>>>>
>>>>>> The author will in no case be liable for any monetary damages arising
>>>>>> from such
>>>>>>
>>>>>> loss, damage or destruction.
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>&

Re: [DISCUSSION] SPIP: Support Volcano/Alternative Schedulers Proposal

2022-01-04 Thread Yikun Jiang
> Any guidance on how to best contribute?

@Agarwal Thanks for the feedback.

- It would be very good if you could share your idea and suggestion on
native scheduler support in SPARK-36057
<https://issues.apache.org/jira/browse/SPARK-36057>, it would be considered
as part of this feature or next advanced improvement in followup.
- You could also feel free to help review the existing PR.

Anyway, you could regard the scope of this feature as enabling the basic
ability to integrate the customized scheduler and help job level scheduling
at some level, it's just a start, if you have any other concern, feel free
to leave any comments.

Regards,
Yikun


Agarwal, Janak  于2022年1月5日周三 02:05写道:

> Hello Folks, Happy new year to one and all.
>
>
>
> I’m from the EMR on EKS <https://aws.amazon.com/emr/features/eks/> team.
> We help customers to run Spark workloads on Kubernetes.
>
> My team had similar ideas, and we have also sourced requirements from
> customers who use EMR on EKS / Spark on EKS. Would love to participate in
> the design to help solve the problem for the vast majority of Spark on
> Kubernetes users.
>
>
>
> Any guidance on how to best contribute?
>
>
>
> Best,
>
> Janak
>
>
>
> *From:* Mich Talebzadeh 
> *Sent:* Tuesday, January 4, 2022 2:12 AM
> *To:* Yikun Jiang 
> *Cc:* dev ; Weiwei Yang ; Holden
> Karau ; wang.platf...@gmail.com; Prasad Paravatha <
> prasad.parava...@gmail.com>; John Zhuge ; Chenya Zhang
> ; Chaoran Yu ;
> Wilfred Spiegelenburg ; Klaus Ma <
> klaus1982...@gmail.com>
> *Subject:* RE: [EXTERNAL] [DISCUSSION] SPIP: Support Volcano/Alternative
> Schedulers Proposal
>
>
>
> *CAUTION*: This email originated from outside of the organization. Do not
> click links or open attachments unless you can confirm the sender and know
> the content is safe.
>
>
>
> Interesting,thanks
>
>
>
> Do you have any indication of the ballpark figure (a rough numerical
> estimate) of adding Volcano as an alternative scheduler is going to
> improve Spark on k8s performance?
>
>
>
> Thanks
>
>
>
>
>view my Linkedin profile
> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
>
>
>
> On Tue, 4 Jan 2022 at 09:43, Yikun Jiang  wrote:
>
> Hi, folks! Wishing you all the best in 2022.
>
>
>
> I'd like to share the current status on "Support Customized K8S Scheduler
> in Spark".
>
>
> https://docs.google.com/document/d/1xgQGRpaHQX6-QH_J9YV2C2Dh6RpXefUpLM7KGkzL6Fg/edit#heading=h.1quyr1r2kr5n
>
>
>
> Framework/Common support
>
> - Volcano and Yunikorn team join the discussion and complete the initial
> doc on framework/common part.
>
> - SPARK-37145 <https://issues.apache.org/jira/browse/SPARK-37145> (under
> reviewing): We proposed to extend the customized scheduler by just using a
> custom feature step, it will meet the requirement of customized scheduler
> after it gets merged. After this, the user can enable featurestep and
> scheduler like:
>
> spark-submit \
>
> --conf spark.kubernete.scheduler.name volcano \
>
> --conf spark.kubernetes.driver.pod.featureSteps
> org.apache.spark.deploy.k8s.features.scheduler.VolcanoFeatureStep
>
> --conf spark.kubernete.job.queue xxx
>
> (such as above, the VolcanoFeatureStep will help to set the the spark
> scheduler queue according user specified conf)
>
> - SPARK-37331 <https://issues.apache.org/jira/browse/SPARK-37331>: Added
> the ability to create kubernetes resources before driver pod creation.
>
> - SPARK-36059 <https://issues.apache.org/jira/browse/SPARK-36059>: Add
> the ability to specify a scheduler in driver/executor
>
> After above all, the framework/common support would be ready for most of
> customized schedulers
>
>
>
> Volcano part:
>
> - SPARK-37258 <https://issues.apache.org/jira/browse/SPARK-37258>:
> Upgrade kubernetes-client to 5.11.1 to add volcano scheduler API support.
>
> - SPARK-36061 <https://issues.apache.org/jira/browse/SPARK-36061>: Add a
> VolcanoFeatureStep to help users to create a PodGroup with user specified
> minimum resources required, there is also a WIP commit to show the
> preview of this
> <https://github.com/Yikun/spark/pull/45/commits/81bf6f98edb5c00ebd0662dc172bc73f980b6a34>
> .
>
>
&g

Re: [DISCUSSION] SPIP: Support Volcano/Alternative Schedulers Proposal

2022-01-04 Thread Yikun Jiang
Hi, folks! Wishing you all the best in 2022.

I'd like to share the current status on "Support Customized K8S Scheduler
in Spark".

https://docs.google.com/document/d/1xgQGRpaHQX6-QH_J9YV2C2Dh6RpXefUpLM7KGkzL6Fg/edit#heading=h.1quyr1r2kr5n

Framework/Common support

- Volcano and Yunikorn team join the discussion and complete the initial
doc on framework/common part.

- SPARK-37145 <https://issues.apache.org/jira/browse/SPARK-37145> (under
reviewing): We proposed to extend the customized scheduler by just using a
custom feature step, it will meet the requirement of customized scheduler
after it gets merged. After this, the user can enable featurestep and
scheduler like:

spark-submit \

--conf spark.kubernete.scheduler.name volcano \

--conf spark.kubernetes.driver.pod.featureSteps
org.apache.spark.deploy.k8s.features.scheduler.VolcanoFeatureStep

--conf spark.kubernete.job.queue xxx

(such as above, the VolcanoFeatureStep will help to set the the spark
scheduler queue according user specified conf)

- SPARK-37331 <https://issues.apache.org/jira/browse/SPARK-37331>: Added
the ability to create kubernetes resources before driver pod creation.

- SPARK-36059 <https://issues.apache.org/jira/browse/SPARK-36059>: Add the
ability to specify a scheduler in driver/executor

After above all, the framework/common support would be ready for most of
customized schedulers

Volcano part:

- SPARK-37258 <https://issues.apache.org/jira/browse/SPARK-37258>: Upgrade
kubernetes-client to 5.11.1 to add volcano scheduler API support.

- SPARK-36061 <https://issues.apache.org/jira/browse/SPARK-36061>: Add a
VolcanoFeatureStep to help users to create a PodGroup with user specified
minimum resources required, there is also a WIP commit to show the preview
of this
<https://github.com/Yikun/spark/pull/45/commits/81bf6f98edb5c00ebd0662dc172bc73f980b6a34>
.

Yunikorn part:

- @WeiweiYang is completing the doc of the Yunikorn part and implementing
the Yunikorn part.

Regards,
Yikun


Weiwei Yang  于2021年12月2日周四 02:00写道:

> Thank you Yikun for the info, and thanks for inviting me to a meeting to
> discuss this.
> I appreciate your effort to put these together, and I agree that the
> purpose is to make Spark easy/flexible enough to support other K8s
> schedulers (not just for Volcano).
> As discussed, could you please help to abstract out the things in common
> and allow Spark to plug different implementations? I'd be happy to work
> with you guys on this issue.
>
>
> On Tue, Nov 30, 2021 at 6:49 PM Yikun Jiang  wrote:
>
>> @Weiwei @Chenya
>>
>> > Thanks for bringing this up. This is quite interesting, we definitely
>> should participate more in the discussions.
>>
>> Thanks for your reply and welcome to join the discussion, I think the
>> input from Yunikorn is very critical.
>>
>> > The main thing here is, the Spark community should make Spark pluggable
>> in order to support other schedulers, not just for Volcano. It looks like
>> this proposal is pushing really hard for adopting PodGroup, which isn't
>> part of K8s yet, that to me is problematic.
>>
>> Definitely yes, we are on the same page.
>>
>> I think we have the same goal: propose a general and reasonable mechanism
>> to make spark on k8s with a custom scheduler more usable.
>>
>> But for the PodGroup, just allow me to do a brief introduction:
>> - The PodGroup definition has been approved by Kubernetes officially in
>> KEP-583. [1]
>> - It can be regarded as a general concept/standard in Kubernetes rather
>> than a specific concept in Volcano, there are also others to implement it,
>> such as [2][3].
>> - Kubernetes recommends using CRD to do more extension to implement what
>> they want. [4]
>> - Volcano as extension provides an interface to maintain the life cycle
>> PodGroup CRD and use volcano-scheduler to complete the scheduling.
>>
>> [1]
>> https://github.com/kubernetes/enhancements/tree/master/keps/sig-scheduling/583-coscheduling
>> [2]
>> https://github.com/kubernetes-sigs/scheduler-plugins/tree/master/pkg/coscheduling#podgroup
>> [3] https://github.com/kubernetes-sigs/kube-batch
>> [4]
>> https://kubernetes.io/docs/tasks/extend-kubernetes/custom-resources/custom-resource-definitions/
>>
>> Regards,
>> Yikun
>>
>>
>> Weiwei Yang  于2021年12月1日周三 上午5:57写道:
>>
>>> Hi Chenya
>>>
>>> Thanks for bringing this up. This is quite interesting, we definitely
>>> should participate more in the discussions.
>>> The main thing here is, the Spark community should make Spark pluggable
>>> in order to support other schedulers, not just for Volcano. It looks like
>>> this pro

Re: [DISCUSSION] SPIP: Support Volcano/Alternative Schedulers Proposal

2021-12-01 Thread Yikun Jiang
> Thank you Yikun for the info, and thanks for inviting me to a meeting to
discuss this.
> I appreciate your effort to put these together, and I agree that the
purpose is to make Spark easy/flexible enough to support other K8s
schedulers (not just for Volcano).
> As discussed, could you please help to abstract out the things in common
and allow Spark to plug different implementations? I'd be happy to work
with you guys on this issue.

Thanks for the support from Yunikron side.

As @weiwei mentioned, yesterday we had an initial meeting which went well,
we have reached a consensus initially.
We will also abstract out the common part to make clear the things in
common and also provide a way to allow a variety of schedulers to do custom
extension.

Regards,
Yikun


Weiwei Yang  于2021年12月2日周四 上午2:00写道:

> Thank you Yikun for the info, and thanks for inviting me to a meeting to
> discuss this.
> I appreciate your effort to put these together, and I agree that the
> purpose is to make Spark easy/flexible enough to support other K8s
> schedulers (not just for Volcano).
> As discussed, could you please help to abstract out the things in common
> and allow Spark to plug different implementations? I'd be happy to work
> with you guys on this issue.
>
>
> On Tue, Nov 30, 2021 at 6:49 PM Yikun Jiang  wrote:
>
>> @Weiwei @Chenya
>>
>> > Thanks for bringing this up. This is quite interesting, we definitely
>> should participate more in the discussions.
>>
>> Thanks for your reply and welcome to join the discussion, I think the
>> input from Yunikorn is very critical.
>>
>> > The main thing here is, the Spark community should make Spark pluggable
>> in order to support other schedulers, not just for Volcano. It looks like
>> this proposal is pushing really hard for adopting PodGroup, which isn't
>> part of K8s yet, that to me is problematic.
>>
>> Definitely yes, we are on the same page.
>>
>> I think we have the same goal: propose a general and reasonable mechanism
>> to make spark on k8s with a custom scheduler more usable.
>>
>> But for the PodGroup, just allow me to do a brief introduction:
>> - The PodGroup definition has been approved by Kubernetes officially in
>> KEP-583. [1]
>> - It can be regarded as a general concept/standard in Kubernetes rather
>> than a specific concept in Volcano, there are also others to implement it,
>> such as [2][3].
>> - Kubernetes recommends using CRD to do more extension to implement what
>> they want. [4]
>> - Volcano as extension provides an interface to maintain the life cycle
>> PodGroup CRD and use volcano-scheduler to complete the scheduling.
>>
>> [1]
>> https://github.com/kubernetes/enhancements/tree/master/keps/sig-scheduling/583-coscheduling
>> [2]
>> https://github.com/kubernetes-sigs/scheduler-plugins/tree/master/pkg/coscheduling#podgroup
>> [3] https://github.com/kubernetes-sigs/kube-batch
>> [4]
>> https://kubernetes.io/docs/tasks/extend-kubernetes/custom-resources/custom-resource-definitions/
>>
>> Regards,
>> Yikun
>>
>>
>> Weiwei Yang  于2021年12月1日周三 上午5:57写道:
>>
>>> Hi Chenya
>>>
>>> Thanks for bringing this up. This is quite interesting, we definitely
>>> should participate more in the discussions.
>>> The main thing here is, the Spark community should make Spark pluggable
>>> in order to support other schedulers, not just for Volcano. It looks like
>>> this proposal is pushing really hard for adopting PodGroup, which isn't
>>> part of K8s yet, that to me is problematic.
>>>
>>> On Tue, Nov 30, 2021 at 9:21 AM Prasad Paravatha <
>>> prasad.parava...@gmail.com> wrote:
>>>
>>>> This is a great feature/idea.
>>>> I'd love to get involved in some form (testing and/or documentation).
>>>> This could be my 1st contribution to Spark!
>>>>
>>>> On Tue, Nov 30, 2021 at 10:46 PM John Zhuge  wrote:
>>>>
>>>>> +1 Kudos to Yikun and the community for starting the discussion!
>>>>>
>>>>> On Tue, Nov 30, 2021 at 8:47 AM Chenya Zhang <
>>>>> chenyazhangche...@gmail.com> wrote:
>>>>>
>>>>>> Thanks folks for bringing up the topic of natively integrating
>>>>>> Volcano and other alternative schedulers into Spark!
>>>>>>
>>>>>> +Weiwei, Wilfred, Chaoran. We would love to contribute to the
>>>>>> discussion as well.
>>>>>>
>>>>>> From our side, we have been using and improving on one alternative
&

Re: [DISCUSSION] SPIP: Support Volcano/Alternative Schedulers Proposal

2021-11-30 Thread Yikun Jiang
@Weiwei @Chenya

> Thanks for bringing this up. This is quite interesting, we definitely
should participate more in the discussions.

Thanks for your reply and welcome to join the discussion, I think the input
from Yunikorn is very critical.

> The main thing here is, the Spark community should make Spark pluggable
in order to support other schedulers, not just for Volcano. It looks like
this proposal is pushing really hard for adopting PodGroup, which isn't
part of K8s yet, that to me is problematic.

Definitely yes, we are on the same page.

I think we have the same goal: propose a general and reasonable mechanism
to make spark on k8s with a custom scheduler more usable.

But for the PodGroup, just allow me to do a brief introduction:
- The PodGroup definition has been approved by Kubernetes officially in
KEP-583. [1]
- It can be regarded as a general concept/standard in Kubernetes rather
than a specific concept in Volcano, there are also others to implement it,
such as [2][3].
- Kubernetes recommends using CRD to do more extension to implement what
they want. [4]
- Volcano as extension provides an interface to maintain the life cycle
PodGroup CRD and use volcano-scheduler to complete the scheduling.

[1]
https://github.com/kubernetes/enhancements/tree/master/keps/sig-scheduling/583-coscheduling
[2]
https://github.com/kubernetes-sigs/scheduler-plugins/tree/master/pkg/coscheduling#podgroup
[3] https://github.com/kubernetes-sigs/kube-batch
[4]
https://kubernetes.io/docs/tasks/extend-kubernetes/custom-resources/custom-resource-definitions/

Regards,
Yikun


Weiwei Yang  于2021年12月1日周三 上午5:57写道:

> Hi Chenya
>
> Thanks for bringing this up. This is quite interesting, we definitely
> should participate more in the discussions.
> The main thing here is, the Spark community should make Spark pluggable in
> order to support other schedulers, not just for Volcano. It looks like this
> proposal is pushing really hard for adopting PodGroup, which isn't part of
> K8s yet, that to me is problematic.
>
> On Tue, Nov 30, 2021 at 9:21 AM Prasad Paravatha <
> prasad.parava...@gmail.com> wrote:
>
>> This is a great feature/idea.
>> I'd love to get involved in some form (testing and/or documentation).
>> This could be my 1st contribution to Spark!
>>
>> On Tue, Nov 30, 2021 at 10:46 PM John Zhuge  wrote:
>>
>>> +1 Kudos to Yikun and the community for starting the discussion!
>>>
>>> On Tue, Nov 30, 2021 at 8:47 AM Chenya Zhang <
>>> chenyazhangche...@gmail.com> wrote:
>>>
>>>> Thanks folks for bringing up the topic of natively integrating Volcano
>>>> and other alternative schedulers into Spark!
>>>>
>>>> +Weiwei, Wilfred, Chaoran. We would love to contribute to the
>>>> discussion as well.
>>>>
>>>> From our side, we have been using and improving on one alternative
>>>> resource scheduler, Apache YuniKorn (https://yunikorn.apache.org/),
>>>> for Spark on Kubernetes in production at Apple with solid results in the
>>>> past year. It is capable of supporting Gang scheduling (similar to
>>>> PodGroups), multi-tenant resource queues (similar to YARN), FIFO, and other
>>>> handy features like bin packing to enable efficient autoscaling, etc.
>>>>
>>>> Natively integrating with Spark would provide more flexibility for
>>>> users and reduce the extra cost and potential inconsistency of maintaining
>>>> different layers of resource strategies. One interesting topic we hope to
>>>> discuss more about is dynamic allocation, which would benefit from native
>>>> coordination between Spark and resource schedulers in K8s &
>>>> cloud environment for an optimal resource efficiency.
>>>>
>>>>
>>>> On Tue, Nov 30, 2021 at 8:10 AM Holden Karau 
>>>> wrote:
>>>>
>>>>> Thanks for putting this together, I’m really excited for us to add
>>>>> better batch scheduling integrations.
>>>>>
>>>>> On Tue, Nov 30, 2021 at 12:46 AM Yikun Jiang 
>>>>> wrote:
>>>>>
>>>>>> Hey everyone,
>>>>>>
>>>>>> I'd like to start a discussion on "Support Volcano/Alternative
>>>>>> Schedulers Proposal".
>>>>>>
>>>>>> This SPIP is proposed to make spark k8s schedulers provide more YARN
>>>>>> like features (such as queues and minimum resources before scheduling 
>>>>>> jobs)
>>>>>> that many folks want on Kubernetes.
>>>>>>
>>>>>> The goal of this SPIP i

[DISCUSSION] SPIP: Support Volcano/Alternative Schedulers Proposal

2021-11-30 Thread Yikun Jiang
Hey everyone,

I'd like to start a discussion on "Support Volcano/Alternative Schedulers
Proposal".

This SPIP is proposed to make spark k8s schedulers provide more YARN like
features (such as queues and minimum resources before scheduling jobs) that
many folks want on Kubernetes.

The goal of this SPIP is to improve current spark k8s scheduler
implementations, add the ability of batch scheduling and support volcano as
one of implementations.

Design doc:
https://docs.google.com/document/d/1xgQGRpaHQX6-QH_J9YV2C2Dh6RpXefUpLM7KGkzL6Fg
JIRA: https://issues.apache.org/jira/browse/SPARK-36057
Part of PRs:
Ability to create resources https://github.com/apache/spark/pull/34599
Add PodGroupFeatureStep: https://github.com/apache/spark/pull/34456

Regards,
Yikun


Re: Spark on Kubernetes scheduler variety

2021-06-29 Thread Yikun Jiang
> Is this the correct link for integrating Volcano with Spark?

Yes, it is Kubernetes operator style of integrating Volcano. And if you
want to just use spark submit style to submit a native support job, you can
see [2] as ref.

[1]
https://github.com/huawei-cloudnative/spark/commit/6c1f37525f026353eaead34216d47dad653f13a4

Regards,
Yikun


Mich Talebzadeh  于2021年6月28日周一 下午6:03写道:

> Hi Yikun,
>
> Is this the correct link for integrating Volcano with Spark?
>
> spark-on-k8s-operator/volcano-integration.md at master ·
> GoogleCloudPlatform/spark-on-k8s-operator · GitHub
> <https://github.com/GoogleCloudPlatform/spark-on-k8s-operator/blob/master/docs/volcano-integration.md>
>
> Thanks
>
>
> Mich
>
>
>view my Linkedin profile
> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
> On Fri, 25 Jun 2021 at 09:45, Yikun Jiang  wrote:
>
>> Oops, sorry for the error link, it should be:
>>
>> We will also prepare to propose an initial design and POC[3] on a shared
>> branch (based on spark master branch) where we can collaborate on it, so I
>> created the spark-volcano[1] org in github to make it happen.
>>
>> [3]
>> https://github.com/huawei-cloudnative/spark/commit/6c1f37525f026353eaead34216d47dad653f13a4
>>
>>
>> And
>> Regards,
>> Yikun
>>
>>
>> Yikun Jiang  于2021年6月25日周五 上午11:53写道:
>>
>>> Hi, folks.
>>>
>>> As @Klaus mentioned, We have some work on Spark on k8s with volcano
>>> native support. Also, there were also some production deployment validation
>>> from our partners in China, like JingDong, XiaoHongShu, VIPshop.
>>>
>>> We will also prepare to propose an initial design and POC[3] on a shared
>>> branch (based on spark master branch) where we can collaborate on it, so I
>>> created the spark-volcano[1] org in github to make it happen.
>>>
>>> Pls feel free to comment on it [2] if you guys have any questions or
>>> concerns.
>>>
>>> [1] https://github.com/spark-volcano
>>> [2] https://github.com/spark-volcano/spark/issues/1
>>> [3]
>>> https://github.com/huawei-cloudnative/spark/commit/6c1f37525f026353eaead34216d47dad653f13a4
>>>
>>>
>>
>>
>>> Regards,
>>> Yikun
>>>
>>> Holden Karau  于2021年6月25日周五 上午12:00写道:
>>>
>>>> Hi Mich,
>>>>
>>>> I certainly think making Spark on Kubernetes run well is going to be a
>>>> challenge. However I think, and I could be wrong about this as well, that
>>>> in terms of cluster managers Kubernetes is likely to be our future. Talking
>>>> with people I don't hear about new standalone, YARN or mesos deployments of
>>>> Spark, but I do hear about people trying to migrate to Kubernetes.
>>>>
>>>> To be clear I certainly agree that we need more work on structured
>>>> streaming, but its important to remember that the Spark developers are not
>>>> all fully interchangeable, we work on the things that we're interested in
>>>> pursuing so even if structured streaming needs more love if I'm not super
>>>> interested in structured streaming I'm less likely to work on it. That
>>>> being said I am certainly spinning up a bit more in the Spark SQL area
>>>> especially around our data source/connectors because I can see the need
>>>> there too.
>>>>
>>>> On Wed, Jun 23, 2021 at 8:26 AM Mich Talebzadeh <
>>>> mich.talebza...@gmail.com> wrote:
>>>>
>>>>>
>>>>>
>>>>> Please allow me to be diverse and express a different point of view on
>>>>> this roadmap.
>>>>>
>>>>>
>>>>> I believe from a technical point of view spending time and effort plus
>>>>> talent on batch scheduling on Kubernetes could be rewarding. However, if I
>>>>> may say I doubt whether such an approach and the so-called democratization
>>>>> of Spark on whatever platform is really should be of great focus.
>>>>>
>>>>> Having worked on Google Dataproc <https://cloud.google.com/dataproc>
>>>>> 

Re: Spark on Kubernetes scheduler variety

2021-06-25 Thread Yikun Jiang
Oops, sorry for the error link, it should be:

We will also prepare to propose an initial design and POC[3] on a shared
branch (based on spark master branch) where we can collaborate on it, so I
created the spark-volcano[1] org in github to make it happen.

[3]
https://github.com/huawei-cloudnative/spark/commit/6c1f37525f026353eaead34216d47dad653f13a4


And
Regards,
Yikun


Yikun Jiang  于2021年6月25日周五 上午11:53写道:

> Hi, folks.
>
> As @Klaus mentioned, We have some work on Spark on k8s with volcano native
> support. Also, there were also some production deployment validation from
> our partners in China, like JingDong, XiaoHongShu, VIPshop.
>
> We will also prepare to propose an initial design and POC[3] on a shared
> branch (based on spark master branch) where we can collaborate on it, so I
> created the spark-volcano[1] org in github to make it happen.
>
> Pls feel free to comment on it [2] if you guys have any questions or
> concerns.
>
> [1] https://github.com/spark-volcano
> [2] https://github.com/spark-volcano/spark/issues/1
> [3]
> https://github.com/huawei-cloudnative/spark/commit/6c1f37525f026353eaead34216d47dad653f13a4
>
>


> Regards,
> Yikun
>
> Holden Karau  于2021年6月25日周五 上午12:00写道:
>
>> Hi Mich,
>>
>> I certainly think making Spark on Kubernetes run well is going to be a
>> challenge. However I think, and I could be wrong about this as well, that
>> in terms of cluster managers Kubernetes is likely to be our future. Talking
>> with people I don't hear about new standalone, YARN or mesos deployments of
>> Spark, but I do hear about people trying to migrate to Kubernetes.
>>
>> To be clear I certainly agree that we need more work on structured
>> streaming, but its important to remember that the Spark developers are not
>> all fully interchangeable, we work on the things that we're interested in
>> pursuing so even if structured streaming needs more love if I'm not super
>> interested in structured streaming I'm less likely to work on it. That
>> being said I am certainly spinning up a bit more in the Spark SQL area
>> especially around our data source/connectors because I can see the need
>> there too.
>>
>> On Wed, Jun 23, 2021 at 8:26 AM Mich Talebzadeh <
>> mich.talebza...@gmail.com> wrote:
>>
>>>
>>>
>>> Please allow me to be diverse and express a different point of view on
>>> this roadmap.
>>>
>>>
>>> I believe from a technical point of view spending time and effort plus
>>> talent on batch scheduling on Kubernetes could be rewarding. However, if I
>>> may say I doubt whether such an approach and the so-called democratization
>>> of Spark on whatever platform is really should be of great focus.
>>>
>>> Having worked on Google Dataproc <https://cloud.google.com/dataproc> (A 
>>> fully
>>> managed and highly scalable service for running Apache Spark, Hadoop and
>>> more recently other artefacts) for that past two years, and Spark on
>>> Kubernetes on-premise, I have come to the conclusion that Spark is not a
>>> beast that that one can fully commoditize it much like one can do with
>>> Zookeeper, Kafka etc. There is always a struggle to make some niche areas
>>> of Spark like Spark Structured Streaming (SSS) work seamlessly and
>>> effortlessly on these commercial platforms with whatever as a Service.
>>>
>>>
>>> Moreover, Spark (and I stand corrected) from the ground up has already a
>>> lot of resiliency and redundancy built in. It is truly an enterprise class
>>> product (requires enterprise class support) that will be difficult to
>>> commoditize with Kubernetes and expect the same performance. After all,
>>> Kubernetes is aimed at efficient resource sharing and potential cost saving
>>> for the mass market. In short I can see commercial enterprises will work on
>>> these platforms ,but may be the great talents on dev team should focus on
>>> stuff like the perceived limitation of SSS in dealing with chain of
>>> aggregation( if I am correct it is not yet supported on streaming datasets)
>>>
>>>
>>> These are my opinions and they are not facts, just opinions so to speak
>>> :)
>>>
>>>
>>>view my Linkedin profile
>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>>
>>>
>>>
>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>> any loss, damage or destruction of data or any other property which may
>>> arise from relying on this email's technic

Re: Spark on Kubernetes scheduler variety

2021-06-24 Thread Yikun Jiang
Hi, folks.

As @Klaus mentioned, We have some work on Spark on k8s with volcano native
support. Also, there were also some production deployment validation from
our partners in China, like JingDong, XiaoHongShu, VIPshop.

We will also prepare to propose an initial design and POC[3] on a shared
branch (based on spark master branch) where we can collaborate on it, so I
created the spark-volcano[1] org in github to make it happen.

Pls feel free to comment on it [2] if you guys have any questions or
concerns.

[1] https://github.com/spark-volcano
[2] https://github.com/spark-volcano/spark/issues/1
[3] https://github.com/spark-volcano-wip/spark-3-volcano

Regards,
Yikun

Holden Karau  于2021年6月25日周五 上午12:00写道:

> Hi Mich,
>
> I certainly think making Spark on Kubernetes run well is going to be a
> challenge. However I think, and I could be wrong about this as well, that
> in terms of cluster managers Kubernetes is likely to be our future. Talking
> with people I don't hear about new standalone, YARN or mesos deployments of
> Spark, but I do hear about people trying to migrate to Kubernetes.
>
> To be clear I certainly agree that we need more work on structured
> streaming, but its important to remember that the Spark developers are not
> all fully interchangeable, we work on the things that we're interested in
> pursuing so even if structured streaming needs more love if I'm not super
> interested in structured streaming I'm less likely to work on it. That
> being said I am certainly spinning up a bit more in the Spark SQL area
> especially around our data source/connectors because I can see the need
> there too.
>
> On Wed, Jun 23, 2021 at 8:26 AM Mich Talebzadeh 
> wrote:
>
>>
>>
>> Please allow me to be diverse and express a different point of view on
>> this roadmap.
>>
>>
>> I believe from a technical point of view spending time and effort plus
>> talent on batch scheduling on Kubernetes could be rewarding. However, if I
>> may say I doubt whether such an approach and the so-called democratization
>> of Spark on whatever platform is really should be of great focus.
>>
>> Having worked on Google Dataproc  (A fully
>> managed and highly scalable service for running Apache Spark, Hadoop and
>> more recently other artefacts) for that past two years, and Spark on
>> Kubernetes on-premise, I have come to the conclusion that Spark is not a
>> beast that that one can fully commoditize it much like one can do with
>> Zookeeper, Kafka etc. There is always a struggle to make some niche areas
>> of Spark like Spark Structured Streaming (SSS) work seamlessly and
>> effortlessly on these commercial platforms with whatever as a Service.
>>
>>
>> Moreover, Spark (and I stand corrected) from the ground up has already a
>> lot of resiliency and redundancy built in. It is truly an enterprise class
>> product (requires enterprise class support) that will be difficult to
>> commoditize with Kubernetes and expect the same performance. After all,
>> Kubernetes is aimed at efficient resource sharing and potential cost saving
>> for the mass market. In short I can see commercial enterprises will work on
>> these platforms ,but may be the great talents on dev team should focus on
>> stuff like the perceived limitation of SSS in dealing with chain of
>> aggregation( if I am correct it is not yet supported on streaming datasets)
>>
>>
>> These are my opinions and they are not facts, just opinions so to speak :)
>>
>>
>>view my Linkedin profile
>> 
>>
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>>
>> On Fri, 18 Jun 2021 at 23:18, Holden Karau  wrote:
>>
>>> I think these approaches are good, but there are limitations (eg dynamic
>>> scaling) without us making changes inside of the Spark Kube scheduler.
>>>
>>> Certainly whichever scheduler extensions we add support for we should
>>> collaborate with the people developing those extensions insofar as they are
>>> interested. My first place that I checked was #sig-scheduling which is
>>> fairly quite on the Kubernetes slack but if there are more places to look
>>> for folks interested in batch scheduling on Kubernetes we should definitely
>>> give it a shot :)
>>>
>>> On Fri, Jun 18, 2021 at 1:41 AM Mich Talebzadeh <
>>> mich.talebza...@gmail.com> wrote:
>>>
 Hi,

 Regarding your point and I quote

 "..  I know that one of the Spark on Kube operators
 supports volcano/kube-batch so I was thinking that might be a place I would
 start exploring..."

 There seems to be ongoing work on say Volcano as part of  Cloud Native
 Computing Foundation 

Re: UPDATE: Apache Spark 3.2 Release

2021-06-17 Thread Yikun Jiang
- Apache Hadoop 3.3.2 becomes the default Hadoop profile for Apache Spark
3.2 via SPARK-29250 today. We are observing big improvements in S3 use
cases. Please try it and share your experience.

It should be  Apache Hadoop 3.3.1 [1]. : )

Note that Apache hadoop 3.3.0 is the first Hadoop release including x86 and
aarch64, and 3.3.1 also. Very happy to see 3.3.1 can be the default
dependency of Spark 3.2.0.

[1] https://hadoop.apache.org/release/3.3.1.html

Regards,
Yikun


Dongjoon Hyun  于2021年6月17日周四 上午5:58写道:

> This is a continuation of the previous thread, `Apache Spark 3.2
> Expectation`, in order to give you updates.
>
> -
> https://lists.apache.org/thread.html/r61897da071729913bf586ddd769311ce8b5b068e7156c352b51f7a33%40%3Cdev.spark.apache.org%3E
>
> First of all, the AS-IS schedule is here
>
> - https://spark.apache.org/versioning-policy.html
>
>   July 1st Code freeze. Release branch cut.
>   Mid July QA period. Focus on bug fixes, tests, stability and docs.
> Generally, no new features merged.
>   August   Release candidates (RC), voting, etc. until final release passes
>
> Second, Gengliang Wang volunteered as a release manager and started to
> work as a release manager. Thank you! He shared the on-going issues and I
> want to piggy-back the followings to his list.
>
>
> # Languages
>
> - Scala 2.13 Support: Although SPARK-25075 is almost done and we have
> Scala 2.13 Jenkins job on master branch, we do not support Scala 2.13.6. We
> should document it if Scala 2.13.7 is not arrived on time.
>   Please see https://github.com/scala/scala/pull/9641 (Milestone Scala
> 2.13.7).
>
> - SparkR CRAN publishing: Apache SparkR 3.1.2 is in CRAN as of today, but
> we get policy violation warnings for cache directory. The fix deadline is
> 2021-06-28. If that's going to be removed again, we need to retry via
> Apache Spark 3.2.0 after making some fix.
>   https://cran.r-project.org/web/packages/SparkR/index.html
>
>
> # Dependencies
>
> - Apache Hadoop 3.3.2 becomes the default Hadoop profile for Apache Spark
> 3.2 via SPARK-29250 today. We are observing big improvements in S3 use
> cases. Please try it and share your experience.
>
> - Apache Hive 2.3.9 becomes the built-in Hive library with more HMS
> compatibility fixes recently. We need re-evaluate the previous HMS
> incompatibility reports.
>
> - K8s 1.21 is released May 12th. K8s Client 5.4.1 supports it in Apache
> Spark 3.2. In addition, public cloud vendors start to support K8s 1.20.
> Please note that this is a breaking K8s API change from K8s Client 4.x to
> 5.x.
>
> - SPARK-33913 upgraded Apache Kafka Client dependency to 2.8.0 and Kafka
> community is considering the deprecation of Scala 2.12 support at Apache
> Kafka 3.0.
>
> - SPARK-34542 upgraded Apache Parquet dependency to 1.12.0. However, we
> need SPARK-34859 to fix column index issue before release. In addition,
> Apache Parquet encryption is added as a developer API. Custom KMS client
> should be implemented.
>
> - SPARK-35489 upgraded Apache ORC dependency to 1.6.8. We still need
> ORC-804 for better masking feature additionally.
>
> - SPARK-34651 improved ZStandard support with ZStandard 1.4.9 and we are
> currently evaluating newly arrived ZStandard 1.5.0 additionally. Currently,
> JDK11 performance is under investigation. In addition, SPARK-35181 (Use
> zstd for spark.io.compression.codec by default) is still on the way
> seperately.
>
>
> # Newly arrived items
>
> - SPARK-35779 Dynamic filtering for Data Source V2
>
> - SPARK-35781 Support Spark on Apple Silicon on macOS natively
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>


Re: [DISCUSS] Multiple columns adding/replacing support in PySpark DataFrame API

2021-05-04 Thread Yikun Jiang
@Saurabh @Mr.Powers Thanks for the input information.

I personal perfer to introduce the `withColumns` because it bring more
friendly development experience rather than select( * ).

This is the PR to add `withColumns`:
https://github.com/apache/spark/pull/32431

Regards,
Yikun


Saurabh Chawla  于2021年4月30日周五 下午1:13写道:

> Hi All,
>
> I also had a scenario where at runtime, I needed to loop through a
> dataframe to use withColumn many times.
>
>  For the safer side I used the reflection to access the withColumns to
> prevent any java.lang.StackOverflowError.
>
> val dataSetClass = Class.forName("org.apache.spark.sql.Dataset")
> val newConfigurationMethod =
>   dataSetClass.getMethod("withColumns", classOf[Seq[String]], 
> classOf[Seq[Column]])
> newConfigurationMethod.invoke(
>   baseDataFrame, columnName, columnValue).asInstanceOf[DataFrame]
>
> It would be great if we use the "withColumns" rather than using the
> reflection code like this.
> or
> make changes in the code to merge the project with existing project in the
> plan, instead of adding the new project every time we call the "
> withColumn".
>
> +1 for exposing the *withColumns*
>
> Regards
> Saurabh Chawla
>
> On Thu, Apr 22, 2021 at 1:03 PM Yikun Jiang  wrote:
>
>> Hi, all
>>
>> *Background:*
>>
>> Currently, there is a withColumns
>> <https://github.com/apache/spark/blob/b5241c97b17a1139a4ff719bfce7f68aef094d95/sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala#L2402>[1]
>> method to help users/devs add/replace multiple columns at once.
>> But this method is private and isn't exposed as a public API interface,
>> that means it cannot be used by the user directly, and also it is not
>> supported in PySpark API.
>>
>> As the dataframe user, I can only call withColumn() multiple times:
>>
>> df.withColumn("key1", col("key1")).withColumn("key2", 
>> col("key2")).withColumn("key3", col("key3"))
>>
>> rather than:
>>
>> df.withColumn(["key1", "key2", "key3"], [col("key1"), col("key2"), 
>> col("key3")])
>>
>> Multiple calls bring some higher cost on developer experience and
>> performance. Especially in a PySpark related scenario, multiple calls mean
>> multiple py4j calls.
>>
>> As mentioned
>> <https://github.com/apache/spark/pull/32276#issuecomment-824461143> from
>> @Hyukjin, there were some previous discussions on  SPARK-12225
>> <https://issues.apache.org/jira/browse/SPARK-12225> [2] .
>>
>> [1]
>> https://github.com/apache/spark/blob/b5241c97b17a1139a4ff719bfce7f68aef094d95/sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala#L2402
>> [2] https://issues.apache.org/jira/browse/SPARK-12225
>>
>> *Potential solution:*
>> Looks like there are 2 potential solutions if we want to support it:
>>
>> 1. Introduce a *withColumns *api for Scala/Python.
>> A separate public withColumns API will be added in scala/python api.
>>
>> 2. Make withColumn can receive *single col *and also the* list of cols*.
>> I did some experimental try on PySpark on
>> https://github.com/apache/spark/pull/32276
>> Just like Maciej said
>> <https://github.com/apache/spark/pull/32276#pullrequestreview-641280217>
>> it will bring some confusion with naming.
>>
>>
>> Thanks for your reading, feel free to reply if you have any other
>> concerns or suggestions!
>>
>>
>> Regards,
>> Yikun
>>
>


[DISCUSS] Multiple columns adding/replacing support in PySpark DataFrame API

2021-04-22 Thread Yikun Jiang
Hi, all

*Background:*

Currently, there is a withColumns
[1]
method to help users/devs add/replace multiple columns at once.
But this method is private and isn't exposed as a public API interface,
that means it cannot be used by the user directly, and also it is not
supported in PySpark API.

As the dataframe user, I can only call withColumn() multiple times:

df.withColumn("key1", col("key1")).withColumn("key2",
col("key2")).withColumn("key3", col("key3"))

rather than:

df.withColumn(["key1", "key2", "key3"], [col("key1"), col("key2"), col("key3")])

Multiple calls bring some higher cost on developer experience and
performance. Especially in a PySpark related scenario, multiple calls mean
multiple py4j calls.

As mentioned
 from
@Hyukjin, there were some previous discussions on  SPARK-12225
 [2] .

[1]
https://github.com/apache/spark/blob/b5241c97b17a1139a4ff719bfce7f68aef094d95/sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala#L2402
[2] https://issues.apache.org/jira/browse/SPARK-12225

*Potential solution:*
Looks like there are 2 potential solutions if we want to support it:

1. Introduce a *withColumns *api for Scala/Python.
A separate public withColumns API will be added in scala/python api.

2. Make withColumn can receive *single col *and also the* list of cols*.
I did some experimental try on PySpark on
https://github.com/apache/spark/pull/32276
Just like Maciej said
 it
will bring some confusion with naming.


Thanks for your reading, feel free to reply if you have any other concerns
or suggestions!


Regards,
Yikun


Re: please read: current state and the future of the apache spark build system

2021-04-15 Thread Yikun Jiang
Much thanks for your work on infra @Shane. Especially, we (I and
@huangtianhua) got really much help from you when make Arm CI work. [1]

> prepare jenkins worker ansible configs and stick in the spark repo

https://github.com/apache/spark/pull/32178 I take a quick glance on it, it
seems it doesn't contain any Arm node setup and config related code.

*Do you have any plan to update the existing code to cover the Arm node
setup and configuration?* or just some exiting script is also okay.

*Do you have any special plan on Arm node migration?* If needed, I will
help some the Arm related node setup and config in new infra to make sure
Spark Arm CI work.

BTW, We also is considering to move the Arm build from jenkins to Github
Action (using self-host or cloud deploy
https://github.com/actions/starter-workflows/tree/main/ci), there are some
pre-work is being done by our team see PoC in [2]. (cc @mgrigorov)[2],
maybe it could bring some idea on future infrastructure.

[1] https://amplab.cs.berkeley.edu/jenkins/label/spark-arm/
[2]
https://martin-grigorov.medium.com/githubactions-build-and-test-on-huaweicloud-arm64-af9d5c97b766

Regards,
Yikun


Holden Karau  于2021年4月15日周四 上午8:29写道:

> Thanks Shane for keeping the build infra structure running for all of
> these years :)
>
> I've got some Kubernetes infra on AS399306 down in HE in Fremont but
> it's also perhaps not of the newest variety, but so far no disk
> failures or anything like that (knock on wood of course). The catch is
> it's on a 15 amp circuit and frankly I'm still learning how BGP works.
>
> Maybe we could expirement with
> https://github.com/lazybit-ch/actions-runner/tree/master/actions-runner
> and try nested MiniKube (which I know is... not great but might make
> things more portable)?
>
> Would the community (and or some of our corporate contributors) be
> open to contributing some hardware + power money or cloud credits?
>
> On Wed, Apr 14, 2021 at 5:13 PM Hyukjin Kwon  wrote:
> >
> > Thanks Shane!!
> >
> > On Thu, 15 Apr 2021, 09:03 shane knapp ☠,  wrote:
> >>>
> >>> medium term (in 6 months):
> >>> * prepare jenkins worker ansible configs and stick in the spark repo
> >>>   - nothing fancy, but enough to config ubuntu workers
> >>>   - could be used to create docker containers for testing in
> THE CLOUD
> >>>
> >> fwiw, i just decided to bang this out today:
> >> https://github.com/apache/spark/pull/32178
> >>
> >> shane
> >> --
> >> Shane Knapp
> >> Computer Guy / Voice of Reason
> >> UC Berkeley EECS Research / RISELab Staff Technical Lead
> >> https://rise.cs.berkeley.edu
>
>
>
> --
> Twitter: https://twitter.com/holdenkarau
> Books (Learning Spark, High Performance Spark, etc.):
> https://amzn.to/2MaRAG9
> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>


Re: K8s Integration test is unable to run because of the unavailable libs

2021-03-22 Thread Yikun Jiang
hey, Yi Wu

Looks like it's just an apt installation problem, we should do apt update
to refresh the local package cache list before we install the "gnupg".

I opened a issue on jira [1] , and try to fix it in [2], hope this helps.

[1] https://issues.apache.org/jira/browse/SPARK-34820
[2] https://github.com/apache/spark/pull/31923

Regards,
Yikun


Yi Wu  于2021年3月22日周一 下午2:15写道:

> Hi devs,
>
> It seems like the K8s Integration test is unable to run recently because
> of the unavailable libs:
>
> Err:20 http://security.debian.org/debian-security buster/updates/main amd64 
> libldap-common all 2.4.47+dfsg-3+deb10u4
>   404  Not Found [IP: 151.101.194.132 80]
> Err:21 http://security.debian.org/debian-security buster/updates/main amd64 
> libldap-2.4-2 amd64 2.4.47+dfsg-3+deb10u4
>   404  Not Found [IP: 151.101.194.132 80]
> E: Failed to fetch 
> http://security.debian.org/debian-security/pool/updates/main/o/openldap/libldap-common_2.4.47+dfsg-3+deb10u4_all.deb
>   404  Not Found [IP: 151.101.194.132 80]
> E: Failed to fetch 
> http://security.debian.org/debian-security/pool/updates/main/o/openldap/libldap-2.4-2_2.4.47+dfsg-3+deb10u4_amd64.deb
>   404  Not Found [IP: 151.101.194.132 80]
>
>
> I alreay saw the error is many places, e.g.,
>
>
> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/40840/console
>
>
> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/40837/console
>
>
> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/40715/console
>
>
> Could someone familiar with K8s please take a look?
>
>
> Thanks,
>
> Yi
>
>
>


Re: [VOTE] Release Spark 3.1.1 (RC2)

2021-02-09 Thread Yikun Jiang
+1, Tested build and basic feature on aarch64(ARM64) environment.

Regards,
Yikun


Yuming Wang  于2021年2月9日周二 下午8:24写道:

> +1. Tested a batch of queries with YARN client mode.
>
> On Tue, Feb 9, 2021 at 2:57 PM 郑瑞峰  wrote:
>
>> +1 (non-binding)
>>
>> Thank you, Hyukjin
>>
>>
>> -- 原始邮件 --
>> *发件人:* "Gengliang Wang" ;
>> *发送时间:* 2021年2月9日(星期二) 中午1:50
>> *收件人:* "Sean Owen";
>> *抄送:* "Hyukjin Kwon";"Yuming Wang"> >;"dev";
>> *主题:* Re: [VOTE] Release Spark 3.1.1 (RC2)
>>
>> +1
>>
>> On Tue, Feb 9, 2021 at 1:39 PM Sean Owen  wrote:
>>
>>> Same result as last time for me, +1. Tested with Java 11.
>>> I fixed the two issues without assignee; one was WontFix though.
>>>
>>> On Mon, Feb 8, 2021 at 7:43 PM Hyukjin Kwon  wrote:
>>>
 Let's set the assignees properly then. Shouldn't be a problem for the
 release.

 On Tue, 9 Feb 2021, 10:40 Yuming Wang,  wrote:

>
> Many tickets do not have correct assignee:
>
> https://issues.apache.org/jira/issues/?jql=project%20%3D%20SPARK%20AND%20status%20in%20(Resolved%2C%20Closed)%20AND%20fixVersion%20in%20(3.1.0%2C%203.1.1)%20AND%20(assignee%20is%20EMPTY%20or%20assignee%20%3D%20apachespark)
>
>
> On Tue, Feb 9, 2021 at 9:05 AM Hyukjin Kwon 
> wrote:
>
>> +1 (binding) from myself too.
>>
>> 2021년 2월 9일 (화) 오전 9:28, Kent Yao 님이 작성:
>>
>>>
>>> +1
>>>
>>> *Kent Yao *
>>> @ Data Science Center, Hangzhou Research Institute, NetEase Corp.
>>> *a spark enthusiast*
>>> *kyuubi is a unified
>>> multi-tenant JDBC interface for large-scale data processing and 
>>> analytics,
>>> built on top of Apache Spark .*
>>> *spark-authorizer A
>>> Spark SQL extension which provides SQL Standard Authorization for 
>>> **Apache
>>> Spark .*
>>> *spark-postgres  A
>>> library for reading data from and transferring data to Postgres / 
>>> Greenplum
>>> with Spark SQL and DataFrames, 10~100x faster.*
>>> *spark-func-extras A
>>> library that brings excellent and useful functions from various modern
>>> database management systems to Apache Spark .*
>>>
>>>
>>>
>>> On 02/9/2021 08:24,Hyukjin Kwon
>>>  wrote:
>>>
>>> Please vote on releasing the following candidate as Apache Spark
>>> version 3.1.1.
>>>
>>> The vote is open until February 15th 5PM PST and passes if a
>>> majority +1 PMC votes are cast, with a minimum of 3 +1 votes.
>>>
>>> Note that it is 7 days this time because it is a holiday season in
>>> several countries including South Korea (where I live), China etc., and 
>>> I
>>> would like to make sure people do not miss it because it is a holiday
>>> season.
>>>
>>> [ ] +1 Release this package as Apache Spark 3.1.1
>>> [ ] -1 Do not release this package because ...
>>>
>>> To learn more about Apache Spark, please see
>>> http://spark.apache.org/
>>>
>>> The tag to be voted on is v3.1.1-rc2 (commit
>>> cf0115ac2d60070399af481b14566f33d22ec45e):
>>> https://github.com/apache/spark/tree/v3.1.1-rc2
>>>
>>> The release files, including signatures, digests, etc. can be found
>>> at:
>>> 
>>> https://dist.apache.org/repos/dist/dev/spark/v3.1.1-rc2-bin/
>>>
>>> Signatures used for Spark RCs can be found in this file:
>>> https://dist.apache.org/repos/dist/dev/spark/KEYS
>>>
>>> The staging repository for this release can be found at:
>>>
>>> https://repository.apache.org/content/repositories/orgapachespark-1365
>>>
>>> The documentation corresponding to this release can be found at:
>>> https://dist.apache.org/repos/dist/dev/spark/v3.1.1-rc2-docs/
>>>
>>> The list of bug fixes going into 3.1.1 can be found at the following
>>> URL:
>>> https://s.apache.org/41kf2
>>>
>>> This release is using the release script of the tag v3.1.1-rc2.
>>>
>>> FAQ
>>>
>>> ===
>>> What happened to 3.1.0?
>>> ===
>>>
>>> There was a technical issue during Apache Spark 3.1.0 preparation,
>>> and it was discussed and decided to skip 3.1.0.
>>> Please see
>>> https://spark.apache.org/news/next-official-release-spark-3.1.1.html for
>>> more details.
>>>
>>> =
>>> How can I help test this release?
>>> =
>>>
>>> If you are a Spark user, you can help us test this release by taking
>>> an existing Spark workload and running on this release candidate,
>>> then
>>> reporting any regressions.

Re: [DISCUSS] Add RocksDB StateStore

2021-02-07 Thread Yikun Jiang
I worked on some work about rocksdb multi-arch support and version upgrade
on
Kafka/Storm/Flink[1][2][3].To avoid these issues happened in spark again, I
want to
give some inputs in here about rocksdb version selection from multi-arch
support
view. Hope it helps.

The Rocksdb adds Arm64 support [4] since version 6.4.6, and also backports
all Arm64
related commits to 5.18.4 and release a all platforms support version.

So, from multi-arch support view, the better rocksdb version is the version
since
v6.4.6, or 5.X version is v5.18.4.

[1] https://issues.apache.org/jira/browse/STORM-3599
[2] https://github.com/apache/kafka/pull/8284
[3] https://issues.apache.org/jira/browse/FLINK-13598
[4] https://github.com/facebook/rocksdb/pull/6250

Regards,
Yikun

Liang-Chi Hsieh  于2021年2月2日周二 下午4:32写道:

> Hi devs,
>
> In Spark structured streaming, we need state store for state management for
> stateful operators such streaming aggregates, joins, etc. We have one and
> only one state store implementation now. It is in-memory hashmap which was
> backed up in HDFS complaint file system at the end of every micro-batch.
>
> As it basically uses in-memory map to store states, memory consumption is a
> serious issue and state store size is limited by the size of the executor
> memory. Moreover, state store using more memory means it may impact the
> performance of task execution that requires memory too.
>
> Internally we see more streaming applications that requires large state in
> stateful operations. For such requirements, we need a StateStore not rely
> on
> memory to store states.
>
> This seems to be also true externally as several other major streaming
> frameworks already use RocksDB for state management. RocksDB is an embedded
> DB and streaming engines can use it to store state instead of memory
> storage.
>
> So seems to me, it is proven to be good choice for large state usage. But
> Spark SS still lacks of a built-in state store for the requirement.
>
> Previously there was one attempt SPARK-28120 to add RocksDB StateStore into
> Spark SS. IIUC, it was pushed back due to two concerns: extra code
> maintenance cost and it introduces RocksDB dependency.
>
> For the first concern, as more users require to use the feature, it should
> be highly used code in SS and more developers will look at it. For second
> one, we propose (SPARK-34198) to add it as an external module to relieve
> the
> dependency concern.
>
> Because it was pushed back previously, I'm going to raise this discussion
> to
> know what people think about it now, in advance of submitting any code.
>
> I think there might be some possible opinions:
>
> 1. okay to add RocksDB StateStore into sql core module
> 2. not okay for 1, but okay to add RocksDB StateStore as external module
> 3. either 1 or 2 is okay
> 4. not okay to add RocksDB StateStore, no matter into sql core or as
> external module
>
> Please let us know if you have some thoughts.
>
> Thank you.
>
> Liang-Chi Hsieh
>
>
>
>
> --
> Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>