[ANNOUNCE] Apache Spark 3.0.3 released

2021-06-24 Thread Yi Wu
We are happy to announce the availability of Spark 3.0.3!

Spark 3.0.3 is a maintenance release containing stability fixes. This
release is based on the branch-3.0 maintenance branch of Spark. We strongly
recommend all 3.0 users to upgrade to this stable release.

To download Spark 3.0.3, head over to the download page:
https://spark.apache.org/downloads.html

To view the release notes:
https://spark.apache.org/releases/spark-release-3-0-3.html

We would like to acknowledge all community members for contributing to this
release. This release would not have been possible without you.

Yi


Re: [DISCUSS] Rename hadoop-3.2/hadoop-2.7 profile to hadoop-3/hadoop-2?

2021-06-24 Thread Gengliang Wang
+1 for targeting the renaming for Apache Spark 3.3 at the current phase.

On Fri, Jun 25, 2021 at 6:55 AM DB Tsai  wrote:

> +1 on renaming.
>
> DB Tsai  |  https://www.dbtsai.com/  |  PGP 42E5B25A8F7A82C1
>
> On Jun 24, 2021, at 11:41 AM, Chao Sun  wrote:
>
> Hi,
>
> As Spark master has upgraded to Hadoop-3.3.1, the current Maven profile
> name hadoop-3.2 is no longer accurate, and it may confuse Spark users when
> they realize the actual version is not Hadoop 3.2.x. Therefore, I created
> https://issues.apache.org/jira/browse/SPARK-33880 to change the profile
> name to hadoop-3 and hadoop-2 respectively. What do you think? Is this
> something worth doing as part of Spark 3.2.0 release?
>
> Best,
> Chao
>
>
>


Re: Spark on Kubernetes scheduler variety

2021-06-24 Thread John Zhuge
Thanks Yikun!

On Thu, Jun 24, 2021 at 8:54 PM Yikun Jiang  wrote:

> Hi, folks.
>
> As @Klaus mentioned, We have some work on Spark on k8s with volcano native
> support. Also, there were also some production deployment validation from
> our partners in China, like JingDong, XiaoHongShu, VIPshop.
>
> We will also prepare to propose an initial design and POC[3] on a shared
> branch (based on spark master branch) where we can collaborate on it, so I
> created the spark-volcano[1] org in github to make it happen.
>
> Pls feel free to comment on it [2] if you guys have any questions or
> concerns.
>
> [1] https://github.com/spark-volcano
> [2] https://github.com/spark-volcano/spark/issues/1
> [3] https://github.com/spark-volcano-wip/spark-3-volcano
>
> Regards,
> Yikun
>
> Holden Karau  于2021年6月25日周五 上午12:00写道:
>
>> Hi Mich,
>>
>> I certainly think making Spark on Kubernetes run well is going to be a
>> challenge. However I think, and I could be wrong about this as well, that
>> in terms of cluster managers Kubernetes is likely to be our future. Talking
>> with people I don't hear about new standalone, YARN or mesos deployments of
>> Spark, but I do hear about people trying to migrate to Kubernetes.
>>
>> To be clear I certainly agree that we need more work on structured
>> streaming, but its important to remember that the Spark developers are not
>> all fully interchangeable, we work on the things that we're interested in
>> pursuing so even if structured streaming needs more love if I'm not super
>> interested in structured streaming I'm less likely to work on it. That
>> being said I am certainly spinning up a bit more in the Spark SQL area
>> especially around our data source/connectors because I can see the need
>> there too.
>>
>> On Wed, Jun 23, 2021 at 8:26 AM Mich Talebzadeh <
>> mich.talebza...@gmail.com> wrote:
>>
>>>
>>>
>>> Please allow me to be diverse and express a different point of view on
>>> this roadmap.
>>>
>>>
>>> I believe from a technical point of view spending time and effort plus
>>> talent on batch scheduling on Kubernetes could be rewarding. However, if I
>>> may say I doubt whether such an approach and the so-called democratization
>>> of Spark on whatever platform is really should be of great focus.
>>>
>>> Having worked on Google Dataproc  (A 
>>> fully
>>> managed and highly scalable service for running Apache Spark, Hadoop and
>>> more recently other artefacts) for that past two years, and Spark on
>>> Kubernetes on-premise, I have come to the conclusion that Spark is not a
>>> beast that that one can fully commoditize it much like one can do with
>>> Zookeeper, Kafka etc. There is always a struggle to make some niche areas
>>> of Spark like Spark Structured Streaming (SSS) work seamlessly and
>>> effortlessly on these commercial platforms with whatever as a Service.
>>>
>>>
>>> Moreover, Spark (and I stand corrected) from the ground up has already a
>>> lot of resiliency and redundancy built in. It is truly an enterprise class
>>> product (requires enterprise class support) that will be difficult to
>>> commoditize with Kubernetes and expect the same performance. After all,
>>> Kubernetes is aimed at efficient resource sharing and potential cost saving
>>> for the mass market. In short I can see commercial enterprises will work on
>>> these platforms ,but may be the great talents on dev team should focus on
>>> stuff like the perceived limitation of SSS in dealing with chain of
>>> aggregation( if I am correct it is not yet supported on streaming datasets)
>>>
>>>
>>> These are my opinions and they are not facts, just opinions so to speak
>>> :)
>>>
>>>
>>>view my Linkedin profile
>>> 
>>>
>>>
>>>
>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>> any loss, damage or destruction of data or any other property which may
>>> arise from relying on this email's technical content is explicitly
>>> disclaimed. The author will in no case be liable for any monetary damages
>>> arising from such loss, damage or destruction.
>>>
>>>
>>>
>>>
>>> On Fri, 18 Jun 2021 at 23:18, Holden Karau  wrote:
>>>
 I think these approaches are good, but there are limitations (eg
 dynamic scaling) without us making changes inside of the Spark Kube
 scheduler.

 Certainly whichever scheduler extensions we add support for we should
 collaborate with the people developing those extensions insofar as they are
 interested. My first place that I checked was #sig-scheduling which is
 fairly quite on the Kubernetes slack but if there are more places to look
 for folks interested in batch scheduling on Kubernetes we should definitely
 give it a shot :)

 On Fri, Jun 18, 2021 at 1:41 AM Mich Talebzadeh <
 mich.talebza...@gmail.com> wrote:

> Hi,
>
> Regarding your point and I quote
>
> "..  I know that one of the 

Re: Spark on Kubernetes scheduler variety

2021-06-24 Thread Yikun Jiang
Hi, folks.

As @Klaus mentioned, We have some work on Spark on k8s with volcano native
support. Also, there were also some production deployment validation from
our partners in China, like JingDong, XiaoHongShu, VIPshop.

We will also prepare to propose an initial design and POC[3] on a shared
branch (based on spark master branch) where we can collaborate on it, so I
created the spark-volcano[1] org in github to make it happen.

Pls feel free to comment on it [2] if you guys have any questions or
concerns.

[1] https://github.com/spark-volcano
[2] https://github.com/spark-volcano/spark/issues/1
[3] https://github.com/spark-volcano-wip/spark-3-volcano

Regards,
Yikun

Holden Karau  于2021年6月25日周五 上午12:00写道:

> Hi Mich,
>
> I certainly think making Spark on Kubernetes run well is going to be a
> challenge. However I think, and I could be wrong about this as well, that
> in terms of cluster managers Kubernetes is likely to be our future. Talking
> with people I don't hear about new standalone, YARN or mesos deployments of
> Spark, but I do hear about people trying to migrate to Kubernetes.
>
> To be clear I certainly agree that we need more work on structured
> streaming, but its important to remember that the Spark developers are not
> all fully interchangeable, we work on the things that we're interested in
> pursuing so even if structured streaming needs more love if I'm not super
> interested in structured streaming I'm less likely to work on it. That
> being said I am certainly spinning up a bit more in the Spark SQL area
> especially around our data source/connectors because I can see the need
> there too.
>
> On Wed, Jun 23, 2021 at 8:26 AM Mich Talebzadeh 
> wrote:
>
>>
>>
>> Please allow me to be diverse and express a different point of view on
>> this roadmap.
>>
>>
>> I believe from a technical point of view spending time and effort plus
>> talent on batch scheduling on Kubernetes could be rewarding. However, if I
>> may say I doubt whether such an approach and the so-called democratization
>> of Spark on whatever platform is really should be of great focus.
>>
>> Having worked on Google Dataproc  (A fully
>> managed and highly scalable service for running Apache Spark, Hadoop and
>> more recently other artefacts) for that past two years, and Spark on
>> Kubernetes on-premise, I have come to the conclusion that Spark is not a
>> beast that that one can fully commoditize it much like one can do with
>> Zookeeper, Kafka etc. There is always a struggle to make some niche areas
>> of Spark like Spark Structured Streaming (SSS) work seamlessly and
>> effortlessly on these commercial platforms with whatever as a Service.
>>
>>
>> Moreover, Spark (and I stand corrected) from the ground up has already a
>> lot of resiliency and redundancy built in. It is truly an enterprise class
>> product (requires enterprise class support) that will be difficult to
>> commoditize with Kubernetes and expect the same performance. After all,
>> Kubernetes is aimed at efficient resource sharing and potential cost saving
>> for the mass market. In short I can see commercial enterprises will work on
>> these platforms ,but may be the great talents on dev team should focus on
>> stuff like the perceived limitation of SSS in dealing with chain of
>> aggregation( if I am correct it is not yet supported on streaming datasets)
>>
>>
>> These are my opinions and they are not facts, just opinions so to speak :)
>>
>>
>>view my Linkedin profile
>> 
>>
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>>
>> On Fri, 18 Jun 2021 at 23:18, Holden Karau  wrote:
>>
>>> I think these approaches are good, but there are limitations (eg dynamic
>>> scaling) without us making changes inside of the Spark Kube scheduler.
>>>
>>> Certainly whichever scheduler extensions we add support for we should
>>> collaborate with the people developing those extensions insofar as they are
>>> interested. My first place that I checked was #sig-scheduling which is
>>> fairly quite on the Kubernetes slack but if there are more places to look
>>> for folks interested in batch scheduling on Kubernetes we should definitely
>>> give it a shot :)
>>>
>>> On Fri, Jun 18, 2021 at 1:41 AM Mich Talebzadeh <
>>> mich.talebza...@gmail.com> wrote:
>>>
 Hi,

 Regarding your point and I quote

 "..  I know that one of the Spark on Kube operators
 supports volcano/kube-batch so I was thinking that might be a place I would
 start exploring..."

 There seems to be ongoing work on say Volcano as part of  Cloud Native
 Computing Foundation 

Re: [DISCUSS] SPIP: Row-level operations in Data Source V2

2021-06-24 Thread L . C . Hsieh
Thanks Anton. I'm voluntarily to be the shepherd of the SPIP. This is also my 
first time to shepherd a SPIP, so please let me know if anything I can improve.

This looks great features and the rationale claimed by the proposal makes 
sense. These operations are getting more common and more important in big data 
workloads. Instead of building custom extensions by individual data sources, it 
makes more sense to support the API from Spark.

Please provide your thoughts about the proposal and the design. Appreciate your 
feedback. Thank you!

On 2021/06/24 23:53:32, Anton Okolnychyi  wrote: 
> Hey everyone,
> 
> I'd like to start a discussion on adding support for executing row-level
> operations such as DELETE, UPDATE, MERGE for v2 tables (SPARK-35801). The
> execution should be the same across data sources and the best way to do
> that is to implement it in Spark.
> 
> Right now, Spark can only parse and to some extent analyze DELETE, UPDATE,
> MERGE commands. Data sources that support row-level changes have to build
> custom Spark extensions to execute such statements. The goal of this effort
> is to come up with a flexible and easy-to-use API that will work across
> data sources.
> 
> Design doc:
> https://docs.google.com/document/d/12Ywmc47j3l2WF4anG5vL4qlrhT2OKigb7_EbIKhxg60/
> 
> PR for handling DELETE statements:
> https://github.com/apache/spark/pull/33008
> 
> Any feedback is more than welcome.
> 
> Liang-Chi was kind enough to shepherd this effort. Thanks!
> 
> - Anton
> 

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: [DISCUSS] SPIP: Row-level operations in Data Source V2

2021-06-24 Thread Jungtaek Lim
Meta question: this doesn't target Spark 3.2, right? Many folks have been
working on branch cut for Spark 3.2, so might be less active to jump in new
feature proposals right now.

On Fri, Jun 25, 2021 at 9:00 AM Holden Karau  wrote:

> I took an initial look at the PRs this morning and I’ll go through the
> design doc in more detail but I think these features look great. It’s
> especially important with the CA regulation changes to make this easier for
> folks to implement.
>
> On Thu, Jun 24, 2021 at 4:54 PM Anton Okolnychyi 
> wrote:
>
>> Hey everyone,
>>
>> I'd like to start a discussion on adding support for executing row-level
>> operations such as DELETE, UPDATE, MERGE for v2 tables (SPARK-35801). The
>> execution should be the same across data sources and the best way to do
>> that is to implement it in Spark.
>>
>> Right now, Spark can only parse and to some extent analyze DELETE,
>> UPDATE, MERGE commands. Data sources that support row-level changes have to
>> build custom Spark extensions to execute such statements. The goal of this
>> effort is to come up with a flexible and easy-to-use API that will work
>> across data sources.
>>
>> Design doc:
>>
>> https://docs.google.com/document/d/12Ywmc47j3l2WF4anG5vL4qlrhT2OKigb7_EbIKhxg60/
>>
>> PR for handling DELETE statements:
>> https://github.com/apache/spark/pull/33008
>>
>> Any feedback is more than welcome.
>>
>> Liang-Chi was kind enough to shepherd this effort. Thanks!
>>
>> - Anton
>>
>>
>>
>>
>>
>> --
> Twitter: https://twitter.com/holdenkarau
> Books (Learning Spark, High Performance Spark, etc.):
> https://amzn.to/2MaRAG9  
> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>


Re: [DISCUSS] SPIP: Row-level operations in Data Source V2

2021-06-24 Thread Holden Karau
I took an initial look at the PRs this morning and I’ll go through the
design doc in more detail but I think these features look great. It’s
especially important with the CA regulation changes to make this easier for
folks to implement.

On Thu, Jun 24, 2021 at 4:54 PM Anton Okolnychyi 
wrote:

> Hey everyone,
>
> I'd like to start a discussion on adding support for executing row-level
> operations such as DELETE, UPDATE, MERGE for v2 tables (SPARK-35801). The
> execution should be the same across data sources and the best way to do
> that is to implement it in Spark.
>
> Right now, Spark can only parse and to some extent analyze DELETE, UPDATE,
> MERGE commands. Data sources that support row-level changes have to build
> custom Spark extensions to execute such statements. The goal of this effort
> is to come up with a flexible and easy-to-use API that will work across
> data sources.
>
> Design doc:
>
> https://docs.google.com/document/d/12Ywmc47j3l2WF4anG5vL4qlrhT2OKigb7_EbIKhxg60/
>
> PR for handling DELETE statements:
> https://github.com/apache/spark/pull/33008
>
> Any feedback is more than welcome.
>
> Liang-Chi was kind enough to shepherd this effort. Thanks!
>
> - Anton
>
>
>
>
>
> --
Twitter: https://twitter.com/holdenkarau
Books (Learning Spark, High Performance Spark, etc.):
https://amzn.to/2MaRAG9  
YouTube Live Streams: https://www.youtube.com/user/holdenkarau


[DISCUSS] SPIP: Row-level operations in Data Source V2

2021-06-24 Thread Anton Okolnychyi
Hey everyone,

I'd like to start a discussion on adding support for executing row-level
operations such as DELETE, UPDATE, MERGE for v2 tables (SPARK-35801). The
execution should be the same across data sources and the best way to do
that is to implement it in Spark.

Right now, Spark can only parse and to some extent analyze DELETE, UPDATE,
MERGE commands. Data sources that support row-level changes have to build
custom Spark extensions to execute such statements. The goal of this effort
is to come up with a flexible and easy-to-use API that will work across
data sources.

Design doc:
https://docs.google.com/document/d/12Ywmc47j3l2WF4anG5vL4qlrhT2OKigb7_EbIKhxg60/

PR for handling DELETE statements:
https://github.com/apache/spark/pull/33008

Any feedback is more than welcome.

Liang-Chi was kind enough to shepherd this effort. Thanks!

- Anton


Re: [DISCUSS] Rename hadoop-3.2/hadoop-2.7 profile to hadoop-3/hadoop-2?

2021-06-24 Thread DB Tsai
+1 on renaming.

DB Tsai  |  https://www.dbtsai.com/  |  PGP 42E5B25A8F7A82C1

> On Jun 24, 2021, at 11:41 AM, Chao Sun  wrote:
> 
> Hi,
> 
> As Spark master has upgraded to Hadoop-3.3.1, the current Maven profile name 
> hadoop-3.2 is no longer accurate, and it may confuse Spark users when they 
> realize the actual version is not Hadoop 3.2.x. Therefore, I created 
> https://issues.apache.org/jira/browse/SPARK-33880 
>  to change the profile 
> name to hadoop-3 and hadoop-2 respectively. What do you think? Is this 
> something worth doing as part of Spark 3.2.0 release?
> 
> Best,
> Chao



Re: [DISCUSS] Rename hadoop-3.2/hadoop-2.7 profile to hadoop-3/hadoop-2?

2021-06-24 Thread Dongjoon Hyun
For renaming, I'd target it for Apache Spark 3.3 instead of Apache Spark 3.2
because this is the first release of using Apache Hadoop 3.3.1 and we may
need to revert Apache Hadoop 3.3.1 during RC period.

Dongjoon.

On Thu, Jun 24, 2021 at 12:24 PM Sean Owen  wrote:

> The downside here is that it would break downstream builds that set
> hadoop-3.2 if it's now called hadoop-3. That's not a huge deal. We can
> retain dummy profiles under the old names that do nothing, but that would
> be a quieter 'break'. I suppose this naming is only of importance to
> developers, who might realize that hadoop-3.2 means "hadoop-3.2 or later".
> And maybe the current naming leaves the possibility for a "hadoop-3.5" or
> something if that needed to be different.
>
> I don't feel strongly but would default to leaving it, very slightly.
>
> On Thu, Jun 24, 2021 at 1:42 PM Chao Sun  wrote:
>
>> Hi,
>>
>> As Spark master has upgraded to Hadoop-3.3.1, the current Maven profile
>> name hadoop-3.2 is no longer accurate, and it may confuse Spark users when
>> they realize the actual version is not Hadoop 3.2.x. Therefore, I created
>> https://issues.apache.org/jira/browse/SPARK-33880 to change the profile
>> name to hadoop-3 and hadoop-2 respectively. What do you think? Is this
>> something worth doing as part of Spark 3.2.0 release?
>>
>> Best,
>> Chao
>>
>


Re: [DISCUSS] Rename hadoop-3.2/hadoop-2.7 profile to hadoop-3/hadoop-2?

2021-06-24 Thread Sean Owen
The downside here is that it would break downstream builds that set
hadoop-3.2 if it's now called hadoop-3. That's not a huge deal. We can
retain dummy profiles under the old names that do nothing, but that would
be a quieter 'break'. I suppose this naming is only of importance to
developers, who might realize that hadoop-3.2 means "hadoop-3.2 or later".
And maybe the current naming leaves the possibility for a "hadoop-3.5" or
something if that needed to be different.

I don't feel strongly but would default to leaving it, very slightly.

On Thu, Jun 24, 2021 at 1:42 PM Chao Sun  wrote:

> Hi,
>
> As Spark master has upgraded to Hadoop-3.3.1, the current Maven profile
> name hadoop-3.2 is no longer accurate, and it may confuse Spark users when
> they realize the actual version is not Hadoop 3.2.x. Therefore, I created
> https://issues.apache.org/jira/browse/SPARK-33880 to change the profile
> name to hadoop-3 and hadoop-2 respectively. What do you think? Is this
> something worth doing as part of Spark 3.2.0 release?
>
> Best,
> Chao
>


[DISCUSS] Rename hadoop-3.2/hadoop-2.7 profile to hadoop-3/hadoop-2?

2021-06-24 Thread Chao Sun
Hi,

As Spark master has upgraded to Hadoop-3.3.1, the current Maven profile
name hadoop-3.2 is no longer accurate, and it may confuse Spark users when
they realize the actual version is not Hadoop 3.2.x. Therefore, I created
https://issues.apache.org/jira/browse/SPARK-33880 to change the profile
name to hadoop-3 and hadoop-2 respectively. What do you think? Is this
something worth doing as part of Spark 3.2.0 release?

Best,
Chao


Re: Spark on Kubernetes scheduler variety

2021-06-24 Thread Mich Talebzadeh
Hi Holden,

Thank you for your points. I guess coming from a corporate world I had an
oversight on how an open source project like Spark does leverage resources
and interest :).

As @KlausMa kindly volunteered it would be good to hear scheduling ideas on
Spark on Kubernetes and of course as I am sure you have some inroads/ideas
on this subject as well, then truly I guess love would be in the air for
Kubernetes 

HTH



   view my Linkedin profile




*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Thu, 24 Jun 2021 at 16:59, Holden Karau  wrote:

> Hi Mich,
>
> I certainly think making Spark on Kubernetes run well is going to be a
> challenge. However I think, and I could be wrong about this as well, that
> in terms of cluster managers Kubernetes is likely to be our future. Talking
> with people I don't hear about new standalone, YARN or mesos deployments of
> Spark, but I do hear about people trying to migrate to Kubernetes.
>
> To be clear I certainly agree that we need more work on structured
> streaming, but its important to remember that the Spark developers are not
> all fully interchangeable, we work on the things that we're interested in
> pursuing so even if structured streaming needs more love if I'm not super
> interested in structured streaming I'm less likely to work on it. That
> being said I am certainly spinning up a bit more in the Spark SQL area
> especially around our data source/connectors because I can see the need
> there too.
>
> On Wed, Jun 23, 2021 at 8:26 AM Mich Talebzadeh 
> wrote:
>
>>
>>
>> Please allow me to be diverse and express a different point of view on
>> this roadmap.
>>
>>
>> I believe from a technical point of view spending time and effort plus
>> talent on batch scheduling on Kubernetes could be rewarding. However, if I
>> may say I doubt whether such an approach and the so-called democratization
>> of Spark on whatever platform is really should be of great focus.
>>
>> Having worked on Google Dataproc  (A fully
>> managed and highly scalable service for running Apache Spark, Hadoop and
>> more recently other artefacts) for that past two years, and Spark on
>> Kubernetes on-premise, I have come to the conclusion that Spark is not a
>> beast that that one can fully commoditize it much like one can do with
>> Zookeeper, Kafka etc. There is always a struggle to make some niche areas
>> of Spark like Spark Structured Streaming (SSS) work seamlessly and
>> effortlessly on these commercial platforms with whatever as a Service.
>>
>>
>> Moreover, Spark (and I stand corrected) from the ground up has already a
>> lot of resiliency and redundancy built in. It is truly an enterprise class
>> product (requires enterprise class support) that will be difficult to
>> commoditize with Kubernetes and expect the same performance. After all,
>> Kubernetes is aimed at efficient resource sharing and potential cost saving
>> for the mass market. In short I can see commercial enterprises will work on
>> these platforms ,but may be the great talents on dev team should focus on
>> stuff like the perceived limitation of SSS in dealing with chain of
>> aggregation( if I am correct it is not yet supported on streaming datasets)
>>
>>
>> These are my opinions and they are not facts, just opinions so to speak :)
>>
>>
>>view my Linkedin profile
>> 
>>
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>>
>> On Fri, 18 Jun 2021 at 23:18, Holden Karau  wrote:
>>
>>> I think these approaches are good, but there are limitations (eg dynamic
>>> scaling) without us making changes inside of the Spark Kube scheduler.
>>>
>>> Certainly whichever scheduler extensions we add support for we should
>>> collaborate with the people developing those extensions insofar as they are
>>> interested. My first place that I checked was #sig-scheduling which is
>>> fairly quite on the Kubernetes slack but if there are more places to look
>>> for folks interested in batch scheduling on Kubernetes we should definitely
>>> give it a shot :)
>>>
>>> On Fri, Jun 18, 2021 at 1:41 AM Mich Talebzadeh <
>>> mich.talebza...@gmail.com> wrote:
>>>
 Hi,

 Regarding your point and I quote

 "..  I know that one of the Spark on 

Re: Spark on Kubernetes scheduler variety

2021-06-24 Thread Holden Karau
Hi Mich,

I certainly think making Spark on Kubernetes run well is going to be a
challenge. However I think, and I could be wrong about this as well, that
in terms of cluster managers Kubernetes is likely to be our future. Talking
with people I don't hear about new standalone, YARN or mesos deployments of
Spark, but I do hear about people trying to migrate to Kubernetes.

To be clear I certainly agree that we need more work on structured
streaming, but its important to remember that the Spark developers are not
all fully interchangeable, we work on the things that we're interested in
pursuing so even if structured streaming needs more love if I'm not super
interested in structured streaming I'm less likely to work on it. That
being said I am certainly spinning up a bit more in the Spark SQL area
especially around our data source/connectors because I can see the need
there too.

On Wed, Jun 23, 2021 at 8:26 AM Mich Talebzadeh 
wrote:

>
>
> Please allow me to be diverse and express a different point of view on
> this roadmap.
>
>
> I believe from a technical point of view spending time and effort plus
> talent on batch scheduling on Kubernetes could be rewarding. However, if I
> may say I doubt whether such an approach and the so-called democratization
> of Spark on whatever platform is really should be of great focus.
>
> Having worked on Google Dataproc  (A fully
> managed and highly scalable service for running Apache Spark, Hadoop and
> more recently other artefacts) for that past two years, and Spark on
> Kubernetes on-premise, I have come to the conclusion that Spark is not a
> beast that that one can fully commoditize it much like one can do with
> Zookeeper, Kafka etc. There is always a struggle to make some niche areas
> of Spark like Spark Structured Streaming (SSS) work seamlessly and
> effortlessly on these commercial platforms with whatever as a Service.
>
>
> Moreover, Spark (and I stand corrected) from the ground up has already a
> lot of resiliency and redundancy built in. It is truly an enterprise class
> product (requires enterprise class support) that will be difficult to
> commoditize with Kubernetes and expect the same performance. After all,
> Kubernetes is aimed at efficient resource sharing and potential cost saving
> for the mass market. In short I can see commercial enterprises will work on
> these platforms ,but may be the great talents on dev team should focus on
> stuff like the perceived limitation of SSS in dealing with chain of
> aggregation( if I am correct it is not yet supported on streaming datasets)
>
>
> These are my opinions and they are not facts, just opinions so to speak :)
>
>
>view my Linkedin profile
> 
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
> On Fri, 18 Jun 2021 at 23:18, Holden Karau  wrote:
>
>> I think these approaches are good, but there are limitations (eg dynamic
>> scaling) without us making changes inside of the Spark Kube scheduler.
>>
>> Certainly whichever scheduler extensions we add support for we should
>> collaborate with the people developing those extensions insofar as they are
>> interested. My first place that I checked was #sig-scheduling which is
>> fairly quite on the Kubernetes slack but if there are more places to look
>> for folks interested in batch scheduling on Kubernetes we should definitely
>> give it a shot :)
>>
>> On Fri, Jun 18, 2021 at 1:41 AM Mich Talebzadeh <
>> mich.talebza...@gmail.com> wrote:
>>
>>> Hi,
>>>
>>> Regarding your point and I quote
>>>
>>> "..  I know that one of the Spark on Kube operators
>>> supports volcano/kube-batch so I was thinking that might be a place I would
>>> start exploring..."
>>>
>>> There seems to be ongoing work on say Volcano as part of  Cloud Native
>>> Computing Foundation  (CNCF). For example through
>>> https://github.com/volcano-sh/volcano
>>>
>> 
>>>
>>> There may be value-add in collaborating with such groups through CNCF in
>>> order to have a collective approach to such work. There also seems to be
>>> some work on Integration of Spark with Volcano for Batch Scheduling.
>>> 
>>>
>>>
>>>
>>> What is not very clear is the degree of progress of these projects. You
>>> may be kind enough to elaborate on KPI for each of these projects and where
>>> you think your contributions is going to be.
>>>
>>>
>>> HTH,
>>>
>>>
>>> Mich
>>>
>>>
>>>view my Linkedin profile
>>> 

Re: Spark on Kubernetes scheduler variety

2021-06-24 Thread Holden Karau
That's awesome, I'm just starting to get context around Volcano but maybe
we can schedule an initial meeting for all of us interested in pursuing
this to get on the same page.

On Wed, Jun 23, 2021 at 6:54 PM Klaus Ma  wrote:

> Hi team,
>
> I'm kube-batch/Volcano founder, and I'm excited to hear that the spark
> community also has such requirements :)
>
> Volcano provides several features for batch workload, e.g. fair-share,
> queue, reservation, preemption/reclaim and so on.
> It has been used in several product environments with Spark; if necessary,
> I can give an overall introduction about Volcano's features and those use
> cases :)
>
> -- Klaus
>
> On Wed, Jun 23, 2021 at 11:26 PM Mich Talebzadeh <
> mich.talebza...@gmail.com> wrote:
>
>>
>>
>> Please allow me to be diverse and express a different point of view on
>> this roadmap.
>>
>>
>> I believe from a technical point of view spending time and effort plus
>> talent on batch scheduling on Kubernetes could be rewarding. However, if I
>> may say I doubt whether such an approach and the so-called democratization
>> of Spark on whatever platform is really should be of great focus.
>>
>> Having worked on Google Dataproc  (A fully
>> managed and highly scalable service for running Apache Spark, Hadoop and
>> more recently other artefacts) for that past two years, and Spark on
>> Kubernetes on-premise, I have come to the conclusion that Spark is not a
>> beast that that one can fully commoditize it much like one can do with
>> Zookeeper, Kafka etc. There is always a struggle to make some niche areas
>> of Spark like Spark Structured Streaming (SSS) work seamlessly and
>> effortlessly on these commercial platforms with whatever as a Service.
>>
>>
>> Moreover, Spark (and I stand corrected) from the ground up has already a
>> lot of resiliency and redundancy built in. It is truly an enterprise class
>> product (requires enterprise class support) that will be difficult to
>> commoditize with Kubernetes and expect the same performance. After all,
>> Kubernetes is aimed at efficient resource sharing and potential cost saving
>> for the mass market. In short I can see commercial enterprises will work on
>> these platforms ,but may be the great talents on dev team should focus on
>> stuff like the perceived limitation of SSS in dealing with chain of
>> aggregation( if I am correct it is not yet supported on streaming datasets)
>>
>>
>> These are my opinions and they are not facts, just opinions so to speak :)
>>
>>
>>view my Linkedin profile
>> 
>>
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>>
>> On Fri, 18 Jun 2021 at 23:18, Holden Karau  wrote:
>>
>>> I think these approaches are good, but there are limitations (eg dynamic
>>> scaling) without us making changes inside of the Spark Kube scheduler.
>>>
>>> Certainly whichever scheduler extensions we add support for we should
>>> collaborate with the people developing those extensions insofar as they are
>>> interested. My first place that I checked was #sig-scheduling which is
>>> fairly quite on the Kubernetes slack but if there are more places to look
>>> for folks interested in batch scheduling on Kubernetes we should definitely
>>> give it a shot :)
>>>
>>> On Fri, Jun 18, 2021 at 1:41 AM Mich Talebzadeh <
>>> mich.talebza...@gmail.com> wrote:
>>>
 Hi,

 Regarding your point and I quote

 "..  I know that one of the Spark on Kube operators
 supports volcano/kube-batch so I was thinking that might be a place I would
 start exploring..."

 There seems to be ongoing work on say Volcano as part of  Cloud Native
 Computing Foundation  (CNCF). For example through
 https://github.com/volcano-sh/volcano

>>> 

 There may be value-add in collaborating with such groups through CNCF
 in order to have a collective approach to such work. There also seems to be
 some work on Integration of Spark with Volcano for Batch Scheduling.
 



 What is not very clear is the degree of progress of these projects. You
 may be kind enough to elaborate on KPI for each of these projects and where
 you think your contributions is going to be.


 HTH,


 Mich


view my Linkedin profile
 



 *Disclaimer:* Use it at your own risk. Any and all responsibility for
 any 

Re: Spark on Kubernetes scheduler variety

2021-06-24 Thread John Zhuge
Thanks Klaus! I am interested in more details.

On Wed, Jun 23, 2021 at 6:54 PM Klaus Ma  wrote:

> Hi team,
>
> I'm kube-batch/Volcano founder, and I'm excited to hear that the spark
> community also has such requirements :)
>
> Volcano provides several features for batch workload, e.g. fair-share,
> queue, reservation, preemption/reclaim and so on.
> It has been used in several product environments with Spark; if necessary,
> I can give an overall introduction about Volcano's features and those use
> cases :)
>
> -- Klaus
>
> On Wed, Jun 23, 2021 at 11:26 PM Mich Talebzadeh <
> mich.talebza...@gmail.com> wrote:
>
>>
>>
>> Please allow me to be diverse and express a different point of view on
>> this roadmap.
>>
>>
>> I believe from a technical point of view spending time and effort plus
>> talent on batch scheduling on Kubernetes could be rewarding. However, if I
>> may say I doubt whether such an approach and the so-called democratization
>> of Spark on whatever platform is really should be of great focus.
>>
>> Having worked on Google Dataproc  (A fully
>> managed and highly scalable service for running Apache Spark, Hadoop and
>> more recently other artefacts) for that past two years, and Spark on
>> Kubernetes on-premise, I have come to the conclusion that Spark is not a
>> beast that that one can fully commoditize it much like one can do with
>> Zookeeper, Kafka etc. There is always a struggle to make some niche areas
>> of Spark like Spark Structured Streaming (SSS) work seamlessly and
>> effortlessly on these commercial platforms with whatever as a Service.
>>
>>
>> Moreover, Spark (and I stand corrected) from the ground up has already a
>> lot of resiliency and redundancy built in. It is truly an enterprise class
>> product (requires enterprise class support) that will be difficult to
>> commoditize with Kubernetes and expect the same performance. After all,
>> Kubernetes is aimed at efficient resource sharing and potential cost saving
>> for the mass market. In short I can see commercial enterprises will work on
>> these platforms ,but may be the great talents on dev team should focus on
>> stuff like the perceived limitation of SSS in dealing with chain of
>> aggregation( if I am correct it is not yet supported on streaming datasets)
>>
>>
>> These are my opinions and they are not facts, just opinions so to speak :)
>>
>>
>>view my Linkedin profile
>> 
>>
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>>
>> On Fri, 18 Jun 2021 at 23:18, Holden Karau  wrote:
>>
>>> I think these approaches are good, but there are limitations (eg dynamic
>>> scaling) without us making changes inside of the Spark Kube scheduler.
>>>
>>> Certainly whichever scheduler extensions we add support for we should
>>> collaborate with the people developing those extensions insofar as they are
>>> interested. My first place that I checked was #sig-scheduling which is
>>> fairly quite on the Kubernetes slack but if there are more places to look
>>> for folks interested in batch scheduling on Kubernetes we should definitely
>>> give it a shot :)
>>>
>>> On Fri, Jun 18, 2021 at 1:41 AM Mich Talebzadeh <
>>> mich.talebza...@gmail.com> wrote:
>>>
 Hi,

 Regarding your point and I quote

 "..  I know that one of the Spark on Kube operators
 supports volcano/kube-batch so I was thinking that might be a place I would
 start exploring..."

 There seems to be ongoing work on say Volcano as part of  Cloud Native
 Computing Foundation  (CNCF). For example through
 https://github.com/volcano-sh/volcano

>>> 

 There may be value-add in collaborating with such groups through CNCF
 in order to have a collective approach to such work. There also seems to be
 some work on Integration of Spark with Volcano for Batch Scheduling.
 



 What is not very clear is the degree of progress of these projects. You
 may be kind enough to elaborate on KPI for each of these projects and where
 you think your contributions is going to be.


 HTH,


 Mich


view my Linkedin profile
 



 *Disclaimer:* Use it at your own risk. Any and all responsibility for
 any loss, damage or destruction of data or any other property which may
 arise from relying on this email's technical content is 

Re: Spark on Kubernetes scheduler variety

2021-06-24 Thread Mich Talebzadeh
Thanks Klaus. That will be great.

It will also be intuitive if you elaborate the need for this feature in
line with the limitation of the current batch workload.

Regards,

Mich



   view my Linkedin profile




*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Thu, 24 Jun 2021 at 02:53, Klaus Ma  wrote:

> Hi team,
>
> I'm kube-batch/Volcano founder, and I'm excited to hear that the spark
> community also has such requirements :)
>
> Volcano provides several features for batch workload, e.g. fair-share,
> queue, reservation, preemption/reclaim and so on.
> It has been used in several product environments with Spark; if necessary,
> I can give an overall introduction about Volcano's features and those use
> cases :)
>
> -- Klaus
>
> On Wed, Jun 23, 2021 at 11:26 PM Mich Talebzadeh <
> mich.talebza...@gmail.com> wrote:
>
>>
>>
>> Please allow me to be diverse and express a different point of view on
>> this roadmap.
>>
>>
>> I believe from a technical point of view spending time and effort plus
>> talent on batch scheduling on Kubernetes could be rewarding. However, if I
>> may say I doubt whether such an approach and the so-called democratization
>> of Spark on whatever platform is really should be of great focus.
>>
>> Having worked on Google Dataproc  (A fully
>> managed and highly scalable service for running Apache Spark, Hadoop and
>> more recently other artefacts) for that past two years, and Spark on
>> Kubernetes on-premise, I have come to the conclusion that Spark is not a
>> beast that that one can fully commoditize it much like one can do with
>> Zookeeper, Kafka etc. There is always a struggle to make some niche areas
>> of Spark like Spark Structured Streaming (SSS) work seamlessly and
>> effortlessly on these commercial platforms with whatever as a Service.
>>
>>
>> Moreover, Spark (and I stand corrected) from the ground up has already a
>> lot of resiliency and redundancy built in. It is truly an enterprise class
>> product (requires enterprise class support) that will be difficult to
>> commoditize with Kubernetes and expect the same performance. After all,
>> Kubernetes is aimed at efficient resource sharing and potential cost saving
>> for the mass market. In short I can see commercial enterprises will work on
>> these platforms ,but may be the great talents on dev team should focus on
>> stuff like the perceived limitation of SSS in dealing with chain of
>> aggregation( if I am correct it is not yet supported on streaming datasets)
>>
>>
>> These are my opinions and they are not facts, just opinions so to speak :)
>>
>>
>>view my Linkedin profile
>> 
>>
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>>
>> On Fri, 18 Jun 2021 at 23:18, Holden Karau  wrote:
>>
>>> I think these approaches are good, but there are limitations (eg dynamic
>>> scaling) without us making changes inside of the Spark Kube scheduler.
>>>
>>> Certainly whichever scheduler extensions we add support for we should
>>> collaborate with the people developing those extensions insofar as they are
>>> interested. My first place that I checked was #sig-scheduling which is
>>> fairly quite on the Kubernetes slack but if there are more places to look
>>> for folks interested in batch scheduling on Kubernetes we should definitely
>>> give it a shot :)
>>>
>>> On Fri, Jun 18, 2021 at 1:41 AM Mich Talebzadeh <
>>> mich.talebza...@gmail.com> wrote:
>>>
 Hi,

 Regarding your point and I quote

 "..  I know that one of the Spark on Kube operators
 supports volcano/kube-batch so I was thinking that might be a place I would
 start exploring..."

 There seems to be ongoing work on say Volcano as part of  Cloud Native
 Computing Foundation  (CNCF). For example through
 https://github.com/volcano-sh/volcano

>>> 

 There may be value-add in collaborating with such groups through CNCF
 in order to have a collective approach to such work. There also seems to be
 some work on Integration of Spark with Volcano for Batch Scheduling.
 



 What is not very