Re: SPIP: Spark on Kubernetes

2017-09-02 Thread Erik Erlandson
We have started discussions about upstreaming and merge strategy at our
weekly meetings. The associated github issue is:
https://github.com/apache-spark-on-k8s/spark/issues/441

There is general consensus that breaking it up into smaller components will
be important for upstream review. Our current focus is on identifying a way
to factor it so that it results in a minimal amount of "artificial"
deconstruction of existing work (for example, un-doing features purely to
present smaller initial PRs, but having to re-add them later).


On Fri, Sep 1, 2017 at 10:27 PM, Reynold Xin  wrote:

> Anirudh (or somebody else familiar with spark-on-k8s),
>
> Can you create a short plan on how we would integrate and do code review
> to merge the project? If the diff is too large it'd be difficult to review
> and merge in one shot. Once we have a plan we can create subtickets to
> track the progress.
>
>
>
> On Thu, Aug 31, 2017 at 5:21 PM, Anirudh Ramanathan <
> ramanath...@google.com> wrote:
>
>> The proposal is in the process of being updated to include the details on
>> testing that we have, that Imran pointed out.
>> Please expect an update on the SPARK-18278
>> .
>>
>> Mridul had a couple of points as well, about exposing an SPI and we've
>> been exploring that, to ascertain the effort involved.
>> That effort is separate, fairly long-term and we should have a working
>> group of representatives from all cluster managers to make progress on it.
>> A proposal regarding this will be in SPARK-19700
>> .
>>
>> This vote has passed.
>> So far, there have been 4 binding +1 votes, ~25 non-binding votes, and no
>> -1 votes.
>>
>> Thanks all!
>>
>> +1 votes (binding):
>> Reynold Xin
>> Matei Zahari
>> Marcelo Vanzin
>> Mark Hamstra
>>
>> +1 votes (non-binding):
>> Anirudh Ramanathan
>> Erik Erlandson
>> Ilan Filonenko
>> Sean Suchter
>> Kimoon Kim
>> Timothy Chen
>> Will Benton
>> Holden Karau
>> Seshu Adunuthula
>> Daniel Imberman
>> Shubham Chopra
>> Jiri Kremser
>> Yinan Li
>> Andrew Ash
>> 李书明
>> Gary Lucas
>> Ismael Mejia
>> Jean-Baptiste Onofré
>> Alexander Bezzubov
>> duyanghao
>> elmiko
>> Sudarshan Kadambi
>> Varun Katta
>> Matt Cheah
>> Edward Zhang
>> Vaquar Khan
>>
>>
>>
>>
>>
>> On Wed, Aug 30, 2017 at 10:42 PM, Reynold Xin 
>> wrote:
>>
>>> This has passed, hasn't it?
>>>
>>> On Tue, Aug 15, 2017 at 5:33 PM Anirudh Ramanathan 
>>> wrote:
>>>
 Spark on Kubernetes effort has been developed separately in a fork, and
 linked back from the Apache Spark project as an experimental backend
 .
 We're ~6 months in, have had 5 releases
 .

- 2 Spark versions maintained (2.1, and 2.2)
- Extensive integration testing and refactoring efforts to maintain
code quality
- Developer
 and
user-facing  docu
mentation
- 10+ consistent code contributors from different organizations

 
  involved
in actively maintaining and using the project, with several more members
involved in testing and providing feedback.
- The community has delivered several talks on Spark-on-Kubernetes
generating lots of feedback from users.
- In addition to these, we've seen efforts spawn off such as:
- HDFS on Kubernetes
    with
   Locality and Performance Experiments
   - Kerberized access
   
 
  to
   HDFS from Spark running on Kubernetes

 *Following the SPIP process, I'm putting this SPIP up for a vote.*

- +1: Yeah, let's go forward and implement the SPIP.
- +0: Don't really care.
- -1: I don't think this is a good idea because of the following
technical reasons.

 If there is any further clarification desired, on the design or the
 implementation, please feel free to ask questions or provide feedback.


 SPIP: Kubernetes as A Native Cluster Manager

 Full Design Doc: link
 

 JIRA: https://issues.apache.org/jira/browse/SPARK-18278

 Kubernetes Issue: https://github.com/kubernetes/kubernetes/issues/34377

 Authors: Yinan Li, Anirudh Ramanathan, Erik Erlandson, Andrew Ash, Matt
 Cheah,

 Ilan Filonenko, Sean Suchter, Kimoon Kim
 B

Re: SPIP: Spark on Kubernetes

2017-09-01 Thread Reynold Xin
Anirudh (or somebody else familiar with spark-on-k8s),

Can you create a short plan on how we would integrate and do code review to
merge the project? If the diff is too large it'd be difficult to review and
merge in one shot. Once we have a plan we can create subtickets to track
the progress.



On Thu, Aug 31, 2017 at 5:21 PM, Anirudh Ramanathan 
wrote:

> The proposal is in the process of being updated to include the details on
> testing that we have, that Imran pointed out.
> Please expect an update on the SPARK-18278
> .
>
> Mridul had a couple of points as well, about exposing an SPI and we've
> been exploring that, to ascertain the effort involved.
> That effort is separate, fairly long-term and we should have a working
> group of representatives from all cluster managers to make progress on it.
> A proposal regarding this will be in SPARK-19700
> .
>
> This vote has passed.
> So far, there have been 4 binding +1 votes, ~25 non-binding votes, and no
> -1 votes.
>
> Thanks all!
>
> +1 votes (binding):
> Reynold Xin
> Matei Zahari
> Marcelo Vanzin
> Mark Hamstra
>
> +1 votes (non-binding):
> Anirudh Ramanathan
> Erik Erlandson
> Ilan Filonenko
> Sean Suchter
> Kimoon Kim
> Timothy Chen
> Will Benton
> Holden Karau
> Seshu Adunuthula
> Daniel Imberman
> Shubham Chopra
> Jiri Kremser
> Yinan Li
> Andrew Ash
> 李书明
> Gary Lucas
> Ismael Mejia
> Jean-Baptiste Onofré
> Alexander Bezzubov
> duyanghao
> elmiko
> Sudarshan Kadambi
> Varun Katta
> Matt Cheah
> Edward Zhang
> Vaquar Khan
>
>
>
>
>
> On Wed, Aug 30, 2017 at 10:42 PM, Reynold Xin  wrote:
>
>> This has passed, hasn't it?
>>
>> On Tue, Aug 15, 2017 at 5:33 PM Anirudh Ramanathan 
>> wrote:
>>
>>> Spark on Kubernetes effort has been developed separately in a fork, and
>>> linked back from the Apache Spark project as an experimental backend
>>> .
>>> We're ~6 months in, have had 5 releases
>>> .
>>>
>>>- 2 Spark versions maintained (2.1, and 2.2)
>>>- Extensive integration testing and refactoring efforts to maintain
>>>code quality
>>>- Developer
>>> and
>>>user-facing  docu
>>>mentation
>>>- 10+ consistent code contributors from different organizations
>>>
>>> 
>>>  involved
>>>in actively maintaining and using the project, with several more members
>>>involved in testing and providing feedback.
>>>- The community has delivered several talks on Spark-on-Kubernetes
>>>generating lots of feedback from users.
>>>- In addition to these, we've seen efforts spawn off such as:
>>>- HDFS on Kubernetes
>>>    with
>>>   Locality and Performance Experiments
>>>   - Kerberized access
>>>   
>>> 
>>>  to
>>>   HDFS from Spark running on Kubernetes
>>>
>>> *Following the SPIP process, I'm putting this SPIP up for a vote.*
>>>
>>>- +1: Yeah, let's go forward and implement the SPIP.
>>>- +0: Don't really care.
>>>- -1: I don't think this is a good idea because of the following
>>>technical reasons.
>>>
>>> If there is any further clarification desired, on the design or the
>>> implementation, please feel free to ask questions or provide feedback.
>>>
>>>
>>> SPIP: Kubernetes as A Native Cluster Manager
>>>
>>> Full Design Doc: link
>>> 
>>>
>>> JIRA: https://issues.apache.org/jira/browse/SPARK-18278
>>>
>>> Kubernetes Issue: https://github.com/kubernetes/kubernetes/issues/34377
>>>
>>> Authors: Yinan Li, Anirudh Ramanathan, Erik Erlandson, Andrew Ash, Matt
>>> Cheah,
>>>
>>> Ilan Filonenko, Sean Suchter, Kimoon Kim
>>> Background and Motivation
>>>
>>> Containerization and cluster management technologies are constantly
>>> evolving in the cluster computing world. Apache Spark currently implements
>>> support for Apache Hadoop YARN and Apache Mesos, in addition to providing
>>> its own standalone cluster manager. In 2014, Google announced development
>>> of Kubernetes  which has its own unique feature
>>> set and differentiates itself from YARN and Mesos. Since its debut, it has
>>> seen contributions from over 1300 contributors with over 5 commits.
>>> Kubernetes has cemented itself as a core player in the cluster computing
>>> world, and cloud-computing providers such as Google Container Engine,
>>> Google Compute Engine, Amazon Web S

Re: SPIP: Spark on Kubernetes

2017-08-31 Thread Anirudh Ramanathan
The proposal is in the process of being updated to include the details on
testing that we have, that Imran pointed out.
Please expect an update on the SPARK-18278
.

Mridul had a couple of points as well, about exposing an SPI and we've been
exploring that, to ascertain the effort involved.
That effort is separate, fairly long-term and we should have a working
group of representatives from all cluster managers to make progress on it.
A proposal regarding this will be in SPARK-19700
.

This vote has passed.
So far, there have been 4 binding +1 votes, ~25 non-binding votes, and no
-1 votes.

Thanks all!

+1 votes (binding):
Reynold Xin
Matei Zahari
Marcelo Vanzin
Mark Hamstra

+1 votes (non-binding):
Anirudh Ramanathan
Erik Erlandson
Ilan Filonenko
Sean Suchter
Kimoon Kim
Timothy Chen
Will Benton
Holden Karau
Seshu Adunuthula
Daniel Imberman
Shubham Chopra
Jiri Kremser
Yinan Li
Andrew Ash
李书明
Gary Lucas
Ismael Mejia
Jean-Baptiste Onofré
Alexander Bezzubov
duyanghao
elmiko
Sudarshan Kadambi
Varun Katta
Matt Cheah
Edward Zhang
Vaquar Khan





On Wed, Aug 30, 2017 at 10:42 PM, Reynold Xin  wrote:

> This has passed, hasn't it?
>
> On Tue, Aug 15, 2017 at 5:33 PM Anirudh Ramanathan 
> wrote:
>
>> Spark on Kubernetes effort has been developed separately in a fork, and
>> linked back from the Apache Spark project as an experimental backend
>> .
>> We're ~6 months in, have had 5 releases
>> .
>>
>>- 2 Spark versions maintained (2.1, and 2.2)
>>- Extensive integration testing and refactoring efforts to maintain
>>code quality
>>- Developer
>> and
>>user-facing 
>>documentation
>>- 10+ consistent code contributors from different organizations
>>
>> 
>>  involved
>>in actively maintaining and using the project, with several more members
>>involved in testing and providing feedback.
>>- The community has delivered several talks on Spark-on-Kubernetes
>>generating lots of feedback from users.
>>- In addition to these, we've seen efforts spawn off such as:
>>- HDFS on Kubernetes
>>    with
>>   Locality and Performance Experiments
>>   - Kerberized access
>>   
>> 
>>  to
>>   HDFS from Spark running on Kubernetes
>>
>> *Following the SPIP process, I'm putting this SPIP up for a vote.*
>>
>>- +1: Yeah, let's go forward and implement the SPIP.
>>- +0: Don't really care.
>>- -1: I don't think this is a good idea because of the following
>>technical reasons.
>>
>> If there is any further clarification desired, on the design or the
>> implementation, please feel free to ask questions or provide feedback.
>>
>>
>> SPIP: Kubernetes as A Native Cluster Manager
>>
>> Full Design Doc: link
>> 
>>
>> JIRA: https://issues.apache.org/jira/browse/SPARK-18278
>>
>> Kubernetes Issue: https://github.com/kubernetes/kubernetes/issues/34377
>>
>> Authors: Yinan Li, Anirudh Ramanathan, Erik Erlandson, Andrew Ash, Matt
>> Cheah,
>>
>> Ilan Filonenko, Sean Suchter, Kimoon Kim
>> Background and Motivation
>>
>> Containerization and cluster management technologies are constantly
>> evolving in the cluster computing world. Apache Spark currently implements
>> support for Apache Hadoop YARN and Apache Mesos, in addition to providing
>> its own standalone cluster manager. In 2014, Google announced development
>> of Kubernetes  which has its own unique feature
>> set and differentiates itself from YARN and Mesos. Since its debut, it has
>> seen contributions from over 1300 contributors with over 5 commits.
>> Kubernetes has cemented itself as a core player in the cluster computing
>> world, and cloud-computing providers such as Google Container Engine,
>> Google Compute Engine, Amazon Web Services, and Microsoft Azure support
>> running Kubernetes clusters.
>>
>> This document outlines a proposal for integrating Apache Spark with
>> Kubernetes in a first class way, adding Kubernetes to the list of cluster
>> managers that Spark can be used with. Doing so would allow users to share
>> their computing resources and containerization framework between their
>> existing applications on Kubernetes and their computational Spark
>> applications. Although there is existing support for running a Spark
>> standalone cluster on 

Re: SPIP: Spark on Kubernetes

2017-08-30 Thread Reynold Xin
This has passed, hasn't it?

On Tue, Aug 15, 2017 at 5:33 PM Anirudh Ramanathan 
wrote:

> Spark on Kubernetes effort has been developed separately in a fork, and
> linked back from the Apache Spark project as an experimental backend
> .
> We're ~6 months in, have had 5 releases
> .
>
>- 2 Spark versions maintained (2.1, and 2.2)
>- Extensive integration testing and refactoring efforts to maintain
>code quality
>- Developer
> and
>user-facing 
> documentation
>- 10+ consistent code contributors from different organizations
>
> 
>  involved
>in actively maintaining and using the project, with several more members
>involved in testing and providing feedback.
>- The community has delivered several talks on Spark-on-Kubernetes
>generating lots of feedback from users.
>- In addition to these, we've seen efforts spawn off such as:
>- HDFS on Kubernetes
>    with
>   Locality and Performance Experiments
>   - Kerberized access
>   
> 
>  to
>   HDFS from Spark running on Kubernetes
>
> *Following the SPIP process, I'm putting this SPIP up for a vote.*
>
>- +1: Yeah, let's go forward and implement the SPIP.
>- +0: Don't really care.
>- -1: I don't think this is a good idea because of the following
>technical reasons.
>
> If there is any further clarification desired, on the design or the
> implementation, please feel free to ask questions or provide feedback.
>
>
> SPIP: Kubernetes as A Native Cluster Manager
>
> Full Design Doc: link
> 
>
> JIRA: https://issues.apache.org/jira/browse/SPARK-18278
>
> Kubernetes Issue: https://github.com/kubernetes/kubernetes/issues/34377
>
> Authors: Yinan Li, Anirudh Ramanathan, Erik Erlandson, Andrew Ash, Matt
> Cheah,
>
> Ilan Filonenko, Sean Suchter, Kimoon Kim
> Background and Motivation
>
> Containerization and cluster management technologies are constantly
> evolving in the cluster computing world. Apache Spark currently implements
> support for Apache Hadoop YARN and Apache Mesos, in addition to providing
> its own standalone cluster manager. In 2014, Google announced development
> of Kubernetes  which has its own unique feature
> set and differentiates itself from YARN and Mesos. Since its debut, it has
> seen contributions from over 1300 contributors with over 5 commits.
> Kubernetes has cemented itself as a core player in the cluster computing
> world, and cloud-computing providers such as Google Container Engine,
> Google Compute Engine, Amazon Web Services, and Microsoft Azure support
> running Kubernetes clusters.
>
> This document outlines a proposal for integrating Apache Spark with
> Kubernetes in a first class way, adding Kubernetes to the list of cluster
> managers that Spark can be used with. Doing so would allow users to share
> their computing resources and containerization framework between their
> existing applications on Kubernetes and their computational Spark
> applications. Although there is existing support for running a Spark
> standalone cluster on Kubernetes
> ,
> there are still major advantages and significant interest in having native
> execution support. For example, this integration provides better support
> for multi-tenancy and dynamic resource allocation. It also allows users to
> run applications of different Spark versions of their choices in the same
> cluster.
>
> The feature is being developed in a separate fork
>  in order to minimize risk
> to the main project during development. Since the start of the development
> in November of 2016, it has received over 100 commits from over 20
> contributors and supports two releases based on Spark 2.1 and 2.2
> respectively. Documentation is also being actively worked on both in the
> main project repository and also in the repository
> https://github.com/apache-spark-on-k8s/userdocs. Regarding real-world use
> cases, we have seen cluster setup that uses 1000+ cores. We are also seeing
> growing interests on this project from more and more organizations.
>
> While it is easy to bootstrap the project in a forked repository, it is
> hard to maintain it in the long run because of the tricky process of
> rebasing onto the upstream and lack of awaren

Re: SPIP: Spark on Kubernetes

2017-08-30 Thread vaquar khan
+1 (non-binding)

Regards,
Vaquar khan

On Mon, Aug 28, 2017 at 5:09 PM, Erik Erlandson  wrote:

>
> In addition to the engineering & software aspects of the native Kubernetes
> community project, we have also worked at building out the community, with
> the goal of providing the foundation for sustaining engineering on the
> Kubernetes scheduler back-end.  That said, I agree 100% with your point
> that adding committers with kube-specific experience is good strategy for
> increasing review bandwidth to help service PRs from this community.
>
> On Mon, Aug 28, 2017 at 2:16 PM, Mark Hamstra 
> wrote:
>
>> In my opinion, the fact that there are nearly no changes to spark-core,
>>> and most of our changes are additive should go to prove that this adds
>>> little complexity to the workflow of the committers.
>>
>>
>> Actually (and somewhat perversely), the otherwise praiseworthy isolation
>> of the Kubernetes code does mean that it adds complexity to the workflow of
>> the existing Spark committers. I'll reiterate Imran's concerns: The
>> existing Spark committers familiar with Spark's scheduler code have
>> adequate knowledge of the Standalone and Yarn implementations, and still
>> not sufficient coverage of Mesos. Adding k8s code to Spark would mean that
>> the progression of that code would start seeing the issues that the Mesos
>> code in Spark currently sees: Reviews and commits tend to languish because
>> we don't have currently active committers with sufficient knowledge and
>> cycles to deal with the Mesos PRs. Some of this is because the PMC needs to
>> get back to addressing the issue of adding new Spark committers who do have
>> the needed Mesos skills, but that isn't as simple as we'd like because
>> ideally a Spark committer has demonstrated skills across a significant
>> portion of the Spark code, not just tightly focused on one area (such as
>> Mesos or k8s integration.) In short, adding Kubernetes support directly
>> into Spark isn't likely (at least in the short-term) to be entirely
>> positive for the spark-on-k8s project, since merging of PRs to the
>> spark-on-k8s is very likely to be quite slow at least until such time as we
>> have k8s-focused Spark committers. If this project does end up getting
>> pulled into the Spark codebase, then the PMC will need to start looking at
>> bringing in one or more new committers who meet our requirements for such a
>> role and responsibility, and who also have k8s skills. The success and pace
>> of development of the spark-on-k8s will depend in large measure on the
>> PMC's ability to find such new committers.
>>
>> All that said, I'm +1 if the those currently responsible for the
>> spark-on-k8s project still want to bring the code into Spark.
>>
>>
>> On Mon, Aug 21, 2017 at 11:48 AM, Anirudh Ramanathan <
>> ramanath...@google.com.invalid> wrote:
>>
>>> Thank you for your comments Imran.
>>>
>>> Regarding integration tests,
>>>
>>> What you inferred from the documentation is correct -
>>> Integration tests do not require any prior setup or a Kubernetes cluster
>>> to run. Minikube is a single binary that brings up a one-node cluster and
>>> exposes the full Kubernetes API. It is actively maintained and kept up to
>>> date with the rest of the project. These local integration tests on Jenkins
>>> (like the ones with spark-on-yarn), should allow for the committers to
>>> merge changes with a high degree of confidence.
>>> I will update the proposal to include more information about the extent
>>> and kinds of testing we do.
>>>
>>> As for (b), people on this thread and the set of contributors on our
>>> fork are a fairly wide community of contributors and committers who would
>>> be involved in the maintenance long-term. It was one of the reasons behind
>>> developing separately as a fork. In my opinion, the fact that there are
>>> nearly no changes to spark-core, and most of our changes are additive
>>> should go to prove that this adds little complexity to the workflow of the
>>> committers.
>>>
>>> Separating out the cluster managers (into an as yet undecided new home)
>>> appears far more disruptive and a high risk change for the short term.
>>> However, when there is enough community support behind that effort, tracked
>>> in 19700 ; and if
>>> that is realized in the future, it wouldn't be difficult to switch over
>>> Kubernetes, YARN and Mesos to using the pluggable API. Currently, in my
>>> opinion, with the integration tests, active users, and a community of
>>> maintainers, Spark-on-Kubernetes would add minimal overhead and benefit a
>>> large (and growing) class of users.
>>>
>>> Lastly, the RSS is indeed separate and a value-add that we would love to
>>> share with other cluster managers as well.
>>>
>>> On Mon, Aug 21, 2017 at 10:17 AM, Imran Rashid 
>>> wrote:
>>>
 Overall this looks like a good proposal.  I do have some concerns which
 I'd like to discuss -- please understand I'm t

Re: SPIP: Spark on Kubernetes

2017-08-28 Thread Erik Erlandson
In addition to the engineering & software aspects of the native Kubernetes
community project, we have also worked at building out the community, with
the goal of providing the foundation for sustaining engineering on the
Kubernetes scheduler back-end.  That said, I agree 100% with your point
that adding committers with kube-specific experience is good strategy for
increasing review bandwidth to help service PRs from this community.

On Mon, Aug 28, 2017 at 2:16 PM, Mark Hamstra 
wrote:

> In my opinion, the fact that there are nearly no changes to spark-core,
>> and most of our changes are additive should go to prove that this adds
>> little complexity to the workflow of the committers.
>
>
> Actually (and somewhat perversely), the otherwise praiseworthy isolation
> of the Kubernetes code does mean that it adds complexity to the workflow of
> the existing Spark committers. I'll reiterate Imran's concerns: The
> existing Spark committers familiar with Spark's scheduler code have
> adequate knowledge of the Standalone and Yarn implementations, and still
> not sufficient coverage of Mesos. Adding k8s code to Spark would mean that
> the progression of that code would start seeing the issues that the Mesos
> code in Spark currently sees: Reviews and commits tend to languish because
> we don't have currently active committers with sufficient knowledge and
> cycles to deal with the Mesos PRs. Some of this is because the PMC needs to
> get back to addressing the issue of adding new Spark committers who do have
> the needed Mesos skills, but that isn't as simple as we'd like because
> ideally a Spark committer has demonstrated skills across a significant
> portion of the Spark code, not just tightly focused on one area (such as
> Mesos or k8s integration.) In short, adding Kubernetes support directly
> into Spark isn't likely (at least in the short-term) to be entirely
> positive for the spark-on-k8s project, since merging of PRs to the
> spark-on-k8s is very likely to be quite slow at least until such time as we
> have k8s-focused Spark committers. If this project does end up getting
> pulled into the Spark codebase, then the PMC will need to start looking at
> bringing in one or more new committers who meet our requirements for such a
> role and responsibility, and who also have k8s skills. The success and pace
> of development of the spark-on-k8s will depend in large measure on the
> PMC's ability to find such new committers.
>
> All that said, I'm +1 if the those currently responsible for the
> spark-on-k8s project still want to bring the code into Spark.
>
>
> On Mon, Aug 21, 2017 at 11:48 AM, Anirudh Ramanathan <
> ramanath...@google.com.invalid> wrote:
>
>> Thank you for your comments Imran.
>>
>> Regarding integration tests,
>>
>> What you inferred from the documentation is correct -
>> Integration tests do not require any prior setup or a Kubernetes cluster
>> to run. Minikube is a single binary that brings up a one-node cluster and
>> exposes the full Kubernetes API. It is actively maintained and kept up to
>> date with the rest of the project. These local integration tests on Jenkins
>> (like the ones with spark-on-yarn), should allow for the committers to
>> merge changes with a high degree of confidence.
>> I will update the proposal to include more information about the extent
>> and kinds of testing we do.
>>
>> As for (b), people on this thread and the set of contributors on our fork
>> are a fairly wide community of contributors and committers who would be
>> involved in the maintenance long-term. It was one of the reasons behind
>> developing separately as a fork. In my opinion, the fact that there are
>> nearly no changes to spark-core, and most of our changes are additive
>> should go to prove that this adds little complexity to the workflow of the
>> committers.
>>
>> Separating out the cluster managers (into an as yet undecided new home)
>> appears far more disruptive and a high risk change for the short term.
>> However, when there is enough community support behind that effort, tracked
>> in 19700 ; and if
>> that is realized in the future, it wouldn't be difficult to switch over
>> Kubernetes, YARN and Mesos to using the pluggable API. Currently, in my
>> opinion, with the integration tests, active users, and a community of
>> maintainers, Spark-on-Kubernetes would add minimal overhead and benefit a
>> large (and growing) class of users.
>>
>> Lastly, the RSS is indeed separate and a value-add that we would love to
>> share with other cluster managers as well.
>>
>> On Mon, Aug 21, 2017 at 10:17 AM, Imran Rashid 
>> wrote:
>>
>>> Overall this looks like a good proposal.  I do have some concerns which
>>> I'd like to discuss -- please understand I'm taking a "devil's advocate"
>>> stance here for discussion, not that I'm giving a -1.
>>>
>>> My primary concern is about testing and maintenance.  My concerns might
>>> be addressed if the doc i

Re: SPIP: Spark on Kubernetes

2017-08-28 Thread Mark Hamstra
>
> In my opinion, the fact that there are nearly no changes to spark-core,
> and most of our changes are additive should go to prove that this adds
> little complexity to the workflow of the committers.


Actually (and somewhat perversely), the otherwise praiseworthy isolation of
the Kubernetes code does mean that it adds complexity to the workflow of
the existing Spark committers. I'll reiterate Imran's concerns: The
existing Spark committers familiar with Spark's scheduler code have
adequate knowledge of the Standalone and Yarn implementations, and still
not sufficient coverage of Mesos. Adding k8s code to Spark would mean that
the progression of that code would start seeing the issues that the Mesos
code in Spark currently sees: Reviews and commits tend to languish because
we don't have currently active committers with sufficient knowledge and
cycles to deal with the Mesos PRs. Some of this is because the PMC needs to
get back to addressing the issue of adding new Spark committers who do have
the needed Mesos skills, but that isn't as simple as we'd like because
ideally a Spark committer has demonstrated skills across a significant
portion of the Spark code, not just tightly focused on one area (such as
Mesos or k8s integration.) In short, adding Kubernetes support directly
into Spark isn't likely (at least in the short-term) to be entirely
positive for the spark-on-k8s project, since merging of PRs to the
spark-on-k8s is very likely to be quite slow at least until such time as we
have k8s-focused Spark committers. If this project does end up getting
pulled into the Spark codebase, then the PMC will need to start looking at
bringing in one or more new committers who meet our requirements for such a
role and responsibility, and who also have k8s skills. The success and pace
of development of the spark-on-k8s will depend in large measure on the
PMC's ability to find such new committers.

All that said, I'm +1 if the those currently responsible for the
spark-on-k8s project still want to bring the code into Spark.


On Mon, Aug 21, 2017 at 11:48 AM, Anirudh Ramanathan <
ramanath...@google.com.invalid> wrote:

> Thank you for your comments Imran.
>
> Regarding integration tests,
>
> What you inferred from the documentation is correct -
> Integration tests do not require any prior setup or a Kubernetes cluster
> to run. Minikube is a single binary that brings up a one-node cluster and
> exposes the full Kubernetes API. It is actively maintained and kept up to
> date with the rest of the project. These local integration tests on Jenkins
> (like the ones with spark-on-yarn), should allow for the committers to
> merge changes with a high degree of confidence.
> I will update the proposal to include more information about the extent
> and kinds of testing we do.
>
> As for (b), people on this thread and the set of contributors on our fork
> are a fairly wide community of contributors and committers who would be
> involved in the maintenance long-term. It was one of the reasons behind
> developing separately as a fork. In my opinion, the fact that there are
> nearly no changes to spark-core, and most of our changes are additive
> should go to prove that this adds little complexity to the workflow of the
> committers.
>
> Separating out the cluster managers (into an as yet undecided new home)
> appears far more disruptive and a high risk change for the short term.
> However, when there is enough community support behind that effort, tracked
> in 19700 ; and if that
> is realized in the future, it wouldn't be difficult to switch over
> Kubernetes, YARN and Mesos to using the pluggable API. Currently, in my
> opinion, with the integration tests, active users, and a community of
> maintainers, Spark-on-Kubernetes would add minimal overhead and benefit a
> large (and growing) class of users.
>
> Lastly, the RSS is indeed separate and a value-add that we would love to
> share with other cluster managers as well.
>
> On Mon, Aug 21, 2017 at 10:17 AM, Imran Rashid 
> wrote:
>
>> Overall this looks like a good proposal.  I do have some concerns which
>> I'd like to discuss -- please understand I'm taking a "devil's advocate"
>> stance here for discussion, not that I'm giving a -1.
>>
>> My primary concern is about testing and maintenance.  My concerns might
>> be addressed if the doc included a section on testing that might just be
>> this: https://github.com/apache-spark-on-k8s/spark/blob/branch-2.
>> 2-kubernetes/resource-managers/kubernetes/README.md#
>> running-the-kubernetes-integration-tests
>>
>> but without the concerning warning "Note that the integration test
>> framework is currently being heavily revised and is subject to change".
>> I'd like the proposal to clearly indicate that some baseline testing can be
>> done by devs and in spark's regular jenkins builds without special access
>> to kubernetes clusters.
>>
>> Its worth noting that there *are* advantages

Re: SPIP: Spark on Kubernetes

2017-08-23 Thread Chen YongHua
Sorry, I read the Dockerfile, the example's path is in the image, I'll try it 
again

获取 Outlook for iOS<https://aka.ms/o0ukef>

From: Chen YongHua 
Sent: Wednesday, August 23, 2017 4:40:55 PM
To: yonzhang2012; dev@spark.apache.org
Subject: Re: SPIP: Spark on Kubernetes

I run the example in k8s environment based on vagrant. The driver pod report 
error:
Error: Could not find or load man class org.apache.spark.examples.SparkPi.

Use local mode.

获取 Outlook for iOS<https://aka.ms/o0ukef>

From: yonzhang2012 
Sent: Wednesday, August 23, 2017 5:47:16 AM
To: dev@spark.apache.org
Subject: Re: SPIP: Spark on Kubernetes

+1 (non-binding)

I am specifically interested in setting up testing environment for my
company's Spark use and also expecting more comprehensive documents on
getting development env setup in case of bug fix or new feature development,
now it is only briefly documented in
https://github.com/apache-spark-on-k8s/spark/blob/branch-2.2-kubernetes/resource-managers/kubernetes/README.md.
I myself tried running this SPIP in local box with integration-test(one of
them) on minikube, and was able to debug remotely with the
KubernetesClusterManager related code in Spark driver pod, which means whole
development lifecycle can be done, but hope developer guide can have more
information.

Thanks
Edward Zhang



--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/SPIP-Spark-on-Kubernetes-tp22147p22210.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: SPIP: Spark on Kubernetes

2017-08-23 Thread Chen YongHua
I run the example in k8s environment based on vagrant. The driver pod report 
error:
Error: Could not find or load man class org.apache.spark.examples.SparkPi.

Use local mode.

获取 Outlook for iOS<https://aka.ms/o0ukef>

From: yonzhang2012 
Sent: Wednesday, August 23, 2017 5:47:16 AM
To: dev@spark.apache.org
Subject: Re: SPIP: Spark on Kubernetes

+1 (non-binding)

I am specifically interested in setting up testing environment for my
company's Spark use and also expecting more comprehensive documents on
getting development env setup in case of bug fix or new feature development,
now it is only briefly documented in
https://github.com/apache-spark-on-k8s/spark/blob/branch-2.2-kubernetes/resource-managers/kubernetes/README.md.
I myself tried running this SPIP in local box with integration-test(one of
them) on minikube, and was able to debug remotely with the
KubernetesClusterManager related code in Spark driver pod, which means whole
development lifecycle can be done, but hope developer guide can have more
information.

Thanks
Edward Zhang



--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/SPIP-Spark-on-Kubernetes-tp22147p22210.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: SPIP: Spark on Kubernetes

2017-08-22 Thread yonzhang2012
+1 (non-binding)

I am specifically interested in setting up testing environment for my
company's Spark use and also expecting more comprehensive documents on
getting development env setup in case of bug fix or new feature development,
now it is only briefly documented in
https://github.com/apache-spark-on-k8s/spark/blob/branch-2.2-kubernetes/resource-managers/kubernetes/README.md.
I myself tried running this SPIP in local box with integration-test(one of
them) on minikube, and was able to debug remotely with the
KubernetesClusterManager related code in Spark driver pod, which means whole
development lifecycle can be done, but hope developer guide can have more
information.

Thanks
Edward Zhang



--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/SPIP-Spark-on-Kubernetes-tp22147p22210.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: SPIP: Spark on Kubernetes

2017-08-21 Thread Anirudh Ramanathan
Thank you for your comments Imran.

Regarding integration tests,

What you inferred from the documentation is correct -
Integration tests do not require any prior setup or a Kubernetes cluster to
run. Minikube is a single binary that brings up a one-node cluster and
exposes the full Kubernetes API. It is actively maintained and kept up to
date with the rest of the project. These local integration tests on Jenkins
(like the ones with spark-on-yarn), should allow for the committers to
merge changes with a high degree of confidence.
I will update the proposal to include more information about the extent and
kinds of testing we do.

As for (b), people on this thread and the set of contributors on our fork
are a fairly wide community of contributors and committers who would be
involved in the maintenance long-term. It was one of the reasons behind
developing separately as a fork. In my opinion, the fact that there are
nearly no changes to spark-core, and most of our changes are additive
should go to prove that this adds little complexity to the workflow of the
committers.

Separating out the cluster managers (into an as yet undecided new home)
appears far more disruptive and a high risk change for the short term.
However, when there is enough community support behind that effort, tracked
in 19700 ; and if that
is realized in the future, it wouldn't be difficult to switch over
Kubernetes, YARN and Mesos to using the pluggable API. Currently, in my
opinion, with the integration tests, active users, and a community of
maintainers, Spark-on-Kubernetes would add minimal overhead and benefit a
large (and growing) class of users.

Lastly, the RSS is indeed separate and a value-add that we would love to
share with other cluster managers as well.

On Mon, Aug 21, 2017 at 10:17 AM, Imran Rashid  wrote:

> Overall this looks like a good proposal.  I do have some concerns which
> I'd like to discuss -- please understand I'm taking a "devil's advocate"
> stance here for discussion, not that I'm giving a -1.
>
> My primary concern is about testing and maintenance.  My concerns might be
> addressed if the doc included a section on testing that might just be this:
> https://github.com/apache-spark-on-k8s/spark/blob/branch-2.2-kubernetes/
> resource-managers/kubernetes/README.md#running-the-
> kubernetes-integration-tests
>
> but without the concerning warning "Note that the integration test
> framework is currently being heavily revised and is subject to change".
> I'd like the proposal to clearly indicate that some baseline testing can be
> done by devs and in spark's regular jenkins builds without special access
> to kubernetes clusters.
>
> Its worth noting that there *are* advantages to keeping it outside Spark:
> * when making changes to spark's scheduler, we do *not* have to worry
> about how those changes impact kubernetes.  This simplifies things for
> those making changes to spark
> * making changes changes to the kubernetes integration is not blocked by
> getting enough attention from spark's committers
>
> or in other words, each community of experts can maintain its focus.  I
> have these concerns based on past experience with the mesos integration --
> mesos contributors are blocked on committers reviewing their changes, and
> then committers have no idea how to test that the changes are correct, and
> find it hard to even learn the ins and outs of that code without access to
> a mesos cluster.
>
> The same could be said for the yarn integration, but I think its helped
> that (a) spark-on-yarn *does* have local tests for testing basic
> integration and (b) there is a sufficient community of contributors and
> committers for spark-on-yarn.   I realize (b) is a chicken-and-egg problem,
> but I'd like to be sure that at least (a) is addressed.  (and maybe even
> spark-on-yarn shouldln't be inside spark itself, as mridul said, but its
> not clear what the other home should be.)
>
> At some point, this is just a judgement call, of the value it brings to
> the spark community vs the added complexity.  I'm willing to believe that
> kubernetes will bring enough value to make this worthwhile, just voicing my
> concerns.
>
> Secondary concern:
> the RSS doesn't seem necessary for kubernetes support, or specific to it.
> If its nice to have, and you want to add it to kubernetes first before
> other cluster managers, fine, but seems separate from this proposal.
>
>
>
> On Tue, Aug 15, 2017 at 10:32 AM, Anirudh Ramanathan <
> fox...@google.com.invalid> wrote:
>
>> Spark on Kubernetes effort has been developed separately in a fork, and
>> linked back from the Apache Spark project as an experimental backend
>> .
>> We're ~6 months in, have had 5 releases
>> .
>>
>>- 2 Spark versions maintained (2.1, and 2.2)
>>- Extensive integration tes

Re: SPIP: Spark on Kubernetes

2017-08-21 Thread Erik Erlandson
Speaking to integration testing: the integration tests can either attach to
an existing cluster, or they can spin up their own minikube cluster to run
themselves against.

Spark-on-kube can definitely operate without the RSS, as long as spark can
find the files it needs using some other established mechanism.  We have
also previously considered reworking the RSS into a more generic kube tool.
Now that we have some experience built up with the RSS, we might revisit
this idea and discuss the potential trade-offs.


On Mon, Aug 21, 2017 at 10:17 AM, Imran Rashid  wrote:

> Overall this looks like a good proposal.  I do have some concerns which
> I'd like to discuss -- please understand I'm taking a "devil's advocate"
> stance here for discussion, not that I'm giving a -1.
>
> My primary concern is about testing and maintenance.  My concerns might be
> addressed if the doc included a section on testing that might just be this:
> https://github.com/apache-spark-on-k8s/spark/blob/branch-2.2-kubernetes/
> resource-managers/kubernetes/README.md#running-the-
> kubernetes-integration-tests
>
> but without the concerning warning "Note that the integration test
> framework is currently being heavily revised and is subject to change".
> I'd like the proposal to clearly indicate that some baseline testing can be
> done by devs and in spark's regular jenkins builds without special access
> to kubernetes clusters.
>
> Its worth noting that there *are* advantages to keeping it outside Spark:
> * when making changes to spark's scheduler, we do *not* have to worry
> about how those changes impact kubernetes.  This simplifies things for
> those making changes to spark
> * making changes changes to the kubernetes integration is not blocked by
> getting enough attention from spark's committers
>
> or in other words, each community of experts can maintain its focus.  I
> have these concerns based on past experience with the mesos integration --
> mesos contributors are blocked on committers reviewing their changes, and
> then committers have no idea how to test that the changes are correct, and
> find it hard to even learn the ins and outs of that code without access to
> a mesos cluster.
>
> The same could be said for the yarn integration, but I think its helped
> that (a) spark-on-yarn *does* have local tests for testing basic
> integration and (b) there is a sufficient community of contributors and
> committers for spark-on-yarn.   I realize (b) is a chicken-and-egg problem,
> but I'd like to be sure that at least (a) is addressed.  (and maybe even
> spark-on-yarn shouldln't be inside spark itself, as mridul said, but its
> not clear what the other home should be.)
>
> At some point, this is just a judgement call, of the value it brings to
> the spark community vs the added complexity.  I'm willing to believe that
> kubernetes will bring enough value to make this worthwhile, just voicing my
> concerns.
>
> Secondary concern:
> the RSS doesn't seem necessary for kubernetes support, or specific to it.
> If its nice to have, and you want to add it to kubernetes first before
> other cluster managers, fine, but seems separate from this proposal.
>
>
>
> On Tue, Aug 15, 2017 at 10:32 AM, Anirudh Ramanathan <
> fox...@google.com.invalid> wrote:
>
>> Spark on Kubernetes effort has been developed separately in a fork, and
>> linked back from the Apache Spark project as an experimental backend
>> .
>> We're ~6 months in, have had 5 releases
>> .
>>
>>- 2 Spark versions maintained (2.1, and 2.2)
>>- Extensive integration testing and refactoring efforts to maintain
>>code quality
>>- Developer
>> and
>>user-facing  docu
>>mentation
>>- 10+ consistent code contributors from different organizations
>>
>> 
>>  involved
>>in actively maintaining and using the project, with several more members
>>involved in testing and providing feedback.
>>- The community has delivered several talks on Spark-on-Kubernetes
>>generating lots of feedback from users.
>>- In addition to these, we've seen efforts spawn off such as:
>>- HDFS on Kubernetes
>>    with
>>   Locality and Performance Experiments
>>   - Kerberized access
>>   
>> 
>>  to
>>   HDFS from Spark running on Kubernetes
>>
>> *Following the SPIP process, I'm putting this SPIP up for a vote.*
>>
>>- +1: Yeah, let's go forward and implement the SPIP.
>>- +0: Don't really care.
>>- -1: I don't think this is a good idea because of the follow

Re: SPIP: Spark on Kubernetes

2017-08-21 Thread Imran Rashid
Overall this looks like a good proposal.  I do have some concerns which I'd
like to discuss -- please understand I'm taking a "devil's advocate" stance
here for discussion, not that I'm giving a -1.

My primary concern is about testing and maintenance.  My concerns might be
addressed if the doc included a section on testing that might just be this:
https://github.com/apache-spark-on-k8s/spark/blob/branch-2.2-kubernetes/resource-managers/kubernetes/README.md#running-the-kubernetes-integration-tests

but without the concerning warning "Note that the integration test
framework is currently being heavily revised and is subject to change".
I'd like the proposal to clearly indicate that some baseline testing can be
done by devs and in spark's regular jenkins builds without special access
to kubernetes clusters.

Its worth noting that there *are* advantages to keeping it outside Spark:
* when making changes to spark's scheduler, we do *not* have to worry about
how those changes impact kubernetes.  This simplifies things for those
making changes to spark
* making changes changes to the kubernetes integration is not blocked by
getting enough attention from spark's committers

or in other words, each community of experts can maintain its focus.  I
have these concerns based on past experience with the mesos integration --
mesos contributors are blocked on committers reviewing their changes, and
then committers have no idea how to test that the changes are correct, and
find it hard to even learn the ins and outs of that code without access to
a mesos cluster.

The same could be said for the yarn integration, but I think its helped
that (a) spark-on-yarn *does* have local tests for testing basic
integration and (b) there is a sufficient community of contributors and
committers for spark-on-yarn.   I realize (b) is a chicken-and-egg problem,
but I'd like to be sure that at least (a) is addressed.  (and maybe even
spark-on-yarn shouldln't be inside spark itself, as mridul said, but its
not clear what the other home should be.)

At some point, this is just a judgement call, of the value it brings to the
spark community vs the added complexity.  I'm willing to believe that
kubernetes will bring enough value to make this worthwhile, just voicing my
concerns.

Secondary concern:
the RSS doesn't seem necessary for kubernetes support, or specific to it.
If its nice to have, and you want to add it to kubernetes first before
other cluster managers, fine, but seems separate from this proposal.



On Tue, Aug 15, 2017 at 10:32 AM, Anirudh Ramanathan <
fox...@google.com.invalid> wrote:

> Spark on Kubernetes effort has been developed separately in a fork, and
> linked back from the Apache Spark project as an experimental backend
> .
> We're ~6 months in, have had 5 releases
> .
>
>- 2 Spark versions maintained (2.1, and 2.2)
>- Extensive integration testing and refactoring efforts to maintain
>code quality
>- Developer
> and
>user-facing 
>documentation
>- 10+ consistent code contributors from different organizations
>
> 
>  involved
>in actively maintaining and using the project, with several more members
>involved in testing and providing feedback.
>- The community has delivered several talks on Spark-on-Kubernetes
>generating lots of feedback from users.
>- In addition to these, we've seen efforts spawn off such as:
>- HDFS on Kubernetes
>    with
>   Locality and Performance Experiments
>   - Kerberized access
>   
> 
>  to
>   HDFS from Spark running on Kubernetes
>
> *Following the SPIP process, I'm putting this SPIP up for a vote.*
>
>- +1: Yeah, let's go forward and implement the SPIP.
>- +0: Don't really care.
>- -1: I don't think this is a good idea because of the following
>technical reasons.
>
> If there is any further clarification desired, on the design or the
> implementation, please feel free to ask questions or provide feedback.
>
>
> SPIP: Kubernetes as A Native Cluster Manager
>
> Full Design Doc: link
> 
>
> JIRA: https://issues.apache.org/jira/browse/SPARK-18278
>
> Kubernetes Issue: https://github.com/kubernetes/kubernetes/issues/34377
>
> Authors: Yinan Li, Anirudh Ramanathan, Erik Erlandson, Andrew Ash, Matt
> Cheah,
>
> Ilan Filonenko, Sean Suchter, Kimoon Kim
> Background and Motivation
>
> Containeriza

Re: SPIP: Spark on Kubernetes

2017-08-18 Thread Sudarshan Kadambi
+1 (non-binding) 

We are evaluating Kubernetes for a variety of data processing workloads.
Spark is the natural choice for some of these workloads. Native Spark on
Kubernetes is of interest to us as it brings in dynamic allocation, resource
isolation and improved notions of security. 



--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/SPIP-Spark-on-Kubernetes-tp22147p22197.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: SPIP: Spark on Kubernetes

2017-08-18 Thread Matt Cheah
The problem with maintaining this scheduler separately right now is that the 
scheduler backend is dependent upon the CoarseGrainedSchedulerBackend class, 
which is not as much a stable API as it is an internal class with components 
that currently need to be shared by all of the scheduler backends. This makes 
it such that maintaining this scheduler requires not just maintaining a single 
small module, but an entire fork of the project as well, so that the cluster 
manager specific scheduler backend can keep up with the changes to 
CoarseGrainedSchedulerBackend. If we wanted to avoid forking the entire project 
and only provide the scheduler backend as a pluggable module, we would need a 
fully pluggable scheduler backend with a stable API, as Erik mentioned. We also 
needed to change the spark-submit code to recognize Kubernetes mode and be able 
to delegate to the Kubernetes submission client, so that would need to be 
pluggable as well.

 

More discussion on fully pluggable scheduler backends is at 
https://issues.apache.org/jira/browse/SPARK-19700.

 

-Matt Cheah

 

From: Erik Erlandson 
Date: Friday, August 18, 2017 at 8:34 AM
To: "dev@spark.apache.org" 
Subject: Re: SPIP: Spark on Kubernetes

 

There are a fair number of people (myself included) who have interest in making 
scheduler back-ends fully pluggable.  That will represent a significant impact 
to core spark architecture, with corresponding risk. Adding the kubernetes 
back-end in a manner similar to the other three back-ends has had a very small 
impact on spark core, which allowed it to be developed in parallel and easily 
stay re-based on successive spark releases while we were developing it and 
building up community support.

 

On Thu, Aug 17, 2017 at 7:14 PM, Mridul Muralidharan  wrote:

While I definitely support the idea of Apache Spark being able to
leverage kubernetes, IMO it is better for long term evolution of spark
to expose appropriate SPI such that this support need not necessarily
live within Apache Spark code base.
It will allow for multiple backends to evolve, decoupled from spark core.
In this case, would have made maintaining apache-spark-on-k8s repo
easier; just as it would allow for supporting other backends -
opensource (nomad for ex) and proprietary.

In retrospect directly integrating yarn support into spark, while
mirroring mesos support at that time, was probably an incorrect design
choice on my part.


Regards,
Mridul

On Tue, Aug 15, 2017 at 8:32 AM, Anirudh Ramanathan
 wrote:

> Spark on Kubernetes effort has been developed separately in a fork, and
> linked back from the Apache Spark project as an experimental backend. We're
> ~6 months in, have had 5 releases.
>
> 2 Spark versions maintained (2.1, and 2.2)
> Extensive integration testing and refactoring efforts to maintain code
> quality
> Developer and user-facing documentation
> 10+ consistent code contributors from different organizations involved in
> actively maintaining and using the project, with several more members
> involved in testing and providing feedback.
> The community has delivered several talks on Spark-on-Kubernetes generating
> lots of feedback from users.
> In addition to these, we've seen efforts spawn off such as:
>
> HDFS on Kubernetes with Locality and Performance Experiments
> Kerberized access to HDFS from Spark running on Kubernetes
>
> Following the SPIP process, I'm putting this SPIP up for a vote.
>
> +1: Yeah, let's go forward and implement the SPIP.
> +0: Don't really care.
> -1: I don't think this is a good idea because of the following technical
> reasons.
>
> If there is any further clarification desired, on the design or the
> implementation, please feel free to ask questions or provide feedback.
>
>
> SPIP: Kubernetes as A Native Cluster Manager
>
>
> Full Design Doc: link
>
> JIRA: https://issues.apache.org/jira/browse/SPARK-18278[issues.apache.org]
>
> Kubernetes Issue: 
> https://github.com/kubernetes/kubernetes/issues/34377[github.com]
>
>
> Authors: Yinan Li, Anirudh Ramanathan, Erik Erlandson, Andrew Ash, Matt
> Cheah,
>
> Ilan Filonenko, Sean Suchter, Kimoon Kim
>
> Background and Motivation
>
> Containerization and cluster management technologies are constantly evolving
> in the cluster computing world. Apache Spark currently implements support
> for Apache Hadoop YARN and Apache Mesos, in addition to providing its own
> standalone cluster manager. In 2014, Google announced development of
> Kubernetes which has its own unique feature set and differentiates itself
> from YARN and Mesos. Since its debut, it has seen contributions from over
> 1300 contributors with over 5 commits. Kubernetes has cemented itself as
> a core player in the cluster computing world, and cloud-computing providers
> s

Re: SPIP: Spark on Kubernetes

2017-08-18 Thread varunkatta
+1 (non-binding)



--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/SPIP-Spark-on-Kubernetes-tp22147p22195.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: SPIP: Spark on Kubernetes

2017-08-18 Thread Erik Erlandson
There are a fair number of people (myself included) who have interest in
making scheduler back-ends fully pluggable.  That will represent a
significant impact to core spark architecture, with corresponding risk.
Adding the kubernetes back-end in a manner similar to the other three
back-ends has had a very small impact on spark core, which allowed it to be
developed in parallel and easily stay re-based on successive spark releases
while we were developing it and building up community support.

On Thu, Aug 17, 2017 at 7:14 PM, Mridul Muralidharan 
wrote:

> While I definitely support the idea of Apache Spark being able to
> leverage kubernetes, IMO it is better for long term evolution of spark
> to expose appropriate SPI such that this support need not necessarily
> live within Apache Spark code base.
> It will allow for multiple backends to evolve, decoupled from spark core.
> In this case, would have made maintaining apache-spark-on-k8s repo
> easier; just as it would allow for supporting other backends -
> opensource (nomad for ex) and proprietary.
>
> In retrospect directly integrating yarn support into spark, while
> mirroring mesos support at that time, was probably an incorrect design
> choice on my part.
>
>
> Regards,
> Mridul
>
> On Tue, Aug 15, 2017 at 8:32 AM, Anirudh Ramanathan
>  wrote:
> > Spark on Kubernetes effort has been developed separately in a fork, and
> > linked back from the Apache Spark project as an experimental backend.
> We're
> > ~6 months in, have had 5 releases.
> >
> > 2 Spark versions maintained (2.1, and 2.2)
> > Extensive integration testing and refactoring efforts to maintain code
> > quality
> > Developer and user-facing documentation
> > 10+ consistent code contributors from different organizations involved in
> > actively maintaining and using the project, with several more members
> > involved in testing and providing feedback.
> > The community has delivered several talks on Spark-on-Kubernetes
> generating
> > lots of feedback from users.
> > In addition to these, we've seen efforts spawn off such as:
> >
> > HDFS on Kubernetes with Locality and Performance Experiments
> > Kerberized access to HDFS from Spark running on Kubernetes
> >
> > Following the SPIP process, I'm putting this SPIP up for a vote.
> >
> > +1: Yeah, let's go forward and implement the SPIP.
> > +0: Don't really care.
> > -1: I don't think this is a good idea because of the following technical
> > reasons.
> >
> > If there is any further clarification desired, on the design or the
> > implementation, please feel free to ask questions or provide feedback.
> >
> >
> > SPIP: Kubernetes as A Native Cluster Manager
> >
> >
> > Full Design Doc: link
> >
> > JIRA: https://issues.apache.org/jira/browse/SPARK-18278
> >
> > Kubernetes Issue: https://github.com/kubernetes/kubernetes/issues/34377
> >
> >
> > Authors: Yinan Li, Anirudh Ramanathan, Erik Erlandson, Andrew Ash, Matt
> > Cheah,
> >
> > Ilan Filonenko, Sean Suchter, Kimoon Kim
> >
> > Background and Motivation
> >
> > Containerization and cluster management technologies are constantly
> evolving
> > in the cluster computing world. Apache Spark currently implements support
> > for Apache Hadoop YARN and Apache Mesos, in addition to providing its own
> > standalone cluster manager. In 2014, Google announced development of
> > Kubernetes which has its own unique feature set and differentiates itself
> > from YARN and Mesos. Since its debut, it has seen contributions from over
> > 1300 contributors with over 5 commits. Kubernetes has cemented
> itself as
> > a core player in the cluster computing world, and cloud-computing
> providers
> > such as Google Container Engine, Google Compute Engine, Amazon Web
> Services,
> > and Microsoft Azure support running Kubernetes clusters.
> >
> >
> > This document outlines a proposal for integrating Apache Spark with
> > Kubernetes in a first class way, adding Kubernetes to the list of cluster
> > managers that Spark can be used with. Doing so would allow users to share
> > their computing resources and containerization framework between their
> > existing applications on Kubernetes and their computational Spark
> > applications. Although there is existing support for running a Spark
> > standalone cluster on Kubernetes, there are still major advantages and
> > significant interest in having native execution support. For example,
> this
> > integration provides better support for multi-tenancy and dynamic
> resource
> > allocation. It also allows users to run applications of different Spark
> > versions of their choices in the same cluster.
> >
> >
> > The feature is being developed in a separate fork in order to minimize
> risk
> > to the main project during development. Since the start of the
> development
> > in November of 2016, it has received over 100 commits from over 20
> > contributors and supports two releases based on Spark 2.1 and 2.2
> > respectively. Documentation is also being actively worked on 

Re: SPIP: Spark on Kubernetes

2017-08-17 Thread Mridul Muralidharan
While I definitely support the idea of Apache Spark being able to
leverage kubernetes, IMO it is better for long term evolution of spark
to expose appropriate SPI such that this support need not necessarily
live within Apache Spark code base.
It will allow for multiple backends to evolve, decoupled from spark core.
In this case, would have made maintaining apache-spark-on-k8s repo
easier; just as it would allow for supporting other backends -
opensource (nomad for ex) and proprietary.

In retrospect directly integrating yarn support into spark, while
mirroring mesos support at that time, was probably an incorrect design
choice on my part.


Regards,
Mridul

On Tue, Aug 15, 2017 at 8:32 AM, Anirudh Ramanathan
 wrote:
> Spark on Kubernetes effort has been developed separately in a fork, and
> linked back from the Apache Spark project as an experimental backend. We're
> ~6 months in, have had 5 releases.
>
> 2 Spark versions maintained (2.1, and 2.2)
> Extensive integration testing and refactoring efforts to maintain code
> quality
> Developer and user-facing documentation
> 10+ consistent code contributors from different organizations involved in
> actively maintaining and using the project, with several more members
> involved in testing and providing feedback.
> The community has delivered several talks on Spark-on-Kubernetes generating
> lots of feedback from users.
> In addition to these, we've seen efforts spawn off such as:
>
> HDFS on Kubernetes with Locality and Performance Experiments
> Kerberized access to HDFS from Spark running on Kubernetes
>
> Following the SPIP process, I'm putting this SPIP up for a vote.
>
> +1: Yeah, let's go forward and implement the SPIP.
> +0: Don't really care.
> -1: I don't think this is a good idea because of the following technical
> reasons.
>
> If there is any further clarification desired, on the design or the
> implementation, please feel free to ask questions or provide feedback.
>
>
> SPIP: Kubernetes as A Native Cluster Manager
>
>
> Full Design Doc: link
>
> JIRA: https://issues.apache.org/jira/browse/SPARK-18278
>
> Kubernetes Issue: https://github.com/kubernetes/kubernetes/issues/34377
>
>
> Authors: Yinan Li, Anirudh Ramanathan, Erik Erlandson, Andrew Ash, Matt
> Cheah,
>
> Ilan Filonenko, Sean Suchter, Kimoon Kim
>
> Background and Motivation
>
> Containerization and cluster management technologies are constantly evolving
> in the cluster computing world. Apache Spark currently implements support
> for Apache Hadoop YARN and Apache Mesos, in addition to providing its own
> standalone cluster manager. In 2014, Google announced development of
> Kubernetes which has its own unique feature set and differentiates itself
> from YARN and Mesos. Since its debut, it has seen contributions from over
> 1300 contributors with over 5 commits. Kubernetes has cemented itself as
> a core player in the cluster computing world, and cloud-computing providers
> such as Google Container Engine, Google Compute Engine, Amazon Web Services,
> and Microsoft Azure support running Kubernetes clusters.
>
>
> This document outlines a proposal for integrating Apache Spark with
> Kubernetes in a first class way, adding Kubernetes to the list of cluster
> managers that Spark can be used with. Doing so would allow users to share
> their computing resources and containerization framework between their
> existing applications on Kubernetes and their computational Spark
> applications. Although there is existing support for running a Spark
> standalone cluster on Kubernetes, there are still major advantages and
> significant interest in having native execution support. For example, this
> integration provides better support for multi-tenancy and dynamic resource
> allocation. It also allows users to run applications of different Spark
> versions of their choices in the same cluster.
>
>
> The feature is being developed in a separate fork in order to minimize risk
> to the main project during development. Since the start of the development
> in November of 2016, it has received over 100 commits from over 20
> contributors and supports two releases based on Spark 2.1 and 2.2
> respectively. Documentation is also being actively worked on both in the
> main project repository and also in the repository
> https://github.com/apache-spark-on-k8s/userdocs. Regarding real-world use
> cases, we have seen cluster setup that uses 1000+ cores. We are also seeing
> growing interests on this project from more and more organizations.
>
>
> While it is easy to bootstrap the project in a forked repository, it is hard
> to maintain it in the long run because of the tricky process of rebasing
> onto the upstream and lack of awareness in the large Spark community. It
> would be beneficial to both the Spark and Kubernetes community seeing this
> feature being merged upstream. On one hand, it gives Spark users the option
> of running their Spark workloads along with other workloads that may already
> be runni

Re: SPIP: Spark on Kubernetes

2017-08-17 Thread Chris Fregly
@reynold:

Databricks runs their proprietary product on Kubernetes.  how about 
contributing some of that work back to the Open Source Community?

—

Chris Fregly
Founder and Research Engineer @ PipelineAI <http://pipeline.io/>
Founder @ Advanced Spark and TensorFlow Meetup 
<http://www.meetup.com/Advanced-Spark-and-TensorFlow-Meetup>
San Francisco - Chicago - Washington DC - London

> On Aug 17, 2017, at 10:55 AM, Reynold Xin  wrote:
> 
> +1 on adding Kubernetes support in Spark (as a separate module similar to how 
> YARN is done)
> 
> I talk with a lot of developers and teams that operate cloud services, and 
> k8s in the last year has definitely become one of the key projects, if not 
> the one with the strongest momentum in this space. I'm not 100% sure we can 
> make it into 2.3 but IMO based on the activities in the forked repo and 
> claims that certain deployments are already running in production, this could 
> already be a solid project and will have everlasting positive impact.
> 
> 
> 
> On Wed, Aug 16, 2017 at 10:24 AM, Alexander Bezzubov  <mailto:b...@apache.org>> wrote:
> +1 (non-binding)
> 
> 
> Looking forward using it as part of Apache Spark release, instead of 
> Standalone cluster deployed on top of k8s.
> 
> 
> --
> Alex
> 
> On Wed, Aug 16, 2017 at 11:11 AM, Ismaël Mejía  <mailto:ieme...@gmail.com>> wrote:
> +1 (non-binding)
> 
> This is something really great to have. More schedulers and runtime
> environments are a HUGE win for the Spark ecosystem.
> Amazing work, Big kudos for the guys who created and continue working on this.
> 
> On Wed, Aug 16, 2017 at 2:07 AM, lucas.g...@gmail.com 
> <mailto:lucas.g...@gmail.com>
> mailto:lucas.g...@gmail.com>> wrote:
> > From our perspective, we have invested heavily in Kubernetes as our cluster
> > manager of choice.
> >
> > We also make quite heavy use of spark.  We've been experimenting with using
> > these builds (2.1 with pyspark enabled) quite heavily.  Given that we've
> > already 'paid the price' to operate Kubernetes in AWS it seems rational to
> > move our jobs over to spark on k8s.  Having this project merged into the
> > master will significantly ease keeping our Data Munging toolchain primarily
> > on Spark.
> >
> >
> > Gary Lucas
> > Data Ops Team Lead
> > Unbounce
> >
> > On 15 August 2017 at 15:52, Andrew Ash  > <mailto:and...@andrewash.com>> wrote:
> >>
> >> +1 (non-binding)
> >>
> >> We're moving large amounts of infrastructure from a combination of open
> >> source and homegrown cluster management systems to unify on Kubernetes and
> >> want to bring Spark workloads along with us.
> >>
> >> On Tue, Aug 15, 2017 at 2:29 PM, liyinan926  >> <mailto:liyinan...@gmail.com>> wrote:
> >>>
> >>> +1 (non-binding)
> >>>
> >>>
> >>>
> >>> --
> >>> View this message in context:
> >>> http://apache-spark-developers-list.1001551.n3.nabble.com/SPIP-Spark-on-Kubernetes-tp22147p22164.html
> >>>  
> >>> <http://apache-spark-developers-list.1001551.n3.nabble.com/SPIP-Spark-on-Kubernetes-tp22147p22164.html>
> >>> Sent from the Apache Spark Developers List mailing list archive at
> >>> Nabble.com.
> >>>
> >>> -
> >>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org 
> >>> <mailto:dev-unsubscr...@spark.apache.org>
> >>>
> >>
> >
> 
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org 
> <mailto:dev-unsubscr...@spark.apache.org>
> 
> 
> 



Fwd: SPIP: Spark on Kubernetes

2017-08-17 Thread Timothy Chen
-- Forwarded message --
From: Timothy Chen 
Date: Thu, Aug 17, 2017 at 2:48 PM
Subject: Re: SPIP: Spark on Kubernetes
To: Marcelo Vanzin 


Hi Marcelo,

Agree with your points, and I had that same thought around Resource
staging server and like to share that with Spark on Mesos (once this
is or can be merged).

For your last part, would love to get more concerted effort going
around abstracting out cluster manager more cleaner in Spark after
this SPIP.

Tim

On Thu, Aug 17, 2017 at 2:40 PM, Marcelo Vanzin  wrote:
> I have just some very high level knowledge of kubernetes, so I can't
> really comment on the details of the proposal that relate to it. But I
> have some comments about other areas of the linked documents:
>
> - It's good to know that there's a community behind this effort and
> mentions of lots of testing. As Reynold mentioned on jira, this is a
> part of Spark that needs very good testing. Even YARN doesn't have
> comprehensive testing built into the Spark test suite, it mostly
> relies on the fact that a lot of Spark developers use YARN so that we
> get test coverage for things like security.
>
> - The "Resource Staging Server" is something that can be useful also
> for standalone and Mesos (YARN has its own thing). It would be nice to
> keep it generic enough that it could be used or embedded in other
> cluster managers.
>
> - It would be good to get more details about the security model here;
> how do applications authenticate to the RSS above, how
> are shared secrets distributed (so you can set up encryption securely
> for individual Spark apps), things like that.
>
> - Same concerns above apply to the kubernetes-specific shuffle
> service; I believe the base shuffle service today doesn't have very
> strong security (single shared secret that all apps need to know about IIRC),
> and only the YARN implementation has proper application isolation.
>
> - I see there's some talk about accessing Kerberos-secured services
> (explicitly mentions HDFS but I'll treat it as "generic
> Hadoop+Kerberos" support). There's already ongoing effort to make
> Spark-on-Mesos support Kerberos (SPARK-16742), which has been going on
> by mostly making the existing YARN Kerberos integration more generic.
> It would be good if this project followed that instead of trying to
> create its own way of dealing with Kerberos.
>
> - I was hoping that as part of this we'd see some effort into
> modularizing SparkSubmit somehow; that's a pretty hairy piece of code
> to navigate, and adding more cluster manager-specific code will
> probably not make that better.
>
> That being said, I don't see any of those as blockers for an initial
> version. So adding my +1.
>
>
> On Tue, Aug 15, 2017 at 8:32 AM, Anirudh Ramanathan
>  wrote:
>> Spark on Kubernetes effort has been developed separately in a fork, and
>> linked back from the Apache Spark project as an experimental backend. We're
>> ~6 months in, have had 5 releases.
>>
>> 2 Spark versions maintained (2.1, and 2.2)
>> Extensive integration testing and refactoring efforts to maintain code
>> quality
>> Developer and user-facing documentation
>> 10+ consistent code contributors from different organizations involved in
>> actively maintaining and using the project, with several more members
>> involved in testing and providing feedback.
>> The community has delivered several talks on Spark-on-Kubernetes generating
>> lots of feedback from users.
>> In addition to these, we've seen efforts spawn off such as:
>>
>> HDFS on Kubernetes with Locality and Performance Experiments
>> Kerberized access to HDFS from Spark running on Kubernetes
>>
>> Following the SPIP process, I'm putting this SPIP up for a vote.
>>
>> +1: Yeah, let's go forward and implement the SPIP.
>> +0: Don't really care.
>> -1: I don't think this is a good idea because of the following technical
>> reasons.
>>
>> If there is any further clarification desired, on the design or the
>> implementation, please feel free to ask questions or provide feedback.
>>
>>
>> SPIP: Kubernetes as A Native Cluster Manager
>>
>>
>> Full Design Doc: link
>>
>> JIRA: https://issues.apache.org/jira/browse/SPARK-18278
>>
>> Kubernetes Issue: https://github.com/kubernetes/kubernetes/issues/34377
>>
>>
>> Authors: Yinan Li, Anirudh Ramanathan, Erik Erlandson, Andrew Ash, Matt
>> Cheah,
>>
>> Ilan Filonenko, Sean Suchter, Kimoon Kim
>>
>> Background and Motivation
>>
>> Containerization and cluster managemen

Re: SPIP: Spark on Kubernetes

2017-08-17 Thread michael mccune

+1 (non-binding)

peace o/

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: SPIP: Spark on Kubernetes

2017-08-17 Thread Marcelo Vanzin
I have just some very high level knowledge of kubernetes, so I can't
really comment on the details of the proposal that relate to it. But I
have some comments about other areas of the linked documents:

- It's good to know that there's a community behind this effort and
mentions of lots of testing. As Reynold mentioned on jira, this is a
part of Spark that needs very good testing. Even YARN doesn't have
comprehensive testing built into the Spark test suite, it mostly
relies on the fact that a lot of Spark developers use YARN so that we
get test coverage for things like security.

- The "Resource Staging Server" is something that can be useful also
for standalone and Mesos (YARN has its own thing). It would be nice to
keep it generic enough that it could be used or embedded in other
cluster managers.

- It would be good to get more details about the security model here;
how do applications authenticate to the RSS above, how
are shared secrets distributed (so you can set up encryption securely
for individual Spark apps), things like that.

- Same concerns above apply to the kubernetes-specific shuffle
service; I believe the base shuffle service today doesn't have very
strong security (single shared secret that all apps need to know about IIRC),
and only the YARN implementation has proper application isolation.

- I see there's some talk about accessing Kerberos-secured services
(explicitly mentions HDFS but I'll treat it as "generic
Hadoop+Kerberos" support). There's already ongoing effort to make
Spark-on-Mesos support Kerberos (SPARK-16742), which has been going on
by mostly making the existing YARN Kerberos integration more generic.
It would be good if this project followed that instead of trying to
create its own way of dealing with Kerberos.

- I was hoping that as part of this we'd see some effort into
modularizing SparkSubmit somehow; that's a pretty hairy piece of code
to navigate, and adding more cluster manager-specific code will
probably not make that better.

That being said, I don't see any of those as blockers for an initial
version. So adding my +1.


On Tue, Aug 15, 2017 at 8:32 AM, Anirudh Ramanathan
 wrote:
> Spark on Kubernetes effort has been developed separately in a fork, and
> linked back from the Apache Spark project as an experimental backend. We're
> ~6 months in, have had 5 releases.
>
> 2 Spark versions maintained (2.1, and 2.2)
> Extensive integration testing and refactoring efforts to maintain code
> quality
> Developer and user-facing documentation
> 10+ consistent code contributors from different organizations involved in
> actively maintaining and using the project, with several more members
> involved in testing and providing feedback.
> The community has delivered several talks on Spark-on-Kubernetes generating
> lots of feedback from users.
> In addition to these, we've seen efforts spawn off such as:
>
> HDFS on Kubernetes with Locality and Performance Experiments
> Kerberized access to HDFS from Spark running on Kubernetes
>
> Following the SPIP process, I'm putting this SPIP up for a vote.
>
> +1: Yeah, let's go forward and implement the SPIP.
> +0: Don't really care.
> -1: I don't think this is a good idea because of the following technical
> reasons.
>
> If there is any further clarification desired, on the design or the
> implementation, please feel free to ask questions or provide feedback.
>
>
> SPIP: Kubernetes as A Native Cluster Manager
>
>
> Full Design Doc: link
>
> JIRA: https://issues.apache.org/jira/browse/SPARK-18278
>
> Kubernetes Issue: https://github.com/kubernetes/kubernetes/issues/34377
>
>
> Authors: Yinan Li, Anirudh Ramanathan, Erik Erlandson, Andrew Ash, Matt
> Cheah,
>
> Ilan Filonenko, Sean Suchter, Kimoon Kim
>
> Background and Motivation
>
> Containerization and cluster management technologies are constantly evolving
> in the cluster computing world. Apache Spark currently implements support
> for Apache Hadoop YARN and Apache Mesos, in addition to providing its own
> standalone cluster manager. In 2014, Google announced development of
> Kubernetes which has its own unique feature set and differentiates itself
> from YARN and Mesos. Since its debut, it has seen contributions from over
> 1300 contributors with over 5 commits. Kubernetes has cemented itself as
> a core player in the cluster computing world, and cloud-computing providers
> such as Google Container Engine, Google Compute Engine, Amazon Web Services,
> and Microsoft Azure support running Kubernetes clusters.
>
>
> This document outlines a proposal for integrating Apache Spark with
> Kubernetes in a first class way, adding Kubernetes to the list of cluster
> managers that Spark can be used with. Doing so would allow users to share
> their computing resources and containerization framework between their
> existing applications on Kubernetes and their computational Spark
> applications. Although there is existing support for running a Spark
> standalone cluster on Kubernetes, ther

Re: SPIP: Spark on Kubernetes

2017-08-17 Thread Matei Zaharia
+1 from me as well.

Matei

> On Aug 17, 2017, at 10:55 AM, Reynold Xin  wrote:
> 
> +1 on adding Kubernetes support in Spark (as a separate module similar to how 
> YARN is done)
> 
> I talk with a lot of developers and teams that operate cloud services, and 
> k8s in the last year has definitely become one of the key projects, if not 
> the one with the strongest momentum in this space. I'm not 100% sure we can 
> make it into 2.3 but IMO based on the activities in the forked repo and 
> claims that certain deployments are already running in production, this could 
> already be a solid project and will have everlasting positive impact.
> 
> 
> 
> On Wed, Aug 16, 2017 at 10:24 AM, Alexander Bezzubov  wrote:
> +1 (non-binding)
> 
> 
> Looking forward using it as part of Apache Spark release, instead of 
> Standalone cluster deployed on top of k8s.
> 
> 
> --
> Alex
> 
> On Wed, Aug 16, 2017 at 11:11 AM, Ismaël Mejía  wrote:
> +1 (non-binding)
> 
> This is something really great to have. More schedulers and runtime
> environments are a HUGE win for the Spark ecosystem.
> Amazing work, Big kudos for the guys who created and continue working on this.
> 
> On Wed, Aug 16, 2017 at 2:07 AM, lucas.g...@gmail.com
>  wrote:
> > From our perspective, we have invested heavily in Kubernetes as our cluster
> > manager of choice.
> >
> > We also make quite heavy use of spark.  We've been experimenting with using
> > these builds (2.1 with pyspark enabled) quite heavily.  Given that we've
> > already 'paid the price' to operate Kubernetes in AWS it seems rational to
> > move our jobs over to spark on k8s.  Having this project merged into the
> > master will significantly ease keeping our Data Munging toolchain primarily
> > on Spark.
> >
> >
> > Gary Lucas
> > Data Ops Team Lead
> > Unbounce
> >
> > On 15 August 2017 at 15:52, Andrew Ash  wrote:
> >>
> >> +1 (non-binding)
> >>
> >> We're moving large amounts of infrastructure from a combination of open
> >> source and homegrown cluster management systems to unify on Kubernetes and
> >> want to bring Spark workloads along with us.
> >>
> >> On Tue, Aug 15, 2017 at 2:29 PM, liyinan926  wrote:
> >>>
> >>> +1 (non-binding)
> >>>
> >>>
> >>>
> >>> --
> >>> View this message in context:
> >>> http://apache-spark-developers-list.1001551.n3.nabble.com/SPIP-Spark-on-Kubernetes-tp22147p22164.html
> >>> Sent from the Apache Spark Developers List mailing list archive at
> >>> Nabble.com.
> >>>
> >>> -
> >>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
> >>>
> >>
> >
> 
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
> 
> 
> 


-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: SPIP: Spark on Kubernetes

2017-08-17 Thread Reynold Xin
+1 on adding Kubernetes support in Spark (as a separate module similar to
how YARN is done)

I talk with a lot of developers and teams that operate cloud services, and
k8s in the last year has definitely become one of the key projects, if not
the one with the strongest momentum in this space. I'm not 100% sure we can
make it into 2.3 but IMO based on the activities in the forked repo and
claims that certain deployments are already running in production, this
could already be a solid project and will have everlasting positive impact.



On Wed, Aug 16, 2017 at 10:24 AM, Alexander Bezzubov  wrote:

> +1 (non-binding)
>
>
> Looking forward using it as part of Apache Spark release, instead of
> Standalone cluster deployed on top of k8s.
>
>
> --
> Alex
>
> On Wed, Aug 16, 2017 at 11:11 AM, Ismaël Mejía  wrote:
>
>> +1 (non-binding)
>>
>> This is something really great to have. More schedulers and runtime
>> environments are a HUGE win for the Spark ecosystem.
>> Amazing work, Big kudos for the guys who created and continue working on
>> this.
>>
>> On Wed, Aug 16, 2017 at 2:07 AM, lucas.g...@gmail.com
>>  wrote:
>> > From our perspective, we have invested heavily in Kubernetes as our
>> cluster
>> > manager of choice.
>> >
>> > We also make quite heavy use of spark.  We've been experimenting with
>> using
>> > these builds (2.1 with pyspark enabled) quite heavily.  Given that we've
>> > already 'paid the price' to operate Kubernetes in AWS it seems rational
>> to
>> > move our jobs over to spark on k8s.  Having this project merged into the
>> > master will significantly ease keeping our Data Munging toolchain
>> primarily
>> > on Spark.
>> >
>> >
>> > Gary Lucas
>> > Data Ops Team Lead
>> > Unbounce
>> >
>> > On 15 August 2017 at 15:52, Andrew Ash  wrote:
>> >>
>> >> +1 (non-binding)
>> >>
>> >> We're moving large amounts of infrastructure from a combination of open
>> >> source and homegrown cluster management systems to unify on Kubernetes
>> and
>> >> want to bring Spark workloads along with us.
>> >>
>> >> On Tue, Aug 15, 2017 at 2:29 PM, liyinan926 
>> wrote:
>> >>>
>> >>> +1 (non-binding)
>> >>>
>> >>>
>> >>>
>> >>> --
>> >>> View this message in context:
>> >>> http://apache-spark-developers-list.1001551.n3.nabble.com/
>> SPIP-Spark-on-Kubernetes-tp22147p22164.html
>> >>> Sent from the Apache Spark Developers List mailing list archive at
>> >>> Nabble.com.
>> >>>
>> >>> -
>> >>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>> >>>
>> >>
>> >
>>
>> -
>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>
>>
>


Re: SPIP: Spark on Kubernetes

2017-08-16 Thread Alexander Bezzubov
+1 (non-binding)


Looking forward using it as part of Apache Spark release, instead of
Standalone cluster deployed on top of k8s.


--
Alex

On Wed, Aug 16, 2017 at 11:11 AM, Ismaël Mejía  wrote:

> +1 (non-binding)
>
> This is something really great to have. More schedulers and runtime
> environments are a HUGE win for the Spark ecosystem.
> Amazing work, Big kudos for the guys who created and continue working on
> this.
>
> On Wed, Aug 16, 2017 at 2:07 AM, lucas.g...@gmail.com
>  wrote:
> > From our perspective, we have invested heavily in Kubernetes as our
> cluster
> > manager of choice.
> >
> > We also make quite heavy use of spark.  We've been experimenting with
> using
> > these builds (2.1 with pyspark enabled) quite heavily.  Given that we've
> > already 'paid the price' to operate Kubernetes in AWS it seems rational
> to
> > move our jobs over to spark on k8s.  Having this project merged into the
> > master will significantly ease keeping our Data Munging toolchain
> primarily
> > on Spark.
> >
> >
> > Gary Lucas
> > Data Ops Team Lead
> > Unbounce
> >
> > On 15 August 2017 at 15:52, Andrew Ash  wrote:
> >>
> >> +1 (non-binding)
> >>
> >> We're moving large amounts of infrastructure from a combination of open
> >> source and homegrown cluster management systems to unify on Kubernetes
> and
> >> want to bring Spark workloads along with us.
> >>
> >> On Tue, Aug 15, 2017 at 2:29 PM, liyinan926 
> wrote:
> >>>
> >>> +1 (non-binding)
> >>>
> >>>
> >>>
> >>> --
> >>> View this message in context:
> >>> http://apache-spark-developers-list.1001551.n3.
> nabble.com/SPIP-Spark-on-Kubernetes-tp22147p22164.html
> >>> Sent from the Apache Spark Developers List mailing list archive at
> >>> Nabble.com.
> >>>
> >>> -
> >>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
> >>>
> >>
> >
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>


Re: SPIP: Spark on Kubernetes

2017-08-16 Thread Jean-Baptiste Onofré
+1 as well.

Regards
JB

On Aug 16, 2017, 10:12, at 10:12, "Ismaël Mejía"  wrote:
>+1 (non-binding)
>
>This is something really great to have. More schedulers and runtime
>environments are a HUGE win for the Spark ecosystem.
>Amazing work, Big kudos for the guys who created and continue working
>on this.
>
>On Wed, Aug 16, 2017 at 2:07 AM, lucas.g...@gmail.com
> wrote:
>> From our perspective, we have invested heavily in Kubernetes as our
>cluster
>> manager of choice.
>>
>> We also make quite heavy use of spark.  We've been experimenting with
>using
>> these builds (2.1 with pyspark enabled) quite heavily.  Given that
>we've
>> already 'paid the price' to operate Kubernetes in AWS it seems
>rational to
>> move our jobs over to spark on k8s.  Having this project merged into
>the
>> master will significantly ease keeping our Data Munging toolchain
>primarily
>> on Spark.
>>
>>
>> Gary Lucas
>> Data Ops Team Lead
>> Unbounce
>>
>> On 15 August 2017 at 15:52, Andrew Ash  wrote:
>>>
>>> +1 (non-binding)
>>>
>>> We're moving large amounts of infrastructure from a combination of
>open
>>> source and homegrown cluster management systems to unify on
>Kubernetes and
>>> want to bring Spark workloads along with us.
>>>
>>> On Tue, Aug 15, 2017 at 2:29 PM, liyinan926 
>wrote:
>>>>
>>>> +1 (non-binding)
>>>>
>>>>
>>>>
>>>> --
>>>> View this message in context:
>>>>
>http://apache-spark-developers-list.1001551.n3.nabble.com/SPIP-Spark-on-Kubernetes-tp22147p22164.html
>>>> Sent from the Apache Spark Developers List mailing list archive at
>>>> Nabble.com.
>>>>
>>>>
>-
>>>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>>>
>>>
>>
>
>-
>To unsubscribe e-mail: dev-unsubscr...@spark.apache.org


Re: SPIP: Spark on Kubernetes

2017-08-16 Thread Ismaël Mejía
+1 (non-binding)

This is something really great to have. More schedulers and runtime
environments are a HUGE win for the Spark ecosystem.
Amazing work, Big kudos for the guys who created and continue working on this.

On Wed, Aug 16, 2017 at 2:07 AM, lucas.g...@gmail.com
 wrote:
> From our perspective, we have invested heavily in Kubernetes as our cluster
> manager of choice.
>
> We also make quite heavy use of spark.  We've been experimenting with using
> these builds (2.1 with pyspark enabled) quite heavily.  Given that we've
> already 'paid the price' to operate Kubernetes in AWS it seems rational to
> move our jobs over to spark on k8s.  Having this project merged into the
> master will significantly ease keeping our Data Munging toolchain primarily
> on Spark.
>
>
> Gary Lucas
> Data Ops Team Lead
> Unbounce
>
> On 15 August 2017 at 15:52, Andrew Ash  wrote:
>>
>> +1 (non-binding)
>>
>> We're moving large amounts of infrastructure from a combination of open
>> source and homegrown cluster management systems to unify on Kubernetes and
>> want to bring Spark workloads along with us.
>>
>> On Tue, Aug 15, 2017 at 2:29 PM, liyinan926  wrote:
>>>
>>> +1 (non-binding)
>>>
>>>
>>>
>>> --
>>> View this message in context:
>>> http://apache-spark-developers-list.1001551.n3.nabble.com/SPIP-Spark-on-Kubernetes-tp22147p22164.html
>>> Sent from the Apache Spark Developers List mailing list archive at
>>> Nabble.com.
>>>
>>> -
>>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>>
>>
>

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: SPIP: Spark on Kubernetes

2017-08-15 Thread lucas.g...@gmail.com
>From our perspective, we have invested heavily in Kubernetes as our cluster
manager of choice.

We also make quite heavy use of spark.  We've been experimenting with using
these builds (2.1 with pyspark enabled) quite heavily.  Given that we've
already 'paid the price' to operate Kubernetes in AWS it seems rational to
move our jobs over to spark on k8s.  Having this project merged into the
master will significantly ease keeping our Data Munging toolchain primarily
on Spark.


Gary Lucas
Data Ops Team Lead
Unbounce

On 15 August 2017 at 15:52, Andrew Ash  wrote:

> +1 (non-binding)
>
> We're moving large amounts of infrastructure from a combination of open
> source and homegrown cluster management systems to unify on Kubernetes and
> want to bring Spark workloads along with us.
>
> On Tue, Aug 15, 2017 at 2:29 PM, liyinan926  wrote:
>
>> +1 (non-binding)
>>
>>
>>
>> --
>> View this message in context: http://apache-spark-developers
>> -list.1001551.n3.nabble.com/SPIP-Spark-on-Kubernetes-tp22147p22164.html
>> Sent from the Apache Spark Developers List mailing list archive at
>> Nabble.com.
>>
>> -
>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>
>>
>


回复: SPIP: Spark on Kubernetes

2017-08-15 Thread 李书明
+1






在2017年08月16日 04:53,Jiri Kremser 写道:
+1 (non-binding)



On Tue, Aug 15, 2017 at 10:19 PM, Shubham Chopra  
wrote:

+1 (non-binding)


~Shubham.


On Tue, Aug 15, 2017 at 2:11 PM, Erik Erlandson  wrote:



Kubernetes has evolved into an important container orchestration platform; it 
has a large and growing user base and an active ecosystem.  Users of Apache 
Spark who are also deploying applications on Kubernetes (or are planning to) 
will have convergence-related motivations for migrating their Spark 
applications to Kubernetes as well. It avoids the need for deploying separate 
cluster infra for Spark workloads and allows Spark applications to take full 
advantage of inhabiting the same orchestration environment as other 
applications.  In this respect, native Kubernetes support for Spark represents 
a way to optimize uptake and retention of Apache Spark among the members of the 
expanding Kubernetes community.



On Tue, Aug 15, 2017 at 8:43 AM, Erik Erlandson  wrote:

+1 (non-binding)




On Tue, Aug 15, 2017 at 8:32 AM, Anirudh Ramanathan  wrote:

Spark on Kubernetes effort has been developed separately in a fork, and linked 
back from the Apache Spark project as an experimental backend. We're ~6 months 
in, have had 5 releases. 

2 Spark versions maintained (2.1, and 2.2)
Extensive integration testing and refactoring efforts to maintain code quality
Developer and user-facing documentation
10+ consistent code contributors from different organizations involved in 
actively maintaining and using the project, with several more members involved 
in testing and providing feedback.
The community has delivered several talks on Spark-on-Kubernetes generating 
lots of feedback from users.
In addition to these, we've seen efforts spawn off such as:

HDFS on Kubernetes with Locality and Performance Experiments

Kerberized access to HDFS from Spark running on Kubernetes

Following the SPIP process, I'm putting this SPIP up for a vote.

+1: Yeah, let's go forward and implement the SPIP.

+0: Don't really care.

-1: I don't think this is a good idea because of the following technical 
reasons.
If there is any further clarification desired, on the design or the 
implementation, please feel free to ask questions or provide feedback.




SPIP: Kubernetes as A Native Cluster Manager




Full Design Doc: link


JIRA: https://issues.apache.org/jira/browse/SPARK-18278

Kubernetes Issue: https://github.com/kubernetes/kubernetes/issues/34377




Authors: Yinan Li, Anirudh Ramanathan, Erik Erlandson, Andrew Ash, Matt Cheah,

Ilan Filonenko, Sean Suchter, Kimoon Kim

Background and Motivation

Containerization and cluster management technologies are constantly evolving in 
the cluster computing world. Apache Spark currently implements support for 
Apache Hadoop YARN and Apache Mesos, in addition to providing its own 
standalone cluster manager. In 2014, Google announced development of Kubernetes 
which has its own unique feature set and differentiates itself from YARN and 
Mesos. Since its debut, it has seen contributions from over 1300 contributors 
with over 5 commits. Kubernetes has cemented itself as a core player in the 
cluster computing world, and cloud-computing providers such as Google Container 
Engine, Google Compute Engine, Amazon Web Services, and Microsoft Azure support 
running Kubernetes clusters.




This document outlines a proposal for integrating Apache Spark with Kubernetes 
in a first class way, adding Kubernetes to the list of cluster managers that 
Spark can be used with. Doing so would allow users to share their computing 
resources and containerization framework between their existing applications on 
Kubernetes and their computational Spark applications. Although there is 
existing support for running a Spark standalone cluster on Kubernetes, there 
are still major advantages and significant interest in having native execution 
support. For example, this integration provides better support for 
multi-tenancy and dynamic resource allocation. It also allows users to run 
applications of different Spark versions of their choices in the same cluster.




The feature is being developed in a separate fork in order to minimize risk to 
the main project during development. Since the start of the development in 
November of 2016, it has received over 100 commits from over 20 contributors 
and supports two releases based on Spark 2.1 and 2.2 respectively. 
Documentation is also being actively worked on both in the main project 
repository and also in the repository 
https://github.com/apache-spark-on-k8s/userdocs. Regarding real-world use 
cases, we have seen cluster setup that uses 1000+ cores. We are also seeing 
growing interests on this project from more and more organizations.




While it is easy to bootstrap the project in a forked repository, it is hard to 
maintain it in the long run because of the tricky process of rebasing onto the 
upstream and lack of awareness in the large Spark com

Re: SPIP: Spark on Kubernetes

2017-08-15 Thread Andrew Ash
+1 (non-binding)

We're moving large amounts of infrastructure from a combination of open
source and homegrown cluster management systems to unify on Kubernetes and
want to bring Spark workloads along with us.

On Tue, Aug 15, 2017 at 2:29 PM, liyinan926  wrote:

> +1 (non-binding)
>
>
>
> --
> View this message in context: http://apache-spark-
> developers-list.1001551.n3.nabble.com/SPIP-Spark-on-
> Kubernetes-tp22147p22164.html
> Sent from the Apache Spark Developers List mailing list archive at
> Nabble.com.
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>


Re: SPIP: Spark on Kubernetes

2017-08-15 Thread liyinan926
+1 (non-binding)



--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/SPIP-Spark-on-Kubernetes-tp22147p22164.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: SPIP: Spark on Kubernetes

2017-08-15 Thread Jiri Kremser
+1 (non-binding)

On Tue, Aug 15, 2017 at 10:19 PM, Shubham Chopra 
wrote:

> +1 (non-binding)
>
> ~Shubham.
>
> On Tue, Aug 15, 2017 at 2:11 PM, Erik Erlandson 
> wrote:
>
>>
>> Kubernetes has evolved into an important container orchestration
>> platform; it has a large and growing user base and an active ecosystem.
>> Users of Apache Spark who are also deploying applications on Kubernetes (or
>> are planning to) will have convergence-related motivations for migrating
>> their Spark applications to Kubernetes as well. It avoids the need for
>> deploying separate cluster infra for Spark workloads and allows Spark
>> applications to take full advantage of inhabiting the same orchestration
>> environment as other applications.  In this respect, native Kubernetes
>> support for Spark represents a way to optimize uptake and retention of
>> Apache Spark among the members of the expanding Kubernetes community.
>>
>> On Tue, Aug 15, 2017 at 8:43 AM, Erik Erlandson 
>> wrote:
>>
>>> +1 (non-binding)
>>>
>>>
>>> On Tue, Aug 15, 2017 at 8:32 AM, Anirudh Ramanathan 
>>> wrote:
>>>
 Spark on Kubernetes effort has been developed separately in a fork, and
 linked back from the Apache Spark project as an experimental backend
 .
 We're ~6 months in, have had 5 releases
 .

- 2 Spark versions maintained (2.1, and 2.2)
- Extensive integration testing and refactoring efforts to maintain
code quality
- Developer
 and
user-facing  docu
mentation
- 10+ consistent code contributors from different organizations

 
  involved
in actively maintaining and using the project, with several more members
involved in testing and providing feedback.
- The community has delivered several talks on Spark-on-Kubernetes
generating lots of feedback from users.
- In addition to these, we've seen efforts spawn off such as:
- HDFS on Kubernetes
    with
   Locality and Performance Experiments
   - Kerberized access
   
 
  to
   HDFS from Spark running on Kubernetes

 *Following the SPIP process, I'm putting this SPIP up for a vote.*

- +1: Yeah, let's go forward and implement the SPIP.
- +0: Don't really care.
- -1: I don't think this is a good idea because of the following
technical reasons.

 If there is any further clarification desired, on the design or the
 implementation, please feel free to ask questions or provide feedback.


 SPIP: Kubernetes as A Native Cluster Manager

 Full Design Doc: link
 

 JIRA: https://issues.apache.org/jira/browse/SPARK-18278

 Kubernetes Issue: https://github.com/kubernetes/kubernetes/issues/34377

 Authors: Yinan Li, Anirudh Ramanathan, Erik Erlandson, Andrew Ash, Matt
 Cheah,

 Ilan Filonenko, Sean Suchter, Kimoon Kim
 Background and Motivation

 Containerization and cluster management technologies are constantly
 evolving in the cluster computing world. Apache Spark currently implements
 support for Apache Hadoop YARN and Apache Mesos, in addition to providing
 its own standalone cluster manager. In 2014, Google announced development
 of Kubernetes  which has its own unique
 feature set and differentiates itself from YARN and Mesos. Since its debut,
 it has seen contributions from over 1300 contributors with over 5
 commits. Kubernetes has cemented itself as a core player in the cluster
 computing world, and cloud-computing providers such as Google Container
 Engine, Google Compute Engine, Amazon Web Services, and Microsoft Azure
 support running Kubernetes clusters.

 This document outlines a proposal for integrating Apache Spark with
 Kubernetes in a first class way, adding Kubernetes to the list of cluster
 managers that Spark can be used with. Doing so would allow users to share
 their computing resources and containerization framework between their
 existing applications on Kubernetes and their computational Spark
 applications. Although there is existing support for running a Spark
 standalone cluster on Kubernetes
 

Re: SPIP: Spark on Kubernetes

2017-08-15 Thread Shubham Chopra
+1 (non-binding)

~Shubham.

On Tue, Aug 15, 2017 at 2:11 PM, Erik Erlandson  wrote:

>
> Kubernetes has evolved into an important container orchestration platform;
> it has a large and growing user base and an active ecosystem.  Users of
> Apache Spark who are also deploying applications on Kubernetes (or are
> planning to) will have convergence-related motivations for migrating their
> Spark applications to Kubernetes as well. It avoids the need for deploying
> separate cluster infra for Spark workloads and allows Spark applications to
> take full advantage of inhabiting the same orchestration environment as
> other applications.  In this respect, native Kubernetes support for Spark
> represents a way to optimize uptake and retention of Apache Spark among the
> members of the expanding Kubernetes community.
>
> On Tue, Aug 15, 2017 at 8:43 AM, Erik Erlandson 
> wrote:
>
>> +1 (non-binding)
>>
>>
>> On Tue, Aug 15, 2017 at 8:32 AM, Anirudh Ramanathan 
>> wrote:
>>
>>> Spark on Kubernetes effort has been developed separately in a fork, and
>>> linked back from the Apache Spark project as an experimental backend
>>> .
>>> We're ~6 months in, have had 5 releases
>>> .
>>>
>>>- 2 Spark versions maintained (2.1, and 2.2)
>>>- Extensive integration testing and refactoring efforts to maintain
>>>code quality
>>>- Developer
>>> and
>>>user-facing  docu
>>>mentation
>>>- 10+ consistent code contributors from different organizations
>>>
>>> 
>>>  involved
>>>in actively maintaining and using the project, with several more members
>>>involved in testing and providing feedback.
>>>- The community has delivered several talks on Spark-on-Kubernetes
>>>generating lots of feedback from users.
>>>- In addition to these, we've seen efforts spawn off such as:
>>>- HDFS on Kubernetes
>>>    with
>>>   Locality and Performance Experiments
>>>   - Kerberized access
>>>   
>>> 
>>>  to
>>>   HDFS from Spark running on Kubernetes
>>>
>>> *Following the SPIP process, I'm putting this SPIP up for a vote.*
>>>
>>>- +1: Yeah, let's go forward and implement the SPIP.
>>>- +0: Don't really care.
>>>- -1: I don't think this is a good idea because of the following
>>>technical reasons.
>>>
>>> If there is any further clarification desired, on the design or the
>>> implementation, please feel free to ask questions or provide feedback.
>>>
>>>
>>> SPIP: Kubernetes as A Native Cluster Manager
>>>
>>> Full Design Doc: link
>>> 
>>>
>>> JIRA: https://issues.apache.org/jira/browse/SPARK-18278
>>>
>>> Kubernetes Issue: https://github.com/kubernetes/kubernetes/issues/34377
>>>
>>> Authors: Yinan Li, Anirudh Ramanathan, Erik Erlandson, Andrew Ash, Matt
>>> Cheah,
>>>
>>> Ilan Filonenko, Sean Suchter, Kimoon Kim
>>> Background and Motivation
>>>
>>> Containerization and cluster management technologies are constantly
>>> evolving in the cluster computing world. Apache Spark currently implements
>>> support for Apache Hadoop YARN and Apache Mesos, in addition to providing
>>> its own standalone cluster manager. In 2014, Google announced development
>>> of Kubernetes  which has its own unique feature
>>> set and differentiates itself from YARN and Mesos. Since its debut, it has
>>> seen contributions from over 1300 contributors with over 5 commits.
>>> Kubernetes has cemented itself as a core player in the cluster computing
>>> world, and cloud-computing providers such as Google Container Engine,
>>> Google Compute Engine, Amazon Web Services, and Microsoft Azure support
>>> running Kubernetes clusters.
>>>
>>> This document outlines a proposal for integrating Apache Spark with
>>> Kubernetes in a first class way, adding Kubernetes to the list of cluster
>>> managers that Spark can be used with. Doing so would allow users to share
>>> their computing resources and containerization framework between their
>>> existing applications on Kubernetes and their computational Spark
>>> applications. Although there is existing support for running a Spark
>>> standalone cluster on Kubernetes
>>> ,
>>> there are still major advantages and significant interest in having native
>>> execution support. For example, this integration provides better support
>>> 

Re: SPIP: Spark on Kubernetes

2017-08-15 Thread Erik Erlandson
Kubernetes has evolved into an important container orchestration platform;
it has a large and growing user base and an active ecosystem.  Users of
Apache Spark who are also deploying applications on Kubernetes (or are
planning to) will have convergence-related motivations for migrating their
Spark applications to Kubernetes as well. It avoids the need for deploying
separate cluster infra for Spark workloads and allows Spark applications to
take full advantage of inhabiting the same orchestration environment as
other applications.  In this respect, native Kubernetes support for Spark
represents a way to optimize uptake and retention of Apache Spark among the
members of the expanding Kubernetes community.

On Tue, Aug 15, 2017 at 8:43 AM, Erik Erlandson  wrote:

> +1 (non-binding)
>
>
> On Tue, Aug 15, 2017 at 8:32 AM, Anirudh Ramanathan 
> wrote:
>
>> Spark on Kubernetes effort has been developed separately in a fork, and
>> linked back from the Apache Spark project as an experimental backend
>> .
>> We're ~6 months in, have had 5 releases
>> .
>>
>>- 2 Spark versions maintained (2.1, and 2.2)
>>- Extensive integration testing and refactoring efforts to maintain
>>code quality
>>- Developer
>> and
>>user-facing  docu
>>mentation
>>- 10+ consistent code contributors from different organizations
>>
>> 
>>  involved
>>in actively maintaining and using the project, with several more members
>>involved in testing and providing feedback.
>>- The community has delivered several talks on Spark-on-Kubernetes
>>generating lots of feedback from users.
>>- In addition to these, we've seen efforts spawn off such as:
>>- HDFS on Kubernetes
>>    with
>>   Locality and Performance Experiments
>>   - Kerberized access
>>   
>> 
>>  to
>>   HDFS from Spark running on Kubernetes
>>
>> *Following the SPIP process, I'm putting this SPIP up for a vote.*
>>
>>- +1: Yeah, let's go forward and implement the SPIP.
>>- +0: Don't really care.
>>- -1: I don't think this is a good idea because of the following
>>technical reasons.
>>
>> If there is any further clarification desired, on the design or the
>> implementation, please feel free to ask questions or provide feedback.
>>
>>
>> SPIP: Kubernetes as A Native Cluster Manager
>>
>> Full Design Doc: link
>> 
>>
>> JIRA: https://issues.apache.org/jira/browse/SPARK-18278
>>
>> Kubernetes Issue: https://github.com/kubernetes/kubernetes/issues/34377
>>
>> Authors: Yinan Li, Anirudh Ramanathan, Erik Erlandson, Andrew Ash, Matt
>> Cheah,
>>
>> Ilan Filonenko, Sean Suchter, Kimoon Kim
>> Background and Motivation
>>
>> Containerization and cluster management technologies are constantly
>> evolving in the cluster computing world. Apache Spark currently implements
>> support for Apache Hadoop YARN and Apache Mesos, in addition to providing
>> its own standalone cluster manager. In 2014, Google announced development
>> of Kubernetes  which has its own unique feature
>> set and differentiates itself from YARN and Mesos. Since its debut, it has
>> seen contributions from over 1300 contributors with over 5 commits.
>> Kubernetes has cemented itself as a core player in the cluster computing
>> world, and cloud-computing providers such as Google Container Engine,
>> Google Compute Engine, Amazon Web Services, and Microsoft Azure support
>> running Kubernetes clusters.
>>
>> This document outlines a proposal for integrating Apache Spark with
>> Kubernetes in a first class way, adding Kubernetes to the list of cluster
>> managers that Spark can be used with. Doing so would allow users to share
>> their computing resources and containerization framework between their
>> existing applications on Kubernetes and their computational Spark
>> applications. Although there is existing support for running a Spark
>> standalone cluster on Kubernetes
>> ,
>> there are still major advantages and significant interest in having native
>> execution support. For example, this integration provides better support
>> for multi-tenancy and dynamic resource allocation. It also allows users to
>> run applications of different Spark versions of their choices in the same
>> cluster.
>>
>> The feature is being developed

Re: SPIP: Spark on Kubernetes

2017-08-15 Thread Daniel Imberman
+1 (non-binding)

Glad to see this moving forward :D

On Tue, Aug 15, 2017 at 10:10 AM Holden Karau  wrote:

> +1 (non-binding)
>
> I (personally) think that Kubernetes as a scheduler backend should
> eventually get merged in and there is clearly a community interested in the
> work required to maintain it.
>
> On Tue, Aug 15, 2017 at 9:51 AM William Benton  wrote:
>
>> +1 (non-binding)
>>
>> On Tue, Aug 15, 2017 at 10:32 AM, Anirudh Ramanathan <
>> fox...@google.com.invalid> wrote:
>>
>>> Spark on Kubernetes effort has been developed separately in a fork, and
>>> linked back from the Apache Spark project as an experimental backend
>>> .
>>> We're ~6 months in, have had 5 releases
>>> .
>>>
>>>- 2 Spark versions maintained (2.1, and 2.2)
>>>- Extensive integration testing and refactoring efforts to maintain
>>>code quality
>>>- Developer
>>> and
>>>user-facing 
>>> documentation
>>>- 10+ consistent code contributors from different organizations
>>>
>>> 
>>>  involved
>>>in actively maintaining and using the project, with several more members
>>>involved in testing and providing feedback.
>>>- The community has delivered several talks on Spark-on-Kubernetes
>>>generating lots of feedback from users.
>>>- In addition to these, we've seen efforts spawn off such as:
>>>- HDFS on Kubernetes
>>>    with
>>>   Locality and Performance Experiments
>>>   - Kerberized access
>>>   
>>> 
>>>  to
>>>   HDFS from Spark running on Kubernetes
>>>
>>> *Following the SPIP process, I'm putting this SPIP up for a vote.*
>>>
>>>- +1: Yeah, let's go forward and implement the SPIP.
>>>- +0: Don't really care.
>>>- -1: I don't think this is a good idea because of the following
>>>technical reasons.
>>>
>>> If there is any further clarification desired, on the design or the
>>> implementation, please feel free to ask questions or provide feedback.
>>>
>>>
>>> SPIP: Kubernetes as A Native Cluster Manager
>>>
>>> Full Design Doc: link
>>> 
>>>
>>> JIRA: https://issues.apache.org/jira/browse/SPARK-18278
>>>
>>> Kubernetes Issue: https://github.com/kubernetes/kubernetes/issues/34377
>>>
>>> Authors: Yinan Li, Anirudh Ramanathan, Erik Erlandson, Andrew Ash, Matt
>>> Cheah,
>>>
>>> Ilan Filonenko, Sean Suchter, Kimoon Kim
>>> Background and Motivation
>>>
>>> Containerization and cluster management technologies are constantly
>>> evolving in the cluster computing world. Apache Spark currently implements
>>> support for Apache Hadoop YARN and Apache Mesos, in addition to providing
>>> its own standalone cluster manager. In 2014, Google announced development
>>> of Kubernetes  which has its own unique feature
>>> set and differentiates itself from YARN and Mesos. Since its debut, it has
>>> seen contributions from over 1300 contributors with over 5 commits.
>>> Kubernetes has cemented itself as a core player in the cluster computing
>>> world, and cloud-computing providers such as Google Container Engine,
>>> Google Compute Engine, Amazon Web Services, and Microsoft Azure support
>>> running Kubernetes clusters.
>>>
>>> This document outlines a proposal for integrating Apache Spark with
>>> Kubernetes in a first class way, adding Kubernetes to the list of cluster
>>> managers that Spark can be used with. Doing so would allow users to share
>>> their computing resources and containerization framework between their
>>> existing applications on Kubernetes and their computational Spark
>>> applications. Although there is existing support for running a Spark
>>> standalone cluster on Kubernetes
>>> ,
>>> there are still major advantages and significant interest in having native
>>> execution support. For example, this integration provides better support
>>> for multi-tenancy and dynamic resource allocation. It also allows users to
>>> run applications of different Spark versions of their choices in the same
>>> cluster.
>>>
>>> The feature is being developed in a separate fork
>>>  in order to minimize
>>> risk to the main project during development. Since the start of the
>>> development in November of 2016, it has received over 100 commits from over
>>> 20 contributors and supports two rele

Re: SPIP: Spark on Kubernetes

2017-08-15 Thread Holden Karau
+1 (non-binding)

I (personally) think that Kubernetes as a scheduler backend should
eventually get merged in and there is clearly a community interested in the
work required to maintain it.

On Tue, Aug 15, 2017 at 9:51 AM William Benton  wrote:

> +1 (non-binding)
>
> On Tue, Aug 15, 2017 at 10:32 AM, Anirudh Ramanathan <
> fox...@google.com.invalid> wrote:
>
>> Spark on Kubernetes effort has been developed separately in a fork, and
>> linked back from the Apache Spark project as an experimental backend
>> .
>> We're ~6 months in, have had 5 releases
>> .
>>
>>- 2 Spark versions maintained (2.1, and 2.2)
>>- Extensive integration testing and refactoring efforts to maintain
>>code quality
>>- Developer
>> and
>>user-facing 
>> documentation
>>- 10+ consistent code contributors from different organizations
>>
>> 
>>  involved
>>in actively maintaining and using the project, with several more members
>>involved in testing and providing feedback.
>>- The community has delivered several talks on Spark-on-Kubernetes
>>generating lots of feedback from users.
>>- In addition to these, we've seen efforts spawn off such as:
>>- HDFS on Kubernetes
>>    with
>>   Locality and Performance Experiments
>>   - Kerberized access
>>   
>> 
>>  to
>>   HDFS from Spark running on Kubernetes
>>
>> *Following the SPIP process, I'm putting this SPIP up for a vote.*
>>
>>- +1: Yeah, let's go forward and implement the SPIP.
>>- +0: Don't really care.
>>- -1: I don't think this is a good idea because of the following
>>technical reasons.
>>
>> If there is any further clarification desired, on the design or the
>> implementation, please feel free to ask questions or provide feedback.
>>
>>
>> SPIP: Kubernetes as A Native Cluster Manager
>>
>> Full Design Doc: link
>> 
>>
>> JIRA: https://issues.apache.org/jira/browse/SPARK-18278
>>
>> Kubernetes Issue: https://github.com/kubernetes/kubernetes/issues/34377
>>
>> Authors: Yinan Li, Anirudh Ramanathan, Erik Erlandson, Andrew Ash, Matt
>> Cheah,
>>
>> Ilan Filonenko, Sean Suchter, Kimoon Kim
>> Background and Motivation
>>
>> Containerization and cluster management technologies are constantly
>> evolving in the cluster computing world. Apache Spark currently implements
>> support for Apache Hadoop YARN and Apache Mesos, in addition to providing
>> its own standalone cluster manager. In 2014, Google announced development
>> of Kubernetes  which has its own unique feature
>> set and differentiates itself from YARN and Mesos. Since its debut, it has
>> seen contributions from over 1300 contributors with over 5 commits.
>> Kubernetes has cemented itself as a core player in the cluster computing
>> world, and cloud-computing providers such as Google Container Engine,
>> Google Compute Engine, Amazon Web Services, and Microsoft Azure support
>> running Kubernetes clusters.
>>
>> This document outlines a proposal for integrating Apache Spark with
>> Kubernetes in a first class way, adding Kubernetes to the list of cluster
>> managers that Spark can be used with. Doing so would allow users to share
>> their computing resources and containerization framework between their
>> existing applications on Kubernetes and their computational Spark
>> applications. Although there is existing support for running a Spark
>> standalone cluster on Kubernetes
>> ,
>> there are still major advantages and significant interest in having native
>> execution support. For example, this integration provides better support
>> for multi-tenancy and dynamic resource allocation. It also allows users to
>> run applications of different Spark versions of their choices in the same
>> cluster.
>>
>> The feature is being developed in a separate fork
>>  in order to minimize risk
>> to the main project during development. Since the start of the development
>> in November of 2016, it has received over 100 commits from over 20
>> contributors and supports two releases based on Spark 2.1 and 2.2
>> respectively. Documentation is also being actively worked on both in the
>> main project repository and also in the repository
>> https://github.com/apache-spark-on-k8s/userdocs. R

Re: SPIP: Spark on Kubernetes

2017-08-15 Thread William Benton
+1 (non-binding)

On Tue, Aug 15, 2017 at 10:32 AM, Anirudh Ramanathan <
fox...@google.com.invalid> wrote:

> Spark on Kubernetes effort has been developed separately in a fork, and
> linked back from the Apache Spark project as an experimental backend
> .
> We're ~6 months in, have had 5 releases
> .
>
>- 2 Spark versions maintained (2.1, and 2.2)
>- Extensive integration testing and refactoring efforts to maintain
>code quality
>- Developer
> and
>user-facing 
>documentation
>- 10+ consistent code contributors from different organizations
>
> 
>  involved
>in actively maintaining and using the project, with several more members
>involved in testing and providing feedback.
>- The community has delivered several talks on Spark-on-Kubernetes
>generating lots of feedback from users.
>- In addition to these, we've seen efforts spawn off such as:
>- HDFS on Kubernetes
>    with
>   Locality and Performance Experiments
>   - Kerberized access
>   
> 
>  to
>   HDFS from Spark running on Kubernetes
>
> *Following the SPIP process, I'm putting this SPIP up for a vote.*
>
>- +1: Yeah, let's go forward and implement the SPIP.
>- +0: Don't really care.
>- -1: I don't think this is a good idea because of the following
>technical reasons.
>
> If there is any further clarification desired, on the design or the
> implementation, please feel free to ask questions or provide feedback.
>
>
> SPIP: Kubernetes as A Native Cluster Manager
>
> Full Design Doc: link
> 
>
> JIRA: https://issues.apache.org/jira/browse/SPARK-18278
>
> Kubernetes Issue: https://github.com/kubernetes/kubernetes/issues/34377
>
> Authors: Yinan Li, Anirudh Ramanathan, Erik Erlandson, Andrew Ash, Matt
> Cheah,
>
> Ilan Filonenko, Sean Suchter, Kimoon Kim
> Background and Motivation
>
> Containerization and cluster management technologies are constantly
> evolving in the cluster computing world. Apache Spark currently implements
> support for Apache Hadoop YARN and Apache Mesos, in addition to providing
> its own standalone cluster manager. In 2014, Google announced development
> of Kubernetes  which has its own unique feature
> set and differentiates itself from YARN and Mesos. Since its debut, it has
> seen contributions from over 1300 contributors with over 5 commits.
> Kubernetes has cemented itself as a core player in the cluster computing
> world, and cloud-computing providers such as Google Container Engine,
> Google Compute Engine, Amazon Web Services, and Microsoft Azure support
> running Kubernetes clusters.
>
> This document outlines a proposal for integrating Apache Spark with
> Kubernetes in a first class way, adding Kubernetes to the list of cluster
> managers that Spark can be used with. Doing so would allow users to share
> their computing resources and containerization framework between their
> existing applications on Kubernetes and their computational Spark
> applications. Although there is existing support for running a Spark
> standalone cluster on Kubernetes
> ,
> there are still major advantages and significant interest in having native
> execution support. For example, this integration provides better support
> for multi-tenancy and dynamic resource allocation. It also allows users to
> run applications of different Spark versions of their choices in the same
> cluster.
>
> The feature is being developed in a separate fork
>  in order to minimize risk
> to the main project during development. Since the start of the development
> in November of 2016, it has received over 100 commits from over 20
> contributors and supports two releases based on Spark 2.1 and 2.2
> respectively. Documentation is also being actively worked on both in the
> main project repository and also in the repository
> https://github.com/apache-spark-on-k8s/userdocs. Regarding real-world use
> cases, we have seen cluster setup that uses 1000+ cores. We are also seeing
> growing interests on this project from more and more organizations.
>
> While it is easy to bootstrap the project in a forked repository, it is
> hard to maintain it in the long run because of the tricky process of
> rebasing onto the upstream 

Re: SPIP: Spark on Kubernetes

2017-08-15 Thread Timothy Chen
+1 (non-binding)

Tim

On Tue, Aug 15, 2017 at 9:20 AM, Kimoon Kim  wrote:
> +1 (non-binding)
>
> Thanks,
> Kimoon
>
> On Tue, Aug 15, 2017 at 9:19 AM, Sean Suchter 
> wrote:
>>
>> +1 (non-binding)
>>
>>
>>
>> --
>> View this message in context:
>> http://apache-spark-developers-list.1001551.n3.nabble.com/SPIP-Spark-on-Kubernetes-tp22147p22150.html
>> Sent from the Apache Spark Developers List mailing list archive at
>> Nabble.com.
>>
>> -
>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>
>

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: SPIP: Spark on Kubernetes

2017-08-15 Thread Kimoon Kim
+1 (non-binding)

Thanks,
Kimoon

On Tue, Aug 15, 2017 at 9:19 AM, Sean Suchter 
wrote:

> +1 (non-binding)
>
>
>
> --
> View this message in context: http://apache-spark-
> developers-list.1001551.n3.nabble.com/SPIP-Spark-on-
> Kubernetes-tp22147p22150.html
> Sent from the Apache Spark Developers List mailing list archive at
> Nabble.com.
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>


Re: SPIP: Spark on Kubernetes

2017-08-15 Thread Sean Suchter
+1 (non-binding)



--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/SPIP-Spark-on-Kubernetes-tp22147p22150.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: SPIP: Spark on Kubernetes

2017-08-15 Thread Erik Erlandson
+1 (non-binding)

On Tue, Aug 15, 2017 at 8:32 AM, Anirudh Ramanathan 
wrote:

> Spark on Kubernetes effort has been developed separately in a fork, and
> linked back from the Apache Spark project as an experimental backend
> .
> We're ~6 months in, have had 5 releases
> .
>
>- 2 Spark versions maintained (2.1, and 2.2)
>- Extensive integration testing and refactoring efforts to maintain
>code quality
>- Developer
> and
>user-facing 
>documentation
>- 10+ consistent code contributors from different organizations
>
> 
>  involved
>in actively maintaining and using the project, with several more members
>involved in testing and providing feedback.
>- The community has delivered several talks on Spark-on-Kubernetes
>generating lots of feedback from users.
>- In addition to these, we've seen efforts spawn off such as:
>- HDFS on Kubernetes
>    with
>   Locality and Performance Experiments
>   - Kerberized access
>   
> 
>  to
>   HDFS from Spark running on Kubernetes
>
> *Following the SPIP process, I'm putting this SPIP up for a vote.*
>
>- +1: Yeah, let's go forward and implement the SPIP.
>- +0: Don't really care.
>- -1: I don't think this is a good idea because of the following
>technical reasons.
>
> If there is any further clarification desired, on the design or the
> implementation, please feel free to ask questions or provide feedback.
>
>
> SPIP: Kubernetes as A Native Cluster Manager
>
> Full Design Doc: link
> 
>
> JIRA: https://issues.apache.org/jira/browse/SPARK-18278
>
> Kubernetes Issue: https://github.com/kubernetes/kubernetes/issues/34377
>
> Authors: Yinan Li, Anirudh Ramanathan, Erik Erlandson, Andrew Ash, Matt
> Cheah,
>
> Ilan Filonenko, Sean Suchter, Kimoon Kim
> Background and Motivation
>
> Containerization and cluster management technologies are constantly
> evolving in the cluster computing world. Apache Spark currently implements
> support for Apache Hadoop YARN and Apache Mesos, in addition to providing
> its own standalone cluster manager. In 2014, Google announced development
> of Kubernetes  which has its own unique feature
> set and differentiates itself from YARN and Mesos. Since its debut, it has
> seen contributions from over 1300 contributors with over 5 commits.
> Kubernetes has cemented itself as a core player in the cluster computing
> world, and cloud-computing providers such as Google Container Engine,
> Google Compute Engine, Amazon Web Services, and Microsoft Azure support
> running Kubernetes clusters.
>
> This document outlines a proposal for integrating Apache Spark with
> Kubernetes in a first class way, adding Kubernetes to the list of cluster
> managers that Spark can be used with. Doing so would allow users to share
> their computing resources and containerization framework between their
> existing applications on Kubernetes and their computational Spark
> applications. Although there is existing support for running a Spark
> standalone cluster on Kubernetes
> ,
> there are still major advantages and significant interest in having native
> execution support. For example, this integration provides better support
> for multi-tenancy and dynamic resource allocation. It also allows users to
> run applications of different Spark versions of their choices in the same
> cluster.
>
> The feature is being developed in a separate fork
>  in order to minimize risk
> to the main project during development. Since the start of the development
> in November of 2016, it has received over 100 commits from over 20
> contributors and supports two releases based on Spark 2.1 and 2.2
> respectively. Documentation is also being actively worked on both in the
> main project repository and also in the repository
> https://github.com/apache-spark-on-k8s/userdocs. Regarding real-world use
> cases, we have seen cluster setup that uses 1000+ cores. We are also seeing
> growing interests on this project from more and more organizations.
>
> While it is easy to bootstrap the project in a forked repository, it is
> hard to maintain it in the long run because of the tricky process of
> rebasing onto the upstream and lack of awareness in the