[VOTE] [SPIP] SPARK-15689: Data Source API V2

2017-08-17 Thread Wenchen Fan
Hi all,

Following the SPIP process, I'm putting this SPIP up for a vote.

The current data source API doesn't work well because of some limitations
like: no partitioning/bucketing support, no columnar read, hard to support
more operator push down, etc.

I'm proposing a Data Source API V2 to address these problems, please read
the full document at
https://issues.apache.org/jira/secure/attachment/12882332/SPIP%20Data%20Source%20API%20V2.pdf

Since this SPIP is mostly about APIs, I also created a prototype and put
java docs on these interfaces, so that it's easier to review these
interfaces and discuss: https://github.com/cloud-fan/spark/pull/10/files

The vote will be up for the next 72 hours. Please reply with your vote:

+1: Yeah, let's go forward and implement the SPIP.
+0: Don't really care.
-1: I don't think this is a good idea because of the following
technical reasons.

Thanks!


Re: [VOTE] [SPIP] SPARK-15689: Data Source API V2

2017-08-17 Thread Wenchen Fan
adding my own +1 (binding)

On Thu, Aug 17, 2017 at 9:02 PM, Wenchen Fan  wrote:

> Hi all,
>
> Following the SPIP process, I'm putting this SPIP up for a vote.
>
> The current data source API doesn't work well because of some limitations
> like: no partitioning/bucketing support, no columnar read, hard to support
> more operator push down, etc.
>
> I'm proposing a Data Source API V2 to address these problems, please read
> the full document at https://issues.apache.org/jira/secure/attachment/
> 12882332/SPIP%20Data%20Source%20API%20V2.pdf
>
> Since this SPIP is mostly about APIs, I also created a prototype and put
> java docs on these interfaces, so that it's easier to review these
> interfaces and discuss: https://github.com/cloud-fan/spark/pull/10/files
>
> The vote will be up for the next 72 hours. Please reply with your vote:
>
> +1: Yeah, let's go forward and implement the SPIP.
> +0: Don't really care.
> -1: I don't think this is a good idea because of the following
> technical reasons.
>
> Thanks!
>


Re: SPIP: Spark on Kubernetes

2017-08-17 Thread Matei Zaharia
+1 from me as well.

Matei

> On Aug 17, 2017, at 10:55 AM, Reynold Xin  wrote:
> 
> +1 on adding Kubernetes support in Spark (as a separate module similar to how 
> YARN is done)
> 
> I talk with a lot of developers and teams that operate cloud services, and 
> k8s in the last year has definitely become one of the key projects, if not 
> the one with the strongest momentum in this space. I'm not 100% sure we can 
> make it into 2.3 but IMO based on the activities in the forked repo and 
> claims that certain deployments are already running in production, this could 
> already be a solid project and will have everlasting positive impact.
> 
> 
> 
> On Wed, Aug 16, 2017 at 10:24 AM, Alexander Bezzubov  wrote:
> +1 (non-binding)
> 
> 
> Looking forward using it as part of Apache Spark release, instead of 
> Standalone cluster deployed on top of k8s.
> 
> 
> --
> Alex
> 
> On Wed, Aug 16, 2017 at 11:11 AM, Ismaël Mejía  wrote:
> +1 (non-binding)
> 
> This is something really great to have. More schedulers and runtime
> environments are a HUGE win for the Spark ecosystem.
> Amazing work, Big kudos for the guys who created and continue working on this.
> 
> On Wed, Aug 16, 2017 at 2:07 AM, lucas.g...@gmail.com
>  wrote:
> > From our perspective, we have invested heavily in Kubernetes as our cluster
> > manager of choice.
> >
> > We also make quite heavy use of spark.  We've been experimenting with using
> > these builds (2.1 with pyspark enabled) quite heavily.  Given that we've
> > already 'paid the price' to operate Kubernetes in AWS it seems rational to
> > move our jobs over to spark on k8s.  Having this project merged into the
> > master will significantly ease keeping our Data Munging toolchain primarily
> > on Spark.
> >
> >
> > Gary Lucas
> > Data Ops Team Lead
> > Unbounce
> >
> > On 15 August 2017 at 15:52, Andrew Ash  wrote:
> >>
> >> +1 (non-binding)
> >>
> >> We're moving large amounts of infrastructure from a combination of open
> >> source and homegrown cluster management systems to unify on Kubernetes and
> >> want to bring Spark workloads along with us.
> >>
> >> On Tue, Aug 15, 2017 at 2:29 PM, liyinan926  wrote:
> >>>
> >>> +1 (non-binding)
> >>>
> >>>
> >>>
> >>> --
> >>> View this message in context:
> >>> http://apache-spark-developers-list.1001551.n3.nabble.com/SPIP-Spark-on-Kubernetes-tp22147p22164.html
> >>> Sent from the Apache Spark Developers List mailing list archive at
> >>> Nabble.com.
> >>>
> >>> -
> >>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
> >>>
> >>
> >
> 
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
> 
> 
> 


-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: SPIP: Spark on Kubernetes

2017-08-17 Thread Marcelo Vanzin
I have just some very high level knowledge of kubernetes, so I can't
really comment on the details of the proposal that relate to it. But I
have some comments about other areas of the linked documents:

- It's good to know that there's a community behind this effort and
mentions of lots of testing. As Reynold mentioned on jira, this is a
part of Spark that needs very good testing. Even YARN doesn't have
comprehensive testing built into the Spark test suite, it mostly
relies on the fact that a lot of Spark developers use YARN so that we
get test coverage for things like security.

- The "Resource Staging Server" is something that can be useful also
for standalone and Mesos (YARN has its own thing). It would be nice to
keep it generic enough that it could be used or embedded in other
cluster managers.

- It would be good to get more details about the security model here;
how do applications authenticate to the RSS above, how
are shared secrets distributed (so you can set up encryption securely
for individual Spark apps), things like that.

- Same concerns above apply to the kubernetes-specific shuffle
service; I believe the base shuffle service today doesn't have very
strong security (single shared secret that all apps need to know about IIRC),
and only the YARN implementation has proper application isolation.

- I see there's some talk about accessing Kerberos-secured services
(explicitly mentions HDFS but I'll treat it as "generic
Hadoop+Kerberos" support). There's already ongoing effort to make
Spark-on-Mesos support Kerberos (SPARK-16742), which has been going on
by mostly making the existing YARN Kerberos integration more generic.
It would be good if this project followed that instead of trying to
create its own way of dealing with Kerberos.

- I was hoping that as part of this we'd see some effort into
modularizing SparkSubmit somehow; that's a pretty hairy piece of code
to navigate, and adding more cluster manager-specific code will
probably not make that better.

That being said, I don't see any of those as blockers for an initial
version. So adding my +1.


On Tue, Aug 15, 2017 at 8:32 AM, Anirudh Ramanathan
 wrote:
> Spark on Kubernetes effort has been developed separately in a fork, and
> linked back from the Apache Spark project as an experimental backend. We're
> ~6 months in, have had 5 releases.
>
> 2 Spark versions maintained (2.1, and 2.2)
> Extensive integration testing and refactoring efforts to maintain code
> quality
> Developer and user-facing documentation
> 10+ consistent code contributors from different organizations involved in
> actively maintaining and using the project, with several more members
> involved in testing and providing feedback.
> The community has delivered several talks on Spark-on-Kubernetes generating
> lots of feedback from users.
> In addition to these, we've seen efforts spawn off such as:
>
> HDFS on Kubernetes with Locality and Performance Experiments
> Kerberized access to HDFS from Spark running on Kubernetes
>
> Following the SPIP process, I'm putting this SPIP up for a vote.
>
> +1: Yeah, let's go forward and implement the SPIP.
> +0: Don't really care.
> -1: I don't think this is a good idea because of the following technical
> reasons.
>
> If there is any further clarification desired, on the design or the
> implementation, please feel free to ask questions or provide feedback.
>
>
> SPIP: Kubernetes as A Native Cluster Manager
>
>
> Full Design Doc: link
>
> JIRA: https://issues.apache.org/jira/browse/SPARK-18278
>
> Kubernetes Issue: https://github.com/kubernetes/kubernetes/issues/34377
>
>
> Authors: Yinan Li, Anirudh Ramanathan, Erik Erlandson, Andrew Ash, Matt
> Cheah,
>
> Ilan Filonenko, Sean Suchter, Kimoon Kim
>
> Background and Motivation
>
> Containerization and cluster management technologies are constantly evolving
> in the cluster computing world. Apache Spark currently implements support
> for Apache Hadoop YARN and Apache Mesos, in addition to providing its own
> standalone cluster manager. In 2014, Google announced development of
> Kubernetes which has its own unique feature set and differentiates itself
> from YARN and Mesos. Since its debut, it has seen contributions from over
> 1300 contributors with over 5 commits. Kubernetes has cemented itself as
> a core player in the cluster computing world, and cloud-computing providers
> such as Google Container Engine, Google Compute Engine, Amazon Web Services,
> and Microsoft Azure support running Kubernetes clusters.
>
>
> This document outlines a proposal for integrating Apache Spark with
> Kubernetes in a first class way, adding Kubernetes to the list of cluster
> managers that Spark can be used with. Doing so would allow users to share
> their computing resources and containerization framework between their
> existing applications on Kubernetes and their computational Spark
> applications. Although there is existing support for running a Spark
> standalone 

Re: SPIP: Spark on Kubernetes

2017-08-17 Thread michael mccune

+1 (non-binding)

peace o/

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: SPIP: Spark on Kubernetes

2017-08-17 Thread Chris Fregly
@reynold:

Databricks runs their proprietary product on Kubernetes.  how about 
contributing some of that work back to the Open Source Community?

—

Chris Fregly
Founder and Research Engineer @ PipelineAI 
Founder @ Advanced Spark and TensorFlow Meetup 

San Francisco - Chicago - Washington DC - London

> On Aug 17, 2017, at 10:55 AM, Reynold Xin  wrote:
> 
> +1 on adding Kubernetes support in Spark (as a separate module similar to how 
> YARN is done)
> 
> I talk with a lot of developers and teams that operate cloud services, and 
> k8s in the last year has definitely become one of the key projects, if not 
> the one with the strongest momentum in this space. I'm not 100% sure we can 
> make it into 2.3 but IMO based on the activities in the forked repo and 
> claims that certain deployments are already running in production, this could 
> already be a solid project and will have everlasting positive impact.
> 
> 
> 
> On Wed, Aug 16, 2017 at 10:24 AM, Alexander Bezzubov  > wrote:
> +1 (non-binding)
> 
> 
> Looking forward using it as part of Apache Spark release, instead of 
> Standalone cluster deployed on top of k8s.
> 
> 
> --
> Alex
> 
> On Wed, Aug 16, 2017 at 11:11 AM, Ismaël Mejía  > wrote:
> +1 (non-binding)
> 
> This is something really great to have. More schedulers and runtime
> environments are a HUGE win for the Spark ecosystem.
> Amazing work, Big kudos for the guys who created and continue working on this.
> 
> On Wed, Aug 16, 2017 at 2:07 AM, lucas.g...@gmail.com 
> 
> > wrote:
> > From our perspective, we have invested heavily in Kubernetes as our cluster
> > manager of choice.
> >
> > We also make quite heavy use of spark.  We've been experimenting with using
> > these builds (2.1 with pyspark enabled) quite heavily.  Given that we've
> > already 'paid the price' to operate Kubernetes in AWS it seems rational to
> > move our jobs over to spark on k8s.  Having this project merged into the
> > master will significantly ease keeping our Data Munging toolchain primarily
> > on Spark.
> >
> >
> > Gary Lucas
> > Data Ops Team Lead
> > Unbounce
> >
> > On 15 August 2017 at 15:52, Andrew Ash  > > wrote:
> >>
> >> +1 (non-binding)
> >>
> >> We're moving large amounts of infrastructure from a combination of open
> >> source and homegrown cluster management systems to unify on Kubernetes and
> >> want to bring Spark workloads along with us.
> >>
> >> On Tue, Aug 15, 2017 at 2:29 PM, liyinan926  >> > wrote:
> >>>
> >>> +1 (non-binding)
> >>>
> >>>
> >>>
> >>> --
> >>> View this message in context:
> >>> http://apache-spark-developers-list.1001551.n3.nabble.com/SPIP-Spark-on-Kubernetes-tp22147p22164.html
> >>>  
> >>> 
> >>> Sent from the Apache Spark Developers List mailing list archive at
> >>> Nabble.com.
> >>>
> >>> -
> >>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org 
> >>> 
> >>>
> >>
> >
> 
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org 
> 
> 
> 
> 



Re: SPIP: Spark on Kubernetes

2017-08-17 Thread Mridul Muralidharan
While I definitely support the idea of Apache Spark being able to
leverage kubernetes, IMO it is better for long term evolution of spark
to expose appropriate SPI such that this support need not necessarily
live within Apache Spark code base.
It will allow for multiple backends to evolve, decoupled from spark core.
In this case, would have made maintaining apache-spark-on-k8s repo
easier; just as it would allow for supporting other backends -
opensource (nomad for ex) and proprietary.

In retrospect directly integrating yarn support into spark, while
mirroring mesos support at that time, was probably an incorrect design
choice on my part.


Regards,
Mridul

On Tue, Aug 15, 2017 at 8:32 AM, Anirudh Ramanathan
 wrote:
> Spark on Kubernetes effort has been developed separately in a fork, and
> linked back from the Apache Spark project as an experimental backend. We're
> ~6 months in, have had 5 releases.
>
> 2 Spark versions maintained (2.1, and 2.2)
> Extensive integration testing and refactoring efforts to maintain code
> quality
> Developer and user-facing documentation
> 10+ consistent code contributors from different organizations involved in
> actively maintaining and using the project, with several more members
> involved in testing and providing feedback.
> The community has delivered several talks on Spark-on-Kubernetes generating
> lots of feedback from users.
> In addition to these, we've seen efforts spawn off such as:
>
> HDFS on Kubernetes with Locality and Performance Experiments
> Kerberized access to HDFS from Spark running on Kubernetes
>
> Following the SPIP process, I'm putting this SPIP up for a vote.
>
> +1: Yeah, let's go forward and implement the SPIP.
> +0: Don't really care.
> -1: I don't think this is a good idea because of the following technical
> reasons.
>
> If there is any further clarification desired, on the design or the
> implementation, please feel free to ask questions or provide feedback.
>
>
> SPIP: Kubernetes as A Native Cluster Manager
>
>
> Full Design Doc: link
>
> JIRA: https://issues.apache.org/jira/browse/SPARK-18278
>
> Kubernetes Issue: https://github.com/kubernetes/kubernetes/issues/34377
>
>
> Authors: Yinan Li, Anirudh Ramanathan, Erik Erlandson, Andrew Ash, Matt
> Cheah,
>
> Ilan Filonenko, Sean Suchter, Kimoon Kim
>
> Background and Motivation
>
> Containerization and cluster management technologies are constantly evolving
> in the cluster computing world. Apache Spark currently implements support
> for Apache Hadoop YARN and Apache Mesos, in addition to providing its own
> standalone cluster manager. In 2014, Google announced development of
> Kubernetes which has its own unique feature set and differentiates itself
> from YARN and Mesos. Since its debut, it has seen contributions from over
> 1300 contributors with over 5 commits. Kubernetes has cemented itself as
> a core player in the cluster computing world, and cloud-computing providers
> such as Google Container Engine, Google Compute Engine, Amazon Web Services,
> and Microsoft Azure support running Kubernetes clusters.
>
>
> This document outlines a proposal for integrating Apache Spark with
> Kubernetes in a first class way, adding Kubernetes to the list of cluster
> managers that Spark can be used with. Doing so would allow users to share
> their computing resources and containerization framework between their
> existing applications on Kubernetes and their computational Spark
> applications. Although there is existing support for running a Spark
> standalone cluster on Kubernetes, there are still major advantages and
> significant interest in having native execution support. For example, this
> integration provides better support for multi-tenancy and dynamic resource
> allocation. It also allows users to run applications of different Spark
> versions of their choices in the same cluster.
>
>
> The feature is being developed in a separate fork in order to minimize risk
> to the main project during development. Since the start of the development
> in November of 2016, it has received over 100 commits from over 20
> contributors and supports two releases based on Spark 2.1 and 2.2
> respectively. Documentation is also being actively worked on both in the
> main project repository and also in the repository
> https://github.com/apache-spark-on-k8s/userdocs. Regarding real-world use
> cases, we have seen cluster setup that uses 1000+ cores. We are also seeing
> growing interests on this project from more and more organizations.
>
>
> While it is easy to bootstrap the project in a forked repository, it is hard
> to maintain it in the long run because of the tricky process of rebasing
> onto the upstream and lack of awareness in the large Spark community. It
> would be beneficial to both the Spark and Kubernetes community seeing this
> feature being merged upstream. On one hand, it gives Spark users the option
> of running their Spark workloads along with other workloads 

Fwd: SPIP: Spark on Kubernetes

2017-08-17 Thread Timothy Chen
-- Forwarded message --
From: Timothy Chen 
Date: Thu, Aug 17, 2017 at 2:48 PM
Subject: Re: SPIP: Spark on Kubernetes
To: Marcelo Vanzin 


Hi Marcelo,

Agree with your points, and I had that same thought around Resource
staging server and like to share that with Spark on Mesos (once this
is or can be merged).

For your last part, would love to get more concerted effort going
around abstracting out cluster manager more cleaner in Spark after
this SPIP.

Tim

On Thu, Aug 17, 2017 at 2:40 PM, Marcelo Vanzin  wrote:
> I have just some very high level knowledge of kubernetes, so I can't
> really comment on the details of the proposal that relate to it. But I
> have some comments about other areas of the linked documents:
>
> - It's good to know that there's a community behind this effort and
> mentions of lots of testing. As Reynold mentioned on jira, this is a
> part of Spark that needs very good testing. Even YARN doesn't have
> comprehensive testing built into the Spark test suite, it mostly
> relies on the fact that a lot of Spark developers use YARN so that we
> get test coverage for things like security.
>
> - The "Resource Staging Server" is something that can be useful also
> for standalone and Mesos (YARN has its own thing). It would be nice to
> keep it generic enough that it could be used or embedded in other
> cluster managers.
>
> - It would be good to get more details about the security model here;
> how do applications authenticate to the RSS above, how
> are shared secrets distributed (so you can set up encryption securely
> for individual Spark apps), things like that.
>
> - Same concerns above apply to the kubernetes-specific shuffle
> service; I believe the base shuffle service today doesn't have very
> strong security (single shared secret that all apps need to know about IIRC),
> and only the YARN implementation has proper application isolation.
>
> - I see there's some talk about accessing Kerberos-secured services
> (explicitly mentions HDFS but I'll treat it as "generic
> Hadoop+Kerberos" support). There's already ongoing effort to make
> Spark-on-Mesos support Kerberos (SPARK-16742), which has been going on
> by mostly making the existing YARN Kerberos integration more generic.
> It would be good if this project followed that instead of trying to
> create its own way of dealing with Kerberos.
>
> - I was hoping that as part of this we'd see some effort into
> modularizing SparkSubmit somehow; that's a pretty hairy piece of code
> to navigate, and adding more cluster manager-specific code will
> probably not make that better.
>
> That being said, I don't see any of those as blockers for an initial
> version. So adding my +1.
>
>
> On Tue, Aug 15, 2017 at 8:32 AM, Anirudh Ramanathan
>  wrote:
>> Spark on Kubernetes effort has been developed separately in a fork, and
>> linked back from the Apache Spark project as an experimental backend. We're
>> ~6 months in, have had 5 releases.
>>
>> 2 Spark versions maintained (2.1, and 2.2)
>> Extensive integration testing and refactoring efforts to maintain code
>> quality
>> Developer and user-facing documentation
>> 10+ consistent code contributors from different organizations involved in
>> actively maintaining and using the project, with several more members
>> involved in testing and providing feedback.
>> The community has delivered several talks on Spark-on-Kubernetes generating
>> lots of feedback from users.
>> In addition to these, we've seen efforts spawn off such as:
>>
>> HDFS on Kubernetes with Locality and Performance Experiments
>> Kerberized access to HDFS from Spark running on Kubernetes
>>
>> Following the SPIP process, I'm putting this SPIP up for a vote.
>>
>> +1: Yeah, let's go forward and implement the SPIP.
>> +0: Don't really care.
>> -1: I don't think this is a good idea because of the following technical
>> reasons.
>>
>> If there is any further clarification desired, on the design or the
>> implementation, please feel free to ask questions or provide feedback.
>>
>>
>> SPIP: Kubernetes as A Native Cluster Manager
>>
>>
>> Full Design Doc: link
>>
>> JIRA: https://issues.apache.org/jira/browse/SPARK-18278
>>
>> Kubernetes Issue: https://github.com/kubernetes/kubernetes/issues/34377
>>
>>
>> Authors: Yinan Li, Anirudh Ramanathan, Erik Erlandson, Andrew Ash, Matt
>> Cheah,
>>
>> Ilan Filonenko, Sean Suchter, Kimoon Kim
>>
>> Background and Motivation
>>
>> Containerization and cluster management technologies are constantly evolving
>> in the cluster computing world. Apache Spark currently implements support
>> for Apache Hadoop YARN and Apache Mesos, in addition to providing its own
>> standalone cluster manager. In 2014, Google announced development of
>> Kubernetes which has its own unique feature set and differentiates itself
>> from YARN and Mesos. Since its debut, it has seen contributions from over
>> 1300 

Re: [VOTE] [SPIP] SPARK-15689: Data Source API V2

2017-08-17 Thread Wenchen Fan
Sorry let's remove the VOTE tag as I just wanna bring this up for
discussion.

I'll restart the voting process after we have enough discussion on the JIRA
ticket or here in this email thread.

On Thu, Aug 17, 2017 at 9:12 PM, Russell Spitzer 
wrote:

> -1, I don't think there has really been any discussion of this api change
> yet or at least it hasn't occurred on the jira ticket
>
> On Thu, Aug 17, 2017 at 8:05 AM Wenchen Fan  wrote:
>
>> adding my own +1 (binding)
>>
>> On Thu, Aug 17, 2017 at 9:02 PM, Wenchen Fan  wrote:
>>
>>> Hi all,
>>>
>>> Following the SPIP process, I'm putting this SPIP up for a vote.
>>>
>>> The current data source API doesn't work well because of some
>>> limitations like: no partitioning/bucketing support, no columnar read, hard
>>> to support more operator push down, etc.
>>>
>>> I'm proposing a Data Source API V2 to address these problems, please
>>> read the full document at https://issues.apache.org/
>>> jira/secure/attachment/12882332/SPIP%20Data%20Source%20API%20V2.pdf
>>>
>>> Since this SPIP is mostly about APIs, I also created a prototype and put
>>> java docs on these interfaces, so that it's easier to review these
>>> interfaces and discuss: https://github.com/cloud-fan/spark/pull/10/files
>>>
>>> The vote will be up for the next 72 hours. Please reply with your vote:
>>>
>>> +1: Yeah, let's go forward and implement the SPIP.
>>> +0: Don't really care.
>>> -1: I don't think this is a good idea because of the following
>>> technical reasons.
>>>
>>> Thanks!
>>>
>>
>>


Re: [VOTE] [SPIP] SPARK-15689: Data Source API V2

2017-08-17 Thread Russell Spitzer
-1, I don't think there has really been any discussion of this api change
yet or at least it hasn't occurred on the jira ticket

On Thu, Aug 17, 2017 at 8:05 AM Wenchen Fan  wrote:

> adding my own +1 (binding)
>
> On Thu, Aug 17, 2017 at 9:02 PM, Wenchen Fan  wrote:
>
>> Hi all,
>>
>> Following the SPIP process, I'm putting this SPIP up for a vote.
>>
>> The current data source API doesn't work well because of some limitations
>> like: no partitioning/bucketing support, no columnar read, hard to support
>> more operator push down, etc.
>>
>> I'm proposing a Data Source API V2 to address these problems, please read
>> the full document at
>> https://issues.apache.org/jira/secure/attachment/12882332/SPIP%20Data%20Source%20API%20V2.pdf
>>
>> Since this SPIP is mostly about APIs, I also created a prototype and put
>> java docs on these interfaces, so that it's easier to review these
>> interfaces and discuss: https://github.com/cloud-fan/spark/pull/10/files
>>
>> The vote will be up for the next 72 hours. Please reply with your vote:
>>
>> +1: Yeah, let's go forward and implement the SPIP.
>> +0: Don't really care.
>> -1: I don't think this is a good idea because of the following
>> technical reasons.
>>
>> Thanks!
>>
>
>


Re: [VOTE] [SPIP] SPARK-15689: Data Source API V2

2017-08-17 Thread Reynold Xin
Yea I don't think it's a good idea to upload a doc and then call for a vote
immediately. People need time to digest ...


On Thu, Aug 17, 2017 at 6:22 AM, Wenchen Fan  wrote:

> Sorry let's remove the VOTE tag as I just wanna bring this up for
> discussion.
>
> I'll restart the voting process after we have enough discussion on the
> JIRA ticket or here in this email thread.
>
> On Thu, Aug 17, 2017 at 9:12 PM, Russell Spitzer <
> russell.spit...@gmail.com> wrote:
>
>> -1, I don't think there has really been any discussion of this api change
>> yet or at least it hasn't occurred on the jira ticket
>>
>> On Thu, Aug 17, 2017 at 8:05 AM Wenchen Fan  wrote:
>>
>>> adding my own +1 (binding)
>>>
>>> On Thu, Aug 17, 2017 at 9:02 PM, Wenchen Fan 
>>> wrote:
>>>
 Hi all,

 Following the SPIP process, I'm putting this SPIP up for a vote.

 The current data source API doesn't work well because of some
 limitations like: no partitioning/bucketing support, no columnar read, hard
 to support more operator push down, etc.

 I'm proposing a Data Source API V2 to address these problems, please
 read the full document at https://issues.apache.org/jira
 /secure/attachment/12882332/SPIP%20Data%20Source%20API%20V2.pdf

 Since this SPIP is mostly about APIs, I also created a prototype and
 put java docs on these interfaces, so that it's easier to review these
 interfaces and discuss: https://github.com/cl
 oud-fan/spark/pull/10/files

 The vote will be up for the next 72 hours. Please reply with your vote:

 +1: Yeah, let's go forward and implement the SPIP.
 +0: Don't really care.
 -1: I don't think this is a good idea because of the following
 technical reasons.

 Thanks!

>>>
>>>
>


Re: [VOTE] [SPIP] SPARK-15689: Data Source API V2

2017-08-17 Thread Mark Hamstra
Points 2, 3 and 4 of the Project Plan in that document (i.e. "port existing
data sources using internal APIs to use the proposed public Data Source V2
API") have my full support. Really, I'd like to see that dog-fooding effort
completed and lesson learned from it fully digested before we remove any
unstable annotations from the new API. It's okay to get a proposal out
there so that we can talk about it and start implementing and using it
internally, followed by external use under the unstable annotations, but I
don't want to see a premature vote on a final form of a new public API.

On Thu, Aug 17, 2017 at 8:55 AM, Reynold Xin  wrote:

> Yea I don't think it's a good idea to upload a doc and then call for a
> vote immediately. People need time to digest ...
>
>
> On Thu, Aug 17, 2017 at 6:22 AM, Wenchen Fan  wrote:
>
>> Sorry let's remove the VOTE tag as I just wanna bring this up for
>> discussion.
>>
>> I'll restart the voting process after we have enough discussion on the
>> JIRA ticket or here in this email thread.
>>
>> On Thu, Aug 17, 2017 at 9:12 PM, Russell Spitzer <
>> russell.spit...@gmail.com> wrote:
>>
>>> -1, I don't think there has really been any discussion of this api
>>> change yet or at least it hasn't occurred on the jira ticket
>>>
>>> On Thu, Aug 17, 2017 at 8:05 AM Wenchen Fan  wrote:
>>>
 adding my own +1 (binding)

 On Thu, Aug 17, 2017 at 9:02 PM, Wenchen Fan 
 wrote:

> Hi all,
>
> Following the SPIP process, I'm putting this SPIP up for a vote.
>
> The current data source API doesn't work well because of some
> limitations like: no partitioning/bucketing support, no columnar read, 
> hard
> to support more operator push down, etc.
>
> I'm proposing a Data Source API V2 to address these problems, please
> read the full document at https://issues.apache.org/jira
> /secure/attachment/12882332/SPIP%20Data%20Source%20API%20V2.pdf
>
> Since this SPIP is mostly about APIs, I also created a prototype and
> put java docs on these interfaces, so that it's easier to review these
> interfaces and discuss: https://github.com/cl
> oud-fan/spark/pull/10/files
>
> The vote will be up for the next 72 hours. Please reply with your vote:
>
> +1: Yeah, let's go forward and implement the SPIP.
> +0: Don't really care.
> -1: I don't think this is a good idea because of the following
> technical reasons.
>
> Thanks!
>


>>
>


Re: [VOTE] [SPIP] SPARK-15689: Data Source API V2

2017-08-17 Thread 蒋星博
+1 (non-binding)

Wenchen Fan 于2017年8月17日 周四下午9:05写道:

> adding my own +1 (binding)
>
> On Thu, Aug 17, 2017 at 9:02 PM, Wenchen Fan  wrote:
>
>> Hi all,
>>
>> Following the SPIP process, I'm putting this SPIP up for a vote.
>>
>> The current data source API doesn't work well because of some limitations
>> like: no partitioning/bucketing support, no columnar read, hard to support
>> more operator push down, etc.
>>
>> I'm proposing a Data Source API V2 to address these problems, please read
>> the full document at
>> https://issues.apache.org/jira/secure/attachment/12882332/SPIP%20Data%20Source%20API%20V2.pdf
>>
>> Since this SPIP is mostly about APIs, I also created a prototype and put
>> java docs on these interfaces, so that it's easier to review these
>> interfaces and discuss: https://github.com/cloud-fan/spark/pull/10/files
>>
>> The vote will be up for the next 72 hours. Please reply with your vote:
>>
>> +1: Yeah, let's go forward and implement the SPIP.
>> +0: Don't really care.
>> -1: I don't think this is a good idea because of the following
>> technical reasons.
>>
>> Thanks!
>>
>
>


Re: SPIP: Spark on Kubernetes

2017-08-17 Thread Reynold Xin
+1 on adding Kubernetes support in Spark (as a separate module similar to
how YARN is done)

I talk with a lot of developers and teams that operate cloud services, and
k8s in the last year has definitely become one of the key projects, if not
the one with the strongest momentum in this space. I'm not 100% sure we can
make it into 2.3 but IMO based on the activities in the forked repo and
claims that certain deployments are already running in production, this
could already be a solid project and will have everlasting positive impact.



On Wed, Aug 16, 2017 at 10:24 AM, Alexander Bezzubov  wrote:

> +1 (non-binding)
>
>
> Looking forward using it as part of Apache Spark release, instead of
> Standalone cluster deployed on top of k8s.
>
>
> --
> Alex
>
> On Wed, Aug 16, 2017 at 11:11 AM, Ismaël Mejía  wrote:
>
>> +1 (non-binding)
>>
>> This is something really great to have. More schedulers and runtime
>> environments are a HUGE win for the Spark ecosystem.
>> Amazing work, Big kudos for the guys who created and continue working on
>> this.
>>
>> On Wed, Aug 16, 2017 at 2:07 AM, lucas.g...@gmail.com
>>  wrote:
>> > From our perspective, we have invested heavily in Kubernetes as our
>> cluster
>> > manager of choice.
>> >
>> > We also make quite heavy use of spark.  We've been experimenting with
>> using
>> > these builds (2.1 with pyspark enabled) quite heavily.  Given that we've
>> > already 'paid the price' to operate Kubernetes in AWS it seems rational
>> to
>> > move our jobs over to spark on k8s.  Having this project merged into the
>> > master will significantly ease keeping our Data Munging toolchain
>> primarily
>> > on Spark.
>> >
>> >
>> > Gary Lucas
>> > Data Ops Team Lead
>> > Unbounce
>> >
>> > On 15 August 2017 at 15:52, Andrew Ash  wrote:
>> >>
>> >> +1 (non-binding)
>> >>
>> >> We're moving large amounts of infrastructure from a combination of open
>> >> source and homegrown cluster management systems to unify on Kubernetes
>> and
>> >> want to bring Spark workloads along with us.
>> >>
>> >> On Tue, Aug 15, 2017 at 2:29 PM, liyinan926 
>> wrote:
>> >>>
>> >>> +1 (non-binding)
>> >>>
>> >>>
>> >>>
>> >>> --
>> >>> View this message in context:
>> >>> http://apache-spark-developers-list.1001551.n3.nabble.com/
>> SPIP-Spark-on-Kubernetes-tp22147p22164.html
>> >>> Sent from the Apache Spark Developers List mailing list archive at
>> >>> Nabble.com.
>> >>>
>> >>> -
>> >>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>> >>>
>> >>
>> >
>>
>> -
>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>
>>
>