Re: [VOTE] [SPIP] SPARK-15689: Data Source API V2

2017-08-31 Thread Wenchen Fan
> Ideally also getting sort orders _after_ getting filters.

Yea we should have a deterministic order when applying various push downs,
and I think filter should definitely go before sort. This is one of the
details we can discuss during PR review :)

On Fri, Sep 1, 2017 at 9:19 AM, James Baker  wrote:

> I think that makes sense. I didn't understand backcompat was the primary
> driver. I actually don't care right now about aggregations on the
> datasource I'm integrating with - I just care about receiving all the
> filters (and ideally also the desired sort order) at the same time. I am
> mostly fine with anything else; but getting filters at the same time is
> important for me, and doesn't seem overly contentious? (e.g. it's
> compatible with datasources v1). Ideally also getting sort orders _after_
> getting filters.
>
> That said, an unstable api that gets me the query plan would be
> appreciated by plenty I'm sure :) (and would make my implementation more
> straightforward - the state management is painful atm).
>
> James
>
> On Wed, 30 Aug 2017 at 14:56 Reynold Xin  wrote:
>
>> Sure that's good to do (and as discussed earlier a good compromise might
>> be to expose an interface for the source to decide which part of the
>> logical plan they want to accept).
>>
>> To me everything is about cost vs benefit.
>>
>> In my mind, the biggest issue with the existing data source API is
>> backward and forward compatibility. All the data sources written for Spark
>> 1.x broke in Spark 2.x. And that's one of the biggest value v2 can bring.
>> To me it's far more important to have data sources implemented in 2017 to
>> be able to work in 2027, in Spark 10.x.
>>
>> You are basically arguing for creating a new API that is capable of doing
>> arbitrary expression, aggregation, and join pushdowns (you only mentioned
>> aggregation so far, but I've talked to enough database people that I know
>> once Spark gives them aggregation pushdown, they will come back for join
>> pushdown). We can do that using unstable APIs, and creating stable APIs
>> would be extremely difficult (still doable, just would take a long time to
>> design and implement). As mentioned earlier, it basically involves creating
>> a stable representation for all of logical plan, which is a lot of work. I
>> think we should still work towards that (for other reasons as well), but
>> I'd consider that out of scope for the current one. Otherwise we'd not
>> release something probably for the next 2 or 3 years.
>>
>>
>>
>>
>>
>> On Wed, Aug 30, 2017 at 11:50 PM, James Baker 
>> wrote:
>>
>>> I guess I was more suggesting that by coding up the powerful mode as the
>>> API, it becomes easy for someone to layer an easy mode beneath it to enable
>>> simpler datasources to be integrated (and that simple mode should be the
>>> out of scope thing).
>>>
>>> Taking a small step back here, one of the places where I think I'm
>>> missing some context is in understanding the target consumers of these
>>> interfaces. I've done some amount (though likely not enough) of research
>>> about the places where people have had issues of API surface in the past -
>>> the concrete tickets I've seen have been based on Cassandra integration
>>> where you want to indicate clustering, and SAP HANA where they want to push
>>> down more complicated queries through Spark. This proposal supports the
>>> former, but the amount of change required to support clustering in the
>>> current API is not obviously high - whilst the current proposal for V2
>>> seems to make it very difficult to add support for pushing down plenty of
>>> aggregations in the future (I've found the question of how to add GROUP BY
>>> to be pretty tricky to answer for the current proposal).
>>>
>>> Googling around for implementations of the current PrunedFilteredScan, I
>>> basically find a lot of databases, which seems reasonable - SAP HANA,
>>> ElasticSearch, Solr, MongoDB, Apache Phoenix, etc. I've talked to people
>>> who've used (some of) these connectors and the sticking point has generally
>>> been that Spark needs to load a lot of data out in order to solve
>>> aggregations that can be very efficiently pushed down into the datasources.
>>>
>>> So, with this proposal it appears that we're optimising towards making
>>> it easy to write one-off datasource integrations, with some amount of
>>> pluggability for people who want to do more complicated things (the most
>>> interesting being bucketing integration). However, my guess is that this
>>> isn't what the current major integrations suffer from; they suffer mostly
>>> from restrictions in what they can push down (which broadly speaking are
>>> not going to go away).
>>>
>>> So the place where I'm confused is that the current integrations can be
>>> made incrementally better as a consequence of this, but the backing data
>>> systems have the features which enable a step change which this API 

Re: [VOTE] [SPIP] SPARK-15689: Data Source API V2

2017-08-31 Thread James Baker
I think that makes sense. I didn't understand backcompat was the primary 
driver. I actually don't care right now about aggregations on the datasource 
I'm integrating with - I just care about receiving all the filters (and ideally 
also the desired sort order) at the same time. I am mostly fine with anything 
else; but getting filters at the same time is important for me, and doesn't 
seem overly contentious? (e.g. it's compatible with datasources v1). Ideally 
also getting sort orders _after_ getting filters.

That said, an unstable api that gets me the query plan would be appreciated by 
plenty I'm sure :) (and would make my implementation more straightforward - the 
state management is painful atm).

James

On Wed, 30 Aug 2017 at 14:56 Reynold Xin 
> wrote:
Sure that's good to do (and as discussed earlier a good compromise might be to 
expose an interface for the source to decide which part of the logical plan 
they want to accept).

To me everything is about cost vs benefit.

In my mind, the biggest issue with the existing data source API is backward and 
forward compatibility. All the data sources written for Spark 1.x broke in 
Spark 2.x. And that's one of the biggest value v2 can bring. To me it's far 
more important to have data sources implemented in 2017 to be able to work in 
2027, in Spark 10.x.

You are basically arguing for creating a new API that is capable of doing 
arbitrary expression, aggregation, and join pushdowns (you only mentioned 
aggregation so far, but I've talked to enough database people that I know once 
Spark gives them aggregation pushdown, they will come back for join pushdown). 
We can do that using unstable APIs, and creating stable APIs would be extremely 
difficult (still doable, just would take a long time to design and implement). 
As mentioned earlier, it basically involves creating a stable representation 
for all of logical plan, which is a lot of work. I think we should still work 
towards that (for other reasons as well), but I'd consider that out of scope 
for the current one. Otherwise we'd not release something probably for the next 
2 or 3 years.





On Wed, Aug 30, 2017 at 11:50 PM, James Baker 
> wrote:
I guess I was more suggesting that by coding up the powerful mode as the API, 
it becomes easy for someone to layer an easy mode beneath it to enable simpler 
datasources to be integrated (and that simple mode should be the out of scope 
thing).

Taking a small step back here, one of the places where I think I'm missing some 
context is in understanding the target consumers of these interfaces. I've done 
some amount (though likely not enough) of research about the places where 
people have had issues of API surface in the past - the concrete tickets I've 
seen have been based on Cassandra integration where you want to indicate 
clustering, and SAP HANA where they want to push down more complicated queries 
through Spark. This proposal supports the former, but the amount of change 
required to support clustering in the current API is not obviously high - 
whilst the current proposal for V2 seems to make it very difficult to add 
support for pushing down plenty of aggregations in the future (I've found the 
question of how to add GROUP BY to be pretty tricky to answer for the current 
proposal).

Googling around for implementations of the current PrunedFilteredScan, I 
basically find a lot of databases, which seems reasonable - SAP HANA, 
ElasticSearch, Solr, MongoDB, Apache Phoenix, etc. I've talked to people who've 
used (some of) these connectors and the sticking point has generally been that 
Spark needs to load a lot of data out in order to solve aggregations that can 
be very efficiently pushed down into the datasources.

So, with this proposal it appears that we're optimising towards making it easy 
to write one-off datasource integrations, with some amount of pluggability for 
people who want to do more complicated things (the most interesting being 
bucketing integration). However, my guess is that this isn't what the current 
major integrations suffer from; they suffer mostly from restrictions in what 
they can push down (which broadly speaking are not going to go away).

So the place where I'm confused is that the current integrations can be made 
incrementally better as a consequence of this, but the backing data systems 
have the features which enable a step change which this API makes harder to 
achieve in the future. Who are the group of users who benefit the most as a 
consequence of this change, like, who is the target consumer here? My personal 
slant is that it's more important to improve support for other datastores than 
it is to lower the barrier of entry - this is why I've been pushing here.

James

On Wed, 30 Aug 2017 at 09:37 Ryan Blue 
> wrote:

-1 (non-binding)

Sometimes it takes a VOTE 

Re: [VOTE] [SPIP] SPARK-15689: Data Source API V2

2017-08-31 Thread James Baker
I think that makes sense. I didn't understand backcompat was the primary 
driver. I actually don't care right now about aggregations on the datasource 
I'm integrating with - I just care about receiving all the filters (and ideally 
also the desired sort order) at the same time. I am mostly fine with anything 
else; but getting filters at the same time is important for me, and doesn't 
seem overly contentious? (e.g. it's compatible with datasources v1). Ideally 
also getting sort orders _after_ getting filters.

That said, an unstable api that gets me the query plan would be appreciated by 
plenty I'm sure :) (and would make my implementation more straightforward - the 
state management is painful atm).

James

On Wed, 30 Aug 2017 at 14:56 Reynold Xin 
> wrote:
Sure that's good to do (and as discussed earlier a good compromise might be to 
expose an interface for the source to decide which part of the logical plan 
they want to accept).

To me everything is about cost vs benefit.

In my mind, the biggest issue with the existing data source API is backward and 
forward compatibility. All the data sources written for Spark 1.x broke in 
Spark 2.x. And that's one of the biggest value v2 can bring. To me it's far 
more important to have data sources implemented in 2017 to be able to work in 
2027, in Spark 10.x.

You are basically arguing for creating a new API that is capable of doing 
arbitrary expression, aggregation, and join pushdowns (you only mentioned 
aggregation so far, but I've talked to enough database people that I know once 
Spark gives them aggregation pushdown, they will come back for join pushdown). 
We can do that using unstable APIs, and creating stable APIs would be extremely 
difficult (still doable, just would take a long time to design and implement). 
As mentioned earlier, it basically involves creating a stable representation 
for all of logical plan, which is a lot of work. I think we should still work 
towards that (for other reasons as well), but I'd consider that out of scope 
for the current one. Otherwise we'd not release something probably for the next 
2 or 3 years.





On Wed, Aug 30, 2017 at 11:50 PM, James Baker 
> wrote:
I guess I was more suggesting that by coding up the powerful mode as the API, 
it becomes easy for someone to layer an easy mode beneath it to enable simpler 
datasources to be integrated (and that simple mode should be the out of scope 
thing).

Taking a small step back here, one of the places where I think I'm missing some 
context is in understanding the target consumers of these interfaces. I've done 
some amount (though likely not enough) of research about the places where 
people have had issues of API surface in the past - the concrete tickets I've 
seen have been based on Cassandra integration where you want to indicate 
clustering, and SAP HANA where they want to push down more complicated queries 
through Spark. This proposal supports the former, but the amount of change 
required to support clustering in the current API is not obviously high - 
whilst the current proposal for V2 seems to make it very difficult to add 
support for pushing down plenty of aggregations in the future (I've found the 
question of how to add GROUP BY to be pretty tricky to answer for the current 
proposal).

Googling around for implementations of the current PrunedFilteredScan, I 
basically find a lot of databases, which seems reasonable - SAP HANA, 
ElasticSearch, Solr, MongoDB, Apache Phoenix, etc. I've talked to people who've 
used (some of) these connectors and the sticking point has generally been that 
Spark needs to load a lot of data out in order to solve aggregations that can 
be very efficiently pushed down into the datasources.

So, with this proposal it appears that we're optimising towards making it easy 
to write one-off datasource integrations, with some amount of pluggability for 
people who want to do more complicated things (the most interesting being 
bucketing integration). However, my guess is that this isn't what the current 
major integrations suffer from; they suffer mostly from restrictions in what 
they can push down (which broadly speaking are not going to go away).

So the place where I'm confused is that the current integrations can be made 
incrementally better as a consequence of this, but the backing data systems 
have the features which enable a step change which this API makes harder to 
achieve in the future. Who are the group of users who benefit the most as a 
consequence of this change, like, who is the target consumer here? My personal 
slant is that it's more important to improve support for other datastores than 
it is to lower the barrier of entry - this is why I've been pushing here.

James

On Wed, 30 Aug 2017 at 09:37 Ryan Blue 
> wrote:

-1 (non-binding)

Sometimes it takes a VOTE 

Re: [VOTE] [SPIP] SPARK-15689: Data Source API V2

2017-08-31 Thread James Baker
I think that makes sense. I didn't understand backcompat was the primary 
driver. I actually don't care right now about aggregations on the datasource 
I'm integrating with - I just care about receiving all the filters (and ideally 
also the desired sort order) at the same time. I am mostly fine with anything 
else; but getting filters at the same time is important for me, and doesn't 
seem overly contentious? (e.g. it's compatible with datasources v1). Ideally 
also getting sort orders _after_ getting filters.

That said, an unstable api that gets me the query plan would be appreciated by 
plenty I'm sure :) (and would make my implementation more straightforward - the 
state management is painful atm).

James

On Wed, 30 Aug 2017 at 14:56 Reynold Xin 
> wrote:
Sure that's good to do (and as discussed earlier a good compromise might be to 
expose an interface for the source to decide which part of the logical plan 
they want to accept).

To me everything is about cost vs benefit.

In my mind, the biggest issue with the existing data source API is backward and 
forward compatibility. All the data sources written for Spark 1.x broke in 
Spark 2.x. And that's one of the biggest value v2 can bring. To me it's far 
more important to have data sources implemented in 2017 to be able to work in 
2027, in Spark 10.x.

You are basically arguing for creating a new API that is capable of doing 
arbitrary expression, aggregation, and join pushdowns (you only mentioned 
aggregation so far, but I've talked to enough database people that I know once 
Spark gives them aggregation pushdown, they will come back for join pushdown). 
We can do that using unstable APIs, and creating stable APIs would be extremely 
difficult (still doable, just would take a long time to design and implement). 
As mentioned earlier, it basically involves creating a stable representation 
for all of logical plan, which is a lot of work. I think we should still work 
towards that (for other reasons as well), but I'd consider that out of scope 
for the current one. Otherwise we'd not release something probably for the next 
2 or 3 years.





On Wed, Aug 30, 2017 at 11:50 PM, James Baker 
> wrote:
I guess I was more suggesting that by coding up the powerful mode as the API, 
it becomes easy for someone to layer an easy mode beneath it to enable simpler 
datasources to be integrated (and that simple mode should be the out of scope 
thing).

Taking a small step back here, one of the places where I think I'm missing some 
context is in understanding the target consumers of these interfaces. I've done 
some amount (though likely not enough) of research about the places where 
people have had issues of API surface in the past - the concrete tickets I've 
seen have been based on Cassandra integration where you want to indicate 
clustering, and SAP HANA where they want to push down more complicated queries 
through Spark. This proposal supports the former, but the amount of change 
required to support clustering in the current API is not obviously high - 
whilst the current proposal for V2 seems to make it very difficult to add 
support for pushing down plenty of aggregations in the future (I've found the 
question of how to add GROUP BY to be pretty tricky to answer for the current 
proposal).

Googling around for implementations of the current PrunedFilteredScan, I 
basically find a lot of databases, which seems reasonable - SAP HANA, 
ElasticSearch, Solr, MongoDB, Apache Phoenix, etc. I've talked to people who've 
used (some of) these connectors and the sticking point has generally been that 
Spark needs to load a lot of data out in order to solve aggregations that can 
be very efficiently pushed down into the datasources.

So, with this proposal it appears that we're optimising towards making it easy 
to write one-off datasource integrations, with some amount of pluggability for 
people who want to do more complicated things (the most interesting being 
bucketing integration). However, my guess is that this isn't what the current 
major integrations suffer from; they suffer mostly from restrictions in what 
they can push down (which broadly speaking are not going to go away).

So the place where I'm confused is that the current integrations can be made 
incrementally better as a consequence of this, but the backing data systems 
have the features which enable a step change which this API makes harder to 
achieve in the future. Who are the group of users who benefit the most as a 
consequence of this change, like, who is the target consumer here? My personal 
slant is that it's more important to improve support for other datastores than 
it is to lower the barrier of entry - this is why I've been pushing here.

James

On Wed, 30 Aug 2017 at 09:37 Ryan Blue 
> wrote:

-1 (non-binding)

Sometimes it takes a VOTE 

Re: Welcoming Saisai (Jerry) Shao as a committer

2017-08-31 Thread Xiao Li
Congratulations!

Xiao

2017-08-31 9:38 GMT-07:00 Imran Rashid :

> Congrats Jerry!
>
> On Mon, Aug 28, 2017 at 8:28 PM, Matei Zaharia 
> wrote:
>
>> Hi everyone,
>>
>> The PMC recently voted to add Saisai (Jerry) Shao as a committer. Saisai
>> has been contributing to many areas of the project for a long time, so it’s
>> great to see him join. Join me in thanking and congratulating him!
>>
>> Matei
>> -
>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>
>>
>


Re: [VOTE] [SPIP] SPARK-15689: Data Source API V2

2017-08-31 Thread Wenchen Fan
Hi Ryan,

I think for a SPIP, we should not worry too much about details, as we can
discuss them during PR review after the vote pass.

I think we should focus more on the overall design, like James did. The
interface mix-in vs plan push down discussion was great, hope we can get a
consensus on this topic soon. The current proposal is, we keep the
interface mix-in framework, and add an unstable plan push down trait.

For details like interface names, sort push down vs sort propagate, etc., I
think they should not block the vote, as they can be updated/improved
within the current interface mix-in framework.

About separating read/write proposals, we should definitely send individual
PRs for read/write when developing data source v2. I'm also OK with voting
on the read side first. The write side is way simpler than the read side, I
think it's more important to get agreement on the read side first.

BTW, I do appreciate your feedbacks/comments on the prototype, let's keep
the discussion there. In the meanwhile, let's have more discussion on the
overall framework, and drive this project together.

Wenchen



On Thu, Aug 31, 2017 at 6:22 AM, Ryan Blue  wrote:

> Maybe I'm missing something, but the high-level proposal consists of:
> Goals, Non-Goals, and Proposed API. What is there to discuss other than the
> details of the API that's being proposed? I think the goals make sense, but
> goals alone aren't enough to approve a SPIP.
>
> On Wed, Aug 30, 2017 at 2:46 PM, Reynold Xin  wrote:
>
>> So we seem to be getting into a cycle of discussing more about the
>> details of APIs than the high level proposal. The details of APIs are
>> important to debate, but those belong more in code reviews.
>>
>> One other important thing is that we should avoid API design by
>> committee. While it is extremely useful to get feedback, understand the use
>> cases, we cannot do API design by incorporating verbatim the union of
>> everybody's feedback. API design is largely a tradeoff game. The most
>> expressive API would also be harder to use, or sacrifice backward/forward
>> compatibility. It is as important to decide what to exclude as what to
>> include.
>>
>> Unlike the v1 API, the way Wenchen's high level V2 framework is proposed
>> makes it very easy to add new features (e.g. clustering properties) in the
>> future without breaking any APIs. I'd rather us shipping something useful
>> that might not be the most comprehensive set, than debating about every
>> single feature we should add and then creating something super complicated
>> that has unclear value.
>>
>>
>>
>> On Wed, Aug 30, 2017 at 6:37 PM, Ryan Blue  wrote:
>>
>>> -1 (non-binding)
>>>
>>> Sometimes it takes a VOTE thread to get people to actually read and
>>> comment, so thanks for starting this one… but there’s still discussion
>>> happening on the prototype API, which it hasn’t been updated. I’d like to
>>> see the proposal shaped by the ongoing discussion so that we have a better,
>>> more concrete plan. I think that’s going to produces a better SPIP.
>>>
>>> The second reason for -1 is that I think the read- and write-side
>>> proposals should be separated. The PR
>>>  currently has “write path”
>>> listed as a TODO item and most of the discussion I’ve seen is on the read
>>> side. I think it would be better to separate the read and write APIs so we
>>> can focus on them individually.
>>>
>>> An example of why we should focus on the write path separately is that
>>> the proposal says this:
>>>
>>> Ideally partitioning/bucketing concept should not be exposed in the Data
>>> Source API V2, because they are just techniques for data skipping and
>>> pre-partitioning. However, these 2 concepts are already widely used in
>>> Spark, e.g. DataFrameWriter.partitionBy and DDL syntax like ADD PARTITION.
>>> To be consistent, we need to add partitioning/bucketing to Data Source V2 .
>>> . .
>>>
>>> Essentially, the some APIs mix DDL and DML operations. I’d like to
>>> consider ways to fix that problem instead of carrying the problem forward
>>> to Data Source V2. We can solve this by adding a high-level API for DDL and
>>> a better write/insert API that works well with it. Clearly, that discussion
>>> is independent of the read path, which is why I think separating the two
>>> proposals would be a win.
>>>
>>> rb
>>> ​
>>>
>>> On Wed, Aug 30, 2017 at 4:28 AM, Reynold Xin 
>>> wrote:
>>>
 That might be good to do, but seems like orthogonal to this effort
 itself. It would be a completely different interface.

 On Wed, Aug 30, 2017 at 1:10 PM Wenchen Fan 
 wrote:

> OK I agree with it, how about we add a new interface to push down the
> query plan, based on the current framework? We can mark the
> query-plan-push-down interface as unstable, to save the effort of 
> designing
> a stable 

Re: Updates on migration guides

2017-08-31 Thread Felix Cheung
+1

think we do migration guide changes for ML and R in separate JIRA/PR/commit but 
we definition should have it updated before the release.


From: linguin@gmail.com 
Sent: Wednesday, August 30, 2017 8:27:17 AM
To: Dongjoon Hyun
Cc: Xiao Li; u...@spark.apache.org
Subject: Re: Updates on migration guides

+1

2017/08/31 0:02、Dongjoon Hyun 
> のメッセージ:

+1

On Wed, Aug 30, 2017 at 7:54 AM, Xiao Li 
> wrote:
Hi, Devs,

Many questions from the open source community are actually caused by the 
behavior changes we made in each release. So far, the migration guides (e.g., 
https://spark.apache.org/docs/latest/sql-programming-guide.html#migration-guide)
 were not being properly updated. In the last few releases, multiple behavior 
changes are not documented in migration guides and even release notes. I 
propose to do the document updates in the same PRs that introduce the behavior 
changes. If the contributors can't make it, the committers who merge the PRs 
need to do it instead. We also can create a dedicated page for migration guides 
of all the components. Hopefully, this can assist the migration efforts.

Thanks,

Xiao Li



Re: Welcoming Saisai (Jerry) Shao as a committer

2017-08-31 Thread Imran Rashid
Congrats Jerry!

On Mon, Aug 28, 2017 at 8:28 PM, Matei Zaharia 
wrote:

> Hi everyone,
>
> The PMC recently voted to add Saisai (Jerry) Shao as a committer. Saisai
> has been contributing to many areas of the project for a long time, so it’s
> great to see him join. Join me in thanking and congratulating him!
>
> Matei
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>


Re: SPIP: Spark on Kubernetes

2017-08-31 Thread Anirudh Ramanathan
The proposal is in the process of being updated to include the details on
testing that we have, that Imran pointed out.
Please expect an update on the SPARK-18278
.

Mridul had a couple of points as well, about exposing an SPI and we've been
exploring that, to ascertain the effort involved.
That effort is separate, fairly long-term and we should have a working
group of representatives from all cluster managers to make progress on it.
A proposal regarding this will be in SPARK-19700
.

This vote has passed.
So far, there have been 4 binding +1 votes, ~25 non-binding votes, and no
-1 votes.

Thanks all!

+1 votes (binding):
Reynold Xin
Matei Zahari
Marcelo Vanzin
Mark Hamstra

+1 votes (non-binding):
Anirudh Ramanathan
Erik Erlandson
Ilan Filonenko
Sean Suchter
Kimoon Kim
Timothy Chen
Will Benton
Holden Karau
Seshu Adunuthula
Daniel Imberman
Shubham Chopra
Jiri Kremser
Yinan Li
Andrew Ash
李书明
Gary Lucas
Ismael Mejia
Jean-Baptiste Onofré
Alexander Bezzubov
duyanghao
elmiko
Sudarshan Kadambi
Varun Katta
Matt Cheah
Edward Zhang
Vaquar Khan





On Wed, Aug 30, 2017 at 10:42 PM, Reynold Xin  wrote:

> This has passed, hasn't it?
>
> On Tue, Aug 15, 2017 at 5:33 PM Anirudh Ramanathan 
> wrote:
>
>> Spark on Kubernetes effort has been developed separately in a fork, and
>> linked back from the Apache Spark project as an experimental backend
>> .
>> We're ~6 months in, have had 5 releases
>> .
>>
>>- 2 Spark versions maintained (2.1, and 2.2)
>>- Extensive integration testing and refactoring efforts to maintain
>>code quality
>>- Developer
>> and
>>user-facing 
>>documentation
>>- 10+ consistent code contributors from different organizations
>>
>> 
>>  involved
>>in actively maintaining and using the project, with several more members
>>involved in testing and providing feedback.
>>- The community has delivered several talks on Spark-on-Kubernetes
>>generating lots of feedback from users.
>>- In addition to these, we've seen efforts spawn off such as:
>>- HDFS on Kubernetes
>>    with
>>   Locality and Performance Experiments
>>   - Kerberized access
>>   
>> 
>>  to
>>   HDFS from Spark running on Kubernetes
>>
>> *Following the SPIP process, I'm putting this SPIP up for a vote.*
>>
>>- +1: Yeah, let's go forward and implement the SPIP.
>>- +0: Don't really care.
>>- -1: I don't think this is a good idea because of the following
>>technical reasons.
>>
>> If there is any further clarification desired, on the design or the
>> implementation, please feel free to ask questions or provide feedback.
>>
>>
>> SPIP: Kubernetes as A Native Cluster Manager
>>
>> Full Design Doc: link
>> 
>>
>> JIRA: https://issues.apache.org/jira/browse/SPARK-18278
>>
>> Kubernetes Issue: https://github.com/kubernetes/kubernetes/issues/34377
>>
>> Authors: Yinan Li, Anirudh Ramanathan, Erik Erlandson, Andrew Ash, Matt
>> Cheah,
>>
>> Ilan Filonenko, Sean Suchter, Kimoon Kim
>> Background and Motivation
>>
>> Containerization and cluster management technologies are constantly
>> evolving in the cluster computing world. Apache Spark currently implements
>> support for Apache Hadoop YARN and Apache Mesos, in addition to providing
>> its own standalone cluster manager. In 2014, Google announced development
>> of Kubernetes  which has its own unique feature
>> set and differentiates itself from YARN and Mesos. Since its debut, it has
>> seen contributions from over 1300 contributors with over 5 commits.
>> Kubernetes has cemented itself as a core player in the cluster computing
>> world, and cloud-computing providers such as Google Container Engine,
>> Google Compute Engine, Amazon Web Services, and Microsoft Azure support
>> running Kubernetes clusters.
>>
>> This document outlines a proposal for integrating Apache Spark with
>> Kubernetes in a first class way, adding Kubernetes to the list of cluster
>> managers that Spark can be used with. Doing so would allow users to share
>> their computing resources and containerization framework between their
>> existing applications on Kubernetes and their computational Spark
>> applications. Although there is existing support for 

Re: Moving Scala 2.12 forward one step

2017-08-31 Thread Sean Owen
I don't think there's a target. The changes aren't all that hard (see the
SPARK-14220 umbrella) but there are some changes that are hard or
impossible without changing key APIs, as far as we can see. That would
suggest 3.0.

One motivation I have here for getting it as far as possible otherwise is
so people could, if they wanted, create a 2.12 build themselves without
much work even if it were not supported upstream. This particular change is
a lot of the miscellaneous stuff you'd have to fix to get to that point.

On Thu, Aug 31, 2017 at 11:04 AM Saisai Shao  wrote:

> Hi Sean,
>
> Do we have a planned target version for Scala 2.12 support? Several other
> projects like Zeppelin, Livy which rely on Spark repl also require changes
> to support this Scala 2.12.
>
> Thanks
> Jerry
>
> On Thu, Aug 31, 2017 at 5:55 PM, Sean Owen  wrote:
>
>> No, this doesn't let Spark build and run on 2.12. It makes changes that
>> will be required though, the ones that are really no loss to the current
>> 2.11 build.
>>
>> On Thu, Aug 31, 2017, 10:48 Denis Bolshakov 
>> wrote:
>>
>>> Hello,
>>>
>>> Sounds amazing. Is there any improvements in benchmarks?
>>>
>>>
>>> On 31 August 2017 at 12:25, Sean Owen  wrote:
>>>
 Calling attention to the question of Scala 2.12 again for moment. I'd
 like to make a modest step towards support. Have a look again, if you
 would, at SPARK-14280:

 https://github.com/apache/spark/pull/18645

 This is a lot of the change for 2.12 that doesn't break 2.11, and
 really doesn't add any complexity. It's mostly dependency updates and
 clarifying some code. Other items like dealing with Kafka 0.8 support, the
 2.12 REPL, etc, are not  here.

 So, this still doesn't result in a working 2.12 build but it's most of
 the miscellany that will be required.

 I'd like to merge it but wanted to flag it for feedback as it's not
 trivial.

>>>
>>>
>>>
>>> --
>>> //with Best Regards
>>> --Denis Bolshakov
>>> e-mail: bolshakov.de...@gmail.com
>>>
>>
>


Re: Moving Scala 2.12 forward one step

2017-08-31 Thread Saisai Shao
Hi Sean,

Do we have a planned target version for Scala 2.12 support? Several other
projects like Zeppelin, Livy which rely on Spark repl also require changes
to support this Scala 2.12.

Thanks
Jerry

On Thu, Aug 31, 2017 at 5:55 PM, Sean Owen  wrote:

> No, this doesn't let Spark build and run on 2.12. It makes changes that
> will be required though, the ones that are really no loss to the current
> 2.11 build.
>
> On Thu, Aug 31, 2017, 10:48 Denis Bolshakov 
> wrote:
>
>> Hello,
>>
>> Sounds amazing. Is there any improvements in benchmarks?
>>
>>
>> On 31 August 2017 at 12:25, Sean Owen  wrote:
>>
>>> Calling attention to the question of Scala 2.12 again for moment. I'd
>>> like to make a modest step towards support. Have a look again, if you
>>> would, at SPARK-14280:
>>>
>>> https://github.com/apache/spark/pull/18645
>>>
>>> This is a lot of the change for 2.12 that doesn't break 2.11, and really
>>> doesn't add any complexity. It's mostly dependency updates and clarifying
>>> some code. Other items like dealing with Kafka 0.8 support, the 2.12 REPL,
>>> etc, are not  here.
>>>
>>> So, this still doesn't result in a working 2.12 build but it's most of
>>> the miscellany that will be required.
>>>
>>> I'd like to merge it but wanted to flag it for feedback as it's not
>>> trivial.
>>>
>>
>>
>>
>> --
>> //with Best Regards
>> --Denis Bolshakov
>> e-mail: bolshakov.de...@gmail.com
>>
>


Re: Moving Scala 2.12 forward one step

2017-08-31 Thread Sean Owen
No, this doesn't let Spark build and run on 2.12. It makes changes that
will be required though, the ones that are really no loss to the current
2.11 build.

On Thu, Aug 31, 2017, 10:48 Denis Bolshakov 
wrote:

> Hello,
>
> Sounds amazing. Is there any improvements in benchmarks?
>
>
> On 31 August 2017 at 12:25, Sean Owen  wrote:
>
>> Calling attention to the question of Scala 2.12 again for moment. I'd
>> like to make a modest step towards support. Have a look again, if you
>> would, at SPARK-14280:
>>
>> https://github.com/apache/spark/pull/18645
>>
>> This is a lot of the change for 2.12 that doesn't break 2.11, and really
>> doesn't add any complexity. It's mostly dependency updates and clarifying
>> some code. Other items like dealing with Kafka 0.8 support, the 2.12 REPL,
>> etc, are not  here.
>>
>> So, this still doesn't result in a working 2.12 build but it's most of
>> the miscellany that will be required.
>>
>> I'd like to merge it but wanted to flag it for feedback as it's not
>> trivial.
>>
>
>
>
> --
> //with Best Regards
> --Denis Bolshakov
> e-mail: bolshakov.de...@gmail.com
>


Re: Moving Scala 2.12 forward one step

2017-08-31 Thread Denis Bolshakov
Hello,

Sounds amazing. Is there any improvements in benchmarks?


On 31 August 2017 at 12:25, Sean Owen  wrote:

> Calling attention to the question of Scala 2.12 again for moment. I'd like
> to make a modest step towards support. Have a look again, if you would, at
> SPARK-14280:
>
> https://github.com/apache/spark/pull/18645
>
> This is a lot of the change for 2.12 that doesn't break 2.11, and really
> doesn't add any complexity. It's mostly dependency updates and clarifying
> some code. Other items like dealing with Kafka 0.8 support, the 2.12 REPL,
> etc, are not  here.
>
> So, this still doesn't result in a working 2.12 build but it's most of the
> miscellany that will be required.
>
> I'd like to merge it but wanted to flag it for feedback as it's not
> trivial.
>



-- 
//with Best Regards
--Denis Bolshakov
e-mail: bolshakov.de...@gmail.com


Moving Scala 2.12 forward one step

2017-08-31 Thread Sean Owen
Calling attention to the question of Scala 2.12 again for moment. I'd like
to make a modest step towards support. Have a look again, if you would, at
SPARK-14280:

https://github.com/apache/spark/pull/18645

This is a lot of the change for 2.12 that doesn't break 2.11, and really
doesn't add any complexity. It's mostly dependency updates and clarifying
some code. Other items like dealing with Kafka 0.8 support, the 2.12 REPL,
etc, are not  here.

So, this still doesn't result in a working 2.12 build but it's most of the
miscellany that will be required.

I'd like to merge it but wanted to flag it for feedback as it's not trivial.