Fwd: [jira] [Commented] (SPARK-21728) Allow SparkSubmit to use logging

2017-08-30 Thread Jacek Laskowski
Hi,

I think that's the code change (by Marcelo Vanzin) that has changed
how logging works as of now which seems not to load
conf/log4j.properties by default.

Can anyone explain how it's supposed to work in 2.3? I could not
figure it out from the code and conf/log4j.properties is not picked up
(but it was at least 2 days ago) :(

I'm using the master at
https://github.com/apache/spark/commit/fba9cc8466dccdcd1f6f372ea7962e7ae9e09be1.

Pozdrawiam,
Jacek Laskowski

https://about.me/JacekLaskowski
Spark Structured Streaming (Apache Spark 2.2+)
https://bit.ly/spark-structured-streaming
Mastering Apache Spark 2 https://bit.ly/mastering-apache-spark
Follow me at https://twitter.com/jaceklaskowski



-- Forwarded message --
From: Jacek Laskowski (JIRA) 
Date: Wed, Aug 30, 2017 at 9:03 AM
Subject: [jira] [Commented] (SPARK-21728) Allow SparkSubmit to use logging
To: iss...@spark.apache.org



[ 
https://issues.apache.org/jira/browse/SPARK-21728?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16146780#comment-16146780
]

Jacek Laskowski commented on SPARK-21728:
-

I think the change is user-visible and therefore deserves to be
included in the release notes for 2.3 (I remember a component or label
to mark changes like that in a special way) /cc [~sowen]
[~hyukjin.kwon]

> Allow SparkSubmit to use logging
> 
>
> Key: SPARK-21728
> URL: https://issues.apache.org/jira/browse/SPARK-21728
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.3.0
>Reporter: Marcelo Vanzin
>Assignee: Marcelo Vanzin
>Priority: Minor
> Fix For: 2.3.0
>
>
> Currently, code in {{SparkSubmit}} cannot call classes or methods that 
> initialize the Spark {{Logging}} framework. That is because at that time 
> {{SparkSubmit}} doesn't yet know which application will run, and logging is 
> initialized differently for certain special applications (notably, the 
> shells).
> It would be better if either {{SparkSubmit}} did logging initialization 
> earlier based on the application to be run, or did it in a way that could be 
> overridden later when the app initializes.
> Without this, there are currently a few parts of {{SparkSubmit}} that 
> duplicates code from other parts of Spark just to avoid logging. For example:
> * 
> [downloadFiles|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala#L860]
>  replicates code from Utils.scala
> * 
> [createTempDir|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/deploy/DependencyUtils.scala#L54]
>  replicates code from Utils.scala and installs its own shutdown hook
> * a few parts of the code could use {{SparkConf}} but can't right now because 
> of the logging issue.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Apache Spark Streaming / Spark SQL Job logs

2017-08-30 Thread Chetan Khatri
Hey Spark Dev,

Can anyone suggests sample Spark Streaming / Spark SQL Job logs to
download. I want to play with Log analytics.

Thanks


Re: [VOTE] [SPIP] SPARK-15689: Data Source API V2

2017-08-30 Thread Wenchen Fan
OK I agree with it, how about we add a new interface to push down the query
plan, based on the current framework? We can mark the query-plan-push-down
interface as unstable, to save the effort of designing a stable
representation of query plan and maintaining forward compatibility.

On Wed, Aug 30, 2017 at 10:53 AM, James Baker  wrote:

> I'll just focus on the one-by-one thing for now - it's the thing that
> blocks me the most.
>
> I think the place where we're most confused here is on the cost of
> determining whether I can push down a filter. For me, in order to work out
> whether I can push down a filter or satisfy a sort, I might have to read
> plenty of data. That said, it's worth me doing this because I can use this
> information to avoid reading >>that much data.
>
> If you give me all the orderings, I will have to read that data many times
> (we stream it to avoid keeping it in memory).
>
> There's also a thing where our typical use cases have many filters (20+ is
> common). So, it's likely not going to work to pass us all the combinations.
> That said, if I can tell you a cost, I know what optimal looks like, why
> can't I just pick that myself?
>
> The current design is friendly to simple datasources, but does not have
> the potential to support this.
>
> So the main problem we have with datasources v1 is that it's essentially
> impossible to leverage a bunch of Spark features - I don't get to use
> bucketing or row batches or all the nice things that I really want to use
> to get decent performance. Provided I can leverage these in a moderately
> supported way which won't break in any given commit, I'll be pretty happy
> with anything that lets me opt out of the restrictions.
>
> My suggestion here is that if you make a mode which works well for
> complicated use cases, you end up being able to write simple mode in terms
> of it very easily. So we could actually provide two APIs, one that lets
> people who have more interesting datasources leverage the cool Spark
> features, and one that lets people who just want to implement basic
> features do that - I'd try to include some kind of layering here. I could
> probably sketch out something here if that'd be useful?
>
> James
>
> On Tue, 29 Aug 2017 at 18:59 Wenchen Fan  wrote:
>
>> Hi James,
>>
>> Thanks for your feedback! I think your concerns are all valid, but we
>> need to make a tradeoff here.
>>
>> > Explicitly here, what I'm looking for is a convenient mechanism to
>> accept a fully specified set of arguments
>>
>> The problem with this approach is: 1) if we wanna add more arguments in
>> the future, it's really hard to do without changing the existing interface.
>> 2) if a user wants to implement a very simple data source, he has to look
>> at all the arguments and understand them, which may be a burden for him.
>> I don't have a solution to solve these 2 problems, comments are welcome.
>>
>>
>> > There are loads of cases like this - you can imagine someone being
>> able to push down a sort before a filter is applied, but not afterwards.
>> However, maybe the filter is so selective that it's better to push down the
>> filter and not handle the sort. I don't get to make this decision, Spark
>> does (but doesn't have good enough information to do it properly, whilst I
>> do). I want to be able to choose the parts I push down given knowledge of
>> my datasource - as defined the APIs don't let me do that, they're strictly
>> more restrictive than the V1 APIs in this way.
>>
>> This is true, the current framework applies push downs one by one,
>> incrementally. If a data source wanna go back to accept a sort push down
>> after it accepts a filter push down, it's impossible with the current data
>> source V2.
>> Fortunately, we have a solution for this problem. At Spark side, actually
>> we do have a fully specified set of arguments waiting to be pushed down,
>> but Spark doesn't know which is the best order to push them into data
>> source. Spark can try every combination and ask the data source to report a
>> cost, then Spark can pick the best combination with the lowest cost. This
>> can also be implemented as a cost report interface, so that advanced data
>> source can implement it for optimal performance, and simple data source
>> doesn't need to care about it and keep simple.
>>
>>
>> The current design is very friendly to simple data source, and has the
>> potential to support complex data source, I prefer the current design over
>> the plan push down one. What do you think?
>>
>>
>> On Wed, Aug 30, 2017 at 5:53 AM, James Baker  wrote:
>>
>>> Yeah, for sure.
>>>
>>> With the stable representation - agree that in the general case this is
>>> pretty intractable, it restricts the modifications that you can do in the
>>> future too much. That said, it shouldn't be as hard if you restrict
>>> yourself to the parts of the plan which are supported by the datasources V2
>>> API (which after all, need to be translateable properly into the future 

Re: [VOTE] [SPIP] SPARK-15689: Data Source API V2

2017-08-30 Thread Reynold Xin
That might be good to do, but seems like orthogonal to this effort itself.
It would be a completely different interface.

On Wed, Aug 30, 2017 at 1:10 PM Wenchen Fan  wrote:

> OK I agree with it, how about we add a new interface to push down the
> query plan, based on the current framework? We can mark the
> query-plan-push-down interface as unstable, to save the effort of designing
> a stable representation of query plan and maintaining forward compatibility.
>
> On Wed, Aug 30, 2017 at 10:53 AM, James Baker  wrote:
>
>> I'll just focus on the one-by-one thing for now - it's the thing that
>> blocks me the most.
>>
>> I think the place where we're most confused here is on the cost of
>> determining whether I can push down a filter. For me, in order to work out
>> whether I can push down a filter or satisfy a sort, I might have to read
>> plenty of data. That said, it's worth me doing this because I can use this
>> information to avoid reading >>that much data.
>>
>> If you give me all the orderings, I will have to read that data many
>> times (we stream it to avoid keeping it in memory).
>>
>> There's also a thing where our typical use cases have many filters (20+
>> is common). So, it's likely not going to work to pass us all the
>> combinations. That said, if I can tell you a cost, I know what optimal
>> looks like, why can't I just pick that myself?
>>
>> The current design is friendly to simple datasources, but does not have
>> the potential to support this.
>>
>> So the main problem we have with datasources v1 is that it's essentially
>> impossible to leverage a bunch of Spark features - I don't get to use
>> bucketing or row batches or all the nice things that I really want to use
>> to get decent performance. Provided I can leverage these in a moderately
>> supported way which won't break in any given commit, I'll be pretty happy
>> with anything that lets me opt out of the restrictions.
>>
>> My suggestion here is that if you make a mode which works well for
>> complicated use cases, you end up being able to write simple mode in terms
>> of it very easily. So we could actually provide two APIs, one that lets
>> people who have more interesting datasources leverage the cool Spark
>> features, and one that lets people who just want to implement basic
>> features do that - I'd try to include some kind of layering here. I could
>> probably sketch out something here if that'd be useful?
>>
>> James
>>
>> On Tue, 29 Aug 2017 at 18:59 Wenchen Fan  wrote:
>>
>>> Hi James,
>>>
>>> Thanks for your feedback! I think your concerns are all valid, but we
>>> need to make a tradeoff here.
>>>
>>> > Explicitly here, what I'm looking for is a convenient mechanism to
>>> accept a fully specified set of arguments
>>>
>>> The problem with this approach is: 1) if we wanna add more arguments in
>>> the future, it's really hard to do without changing the existing interface.
>>> 2) if a user wants to implement a very simple data source, he has to look
>>> at all the arguments and understand them, which may be a burden for him.
>>> I don't have a solution to solve these 2 problems, comments are welcome.
>>>
>>>
>>> > There are loads of cases like this - you can imagine someone being
>>> able to push down a sort before a filter is applied, but not afterwards.
>>> However, maybe the filter is so selective that it's better to push down the
>>> filter and not handle the sort. I don't get to make this decision, Spark
>>> does (but doesn't have good enough information to do it properly, whilst I
>>> do). I want to be able to choose the parts I push down given knowledge of
>>> my datasource - as defined the APIs don't let me do that, they're strictly
>>> more restrictive than the V1 APIs in this way.
>>>
>>> This is true, the current framework applies push downs one by one,
>>> incrementally. If a data source wanna go back to accept a sort push down
>>> after it accepts a filter push down, it's impossible with the current data
>>> source V2.
>>> Fortunately, we have a solution for this problem. At Spark side,
>>> actually we do have a fully specified set of arguments waiting to be
>>> pushed down, but Spark doesn't know which is the best order to push them
>>> into data source. Spark can try every combination and ask the data source
>>> to report a cost, then Spark can pick the best combination with the lowest
>>> cost. This can also be implemented as a cost report interface, so that
>>> advanced data source can implement it for optimal performance, and simple
>>> data source doesn't need to care about it and keep simple.
>>>
>>>
>>> The current design is very friendly to simple data source, and has the
>>> potential to support complex data source, I prefer the current design over
>>> the plan push down one. What do you think?
>>>
>>>
>>> On Wed, Aug 30, 2017 at 5:53 AM, James Baker 
>>> wrote:
>>>
 Yeah, for sure.

 With the stable representation - agree that in the general case this is
 pretty intractable, it 

Updates on migration guides

2017-08-30 Thread Xiao Li
Hi, Devs,

Many questions from the open source community are actually caused by the
behavior changes we made in each release. So far, the migration guides
(e.g.,
https://spark.apache.org/docs/latest/sql-programming-guide.html#migration-guide)
were not being properly updated. In the last few releases, multiple
behavior changes are not documented in migration guides and even release
notes. I propose to do the document updates in the same PRs that introduce
the behavior changes. If the contributors can't make it, the committers who
merge the PRs need to do it instead. We also can create a dedicated page
for migration guides of all the components. Hopefully, this can assist the
migration efforts.

Thanks,

Xiao Li


Re: Updates on migration guides

2017-08-30 Thread Dongjoon Hyun
+1

On Wed, Aug 30, 2017 at 7:54 AM, Xiao Li  wrote:

> Hi, Devs,
>
> Many questions from the open source community are actually caused by the
> behavior changes we made in each release. So far, the migration guides
> (e.g., https://spark.apache.org/docs/latest/sql-programming-guide.
> html#migration-guide) were not being properly updated. In the last few
> releases, multiple behavior changes are not documented in migration guides
> and even release notes. I propose to do the document updates in the same
> PRs that introduce the behavior changes. If the contributors can't make it,
> the committers who merge the PRs need to do it instead. We also can create
> a dedicated page for migration guides of all the components. Hopefully,
> this can assist the migration efforts.
>
> Thanks,
>
> Xiao Li
>


Re: Updates on migration guides

2017-08-30 Thread Nick Pentreath
MLlib has tried quite hard to ensure the migration guide is up to date for
each release. I think generally we catch all breaking and most major
behavior changes

On Wed, 30 Aug 2017 at 17:02, Dongjoon Hyun  wrote:

> +1
>
> On Wed, Aug 30, 2017 at 7:54 AM, Xiao Li  wrote:
>
>> Hi, Devs,
>>
>> Many questions from the open source community are actually caused by the
>> behavior changes we made in each release. So far, the migration guides
>> (e.g.,
>> https://spark.apache.org/docs/latest/sql-programming-guide.html#migration-guide)
>> were not being properly updated. In the last few releases, multiple
>> behavior changes are not documented in migration guides and even release
>> notes. I propose to do the document updates in the same PRs that introduce
>> the behavior changes. If the contributors can't make it, the committers who
>> merge the PRs need to do it instead. We also can create a dedicated page
>> for migration guides of all the components. Hopefully, this can assist the
>> migration efforts.
>>
>> Thanks,
>>
>> Xiao Li
>>
>
>


Re: Updates on migration guides

2017-08-30 Thread linguin . m . s
+1

2017/08/31 0:02、Dongjoon Hyun  のメッセージ:

> +1
> 
>> On Wed, Aug 30, 2017 at 7:54 AM, Xiao Li  wrote:
>> Hi, Devs,
>> 
>> Many questions from the open source community are actually caused by the 
>> behavior changes we made in each release. So far, the migration guides 
>> (e.g., 
>> https://spark.apache.org/docs/latest/sql-programming-guide.html#migration-guide)
>>  were not being properly updated. In the last few releases, multiple 
>> behavior changes are not documented in migration guides and even release 
>> notes. I propose to do the document updates in the same PRs that introduce 
>> the behavior changes. If the contributors can't make it, the committers who 
>> merge the PRs need to do it instead. We also can create a dedicated page for 
>> migration guides of all the components. Hopefully, this can assist the 
>> migration efforts. 
>> 
>> Thanks,
>> 
>> Xiao Li
> 


Re: [VOTE] [SPIP] SPARK-15689: Data Source API V2

2017-08-30 Thread Ryan Blue
-1 (non-binding)

Sometimes it takes a VOTE thread to get people to actually read and
comment, so thanks for starting this one… but there’s still discussion
happening on the prototype API, which it hasn’t been updated. I’d like to
see the proposal shaped by the ongoing discussion so that we have a better,
more concrete plan. I think that’s going to produces a better SPIP.

The second reason for -1 is that I think the read- and write-side proposals
should be separated. The PR 
currently has “write path” listed as a TODO item and most of the discussion
I’ve seen is on the read side. I think it would be better to separate the
read and write APIs so we can focus on them individually.

An example of why we should focus on the write path separately is that the
proposal says this:

Ideally partitioning/bucketing concept should not be exposed in the Data
Source API V2, because they are just techniques for data skipping and
pre-partitioning. However, these 2 concepts are already widely used in
Spark, e.g. DataFrameWriter.partitionBy and DDL syntax like ADD PARTITION.
To be consistent, we need to add partitioning/bucketing to Data Source V2 .
. .

Essentially, the some APIs mix DDL and DML operations. I’d like to consider
ways to fix that problem instead of carrying the problem forward to Data
Source V2. We can solve this by adding a high-level API for DDL and a
better write/insert API that works well with it. Clearly, that discussion
is independent of the read path, which is why I think separating the two
proposals would be a win.

rb
​

On Wed, Aug 30, 2017 at 4:28 AM, Reynold Xin  wrote:

> That might be good to do, but seems like orthogonal to this effort itself.
> It would be a completely different interface.
>
> On Wed, Aug 30, 2017 at 1:10 PM Wenchen Fan  wrote:
>
>> OK I agree with it, how about we add a new interface to push down the
>> query plan, based on the current framework? We can mark the
>> query-plan-push-down interface as unstable, to save the effort of designing
>> a stable representation of query plan and maintaining forward compatibility.
>>
>> On Wed, Aug 30, 2017 at 10:53 AM, James Baker 
>> wrote:
>>
>>> I'll just focus on the one-by-one thing for now - it's the thing that
>>> blocks me the most.
>>>
>>> I think the place where we're most confused here is on the cost of
>>> determining whether I can push down a filter. For me, in order to work out
>>> whether I can push down a filter or satisfy a sort, I might have to read
>>> plenty of data. That said, it's worth me doing this because I can use this
>>> information to avoid reading >>that much data.
>>>
>>> If you give me all the orderings, I will have to read that data many
>>> times (we stream it to avoid keeping it in memory).
>>>
>>> There's also a thing where our typical use cases have many filters (20+
>>> is common). So, it's likely not going to work to pass us all the
>>> combinations. That said, if I can tell you a cost, I know what optimal
>>> looks like, why can't I just pick that myself?
>>>
>>> The current design is friendly to simple datasources, but does not have
>>> the potential to support this.
>>>
>>> So the main problem we have with datasources v1 is that it's essentially
>>> impossible to leverage a bunch of Spark features - I don't get to use
>>> bucketing or row batches or all the nice things that I really want to use
>>> to get decent performance. Provided I can leverage these in a moderately
>>> supported way which won't break in any given commit, I'll be pretty happy
>>> with anything that lets me opt out of the restrictions.
>>>
>>> My suggestion here is that if you make a mode which works well for
>>> complicated use cases, you end up being able to write simple mode in terms
>>> of it very easily. So we could actually provide two APIs, one that lets
>>> people who have more interesting datasources leverage the cool Spark
>>> features, and one that lets people who just want to implement basic
>>> features do that - I'd try to include some kind of layering here. I could
>>> probably sketch out something here if that'd be useful?
>>>
>>> James
>>>
>>> On Tue, 29 Aug 2017 at 18:59 Wenchen Fan  wrote:
>>>
 Hi James,

 Thanks for your feedback! I think your concerns are all valid, but we
 need to make a tradeoff here.

 > Explicitly here, what I'm looking for is a convenient mechanism to
 accept a fully specified set of arguments

 The problem with this approach is: 1) if we wanna add more arguments in
 the future, it's really hard to do without changing the existing interface.
 2) if a user wants to implement a very simple data source, he has to look
 at all the arguments and understand them, which may be a burden for him.
 I don't have a solution to solve these 2 problems, comments are welcome.


 > There are loads of cases like this - you can imagine someone being
 able to push down a sort before a filter 

Re: Are there multiple processes out there running JIRA <-> Github maintenance tasks?

2017-08-30 Thread Marcelo Vanzin
I'm still seeing some odd behavior.

I just deleted my repo's branch for
https://github.com/apache/spark/pull/19013 and the script seems to
have done some update to the bug, since I got a bunch of e-mails.

On Mon, Aug 28, 2017 at 2:34 PM, Josh Rosen  wrote:
> This should be fixed now. The problem was that debug code had been pushed
> while investigating the JIRA linkage failure but was not removed and this
> problem went unnoticed because linking was failing well before the debug
> code was hit. Once the JIRA connectivity issues were resolved, the
> problematic code was running and causing the linking operation to fail
> mid-way through, triggering a finally block which undid the JIRA assignment.
>
> I've rolled back the bad code and enabled additional monitoring in
> StackDriver to raise an alert if we see new linking failures.
>
> On Mon, Aug 28, 2017 at 12:02 PM Marcelo Vanzin  wrote:
>>
>> It seems a little wonky, though. Feels like it's updating JIRA every
>> time you comment on a PR. Or maybe it's still working through the
>> backlog...
>>
>> On Mon, Aug 28, 2017 at 9:57 AM, Reynold Xin  wrote:
>> > The process for doing that was down before, and might've come back up
>> > and
>> > are going through the huge backlog.
>> >
>> >
>> > On Mon, Aug 28, 2017 at 6:56 PM, Sean Owen  wrote:
>> >>
>> >> Like whatever reassigns JIRAs after a PR is closed?
>> >>
>> >> It seems to be going crazy, or maybe there are many running. Not sure
>> >> who
>> >> owns that, but can he/she take a look?
>> >>
>> >
>>
>>
>>
>> --
>> Marcelo
>>
>> -
>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>
>



-- 
Marcelo

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: [VOTE] [SPIP] SPARK-15689: Data Source API V2

2017-08-30 Thread Reynold Xin
So we seem to be getting into a cycle of discussing more about the details
of APIs than the high level proposal. The details of APIs are important to
debate, but those belong more in code reviews.

One other important thing is that we should avoid API design by committee.
While it is extremely useful to get feedback, understand the use cases, we
cannot do API design by incorporating verbatim the union of everybody's
feedback. API design is largely a tradeoff game. The most expressive API
would also be harder to use, or sacrifice backward/forward compatibility.
It is as important to decide what to exclude as what to include.

Unlike the v1 API, the way Wenchen's high level V2 framework is proposed
makes it very easy to add new features (e.g. clustering properties) in the
future without breaking any APIs. I'd rather us shipping something useful
that might not be the most comprehensive set, than debating about every
single feature we should add and then creating something super complicated
that has unclear value.



On Wed, Aug 30, 2017 at 6:37 PM, Ryan Blue  wrote:

> -1 (non-binding)
>
> Sometimes it takes a VOTE thread to get people to actually read and
> comment, so thanks for starting this one… but there’s still discussion
> happening on the prototype API, which it hasn’t been updated. I’d like to
> see the proposal shaped by the ongoing discussion so that we have a better,
> more concrete plan. I think that’s going to produces a better SPIP.
>
> The second reason for -1 is that I think the read- and write-side
> proposals should be separated. The PR
>  currently has “write path”
> listed as a TODO item and most of the discussion I’ve seen is on the read
> side. I think it would be better to separate the read and write APIs so we
> can focus on them individually.
>
> An example of why we should focus on the write path separately is that the
> proposal says this:
>
> Ideally partitioning/bucketing concept should not be exposed in the Data
> Source API V2, because they are just techniques for data skipping and
> pre-partitioning. However, these 2 concepts are already widely used in
> Spark, e.g. DataFrameWriter.partitionBy and DDL syntax like ADD PARTITION.
> To be consistent, we need to add partitioning/bucketing to Data Source V2 .
> . .
>
> Essentially, the some APIs mix DDL and DML operations. I’d like to
> consider ways to fix that problem instead of carrying the problem forward
> to Data Source V2. We can solve this by adding a high-level API for DDL and
> a better write/insert API that works well with it. Clearly, that discussion
> is independent of the read path, which is why I think separating the two
> proposals would be a win.
>
> rb
> ​
>
> On Wed, Aug 30, 2017 at 4:28 AM, Reynold Xin  wrote:
>
>> That might be good to do, but seems like orthogonal to this effort
>> itself. It would be a completely different interface.
>>
>> On Wed, Aug 30, 2017 at 1:10 PM Wenchen Fan  wrote:
>>
>>> OK I agree with it, how about we add a new interface to push down the
>>> query plan, based on the current framework? We can mark the
>>> query-plan-push-down interface as unstable, to save the effort of designing
>>> a stable representation of query plan and maintaining forward compatibility.
>>>
>>> On Wed, Aug 30, 2017 at 10:53 AM, James Baker 
>>> wrote:
>>>
 I'll just focus on the one-by-one thing for now - it's the thing that
 blocks me the most.

 I think the place where we're most confused here is on the cost of
 determining whether I can push down a filter. For me, in order to work out
 whether I can push down a filter or satisfy a sort, I might have to read
 plenty of data. That said, it's worth me doing this because I can use this
 information to avoid reading >>that much data.

 If you give me all the orderings, I will have to read that data many
 times (we stream it to avoid keeping it in memory).

 There's also a thing where our typical use cases have many filters (20+
 is common). So, it's likely not going to work to pass us all the
 combinations. That said, if I can tell you a cost, I know what optimal
 looks like, why can't I just pick that myself?

 The current design is friendly to simple datasources, but does not have
 the potential to support this.

 So the main problem we have with datasources v1 is that it's
 essentially impossible to leverage a bunch of Spark features - I don't get
 to use bucketing or row batches or all the nice things that I really want
 to use to get decent performance. Provided I can leverage these in a
 moderately supported way which won't break in any given commit, I'll be
 pretty happy with anything that lets me opt out of the restrictions.

 My suggestion here is that if you make a mode which works well for
 complicated use cases, you end up being able to write simple mode in terms
 of it very easily. So we

Re: [VOTE] [SPIP] SPARK-15689: Data Source API V2

2017-08-30 Thread James Baker
I guess I was more suggesting that by coding up the powerful mode as the API, 
it becomes easy for someone to layer an easy mode beneath it to enable simpler 
datasources to be integrated (and that simple mode should be the out of scope 
thing).

Taking a small step back here, one of the places where I think I'm missing some 
context is in understanding the target consumers of these interfaces. I've done 
some amount (though likely not enough) of research about the places where 
people have had issues of API surface in the past - the concrete tickets I've 
seen have been based on Cassandra integration where you want to indicate 
clustering, and SAP HANA where they want to push down more complicated queries 
through Spark. This proposal supports the former, but the amount of change 
required to support clustering in the current API is not obviously high - 
whilst the current proposal for V2 seems to make it very difficult to add 
support for pushing down plenty of aggregations in the future (I've found the 
question of how to add GROUP BY to be pretty tricky to answer for the current 
proposal).

Googling around for implementations of the current PrunedFilteredScan, I 
basically find a lot of databases, which seems reasonable - SAP HANA, 
ElasticSearch, Solr, MongoDB, Apache Phoenix, etc. I've talked to people who've 
used (some of) these connectors and the sticking point has generally been that 
Spark needs to load a lot of data out in order to solve aggregations that can 
be very efficiently pushed down into the datasources.

So, with this proposal it appears that we're optimising towards making it easy 
to write one-off datasource integrations, with some amount of pluggability for 
people who want to do more complicated things (the most interesting being 
bucketing integration). However, my guess is that this isn't what the current 
major integrations suffer from; they suffer mostly from restrictions in what 
they can push down (which broadly speaking are not going to go away).

So the place where I'm confused is that the current integrations can be made 
incrementally better as a consequence of this, but the backing data systems 
have the features which enable a step change which this API makes harder to 
achieve in the future. Who are the group of users who benefit the most as a 
consequence of this change, like, who is the target consumer here? My personal 
slant is that it's more important to improve support for other datastores than 
it is to lower the barrier of entry - this is why I've been pushing here.

James

On Wed, 30 Aug 2017 at 09:37 Ryan Blue 
mailto:rb...@netflix.com>> wrote:

-1 (non-binding)

Sometimes it takes a VOTE thread to get people to actually read and comment, so 
thanks for starting this one… but there’s still discussion happening on the 
prototype API, which it hasn’t been updated. I’d like to see the proposal 
shaped by the ongoing discussion so that we have a better, more concrete plan. 
I think that’s going to produces a better SPIP.

The second reason for -1 is that I think the read- and write-side proposals 
should be separated. The PR 
currently has “write path” listed as a TODO item and most of the discussion 
I’ve seen is on the read side. I think it would be better to separate the read 
and write APIs so we can focus on them individually.

An example of why we should focus on the write path separately is that the 
proposal says this:

Ideally partitioning/bucketing concept should not be exposed in the Data Source 
API V2, because they are just techniques for data skipping and 
pre-partitioning. However, these 2 concepts are already widely used in Spark, 
e.g. DataFrameWriter.partitionBy and DDL syntax like ADD PARTITION. To be 
consistent, we need to add partitioning/bucketing to Data Source V2 . . .

Essentially, the some APIs mix DDL and DML operations. I’d like to consider 
ways to fix that problem instead of carrying the problem forward to Data Source 
V2. We can solve this by adding a high-level API for DDL and a better 
write/insert API that works well with it. Clearly, that discussion is 
independent of the read path, which is why I think separating the two proposals 
would be a win.

rb

​

On Wed, Aug 30, 2017 at 4:28 AM, Reynold Xin 
mailto:r...@databricks.com>> wrote:
That might be good to do, but seems like orthogonal to this effort itself. It 
would be a completely different interface.

On Wed, Aug 30, 2017 at 1:10 PM Wenchen Fan 
mailto:cloud0...@gmail.com>> wrote:
OK I agree with it, how about we add a new interface to push down the query 
plan, based on the current framework? We can mark the query-plan-push-down 
interface as unstable, to save the effort of designing a stable representation 
of query plan and maintaining forward compatibility.

On Wed, Aug 30, 2017 at 10:53 AM, James Baker 
mailto:j.ba...@outlook.com>> wrote:
I'll just focus on the one-by-one thing for now - it's the thing that blocks me 
the

Re: [VOTE] [SPIP] SPARK-15689: Data Source API V2

2017-08-30 Thread Reynold Xin
Sure that's good to do (and as discussed earlier a good compromise might be
to expose an interface for the source to decide which part of the logical
plan they want to accept).

To me everything is about cost vs benefit.

In my mind, the biggest issue with the existing data source API is backward
and forward compatibility. All the data sources written for Spark 1.x broke
in Spark 2.x. And that's one of the biggest value v2 can bring. To me it's
far more important to have data sources implemented in 2017 to be able to
work in 2027, in Spark 10.x.

You are basically arguing for creating a new API that is capable of doing
arbitrary expression, aggregation, and join pushdowns (you only mentioned
aggregation so far, but I've talked to enough database people that I know
once Spark gives them aggregation pushdown, they will come back for join
pushdown). We can do that using unstable APIs, and creating stable APIs
would be extremely difficult (still doable, just would take a long time to
design and implement). As mentioned earlier, it basically involves creating
a stable representation for all of logical plan, which is a lot of work. I
think we should still work towards that (for other reasons as well), but
I'd consider that out of scope for the current one. Otherwise we'd not
release something probably for the next 2 or 3 years.





On Wed, Aug 30, 2017 at 11:50 PM, James Baker  wrote:

> I guess I was more suggesting that by coding up the powerful mode as the
> API, it becomes easy for someone to layer an easy mode beneath it to enable
> simpler datasources to be integrated (and that simple mode should be the
> out of scope thing).
>
> Taking a small step back here, one of the places where I think I'm missing
> some context is in understanding the target consumers of these interfaces.
> I've done some amount (though likely not enough) of research about the
> places where people have had issues of API surface in the past - the
> concrete tickets I've seen have been based on Cassandra integration where
> you want to indicate clustering, and SAP HANA where they want to push down
> more complicated queries through Spark. This proposal supports the former,
> but the amount of change required to support clustering in the current API
> is not obviously high - whilst the current proposal for V2 seems to make it
> very difficult to add support for pushing down plenty of aggregations in
> the future (I've found the question of how to add GROUP BY to be pretty
> tricky to answer for the current proposal).
>
> Googling around for implementations of the current PrunedFilteredScan, I
> basically find a lot of databases, which seems reasonable - SAP HANA,
> ElasticSearch, Solr, MongoDB, Apache Phoenix, etc. I've talked to people
> who've used (some of) these connectors and the sticking point has generally
> been that Spark needs to load a lot of data out in order to solve
> aggregations that can be very efficiently pushed down into the datasources.
>
> So, with this proposal it appears that we're optimising towards making it
> easy to write one-off datasource integrations, with some amount of
> pluggability for people who want to do more complicated things (the most
> interesting being bucketing integration). However, my guess is that this
> isn't what the current major integrations suffer from; they suffer mostly
> from restrictions in what they can push down (which broadly speaking are
> not going to go away).
>
> So the place where I'm confused is that the current integrations can be
> made incrementally better as a consequence of this, but the backing data
> systems have the features which enable a step change which this API makes
> harder to achieve in the future. Who are the group of users who benefit the
> most as a consequence of this change, like, who is the target consumer
> here? My personal slant is that it's more important to improve support for
> other datastores than it is to lower the barrier of entry - this is why
> I've been pushing here.
>
> James
>
> On Wed, 30 Aug 2017 at 09:37 Ryan Blue  wrote:
>
>> -1 (non-binding)
>>
>> Sometimes it takes a VOTE thread to get people to actually read and
>> comment, so thanks for starting this one… but there’s still discussion
>> happening on the prototype API, which it hasn’t been updated. I’d like to
>> see the proposal shaped by the ongoing discussion so that we have a better,
>> more concrete plan. I think that’s going to produces a better SPIP.
>>
>> The second reason for -1 is that I think the read- and write-side
>> proposals should be separated. The PR
>>  currently has “write path”
>> listed as a TODO item and most of the discussion I’ve seen is on the read
>> side. I think it would be better to separate the read and write APIs so we
>> can focus on them individually.
>>
>> An example of why we should focus on the write path separately is that
>> the proposal says this:
>>
>> Ideally partitioning/bucketing concept sho

Re: [VOTE] [SPIP] SPARK-15689: Data Source API V2

2017-08-30 Thread James Baker
I guess I was more suggesting that by coding up the powerful mode as the API, 
it becomes easy for someone to layer an easy mode beneath it to enable simpler 
datasources to be integrated (and that simple mode should be the out of scope 
thing).

Taking a small step back here, one of the places where I think I'm missing some 
context is in understanding the target consumers of these interfaces. I've done 
some amount (though likely not enough) of research about the places where 
people have had issues of API surface in the past - the concrete tickets I've 
seen have been based on Cassandra integration where you want to indicate 
clustering, and SAP HANA where they want to push down more complicated queries 
through Spark. This proposal supports the former, but the amount of change 
required to support clustering in the current API is not obviously high - 
whilst the current proposal for V2 seems to make it very difficult to add 
support for pushing down plenty of aggregations in the future (I've found the 
question of how to add GROUP BY to be pretty tricky to answer for the current 
proposal).

Googling around for implementations of the current PrunedFilteredScan, I 
basically find a lot of databases, which seems reasonable - SAP HANA, 
ElasticSearch, Solr, MongoDB, Apache Phoenix, etc. I've talked to people who've 
used (some of) these connectors and the sticking point has generally been that 
Spark needs to load a lot of data out in order to solve aggregations that can 
be very efficiently pushed down into the datasources.

So, with this proposal it appears that we're optimising towards making it easy 
to write one-off datasource integrations, with some amount of pluggability for 
people who want to do more complicated things (the most interesting being 
bucketing integration). However, my guess is that this isn't what the current 
major integrations suffer from; they suffer mostly from restrictions in what 
they can push down (which broadly speaking are not going to go away).

So the place where I'm confused is that the current integrations can be made 
incrementally better as a consequence of this, but the backing data systems 
have the features which enable a step change which this API makes harder to 
achieve in the future. Who are the group of users who benefit the most as a 
consequence of this change, like, who is the target consumer here? My personal 
slant is that it's more important to improve support for other datastores than 
it is to lower the barrier of entry - this is why I've been pushing here.

James

On Wed, 30 Aug 2017 at 09:37 Ryan Blue 
mailto:rb...@netflix.com>> wrote:

-1 (non-binding)

Sometimes it takes a VOTE thread to get people to actually read and comment, so 
thanks for starting this one… but there’s still discussion happening on the 
prototype API, which it hasn’t been updated. I’d like to see the proposal 
shaped by the ongoing discussion so that we have a better, more concrete plan. 
I think that’s going to produces a better SPIP.

The second reason for -1 is that I think the read- and write-side proposals 
should be separated. The PR 
currently has “write path” listed as a TODO item and most of the discussion 
I’ve seen is on the read side. I think it would be better to separate the read 
and write APIs so we can focus on them individually.

An example of why we should focus on the write path separately is that the 
proposal says this:

Ideally partitioning/bucketing concept should not be exposed in the Data Source 
API V2, because they are just techniques for data skipping and 
pre-partitioning. However, these 2 concepts are already widely used in Spark, 
e.g. DataFrameWriter.partitionBy and DDL syntax like ADD PARTITION. To be 
consistent, we need to add partitioning/bucketing to Data Source V2 . . .

Essentially, the some APIs mix DDL and DML operations. I’d like to consider 
ways to fix that problem instead of carrying the problem forward to Data Source 
V2. We can solve this by adding a high-level API for DDL and a better 
write/insert API that works well with it. Clearly, that discussion is 
independent of the read path, which is why I think separating the two proposals 
would be a win.

rb

​

On Wed, Aug 30, 2017 at 4:28 AM, Reynold Xin 
mailto:r...@databricks.com>> wrote:
That might be good to do, but seems like orthogonal to this effort itself. It 
would be a completely different interface.

On Wed, Aug 30, 2017 at 1:10 PM Wenchen Fan 
mailto:cloud0...@gmail.com>> wrote:
OK I agree with it, how about we add a new interface to push down the query 
plan, based on the current framework? We can mark the query-plan-push-down 
interface as unstable, to save the effort of designing a stable representation 
of query plan and maintaining forward compatibility.

On Wed, Aug 30, 2017 at 10:53 AM, James Baker 
mailto:j.ba...@outlook.com>> wrote:
I'll just focus on the one-by-one thing for now - it's the thing that blocks me 
the

Re: [VOTE] [SPIP] SPARK-15689: Data Source API V2

2017-08-30 Thread James Baker
I guess I was more suggesting that by coding up the powerful mode as the API, 
it becomes easy for someone to layer an easy mode beneath it to enable simpler 
datasources to be integrated (and that simple mode should be the out of scope 
thing).

Taking a small step back here, one of the places where I think I'm missing some 
context is in understanding the target consumers of these interfaces. I've done 
some amount (though likely not enough) of research about the places where 
people have had issues of API surface in the past - the concrete tickets I've 
seen have been based on Cassandra integration where you want to indicate 
clustering, and SAP HANA where they want to push down more complicated queries 
through Spark. This proposal supports the former, but the amount of change 
required to support clustering in the current API is not obviously high - 
whilst the current proposal for V2 seems to make it very difficult to add 
support for pushing down plenty of aggregations in the future (I've found the 
question of how to add GROUP BY to be pretty tricky to answer for the current 
proposal).

Googling around for implementations of the current PrunedFilteredScan, I 
basically find a lot of databases, which seems reasonable - SAP HANA, 
ElasticSearch, Solr, MongoDB, Apache Phoenix, etc. I've talked to people who've 
used (some of) these connectors and the sticking point has generally been that 
Spark needs to load a lot of data out in order to solve aggregations that can 
be very efficiently pushed down into the datasources.

So, with this proposal it appears that we're optimising towards making it easy 
to write one-off datasource integrations, with some amount of pluggability for 
people who want to do more complicated things (the most interesting being 
bucketing integration). However, my guess is that this isn't what the current 
major integrations suffer from; they suffer mostly from restrictions in what 
they can push down (which broadly speaking are not going to go away).

So the place where I'm confused is that the current integrations can be made 
incrementally better as a consequence of this, but the backing data systems 
have the features which enable a step change which this API makes harder to 
achieve in the future. Who are the group of users who benefit the most as a 
consequence of this change, like, who is the target consumer here? My personal 
slant is that it's more important to improve support for other datastores than 
it is to lower the barrier of entry - this is why I've been pushing here.

James

On Wed, 30 Aug 2017 at 09:37 Ryan Blue 
mailto:rb...@netflix.com>> wrote:

-1 (non-binding)

Sometimes it takes a VOTE thread to get people to actually read and comment, so 
thanks for starting this one… but there’s still discussion happening on the 
prototype API, which it hasn’t been updated. I’d like to see the proposal 
shaped by the ongoing discussion so that we have a better, more concrete plan. 
I think that’s going to produces a better SPIP.

The second reason for -1 is that I think the read- and write-side proposals 
should be separated. The PR 
currently has “write path” listed as a TODO item and most of the discussion 
I’ve seen is on the read side. I think it would be better to separate the read 
and write APIs so we can focus on them individually.

An example of why we should focus on the write path separately is that the 
proposal says this:

Ideally partitioning/bucketing concept should not be exposed in the Data Source 
API V2, because they are just techniques for data skipping and 
pre-partitioning. However, these 2 concepts are already widely used in Spark, 
e.g. DataFrameWriter.partitionBy and DDL syntax like ADD PARTITION. To be 
consistent, we need to add partitioning/bucketing to Data Source V2 . . .

Essentially, the some APIs mix DDL and DML operations. I’d like to consider 
ways to fix that problem instead of carrying the problem forward to Data Source 
V2. We can solve this by adding a high-level API for DDL and a better 
write/insert API that works well with it. Clearly, that discussion is 
independent of the read path, which is why I think separating the two proposals 
would be a win.

rb

​

On Wed, Aug 30, 2017 at 4:28 AM, Reynold Xin 
mailto:r...@databricks.com>> wrote:
That might be good to do, but seems like orthogonal to this effort itself. It 
would be a completely different interface.

On Wed, Aug 30, 2017 at 1:10 PM Wenchen Fan 
mailto:cloud0...@gmail.com>> wrote:
OK I agree with it, how about we add a new interface to push down the query 
plan, based on the current framework? We can mark the query-plan-push-down 
interface as unstable, to save the effort of designing a stable representation 
of query plan and maintaining forward compatibility.

On Wed, Aug 30, 2017 at 10:53 AM, James Baker 
mailto:j.ba...@outlook.com>> wrote:
I'll just focus on the one-by-one thing for now - it's the thing that blocks me 
the

Re: [VOTE] [SPIP] SPARK-15689: Data Source API V2

2017-08-30 Thread James Baker
I guess I was more suggesting that by coding up the powerful mode as the API, 
it becomes easy for someone to layer an easy mode beneath it to enable simpler 
datasources to be integrated (and that simple mode should be the out of scope 
thing).

Taking a small step back here, one of the places where I think I'm missing some 
context is in understanding the target consumers of these interfaces. I've done 
some amount (though likely not enough) of research about the places where 
people have had issues of API surface in the past - the concrete tickets I've 
seen have been based on Cassandra integration where you want to indicate 
clustering, and SAP HANA where they want to push down more complicated queries 
through Spark. This proposal supports the former, but the amount of change 
required to support clustering in the current API is not obviously high - 
whilst the current proposal for V2 seems to make it very difficult to add 
support for pushing down plenty of aggregations in the future (I've found the 
question of how to add GROUP BY to be pretty tricky to answer for the current 
proposal).

Googling around for implementations of the current PrunedFilteredScan, I 
basically find a lot of databases, which seems reasonable - SAP HANA, 
ElasticSearch, Solr, MongoDB, Apache Phoenix, etc. I've talked to people who've 
used (some of) these connectors and the sticking point has generally been that 
Spark needs to load a lot of data out in order to solve aggregations that can 
be very efficiently pushed down into the datasources.

So, with this proposal it appears that we're optimising towards making it easy 
to write one-off datasource integrations, with some amount of pluggability for 
people who want to do more complicated things (the most interesting being 
bucketing integration). However, my guess is that this isn't what the current 
major integrations suffer from; they suffer mostly from restrictions in what 
they can push down (which broadly speaking are not going to go away).

So the place where I'm confused is that the current integrations can be made 
incrementally better as a consequence of this, but the backing data systems 
have the features which enable a step change which this API makes harder to 
achieve in the future. Who are the group of users who benefit the most as a 
consequence of this change, like, who is the target consumer here? My personal 
slant is that it's more important to improve support for other datastores than 
it is to lower the barrier of entry - this is why I've been pushing here.

James

On Wed, 30 Aug 2017 at 09:37 Ryan Blue 
mailto:rb...@netflix.com>> wrote:

-1 (non-binding)

Sometimes it takes a VOTE thread to get people to actually read and comment, so 
thanks for starting this one… but there’s still discussion happening on the 
prototype API, which it hasn’t been updated. I’d like to see the proposal 
shaped by the ongoing discussion so that we have a better, more concrete plan. 
I think that’s going to produces a better SPIP.

The second reason for -1 is that I think the read- and write-side proposals 
should be separated. The PR 
currently has “write path” listed as a TODO item and most of the discussion 
I’ve seen is on the read side. I think it would be better to separate the read 
and write APIs so we can focus on them individually.

An example of why we should focus on the write path separately is that the 
proposal says this:

Ideally partitioning/bucketing concept should not be exposed in the Data Source 
API V2, because they are just techniques for data skipping and 
pre-partitioning. However, these 2 concepts are already widely used in Spark, 
e.g. DataFrameWriter.partitionBy and DDL syntax like ADD PARTITION. To be 
consistent, we need to add partitioning/bucketing to Data Source V2 . . .

Essentially, the some APIs mix DDL and DML operations. I’d like to consider 
ways to fix that problem instead of carrying the problem forward to Data Source 
V2. We can solve this by adding a high-level API for DDL and a better 
write/insert API that works well with it. Clearly, that discussion is 
independent of the read path, which is why I think separating the two proposals 
would be a win.

rb

​

On Wed, Aug 30, 2017 at 4:28 AM, Reynold Xin 
mailto:r...@databricks.com>> wrote:
That might be good to do, but seems like orthogonal to this effort itself. It 
would be a completely different interface.

On Wed, Aug 30, 2017 at 1:10 PM Wenchen Fan 
mailto:cloud0...@gmail.com>> wrote:
OK I agree with it, how about we add a new interface to push down the query 
plan, based on the current framework? We can mark the query-plan-push-down 
interface as unstable, to save the effort of designing a stable representation 
of query plan and maintaining forward compatibility.

On Wed, Aug 30, 2017 at 10:53 AM, James Baker 
mailto:j.ba...@outlook.com>> wrote:
I'll just focus on the one-by-one thing for now - it's the thing that blocks me 
the

Re: [VOTE] [SPIP] SPARK-15689: Data Source API V2

2017-08-30 Thread Ryan Blue
Maybe I'm missing something, but the high-level proposal consists of:
Goals, Non-Goals, and Proposed API. What is there to discuss other than the
details of the API that's being proposed? I think the goals make sense, but
goals alone aren't enough to approve a SPIP.

On Wed, Aug 30, 2017 at 2:46 PM, Reynold Xin  wrote:

> So we seem to be getting into a cycle of discussing more about the details
> of APIs than the high level proposal. The details of APIs are important to
> debate, but those belong more in code reviews.
>
> One other important thing is that we should avoid API design by committee.
> While it is extremely useful to get feedback, understand the use cases, we
> cannot do API design by incorporating verbatim the union of everybody's
> feedback. API design is largely a tradeoff game. The most expressive API
> would also be harder to use, or sacrifice backward/forward compatibility.
> It is as important to decide what to exclude as what to include.
>
> Unlike the v1 API, the way Wenchen's high level V2 framework is proposed
> makes it very easy to add new features (e.g. clustering properties) in the
> future without breaking any APIs. I'd rather us shipping something useful
> that might not be the most comprehensive set, than debating about every
> single feature we should add and then creating something super complicated
> that has unclear value.
>
>
>
> On Wed, Aug 30, 2017 at 6:37 PM, Ryan Blue  wrote:
>
>> -1 (non-binding)
>>
>> Sometimes it takes a VOTE thread to get people to actually read and
>> comment, so thanks for starting this one… but there’s still discussion
>> happening on the prototype API, which it hasn’t been updated. I’d like to
>> see the proposal shaped by the ongoing discussion so that we have a better,
>> more concrete plan. I think that’s going to produces a better SPIP.
>>
>> The second reason for -1 is that I think the read- and write-side
>> proposals should be separated. The PR
>>  currently has “write path”
>> listed as a TODO item and most of the discussion I’ve seen is on the read
>> side. I think it would be better to separate the read and write APIs so we
>> can focus on them individually.
>>
>> An example of why we should focus on the write path separately is that
>> the proposal says this:
>>
>> Ideally partitioning/bucketing concept should not be exposed in the Data
>> Source API V2, because they are just techniques for data skipping and
>> pre-partitioning. However, these 2 concepts are already widely used in
>> Spark, e.g. DataFrameWriter.partitionBy and DDL syntax like ADD PARTITION.
>> To be consistent, we need to add partitioning/bucketing to Data Source V2 .
>> . .
>>
>> Essentially, the some APIs mix DDL and DML operations. I’d like to
>> consider ways to fix that problem instead of carrying the problem forward
>> to Data Source V2. We can solve this by adding a high-level API for DDL and
>> a better write/insert API that works well with it. Clearly, that discussion
>> is independent of the read path, which is why I think separating the two
>> proposals would be a win.
>>
>> rb
>> ​
>>
>> On Wed, Aug 30, 2017 at 4:28 AM, Reynold Xin  wrote:
>>
>>> That might be good to do, but seems like orthogonal to this effort
>>> itself. It would be a completely different interface.
>>>
>>> On Wed, Aug 30, 2017 at 1:10 PM Wenchen Fan  wrote:
>>>
 OK I agree with it, how about we add a new interface to push down the
 query plan, based on the current framework? We can mark the
 query-plan-push-down interface as unstable, to save the effort of designing
 a stable representation of query plan and maintaining forward 
 compatibility.

 On Wed, Aug 30, 2017 at 10:53 AM, James Baker 
 wrote:

> I'll just focus on the one-by-one thing for now - it's the thing that
> blocks me the most.
>
> I think the place where we're most confused here is on the cost of
> determining whether I can push down a filter. For me, in order to work out
> whether I can push down a filter or satisfy a sort, I might have to read
> plenty of data. That said, it's worth me doing this because I can use this
> information to avoid reading >>that much data.
>
> If you give me all the orderings, I will have to read that data many
> times (we stream it to avoid keeping it in memory).
>
> There's also a thing where our typical use cases have many filters
> (20+ is common). So, it's likely not going to work to pass us all the
> combinations. That said, if I can tell you a cost, I know what optimal
> looks like, why can't I just pick that myself?
>
> The current design is friendly to simple datasources, but does not
> have the potential to support this.
>
> So the main problem we have with datasources v1 is that it's
> essentially impossible to leverage a bunch of Spark features - I don't get
> to use bucketing or row batches or all the nic

Re: Welcoming Saisai (Jerry) Shao as a committer

2017-08-30 Thread Joseph Bradley
Congrats!

On Aug 29, 2017 9:55 AM, "Felix Cheung"  wrote:

> Congrats!
>
> --
> *From:* Wenchen Fan 
> *Sent:* Tuesday, August 29, 2017 9:21:38 AM
> *To:* Kevin Yu
> *Cc:* Meisam Fathi; dev
> *Subject:* Re: Welcoming Saisai (Jerry) Shao as a committer
>
> Congratulations, Saisai!
>
> On 29 Aug 2017, at 10:38 PM, Kevin Yu  wrote:
>
> Congratulations, Jerry!
>
> On Tue, Aug 29, 2017 at 6:35 AM, Meisam Fathi 
> wrote:
>
>> Congratulations, Jerry!
>>
>> Thanks,
>> Meisam
>>
>> On Tue, Aug 29, 2017 at 1:13 AM Wang, Carson 
>> wrote:
>>
>>> Congratulations, Saisai!
>>>
>>>
>>> -Original Message-
>>> From: Matei Zaharia [mailto:matei.zaha...@gmail.com]
>>> Sent: Tuesday, August 29, 2017 9:29 AM
>>> To: dev 
>>> Cc: Saisai Shao 
>>> Subject: Welcoming Saisai (Jerry) Shao as a committer
>>>
>>> Hi everyone,
>>>
>>> The PMC recently voted to add Saisai (Jerry) Shao as a committer. Saisai
>>> has been contributing to many areas of the project for a long time, so it’s
>>> great to see him join. Join me in thanking and congratulating him!
>>>
>>> Matei
>>> -
>>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>>
>>>
>
>


Re: Are there multiple processes out there running JIRA <-> Github maintenance tasks?

2017-08-30 Thread Josh Rosen
I think that's because https://issues.apache.org/jira/browse/SPARK-21728 was
re-opened in JIRA and had a new PR associated with it, so the bot did the
temporary issue re-assignment in order to be able to transition the issue
status from "reopened" to "in progress".

On Wed, Aug 30, 2017 at 1:18 PM Marcelo Vanzin  wrote:

> I'm still seeing some odd behavior.
>
> I just deleted my repo's branch for
> https://github.com/apache/spark/pull/19013 and the script seems to
> have done some update to the bug, since I got a bunch of e-mails.
>
> On Mon, Aug 28, 2017 at 2:34 PM, Josh Rosen 
> wrote:
> > This should be fixed now. The problem was that debug code had been pushed
> > while investigating the JIRA linkage failure but was not removed and this
> > problem went unnoticed because linking was failing well before the debug
> > code was hit. Once the JIRA connectivity issues were resolved, the
> > problematic code was running and causing the linking operation to fail
> > mid-way through, triggering a finally block which undid the JIRA
> assignment.
> >
> > I've rolled back the bad code and enabled additional monitoring in
> > StackDriver to raise an alert if we see new linking failures.
> >
> > On Mon, Aug 28, 2017 at 12:02 PM Marcelo Vanzin 
> wrote:
> >>
> >> It seems a little wonky, though. Feels like it's updating JIRA every
> >> time you comment on a PR. Or maybe it's still working through the
> >> backlog...
> >>
> >> On Mon, Aug 28, 2017 at 9:57 AM, Reynold Xin 
> wrote:
> >> > The process for doing that was down before, and might've come back up
> >> > and
> >> > are going through the huge backlog.
> >> >
> >> >
> >> > On Mon, Aug 28, 2017 at 6:56 PM, Sean Owen 
> wrote:
> >> >>
> >> >> Like whatever reassigns JIRAs after a PR is closed?
> >> >>
> >> >> It seems to be going crazy, or maybe there are many running. Not sure
> >> >> who
> >> >> owns that, but can he/she take a look?
> >> >>
> >> >
> >>
> >>
> >>
> >> --
> >> Marcelo
> >>
> >> -
> >> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
> >>
> >
>
>
>
> --
> Marcelo
>


Re: Are there multiple processes out there running JIRA <-> Github maintenance tasks?

2017-08-30 Thread Marcelo Vanzin
I opened the new PR after I sent my e-mail below, so the script indeed
was doing something weird.

On Wed, Aug 30, 2017 at 6:42 PM, Josh Rosen  wrote:
> I think that's because https://issues.apache.org/jira/browse/SPARK-21728 was
> re-opened in JIRA and had a new PR associated with it, so the bot did the
> temporary issue re-assignment in order to be able to transition the issue
> status from "reopened" to "in progress".
>
> On Wed, Aug 30, 2017 at 1:18 PM Marcelo Vanzin  wrote:
>>
>> I'm still seeing some odd behavior.
>>
>> I just deleted my repo's branch for
>> https://github.com/apache/spark/pull/19013 and the script seems to
>> have done some update to the bug, since I got a bunch of e-mails.
>>
>> On Mon, Aug 28, 2017 at 2:34 PM, Josh Rosen 
>> wrote:
>> > This should be fixed now. The problem was that debug code had been
>> > pushed
>> > while investigating the JIRA linkage failure but was not removed and
>> > this
>> > problem went unnoticed because linking was failing well before the debug
>> > code was hit. Once the JIRA connectivity issues were resolved, the
>> > problematic code was running and causing the linking operation to fail
>> > mid-way through, triggering a finally block which undid the JIRA
>> > assignment.
>> >
>> > I've rolled back the bad code and enabled additional monitoring in
>> > StackDriver to raise an alert if we see new linking failures.
>> >
>> > On Mon, Aug 28, 2017 at 12:02 PM Marcelo Vanzin 
>> > wrote:
>> >>
>> >> It seems a little wonky, though. Feels like it's updating JIRA every
>> >> time you comment on a PR. Or maybe it's still working through the
>> >> backlog...
>> >>
>> >> On Mon, Aug 28, 2017 at 9:57 AM, Reynold Xin 
>> >> wrote:
>> >> > The process for doing that was down before, and might've come back up
>> >> > and
>> >> > are going through the huge backlog.
>> >> >
>> >> >
>> >> > On Mon, Aug 28, 2017 at 6:56 PM, Sean Owen 
>> >> > wrote:
>> >> >>
>> >> >> Like whatever reassigns JIRAs after a PR is closed?
>> >> >>
>> >> >> It seems to be going crazy, or maybe there are many running. Not
>> >> >> sure
>> >> >> who
>> >> >> owns that, but can he/she take a look?
>> >> >>
>> >> >
>> >>
>> >>
>> >>
>> >> --
>> >> Marcelo
>> >>
>> >> -
>> >> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>> >>
>> >
>>
>>
>>
>> --
>> Marcelo



-- 
Marcelo

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: SPIP: Spark on Kubernetes

2017-08-30 Thread vaquar khan
+1 (non-binding)

Regards,
Vaquar khan

On Mon, Aug 28, 2017 at 5:09 PM, Erik Erlandson  wrote:

>
> In addition to the engineering & software aspects of the native Kubernetes
> community project, we have also worked at building out the community, with
> the goal of providing the foundation for sustaining engineering on the
> Kubernetes scheduler back-end.  That said, I agree 100% with your point
> that adding committers with kube-specific experience is good strategy for
> increasing review bandwidth to help service PRs from this community.
>
> On Mon, Aug 28, 2017 at 2:16 PM, Mark Hamstra 
> wrote:
>
>> In my opinion, the fact that there are nearly no changes to spark-core,
>>> and most of our changes are additive should go to prove that this adds
>>> little complexity to the workflow of the committers.
>>
>>
>> Actually (and somewhat perversely), the otherwise praiseworthy isolation
>> of the Kubernetes code does mean that it adds complexity to the workflow of
>> the existing Spark committers. I'll reiterate Imran's concerns: The
>> existing Spark committers familiar with Spark's scheduler code have
>> adequate knowledge of the Standalone and Yarn implementations, and still
>> not sufficient coverage of Mesos. Adding k8s code to Spark would mean that
>> the progression of that code would start seeing the issues that the Mesos
>> code in Spark currently sees: Reviews and commits tend to languish because
>> we don't have currently active committers with sufficient knowledge and
>> cycles to deal with the Mesos PRs. Some of this is because the PMC needs to
>> get back to addressing the issue of adding new Spark committers who do have
>> the needed Mesos skills, but that isn't as simple as we'd like because
>> ideally a Spark committer has demonstrated skills across a significant
>> portion of the Spark code, not just tightly focused on one area (such as
>> Mesos or k8s integration.) In short, adding Kubernetes support directly
>> into Spark isn't likely (at least in the short-term) to be entirely
>> positive for the spark-on-k8s project, since merging of PRs to the
>> spark-on-k8s is very likely to be quite slow at least until such time as we
>> have k8s-focused Spark committers. If this project does end up getting
>> pulled into the Spark codebase, then the PMC will need to start looking at
>> bringing in one or more new committers who meet our requirements for such a
>> role and responsibility, and who also have k8s skills. The success and pace
>> of development of the spark-on-k8s will depend in large measure on the
>> PMC's ability to find such new committers.
>>
>> All that said, I'm +1 if the those currently responsible for the
>> spark-on-k8s project still want to bring the code into Spark.
>>
>>
>> On Mon, Aug 21, 2017 at 11:48 AM, Anirudh Ramanathan <
>> ramanath...@google.com.invalid> wrote:
>>
>>> Thank you for your comments Imran.
>>>
>>> Regarding integration tests,
>>>
>>> What you inferred from the documentation is correct -
>>> Integration tests do not require any prior setup or a Kubernetes cluster
>>> to run. Minikube is a single binary that brings up a one-node cluster and
>>> exposes the full Kubernetes API. It is actively maintained and kept up to
>>> date with the rest of the project. These local integration tests on Jenkins
>>> (like the ones with spark-on-yarn), should allow for the committers to
>>> merge changes with a high degree of confidence.
>>> I will update the proposal to include more information about the extent
>>> and kinds of testing we do.
>>>
>>> As for (b), people on this thread and the set of contributors on our
>>> fork are a fairly wide community of contributors and committers who would
>>> be involved in the maintenance long-term. It was one of the reasons behind
>>> developing separately as a fork. In my opinion, the fact that there are
>>> nearly no changes to spark-core, and most of our changes are additive
>>> should go to prove that this adds little complexity to the workflow of the
>>> committers.
>>>
>>> Separating out the cluster managers (into an as yet undecided new home)
>>> appears far more disruptive and a high risk change for the short term.
>>> However, when there is enough community support behind that effort, tracked
>>> in 19700 ; and if
>>> that is realized in the future, it wouldn't be difficult to switch over
>>> Kubernetes, YARN and Mesos to using the pluggable API. Currently, in my
>>> opinion, with the integration tests, active users, and a community of
>>> maintainers, Spark-on-Kubernetes would add minimal overhead and benefit a
>>> large (and growing) class of users.
>>>
>>> Lastly, the RSS is indeed separate and a value-add that we would love to
>>> share with other cluster managers as well.
>>>
>>> On Mon, Aug 21, 2017 at 10:17 AM, Imran Rashid 
>>> wrote:
>>>
 Overall this looks like a good proposal.  I do have some concerns which
 I'd like to discuss -- please understand I'm t

Re: Time window on Processing Time

2017-08-30 Thread madhu phatak
Hi,
That's great. Thanks a lot.

On Wed, Aug 30, 2017 at 10:44 AM, Tathagata Das  wrote:

> Yes, it can be! There is a sql function called current_timestamp() which
> is self-explanatory. So I believe you should be able to do something like
>
> import org.apache.spark.sql.functions._
>
> ds.withColumn("processingTime", current_timestamp())
>   .groupBy(window("processingTime", "1 minute"))
>   .count()
>
>
> On Mon, Aug 28, 2017 at 5:46 AM, madhu phatak 
> wrote:
>
>> Hi,
>> As I am playing with structured streaming, I observed that window
>> function always requires a time column in input data.So that means it's
>> event time.
>>
>> Is it possible to old spark streaming style window function based on
>> processing time. I don't see any documentation on the same.
>>
>> --
>> Regards,
>> Madhukara Phatak
>> http://datamantra.io/
>>
>
>


-- 
Regards,
Madhukara Phatak
http://datamantra.io/


Re: SPIP: Spark on Kubernetes

2017-08-30 Thread Reynold Xin
This has passed, hasn't it?

On Tue, Aug 15, 2017 at 5:33 PM Anirudh Ramanathan 
wrote:

> Spark on Kubernetes effort has been developed separately in a fork, and
> linked back from the Apache Spark project as an experimental backend
> .
> We're ~6 months in, have had 5 releases
> .
>
>- 2 Spark versions maintained (2.1, and 2.2)
>- Extensive integration testing and refactoring efforts to maintain
>code quality
>- Developer
> and
>user-facing 
> documentation
>- 10+ consistent code contributors from different organizations
>
> 
>  involved
>in actively maintaining and using the project, with several more members
>involved in testing and providing feedback.
>- The community has delivered several talks on Spark-on-Kubernetes
>generating lots of feedback from users.
>- In addition to these, we've seen efforts spawn off such as:
>- HDFS on Kubernetes
>    with
>   Locality and Performance Experiments
>   - Kerberized access
>   
> 
>  to
>   HDFS from Spark running on Kubernetes
>
> *Following the SPIP process, I'm putting this SPIP up for a vote.*
>
>- +1: Yeah, let's go forward and implement the SPIP.
>- +0: Don't really care.
>- -1: I don't think this is a good idea because of the following
>technical reasons.
>
> If there is any further clarification desired, on the design or the
> implementation, please feel free to ask questions or provide feedback.
>
>
> SPIP: Kubernetes as A Native Cluster Manager
>
> Full Design Doc: link
> 
>
> JIRA: https://issues.apache.org/jira/browse/SPARK-18278
>
> Kubernetes Issue: https://github.com/kubernetes/kubernetes/issues/34377
>
> Authors: Yinan Li, Anirudh Ramanathan, Erik Erlandson, Andrew Ash, Matt
> Cheah,
>
> Ilan Filonenko, Sean Suchter, Kimoon Kim
> Background and Motivation
>
> Containerization and cluster management technologies are constantly
> evolving in the cluster computing world. Apache Spark currently implements
> support for Apache Hadoop YARN and Apache Mesos, in addition to providing
> its own standalone cluster manager. In 2014, Google announced development
> of Kubernetes  which has its own unique feature
> set and differentiates itself from YARN and Mesos. Since its debut, it has
> seen contributions from over 1300 contributors with over 5 commits.
> Kubernetes has cemented itself as a core player in the cluster computing
> world, and cloud-computing providers such as Google Container Engine,
> Google Compute Engine, Amazon Web Services, and Microsoft Azure support
> running Kubernetes clusters.
>
> This document outlines a proposal for integrating Apache Spark with
> Kubernetes in a first class way, adding Kubernetes to the list of cluster
> managers that Spark can be used with. Doing so would allow users to share
> their computing resources and containerization framework between their
> existing applications on Kubernetes and their computational Spark
> applications. Although there is existing support for running a Spark
> standalone cluster on Kubernetes
> ,
> there are still major advantages and significant interest in having native
> execution support. For example, this integration provides better support
> for multi-tenancy and dynamic resource allocation. It also allows users to
> run applications of different Spark versions of their choices in the same
> cluster.
>
> The feature is being developed in a separate fork
>  in order to minimize risk
> to the main project during development. Since the start of the development
> in November of 2016, it has received over 100 commits from over 20
> contributors and supports two releases based on Spark 2.1 and 2.2
> respectively. Documentation is also being actively worked on both in the
> main project repository and also in the repository
> https://github.com/apache-spark-on-k8s/userdocs. Regarding real-world use
> cases, we have seen cluster setup that uses 1000+ cores. We are also seeing
> growing interests on this project from more and more organizations.
>
> While it is easy to bootstrap the project in a forked repository, it is
> hard to maintain it in the long run because of the tricky process of
> rebasing onto the upstream and lack of awaren