Re: [VOTE] [SPIP] SPARK-15689: Data Source API V2

Wenchen Fan Wed, 06 Sep 2017 10:04:18 -0700

Hi all,

I've submitted a PR for a basic data source v2, i.e., only contains
features we already have in data source v1. We can discuss API details like
naming in that PR: https://github.com/apache/spark/pull/19136


In the meanwhile, let's keep this vote open and collecting more feedbacks.

Thanks


On Fri, Sep 1, 2017 at 5:56 PM, Reynold Xin <r...@databricks.com> wrote:

> Why does ordering matter here for sort vs filter? The source should be
> able to handle it in whatever way it wants (which is almost always filter
> beneath sort I'd imagine).
>
> The only ordering that'd matter in the current set of pushdowns is limit -
> it should always mean the root of the pushded tree.
>
>
> On Fri, Sep 1, 2017 at 3:22 AM, Wenchen Fan <cloud0...@gmail.com> wrote:
>
>> > Ideally also getting sort orders _after_ getting filters.
>>
>> Yea we should have a deterministic order when applying various push
>> downs, and I think filter should definitely go before sort. This is one of
>> the details we can discuss during PR review :)
>>
>> On Fri, Sep 1, 2017 at 9:19 AM, James Baker <j.ba...@outlook.com> wrote:
>>
>>> I think that makes sense. I didn't understand backcompat was the primary
>>> driver. I actually don't care right now about aggregations on the
>>> datasource I'm integrating with - I just care about receiving all the
>>> filters (and ideally also the desired sort order) at the same time. I am
>>> mostly fine with anything else; but getting filters at the same time is
>>> important for me, and doesn't seem overly contentious? (e.g. it's
>>> compatible with datasources v1). Ideally also getting sort orders _after_
>>> getting filters.
>>>
>>> That said, an unstable api that gets me the query plan would be
>>> appreciated by plenty I'm sure :) (and would make my implementation more
>>> straightforward - the state management is painful atm).
>>>
>>> James
>>>
>>> On Wed, 30 Aug 2017 at 14:56 Reynold Xin <r...@databricks.com> wrote:
>>>
>>>> Sure that's good to do (and as discussed earlier a good compromise
>>>> might be to expose an interface for the source to decide which part of the
>>>> logical plan they want to accept).
>>>>
>>>> To me everything is about cost vs benefit.
>>>>
>>>> In my mind, the biggest issue with the existing data source API is
>>>> backward and forward compatibility. All the data sources written for Spark
>>>> 1.x broke in Spark 2.x. And that's one of the biggest value v2 can bring.
>>>> To me it's far more important to have data sources implemented in 2017 to
>>>> be able to work in 2027, in Spark 10.x.
>>>>
>>>> You are basically arguing for creating a new API that is capable of
>>>> doing arbitrary expression, aggregation, and join pushdowns (you only
>>>> mentioned aggregation so far, but I've talked to enough database people
>>>> that I know once Spark gives them aggregation pushdown, they will come back
>>>> for join pushdown). We can do that using unstable APIs, and creating stable
>>>> APIs would be extremely difficult (still doable, just would take a long
>>>> time to design and implement). As mentioned earlier, it basically involves
>>>> creating a stable representation for all of logical plan, which is a lot of
>>>> work. I think we should still work towards that (for other reasons as
>>>> well), but I'd consider that out of scope for the current one. Otherwise
>>>> we'd not release something probably for the next 2 or 3 years.
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> On Wed, Aug 30, 2017 at 11:50 PM, James Baker <j.ba...@outlook.com>
>>>> wrote:
>>>>
>>>>> I guess I was more suggesting that by coding up the powerful mode as
>>>>> the API, it becomes easy for someone to layer an easy mode beneath it to
>>>>> enable simpler datasources to be integrated (and that simple mode should 
>>>>> be
>>>>> the out of scope thing).
>>>>>
>>>>> Taking a small step back here, one of the places where I think I'm
>>>>> missing some context is in understanding the target consumers of these
>>>>> interfaces. I've done some amount (though likely not enough) of research
>>>>> about the places where people have had issues of API surface in the past -
>>>>> the concrete tickets I've seen have been based on Cassandra integration
>>>>> where you want to indicate clustering, and SAP HANA where they want to 
>>>>> push
>>>>> down more complicated queries through Spark. This proposal supports the
>>>>> former, but the amount of change required to support clustering in the
>>>>> current API is not obviously high - whilst the current proposal for V2
>>>>> seems to make it very difficult to add support for pushing down plenty of
>>>>> aggregations in the future (I've found the question of how to add GROUP BY
>>>>> to be pretty tricky to answer for the current proposal).
>>>>>
>>>>> Googling around for implementations of the current PrunedFilteredScan,
>>>>> I basically find a lot of databases, which seems reasonable - SAP HANA,
>>>>> ElasticSearch, Solr, MongoDB, Apache Phoenix, etc. I've talked to people
>>>>> who've used (some of) these connectors and the sticking point has 
>>>>> generally
>>>>> been that Spark needs to load a lot of data out in order to solve
>>>>> aggregations that can be very efficiently pushed down into the 
>>>>> datasources.
>>>>>
>>>>> So, with this proposal it appears that we're optimising towards making
>>>>> it easy to write one-off datasource integrations, with some amount of
>>>>> pluggability for people who want to do more complicated things (the most
>>>>> interesting being bucketing integration). However, my guess is that this
>>>>> isn't what the current major integrations suffer from; they suffer mostly
>>>>> from restrictions in what they can push down (which broadly speaking are
>>>>> not going to go away).
>>>>>
>>>>> So the place where I'm confused is that the current integrations can
>>>>> be made incrementally better as a consequence of this, but the backing 
>>>>> data
>>>>> systems have the features which enable a step change which this API makes
>>>>> harder to achieve in the future. Who are the group of users who benefit 
>>>>> the
>>>>> most as a consequence of this change, like, who is the target consumer
>>>>> here? My personal slant is that it's more important to improve support for
>>>>> other datastores than it is to lower the barrier of entry - this is why
>>>>> I've been pushing here.
>>>>>
>>>>> James
>>>>>
>>>>> On Wed, 30 Aug 2017 at 09:37 Ryan Blue <rb...@netflix.com> wrote:
>>>>>
>>>>>> -1 (non-binding)
>>>>>>
>>>>>> Sometimes it takes a VOTE thread to get people to actually read and
>>>>>> comment, so thanks for starting this one… but there’s still discussion
>>>>>> happening on the prototype API, which it hasn’t been updated. I’d like to
>>>>>> see the proposal shaped by the ongoing discussion so that we have a 
>>>>>> better,
>>>>>> more concrete plan. I think that’s going to produces a better SPIP.
>>>>>>
>>>>>> The second reason for -1 is that I think the read- and write-side
>>>>>> proposals should be separated. The PR
>>>>>> <https://github.com/cloud-fan/spark/pull/10> currently has “write
>>>>>> path” listed as a TODO item and most of the discussion I’ve seen is on 
>>>>>> the
>>>>>> read side. I think it would be better to separate the read and write APIs
>>>>>> so we can focus on them individually.
>>>>>>
>>>>>> An example of why we should focus on the write path separately is
>>>>>> that the proposal says this:
>>>>>>
>>>>>> Ideally partitioning/bucketing concept should not be exposed in the
>>>>>> Data Source API V2, because they are just techniques for data skipping 
>>>>>> and
>>>>>> pre-partitioning. However, these 2 concepts are already widely used in
>>>>>> Spark, e.g. DataFrameWriter.partitionBy and DDL syntax like ADD 
>>>>>> PARTITION.
>>>>>> To be consistent, we need to add partitioning/bucketing to Data Source 
>>>>>> V2 .
>>>>>> . .
>>>>>>
>>>>>> Essentially, the some APIs mix DDL and DML operations. I’d like to
>>>>>> consider ways to fix that problem instead of carrying the problem forward
>>>>>> to Data Source V2. We can solve this by adding a high-level API for DDL 
>>>>>> and
>>>>>> a better write/insert API that works well with it. Clearly, that 
>>>>>> discussion
>>>>>> is independent of the read path, which is why I think separating the two
>>>>>> proposals would be a win.
>>>>>>
>>>>>> rb
>>>>>> 
>>>>>>
>>>>>> On Wed, Aug 30, 2017 at 4:28 AM, Reynold Xin <r...@databricks.com>
>>>>>> wrote:
>>>>>>
>>>>>>> That might be good to do, but seems like orthogonal to this effort
>>>>>>> itself. It would be a completely different interface.
>>>>>>>
>>>>>>> On Wed, Aug 30, 2017 at 1:10 PM Wenchen Fan <cloud0...@gmail.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> OK I agree with it, how about we add a new interface to push down
>>>>>>>> the query plan, based on the current framework? We can mark the
>>>>>>>> query-plan-push-down interface as unstable, to save the effort of 
>>>>>>>> designing
>>>>>>>> a stable representation of query plan and maintaining forward 
>>>>>>>> compatibility.
>>>>>>>>
>>>>>>>> On Wed, Aug 30, 2017 at 10:53 AM, James Baker <j.ba...@outlook.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> I'll just focus on the one-by-one thing for now - it's the thing
>>>>>>>>> that blocks me the most.
>>>>>>>>>
>>>>>>>>> I think the place where we're most confused here is on the cost of
>>>>>>>>> determining whether I can push down a filter. For me, in order to 
>>>>>>>>> work out
>>>>>>>>> whether I can push down a filter or satisfy a sort, I might have to 
>>>>>>>>> read
>>>>>>>>> plenty of data. That said, it's worth me doing this because I can use 
>>>>>>>>> this
>>>>>>>>> information to avoid reading >>that much data.
>>>>>>>>>
>>>>>>>>> If you give me all the orderings, I will have to read that data
>>>>>>>>> many times (we stream it to avoid keeping it in memory).
>>>>>>>>>
>>>>>>>>> There's also a thing where our typical use cases have many filters
>>>>>>>>> (20+ is common). So, it's likely not going to work to pass us all the
>>>>>>>>> combinations. That said, if I can tell you a cost, I know what optimal
>>>>>>>>> looks like, why can't I just pick that myself?
>>>>>>>>>
>>>>>>>>> The current design is friendly to simple datasources, but does not
>>>>>>>>> have the potential to support this.
>>>>>>>>>
>>>>>>>>> So the main problem we have with datasources v1 is that it's
>>>>>>>>> essentially impossible to leverage a bunch of Spark features - I 
>>>>>>>>> don't get
>>>>>>>>> to use bucketing or row batches or all the nice things that I really 
>>>>>>>>> want
>>>>>>>>> to use to get decent performance. Provided I can leverage these in a
>>>>>>>>> moderately supported way which won't break in any given commit, I'll 
>>>>>>>>> be
>>>>>>>>> pretty happy with anything that lets me opt out of the restrictions.
>>>>>>>>>
>>>>>>>>> My suggestion here is that if you make a mode which works well for
>>>>>>>>> complicated use cases, you end up being able to write simple mode in 
>>>>>>>>> terms
>>>>>>>>> of it very easily. So we could actually provide two APIs, one that 
>>>>>>>>> lets
>>>>>>>>> people who have more interesting datasources leverage the cool Spark
>>>>>>>>> features, and one that lets people who just want to implement basic
>>>>>>>>> features do that - I'd try to include some kind of layering here. I 
>>>>>>>>> could
>>>>>>>>> probably sketch out something here if that'd be useful?
>>>>>>>>>
>>>>>>>>> James
>>>>>>>>>
>>>>>>>>> On Tue, 29 Aug 2017 at 18:59 Wenchen Fan <cloud0...@gmail.com>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> Hi James,
>>>>>>>>>>
>>>>>>>>>> Thanks for your feedback! I think your concerns are all valid,
>>>>>>>>>> but we need to make a tradeoff here.
>>>>>>>>>>
>>>>>>>>>> > Explicitly here, what I'm looking for is a convenient
>>>>>>>>>> mechanism to accept a fully specified set of arguments
>>>>>>>>>>
>>>>>>>>>> The problem with this approach is: 1) if we wanna add more
>>>>>>>>>> arguments in the future, it's really hard to do without changing
>>>>>>>>>> the existing interface. 2) if a user wants to implement a very 
>>>>>>>>>> simple data
>>>>>>>>>> source, he has to look at all the arguments and understand them, 
>>>>>>>>>> which may
>>>>>>>>>> be a burden for him.
>>>>>>>>>> I don't have a solution to solve these 2 problems, comments are
>>>>>>>>>> welcome.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> > There are loads of cases like this - you can imagine someone
>>>>>>>>>> being able to push down a sort before a filter is applied, but not
>>>>>>>>>> afterwards. However, maybe the filter is so selective that it's 
>>>>>>>>>> better to
>>>>>>>>>> push down the filter and not handle the sort. I don't get to make 
>>>>>>>>>> this
>>>>>>>>>> decision, Spark does (but doesn't have good enough information to do 
>>>>>>>>>> it
>>>>>>>>>> properly, whilst I do). I want to be able to choose the parts I push 
>>>>>>>>>> down
>>>>>>>>>> given knowledge of my datasource - as defined the APIs don't let me 
>>>>>>>>>> do
>>>>>>>>>> that, they're strictly more restrictive than the V1 APIs in this way.
>>>>>>>>>>
>>>>>>>>>> This is true, the current framework applies push downs one by
>>>>>>>>>> one, incrementally. If a data source wanna go back to accept a sort 
>>>>>>>>>> push
>>>>>>>>>> down after it accepts a filter push down, it's impossible with the 
>>>>>>>>>> current
>>>>>>>>>> data source V2.
>>>>>>>>>> Fortunately, we have a solution for this problem. At Spark side,
>>>>>>>>>> actually we do have a fully specified set of arguments waiting
>>>>>>>>>> to be pushed down, but Spark doesn't know which is the best order to 
>>>>>>>>>> push
>>>>>>>>>> them into data source. Spark can try every combination and ask the 
>>>>>>>>>> data
>>>>>>>>>> source to report a cost, then Spark can pick the best combination 
>>>>>>>>>> with the
>>>>>>>>>> lowest cost. This can also be implemented as a cost report 
>>>>>>>>>> interface, so
>>>>>>>>>> that advanced data source can implement it for optimal performance, 
>>>>>>>>>> and
>>>>>>>>>> simple data source doesn't need to care about it and keep simple.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> The current design is very friendly to simple data source, and
>>>>>>>>>> has the potential to support complex data source, I prefer the 
>>>>>>>>>> current
>>>>>>>>>> design over the plan push down one. What do you think?
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Wed, Aug 30, 2017 at 5:53 AM, James Baker <j.ba...@outlook.com
>>>>>>>>>> > wrote:
>>>>>>>>>>
>>>>>>>>>>> Yeah, for sure.
>>>>>>>>>>>
>>>>>>>>>>> With the stable representation - agree that in the general case
>>>>>>>>>>> this is pretty intractable, it restricts the modifications that you 
>>>>>>>>>>> can do
>>>>>>>>>>> in the future too much. That said, it shouldn't be as hard if you 
>>>>>>>>>>> restrict
>>>>>>>>>>> yourself to the parts of the plan which are supported by the 
>>>>>>>>>>> datasources V2
>>>>>>>>>>> API (which after all, need to be translateable properly into the 
>>>>>>>>>>> future to
>>>>>>>>>>> support the mixins proposed). This should have a pretty small scope 
>>>>>>>>>>> in
>>>>>>>>>>> comparison. As long as the user can bail out of nodes they don't
>>>>>>>>>>> understand, they should be ok, right?
>>>>>>>>>>>
>>>>>>>>>>> That said, what would also be fine for us is a place to plug
>>>>>>>>>>> into an unstable query plan.
>>>>>>>>>>>
>>>>>>>>>>> Explicitly here, what I'm looking for is a convenient mechanism
>>>>>>>>>>> to accept a fully specified set of arguments (of which I can choose 
>>>>>>>>>>> to
>>>>>>>>>>> ignore some), and return the information as to which of them I'm 
>>>>>>>>>>> ignoring.
>>>>>>>>>>> Taking a query plan of sorts is a way of doing this which IMO is 
>>>>>>>>>>> intuitive
>>>>>>>>>>> to the user. It also provides a convenient location to plug in 
>>>>>>>>>>> things like
>>>>>>>>>>> stats. Not at all married to the idea of using a query plan here; 
>>>>>>>>>>> it just
>>>>>>>>>>> seemed convenient.
>>>>>>>>>>>
>>>>>>>>>>> Regarding the users who just want to be able to pump data into
>>>>>>>>>>> Spark, my understanding is that replacing isolated nodes in a query 
>>>>>>>>>>> plan is
>>>>>>>>>>> easy. That said, our goal here is to be able to push down as much as
>>>>>>>>>>> possible into the underlying datastore.
>>>>>>>>>>>
>>>>>>>>>>> To your second question:
>>>>>>>>>>>
>>>>>>>>>>> The issue is that if you build up pushdowns incrementally and
>>>>>>>>>>> not all at once, you end up having to reject pushdowns and filters 
>>>>>>>>>>> that you
>>>>>>>>>>> actually can do, which unnecessarily increases overheads.
>>>>>>>>>>>
>>>>>>>>>>> For example, the dataset
>>>>>>>>>>>
>>>>>>>>>>> a b c
>>>>>>>>>>> 1 2 3
>>>>>>>>>>> 1 3 3
>>>>>>>>>>> 1 3 4
>>>>>>>>>>> 2 1 1
>>>>>>>>>>> 2 0 1
>>>>>>>>>>>
>>>>>>>>>>> can efficiently push down sort(b, c) if I have already applied
>>>>>>>>>>> the filter a = 1, but otherwise will force a sort in Spark. On the 
>>>>>>>>>>> PR I
>>>>>>>>>>> detail a case I see where I can push down two equality filters iff 
>>>>>>>>>>> I am
>>>>>>>>>>> given them at the same time, whilst not being able to one at a time.
>>>>>>>>>>>
>>>>>>>>>>> There are loads of cases like this - you can imagine someone
>>>>>>>>>>> being able to push down a sort before a filter is applied, but not
>>>>>>>>>>> afterwards. However, maybe the filter is so selective that it's 
>>>>>>>>>>> better to
>>>>>>>>>>> push down the filter and not handle the sort. I don't get to make 
>>>>>>>>>>> this
>>>>>>>>>>> decision, Spark does (but doesn't have good enough information to 
>>>>>>>>>>> do it
>>>>>>>>>>> properly, whilst I do). I want to be able to choose the parts I 
>>>>>>>>>>> push down
>>>>>>>>>>> given knowledge of my datasource - as defined the APIs don't let me 
>>>>>>>>>>> do
>>>>>>>>>>> that, they're strictly more restrictive than the V1 APIs in this 
>>>>>>>>>>> way.
>>>>>>>>>>>
>>>>>>>>>>> The pattern of not considering things that can be done in bulk
>>>>>>>>>>> bites us in other ways. The retrieval methods end up being trickier 
>>>>>>>>>>> to
>>>>>>>>>>> implement than is necessary because frequently a single operation 
>>>>>>>>>>> provides
>>>>>>>>>>> the result of many of the getters, but the state is mutable, so you 
>>>>>>>>>>> end up
>>>>>>>>>>> with odd caches.
>>>>>>>>>>>
>>>>>>>>>>> For example, the work I need to do to answer unhandledFilters in
>>>>>>>>>>> V1 is roughly the same as the work I need to do to buildScan, so I 
>>>>>>>>>>> want to
>>>>>>>>>>> cache it. This means that I end up with code that looks like:
>>>>>>>>>>>
>>>>>>>>>>> public final class CachingFoo implements Foo {
>>>>>>>>>>>     private final Foo delegate;
>>>>>>>>>>>
>>>>>>>>>>>     private List<Filter> currentFilters = emptyList();
>>>>>>>>>>>     private Supplier<Bar> barSupplier =
>>>>>>>>>>> newSupplier(currentFilters);
>>>>>>>>>>>
>>>>>>>>>>>     public CachingFoo(Foo delegate) {
>>>>>>>>>>>         this.delegate = delegate;
>>>>>>>>>>>     }
>>>>>>>>>>>
>>>>>>>>>>>     private Supplier<Bar> newSupplier(List<Filter> filters) {
>>>>>>>>>>>         return Suppliers.memoize(() ->
>>>>>>>>>>> delegate.computeBar(filters));
>>>>>>>>>>>     }
>>>>>>>>>>>
>>>>>>>>>>>     @Override
>>>>>>>>>>>     public Bar computeBar(List<Filter> filters) {
>>>>>>>>>>>         if (!filters.equals(currentFilters)) {
>>>>>>>>>>>             currentFilters = filters;
>>>>>>>>>>>             barSupplier = newSupplier(filters);
>>>>>>>>>>>         }
>>>>>>>>>>>
>>>>>>>>>>>         return barSupplier.get();
>>>>>>>>>>>     }
>>>>>>>>>>> }
>>>>>>>>>>>
>>>>>>>>>>> which caches the result required in unhandledFilters on the
>>>>>>>>>>> expectation that Spark will call buildScan afterwards and get to 
>>>>>>>>>>> use the
>>>>>>>>>>> result..
>>>>>>>>>>>
>>>>>>>>>>> This kind of cache becomes more prominent, but harder to deal
>>>>>>>>>>> with in the new APIs. As one example here, the state I will need in 
>>>>>>>>>>> order
>>>>>>>>>>> to compute accurate column stats internally will likely be a subset 
>>>>>>>>>>> of the
>>>>>>>>>>> work required in order to get the read tasks, tell you if I can 
>>>>>>>>>>> handle
>>>>>>>>>>> filters, etc, so I'll want to cache them for reuse. However, the 
>>>>>>>>>>> cached
>>>>>>>>>>> information needs to be appropriately invalidated when I add a new 
>>>>>>>>>>> filter
>>>>>>>>>>> or sort order or limit, and this makes implementing the APIs harder 
>>>>>>>>>>> and
>>>>>>>>>>> more error-prone.
>>>>>>>>>>>
>>>>>>>>>>> One thing that'd be great is a defined contract of the order in
>>>>>>>>>>> which Spark calls the methods on your datasource (ideally this 
>>>>>>>>>>> contract
>>>>>>>>>>> could be implied by the way the Java class structure works, but 
>>>>>>>>>>> otherwise I
>>>>>>>>>>> can just throw).
>>>>>>>>>>>
>>>>>>>>>>> James
>>>>>>>>>>>
>>>>>>>>>>> On Tue, 29 Aug 2017 at 02:56 Reynold Xin <r...@databricks.com>
>>>>>>>>>>> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> James,
>>>>>>>>>>>>
>>>>>>>>>>>> Thanks for the comment. I think you just pointed out a
>>>>>>>>>>>> trade-off between expressiveness and API simplicity, compatibility 
>>>>>>>>>>>> and
>>>>>>>>>>>> evolvability. For the max expressiveness, we'd want the ability to 
>>>>>>>>>>>> expose
>>>>>>>>>>>> full query plans, and let the data source decide which part of the 
>>>>>>>>>>>> query
>>>>>>>>>>>> plan can be pushed down.
>>>>>>>>>>>>
>>>>>>>>>>>> The downside to that (full query plan push down) are:
>>>>>>>>>>>>
>>>>>>>>>>>> 1. It is extremely difficult to design a stable representation
>>>>>>>>>>>> for logical / physical plan. It is doable, but we'd be the first 
>>>>>>>>>>>> to do
>>>>>>>>>>>> it. I'm not sure of any mainstream databases being able to do that 
>>>>>>>>>>>> in the
>>>>>>>>>>>> past. The design of that API itself, to make sure we have a good 
>>>>>>>>>>>> story for
>>>>>>>>>>>> backward and forward compatibility, would probably take months if 
>>>>>>>>>>>> not
>>>>>>>>>>>> years. It might still be good to do, or offer an experimental 
>>>>>>>>>>>> trait without
>>>>>>>>>>>> compatibility guarantee that uses the current Catalyst internal 
>>>>>>>>>>>> logical
>>>>>>>>>>>> plan.
>>>>>>>>>>>>
>>>>>>>>>>>> 2. Most data source developers simply want a way to offer some
>>>>>>>>>>>> data, without any pushdown. Having to understand query plans is a 
>>>>>>>>>>>> burden
>>>>>>>>>>>> rather than a gift.
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> Re: your point about the proposed v2 being worse than v1 for
>>>>>>>>>>>> your use case.
>>>>>>>>>>>>
>>>>>>>>>>>> Can you say more? You used the argument that in v2 there are
>>>>>>>>>>>> more support for broader pushdown and as a result it is harder to
>>>>>>>>>>>> implement. That's how it is supposed to be. If a data source simply
>>>>>>>>>>>> implements one of the trait, it'd be logically identical to v1. I 
>>>>>>>>>>>> don't see
>>>>>>>>>>>> why it would be worse or better, other than v2 provides much 
>>>>>>>>>>>> stronger
>>>>>>>>>>>> forward compatibility guarantees than v1.
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> On Tue, Aug 29, 2017 at 4:54 AM, James Baker <
>>>>>>>>>>>> j.ba...@outlook.com> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> Copying from the code review comments I just submitted on the
>>>>>>>>>>>>> draft API (https://github.com/cloud-fan/
>>>>>>>>>>>>> spark/pull/10#pullrequestreview-59088745):
>>>>>>>>>>>>>
>>>>>>>>>>>>> Context here is that I've spent some time implementing a Spark
>>>>>>>>>>>>> datasource and have had some issues with the current API which 
>>>>>>>>>>>>> are made
>>>>>>>>>>>>> worse in V2.
>>>>>>>>>>>>>
>>>>>>>>>>>>> The general conclusion I’ve come to here is that this is very
>>>>>>>>>>>>> hard to actually implement (in a similar but more aggressive way 
>>>>>>>>>>>>> than
>>>>>>>>>>>>> DataSource V1, because of the extra methods and dimensions we get 
>>>>>>>>>>>>> in V2).
>>>>>>>>>>>>>
>>>>>>>>>>>>> In DataSources V1 PrunedFilteredScan, the issue is that you
>>>>>>>>>>>>> are passed in the filters with the buildScan method, and then 
>>>>>>>>>>>>> passed in
>>>>>>>>>>>>> again with the unhandledFilters method.
>>>>>>>>>>>>>
>>>>>>>>>>>>> However, the filters that you can’t handle might be data
>>>>>>>>>>>>> dependent, which the current API does not handle well. Suppose I 
>>>>>>>>>>>>> can handle
>>>>>>>>>>>>> filter A some of the time, and filter B some of the time. If I’m 
>>>>>>>>>>>>> passed in
>>>>>>>>>>>>> both, then either A and B are unhandled, or A, or B, or neither. 
>>>>>>>>>>>>> The work I
>>>>>>>>>>>>> have to do to work this out is essentially the same as I have to 
>>>>>>>>>>>>> do while
>>>>>>>>>>>>> actually generating my RDD (essentially I have to generate my 
>>>>>>>>>>>>> partitions),
>>>>>>>>>>>>> so I end up doing some weird caching work.
>>>>>>>>>>>>>
>>>>>>>>>>>>> This V2 API proposal has the same issues, but perhaps moreso.
>>>>>>>>>>>>> In PrunedFilteredScan, there is essentially one degree of freedom 
>>>>>>>>>>>>> for
>>>>>>>>>>>>> pruning (filters), so you just have to implement caching between
>>>>>>>>>>>>> unhandledFilters and buildScan. However, here we have many 
>>>>>>>>>>>>> degrees of
>>>>>>>>>>>>> freedom; sorts, individual filters, clustering, sampling, maybe
>>>>>>>>>>>>> aggregations eventually - and these operations are not all 
>>>>>>>>>>>>> commutative, and
>>>>>>>>>>>>> computing my support one-by-one can easily end up being more 
>>>>>>>>>>>>> expensive than
>>>>>>>>>>>>> computing all in one go.
>>>>>>>>>>>>>
>>>>>>>>>>>>> For some trivial examples:
>>>>>>>>>>>>>
>>>>>>>>>>>>> - After filtering, I might be sorted, whilst before filtering
>>>>>>>>>>>>> I might not be.
>>>>>>>>>>>>>
>>>>>>>>>>>>> - Filtering with certain filters might affect my ability to
>>>>>>>>>>>>> push down others.
>>>>>>>>>>>>>
>>>>>>>>>>>>> - Filtering with aggregations (as mooted) might not be
>>>>>>>>>>>>> possible to push down.
>>>>>>>>>>>>>
>>>>>>>>>>>>> And with the API as currently mooted, I need to be able to go
>>>>>>>>>>>>> back and change my results because they might change later.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Really what would be good here is to pass all of the filters
>>>>>>>>>>>>> and sorts etc all at once, and then I return the parts I can’t 
>>>>>>>>>>>>> handle.
>>>>>>>>>>>>>
>>>>>>>>>>>>> I’d prefer in general that this be implemented by passing some
>>>>>>>>>>>>> kind of query plan to the datasource which enables this kind of
>>>>>>>>>>>>> replacement. Explicitly don’t want to give the whole query plan - 
>>>>>>>>>>>>> that
>>>>>>>>>>>>> sounds painful - would prefer we push down only the parts of the 
>>>>>>>>>>>>> query plan
>>>>>>>>>>>>> we deem to be stable. With the mix-in approach, I don’t think we 
>>>>>>>>>>>>> can
>>>>>>>>>>>>> guarantee the properties we want without a two-phase thing - I’d 
>>>>>>>>>>>>> really
>>>>>>>>>>>>> love to be able to just define a straightforward union type which 
>>>>>>>>>>>>> is our
>>>>>>>>>>>>> supported pushdown stuff, and then the user can transform and 
>>>>>>>>>>>>> return it.
>>>>>>>>>>>>>
>>>>>>>>>>>>> I think this ends up being a more elegant API for consumers,
>>>>>>>>>>>>> and also far more intuitive.
>>>>>>>>>>>>>
>>>>>>>>>>>>> James
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Mon, 28 Aug 2017 at 18:00 蒋星博 <jiangxb1...@gmail.com>
>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> +1 (Non-binding)
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Xiao Li <gatorsm...@gmail.com>于2017年8月28日 周一下午5:38写道：
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> +1
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> 2017-08-28 12:45 GMT-07:00 Cody Koeninger <
>>>>>>>>>>>>>>> c...@koeninger.org>:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Just wanted to point out that because the jira isn't
>>>>>>>>>>>>>>>> labeled SPIP, it
>>>>>>>>>>>>>>>> won't have shown up linked from
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> http://spark.apache.org/improvement-proposals.html
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> On Mon, Aug 28, 2017 at 2:20 PM, Wenchen Fan <
>>>>>>>>>>>>>>>> cloud0...@gmail.com> wrote:
>>>>>>>>>>>>>>>> > Hi all,
>>>>>>>>>>>>>>>> >
>>>>>>>>>>>>>>>> > It has been almost 2 weeks since I proposed the data
>>>>>>>>>>>>>>>> source V2 for
>>>>>>>>>>>>>>>> > discussion, and we already got some feedbacks on the JIRA
>>>>>>>>>>>>>>>> ticket and the
>>>>>>>>>>>>>>>> > prototype PR, so I'd like to call for a vote.
>>>>>>>>>>>>>>>> >
>>>>>>>>>>>>>>>> > The full document of the Data Source API V2 is:
>>>>>>>>>>>>>>>> > https://docs.google.com/docume
>>>>>>>>>>>>>>>> nt/d/1n_vUVbF4KD3gxTmkNEon5qdQ-Z8qU5Frf6WMQZ6jJVM/edit
>>>>>>>>>>>>>>>> >
>>>>>>>>>>>>>>>> > Note that, this vote should focus on high-level
>>>>>>>>>>>>>>>> design/framework, not
>>>>>>>>>>>>>>>> > specified APIs, as we can always change/improve specified
>>>>>>>>>>>>>>>> APIs during
>>>>>>>>>>>>>>>> > development.
>>>>>>>>>>>>>>>> >
>>>>>>>>>>>>>>>> > The vote will be up for the next 72 hours. Please reply
>>>>>>>>>>>>>>>> with your vote:
>>>>>>>>>>>>>>>> >
>>>>>>>>>>>>>>>> > +1: Yeah, let's go forward and implement the SPIP.
>>>>>>>>>>>>>>>> > +0: Don't really care.
>>>>>>>>>>>>>>>> > -1: I don't think this is a good idea because of the
>>>>>>>>>>>>>>>> following technical
>>>>>>>>>>>>>>>> > reasons.
>>>>>>>>>>>>>>>> >
>>>>>>>>>>>>>>>> > Thanks!
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> ------------------------------
>>>>>>>>>>>>>>>> ---------------------------------------
>>>>>>>>>>>>>>>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Ryan Blue
>>>>>> Software Engineer
>>>>>> Netflix
>>>>>>
>>>>>
>>>>
>>
>

Re: [VOTE] [SPIP] SPARK-15689: Data Source API V2

Reply via email to