Hi all, I've submitted a PR for a basic data source v2, i.e., only contains features we already have in data source v1. We can discuss API details like naming in that PR: https://github.com/apache/spark/pull/19136
In the meanwhile, let's keep this vote open and collecting more feedbacks. Thanks On Fri, Sep 1, 2017 at 5:56 PM, Reynold Xin <r...@databricks.com> wrote: > Why does ordering matter here for sort vs filter? The source should be > able to handle it in whatever way it wants (which is almost always filter > beneath sort I'd imagine). > > The only ordering that'd matter in the current set of pushdowns is limit - > it should always mean the root of the pushded tree. > > > On Fri, Sep 1, 2017 at 3:22 AM, Wenchen Fan <cloud0...@gmail.com> wrote: > >> > Ideally also getting sort orders _after_ getting filters. >> >> Yea we should have a deterministic order when applying various push >> downs, and I think filter should definitely go before sort. This is one of >> the details we can discuss during PR review :) >> >> On Fri, Sep 1, 2017 at 9:19 AM, James Baker <j.ba...@outlook.com> wrote: >> >>> I think that makes sense. I didn't understand backcompat was the primary >>> driver. I actually don't care right now about aggregations on the >>> datasource I'm integrating with - I just care about receiving all the >>> filters (and ideally also the desired sort order) at the same time. I am >>> mostly fine with anything else; but getting filters at the same time is >>> important for me, and doesn't seem overly contentious? (e.g. it's >>> compatible with datasources v1). Ideally also getting sort orders _after_ >>> getting filters. >>> >>> That said, an unstable api that gets me the query plan would be >>> appreciated by plenty I'm sure :) (and would make my implementation more >>> straightforward - the state management is painful atm). >>> >>> James >>> >>> On Wed, 30 Aug 2017 at 14:56 Reynold Xin <r...@databricks.com> wrote: >>> >>>> Sure that's good to do (and as discussed earlier a good compromise >>>> might be to expose an interface for the source to decide which part of the >>>> logical plan they want to accept). >>>> >>>> To me everything is about cost vs benefit. >>>> >>>> In my mind, the biggest issue with the existing data source API is >>>> backward and forward compatibility. All the data sources written for Spark >>>> 1.x broke in Spark 2.x. And that's one of the biggest value v2 can bring. >>>> To me it's far more important to have data sources implemented in 2017 to >>>> be able to work in 2027, in Spark 10.x. >>>> >>>> You are basically arguing for creating a new API that is capable of >>>> doing arbitrary expression, aggregation, and join pushdowns (you only >>>> mentioned aggregation so far, but I've talked to enough database people >>>> that I know once Spark gives them aggregation pushdown, they will come back >>>> for join pushdown). We can do that using unstable APIs, and creating stable >>>> APIs would be extremely difficult (still doable, just would take a long >>>> time to design and implement). As mentioned earlier, it basically involves >>>> creating a stable representation for all of logical plan, which is a lot of >>>> work. I think we should still work towards that (for other reasons as >>>> well), but I'd consider that out of scope for the current one. Otherwise >>>> we'd not release something probably for the next 2 or 3 years. >>>> >>>> >>>> >>>> >>>> >>>> On Wed, Aug 30, 2017 at 11:50 PM, James Baker <j.ba...@outlook.com> >>>> wrote: >>>> >>>>> I guess I was more suggesting that by coding up the powerful mode as >>>>> the API, it becomes easy for someone to layer an easy mode beneath it to >>>>> enable simpler datasources to be integrated (and that simple mode should >>>>> be >>>>> the out of scope thing). >>>>> >>>>> Taking a small step back here, one of the places where I think I'm >>>>> missing some context is in understanding the target consumers of these >>>>> interfaces. I've done some amount (though likely not enough) of research >>>>> about the places where people have had issues of API surface in the past - >>>>> the concrete tickets I've seen have been based on Cassandra integration >>>>> where you want to indicate clustering, and SAP HANA where they want to >>>>> push >>>>> down more complicated queries through Spark. This proposal supports the >>>>> former, but the amount of change required to support clustering in the >>>>> current API is not obviously high - whilst the current proposal for V2 >>>>> seems to make it very difficult to add support for pushing down plenty of >>>>> aggregations in the future (I've found the question of how to add GROUP BY >>>>> to be pretty tricky to answer for the current proposal). >>>>> >>>>> Googling around for implementations of the current PrunedFilteredScan, >>>>> I basically find a lot of databases, which seems reasonable - SAP HANA, >>>>> ElasticSearch, Solr, MongoDB, Apache Phoenix, etc. I've talked to people >>>>> who've used (some of) these connectors and the sticking point has >>>>> generally >>>>> been that Spark needs to load a lot of data out in order to solve >>>>> aggregations that can be very efficiently pushed down into the >>>>> datasources. >>>>> >>>>> So, with this proposal it appears that we're optimising towards making >>>>> it easy to write one-off datasource integrations, with some amount of >>>>> pluggability for people who want to do more complicated things (the most >>>>> interesting being bucketing integration). However, my guess is that this >>>>> isn't what the current major integrations suffer from; they suffer mostly >>>>> from restrictions in what they can push down (which broadly speaking are >>>>> not going to go away). >>>>> >>>>> So the place where I'm confused is that the current integrations can >>>>> be made incrementally better as a consequence of this, but the backing >>>>> data >>>>> systems have the features which enable a step change which this API makes >>>>> harder to achieve in the future. Who are the group of users who benefit >>>>> the >>>>> most as a consequence of this change, like, who is the target consumer >>>>> here? My personal slant is that it's more important to improve support for >>>>> other datastores than it is to lower the barrier of entry - this is why >>>>> I've been pushing here. >>>>> >>>>> James >>>>> >>>>> On Wed, 30 Aug 2017 at 09:37 Ryan Blue <rb...@netflix.com> wrote: >>>>> >>>>>> -1 (non-binding) >>>>>> >>>>>> Sometimes it takes a VOTE thread to get people to actually read and >>>>>> comment, so thanks for starting this one… but there’s still discussion >>>>>> happening on the prototype API, which it hasn’t been updated. I’d like to >>>>>> see the proposal shaped by the ongoing discussion so that we have a >>>>>> better, >>>>>> more concrete plan. I think that’s going to produces a better SPIP. >>>>>> >>>>>> The second reason for -1 is that I think the read- and write-side >>>>>> proposals should be separated. The PR >>>>>> <https://github.com/cloud-fan/spark/pull/10> currently has “write >>>>>> path” listed as a TODO item and most of the discussion I’ve seen is on >>>>>> the >>>>>> read side. I think it would be better to separate the read and write APIs >>>>>> so we can focus on them individually. >>>>>> >>>>>> An example of why we should focus on the write path separately is >>>>>> that the proposal says this: >>>>>> >>>>>> Ideally partitioning/bucketing concept should not be exposed in the >>>>>> Data Source API V2, because they are just techniques for data skipping >>>>>> and >>>>>> pre-partitioning. However, these 2 concepts are already widely used in >>>>>> Spark, e.g. DataFrameWriter.partitionBy and DDL syntax like ADD >>>>>> PARTITION. >>>>>> To be consistent, we need to add partitioning/bucketing to Data Source >>>>>> V2 . >>>>>> . . >>>>>> >>>>>> Essentially, the some APIs mix DDL and DML operations. I’d like to >>>>>> consider ways to fix that problem instead of carrying the problem forward >>>>>> to Data Source V2. We can solve this by adding a high-level API for DDL >>>>>> and >>>>>> a better write/insert API that works well with it. Clearly, that >>>>>> discussion >>>>>> is independent of the read path, which is why I think separating the two >>>>>> proposals would be a win. >>>>>> >>>>>> rb >>>>>> >>>>>> >>>>>> On Wed, Aug 30, 2017 at 4:28 AM, Reynold Xin <r...@databricks.com> >>>>>> wrote: >>>>>> >>>>>>> That might be good to do, but seems like orthogonal to this effort >>>>>>> itself. It would be a completely different interface. >>>>>>> >>>>>>> On Wed, Aug 30, 2017 at 1:10 PM Wenchen Fan <cloud0...@gmail.com> >>>>>>> wrote: >>>>>>> >>>>>>>> OK I agree with it, how about we add a new interface to push down >>>>>>>> the query plan, based on the current framework? We can mark the >>>>>>>> query-plan-push-down interface as unstable, to save the effort of >>>>>>>> designing >>>>>>>> a stable representation of query plan and maintaining forward >>>>>>>> compatibility. >>>>>>>> >>>>>>>> On Wed, Aug 30, 2017 at 10:53 AM, James Baker <j.ba...@outlook.com> >>>>>>>> wrote: >>>>>>>> >>>>>>>>> I'll just focus on the one-by-one thing for now - it's the thing >>>>>>>>> that blocks me the most. >>>>>>>>> >>>>>>>>> I think the place where we're most confused here is on the cost of >>>>>>>>> determining whether I can push down a filter. For me, in order to >>>>>>>>> work out >>>>>>>>> whether I can push down a filter or satisfy a sort, I might have to >>>>>>>>> read >>>>>>>>> plenty of data. That said, it's worth me doing this because I can use >>>>>>>>> this >>>>>>>>> information to avoid reading >>that much data. >>>>>>>>> >>>>>>>>> If you give me all the orderings, I will have to read that data >>>>>>>>> many times (we stream it to avoid keeping it in memory). >>>>>>>>> >>>>>>>>> There's also a thing where our typical use cases have many filters >>>>>>>>> (20+ is common). So, it's likely not going to work to pass us all the >>>>>>>>> combinations. That said, if I can tell you a cost, I know what optimal >>>>>>>>> looks like, why can't I just pick that myself? >>>>>>>>> >>>>>>>>> The current design is friendly to simple datasources, but does not >>>>>>>>> have the potential to support this. >>>>>>>>> >>>>>>>>> So the main problem we have with datasources v1 is that it's >>>>>>>>> essentially impossible to leverage a bunch of Spark features - I >>>>>>>>> don't get >>>>>>>>> to use bucketing or row batches or all the nice things that I really >>>>>>>>> want >>>>>>>>> to use to get decent performance. Provided I can leverage these in a >>>>>>>>> moderately supported way which won't break in any given commit, I'll >>>>>>>>> be >>>>>>>>> pretty happy with anything that lets me opt out of the restrictions. >>>>>>>>> >>>>>>>>> My suggestion here is that if you make a mode which works well for >>>>>>>>> complicated use cases, you end up being able to write simple mode in >>>>>>>>> terms >>>>>>>>> of it very easily. So we could actually provide two APIs, one that >>>>>>>>> lets >>>>>>>>> people who have more interesting datasources leverage the cool Spark >>>>>>>>> features, and one that lets people who just want to implement basic >>>>>>>>> features do that - I'd try to include some kind of layering here. I >>>>>>>>> could >>>>>>>>> probably sketch out something here if that'd be useful? >>>>>>>>> >>>>>>>>> James >>>>>>>>> >>>>>>>>> On Tue, 29 Aug 2017 at 18:59 Wenchen Fan <cloud0...@gmail.com> >>>>>>>>> wrote: >>>>>>>>> >>>>>>>>>> Hi James, >>>>>>>>>> >>>>>>>>>> Thanks for your feedback! I think your concerns are all valid, >>>>>>>>>> but we need to make a tradeoff here. >>>>>>>>>> >>>>>>>>>> > Explicitly here, what I'm looking for is a convenient >>>>>>>>>> mechanism to accept a fully specified set of arguments >>>>>>>>>> >>>>>>>>>> The problem with this approach is: 1) if we wanna add more >>>>>>>>>> arguments in the future, it's really hard to do without changing >>>>>>>>>> the existing interface. 2) if a user wants to implement a very >>>>>>>>>> simple data >>>>>>>>>> source, he has to look at all the arguments and understand them, >>>>>>>>>> which may >>>>>>>>>> be a burden for him. >>>>>>>>>> I don't have a solution to solve these 2 problems, comments are >>>>>>>>>> welcome. >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> > There are loads of cases like this - you can imagine someone >>>>>>>>>> being able to push down a sort before a filter is applied, but not >>>>>>>>>> afterwards. However, maybe the filter is so selective that it's >>>>>>>>>> better to >>>>>>>>>> push down the filter and not handle the sort. I don't get to make >>>>>>>>>> this >>>>>>>>>> decision, Spark does (but doesn't have good enough information to do >>>>>>>>>> it >>>>>>>>>> properly, whilst I do). I want to be able to choose the parts I push >>>>>>>>>> down >>>>>>>>>> given knowledge of my datasource - as defined the APIs don't let me >>>>>>>>>> do >>>>>>>>>> that, they're strictly more restrictive than the V1 APIs in this way. >>>>>>>>>> >>>>>>>>>> This is true, the current framework applies push downs one by >>>>>>>>>> one, incrementally. If a data source wanna go back to accept a sort >>>>>>>>>> push >>>>>>>>>> down after it accepts a filter push down, it's impossible with the >>>>>>>>>> current >>>>>>>>>> data source V2. >>>>>>>>>> Fortunately, we have a solution for this problem. At Spark side, >>>>>>>>>> actually we do have a fully specified set of arguments waiting >>>>>>>>>> to be pushed down, but Spark doesn't know which is the best order to >>>>>>>>>> push >>>>>>>>>> them into data source. Spark can try every combination and ask the >>>>>>>>>> data >>>>>>>>>> source to report a cost, then Spark can pick the best combination >>>>>>>>>> with the >>>>>>>>>> lowest cost. This can also be implemented as a cost report >>>>>>>>>> interface, so >>>>>>>>>> that advanced data source can implement it for optimal performance, >>>>>>>>>> and >>>>>>>>>> simple data source doesn't need to care about it and keep simple. >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> The current design is very friendly to simple data source, and >>>>>>>>>> has the potential to support complex data source, I prefer the >>>>>>>>>> current >>>>>>>>>> design over the plan push down one. What do you think? >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> On Wed, Aug 30, 2017 at 5:53 AM, James Baker <j.ba...@outlook.com >>>>>>>>>> > wrote: >>>>>>>>>> >>>>>>>>>>> Yeah, for sure. >>>>>>>>>>> >>>>>>>>>>> With the stable representation - agree that in the general case >>>>>>>>>>> this is pretty intractable, it restricts the modifications that you >>>>>>>>>>> can do >>>>>>>>>>> in the future too much. That said, it shouldn't be as hard if you >>>>>>>>>>> restrict >>>>>>>>>>> yourself to the parts of the plan which are supported by the >>>>>>>>>>> datasources V2 >>>>>>>>>>> API (which after all, need to be translateable properly into the >>>>>>>>>>> future to >>>>>>>>>>> support the mixins proposed). This should have a pretty small scope >>>>>>>>>>> in >>>>>>>>>>> comparison. As long as the user can bail out of nodes they don't >>>>>>>>>>> understand, they should be ok, right? >>>>>>>>>>> >>>>>>>>>>> That said, what would also be fine for us is a place to plug >>>>>>>>>>> into an unstable query plan. >>>>>>>>>>> >>>>>>>>>>> Explicitly here, what I'm looking for is a convenient mechanism >>>>>>>>>>> to accept a fully specified set of arguments (of which I can choose >>>>>>>>>>> to >>>>>>>>>>> ignore some), and return the information as to which of them I'm >>>>>>>>>>> ignoring. >>>>>>>>>>> Taking a query plan of sorts is a way of doing this which IMO is >>>>>>>>>>> intuitive >>>>>>>>>>> to the user. It also provides a convenient location to plug in >>>>>>>>>>> things like >>>>>>>>>>> stats. Not at all married to the idea of using a query plan here; >>>>>>>>>>> it just >>>>>>>>>>> seemed convenient. >>>>>>>>>>> >>>>>>>>>>> Regarding the users who just want to be able to pump data into >>>>>>>>>>> Spark, my understanding is that replacing isolated nodes in a query >>>>>>>>>>> plan is >>>>>>>>>>> easy. That said, our goal here is to be able to push down as much as >>>>>>>>>>> possible into the underlying datastore. >>>>>>>>>>> >>>>>>>>>>> To your second question: >>>>>>>>>>> >>>>>>>>>>> The issue is that if you build up pushdowns incrementally and >>>>>>>>>>> not all at once, you end up having to reject pushdowns and filters >>>>>>>>>>> that you >>>>>>>>>>> actually can do, which unnecessarily increases overheads. >>>>>>>>>>> >>>>>>>>>>> For example, the dataset >>>>>>>>>>> >>>>>>>>>>> a b c >>>>>>>>>>> 1 2 3 >>>>>>>>>>> 1 3 3 >>>>>>>>>>> 1 3 4 >>>>>>>>>>> 2 1 1 >>>>>>>>>>> 2 0 1 >>>>>>>>>>> >>>>>>>>>>> can efficiently push down sort(b, c) if I have already applied >>>>>>>>>>> the filter a = 1, but otherwise will force a sort in Spark. On the >>>>>>>>>>> PR I >>>>>>>>>>> detail a case I see where I can push down two equality filters iff >>>>>>>>>>> I am >>>>>>>>>>> given them at the same time, whilst not being able to one at a time. >>>>>>>>>>> >>>>>>>>>>> There are loads of cases like this - you can imagine someone >>>>>>>>>>> being able to push down a sort before a filter is applied, but not >>>>>>>>>>> afterwards. However, maybe the filter is so selective that it's >>>>>>>>>>> better to >>>>>>>>>>> push down the filter and not handle the sort. I don't get to make >>>>>>>>>>> this >>>>>>>>>>> decision, Spark does (but doesn't have good enough information to >>>>>>>>>>> do it >>>>>>>>>>> properly, whilst I do). I want to be able to choose the parts I >>>>>>>>>>> push down >>>>>>>>>>> given knowledge of my datasource - as defined the APIs don't let me >>>>>>>>>>> do >>>>>>>>>>> that, they're strictly more restrictive than the V1 APIs in this >>>>>>>>>>> way. >>>>>>>>>>> >>>>>>>>>>> The pattern of not considering things that can be done in bulk >>>>>>>>>>> bites us in other ways. The retrieval methods end up being trickier >>>>>>>>>>> to >>>>>>>>>>> implement than is necessary because frequently a single operation >>>>>>>>>>> provides >>>>>>>>>>> the result of many of the getters, but the state is mutable, so you >>>>>>>>>>> end up >>>>>>>>>>> with odd caches. >>>>>>>>>>> >>>>>>>>>>> For example, the work I need to do to answer unhandledFilters in >>>>>>>>>>> V1 is roughly the same as the work I need to do to buildScan, so I >>>>>>>>>>> want to >>>>>>>>>>> cache it. This means that I end up with code that looks like: >>>>>>>>>>> >>>>>>>>>>> public final class CachingFoo implements Foo { >>>>>>>>>>> private final Foo delegate; >>>>>>>>>>> >>>>>>>>>>> private List<Filter> currentFilters = emptyList(); >>>>>>>>>>> private Supplier<Bar> barSupplier = >>>>>>>>>>> newSupplier(currentFilters); >>>>>>>>>>> >>>>>>>>>>> public CachingFoo(Foo delegate) { >>>>>>>>>>> this.delegate = delegate; >>>>>>>>>>> } >>>>>>>>>>> >>>>>>>>>>> private Supplier<Bar> newSupplier(List<Filter> filters) { >>>>>>>>>>> return Suppliers.memoize(() -> >>>>>>>>>>> delegate.computeBar(filters)); >>>>>>>>>>> } >>>>>>>>>>> >>>>>>>>>>> @Override >>>>>>>>>>> public Bar computeBar(List<Filter> filters) { >>>>>>>>>>> if (!filters.equals(currentFilters)) { >>>>>>>>>>> currentFilters = filters; >>>>>>>>>>> barSupplier = newSupplier(filters); >>>>>>>>>>> } >>>>>>>>>>> >>>>>>>>>>> return barSupplier.get(); >>>>>>>>>>> } >>>>>>>>>>> } >>>>>>>>>>> >>>>>>>>>>> which caches the result required in unhandledFilters on the >>>>>>>>>>> expectation that Spark will call buildScan afterwards and get to >>>>>>>>>>> use the >>>>>>>>>>> result.. >>>>>>>>>>> >>>>>>>>>>> This kind of cache becomes more prominent, but harder to deal >>>>>>>>>>> with in the new APIs. As one example here, the state I will need in >>>>>>>>>>> order >>>>>>>>>>> to compute accurate column stats internally will likely be a subset >>>>>>>>>>> of the >>>>>>>>>>> work required in order to get the read tasks, tell you if I can >>>>>>>>>>> handle >>>>>>>>>>> filters, etc, so I'll want to cache them for reuse. However, the >>>>>>>>>>> cached >>>>>>>>>>> information needs to be appropriately invalidated when I add a new >>>>>>>>>>> filter >>>>>>>>>>> or sort order or limit, and this makes implementing the APIs harder >>>>>>>>>>> and >>>>>>>>>>> more error-prone. >>>>>>>>>>> >>>>>>>>>>> One thing that'd be great is a defined contract of the order in >>>>>>>>>>> which Spark calls the methods on your datasource (ideally this >>>>>>>>>>> contract >>>>>>>>>>> could be implied by the way the Java class structure works, but >>>>>>>>>>> otherwise I >>>>>>>>>>> can just throw). >>>>>>>>>>> >>>>>>>>>>> James >>>>>>>>>>> >>>>>>>>>>> On Tue, 29 Aug 2017 at 02:56 Reynold Xin <r...@databricks.com> >>>>>>>>>>> wrote: >>>>>>>>>>> >>>>>>>>>>>> James, >>>>>>>>>>>> >>>>>>>>>>>> Thanks for the comment. I think you just pointed out a >>>>>>>>>>>> trade-off between expressiveness and API simplicity, compatibility >>>>>>>>>>>> and >>>>>>>>>>>> evolvability. For the max expressiveness, we'd want the ability to >>>>>>>>>>>> expose >>>>>>>>>>>> full query plans, and let the data source decide which part of the >>>>>>>>>>>> query >>>>>>>>>>>> plan can be pushed down. >>>>>>>>>>>> >>>>>>>>>>>> The downside to that (full query plan push down) are: >>>>>>>>>>>> >>>>>>>>>>>> 1. It is extremely difficult to design a stable representation >>>>>>>>>>>> for logical / physical plan. It is doable, but we'd be the first >>>>>>>>>>>> to do >>>>>>>>>>>> it. I'm not sure of any mainstream databases being able to do that >>>>>>>>>>>> in the >>>>>>>>>>>> past. The design of that API itself, to make sure we have a good >>>>>>>>>>>> story for >>>>>>>>>>>> backward and forward compatibility, would probably take months if >>>>>>>>>>>> not >>>>>>>>>>>> years. It might still be good to do, or offer an experimental >>>>>>>>>>>> trait without >>>>>>>>>>>> compatibility guarantee that uses the current Catalyst internal >>>>>>>>>>>> logical >>>>>>>>>>>> plan. >>>>>>>>>>>> >>>>>>>>>>>> 2. Most data source developers simply want a way to offer some >>>>>>>>>>>> data, without any pushdown. Having to understand query plans is a >>>>>>>>>>>> burden >>>>>>>>>>>> rather than a gift. >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> Re: your point about the proposed v2 being worse than v1 for >>>>>>>>>>>> your use case. >>>>>>>>>>>> >>>>>>>>>>>> Can you say more? You used the argument that in v2 there are >>>>>>>>>>>> more support for broader pushdown and as a result it is harder to >>>>>>>>>>>> implement. That's how it is supposed to be. If a data source simply >>>>>>>>>>>> implements one of the trait, it'd be logically identical to v1. I >>>>>>>>>>>> don't see >>>>>>>>>>>> why it would be worse or better, other than v2 provides much >>>>>>>>>>>> stronger >>>>>>>>>>>> forward compatibility guarantees than v1. >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> On Tue, Aug 29, 2017 at 4:54 AM, James Baker < >>>>>>>>>>>> j.ba...@outlook.com> wrote: >>>>>>>>>>>> >>>>>>>>>>>>> Copying from the code review comments I just submitted on the >>>>>>>>>>>>> draft API (https://github.com/cloud-fan/ >>>>>>>>>>>>> spark/pull/10#pullrequestreview-59088745): >>>>>>>>>>>>> >>>>>>>>>>>>> Context here is that I've spent some time implementing a Spark >>>>>>>>>>>>> datasource and have had some issues with the current API which >>>>>>>>>>>>> are made >>>>>>>>>>>>> worse in V2. >>>>>>>>>>>>> >>>>>>>>>>>>> The general conclusion I’ve come to here is that this is very >>>>>>>>>>>>> hard to actually implement (in a similar but more aggressive way >>>>>>>>>>>>> than >>>>>>>>>>>>> DataSource V1, because of the extra methods and dimensions we get >>>>>>>>>>>>> in V2). >>>>>>>>>>>>> >>>>>>>>>>>>> In DataSources V1 PrunedFilteredScan, the issue is that you >>>>>>>>>>>>> are passed in the filters with the buildScan method, and then >>>>>>>>>>>>> passed in >>>>>>>>>>>>> again with the unhandledFilters method. >>>>>>>>>>>>> >>>>>>>>>>>>> However, the filters that you can’t handle might be data >>>>>>>>>>>>> dependent, which the current API does not handle well. Suppose I >>>>>>>>>>>>> can handle >>>>>>>>>>>>> filter A some of the time, and filter B some of the time. If I’m >>>>>>>>>>>>> passed in >>>>>>>>>>>>> both, then either A and B are unhandled, or A, or B, or neither. >>>>>>>>>>>>> The work I >>>>>>>>>>>>> have to do to work this out is essentially the same as I have to >>>>>>>>>>>>> do while >>>>>>>>>>>>> actually generating my RDD (essentially I have to generate my >>>>>>>>>>>>> partitions), >>>>>>>>>>>>> so I end up doing some weird caching work. >>>>>>>>>>>>> >>>>>>>>>>>>> This V2 API proposal has the same issues, but perhaps moreso. >>>>>>>>>>>>> In PrunedFilteredScan, there is essentially one degree of freedom >>>>>>>>>>>>> for >>>>>>>>>>>>> pruning (filters), so you just have to implement caching between >>>>>>>>>>>>> unhandledFilters and buildScan. However, here we have many >>>>>>>>>>>>> degrees of >>>>>>>>>>>>> freedom; sorts, individual filters, clustering, sampling, maybe >>>>>>>>>>>>> aggregations eventually - and these operations are not all >>>>>>>>>>>>> commutative, and >>>>>>>>>>>>> computing my support one-by-one can easily end up being more >>>>>>>>>>>>> expensive than >>>>>>>>>>>>> computing all in one go. >>>>>>>>>>>>> >>>>>>>>>>>>> For some trivial examples: >>>>>>>>>>>>> >>>>>>>>>>>>> - After filtering, I might be sorted, whilst before filtering >>>>>>>>>>>>> I might not be. >>>>>>>>>>>>> >>>>>>>>>>>>> - Filtering with certain filters might affect my ability to >>>>>>>>>>>>> push down others. >>>>>>>>>>>>> >>>>>>>>>>>>> - Filtering with aggregations (as mooted) might not be >>>>>>>>>>>>> possible to push down. >>>>>>>>>>>>> >>>>>>>>>>>>> And with the API as currently mooted, I need to be able to go >>>>>>>>>>>>> back and change my results because they might change later. >>>>>>>>>>>>> >>>>>>>>>>>>> Really what would be good here is to pass all of the filters >>>>>>>>>>>>> and sorts etc all at once, and then I return the parts I can’t >>>>>>>>>>>>> handle. >>>>>>>>>>>>> >>>>>>>>>>>>> I’d prefer in general that this be implemented by passing some >>>>>>>>>>>>> kind of query plan to the datasource which enables this kind of >>>>>>>>>>>>> replacement. Explicitly don’t want to give the whole query plan - >>>>>>>>>>>>> that >>>>>>>>>>>>> sounds painful - would prefer we push down only the parts of the >>>>>>>>>>>>> query plan >>>>>>>>>>>>> we deem to be stable. With the mix-in approach, I don’t think we >>>>>>>>>>>>> can >>>>>>>>>>>>> guarantee the properties we want without a two-phase thing - I’d >>>>>>>>>>>>> really >>>>>>>>>>>>> love to be able to just define a straightforward union type which >>>>>>>>>>>>> is our >>>>>>>>>>>>> supported pushdown stuff, and then the user can transform and >>>>>>>>>>>>> return it. >>>>>>>>>>>>> >>>>>>>>>>>>> I think this ends up being a more elegant API for consumers, >>>>>>>>>>>>> and also far more intuitive. >>>>>>>>>>>>> >>>>>>>>>>>>> James >>>>>>>>>>>>> >>>>>>>>>>>>> On Mon, 28 Aug 2017 at 18:00 蒋星博 <jiangxb1...@gmail.com> >>>>>>>>>>>>> wrote: >>>>>>>>>>>>> >>>>>>>>>>>>>> +1 (Non-binding) >>>>>>>>>>>>>> >>>>>>>>>>>>>> Xiao Li <gatorsm...@gmail.com>于2017年8月28日 周一下午5:38写道: >>>>>>>>>>>>>> >>>>>>>>>>>>>>> +1 >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> 2017-08-28 12:45 GMT-07:00 Cody Koeninger < >>>>>>>>>>>>>>> c...@koeninger.org>: >>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Just wanted to point out that because the jira isn't >>>>>>>>>>>>>>>> labeled SPIP, it >>>>>>>>>>>>>>>> won't have shown up linked from >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> http://spark.apache.org/improvement-proposals.html >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> On Mon, Aug 28, 2017 at 2:20 PM, Wenchen Fan < >>>>>>>>>>>>>>>> cloud0...@gmail.com> wrote: >>>>>>>>>>>>>>>> > Hi all, >>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>> > It has been almost 2 weeks since I proposed the data >>>>>>>>>>>>>>>> source V2 for >>>>>>>>>>>>>>>> > discussion, and we already got some feedbacks on the JIRA >>>>>>>>>>>>>>>> ticket and the >>>>>>>>>>>>>>>> > prototype PR, so I'd like to call for a vote. >>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>> > The full document of the Data Source API V2 is: >>>>>>>>>>>>>>>> > https://docs.google.com/docume >>>>>>>>>>>>>>>> nt/d/1n_vUVbF4KD3gxTmkNEon5qdQ-Z8qU5Frf6WMQZ6jJVM/edit >>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>> > Note that, this vote should focus on high-level >>>>>>>>>>>>>>>> design/framework, not >>>>>>>>>>>>>>>> > specified APIs, as we can always change/improve specified >>>>>>>>>>>>>>>> APIs during >>>>>>>>>>>>>>>> > development. >>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>> > The vote will be up for the next 72 hours. Please reply >>>>>>>>>>>>>>>> with your vote: >>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>> > +1: Yeah, let's go forward and implement the SPIP. >>>>>>>>>>>>>>>> > +0: Don't really care. >>>>>>>>>>>>>>>> > -1: I don't think this is a good idea because of the >>>>>>>>>>>>>>>> following technical >>>>>>>>>>>>>>>> > reasons. >>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>> > Thanks! >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> ------------------------------ >>>>>>>>>>>>>>>> --------------------------------------- >>>>>>>>>>>>>>>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>> >>>>>>>> >>>>>> >>>>>> >>>>>> -- >>>>>> Ryan Blue >>>>>> Software Engineer >>>>>> Netflix >>>>>> >>>>> >>>> >> >