Copying from the code review comments I just submitted on the draft API (https://github.com/cloud-fan/spark/pull/10#pullrequestreview-59088745):
Context here is that I've spent some time implementing a Spark datasource and have had some issues with the current API which are made worse in V2. The general conclusion I’ve come to here is that this is very hard to actually implement (in a similar but more aggressive way than DataSource V1, because of the extra methods and dimensions we get in V2). In DataSources V1 PrunedFilteredScan, the issue is that you are passed in the filters with the buildScan method, and then passed in again with the unhandledFilters method. However, the filters that you can’t handle might be data dependent, which the current API does not handle well. Suppose I can handle filter A some of the time, and filter B some of the time. If I’m passed in both, then either A and B are unhandled, or A, or B, or neither. The work I have to do to work this out is essentially the same as I have to do while actually generating my RDD (essentially I have to generate my partitions), so I end up doing some weird caching work. This V2 API proposal has the same issues, but perhaps moreso. In PrunedFilteredScan, there is essentially one degree of freedom for pruning (filters), so you just have to implement caching between unhandledFilters and buildScan. However, here we have many degrees of freedom; sorts, individual filters, clustering, sampling, maybe aggregations eventually - and these operations are not all commutative, and computing my support one-by-one can easily end up being more expensive than computing all in one go. For some trivial examples: - After filtering, I might be sorted, whilst before filtering I might not be. - Filtering with certain filters might affect my ability to push down others. - Filtering with aggregations (as mooted) might not be possible to push down. And with the API as currently mooted, I need to be able to go back and change my results because they might change later. Really what would be good here is to pass all of the filters and sorts etc all at once, and then I return the parts I can’t handle. I’d prefer in general that this be implemented by passing some kind of query plan to the datasource which enables this kind of replacement. Explicitly don’t want to give the whole query plan - that sounds painful - would prefer we push down only the parts of the query plan we deem to be stable. With the mix-in approach, I don’t think we can guarantee the properties we want without a two-phase thing - I’d really love to be able to just define a straightforward union type which is our supported pushdown stuff, and then the user can transform and return it. I think this ends up being a more elegant API for consumers, and also far more intuitive. James On Mon, 28 Aug 2017 at 18:00 蒋星博 <jiangxb1...@gmail.com<mailto:jiangxb1...@gmail.com>> wrote: +1 (Non-binding) Xiao Li <gatorsm...@gmail.com<mailto:gatorsm...@gmail.com>>于2017年8月28日 周一下午5:38写道: +1 2017-08-28 12:45 GMT-07:00 Cody Koeninger <c...@koeninger.org<mailto:c...@koeninger.org>>: Just wanted to point out that because the jira isn't labeled SPIP, it won't have shown up linked from http://spark.apache.org/improvement-proposals.html On Mon, Aug 28, 2017 at 2:20 PM, Wenchen Fan <cloud0...@gmail.com<mailto:cloud0...@gmail.com>> wrote: > Hi all, > > It has been almost 2 weeks since I proposed the data source V2 for > discussion, and we already got some feedbacks on the JIRA ticket and the > prototype PR, so I'd like to call for a vote. > > The full document of the Data Source API V2 is: > https://docs.google.com/document/d/1n_vUVbF4KD3gxTmkNEon5qdQ-Z8qU5Frf6WMQZ6jJVM/edit > > Note that, this vote should focus on high-level design/framework, not > specified APIs, as we can always change/improve specified APIs during > development. > > The vote will be up for the next 72 hours. Please reply with your vote: > > +1: Yeah, let's go forward and implement the SPIP. > +0: Don't really care. > -1: I don't think this is a good idea because of the following technical > reasons. > > Thanks! --------------------------------------------------------------------- To unsubscribe e-mail: dev-unsubscr...@spark.apache.org<mailto:dev-unsubscr...@spark.apache.org>