Copying from the code review comments I just submitted on the draft API 
(https://github.com/cloud-fan/spark/pull/10#pullrequestreview-59088745):

Context here is that I've spent some time implementing a Spark datasource and 
have had some issues with the current API which are made worse in V2.

The general conclusion I’ve come to here is that this is very hard to actually 
implement (in a similar but more aggressive way than DataSource V1, because of 
the extra methods and dimensions we get in V2).

In DataSources V1 PrunedFilteredScan, the issue is that you are passed in the 
filters with the buildScan method, and then passed in again with the 
unhandledFilters method.

However, the filters that you can’t handle might be data dependent, which the 
current API does not handle well. Suppose I can handle filter A some of the 
time, and filter B some of the time. If I’m passed in both, then either A and B 
are unhandled, or A, or B, or neither. The work I have to do to work this out 
is essentially the same as I have to do while actually generating my RDD 
(essentially I have to generate my partitions), so I end up doing some weird 
caching work.

This V2 API proposal has the same issues, but perhaps moreso. In 
PrunedFilteredScan, there is essentially one degree of freedom for pruning 
(filters), so you just have to implement caching between unhandledFilters and 
buildScan. However, here we have many degrees of freedom; sorts, individual 
filters, clustering, sampling, maybe aggregations eventually - and these 
operations are not all commutative, and computing my support one-by-one can 
easily end up being more expensive than computing all in one go.

For some trivial examples:

- After filtering, I might be sorted, whilst before filtering I might not be.

- Filtering with certain filters might affect my ability to push down others.

- Filtering with aggregations (as mooted) might not be possible to push down.

And with the API as currently mooted, I need to be able to go back and change 
my results because they might change later.

Really what would be good here is to pass all of the filters and sorts etc all 
at once, and then I return the parts I can’t handle.

I’d prefer in general that this be implemented by passing some kind of query 
plan to the datasource which enables this kind of replacement. Explicitly don’t 
want to give the whole query plan - that sounds painful - would prefer we push 
down only the parts of the query plan we deem to be stable. With the mix-in 
approach, I don’t think we can guarantee the properties we want without a 
two-phase thing - I’d really love to be able to just define a straightforward 
union type which is our supported pushdown stuff, and then the user can 
transform and return it.

I think this ends up being a more elegant API for consumers, and also far more 
intuitive.

James

On Mon, 28 Aug 2017 at 18:00 蒋星博 
<jiangxb1...@gmail.com<mailto:jiangxb1...@gmail.com>> wrote:
+1 (Non-binding)

Xiao Li <gatorsm...@gmail.com<mailto:gatorsm...@gmail.com>>于2017年8月28日 
周一下午5:38写道:
+1

2017-08-28 12:45 GMT-07:00 Cody Koeninger 
<c...@koeninger.org<mailto:c...@koeninger.org>>:
Just wanted to point out that because the jira isn't labeled SPIP, it
won't have shown up linked from

http://spark.apache.org/improvement-proposals.html

On Mon, Aug 28, 2017 at 2:20 PM, Wenchen Fan 
<cloud0...@gmail.com<mailto:cloud0...@gmail.com>> wrote:
> Hi all,
>
> It has been almost 2 weeks since I proposed the data source V2 for
> discussion, and we already got some feedbacks on the JIRA ticket and the
> prototype PR, so I'd like to call for a vote.
>
> The full document of the Data Source API V2 is:
> https://docs.google.com/document/d/1n_vUVbF4KD3gxTmkNEon5qdQ-Z8qU5Frf6WMQZ6jJVM/edit
>
> Note that, this vote should focus on high-level design/framework, not
> specified APIs, as we can always change/improve specified APIs during
> development.
>
> The vote will be up for the next 72 hours. Please reply with your vote:
>
> +1: Yeah, let's go forward and implement the SPIP.
> +0: Don't really care.
> -1: I don't think this is a good idea because of the following technical
> reasons.
>
> Thanks!

---------------------------------------------------------------------
To unsubscribe e-mail: 
dev-unsubscr...@spark.apache.org<mailto:dev-unsubscr...@spark.apache.org>


Reply via email to