[jira] [Commented] (SPARK-15689) Data source API v2

Russell Spitzer (JIRA) Wed, 01 Nov 2017 15:53:35 -0700

    [ 
https://issues.apache.org/jira/browse/SPARK-15689?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16234899#comment-16234899
 ]


Russell Spitzer commented on SPARK-15689:
-----------------------------------------

I think knowing whether or not the count was occurring at the time of the 
pushdown would solve this. So aggregate pushdown is probably the cleanest 
solution. 

We have a similar problem to Spark, we can handle the pushdown efficently if we 
know it's a count that completely be handled by our filters but not if it 
isn't. Unfortunately we can see if we can satisfy all the filters when 
"unhandled filters" is called but don't know if the plan requires any 
additional columns. So really we would be ok if we just got the required 
"output" at that time. If we satisfy all the predicates and we know the output 
is empty we could handle the count pushdown. Having a specific Aggregate 
pushdown is probably cleaner though.

> Data source API v2
> ------------------
>
>                 Key: SPARK-15689
>                 URL: https://issues.apache.org/jira/browse/SPARK-15689
>             Project: Spark
>          Issue Type: New Feature
>          Components: SQL
>    Affects Versions: 2.3.0
>            Reporter: Reynold Xin
>            Assignee: Wenchen Fan
>            Priority: Major
>              Labels: SPIP, releasenotes
>         Attachments: SPIP Data Source API V2.pdf
>
>
> This ticket tracks progress in creating the v2 of data source API. This new 
> API should focus on:
> 1. Have a small surface so it is easy to freeze and maintain compatibility 
> for a long time. Ideally, this API should survive architectural rewrites and 
> user-facing API revamps of Spark.
> 2. Have a well-defined column batch interface for high performance. 
> Convenience methods should exist to convert row-oriented formats into column 
> batches for data source developers.
> 3. Still support filter push down, similar to the existing API.
> 4. Nice-to-have: support additional common operators, including limit and 
> sampling.
> Note that both 1 and 2 are problems that the current data source API (v1) 
> suffers. The current data source API has a wide surface with dependency on 
> DataFrame/SQLContext, making the data source API compatibility depending on 
> the upper level API. The current data source API is also only row oriented 
> and has to go through an expensive external data type conversion to internal 
> data type.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-15689) Data source API v2

Reply via email to