[jira] [Commented] (SPARK-15689) Data source API v2

Russell Spitzer (JIRA) Thu, 17 Aug 2017 06:04:07 -0700

    [ 
https://issues.apache.org/jira/browse/SPARK-15689?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16130355#comment-16130355
 ]


Russell Spitzer commented on SPARK-15689:
-----------------------------------------

Thanks [~cloud_fan] for posting the design doc it was a great read and I like a 
lot of the direction this is going in. It would helpful if we could have access 
to the doc as a google doc or some other editable/comment-able form though to 
encourage discussion.

I left some comments on the prototype but one thing I think could be a great 
addition would be a joinInterface. I ended up writing up one of these 
specifically for Cassandra and had to do a lot of plumbing to get it to fit 
into the rest of the Catalyst ecosystem so I think this would be a great time 
to plan ahead in Spark design. 

The join interface would look a lot like a combination of the read and write 
apis, given a row input and a set of expressions the relationship should return 
rows that match those expressions OR fallback to just being a read relationship 
if none of the expressions can be satisfied by the join (leaving all the 
expressions to be evaluated in spark). 

> Data source API v2
> ------------------
>
>                 Key: SPARK-15689
>                 URL: https://issues.apache.org/jira/browse/SPARK-15689
>             Project: Spark
>          Issue Type: New Feature
>          Components: SQL
>            Reporter: Reynold Xin
>              Labels: releasenotes
>         Attachments: SPIP Data Source API V2.pdf
>
>
> This ticket tracks progress in creating the v2 of data source API. This new 
> API should focus on:
> 1. Have a small surface so it is easy to freeze and maintain compatibility 
> for a long time. Ideally, this API should survive architectural rewrites and 
> user-facing API revamps of Spark.
> 2. Have a well-defined column batch interface for high performance. 
> Convenience methods should exist to convert row-oriented formats into column 
> batches for data source developers.
> 3. Still support filter push down, similar to the existing API.
> 4. Nice-to-have: support additional common operators, including limit and 
> sampling.
> Note that both 1 and 2 are problems that the current data source API (v1) 
> suffers. The current data source API has a wide surface with dependency on 
> DataFrame/SQLContext, making the data source API compatibility depending on 
> the upper level API. The current data source API is also only row oriented 
> and has to go through an expensive external data type conversion to internal 
> data type.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-15689) Data source API v2

Reply via email to