[ 
https://issues.apache.org/jira/browse/SPARK-15689?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-15689:
--------------------------------
    Description: 
This ticket tracks progress in creating the v2 of data source API. This new API 
should focus on:

1. Have a small surface so it is easy to freeze and maintain compatibility for 
a long time. Ideally, this API should survive architectural rewrites and 
user-facing API revamps of Spark.

2. Have a well-defined column batch interface for high performance. Convenience 
methods should exist to convert row-oriented formats into column batches for 
data source developers.

3. Still support filter push down, similar to the existing API.

4. Support sampling.


Note that both 1 and 2 are problems that the current data source API (v1) 
suffers. The current data source API has a wide surface with dependency on 
DataFrame/SQLContext, making the data source API compatibility depending on the 
upper level API. The current data source API is also only row oriented and has 
to go through an expensive external data type conversion to internal data type.


  was:
This ticket tracks progress in creating the v2 of data source API. This new API 
should focus on:

1. Have a small surface so it is easy to freeze and maintain compatibility for 
a long time. Ideally, this API should survive architectural rewrites and 
user-facing API revamps of Spark.

2. Have a well-defined column batch interface for high performance. Convenience 
methods should exist to convert row-oriented formats into column batches for 
data source developers.

3. Still support filter push down, similar to the existing API.


Note that both 1 and 2 are problems that the current data source API (v1) 
suffers. The current data source API has a wide surface with dependency on 
DataFrame/SQLContext, making the data source API compatibility depending on the 
upper level API. The current data source API is also only row oriented and has 
to go through an expensive external data type conversion to internal data type.



> Data source API v2
> ------------------
>
>                 Key: SPARK-15689
>                 URL: https://issues.apache.org/jira/browse/SPARK-15689
>             Project: Spark
>          Issue Type: New Feature
>          Components: SQL
>            Reporter: Reynold Xin
>
> This ticket tracks progress in creating the v2 of data source API. This new 
> API should focus on:
> 1. Have a small surface so it is easy to freeze and maintain compatibility 
> for a long time. Ideally, this API should survive architectural rewrites and 
> user-facing API revamps of Spark.
> 2. Have a well-defined column batch interface for high performance. 
> Convenience methods should exist to convert row-oriented formats into column 
> batches for data source developers.
> 3. Still support filter push down, similar to the existing API.
> 4. Support sampling.
> Note that both 1 and 2 are problems that the current data source API (v1) 
> suffers. The current data source API has a wide surface with dependency on 
> DataFrame/SQLContext, making the data source API compatibility depending on 
> the upper level API. The current data source API is also only row oriented 
> and has to go through an expensive external data type conversion to internal 
> data type.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to