[jira] [Comment Edited] (SPARK-24882) data source v2 API improvement

Ryan Blue (JIRA) Tue, 31 Jul 2018 14:03:28 -0700


    [ 
https://issues.apache.org/jira/browse/SPARK-24882?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16564362#comment-16564362
 ]


Ryan Blue edited comment on SPARK-24882 at 7/31/18 9:02 PM:
------------------------------------------------------------

{quote}the problem is then we need to make `CatalogSupport` a must-have for 
data sources instead of an optional plugin
{quote}
Data sources are read and write implementations. Catalog support should be a 
layer above read/write implementation that is used to provide CTAS and other 
table-level support.

If you're interested in the anonymous table use case from the email discussion, 
I posted a suggestion there to add an {{anonymousTable}} function to 
{{DataSourceV2}}. That allows a source instantiated directly through v1-style 
reflection to provide a {{Table}} based on an options map. Then that table 
would implement {{ReadSupport}} and {{WriteSupport}} as I've suggested in this 
thread. That would preserve the ability to instantiate a source directly and 
use it, and would center around a {{Table}} that implements the read and write 
traits.

An alternative to the {{anonymousTable}} method is what I did in the WIP pull 
request for CTAS. In that PR, I created two ways to work with {{DataSourceV2}}: 
through the existing {{DataSourceV2Relation}} and through a new 
{{TableV2Relation}}. The first is for {{DataSourceV2}} instances that implement 
the read and write traits, while the latter is for {{Table}} objects that 
implement them. Either way works, though it would be cleaner to just use 
{{Table}}.

 

Thanks for the builder update! Immutability is the most important part, but I'd 
still prefer a builder interface with default methods instead of the mix-in 
traits.


was (Author: rdblue):
{quote}the problem is then we need to make `CatalogSupport` a must-have for 
data sources instead of an optional plugin
{quote}
Data sources are read and write implementations. Catalog support should be a 
layer above read/write implementation that is used to provide CTAS and other 
table-level support. If you're interested in the anonymous table use case from 
the email discussion, I posted a suggestion there to add an {{anonymousTable}} 
function to {{DataSourceV2}}. That allows a source instantiated directly 
through v1-style reflection to provide a {{Table}} based on an options map. 
Then that table would implement {{ReadSupport}} and {{WriteSupport}} as I've 
suggested in this thread. That would preserve the ability to instantiate a 
source directly and use it, and would center around a {{Table}} that implements 
the read and write traits.

An alternative to the {{anonymousTable}} method is what I did in the WIP pull 
request for CTAS. In that PR, I created two ways to work with {{DataSourceV2}}: 
through the existing {{DataSourceV2Relation}} and through a new 
{{TableV2Relation}}. The first is for {{DataSourceV2}} instances that implement 
the read and write traits, while the latter is for {{Table}} objects that 
implement them. Either way works, though it would be cleaner to just use 
{{Table}}.

 

Thanks for the builder update! Immutability is the most important part, but I'd 
still prefer a builder interface with default methods instead of the mix-in 
traits.

> data source v2 API improvement
> ------------------------------
>
>                 Key: SPARK-24882
>                 URL: https://issues.apache.org/jira/browse/SPARK-24882
>             Project: Spark
>          Issue Type: Improvement
>          Components: SQL
>    Affects Versions: 2.4.0
>            Reporter: Wenchen Fan
>            Assignee: Wenchen Fan
>            Priority: Major
>
> Data source V2 is out for a while, see the SPIP 
> [here|https://docs.google.com/document/d/1n_vUVbF4KD3gxTmkNEon5qdQ-Z8qU5Frf6WMQZ6jJVM/edit?usp=sharing].
>  We have already migrated most of the built-in streaming data sources to the 
> V2 API, and the file source migration is in progress. During the migration, 
> we found several problems and want to address them before we stabilize the V2 
> API.
> To solve these problems, we need to separate responsibilities in the data 
> source v2 API, isolate the stateull part of the API, think of better naming 
> of some interfaces. Details please see the attached google doc: 
> https://docs.google.com/document/d/1DDXCTCrup4bKWByTalkXWgavcPdvur8a4eEu8x1BzPM/edit?usp=sharing



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-24882) data source v2 API improvement

Reply via email to