[jira] [Commented] (SPARK-24882) separate responsibilities of the data source v2 read API

Wenchen Fan (JIRA) Fri, 27 Jul 2018 08:44:25 -0700


    [ 
https://issues.apache.org/jira/browse/SPARK-24882?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16559918#comment-16559918
 ]


Wenchen Fan commented on SPARK-24882:
-------------------------------------

Hi [~rdblue] , I like your naming changes and will update them in the doc. I 
also like your idea of merging `DataSourceReader` and `ReadSupport` to reduce # 
of interfaces. I think we can also apply it to the write API.

About the builder pattern, I agree it's good to make the API immutable, but I'd 
say it's hard to do so. I've spent a lot of time thinking of the builder 
pattern and have no luck.

One problem is: Spark needs feedback from the data source when an operator is 
pushed. e.g. when Spark pushes a Filter to a data source, Spark needs to know 
if all the filters are pushed, so that it can keep pushing the next operator. 
Spark can't blindly push all operators to the data source one by one. Spark 
needs to ask the data source if it can accept the next operator, before pushing 
it. And this is not a builder pattern anymore.

> separate responsibilities of the data source v2 read API
> --------------------------------------------------------
>
>                 Key: SPARK-24882
>                 URL: https://issues.apache.org/jira/browse/SPARK-24882
>             Project: Spark
>          Issue Type: Improvement
>          Components: SQL
>    Affects Versions: 2.4.0
>            Reporter: Wenchen Fan
>            Assignee: Wenchen Fan
>            Priority: Major
>
> Data source V2 is out for a while, see the SPIP 
> [here|https://docs.google.com/document/d/1n_vUVbF4KD3gxTmkNEon5qdQ-Z8qU5Frf6WMQZ6jJVM/edit?usp=sharing].
>  We have already migrated most of the built-in streaming data sources to the 
> V2 API, and the file source migration is in progress. During the migration, 
> we found several problems and want to address them before we stabilize the V2 
> API.
> To solve these problems, we need to separate responsibilities in the data 
> source v2 read API. Details please see the attached google doc: 
> https://docs.google.com/document/d/1DDXCTCrup4bKWByTalkXWgavcPdvur8a4eEu8x1BzPM/edit?usp=sharing



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-24882) separate responsibilities of the data source v2 read API

Reply via email to