[jira] [Commented] (SPARK-24882) separate responsibilities of the data source v2 read API

Wenchen Fan (JIRA) Mon, 30 Jul 2018 01:58:10 -0700


    [ 
https://issues.apache.org/jira/browse/SPARK-24882?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16561653#comment-16561653
 ]


Wenchen Fan commented on SPARK-24882:
-------------------------------------

[~rdblue] After the experiment, I decided to not merge the `ReadSupport` and 
`DataSourceReader`. `DataSourceReader` serves as the initialization phase of 
the scan, while `ReadSupport` is created via reflection and should be very 
light-weighted. For example, Kafka data source initialize the kafka connection 
when creating `DataSourceReader` with the given kafka URL in the options. I 
also tried to move the initialization to `ScanConfig`, but then I face another 
problem about life circle: `DataSourceReader` life circle is tied to the entire 
streaming query, but `ScanConfig` life circle is tied to an epoch. So the per 
streaming query initialization needs to happen in `DataSourceReader`.

For the builder pattern, it's important that Spark can get immediate pushdown 
feedback from the data source, instead of looking at the final ScanConfig. 
Assuming Spark tried to push down a Filter then a Limit, if the data source 
support neither of them, Spark needs to add back the Filter and Limit 
operators, then we can't get the operator order from the final ScanConfig. 
Things get more complicated if we want to push down 
`Filter(Limit(Filter(Scan)))`.



> separate responsibilities of the data source v2 read API
> --------------------------------------------------------
>
>                 Key: SPARK-24882
>                 URL: https://issues.apache.org/jira/browse/SPARK-24882
>             Project: Spark
>          Issue Type: Improvement
>          Components: SQL
>    Affects Versions: 2.4.0
>            Reporter: Wenchen Fan
>            Assignee: Wenchen Fan
>            Priority: Major
>
> Data source V2 is out for a while, see the SPIP 
> [here|https://docs.google.com/document/d/1n_vUVbF4KD3gxTmkNEon5qdQ-Z8qU5Frf6WMQZ6jJVM/edit?usp=sharing].
>  We have already migrated most of the built-in streaming data sources to the 
> V2 API, and the file source migration is in progress. During the migration, 
> we found several problems and want to address them before we stabilize the V2 
> API.
> To solve these problems, we need to separate responsibilities in the data 
> source v2 read API. Details please see the attached google doc: 
> https://docs.google.com/document/d/1DDXCTCrup4bKWByTalkXWgavcPdvur8a4eEu8x1BzPM/edit?usp=sharing



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-24882) separate responsibilities of the data source v2 read API

Reply via email to