[jira] [Commented] (SPARK-24882) separate responsibilities of the data source v2 read API

Ryan Blue (JIRA) Sun, 29 Jul 2018 08:52:37 -0700


    [ 
https://issues.apache.org/jira/browse/SPARK-24882?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16561145#comment-16561145
 ]


Ryan Blue commented on SPARK-24882:
-----------------------------------

[~cloud_fan], thanks for making those changes. I'll have a look at the updated 
doc.

For scan configuration, I think this builder pattern would work. The builder's 
super-class would be provided by Spark. That way, the methods for pushing 
always work. Similarly, the ScanConfig interface would be provided with default 
implementations, so Spark can always get the scan configuration. When a source 
supports push-down, it would override {{pushPredicates}} and return the 
predicates that were pushed in the ScanConfig ({{pushedPredicates}}. Then Spark 
can remove those pushed predicates.

If the source doesn't support push-down, then it needs to implement nothing at 
all: the default {{pushPredicates}} implementation on the builder is a no-op, 
and the default {{pushedPredicates}} implementation returns {{new 
Expression[0]}} to indicate that nothing was pushed. The feedback that Spark 
needs comes from the final ScanConfig and then there's no need to do instanceOf 
checks for interfaces. Spark's code always makes the pushdown calls and they 
can be easily ignored by the source implementation.

> separate responsibilities of the data source v2 read API
> --------------------------------------------------------
>
>                 Key: SPARK-24882
>                 URL: https://issues.apache.org/jira/browse/SPARK-24882
>             Project: Spark
>          Issue Type: Improvement
>          Components: SQL
>    Affects Versions: 2.4.0
>            Reporter: Wenchen Fan
>            Assignee: Wenchen Fan
>            Priority: Major
>
> Data source V2 is out for a while, see the SPIP 
> [here|https://docs.google.com/document/d/1n_vUVbF4KD3gxTmkNEon5qdQ-Z8qU5Frf6WMQZ6jJVM/edit?usp=sharing].
>  We have already migrated most of the built-in streaming data sources to the 
> V2 API, and the file source migration is in progress. During the migration, 
> we found several problems and want to address them before we stabilize the V2 
> API.
> To solve these problems, we need to separate responsibilities in the data 
> source v2 read API. Details please see the attached google doc: 
> https://docs.google.com/document/d/1DDXCTCrup4bKWByTalkXWgavcPdvur8a4eEu8x1BzPM/edit?usp=sharing



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-24882) separate responsibilities of the data source v2 read API

Reply via email to