[jira] [Updated] (SPARK-43155) DataSourceV2 is hard to be implemented without following V1

PEIYUAN SUN (Jira) Sun, 16 Apr 2023 06:57:24 -0700


     [ 
https://issues.apache.org/jira/browse/SPARK-43155?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


PEIYUAN SUN updated SPARK-43155:
--------------------------------
    Description: 
h1. Description

The current interface of DataSourceV2 becomes overly complicated than the Spark 
2.x versions. To implement under the DataSourceV2, user needs to learn not only 
the V2 APIs and interfaces. But also the DataSourceV1 (as it is a failback 
version).
h2. Interface Gaps

There is no easy way and clear examples on how to implement both for a new 
dataSource. For example, the examples in standard spark repo like orc, parquet, 
json has a FileFormat interface for V1 while all these are not feasible to be 
followed since the SPI is hard-code as `DefaultSource` instead of dynamic 
loading if from user provided class outside the Spark Repo. Different data 
sources are not strictly following a same pattern in V1 and not decoupled well 
with customized logic within it.

 
h2. Loss of simple layer over different kinds of dataSource

With original V1, user can actually implement a new wrapper on top of 
orc/parquet easily with Relation Interface. The DataSourceV2 again here becomes 
too low level and hard to be used in this case.

 
h3. No explicit guidance

The functionality interfaces are not well organized which forces the reader 
spend lots of time to understand the commit history, existing patterns as well.

  was:
h1. Description

The current interface of DataSourceV2 becomes overly complicated than the Spark 
2.x versions. To implement under the DataSourceV2, user needs to learn not only 
the V2 APIs and interfaces. But also the DataSourceV1 (as it is a failback 
version).
h2. Interface Gaps


There is no easy way and clear examples on how to implement both for a new 
dataSource. For example, the examples in standard spark repo like orc, parquet, 
json has a FileFormat interface for V1 while all these are not feasible to be 
followed since the SPI is hard-code as `DefaultSource` instead of dynamic 
loading if from user provided class outside the Spark Repo.

 
h2. Loss of simple layer over different kinds of dataSource


With original V1, user can actually implement a new wrapper on top of 
orc/parquet easily with Relation Interface. The DataSourceV2 again here becomes 
too low level and hard to be used in this case.

 
h3. No explicit guidance


The functionality interfaces are not well organized which forces the reader 
spend lots of time to understand the commit history, existing patterns as well.


> DataSourceV2 is hard to be implemented without following V1
> -----------------------------------------------------------
>
>                 Key: SPARK-43155
>                 URL: https://issues.apache.org/jira/browse/SPARK-43155
>             Project: Spark
>          Issue Type: Improvement
>          Components: SQL
>    Affects Versions: 3.4.0
>            Reporter: PEIYUAN SUN
>            Priority: Major
>              Labels: features
>             Fix For: 3.5.0
>
>   Original Estimate: 672h
>  Remaining Estimate: 672h
>
> h1. Description
> The current interface of DataSourceV2 becomes overly complicated than the 
> Spark 2.x versions. To implement under the DataSourceV2, user needs to learn 
> not only the V2 APIs and interfaces. But also the DataSourceV1 (as it is a 
> failback version).
> h2. Interface Gaps
> There is no easy way and clear examples on how to implement both for a new 
> dataSource. For example, the examples in standard spark repo like orc, 
> parquet, json has a FileFormat interface for V1 while all these are not 
> feasible to be followed since the SPI is hard-code as `DefaultSource` instead 
> of dynamic loading if from user provided class outside the Spark Repo. 
> Different data sources are not strictly following a same pattern in V1 and 
> not decoupled well with customized logic within it.
>  
> h2. Loss of simple layer over different kinds of dataSource
> With original V1, user can actually implement a new wrapper on top of 
> orc/parquet easily with Relation Interface. The DataSourceV2 again here 
> becomes too low level and hard to be used in this case.
>  
> h3. No explicit guidance
> The functionality interfaces are not well organized which forces the reader 
> spend lots of time to understand the commit history, existing patterns as 
> well.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-43155) DataSourceV2 is hard to be implemented without following V1

Reply via email to