[ https://issues.apache.org/jira/browse/SPARK-43155?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
PEIYUAN SUN updated SPARK-43155: -------------------------------- Description: h1. Description The current interface of DataSourceV2 becomes overly complicated than the Spark 2.x versions. To implement under the DataSourceV2, user needs to learn not only the V2 APIs and interfaces. But also the DataSourceV1 (as it is a failback version). h2. Interface Gaps There is no easy way and clear examples on how to implement both for a new dataSource. For example, the examples in standard spark repo like orc, parquet, json has a FileFormat interface for V1 while all these are not feasible to be followed since the SPI is hard-code as `DefaultSource` instead of dynamic loading if from user provided class outside the Spark Repo. Different data sources are not strictly following a same pattern in V1 and not decoupled well with customized logic within it. h2. Loss of simple layer over different kinds of dataSource With original V1, user can actually implement a new wrapper on top of orc/parquet easily with Relation Interface. The DataSourceV2 again here becomes too low level and hard to be used in this case. h3. No explicit guidance The functionality interfaces are not well organized which forces the reader spend lots of time to understand the commit history, existing patterns as well. was: h1. Description The current interface of DataSourceV2 becomes overly complicated than the Spark 2.x versions. To implement under the DataSourceV2, user needs to learn not only the V2 APIs and interfaces. But also the DataSourceV1 (as it is a failback version). h2. Interface Gaps There is no easy way and clear examples on how to implement both for a new dataSource. For example, the examples in standard spark repo like orc, parquet, json has a FileFormat interface for V1 while all these are not feasible to be followed since the SPI is hard-code as `DefaultSource` instead of dynamic loading if from user provided class outside the Spark Repo. h2. Loss of simple layer over different kinds of dataSource With original V1, user can actually implement a new wrapper on top of orc/parquet easily with Relation Interface. The DataSourceV2 again here becomes too low level and hard to be used in this case. h3. No explicit guidance The functionality interfaces are not well organized which forces the reader spend lots of time to understand the commit history, existing patterns as well. > DataSourceV2 is hard to be implemented without following V1 > ----------------------------------------------------------- > > Key: SPARK-43155 > URL: https://issues.apache.org/jira/browse/SPARK-43155 > Project: Spark > Issue Type: Improvement > Components: SQL > Affects Versions: 3.4.0 > Reporter: PEIYUAN SUN > Priority: Major > Labels: features > Fix For: 3.5.0 > > Original Estimate: 672h > Remaining Estimate: 672h > > h1. Description > The current interface of DataSourceV2 becomes overly complicated than the > Spark 2.x versions. To implement under the DataSourceV2, user needs to learn > not only the V2 APIs and interfaces. But also the DataSourceV1 (as it is a > failback version). > h2. Interface Gaps > There is no easy way and clear examples on how to implement both for a new > dataSource. For example, the examples in standard spark repo like orc, > parquet, json has a FileFormat interface for V1 while all these are not > feasible to be followed since the SPI is hard-code as `DefaultSource` instead > of dynamic loading if from user provided class outside the Spark Repo. > Different data sources are not strictly following a same pattern in V1 and > not decoupled well with customized logic within it. > > h2. Loss of simple layer over different kinds of dataSource > With original V1, user can actually implement a new wrapper on top of > orc/parquet easily with Relation Interface. The DataSourceV2 again here > becomes too low level and hard to be used in this case. > > h3. No explicit guidance > The functionality interfaces are not well organized which forces the reader > spend lots of time to understand the commit history, existing patterns as > well. -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org