[jira] [Commented] (SPARK-25390) Data source V2 API refactoring
[ https://issues.apache.org/jira/browse/SPARK-25390?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17289577#comment-17289577 ] Rafael commented on SPARK-25390: Sorry for late response, I was able to migrate my project on Spark 3.0.0 Here some hints what I did: https://gist.github.com/rafaelkyrdan/2bea8385aadd71be5bf67cddeec59581 > Data source V2 API refactoring > -- > > Key: SPARK-25390 > URL: https://issues.apache.org/jira/browse/SPARK-25390 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Wenchen Fan >Assignee: Wenchen Fan >Priority: Major > Fix For: 3.0.0 > > > Currently it's not very clear how we should abstract data source v2 API. The > abstraction should be unified between batch and streaming, or similar but > have a well-defined difference between batch and streaming. And the > abstraction should also include catalog/table. > An example of the abstraction: > {code} > batch: catalog -> table -> scan > streaming: catalog -> table -> stream -> scan > {code} > We should refactor the data source v2 API according to the abstraction -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25390) Data source V2 API refactoring
[ https://issues.apache.org/jira/browse/SPARK-25390?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17212675#comment-17212675 ] Jan Berkel commented on SPARK-25390: I'm in a similar situation. [~Kyrdan] asked on the mailing list as directed, but nobody replied. It's strange that such a central API is completely undocumented. The new iteration of the datasource API doesn't look remotely like v2, it might as well have been called v3. If it's not possible to provide the documentation, put at least some notes/warnings in the migration guide or changelog indicating that Spark3's datasource API has changed completely. And, as far as I can tell at the moment, it doesn't seem to be possible to implement the new Datasource V2 using plain Java classes. > Data source V2 API refactoring > -- > > Key: SPARK-25390 > URL: https://issues.apache.org/jira/browse/SPARK-25390 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Wenchen Fan >Assignee: Wenchen Fan >Priority: Major > Fix For: 3.0.0 > > > Currently it's not very clear how we should abstract data source v2 API. The > abstraction should be unified between batch and streaming, or similar but > have a well-defined difference between batch and streaming. And the > abstraction should also include catalog/table. > An example of the abstraction: > {code} > batch: catalog -> table -> scan > streaming: catalog -> table -> stream -> scan > {code} > We should refactor the data source v2 API according to the abstraction -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25390) Data source V2 API refactoring
[ https://issues.apache.org/jira/browse/SPARK-25390?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17178729#comment-17178729 ] Hyukjin Kwon commented on SPARK-25390: -- [~Kyrdan], let's interact at https://spark.apache.org/community.html to ask questions. That's the appropriate channel to ask questions. > Data source V2 API refactoring > -- > > Key: SPARK-25390 > URL: https://issues.apache.org/jira/browse/SPARK-25390 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Wenchen Fan >Assignee: Wenchen Fan >Priority: Major > Fix For: 3.0.0 > > > Currently it's not very clear how we should abstract data source v2 API. The > abstraction should be unified between batch and streaming, or similar but > have a well-defined difference between batch and streaming. And the > abstraction should also include catalog/table. > An example of the abstraction: > {code} > batch: catalog -> table -> scan > streaming: catalog -> table -> stream -> scan > {code} > We should refactor the data source v2 API according to the abstraction -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25390) Data source V2 API refactoring
[ https://issues.apache.org/jira/browse/SPARK-25390?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17178052#comment-17178052 ] Rafael commented on SPARK-25390: Hey guys, I'm trying to migrate the package where I was using *import org.apache.spark.sql.sources.v2._ into Spark3.0.0* and haven't found a good guide so may I ask my questions here. Here my migration plan can you highlight what interfaces should I use now 1. Cannot find what should I use instead of ReadSupport, ReadSupport, DataSourceReader, if instead of ReadSupport we have to use now Scan then what happened to method createReader? {code:java} class DefaultSource extends ReadSupport { override def createReader(options: DataSourceOptions): DataSourceReader = new GeneratingReader() } {code} 2. Here instead of {code:java} import org.apache.spark.sql.sources.v2.reader.partitioning.{Distribution, Partitioning} import org.apache.spark.sql.sources.v2.reader.{InputPartition, SupportsReportPartitioning} {code} I should use {code:java} import org.apache.spark.sql.connector.read.partitioning.{Distribution, Partitioning} import org.apache.spark.sql.connector.read.{InputPartition, SupportsReportPartitioning} {code} right? {code:java} class GeneratingReader() extends DataSourceReader { override def readSchema(): StructType = {...} override def planInputPartitions(): util.List[InputPartition[InternalRow]] = { val partitions = new util.ArrayList[InputPartition[InternalRow]]() ... partitions.add(new GeneratingInputPartition(...)) } override def outputPartitioning(): Partitioning = {...} } {code} 3. Haven't found what should I use instead of {code:java} import org.apache.spark.sql.sources.v2.reader.InputPartition import org.apache.spark.sql.sources.v2.reader.InputPartitionReader{code} Interface like it has totally different contract {code:java} import org.apache.spark.sql.connector.read.InputPartition{code} {code:java} class GeneratingInputPartition() extends InputPartition[InternalRow] { override def createPartitionReader(): InputPartitionReader[InternalRow] = new GeneratingInputPartitionReader(...) } class GeneratingInputPartitionReader() extends InputPartitionReader[InternalRow] { override def next(): Boolean = ... override def get(): InternalRow = ... override def close(): Unit = ... } {code} > Data source V2 API refactoring > -- > > Key: SPARK-25390 > URL: https://issues.apache.org/jira/browse/SPARK-25390 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Wenchen Fan >Assignee: Wenchen Fan >Priority: Major > Fix For: 3.0.0 > > > Currently it's not very clear how we should abstract data source v2 API. The > abstraction should be unified between batch and streaming, or similar but > have a well-defined difference between batch and streaming. And the > abstraction should also include catalog/table. > An example of the abstraction: > {code} > batch: catalog -> table -> scan > streaming: catalog -> table -> stream -> scan > {code} > We should refactor the data source v2 API according to the abstraction -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25390) data source V2 API refactoring
[ https://issues.apache.org/jira/browse/SPARK-25390?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16875893#comment-16875893 ] Wenchen Fan commented on SPARK-25390: - Yes, we should have a user guide for data source v2 in Spark 3.0. I've created a blocker ticket for it: https://issues.apache.org/jira/browse/SPARK-28219. Also cc [~rdblue] > data source V2 API refactoring > -- > > Key: SPARK-25390 > URL: https://issues.apache.org/jira/browse/SPARK-25390 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Wenchen Fan >Priority: Major > > Currently it's not very clear how we should abstract data source v2 API. The > abstraction should be unified between batch and streaming, or similar but > have a well-defined difference between batch and streaming. And the > abstraction should also include catalog/table. > An example of the abstraction: > {code} > batch: catalog -> table -> scan > streaming: catalog -> table -> stream -> scan > {code} > We should refactor the data source v2 API according to the abstraction -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25390) data source V2 API refactoring
[ https://issues.apache.org/jira/browse/SPARK-25390?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16872684#comment-16872684 ] Lars Francke commented on SPARK-25390: -- Is there any kind of end-user documentation for this on how to use these APIs to develop custom sources? When looking on the Spark homepage one only finds this documentation [https://spark.apache.org/docs/2.2.0/streaming-custom-receivers.html] it'd be useful to have a version of this for the new APIs > data source V2 API refactoring > -- > > Key: SPARK-25390 > URL: https://issues.apache.org/jira/browse/SPARK-25390 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Wenchen Fan >Priority: Major > > Currently it's not very clear how we should abstract data source v2 API. The > abstraction should be unified between batch and streaming, or similar but > have a well-defined difference between batch and streaming. And the > abstraction should also include catalog/table. > An example of the abstraction: > {code} > batch: catalog -> table -> scan > streaming: catalog -> table -> stream -> scan > {code} > We should refactor the data source v2 API according to the abstraction -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org