[jira] [Commented] (SPARK-25390) Data source V2 API refactoring
[ https://issues.apache.org/jira/browse/SPARK-25390?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17289577#comment-17289577 ] Rafael commented on SPARK-25390: Sorry for late response, I was able to migrate my project on Spark 3.0.0 Here some hints what I did: https://gist.github.com/rafaelkyrdan/2bea8385aadd71be5bf67cddeec59581 > Data source V2 API refactoring > -- > > Key: SPARK-25390 > URL: https://issues.apache.org/jira/browse/SPARK-25390 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Wenchen Fan >Assignee: Wenchen Fan >Priority: Major > Fix For: 3.0.0 > > > Currently it's not very clear how we should abstract data source v2 API. The > abstraction should be unified between batch and streaming, or similar but > have a well-defined difference between batch and streaming. And the > abstraction should also include catalog/table. > An example of the abstraction: > {code} > batch: catalog -> table -> scan > streaming: catalog -> table -> stream -> scan > {code} > We should refactor the data source v2 API according to the abstraction -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-27708) Add documentation for v2 data sources
[ https://issues.apache.org/jira/browse/SPARK-27708?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17178583#comment-17178583 ] Rafael commented on SPARK-27708: [~rdblue] [~jlaskowski] Hey guys, I'm trying to migrate my package where I'm using V2 DataSources into Spark 3 version and any docs/guides would be very useful to me and to everybody who is using the V2 DataSources I added my plan here, could you share your knowledge? https://issues.apache.org/jira/browse/SPARK-25390?focusedCommentId=17178052&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17178052 > Add documentation for v2 data sources > - > > Key: SPARK-27708 > URL: https://issues.apache.org/jira/browse/SPARK-27708 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.0 >Reporter: Ryan Blue >Priority: Major > Labels: documentation > > Before the 3.0 release, the new v2 data sources should be documented. This > includes: > * How to plug in catalog implementations > * Catalog plugin configuration > * Multi-part identifier behavior > * Partition transforms > * Table properties that are used to pass table info (e.g. "provider") -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-25390) Data source V2 API refactoring
[ https://issues.apache.org/jira/browse/SPARK-25390?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17178052#comment-17178052 ] Rafael edited comment on SPARK-25390 at 8/15/20, 6:27 PM: -- Hey guys [~cloud_fan] [~b...@cloudera.com] I'm trying to migrate the package where I was using *import org.apache.spark.sql.sources.v2._ into Spark3.0.0* and haven't found a good guide so may I ask my questions here. Here my migration plan can you highlight what interfaces should I use now 1. Cannot find what should I use instead of ReadSupport, ReadSupport, DataSourceReader, if instead of ReadSupport we have to use now Scan then what happened to method createReader? {code:java} class DefaultSource extends ReadSupport { override def createReader(options: DataSourceOptions): DataSourceReader = new GeneratingReader() } {code} 2. Here instead of {code:java} import org.apache.spark.sql.sources.v2.reader.partitioning.{Distribution, Partitioning} import org.apache.spark.sql.sources.v2.reader.{InputPartition, SupportsReportPartitioning} {code} I should use {code:java} import org.apache.spark.sql.connector.read.partitioning.{Distribution, Partitioning} import org.apache.spark.sql.connector.read.{InputPartition, SupportsReportPartitioning} {code} right? {code:java} class GeneratingReader() extends DataSourceReader { override def readSchema(): StructType = {...} override def planInputPartitions(): util.List[InputPartition[InternalRow]] = { val partitions = new util.ArrayList[InputPartition[InternalRow]]() ... partitions.add(new GeneratingInputPartition(...)) } override def outputPartitioning(): Partitioning = {...} } {code} 3. Haven't found what should I use instead of {code:java} import org.apache.spark.sql.sources.v2.reader.InputPartition import org.apache.spark.sql.sources.v2.reader.InputPartitionReader{code} if a new interface is from package read then it has totally different new contract. {code:java} import org.apache.spark.sql.connector.read.InputPartition{code} {code:java} class GeneratingInputPartition() extends InputPartition[InternalRow] { override def createPartitionReader(): InputPartitionReader[InternalRow] = new GeneratingInputPartitionReader(...) } class GeneratingInputPartitionReader() extends InputPartitionReader[InternalRow] { override def next(): Boolean = ... override def get(): InternalRow = ... override def close(): Unit = ... } {code} And usage will remain the same? {code:java} val df = sparkSession.read .format("sources.v2.generating") .option(OPT_PARTITIONS, numberOfRows) .option(OPT_DESCRIPTOR, descriptorJson) .option(OPT_SOURCE_TYPE, sourceConnection) .load() {code} was (Author: kyrdan): Hey guys [~cloud_fan] [~b...@cloudera.com] I'm trying to migrate the package where I was using *import org.apache.spark.sql.sources.v2._ into Spark3.0.0* and haven't found a good guide so may I ask my questions here. Here my migration plan can you highlight what interfaces should I use now 1. Cannot find what should I use instead of ReadSupport, ReadSupport, DataSourceReader, if instead of ReadSupport we have to use now Scan then what happened to method createReader? {code:java} class DefaultSource extends ReadSupport { override def createReader(options: DataSourceOptions): DataSourceReader = new GeneratingReader() } {code} 2. Here instead of {code:java} import org.apache.spark.sql.sources.v2.reader.partitioning.{Distribution, Partitioning} import org.apache.spark.sql.sources.v2.reader.{InputPartition, SupportsReportPartitioning} {code} I should use {code:java} import org.apache.spark.sql.connector.read.partitioning.{Distribution, Partitioning} import org.apache.spark.sql.connector.read.{InputPartition, SupportsReportPartitioning} {code} right? {code:java} class GeneratingReader() extends DataSourceReader { override def readSchema(): StructType = {...} override def planInputPartitions(): util.List[InputPartition[InternalRow]] = { val partitions = new util.ArrayList[InputPartition[InternalRow]]() ... partitions.add(new GeneratingInputPartition(...)) } override def outputPartitioning(): Partitioning = {...} } {code} 3. Haven't found what should I use instead of {code:java} import org.apache.spark.sql.sources.v2.reader.InputPartition import org.apache.spark.sql.sources.v2.reader.InputPartitionReader{code} if a new interface is from package read then it has totally different new contract. {code:java} import org.apache.spark.sql.connector.read.InputPartition{code} {code:java} class GeneratingInputPartition() extends InputPartition[InternalRow] { override def createPartitionReader(): InputPartitionReader[InternalRow] = new GeneratingInputPartitionReader(...) } class GeneratingInputPartitionReader() extends InputPartitionReader[InternalRow] { override def next(): Boolean = ... override def get(): Inter
[jira] [Comment Edited] (SPARK-25390) Data source V2 API refactoring
[ https://issues.apache.org/jira/browse/SPARK-25390?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17178052#comment-17178052 ] Rafael edited comment on SPARK-25390 at 8/15/20, 6:26 PM: -- Hey guys [~cloud_fan] [~b...@cloudera.com] I'm trying to migrate the package where I was using *import org.apache.spark.sql.sources.v2._ into Spark3.0.0* and haven't found a good guide so may I ask my questions here. Here my migration plan can you highlight what interfaces should I use now 1. Cannot find what should I use instead of ReadSupport, ReadSupport, DataSourceReader, if instead of ReadSupport we have to use now Scan then what happened to method createReader? {code:java} class DefaultSource extends ReadSupport { override def createReader(options: DataSourceOptions): DataSourceReader = new GeneratingReader() } {code} 2. Here instead of {code:java} import org.apache.spark.sql.sources.v2.reader.partitioning.{Distribution, Partitioning} import org.apache.spark.sql.sources.v2.reader.{InputPartition, SupportsReportPartitioning} {code} I should use {code:java} import org.apache.spark.sql.connector.read.partitioning.{Distribution, Partitioning} import org.apache.spark.sql.connector.read.{InputPartition, SupportsReportPartitioning} {code} right? {code:java} class GeneratingReader() extends DataSourceReader { override def readSchema(): StructType = {...} override def planInputPartitions(): util.List[InputPartition[InternalRow]] = { val partitions = new util.ArrayList[InputPartition[InternalRow]]() ... partitions.add(new GeneratingInputPartition(...)) } override def outputPartitioning(): Partitioning = {...} } {code} 3. Haven't found what should I use instead of {code:java} import org.apache.spark.sql.sources.v2.reader.InputPartition import org.apache.spark.sql.sources.v2.reader.InputPartitionReader{code} if a new interface is from package read then it has totally different new contract. {code:java} import org.apache.spark.sql.connector.read.InputPartition{code} {code:java} class GeneratingInputPartition() extends InputPartition[InternalRow] { override def createPartitionReader(): InputPartitionReader[InternalRow] = new GeneratingInputPartitionReader(...) } class GeneratingInputPartitionReader() extends InputPartitionReader[InternalRow] { override def next(): Boolean = ... override def get(): InternalRow = ... override def close(): Unit = ... } {code} And usage will remain the same? {code:java} val df = sparkSession.read .format("sources.v2.generating") .option(OPT_PARTITIONS, calculateNumberOfPartitions(numberOfRows)) .option(OPT_DESCRIPTOR, descriptorJson) .option(OPT_SOURCE_TYPE, getSqlName(jobExecutionPlanContext.jobDescriptor.jobHeader.sourceConnection)) .load() {code} was (Author: kyrdan): Hey guys [~cloud_fan] [~b...@cloudera.com] I'm trying to migrate the package where I was using *import org.apache.spark.sql.sources.v2._ into Spark3.0.0* and haven't found a good guide so may I ask my questions here. Here my migration plan can you highlight what interfaces should I use now 1. Cannot find what should I use instead of ReadSupport, ReadSupport, DataSourceReader, if instead of ReadSupport we have to use now Scan then what happened to method createReader? {code:java} class DefaultSource extends ReadSupport { override def createReader(options: DataSourceOptions): DataSourceReader = new GeneratingReader() } {code} 2. Here instead of {code:java} import org.apache.spark.sql.sources.v2.reader.partitioning.{Distribution, Partitioning} import org.apache.spark.sql.sources.v2.reader.{InputPartition, SupportsReportPartitioning} {code} I should use {code:java} import org.apache.spark.sql.connector.read.partitioning.{Distribution, Partitioning} import org.apache.spark.sql.connector.read.{InputPartition, SupportsReportPartitioning} {code} right? {code:java} class GeneratingReader() extends DataSourceReader { override def readSchema(): StructType = {...} override def planInputPartitions(): util.List[InputPartition[InternalRow]] = { val partitions = new util.ArrayList[InputPartition[InternalRow]]() ... partitions.add(new GeneratingInputPartition(...)) } override def outputPartitioning(): Partitioning = {...} } {code} 3. Haven't found what should I use instead of {code:java} import org.apache.spark.sql.sources.v2.reader.InputPartition import org.apache.spark.sql.sources.v2.reader.InputPartitionReader{code} if a new interface is from package read then it has totally different new contract. {code:java} import org.apache.spark.sql.connector.read.InputPartition{code} {code:java} class GeneratingInputPartition() extends InputPartition[InternalRow] { override def createPartitionReader(): InputPartitionReader[InternalRow] = new GeneratingInputPartitionReader(...) } class GeneratingInputPartitionReader() extends InputPartiti
[jira] [Comment Edited] (SPARK-25390) Data source V2 API refactoring
[ https://issues.apache.org/jira/browse/SPARK-25390?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17178052#comment-17178052 ] Rafael edited comment on SPARK-25390 at 8/14/20, 8:45 PM: -- Hey guys [~cloud_fan] [~b...@cloudera.com] I'm trying to migrate the package where I was using *import org.apache.spark.sql.sources.v2._ into Spark3.0.0* and haven't found a good guide so may I ask my questions here. Here my migration plan can you highlight what interfaces should I use now 1. Cannot find what should I use instead of ReadSupport, ReadSupport, DataSourceReader, if instead of ReadSupport we have to use now Scan then what happened to method createReader? {code:java} class DefaultSource extends ReadSupport { override def createReader(options: DataSourceOptions): DataSourceReader = new GeneratingReader() } {code} 2. Here instead of {code:java} import org.apache.spark.sql.sources.v2.reader.partitioning.{Distribution, Partitioning} import org.apache.spark.sql.sources.v2.reader.{InputPartition, SupportsReportPartitioning} {code} I should use {code:java} import org.apache.spark.sql.connector.read.partitioning.{Distribution, Partitioning} import org.apache.spark.sql.connector.read.{InputPartition, SupportsReportPartitioning} {code} right? {code:java} class GeneratingReader() extends DataSourceReader { override def readSchema(): StructType = {...} override def planInputPartitions(): util.List[InputPartition[InternalRow]] = { val partitions = new util.ArrayList[InputPartition[InternalRow]]() ... partitions.add(new GeneratingInputPartition(...)) } override def outputPartitioning(): Partitioning = {...} } {code} 3. Haven't found what should I use instead of {code:java} import org.apache.spark.sql.sources.v2.reader.InputPartition import org.apache.spark.sql.sources.v2.reader.InputPartitionReader{code} if a new interface is from package read then it has totally different new contract. {code:java} import org.apache.spark.sql.connector.read.InputPartition{code} {code:java} class GeneratingInputPartition() extends InputPartition[InternalRow] { override def createPartitionReader(): InputPartitionReader[InternalRow] = new GeneratingInputPartitionReader(...) } class GeneratingInputPartitionReader() extends InputPartitionReader[InternalRow] { override def next(): Boolean = ... override def get(): InternalRow = ... override def close(): Unit = ... } {code} was (Author: kyrdan): Hey guys [~cloud_fan] [~b...@cloudera.com] I'm trying to migrate the package where I was using *import org.apache.spark.sql.sources.v2._ into Spark3.0.0* and haven't found a good guide so may I ask my questions here. Here my migration plan can you highlight what interfaces should I use now 1. Cannot find what should I use instead of ReadSupport, ReadSupport, DataSourceReader, if instead of ReadSupport we have to use now Scan then what happened to method createReader? {code:java} class DefaultSource extends ReadSupport { override def createReader(options: DataSourceOptions): DataSourceReader = new GeneratingReader() } {code} 2. Here instead of {code:java} import org.apache.spark.sql.sources.v2.reader.partitioning.{Distribution, Partitioning} import org.apache.spark.sql.sources.v2.reader.{InputPartition, SupportsReportPartitioning} {code} I should use {code:java} import org.apache.spark.sql.connector.read.partitioning.{Distribution, Partitioning} import org.apache.spark.sql.connector.read.{InputPartition, SupportsReportPartitioning} {code} right? {code:java} class GeneratingReader() extends DataSourceReader { override def readSchema(): StructType = {...} override def planInputPartitions(): util.List[InputPartition[InternalRow]] = { val partitions = new util.ArrayList[InputPartition[InternalRow]]() ... partitions.add(new GeneratingInputPartition(...)) } override def outputPartitioning(): Partitioning = {...} } {code} 3. Haven't found what should I use instead of {code:java} import org.apache.spark.sql.sources.v2.reader.InputPartition import org.apache.spark.sql.sources.v2.reader.InputPartitionReader{code} Interface like it has totally different contract {code:java} import org.apache.spark.sql.connector.read.InputPartition{code} {code:java} class GeneratingInputPartition() extends InputPartition[InternalRow] { override def createPartitionReader(): InputPartitionReader[InternalRow] = new GeneratingInputPartitionReader(...) } class GeneratingInputPartitionReader() extends InputPartitionReader[InternalRow] { override def next(): Boolean = ... override def get(): InternalRow = ... override def close(): Unit = ... } {code} > Data source V2 API refactoring > -- > > Key: SPARK-25390 > URL: https://issues.apache.org/jira/browse/SPARK-25390 > Project: Spark > Issue Type:
[jira] [Comment Edited] (SPARK-25390) Data source V2 API refactoring
[ https://issues.apache.org/jira/browse/SPARK-25390?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17178052#comment-17178052 ] Rafael edited comment on SPARK-25390 at 8/14/20, 8:44 PM: -- Hey guys [~cloud_fan] [~b...@cloudera.com] I'm trying to migrate the package where I was using *import org.apache.spark.sql.sources.v2._ into Spark3.0.0* and haven't found a good guide so may I ask my questions here. Here my migration plan can you highlight what interfaces should I use now 1. Cannot find what should I use instead of ReadSupport, ReadSupport, DataSourceReader, if instead of ReadSupport we have to use now Scan then what happened to method createReader? {code:java} class DefaultSource extends ReadSupport { override def createReader(options: DataSourceOptions): DataSourceReader = new GeneratingReader() } {code} 2. Here instead of {code:java} import org.apache.spark.sql.sources.v2.reader.partitioning.{Distribution, Partitioning} import org.apache.spark.sql.sources.v2.reader.{InputPartition, SupportsReportPartitioning} {code} I should use {code:java} import org.apache.spark.sql.connector.read.partitioning.{Distribution, Partitioning} import org.apache.spark.sql.connector.read.{InputPartition, SupportsReportPartitioning} {code} right? {code:java} class GeneratingReader() extends DataSourceReader { override def readSchema(): StructType = {...} override def planInputPartitions(): util.List[InputPartition[InternalRow]] = { val partitions = new util.ArrayList[InputPartition[InternalRow]]() ... partitions.add(new GeneratingInputPartition(...)) } override def outputPartitioning(): Partitioning = {...} } {code} 3. Haven't found what should I use instead of {code:java} import org.apache.spark.sql.sources.v2.reader.InputPartition import org.apache.spark.sql.sources.v2.reader.InputPartitionReader{code} Interface like it has totally different contract {code:java} import org.apache.spark.sql.connector.read.InputPartition{code} {code:java} class GeneratingInputPartition() extends InputPartition[InternalRow] { override def createPartitionReader(): InputPartitionReader[InternalRow] = new GeneratingInputPartitionReader(...) } class GeneratingInputPartitionReader() extends InputPartitionReader[InternalRow] { override def next(): Boolean = ... override def get(): InternalRow = ... override def close(): Unit = ... } {code} was (Author: kyrdan): Hey guys, I'm trying to migrate the package where I was using *import org.apache.spark.sql.sources.v2._ into Spark3.0.0* and haven't found a good guide so may I ask my questions here. Here my migration plan can you highlight what interfaces should I use now 1. Cannot find what should I use instead of ReadSupport, ReadSupport, DataSourceReader, if instead of ReadSupport we have to use now Scan then what happened to method createReader? {code:java} class DefaultSource extends ReadSupport { override def createReader(options: DataSourceOptions): DataSourceReader = new GeneratingReader() } {code} 2. Here instead of {code:java} import org.apache.spark.sql.sources.v2.reader.partitioning.{Distribution, Partitioning} import org.apache.spark.sql.sources.v2.reader.{InputPartition, SupportsReportPartitioning} {code} I should use {code:java} import org.apache.spark.sql.connector.read.partitioning.{Distribution, Partitioning} import org.apache.spark.sql.connector.read.{InputPartition, SupportsReportPartitioning} {code} right? {code:java} class GeneratingReader() extends DataSourceReader { override def readSchema(): StructType = {...} override def planInputPartitions(): util.List[InputPartition[InternalRow]] = { val partitions = new util.ArrayList[InputPartition[InternalRow]]() ... partitions.add(new GeneratingInputPartition(...)) } override def outputPartitioning(): Partitioning = {...} } {code} 3. Haven't found what should I use instead of {code:java} import org.apache.spark.sql.sources.v2.reader.InputPartition import org.apache.spark.sql.sources.v2.reader.InputPartitionReader{code} Interface like it has totally different contract {code:java} import org.apache.spark.sql.connector.read.InputPartition{code} {code:java} class GeneratingInputPartition() extends InputPartition[InternalRow] { override def createPartitionReader(): InputPartitionReader[InternalRow] = new GeneratingInputPartitionReader(...) } class GeneratingInputPartitionReader() extends InputPartitionReader[InternalRow] { override def next(): Boolean = ... override def get(): InternalRow = ... override def close(): Unit = ... } {code} > Data source V2 API refactoring > -- > > Key: SPARK-25390 > URL: https://issues.apache.org/jira/browse/SPARK-25390 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0
[jira] [Commented] (SPARK-25390) Data source V2 API refactoring
[ https://issues.apache.org/jira/browse/SPARK-25390?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17178052#comment-17178052 ] Rafael commented on SPARK-25390: Hey guys, I'm trying to migrate the package where I was using *import org.apache.spark.sql.sources.v2._ into Spark3.0.0* and haven't found a good guide so may I ask my questions here. Here my migration plan can you highlight what interfaces should I use now 1. Cannot find what should I use instead of ReadSupport, ReadSupport, DataSourceReader, if instead of ReadSupport we have to use now Scan then what happened to method createReader? {code:java} class DefaultSource extends ReadSupport { override def createReader(options: DataSourceOptions): DataSourceReader = new GeneratingReader() } {code} 2. Here instead of {code:java} import org.apache.spark.sql.sources.v2.reader.partitioning.{Distribution, Partitioning} import org.apache.spark.sql.sources.v2.reader.{InputPartition, SupportsReportPartitioning} {code} I should use {code:java} import org.apache.spark.sql.connector.read.partitioning.{Distribution, Partitioning} import org.apache.spark.sql.connector.read.{InputPartition, SupportsReportPartitioning} {code} right? {code:java} class GeneratingReader() extends DataSourceReader { override def readSchema(): StructType = {...} override def planInputPartitions(): util.List[InputPartition[InternalRow]] = { val partitions = new util.ArrayList[InputPartition[InternalRow]]() ... partitions.add(new GeneratingInputPartition(...)) } override def outputPartitioning(): Partitioning = {...} } {code} 3. Haven't found what should I use instead of {code:java} import org.apache.spark.sql.sources.v2.reader.InputPartition import org.apache.spark.sql.sources.v2.reader.InputPartitionReader{code} Interface like it has totally different contract {code:java} import org.apache.spark.sql.connector.read.InputPartition{code} {code:java} class GeneratingInputPartition() extends InputPartition[InternalRow] { override def createPartitionReader(): InputPartitionReader[InternalRow] = new GeneratingInputPartitionReader(...) } class GeneratingInputPartitionReader() extends InputPartitionReader[InternalRow] { override def next(): Boolean = ... override def get(): InternalRow = ... override def close(): Unit = ... } {code} > Data source V2 API refactoring > -- > > Key: SPARK-25390 > URL: https://issues.apache.org/jira/browse/SPARK-25390 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Wenchen Fan >Assignee: Wenchen Fan >Priority: Major > Fix For: 3.0.0 > > > Currently it's not very clear how we should abstract data source v2 API. The > abstraction should be unified between batch and streaming, or similar but > have a well-defined difference between batch and streaming. And the > abstraction should also include catalog/table. > An example of the abstraction: > {code} > batch: catalog -> table -> scan > streaming: catalog -> table -> stream -> scan > {code} > We should refactor the data source v2 API according to the abstraction -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-25390) Data source V2 API refactoring
[ https://issues.apache.org/jira/browse/SPARK-25390?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17178052#comment-17178052 ] Rafael edited comment on SPARK-25390 at 8/14/20, 8:42 PM: -- Hey guys, I'm trying to migrate the package where I was using *import org.apache.spark.sql.sources.v2._ into Spark3.0.0* and haven't found a good guide so may I ask my questions here. Here my migration plan can you highlight what interfaces should I use now 1. Cannot find what should I use instead of ReadSupport, ReadSupport, DataSourceReader, if instead of ReadSupport we have to use now Scan then what happened to method createReader? {code:java} class DefaultSource extends ReadSupport { override def createReader(options: DataSourceOptions): DataSourceReader = new GeneratingReader() } {code} 2. Here instead of {code:java} import org.apache.spark.sql.sources.v2.reader.partitioning.{Distribution, Partitioning} import org.apache.spark.sql.sources.v2.reader.{InputPartition, SupportsReportPartitioning} {code} I should use {code:java} import org.apache.spark.sql.connector.read.partitioning.{Distribution, Partitioning} import org.apache.spark.sql.connector.read.{InputPartition, SupportsReportPartitioning} {code} right? {code:java} class GeneratingReader() extends DataSourceReader { override def readSchema(): StructType = {...} override def planInputPartitions(): util.List[InputPartition[InternalRow]] = { val partitions = new util.ArrayList[InputPartition[InternalRow]]() ... partitions.add(new GeneratingInputPartition(...)) } override def outputPartitioning(): Partitioning = {...} } {code} 3. Haven't found what should I use instead of {code:java} import org.apache.spark.sql.sources.v2.reader.InputPartition import org.apache.spark.sql.sources.v2.reader.InputPartitionReader{code} Interface like it has totally different contract {code:java} import org.apache.spark.sql.connector.read.InputPartition{code} {code:java} class GeneratingInputPartition() extends InputPartition[InternalRow] { override def createPartitionReader(): InputPartitionReader[InternalRow] = new GeneratingInputPartitionReader(...) } class GeneratingInputPartitionReader() extends InputPartitionReader[InternalRow] { override def next(): Boolean = ... override def get(): InternalRow = ... override def close(): Unit = ... } {code} was (Author: kyrdan): Hey guys, I'm trying to migrate the package where I was using *import org.apache.spark.sql.sources.v2._ into Spark3.0.0* and haven't found a good guide so may I ask my questions here. Here my migration plan can you highlight what interfaces should I use now 1. Cannot find what should I use instead of ReadSupport, ReadSupport, DataSourceReader, if instead of ReadSupport we have to use now Scan then what happened to method createReader? {code:java} class DefaultSource extends ReadSupport { override def createReader(options: DataSourceOptions): DataSourceReader = new GeneratingReader() } {code} 2. Here instead of {code:java} import org.apache.spark.sql.sources.v2.reader.partitioning.{Distribution, Partitioning} import org.apache.spark.sql.sources.v2.reader.{InputPartition, SupportsReportPartitioning} {code} I should use {code:java} import org.apache.spark.sql.connector.read.partitioning.{Distribution, Partitioning} import org.apache.spark.sql.connector.read.{InputPartition, SupportsReportPartitioning} {code} right? {code:java} class GeneratingReader() extends DataSourceReader { override def readSchema(): StructType = {...} override def planInputPartitions(): util.List[InputPartition[InternalRow]] = { val partitions = new util.ArrayList[InputPartition[InternalRow]]() ... partitions.add(new GeneratingInputPartition(...)) } override def outputPartitioning(): Partitioning = {...} } {code} 3. Haven't found what should I use instead of {code:java} import org.apache.spark.sql.sources.v2.reader.InputPartition import org.apache.spark.sql.sources.v2.reader.InputPartitionReader{code} Interface like it has totally different contract {code:java} import org.apache.spark.sql.connector.read.InputPartition{code} {code:java} class GeneratingInputPartition() extends InputPartition[InternalRow] { override def createPartitionReader(): InputPartitionReader[InternalRow] = new GeneratingInputPartitionReader(...) } class GeneratingInputPartitionReader() extends InputPartitionReader[InternalRow] { override def next(): Boolean = ... override def get(): InternalRow = ... override def close(): Unit = ... } {code} > Data source V2 API refactoring > -- > > Key: SPARK-25390 > URL: https://issues.apache.org/jira/browse/SPARK-25390 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >
[jira] [Commented] (SPARK-26132) Remove support for Scala 2.11 in Spark 3.0.0
[ https://issues.apache.org/jira/browse/SPARK-26132?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17178035#comment-17178035 ] Rafael commented on SPARK-26132: Thank you [~srowen] yes it works. {code:java} val f: Iterator[Row] => Unit = (iterator: Iterator[Row]) => {} dataFrame.foreachPartition(f){code} > Remove support for Scala 2.11 in Spark 3.0.0 > > > Key: SPARK-26132 > URL: https://issues.apache.org/jira/browse/SPARK-26132 > Project: Spark > Issue Type: Improvement > Components: Build, Spark Core >Affects Versions: 3.0.0 >Reporter: Sean R. Owen >Assignee: Sean R. Owen >Priority: Major > Labels: release-notes > Fix For: 3.0.0 > > > Per some discussion on the mailing list, we are_considering_ formally not > supporting Scala 2.11 in Spark 3.0. This JIRA tracks that discussion. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26132) Remove support for Scala 2.11 in Spark 3.0.0
[ https://issues.apache.org/jira/browse/SPARK-26132?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17178019#comment-17178019 ] Rafael commented on SPARK-26132: [~srowen] In release notes for Spark 3.0.0 they mentioned your ticket {quote}Due to the upgrade of Scala 2.12, {{DataStreamWriter.foreachBatch}} is not source compatible for Scala program. You need to update your Scala source code to disambiguate between Scala function and Java lambda. (SPARK-26132) {quote} so maybe you know how we should use *foreachPartition* now in Scala code {code:java} dataFrame.foreachPartition(partition => { partition .grouped(Config.BATCH_SIZE) .foreach(batch => { } } {code} Right now it call on any method like grouped, foreach cause the exception *value grouped is not a member of Object* > Remove support for Scala 2.11 in Spark 3.0.0 > > > Key: SPARK-26132 > URL: https://issues.apache.org/jira/browse/SPARK-26132 > Project: Spark > Issue Type: Improvement > Components: Build, Spark Core >Affects Versions: 3.0.0 >Reporter: Sean R. Owen >Assignee: Sean R. Owen >Priority: Major > Labels: release-notes > Fix For: 3.0.0 > > > Per some discussion on the mailing list, we are_considering_ formally not > supporting Scala 2.11 in Spark 3.0. This JIRA tracks that discussion. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-20427) Issue with Spark interpreting Oracle datatype NUMBER
[ https://issues.apache.org/jira/browse/SPARK-20427?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17107376#comment-17107376 ] Rafael edited comment on SPARK-20427 at 5/18/20, 7:27 AM: -- Hey guys, I encountered an issue related to precision issues. Now the code expects the Decimal type we need to have in JDBC metadata precision and scale. [https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JdbcUtils.scala#L402-L414] I found out that in the OracleDB it is valid to have Decimal without these data. When I do a query read metadata for such column I'm getting DATA_PRECISION = Null, and DATA_SCALE = Null. Then when I run the `spark-sql` I'm getting such error: {code:java} java.lang.IllegalArgumentException: requirement failed: Decimal precision 45 exceeds max precision 38 at scala.Predef$.require(Predef.scala:224) at org.apache.spark.sql.types.Decimal.set(Decimal.scala:114) at org.apache.spark.sql.types.Decimal$.apply(Decimal.scala:465) at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$org$apache$spark$sql$execution$datasources$jdbc$JdbcUtils$$makeGetter$3$$anonfun$12.apply(JdbcUtils.scala:407) {code} Do you have a work around how spark-sql can work with such cases? UPDATE: Solved with the custom scheme. was (Author: kyrdan): Hey guys, I encountered an issue related to precision issues. Now the code expects the Decimal type we need to have in JDBC metadata precision and scale. [https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JdbcUtils.scala#L402-L414] I found out that in the OracleDB it is valid to have Decimal without these data. When I do a query read metadata for such column I'm getting DATA_PRECISION = Null, and DATA_SCALE = Null. Then when I run the `spark-sql` I'm getting such error: {code:java} java.lang.IllegalArgumentException: requirement failed: Decimal precision 45 exceeds max precision 38 at scala.Predef$.require(Predef.scala:224) at org.apache.spark.sql.types.Decimal.set(Decimal.scala:114) at org.apache.spark.sql.types.Decimal$.apply(Decimal.scala:465) at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$org$apache$spark$sql$execution$datasources$jdbc$JdbcUtils$$makeGetter$3$$anonfun$12.apply(JdbcUtils.scala:407) {code} Do you have a work around how spark-sql can work with such cases? > Issue with Spark interpreting Oracle datatype NUMBER > > > Key: SPARK-20427 > URL: https://issues.apache.org/jira/browse/SPARK-20427 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0 >Reporter: Alexander Andrushenko >Assignee: Yuming Wang >Priority: Major > Fix For: 2.3.0 > > > In Oracle exists data type NUMBER. When defining a filed in a table of type > NUMBER the field has two components, precision and scale. > For example, NUMBER(p,s) has precision p and scale s. > Precision can range from 1 to 38. > Scale can range from -84 to 127. > When reading such a filed Spark can create numbers with precision exceeding > 38. In our case it has created fields with precision 44, > calculated as sum of the precision (in our case 34 digits) and the scale (10): > "...java.lang.IllegalArgumentException: requirement failed: Decimal precision > 44 exceeds max precision 38...". > The result was, that a data frame was read from a table on one schema but > could not be inserted in the identical table on other schema. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20427) Issue with Spark interpreting Oracle datatype NUMBER
[ https://issues.apache.org/jira/browse/SPARK-20427?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17107376#comment-17107376 ] Rafael commented on SPARK-20427: Hey guys, I encountered an issue related to precision issues. Now the code expects the Decimal type we need to have in JDBC metadata precision and scale. [https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JdbcUtils.scala#L402-L414] I found out that in the OracleDB it is valid to have Decimal without these data. When I do a query read metadata for such column I'm getting DATA_PRECISION = Null, and DATA_SCALE = Null. Then when I run the `spark-sql` I'm getting such error: {code:java} java.lang.IllegalArgumentException: requirement failed: Decimal precision 45 exceeds max precision 38 at scala.Predef$.require(Predef.scala:224) at org.apache.spark.sql.types.Decimal.set(Decimal.scala:114) at org.apache.spark.sql.types.Decimal$.apply(Decimal.scala:465) at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$org$apache$spark$sql$execution$datasources$jdbc$JdbcUtils$$makeGetter$3$$anonfun$12.apply(JdbcUtils.scala:407) {code} Do you have a work around how spark-sql can work with such cases? > Issue with Spark interpreting Oracle datatype NUMBER > > > Key: SPARK-20427 > URL: https://issues.apache.org/jira/browse/SPARK-20427 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0 >Reporter: Alexander Andrushenko >Assignee: Yuming Wang >Priority: Major > Fix For: 2.3.0 > > > In Oracle exists data type NUMBER. When defining a filed in a table of type > NUMBER the field has two components, precision and scale. > For example, NUMBER(p,s) has precision p and scale s. > Precision can range from 1 to 38. > Scale can range from -84 to 127. > When reading such a filed Spark can create numbers with precision exceeding > 38. In our case it has created fields with precision 44, > calculated as sum of the precision (in our case 34 digits) and the scale (10): > "...java.lang.IllegalArgumentException: requirement failed: Decimal precision > 44 exceeds max precision 38...". > The result was, that a data frame was read from a table on one schema but > could not be inserted in the identical table on other schema. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-30100) Decimal Precision Inferred from JDBC via Spark
[ https://issues.apache.org/jira/browse/SPARK-30100?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17107365#comment-17107365 ] Rafael edited comment on SPARK-30100 at 5/14/20, 2:48 PM: -- Hey guys, I encountered an issue related to the precision issues. Now the code expects the for the Decimal type we need to have in JDBC metadata precision and scale. [https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JdbcUtils.scala#L402-L414] I found out that in the OracleDB it is valid to have Decimal without these data. When I do a query read metadata for such column I'm getting DATA_PRECISION = Null, and DATA_SCALE = Null. Then when I run the `spark-sql` I'm getting such error: {code:java} java.lang.IllegalArgumentException: requirement failed: Decimal precision 45 exceeds max precision 38 at scala.Predef$.require(Predef.scala:224) at org.apache.spark.sql.types.Decimal.set(Decimal.scala:114) at org.apache.spark.sql.types.Decimal$.apply(Decimal.scala:465) at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$org$apache$spark$sql$execution$datasources$jdbc$JdbcUtils$$makeGetter$3$$anonfun$12.apply(JdbcUtils.scala:407) {code} Do you have a work around how spark-sql can work with such cases? was (Author: kyrdan): Hey guys, I encountered an issue related to the precision issues. Now the code expects the for the Decimal type we need to have in JDBC metadata precision and scale. [https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JdbcUtils.scala#L402-L414] I found out that in the OracleDB it is valid to have Decimal without these data. When I do a query read metadata for such column I'm getting DATA_PRECISION = Null, and DATA_SCALE = Null. Then when I run the `spark-sql` I'm getting such error: {code:java} java.lang.IllegalArgumentException: requirement failed: Decimal precision 45 exceeds max precision 38 at scala.Predef$.require(Predef.scala:224) at org.apache.spark.sql.types.Decimal.set(Decimal.scala:114) at org.apache.spark.sql.types.Decimal$.apply(Decimal.scala:465) at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$org$apache$spark$sql$execution$datasources$jdbc$JdbcUtils$$makeGetter$3$$anonfun$12.apply(JdbcUtils.scala:407) {code} Do you have a work around how spark-sql can work with such cases? * [|https://issues.apache.org/jira/secure/AddComment!default.jspa?id=13001869] > Decimal Precision Inferred from JDBC via Spark > -- > > Key: SPARK-30100 > URL: https://issues.apache.org/jira/browse/SPARK-30100 > Project: Spark > Issue Type: Bug > Components: PySpark, SQL >Affects Versions: 2.4.0 >Reporter: Joby Joje >Priority: Major > > When trying to load data from JDBC(Oracle) into Spark, there seems to be > precision loss in the decimal field, as per my understanding Spark supports > *DECIMAL(38,18)*. The field from the Oracle is DECIMAL(38,14), whereas Spark > rounds off the last four digits making it a precision of DECIMAL(38,10). This > is happening to few fields in the dataframe where the column is fetched using > a CASE statement whereas in the same query another field populates the right > schema. > Tried to pass the > {code:java} > spark.sql.decimalOperations.allowPrecisionLoss=false{code} > conf in the Spark-submit though didn't get the desired results. > {code:java} > jdbcDF = spark.read \ > .format("jdbc") \ > .option("url", "ORACLE") \ > .option("dbtable", "QUERY") \ > .option("user", "USERNAME") \ > .option("password", "PASSWORD") \ > .load(){code} > So considering that the Spark infers the schema from a sample records, how > does this work here? Does it use the results of the query i.e (SELECT * FROM > TABLE_NAME JOIN ...) or does it take a different route to guess the schema > for itself? Can someone throw some light on this and advise how to achieve > the right decimal precision on this regards without manipulating the query as > doing a CAST on the query does solve the issue, but would prefer to get some > alternatives. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-30100) Decimal Precision Inferred from JDBC via Spark
[ https://issues.apache.org/jira/browse/SPARK-30100?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17107365#comment-17107365 ] Rafael commented on SPARK-30100: Hey guys, I encountered an issue related to the precision issues. Now the code expects the for the Decimal type we need to have in JDBC metadata precision and scale. [https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JdbcUtils.scala#L402-L414] I found out that in the OracleDB it is valid to have Decimal without these data. When I do a query read metadata for such column I'm getting DATA_PRECISION = Null, and DATA_SCALE = Null. Then when I run the `spark-sql` I'm getting such error: {code:java} java.lang.IllegalArgumentException: requirement failed: Decimal precision 45 exceeds max precision 38 at scala.Predef$.require(Predef.scala:224) at org.apache.spark.sql.types.Decimal.set(Decimal.scala:114) at org.apache.spark.sql.types.Decimal$.apply(Decimal.scala:465) at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$org$apache$spark$sql$execution$datasources$jdbc$JdbcUtils$$makeGetter$3$$anonfun$12.apply(JdbcUtils.scala:407) {code} Do you have a work around how spark-sql can work with such cases? * [|https://issues.apache.org/jira/secure/AddComment!default.jspa?id=13001869] > Decimal Precision Inferred from JDBC via Spark > -- > > Key: SPARK-30100 > URL: https://issues.apache.org/jira/browse/SPARK-30100 > Project: Spark > Issue Type: Bug > Components: PySpark, SQL >Affects Versions: 2.4.0 >Reporter: Joby Joje >Priority: Major > > When trying to load data from JDBC(Oracle) into Spark, there seems to be > precision loss in the decimal field, as per my understanding Spark supports > *DECIMAL(38,18)*. The field from the Oracle is DECIMAL(38,14), whereas Spark > rounds off the last four digits making it a precision of DECIMAL(38,10). This > is happening to few fields in the dataframe where the column is fetched using > a CASE statement whereas in the same query another field populates the right > schema. > Tried to pass the > {code:java} > spark.sql.decimalOperations.allowPrecisionLoss=false{code} > conf in the Spark-submit though didn't get the desired results. > {code:java} > jdbcDF = spark.read \ > .format("jdbc") \ > .option("url", "ORACLE") \ > .option("dbtable", "QUERY") \ > .option("user", "USERNAME") \ > .option("password", "PASSWORD") \ > .load(){code} > So considering that the Spark infers the schema from a sample records, how > does this work here? Does it use the results of the query i.e (SELECT * FROM > TABLE_NAME JOIN ...) or does it take a different route to guess the schema > for itself? Can someone throw some light on this and advise how to achieve > the right decimal precision on this regards without manipulating the query as > doing a CAST on the query does solve the issue, but would prefer to get some > alternatives. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17351) Refactor JDBCRDD to expose JDBC -> SparkSQL conversion functionality
[ https://issues.apache.org/jira/browse/SPARK-17351?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17107338#comment-17107338 ] Rafael commented on SPARK-17351: Hey guys, I know that it is a very old ticket but I encountered an issue related to these changes so let me ask my question it here. Now the code expects the for Decimal type we need to have in JDBC metadata precision and scale. [https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JdbcUtils.scala#L402-L414] I found out that in the OracleDB it is valid to have Decimal without these data. When I do a query read metadata for such column I'm getting DATA_PRECISION = Null, and DATA_SCALE = Null. Then when I run the `spark-sql` I'm getting such error: {code:java} java.lang.IllegalArgumentException: requirement failed: Decimal precision 45 exceeds max precision 38 at scala.Predef$.require(Predef.scala:224) at org.apache.spark.sql.types.Decimal.set(Decimal.scala:114) at org.apache.spark.sql.types.Decimal$.apply(Decimal.scala:465) at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$org$apache$spark$sql$execution$datasources$jdbc$JdbcUtils$$makeGetter$3$$anonfun$12.apply(JdbcUtils.scala:407) {code} Do you have a work around how spark-sql can work with such cases? > Refactor JDBCRDD to expose JDBC -> SparkSQL conversion functionality > > > Key: SPARK-17351 > URL: https://issues.apache.org/jira/browse/SPARK-17351 > Project: Spark > Issue Type: New Feature > Components: SQL >Reporter: Josh Rosen >Assignee: Josh Rosen >Priority: Major > Fix For: 2.1.0 > > > It would be useful if more of JDBCRDD's JDBC -> Spark SQL functionality was > usable from outside of JDBCRDD; this would make it easier to write test > harnesses comparing Spark output against other JDBC databases. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15689) Data source API v2
[ https://issues.apache.org/jira/browse/SPARK-15689?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16947794#comment-16947794 ] Rafael commented on SPARK-15689: Could you clarify whether 4 item from the original list was implemented: _4. Nice-to-have: support additional common operators, including limit and sampling._ I wondering whether limit operation is optimized now, pushed to `PushedFilters: ` > Data source API v2 > -- > > Key: SPARK-15689 > URL: https://issues.apache.org/jira/browse/SPARK-15689 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 2.3.0 >Reporter: Reynold Xin >Assignee: Wenchen Fan >Priority: Major > Labels: SPIP, releasenotes > Fix For: 2.3.0 > > Attachments: SPIP Data Source API V2.pdf > > > This ticket tracks progress in creating the v2 of data source API. This new > API should focus on: > 1. Have a small surface so it is easy to freeze and maintain compatibility > for a long time. Ideally, this API should survive architectural rewrites and > user-facing API revamps of Spark. > 2. Have a well-defined column batch interface for high performance. > Convenience methods should exist to convert row-oriented formats into column > batches for data source developers. > 3. Still support filter push down, similar to the existing API. > 4. Nice-to-have: support additional common operators, including limit and > sampling. > Note that both 1 and 2 are problems that the current data source API (v1) > suffers. The current data source API has a wide surface with dependency on > DataFrame/SQLContext, making the data source API compatibility depending on > the upper level API. The current data source API is also only row oriented > and has to go through an expensive external data type conversion to internal > data type. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org