[jira] [Commented] (SPARK-25390) Data source V2 API refactoring

2021-02-23 Thread Rafael (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-25390?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17289577#comment-17289577
 ] 

Rafael commented on SPARK-25390:


Sorry for late response, 
I was able to migrate my project on Spark 3.0.0
Here some hints what I did: 
https://gist.github.com/rafaelkyrdan/2bea8385aadd71be5bf67cddeec59581



> Data source V2 API refactoring
> --
>
> Key: SPARK-25390
> URL: https://issues.apache.org/jira/browse/SPARK-25390
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>Priority: Major
> Fix For: 3.0.0
>
>
> Currently it's not very clear how we should abstract data source v2 API. The 
> abstraction should be unified between batch and streaming, or similar but 
> have a well-defined difference between batch and streaming. And the 
> abstraction should also include catalog/table.
> An example of the abstraction:
> {code}
> batch: catalog -> table -> scan
> streaming: catalog -> table -> stream -> scan
> {code}
> We should refactor the data source v2 API according to the abstraction



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-27708) Add documentation for v2 data sources

2020-08-16 Thread Rafael (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-27708?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17178583#comment-17178583
 ] 

Rafael commented on SPARK-27708:


[~rdblue] [~jlaskowski]

Hey guys, I'm trying to migrate my package where I'm using V2 DataSources into 
Spark 3 version and any docs/guides would be very useful to me and to everybody 
who is using the V2 DataSources

I added my plan here, could you share your knowledge?

https://issues.apache.org/jira/browse/SPARK-25390?focusedCommentId=17178052&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17178052

> Add documentation for v2 data sources
> -
>
> Key: SPARK-27708
> URL: https://issues.apache.org/jira/browse/SPARK-27708
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Ryan Blue
>Priority: Major
>  Labels: documentation
>
> Before the 3.0 release, the new v2 data sources should be documented. This 
> includes:
>  * How to plug in catalog implementations
>  * Catalog plugin configuration
>  * Multi-part identifier behavior
>  * Partition transforms
>  * Table properties that are used to pass table info (e.g. "provider")



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-25390) Data source V2 API refactoring

2020-08-15 Thread Rafael (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-25390?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17178052#comment-17178052
 ] 

Rafael edited comment on SPARK-25390 at 8/15/20, 6:27 PM:
--

Hey guys [~cloud_fan] [~b...@cloudera.com]

I'm trying to migrate the package where I was using *import 
org.apache.spark.sql.sources.v2._ into Spark3.0.0*

and haven't found a good guide so may I ask my questions here.

Here my migration plan can you highlight what interfaces should I use now

1. Cannot find what should I use instead of ReadSupport, ReadSupport, 
DataSourceReader,

if instead of ReadSupport we have to use now Scan then what happened to method 
createReader? 
{code:java}
class DefaultSource extends ReadSupport { 
  override def createReader(options: DataSourceOptions): DataSourceReader = new 
GeneratingReader() 
}
{code}
2. 

Here instead of
{code:java}
import org.apache.spark.sql.sources.v2.reader.partitioning.{Distribution, 
Partitioning}
import org.apache.spark.sql.sources.v2.reader.{InputPartition, 
SupportsReportPartitioning}
{code}
I should use
{code:java}
import org.apache.spark.sql.connector.read.partitioning.{Distribution, 
Partitioning}
import org.apache.spark.sql.connector.read.{InputPartition, 
SupportsReportPartitioning}
{code}
right?
{code:java}
class GeneratingReader() extends DataSourceReader {
  override def readSchema(): StructType = {...}
  override def planInputPartitions(): util.List[InputPartition[InternalRow]] = {
val partitions = new util.ArrayList[InputPartition[InternalRow]]()
...
partitions.add(new GeneratingInputPartition(...))
  }
  override def outputPartitioning(): Partitioning = {...}
}
{code}
3.

Haven't found what should I use instead of
{code:java}
import org.apache.spark.sql.sources.v2.reader.InputPartition
import org.apache.spark.sql.sources.v2.reader.InputPartitionReader{code}
if a new interface is from package read then it has totally different new 
contract.
{code:java}
import org.apache.spark.sql.connector.read.InputPartition{code}
{code:java}
class GeneratingInputPartition() extends InputPartition[InternalRow] {
  override def createPartitionReader(): InputPartitionReader[InternalRow] = new 
 GeneratingInputPartitionReader(...)
}

class GeneratingInputPartitionReader() extends 
InputPartitionReader[InternalRow] {
   override def next(): Boolean = ...
   override def get(): InternalRow = ...
   override def close(): Unit = ...
}
{code}
And usage will remain the same?
{code:java}
val df = sparkSession.read
  .format("sources.v2.generating")
  .option(OPT_PARTITIONS, numberOfRows)
  .option(OPT_DESCRIPTOR, descriptorJson)
  .option(OPT_SOURCE_TYPE, sourceConnection)
  .load()
{code}


was (Author: kyrdan):
Hey guys [~cloud_fan] [~b...@cloudera.com]

I'm trying to migrate the package where I was using *import 
org.apache.spark.sql.sources.v2._ into Spark3.0.0*

and haven't found a good guide so may I ask my questions here.

Here my migration plan can you highlight what interfaces should I use now

1. Cannot find what should I use instead of ReadSupport, ReadSupport, 
DataSourceReader,

if instead of ReadSupport we have to use now Scan then what happened to method 
createReader? 
{code:java}
class DefaultSource extends ReadSupport { 
  override def createReader(options: DataSourceOptions): DataSourceReader = new 
GeneratingReader() 
}
{code}
2. 

Here instead of
{code:java}
import org.apache.spark.sql.sources.v2.reader.partitioning.{Distribution, 
Partitioning}
import org.apache.spark.sql.sources.v2.reader.{InputPartition, 
SupportsReportPartitioning}
{code}
I should use
{code:java}
import org.apache.spark.sql.connector.read.partitioning.{Distribution, 
Partitioning}
import org.apache.spark.sql.connector.read.{InputPartition, 
SupportsReportPartitioning}
{code}
right?
{code:java}
class GeneratingReader() extends DataSourceReader {
  override def readSchema(): StructType = {...}
  override def planInputPartitions(): util.List[InputPartition[InternalRow]] = {
val partitions = new util.ArrayList[InputPartition[InternalRow]]()
...
partitions.add(new GeneratingInputPartition(...))
  }
  override def outputPartitioning(): Partitioning = {...}
}
{code}
3.

Haven't found what should I use instead of
{code:java}
import org.apache.spark.sql.sources.v2.reader.InputPartition
import org.apache.spark.sql.sources.v2.reader.InputPartitionReader{code}
if a new interface is from package read then it has totally different new 
contract.
{code:java}
import org.apache.spark.sql.connector.read.InputPartition{code}
{code:java}
class GeneratingInputPartition() extends InputPartition[InternalRow] {
  override def createPartitionReader(): InputPartitionReader[InternalRow] = new 
 GeneratingInputPartitionReader(...)
}

class GeneratingInputPartitionReader() extends 
InputPartitionReader[InternalRow] {
   override def next(): Boolean = ...
   override def get(): Inter

[jira] [Comment Edited] (SPARK-25390) Data source V2 API refactoring

2020-08-15 Thread Rafael (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-25390?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17178052#comment-17178052
 ] 

Rafael edited comment on SPARK-25390 at 8/15/20, 6:26 PM:
--

Hey guys [~cloud_fan] [~b...@cloudera.com]

I'm trying to migrate the package where I was using *import 
org.apache.spark.sql.sources.v2._ into Spark3.0.0*

and haven't found a good guide so may I ask my questions here.

Here my migration plan can you highlight what interfaces should I use now

1. Cannot find what should I use instead of ReadSupport, ReadSupport, 
DataSourceReader,

if instead of ReadSupport we have to use now Scan then what happened to method 
createReader? 
{code:java}
class DefaultSource extends ReadSupport { 
  override def createReader(options: DataSourceOptions): DataSourceReader = new 
GeneratingReader() 
}
{code}
2. 

Here instead of
{code:java}
import org.apache.spark.sql.sources.v2.reader.partitioning.{Distribution, 
Partitioning}
import org.apache.spark.sql.sources.v2.reader.{InputPartition, 
SupportsReportPartitioning}
{code}
I should use
{code:java}
import org.apache.spark.sql.connector.read.partitioning.{Distribution, 
Partitioning}
import org.apache.spark.sql.connector.read.{InputPartition, 
SupportsReportPartitioning}
{code}
right?
{code:java}
class GeneratingReader() extends DataSourceReader {
  override def readSchema(): StructType = {...}
  override def planInputPartitions(): util.List[InputPartition[InternalRow]] = {
val partitions = new util.ArrayList[InputPartition[InternalRow]]()
...
partitions.add(new GeneratingInputPartition(...))
  }
  override def outputPartitioning(): Partitioning = {...}
}
{code}
3.

Haven't found what should I use instead of
{code:java}
import org.apache.spark.sql.sources.v2.reader.InputPartition
import org.apache.spark.sql.sources.v2.reader.InputPartitionReader{code}
if a new interface is from package read then it has totally different new 
contract.
{code:java}
import org.apache.spark.sql.connector.read.InputPartition{code}
{code:java}
class GeneratingInputPartition() extends InputPartition[InternalRow] {
  override def createPartitionReader(): InputPartitionReader[InternalRow] = new 
 GeneratingInputPartitionReader(...)
}

class GeneratingInputPartitionReader() extends 
InputPartitionReader[InternalRow] {
   override def next(): Boolean = ...
   override def get(): InternalRow = ...
   override def close(): Unit = ...
}
{code}
And usage will remain the same?
{code:java}
val df = sparkSession.read
  .format("sources.v2.generating")
  .option(OPT_PARTITIONS, calculateNumberOfPartitions(numberOfRows))
  .option(OPT_DESCRIPTOR, descriptorJson)
  .option(OPT_SOURCE_TYPE, 
getSqlName(jobExecutionPlanContext.jobDescriptor.jobHeader.sourceConnection))
  .load()
{code}


was (Author: kyrdan):
Hey guys [~cloud_fan] [~b...@cloudera.com]

I'm trying to migrate the package where I was using *import 
org.apache.spark.sql.sources.v2._ into Spark3.0.0*

and haven't found a good guide so may I ask my questions here.

Here my migration plan can you highlight what interfaces should I use now

1. Cannot find what should I use instead of ReadSupport, ReadSupport, 
DataSourceReader,

if instead of ReadSupport we have to use now Scan then what happened to method 
createReader? 
{code:java}
class DefaultSource extends ReadSupport { 
  override def createReader(options: DataSourceOptions): DataSourceReader = new 
GeneratingReader() 
}
{code}
2. 

Here instead of
{code:java}
import org.apache.spark.sql.sources.v2.reader.partitioning.{Distribution, 
Partitioning}
import org.apache.spark.sql.sources.v2.reader.{InputPartition, 
SupportsReportPartitioning}
{code}
I should use
{code:java}
import org.apache.spark.sql.connector.read.partitioning.{Distribution, 
Partitioning}
import org.apache.spark.sql.connector.read.{InputPartition, 
SupportsReportPartitioning}
{code}
right?
{code:java}
class GeneratingReader() extends DataSourceReader {
  override def readSchema(): StructType = {...}
  override def planInputPartitions(): util.List[InputPartition[InternalRow]] = {
val partitions = new util.ArrayList[InputPartition[InternalRow]]()
...
partitions.add(new GeneratingInputPartition(...))
  }
  override def outputPartitioning(): Partitioning = {...}
}
{code}
3.

Haven't found what should I use instead of
{code:java}
import org.apache.spark.sql.sources.v2.reader.InputPartition
import org.apache.spark.sql.sources.v2.reader.InputPartitionReader{code}
if a new interface is from package read then it has totally different new 
contract.
{code:java}
import org.apache.spark.sql.connector.read.InputPartition{code}
{code:java}
class GeneratingInputPartition() extends InputPartition[InternalRow] {
  override def createPartitionReader(): InputPartitionReader[InternalRow] = new 
 GeneratingInputPartitionReader(...)
}

class GeneratingInputPartitionReader() extends 
InputPartiti

[jira] [Comment Edited] (SPARK-25390) Data source V2 API refactoring

2020-08-14 Thread Rafael (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-25390?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17178052#comment-17178052
 ] 

Rafael edited comment on SPARK-25390 at 8/14/20, 8:45 PM:
--

Hey guys [~cloud_fan] [~b...@cloudera.com]

I'm trying to migrate the package where I was using *import 
org.apache.spark.sql.sources.v2._ into Spark3.0.0*

and haven't found a good guide so may I ask my questions here.

Here my migration plan can you highlight what interfaces should I use now

1. Cannot find what should I use instead of ReadSupport, ReadSupport, 
DataSourceReader,

if instead of ReadSupport we have to use now Scan then what happened to method 
createReader? 
{code:java}
class DefaultSource extends ReadSupport { 
  override def createReader(options: DataSourceOptions): DataSourceReader = new 
GeneratingReader() 
}
{code}
2. 

Here instead of
{code:java}
import org.apache.spark.sql.sources.v2.reader.partitioning.{Distribution, 
Partitioning}
import org.apache.spark.sql.sources.v2.reader.{InputPartition, 
SupportsReportPartitioning}
{code}
I should use
{code:java}
import org.apache.spark.sql.connector.read.partitioning.{Distribution, 
Partitioning}
import org.apache.spark.sql.connector.read.{InputPartition, 
SupportsReportPartitioning}
{code}
right?
{code:java}
class GeneratingReader() extends DataSourceReader {
  override def readSchema(): StructType = {...}
  override def planInputPartitions(): util.List[InputPartition[InternalRow]] = {
val partitions = new util.ArrayList[InputPartition[InternalRow]]()
...
partitions.add(new GeneratingInputPartition(...))
  }
  override def outputPartitioning(): Partitioning = {...}
}
{code}
3.

Haven't found what should I use instead of
{code:java}
import org.apache.spark.sql.sources.v2.reader.InputPartition
import org.apache.spark.sql.sources.v2.reader.InputPartitionReader{code}
if a new interface is from package read then it has totally different new 
contract.
{code:java}
import org.apache.spark.sql.connector.read.InputPartition{code}
{code:java}
class GeneratingInputPartition() extends InputPartition[InternalRow] {
  override def createPartitionReader(): InputPartitionReader[InternalRow] = new 
 GeneratingInputPartitionReader(...)
}

class GeneratingInputPartitionReader() extends 
InputPartitionReader[InternalRow] {
   override def next(): Boolean = ...
   override def get(): InternalRow = ...
   override def close(): Unit = ...
}
{code}


was (Author: kyrdan):
Hey guys [~cloud_fan] [~b...@cloudera.com]

I'm trying to migrate the package where I was using *import 
org.apache.spark.sql.sources.v2._ into Spark3.0.0*

and haven't found a good guide so may I ask my questions here.

Here my migration plan can you highlight what interfaces should I use now

1. Cannot find what should I use instead of ReadSupport, ReadSupport, 
DataSourceReader,

if instead of ReadSupport we have to use now Scan then what happened to method 
createReader? 
{code:java}
class DefaultSource extends ReadSupport { 
  override def createReader(options: DataSourceOptions): DataSourceReader = new 
GeneratingReader() 
}
{code}
2. 

Here instead of
{code:java}
import org.apache.spark.sql.sources.v2.reader.partitioning.{Distribution, 
Partitioning}
import org.apache.spark.sql.sources.v2.reader.{InputPartition, 
SupportsReportPartitioning}
{code}
I should use
{code:java}
import org.apache.spark.sql.connector.read.partitioning.{Distribution, 
Partitioning}
import org.apache.spark.sql.connector.read.{InputPartition, 
SupportsReportPartitioning}
{code}
right?
{code:java}
class GeneratingReader() extends DataSourceReader {
  override def readSchema(): StructType = {...}
  override def planInputPartitions(): util.List[InputPartition[InternalRow]] = {
val partitions = new util.ArrayList[InputPartition[InternalRow]]()
...
partitions.add(new GeneratingInputPartition(...))
  }
  override def outputPartitioning(): Partitioning = {...}
}
{code}
3.

Haven't found what should I use instead of
{code:java}
import org.apache.spark.sql.sources.v2.reader.InputPartition
import org.apache.spark.sql.sources.v2.reader.InputPartitionReader{code}
Interface like it has totally different contract
{code:java}
import org.apache.spark.sql.connector.read.InputPartition{code}
{code:java}
class GeneratingInputPartition() extends InputPartition[InternalRow] {
  override def createPartitionReader(): InputPartitionReader[InternalRow] = new 
 GeneratingInputPartitionReader(...)
}

class GeneratingInputPartitionReader() extends 
InputPartitionReader[InternalRow] {
   override def next(): Boolean = ...
   override def get(): InternalRow = ...
   override def close(): Unit = ...
}
{code}

> Data source V2 API refactoring
> --
>
> Key: SPARK-25390
> URL: https://issues.apache.org/jira/browse/SPARK-25390
> Project: Spark
>  Issue Type:

[jira] [Comment Edited] (SPARK-25390) Data source V2 API refactoring

2020-08-14 Thread Rafael (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-25390?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17178052#comment-17178052
 ] 

Rafael edited comment on SPARK-25390 at 8/14/20, 8:44 PM:
--

Hey guys [~cloud_fan] [~b...@cloudera.com]

I'm trying to migrate the package where I was using *import 
org.apache.spark.sql.sources.v2._ into Spark3.0.0*

and haven't found a good guide so may I ask my questions here.

Here my migration plan can you highlight what interfaces should I use now

1. Cannot find what should I use instead of ReadSupport, ReadSupport, 
DataSourceReader,

if instead of ReadSupport we have to use now Scan then what happened to method 
createReader? 
{code:java}
class DefaultSource extends ReadSupport { 
  override def createReader(options: DataSourceOptions): DataSourceReader = new 
GeneratingReader() 
}
{code}
2. 

Here instead of
{code:java}
import org.apache.spark.sql.sources.v2.reader.partitioning.{Distribution, 
Partitioning}
import org.apache.spark.sql.sources.v2.reader.{InputPartition, 
SupportsReportPartitioning}
{code}
I should use
{code:java}
import org.apache.spark.sql.connector.read.partitioning.{Distribution, 
Partitioning}
import org.apache.spark.sql.connector.read.{InputPartition, 
SupportsReportPartitioning}
{code}
right?
{code:java}
class GeneratingReader() extends DataSourceReader {
  override def readSchema(): StructType = {...}
  override def planInputPartitions(): util.List[InputPartition[InternalRow]] = {
val partitions = new util.ArrayList[InputPartition[InternalRow]]()
...
partitions.add(new GeneratingInputPartition(...))
  }
  override def outputPartitioning(): Partitioning = {...}
}
{code}
3.

Haven't found what should I use instead of
{code:java}
import org.apache.spark.sql.sources.v2.reader.InputPartition
import org.apache.spark.sql.sources.v2.reader.InputPartitionReader{code}
Interface like it has totally different contract
{code:java}
import org.apache.spark.sql.connector.read.InputPartition{code}
{code:java}
class GeneratingInputPartition() extends InputPartition[InternalRow] {
  override def createPartitionReader(): InputPartitionReader[InternalRow] = new 
 GeneratingInputPartitionReader(...)
}

class GeneratingInputPartitionReader() extends 
InputPartitionReader[InternalRow] {
   override def next(): Boolean = ...
   override def get(): InternalRow = ...
   override def close(): Unit = ...
}
{code}


was (Author: kyrdan):
Hey guys, 

I'm trying to migrate the package where I was using *import 
org.apache.spark.sql.sources.v2._ into Spark3.0.0*

and haven't found a good guide so may I ask my questions here.

Here my migration plan can you highlight what interfaces should I use now

1. Cannot find what should I use instead of ReadSupport, ReadSupport, 
DataSourceReader,

if instead of ReadSupport we have to use now Scan then what happened to method 
createReader? 
{code:java}
class DefaultSource extends ReadSupport { 
  override def createReader(options: DataSourceOptions): DataSourceReader = new 
GeneratingReader() 
}
{code}
2. 

Here instead of
{code:java}
import org.apache.spark.sql.sources.v2.reader.partitioning.{Distribution, 
Partitioning}
import org.apache.spark.sql.sources.v2.reader.{InputPartition, 
SupportsReportPartitioning}
{code}
I should use
{code:java}
import org.apache.spark.sql.connector.read.partitioning.{Distribution, 
Partitioning}
import org.apache.spark.sql.connector.read.{InputPartition, 
SupportsReportPartitioning}
{code}
right?
{code:java}
class GeneratingReader() extends DataSourceReader {
  override def readSchema(): StructType = {...}
  override def planInputPartitions(): util.List[InputPartition[InternalRow]] = {
val partitions = new util.ArrayList[InputPartition[InternalRow]]()
...
partitions.add(new GeneratingInputPartition(...))
  }
  override def outputPartitioning(): Partitioning = {...}
}
{code}
3.

Haven't found what should I use instead of
{code:java}
import org.apache.spark.sql.sources.v2.reader.InputPartition
import org.apache.spark.sql.sources.v2.reader.InputPartitionReader{code}
Interface like it has totally different contract
{code:java}
import org.apache.spark.sql.connector.read.InputPartition{code}
{code:java}
class GeneratingInputPartition() extends InputPartition[InternalRow] {
  override def createPartitionReader(): InputPartitionReader[InternalRow] = new 
 GeneratingInputPartitionReader(...)
}

class GeneratingInputPartitionReader() extends 
InputPartitionReader[InternalRow] {
   override def next(): Boolean = ...
   override def get(): InternalRow = ...
   override def close(): Unit = ...
}
{code}

> Data source V2 API refactoring
> --
>
> Key: SPARK-25390
> URL: https://issues.apache.org/jira/browse/SPARK-25390
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0

[jira] [Commented] (SPARK-25390) Data source V2 API refactoring

2020-08-14 Thread Rafael (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-25390?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17178052#comment-17178052
 ] 

Rafael commented on SPARK-25390:


Hey guys, 

I'm trying to migrate the package where I was using *import 
org.apache.spark.sql.sources.v2._ into Spark3.0.0*

and haven't found a good guide so may I ask my questions here.

 

Here my migration plan can you highlight what interfaces should I use now

 

1. Cannot find what should I use instead of ReadSupport, ReadSupport, 
DataSourceReader,

if instead of ReadSupport we have to use now Scan then what happened to method 
createReader? 
{code:java}

class DefaultSource extends ReadSupport { 
  override def createReader(options: DataSourceOptions): DataSourceReader = new 
GeneratingReader() 
}
{code}
2. 

Here instead of

 
{code:java}
import org.apache.spark.sql.sources.v2.reader.partitioning.{Distribution, 
Partitioning}
import org.apache.spark.sql.sources.v2.reader.{InputPartition, 
SupportsReportPartitioning}
{code}
 

I should use

 
{code:java}
import org.apache.spark.sql.connector.read.partitioning.{Distribution, 
Partitioning}
import org.apache.spark.sql.connector.read.{InputPartition, 
SupportsReportPartitioning}
{code}
 

right?
{code:java}
class GeneratingReader() extends DataSourceReader {
  override def readSchema(): StructType = {...}
  override def planInputPartitions(): util.List[InputPartition[InternalRow]] = {
val partitions = new util.ArrayList[InputPartition[InternalRow]]()
...
partitions.add(new GeneratingInputPartition(...))
  }
  override def outputPartitioning(): Partitioning = {...}
}
{code}
3.

Haven't found what should I use instead of
{code:java}
import org.apache.spark.sql.sources.v2.reader.InputPartition
import org.apache.spark.sql.sources.v2.reader.InputPartitionReader{code}
Interface like it has totally different contract
{code:java}
import org.apache.spark.sql.connector.read.InputPartition{code}
{code:java}
class GeneratingInputPartition() extends InputPartition[InternalRow] {
  override def createPartitionReader(): InputPartitionReader[InternalRow] = new 
 GeneratingInputPartitionReader(...)
}

class GeneratingInputPartitionReader() extends 
InputPartitionReader[InternalRow] {
   override def next(): Boolean = ...
   override def get(): InternalRow = ...
   override def close(): Unit = ...
}
{code}

> Data source V2 API refactoring
> --
>
> Key: SPARK-25390
> URL: https://issues.apache.org/jira/browse/SPARK-25390
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>Priority: Major
> Fix For: 3.0.0
>
>
> Currently it's not very clear how we should abstract data source v2 API. The 
> abstraction should be unified between batch and streaming, or similar but 
> have a well-defined difference between batch and streaming. And the 
> abstraction should also include catalog/table.
> An example of the abstraction:
> {code}
> batch: catalog -> table -> scan
> streaming: catalog -> table -> stream -> scan
> {code}
> We should refactor the data source v2 API according to the abstraction



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-25390) Data source V2 API refactoring

2020-08-14 Thread Rafael (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-25390?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17178052#comment-17178052
 ] 

Rafael edited comment on SPARK-25390 at 8/14/20, 8:42 PM:
--

Hey guys, 

I'm trying to migrate the package where I was using *import 
org.apache.spark.sql.sources.v2._ into Spark3.0.0*

and haven't found a good guide so may I ask my questions here.

Here my migration plan can you highlight what interfaces should I use now

1. Cannot find what should I use instead of ReadSupport, ReadSupport, 
DataSourceReader,

if instead of ReadSupport we have to use now Scan then what happened to method 
createReader? 
{code:java}
class DefaultSource extends ReadSupport { 
  override def createReader(options: DataSourceOptions): DataSourceReader = new 
GeneratingReader() 
}
{code}
2. 

Here instead of
{code:java}
import org.apache.spark.sql.sources.v2.reader.partitioning.{Distribution, 
Partitioning}
import org.apache.spark.sql.sources.v2.reader.{InputPartition, 
SupportsReportPartitioning}
{code}
I should use
{code:java}
import org.apache.spark.sql.connector.read.partitioning.{Distribution, 
Partitioning}
import org.apache.spark.sql.connector.read.{InputPartition, 
SupportsReportPartitioning}
{code}
right?
{code:java}
class GeneratingReader() extends DataSourceReader {
  override def readSchema(): StructType = {...}
  override def planInputPartitions(): util.List[InputPartition[InternalRow]] = {
val partitions = new util.ArrayList[InputPartition[InternalRow]]()
...
partitions.add(new GeneratingInputPartition(...))
  }
  override def outputPartitioning(): Partitioning = {...}
}
{code}
3.

Haven't found what should I use instead of
{code:java}
import org.apache.spark.sql.sources.v2.reader.InputPartition
import org.apache.spark.sql.sources.v2.reader.InputPartitionReader{code}
Interface like it has totally different contract
{code:java}
import org.apache.spark.sql.connector.read.InputPartition{code}
{code:java}
class GeneratingInputPartition() extends InputPartition[InternalRow] {
  override def createPartitionReader(): InputPartitionReader[InternalRow] = new 
 GeneratingInputPartitionReader(...)
}

class GeneratingInputPartitionReader() extends 
InputPartitionReader[InternalRow] {
   override def next(): Boolean = ...
   override def get(): InternalRow = ...
   override def close(): Unit = ...
}
{code}


was (Author: kyrdan):
Hey guys, 

I'm trying to migrate the package where I was using *import 
org.apache.spark.sql.sources.v2._ into Spark3.0.0*

and haven't found a good guide so may I ask my questions here.

 

Here my migration plan can you highlight what interfaces should I use now

 

1. Cannot find what should I use instead of ReadSupport, ReadSupport, 
DataSourceReader,

if instead of ReadSupport we have to use now Scan then what happened to method 
createReader? 
{code:java}

class DefaultSource extends ReadSupport { 
  override def createReader(options: DataSourceOptions): DataSourceReader = new 
GeneratingReader() 
}
{code}
2. 

Here instead of

 
{code:java}
import org.apache.spark.sql.sources.v2.reader.partitioning.{Distribution, 
Partitioning}
import org.apache.spark.sql.sources.v2.reader.{InputPartition, 
SupportsReportPartitioning}
{code}
 

I should use

 
{code:java}
import org.apache.spark.sql.connector.read.partitioning.{Distribution, 
Partitioning}
import org.apache.spark.sql.connector.read.{InputPartition, 
SupportsReportPartitioning}
{code}
 

right?
{code:java}
class GeneratingReader() extends DataSourceReader {
  override def readSchema(): StructType = {...}
  override def planInputPartitions(): util.List[InputPartition[InternalRow]] = {
val partitions = new util.ArrayList[InputPartition[InternalRow]]()
...
partitions.add(new GeneratingInputPartition(...))
  }
  override def outputPartitioning(): Partitioning = {...}
}
{code}
3.

Haven't found what should I use instead of
{code:java}
import org.apache.spark.sql.sources.v2.reader.InputPartition
import org.apache.spark.sql.sources.v2.reader.InputPartitionReader{code}
Interface like it has totally different contract
{code:java}
import org.apache.spark.sql.connector.read.InputPartition{code}
{code:java}
class GeneratingInputPartition() extends InputPartition[InternalRow] {
  override def createPartitionReader(): InputPartitionReader[InternalRow] = new 
 GeneratingInputPartitionReader(...)
}

class GeneratingInputPartitionReader() extends 
InputPartitionReader[InternalRow] {
   override def next(): Boolean = ...
   override def get(): InternalRow = ...
   override def close(): Unit = ...
}
{code}

> Data source V2 API refactoring
> --
>
> Key: SPARK-25390
> URL: https://issues.apache.org/jira/browse/SPARK-25390
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>   

[jira] [Commented] (SPARK-26132) Remove support for Scala 2.11 in Spark 3.0.0

2020-08-14 Thread Rafael (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-26132?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17178035#comment-17178035
 ] 

Rafael commented on SPARK-26132:


Thank you [~srowen] yes it works.
{code:java}
val f: Iterator[Row] => Unit = (iterator: Iterator[Row]) => {}
 dataFrame.foreachPartition(f){code}

> Remove support for Scala 2.11 in Spark 3.0.0
> 
>
> Key: SPARK-26132
> URL: https://issues.apache.org/jira/browse/SPARK-26132
> Project: Spark
>  Issue Type: Improvement
>  Components: Build, Spark Core
>Affects Versions: 3.0.0
>Reporter: Sean R. Owen
>Assignee: Sean R. Owen
>Priority: Major
>  Labels: release-notes
> Fix For: 3.0.0
>
>
> Per some discussion on the mailing list, we are_considering_ formally not 
> supporting Scala 2.11 in Spark 3.0. This JIRA tracks that discussion.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26132) Remove support for Scala 2.11 in Spark 3.0.0

2020-08-14 Thread Rafael (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-26132?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17178019#comment-17178019
 ] 

Rafael commented on SPARK-26132:


[~srowen]

In release notes for Spark 3.0.0 they mentioned your ticket
{quote}Due to the upgrade of Scala 2.12, {{DataStreamWriter.foreachBatch}} is 
not source compatible for Scala program. You need to update your Scala source 
code to disambiguate between Scala function and Java lambda. (SPARK-26132)
{quote}
 

so maybe you know how we should use *foreachPartition* now in Scala code
{code:java}
dataFrame.foreachPartition(partition => {
  partition
.grouped(Config.BATCH_SIZE)
.foreach(batch => { 
 
 } 
}
{code}
Right now it call on any method like grouped, foreach cause the exception 
*value grouped is not a member of Object*

> Remove support for Scala 2.11 in Spark 3.0.0
> 
>
> Key: SPARK-26132
> URL: https://issues.apache.org/jira/browse/SPARK-26132
> Project: Spark
>  Issue Type: Improvement
>  Components: Build, Spark Core
>Affects Versions: 3.0.0
>Reporter: Sean R. Owen
>Assignee: Sean R. Owen
>Priority: Major
>  Labels: release-notes
> Fix For: 3.0.0
>
>
> Per some discussion on the mailing list, we are_considering_ formally not 
> supporting Scala 2.11 in Spark 3.0. This JIRA tracks that discussion.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-20427) Issue with Spark interpreting Oracle datatype NUMBER

2020-05-18 Thread Rafael (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-20427?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17107376#comment-17107376
 ] 

Rafael edited comment on SPARK-20427 at 5/18/20, 7:27 AM:
--

Hey guys, 
 I encountered an issue related to precision issues.

Now the code expects the Decimal type we need to have in JDBC metadata 
precision and scale. 

[https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JdbcUtils.scala#L402-L414]

 

I found out that in the OracleDB it is valid to have Decimal without these 
data. When I do a query read metadata for such column I'm getting 
DATA_PRECISION = Null, and DATA_SCALE = Null.

Then when I run the `spark-sql` I'm getting such error:
{code:java}
java.lang.IllegalArgumentException: requirement failed: Decimal precision 45 
exceeds max precision 38
at scala.Predef$.require(Predef.scala:224)
at org.apache.spark.sql.types.Decimal.set(Decimal.scala:114)
at org.apache.spark.sql.types.Decimal$.apply(Decimal.scala:465)
at 
org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$org$apache$spark$sql$execution$datasources$jdbc$JdbcUtils$$makeGetter$3$$anonfun$12.apply(JdbcUtils.scala:407)
{code}
Do you have a work around how spark-sql can work with such cases?

 

UPDATE:

Solved with the custom scheme.


was (Author: kyrdan):
Hey guys, 
I encountered an issue related to precision issues.

Now the code expects the Decimal type we need to have in JDBC metadata 
precision and scale. 

[https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JdbcUtils.scala#L402-L414]

 

I found out that in the OracleDB it is valid to have Decimal without these 
data. When I do a query read metadata for such column I'm getting 
DATA_PRECISION = Null, and DATA_SCALE = Null.

Then when I run the `spark-sql` I'm getting such error:
{code:java}
java.lang.IllegalArgumentException: requirement failed: Decimal precision 45 
exceeds max precision 38
at scala.Predef$.require(Predef.scala:224)
at org.apache.spark.sql.types.Decimal.set(Decimal.scala:114)
at org.apache.spark.sql.types.Decimal$.apply(Decimal.scala:465)
at 
org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$org$apache$spark$sql$execution$datasources$jdbc$JdbcUtils$$makeGetter$3$$anonfun$12.apply(JdbcUtils.scala:407)
{code}
Do you have a work around how spark-sql can work with such cases?

> Issue with Spark interpreting Oracle datatype NUMBER
> 
>
> Key: SPARK-20427
> URL: https://issues.apache.org/jira/browse/SPARK-20427
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Alexander Andrushenko
>Assignee: Yuming Wang
>Priority: Major
> Fix For: 2.3.0
>
>
> In Oracle exists data type NUMBER. When defining a filed in a table of type 
> NUMBER the field has two components, precision and scale.
> For example, NUMBER(p,s) has precision p and scale s. 
> Precision can range from 1 to 38.
> Scale can range from -84 to 127.
> When reading such a filed Spark can create numbers with precision exceeding 
> 38. In our case it has created fields with precision 44,
> calculated as sum of the precision (in our case 34 digits) and the scale (10):
> "...java.lang.IllegalArgumentException: requirement failed: Decimal precision 
> 44 exceeds max precision 38...".
> The result was, that a data frame was read from a table on one schema but 
> could not be inserted in the identical table on other schema.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20427) Issue with Spark interpreting Oracle datatype NUMBER

2020-05-14 Thread Rafael (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-20427?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17107376#comment-17107376
 ] 

Rafael commented on SPARK-20427:


Hey guys, 
I encountered an issue related to precision issues.

Now the code expects the Decimal type we need to have in JDBC metadata 
precision and scale. 

[https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JdbcUtils.scala#L402-L414]

 

I found out that in the OracleDB it is valid to have Decimal without these 
data. When I do a query read metadata for such column I'm getting 
DATA_PRECISION = Null, and DATA_SCALE = Null.

Then when I run the `spark-sql` I'm getting such error:
{code:java}
java.lang.IllegalArgumentException: requirement failed: Decimal precision 45 
exceeds max precision 38
at scala.Predef$.require(Predef.scala:224)
at org.apache.spark.sql.types.Decimal.set(Decimal.scala:114)
at org.apache.spark.sql.types.Decimal$.apply(Decimal.scala:465)
at 
org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$org$apache$spark$sql$execution$datasources$jdbc$JdbcUtils$$makeGetter$3$$anonfun$12.apply(JdbcUtils.scala:407)
{code}
Do you have a work around how spark-sql can work with such cases?

> Issue with Spark interpreting Oracle datatype NUMBER
> 
>
> Key: SPARK-20427
> URL: https://issues.apache.org/jira/browse/SPARK-20427
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Alexander Andrushenko
>Assignee: Yuming Wang
>Priority: Major
> Fix For: 2.3.0
>
>
> In Oracle exists data type NUMBER. When defining a filed in a table of type 
> NUMBER the field has two components, precision and scale.
> For example, NUMBER(p,s) has precision p and scale s. 
> Precision can range from 1 to 38.
> Scale can range from -84 to 127.
> When reading such a filed Spark can create numbers with precision exceeding 
> 38. In our case it has created fields with precision 44,
> calculated as sum of the precision (in our case 34 digits) and the scale (10):
> "...java.lang.IllegalArgumentException: requirement failed: Decimal precision 
> 44 exceeds max precision 38...".
> The result was, that a data frame was read from a table on one schema but 
> could not be inserted in the identical table on other schema.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-30100) Decimal Precision Inferred from JDBC via Spark

2020-05-14 Thread Rafael (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-30100?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17107365#comment-17107365
 ] 

Rafael edited comment on SPARK-30100 at 5/14/20, 2:48 PM:
--

Hey guys, 
 I encountered an issue related to the precision issues.

Now the code expects the for the Decimal type we need to have in JDBC metadata 
precision and scale. 

[https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JdbcUtils.scala#L402-L414]

 

I found out that in the OracleDB it is valid to have Decimal without these 
data. When I do a query read metadata for such column I'm getting 
DATA_PRECISION = Null, and DATA_SCALE = Null.

Then when I run the `spark-sql` I'm getting such error:
{code:java}
java.lang.IllegalArgumentException: requirement failed: Decimal precision 45 
exceeds max precision 38
at scala.Predef$.require(Predef.scala:224)
at org.apache.spark.sql.types.Decimal.set(Decimal.scala:114)
at org.apache.spark.sql.types.Decimal$.apply(Decimal.scala:465)
at 
org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$org$apache$spark$sql$execution$datasources$jdbc$JdbcUtils$$makeGetter$3$$anonfun$12.apply(JdbcUtils.scala:407)
{code}
Do you have a work around how spark-sql can work with such cases?


was (Author: kyrdan):
Hey guys, 
I encountered an issue related to the precision issues.

Now the code expects the for the Decimal type we need to have in JDBC metadata 
precision and scale. 

[https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JdbcUtils.scala#L402-L414]

 

I found out that in the OracleDB it is valid to have Decimal without these 
data. When I do a query read metadata for such column I'm getting 
DATA_PRECISION = Null, and DATA_SCALE = Null.

Then when I run the `spark-sql` I'm getting such error:
{code:java}
java.lang.IllegalArgumentException: requirement failed: Decimal precision 45 
exceeds max precision 38
at scala.Predef$.require(Predef.scala:224)
at org.apache.spark.sql.types.Decimal.set(Decimal.scala:114)
at org.apache.spark.sql.types.Decimal$.apply(Decimal.scala:465)
at 
org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$org$apache$spark$sql$execution$datasources$jdbc$JdbcUtils$$makeGetter$3$$anonfun$12.apply(JdbcUtils.scala:407)
{code}
Do you have a work around how spark-sql can work with such cases?
 * [|https://issues.apache.org/jira/secure/AddComment!default.jspa?id=13001869]

> Decimal Precision Inferred from JDBC via Spark
> --
>
> Key: SPARK-30100
> URL: https://issues.apache.org/jira/browse/SPARK-30100
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 2.4.0
>Reporter: Joby Joje
>Priority: Major
>
> When trying to load data from JDBC(Oracle) into Spark, there seems to be 
> precision loss in the decimal field, as per my understanding Spark supports 
> *DECIMAL(38,18)*. The field from the Oracle is DECIMAL(38,14), whereas Spark 
> rounds off the last four digits making it a precision of DECIMAL(38,10). This 
> is happening to few fields in the dataframe where the column is fetched using 
> a CASE statement whereas in the same query another field populates the right 
> schema.
> Tried to pass the
> {code:java}
> spark.sql.decimalOperations.allowPrecisionLoss=false{code}
> conf in the Spark-submit though didn't get the desired results.
> {code:java}
> jdbcDF = spark.read \ 
> .format("jdbc") \ 
> .option("url", "ORACLE") \ 
> .option("dbtable", "QUERY") \ 
> .option("user", "USERNAME") \ 
> .option("password", "PASSWORD") \ 
> .load(){code}
> So considering that the Spark infers the schema from a sample records, how 
> does this work here? Does it use the results of the query i.e (SELECT * FROM 
> TABLE_NAME JOIN ...) or does it take a different route to guess the schema 
> for itself? Can someone throw some light on this and advise how to achieve 
> the right decimal precision on this regards without manipulating the query as 
> doing a CAST on the query does solve the issue, but would prefer to get some 
> alternatives.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-30100) Decimal Precision Inferred from JDBC via Spark

2020-05-14 Thread Rafael (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-30100?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17107365#comment-17107365
 ] 

Rafael commented on SPARK-30100:


Hey guys, 
I encountered an issue related to the precision issues.

Now the code expects the for the Decimal type we need to have in JDBC metadata 
precision and scale. 

[https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JdbcUtils.scala#L402-L414]

 

I found out that in the OracleDB it is valid to have Decimal without these 
data. When I do a query read metadata for such column I'm getting 
DATA_PRECISION = Null, and DATA_SCALE = Null.

Then when I run the `spark-sql` I'm getting such error:
{code:java}
java.lang.IllegalArgumentException: requirement failed: Decimal precision 45 
exceeds max precision 38
at scala.Predef$.require(Predef.scala:224)
at org.apache.spark.sql.types.Decimal.set(Decimal.scala:114)
at org.apache.spark.sql.types.Decimal$.apply(Decimal.scala:465)
at 
org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$org$apache$spark$sql$execution$datasources$jdbc$JdbcUtils$$makeGetter$3$$anonfun$12.apply(JdbcUtils.scala:407)
{code}
Do you have a work around how spark-sql can work with such cases?
 * [|https://issues.apache.org/jira/secure/AddComment!default.jspa?id=13001869]

> Decimal Precision Inferred from JDBC via Spark
> --
>
> Key: SPARK-30100
> URL: https://issues.apache.org/jira/browse/SPARK-30100
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 2.4.0
>Reporter: Joby Joje
>Priority: Major
>
> When trying to load data from JDBC(Oracle) into Spark, there seems to be 
> precision loss in the decimal field, as per my understanding Spark supports 
> *DECIMAL(38,18)*. The field from the Oracle is DECIMAL(38,14), whereas Spark 
> rounds off the last four digits making it a precision of DECIMAL(38,10). This 
> is happening to few fields in the dataframe where the column is fetched using 
> a CASE statement whereas in the same query another field populates the right 
> schema.
> Tried to pass the
> {code:java}
> spark.sql.decimalOperations.allowPrecisionLoss=false{code}
> conf in the Spark-submit though didn't get the desired results.
> {code:java}
> jdbcDF = spark.read \ 
> .format("jdbc") \ 
> .option("url", "ORACLE") \ 
> .option("dbtable", "QUERY") \ 
> .option("user", "USERNAME") \ 
> .option("password", "PASSWORD") \ 
> .load(){code}
> So considering that the Spark infers the schema from a sample records, how 
> does this work here? Does it use the results of the query i.e (SELECT * FROM 
> TABLE_NAME JOIN ...) or does it take a different route to guess the schema 
> for itself? Can someone throw some light on this and advise how to achieve 
> the right decimal precision on this regards without manipulating the query as 
> doing a CAST on the query does solve the issue, but would prefer to get some 
> alternatives.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17351) Refactor JDBCRDD to expose JDBC -> SparkSQL conversion functionality

2020-05-14 Thread Rafael (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-17351?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17107338#comment-17107338
 ] 

Rafael commented on SPARK-17351:


Hey guys, 
I know that it is a very old ticket but I encountered an issue related to these 
changes so let me ask my question it here.

Now the code expects the for Decimal type we need to have in JDBC metadata 
precision and scale. 

[https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JdbcUtils.scala#L402-L414]

 

I found out that in the OracleDB it is valid to have Decimal without these 
data. When I do a query read metadata for such column I'm getting 
DATA_PRECISION = Null, and DATA_SCALE = Null.

Then when I run the `spark-sql` I'm getting such error:
{code:java}
java.lang.IllegalArgumentException: requirement failed: Decimal precision 45 
exceeds max precision 38
at scala.Predef$.require(Predef.scala:224)
at org.apache.spark.sql.types.Decimal.set(Decimal.scala:114)
at org.apache.spark.sql.types.Decimal$.apply(Decimal.scala:465)
at 
org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$org$apache$spark$sql$execution$datasources$jdbc$JdbcUtils$$makeGetter$3$$anonfun$12.apply(JdbcUtils.scala:407)
{code}
Do you have a work around how spark-sql can work with such cases?

> Refactor JDBCRDD to expose JDBC -> SparkSQL conversion functionality
> 
>
> Key: SPARK-17351
> URL: https://issues.apache.org/jira/browse/SPARK-17351
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Josh Rosen
>Assignee: Josh Rosen
>Priority: Major
> Fix For: 2.1.0
>
>
> It would be useful if more of JDBCRDD's JDBC -> Spark SQL functionality was 
> usable from outside of JDBCRDD; this would make it easier to write test 
> harnesses comparing Spark output against other JDBC databases. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15689) Data source API v2

2019-10-09 Thread Rafael (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-15689?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16947794#comment-16947794
 ] 

Rafael commented on SPARK-15689:


Could you clarify whether 4 item from the original list was implemented:

_4. Nice-to-have: support additional common operators, including limit and 
sampling._

I wondering whether limit operation is optimized now, pushed to `PushedFilters: 
`

> Data source API v2
> --
>
> Key: SPARK-15689
> URL: https://issues.apache.org/jira/browse/SPARK-15689
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Reynold Xin
>Assignee: Wenchen Fan
>Priority: Major
>  Labels: SPIP, releasenotes
> Fix For: 2.3.0
>
> Attachments: SPIP Data Source API V2.pdf
>
>
> This ticket tracks progress in creating the v2 of data source API. This new 
> API should focus on:
> 1. Have a small surface so it is easy to freeze and maintain compatibility 
> for a long time. Ideally, this API should survive architectural rewrites and 
> user-facing API revamps of Spark.
> 2. Have a well-defined column batch interface for high performance. 
> Convenience methods should exist to convert row-oriented formats into column 
> batches for data source developers.
> 3. Still support filter push down, similar to the existing API.
> 4. Nice-to-have: support additional common operators, including limit and 
> sampling.
> Note that both 1 and 2 are problems that the current data source API (v1) 
> suffers. The current data source API has a wide surface with dependency on 
> DataFrame/SQLContext, making the data source API compatibility depending on 
> the upper level API. The current data source API is also only row oriented 
> and has to go through an expensive external data type conversion to internal 
> data type.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org