[jira] [Updated] (SPARK-40637) Spark-shell can correctly encode BINARY type but Spark-sql cannot

2022-10-18 Thread xsys (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40637?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

xsys updated SPARK-40637:
-
Description: 
h3. Describe the bug

When we store a BINARY value (e.g. {{BigInt("1").toByteArray)}} / 
{{{}X'01'{}}}) either via {{spark-shell or spark-sql, and then read it from 
Spark-shell, it}} outputs {{{}[01]{}}}. However, it does not encode correctly 
when querying it via {{{}spark-sql{}}}.

i.e.,

Insert via spark-shell, read via spark-shell: display correctly

Insert via spark-shell, read via spark-sql: does not display correctly

Insert via spark-sql, read via spark-sql: does not display correctly

Insert via spark-sql, read via spark-shell: display correctly
h3. To Reproduce

On Spark 3.2.1 (commit {{{}4f25b3f712{}}}), using {{{}spark-shell{}}}:
{code:java}
$SPARK_HOME/bin/spark-shell{code}
Execute the following:
{code:java}
scala> import org.apache.spark.sql.Row 
scala> import org.apache.spark.sql.types._
scala> val rdd = sc.parallelize(Seq(Row(BigInt("1").toByteArray)))
rdd: org.apache.spark.rdd.RDD[org.apache.spark.sql.Row] = 
ParallelCollectionRDD[356] at parallelize at :28
scala> val schema = new StructType().add(StructField("c1", BinaryType, true))
schema: org.apache.spark.sql.types.StructType = 
StructType(StructField(c1,BinaryType,true))
scala> val df = spark.createDataFrame(rdd, schema)
df: org.apache.spark.sql.DataFrame = [c1: binary]
scala> df.write.mode("overwrite").format("orc").saveAsTable("binary_vals_shell")
scala> spark.sql("select * from binary_vals_shell;").show(false)
++
|c1  |
++
|[01]|
++{code}
Then using {{spark-sql}} to (1) query what is inserted via spark-shell to the 
binary_vals_shell table, and then (2) insert the value via spark-sql to the 
binary_vals_sql table (we use tee to redirect the log to a file)
{code:java}
$SPARK_HOME/bin/spark-sql | tee sql.log{code}
 Execute the following, we only get an empty output in the terminal (but a 
garbage character in the log file):
{code:java}
spark-sql> select * from binary_vals_shell; -- query what is inserted via 
spark-shell;

spark-sql> create table binary_vals_sql(c1 BINARY) stored as ORC; 
spark-sql> insert into binary_vals_sql select X'01'; -- try to insert directly 
in spark-sql;
spark-sql> select * from binary_vals_sql;

Time taken: 0.077 seconds, Fetched 1 row(s)
{code}
>From the log file, we find it shows as a garbage character. (We never 
>encountered this garbage character in logs of other data types)
h3. !image-2022-10-18-12-15-05-576.png!

We then return to spark-shell again and run the following:
{code:java}
scala> spark.sql("select * from binary_vals_sql;").show(false)
++                                                                          
|c1  |
++
|[01]|
++{code}
The binary value does not display correctly via spark-sql, it still displays 
correctly via spark-shell.
h3. Expected behavior

We expect the two Spark interfaces ({{{}spark-sql{}}} & {{{}spark-shell{}}}) to 
behave consistently for the same data type ({{{}BINARY{}}}) & input 
({{{}BigInt("1").toByteArray){}}} / {{{}X'01'{}}}) combination.

 
h3. Additional context

We also tried Avro and Parquet and encountered the same issue. We believe this 
is format-independent.

  was:
h3. Describe the bug

When we store a BINARY value (e.g. {{BigInt("1").toByteArray)}} / 
{{{}X'01'{}}}) either via {{spark-shell or spark-sql, and then read it from 
Spark-shell, it}} outputs {{{}[01]{}}}. However, it does not encode correctly 
when querying it via {{{}spark-sql{}}}.
h3. To Reproduce

On Spark 3.2.1 (commit {{{}4f25b3f712{}}}), using {{{}spark-shell{}}}:
{code:java}
$SPARK_HOME/bin/spark-shell{code}
Execute the following:
{code:java}
scala> import org.apache.spark.sql.Row 
scala> import org.apache.spark.sql.types._
scala> val rdd = sc.parallelize(Seq(Row(BigInt("1").toByteArray)))
rdd: org.apache.spark.rdd.RDD[org.apache.spark.sql.Row] = 
ParallelCollectionRDD[356] at parallelize at :28
scala> val schema = new StructType().add(StructField("c1", BinaryType, true))
schema: org.apache.spark.sql.types.StructType = 
StructType(StructField(c1,BinaryType,true))
scala> val df = spark.createDataFrame(rdd, schema)
df: org.apache.spark.sql.DataFrame = [c1: binary]
scala> df.write.mode("overwrite").format("orc").saveAsTable("binary_vals_shell")
scala> spark.sql("select * from binary_vals_shell;").show(false)
++
|c1  |
++
|[01]|
++{code}
We can see the output using "spark.sql" in spark-shell.

Then using {{spark-sql}} to (1) query what is inserted via spark-shell to the 
binary_vals_shell table, and then (2) insert the value via spark-sql to the 
binary_vals_sql table (we use tee to redirect the log to a file)
{code:java}
$SPARK_HOME/bin/spark-sql | tee sql.log{code}
 Execute the following, we only get an empty output in the terminal (but a 
garbage character in the log file):
{code:java}
spark-sql> select * from binary_vals_shell; -- query what is 

[jira] [Commented] (SPARK-40637) Spark-shell can correctly encode BINARY type but Spark-sql cannot

2022-10-18 Thread xsys (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-40637?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17619715#comment-17619715
 ] 

xsys commented on SPARK-40637:
--

Thank you for the response [~ivan.sadikov].

I just updated the description with more details about how to reproduce it 
(including writing the value to the table in the first example).

Basically, when we insert the binary value either via spark-shell or spark-sql, 
spark-shell displays it correctly but spark-sql does not.

We also tried Avro and Parquet and encountered the same issue. We believe this 
is format-independent.

> Spark-shell can correctly encode BINARY type but Spark-sql cannot
> -
>
> Key: SPARK-40637
> URL: https://issues.apache.org/jira/browse/SPARK-40637
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.1
>Reporter: xsys
>Priority: Major
> Attachments: image-2022-10-18-12-15-05-576.png
>
>
> h3. Describe the bug
> When we store a BINARY value (e.g. {{BigInt("1").toByteArray)}} / 
> {{{}X'01'{}}}) either via {{spark-shell or spark-sql, and then read it from 
> Spark-shell, it}} outputs {{{}[01]{}}}. However, it does not encode correctly 
> when querying it via {{{}spark-sql{}}}.
> h3. To Reproduce
> On Spark 3.2.1 (commit {{{}4f25b3f712{}}}), using {{{}spark-shell{}}}:
> {code:java}
> $SPARK_HOME/bin/spark-shell{code}
> Execute the following:
> {code:java}
> scala> import org.apache.spark.sql.Row 
> scala> import org.apache.spark.sql.types._
> scala> val rdd = sc.parallelize(Seq(Row(BigInt("1").toByteArray)))
> rdd: org.apache.spark.rdd.RDD[org.apache.spark.sql.Row] = 
> ParallelCollectionRDD[356] at parallelize at :28
> scala> val schema = new StructType().add(StructField("c1", BinaryType, true))
> schema: org.apache.spark.sql.types.StructType = 
> StructType(StructField(c1,BinaryType,true))
> scala> val df = spark.createDataFrame(rdd, schema)
> df: org.apache.spark.sql.DataFrame = [c1: binary]
> scala> 
> df.write.mode("overwrite").format("orc").saveAsTable("binary_vals_shell")
> scala> spark.sql("select * from binary_vals_shell;").show(false)
> ++
> |c1  |
> ++
> |[01]|
> ++{code}
> We can see the output using "spark.sql" in spark-shell.
> Then using {{spark-sql}} to (1) query what is inserted via spark-shell to the 
> binary_vals_shell table, and then (2) insert the value via spark-sql to the 
> binary_vals_sql table (we use tee to redirect the log to a file)
> {code:java}
> $SPARK_HOME/bin/spark-sql | tee sql.log{code}
>  Execute the following, we only get an empty output in the terminal (but a 
> garbage character in the log file):
> {code:java}
> spark-sql> select * from binary_vals_shell; -- query what is inserted via 
> spark-shell;
> spark-sql> create table binary_vals_sql(c1 BINARY) stored as ORC; 
> spark-sql> insert into binary_vals_sql select X'01'; -- try to insert 
> directly in spark-sql;
> spark-sql> select * from binary_vals_sql;
> Time taken: 0.077 seconds, Fetched 1 row(s)
> {code}
> From the log file, we find it shows as a garbage character. (We never 
> encountered this garbage character in logs of other data types)
> h3. !image-2022-10-18-12-15-05-576.png!
> We then return to spark-shell again and run the following:
> {code:java}
> scala> spark.sql("select * from binary_vals_sql;").show(false)
> ++                                                                        
>   
> |c1  |
> ++
> |[01]|
> ++{code}
> The binary value does not display correctly via spark-sql, it still displays 
> correctly via spark-shell.
> h3. Expected behavior
> We expect the two Spark interfaces ({{{}spark-sql{}}} & {{{}spark-shell{}}}) 
> to behave consistently for the same data type ({{{}BINARY{}}}) & input 
> ({{{}BigInt("1").toByteArray){}}} / {{{}X'01'{}}}) combination.
>  
> h3. Additional context
> We also tried Avro and Parquet and encountered the same issue. We believe 
> this is format-independent.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-40637) Spark-shell can correctly encode BINARY type but Spark-sql cannot

2022-10-18 Thread xsys (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40637?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

xsys updated SPARK-40637:
-
Description: 
h3. Describe the bug

When we store a BINARY value (e.g. {{BigInt("1").toByteArray)}} / 
{{{}X'01'{}}}) either via {{spark-shell or spark-sql, and then read it from 
Spark-shell, it}} outputs {{{}[01]{}}}. However, it does not encode correctly 
when querying it via {{{}spark-sql{}}}.
h3. To Reproduce

On Spark 3.2.1 (commit {{{}4f25b3f712{}}}), using {{{}spark-shell{}}}:
{code:java}
$SPARK_HOME/bin/spark-shell{code}
Execute the following:
{code:java}
scala> import org.apache.spark.sql.Row 
scala> import org.apache.spark.sql.types._
scala> val rdd = sc.parallelize(Seq(Row(BigInt("1").toByteArray)))
rdd: org.apache.spark.rdd.RDD[org.apache.spark.sql.Row] = 
ParallelCollectionRDD[356] at parallelize at :28
scala> val schema = new StructType().add(StructField("c1", BinaryType, true))
schema: org.apache.spark.sql.types.StructType = 
StructType(StructField(c1,BinaryType,true))
scala> val df = spark.createDataFrame(rdd, schema)
df: org.apache.spark.sql.DataFrame = [c1: binary]
scala> df.write.mode("overwrite").format("orc").saveAsTable("binary_vals_shell")
scala> spark.sql("select * from binary_vals_shell;").show(false)
++
|c1  |
++
|[01]|
++{code}
We can see the output using "spark.sql" in spark-shell.

Then using {{spark-sql}} to (1) query what is inserted via spark-shell to the 
binary_vals_shell table, and then (2) insert the value via spark-sql to the 
binary_vals_sql table (we use tee to redirect the log to a file)
{code:java}
$SPARK_HOME/bin/spark-sql | tee sql.log{code}
 Execute the following, we only get an empty output in the terminal (but a 
garbage character in the log file):
{code:java}
spark-sql> select * from binary_vals_shell; -- query what is inserted via 
spark-shell;

spark-sql> create table binary_vals_sql(c1 BINARY) stored as ORC; 
spark-sql> insert into binary_vals_sql select X'01'; -- try to insert directly 
in spark-sql;
spark-sql> select * from binary_vals_sql;

Time taken: 0.077 seconds, Fetched 1 row(s)
{code}
>From the log file, we find it shows as a garbage character. (We never 
>encountered this garbage character in logs of other data types)
h3. !image-2022-10-18-12-15-05-576.png!

We then return to spark-shell again and run the following:
{code:java}
scala> spark.sql("select * from binary_vals_sql;").show(false)
++                                                                          
|c1  |
++
|[01]|
++{code}
The binary value does not display correctly via spark-sql, it still displays 
correctly via spark-shell.
h3. Expected behavior

We expect the two Spark interfaces ({{{}spark-sql{}}} & {{{}spark-shell{}}}) to 
behave consistently for the same data type ({{{}BINARY{}}}) & input 
({{{}BigInt("1").toByteArray){}}} / {{{}X'01'{}}}) combination.

 
h3. Additional context

We also tried Avro and Parquet and encountered the same issue. We believe this 
is format-independent.

  was:
h3. Describe the bug

When we store a BINARY value (e.g. {{BigInt("1").toByteArray)}} / 
{{{}X'01'{}}}) either via {{spark-shell or spark-sql, and then read it from 
Spark-shell, it}} outputs {{{}[01]{}}}. However, it does not encode correctly 
when quering it via {{{}spark-sql{}}}.
h3. To Reproduce

On Spark 3.2.1 (commit {{{}4f25b3f712{}}}), using {{{}spark-shell{}}}:
{code:java}
$SPARK_HOME/bin/spark-shell{code}
Execute the following:
{code:java}
scala> import org.apache.spark.sql.Row 
scala> import org.apache.spark.sql.types._
scala> val rdd = sc.parallelize(Seq(Row(BigInt("1").toByteArray)))
rdd: org.apache.spark.rdd.RDD[org.apache.spark.sql.Row] = 
ParallelCollectionRDD[356] at parallelize at :28
scala> val schema = new StructType().add(StructField("c1", BinaryType, true))
schema: org.apache.spark.sql.types.StructType = 
StructType(StructField(c1,BinaryType,true))
scala> val df = spark.createDataFrame(rdd, schema)
df: org.apache.spark.sql.DataFrame = [c1: binary]
scala> df.write.mode("overwrite").format("orc").saveAsTable("binary_vals_shell")
scala> spark.sql("select * from binary_vals_shell;").show(false)
++
|c1  |
++
|[01]|
++{code}
We can see the output using "spark.sql" in spark-shell.

Then using {{spark-sql}} to (1) query what is inserted via spark-shell to the 
binary_vals_shell table, and then (2) insert the value via spark-sql to the 
binary_vals_sql table (we use tee to redirect the log to a file)
{code:java}
$SPARK_HOME/bin/spark-sql | tee sql.log{code}
 Execute the following, we only get an empty output in the terminal (but a 
garbage character in the log file):
{code:java}
spark-sql> select * from binary_vals_shell; -- query what is inserted via 
spark-shell;

spark-sql> create table binary_vals_sql(c1 BINARY) stored as ORC; 
spark-sql> insert into binary_vals_sql select X'01'; -- try to insert directly 
in spark-sql;
spark-sql> select * from 

[jira] [Updated] (SPARK-40637) Spark-shell can correctly encode BINARY type but Spark-sql cannot

2022-10-18 Thread xsys (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40637?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

xsys updated SPARK-40637:
-
Description: 
h3. Describe the bug

When we store a BINARY value (e.g. {{BigInt("1").toByteArray)}} / 
{{{}X'01'{}}}) either via {{spark-shell or spark-sql, and then read it from 
Spark-shell, it}} outputs {{{}[01]{}}}. However, it does not encode correctly 
when quering it via {{{}spark-sql{}}}.
h3. To Reproduce

On Spark 3.2.1 (commit {{{}4f25b3f712{}}}), using {{{}spark-shell{}}}:
{code:java}
$SPARK_HOME/bin/spark-shell{code}
Execute the following:
{code:java}
scala> import org.apache.spark.sql.Row 
scala> import org.apache.spark.sql.types._
scala> val rdd = sc.parallelize(Seq(Row(BigInt("1").toByteArray)))
rdd: org.apache.spark.rdd.RDD[org.apache.spark.sql.Row] = 
ParallelCollectionRDD[356] at parallelize at :28
scala> val schema = new StructType().add(StructField("c1", BinaryType, true))
schema: org.apache.spark.sql.types.StructType = 
StructType(StructField(c1,BinaryType,true))
scala> val df = spark.createDataFrame(rdd, schema)
df: org.apache.spark.sql.DataFrame = [c1: binary]
scala> df.write.mode("overwrite").format("orc").saveAsTable("binary_vals_shell")
scala> spark.sql("select * from binary_vals_shell;").show(false)
++
|c1  |
++
|[01]|
++{code}
We can see the output using "spark.sql" in spark-shell.

Then using {{spark-sql}} to (1) query what is inserted via spark-shell to the 
binary_vals_shell table, and then (2) insert the value via spark-sql to the 
binary_vals_sql table (we use tee to redirect the log to a file)
{code:java}
$SPARK_HOME/bin/spark-sql | tee sql.log{code}
 Execute the following, we only get an empty output in the terminal (but a 
garbage character in the log file):
{code:java}
spark-sql> select * from binary_vals_shell; -- query what is inserted via 
spark-shell;

spark-sql> create table binary_vals_sql(c1 BINARY) stored as ORC; 
spark-sql> insert into binary_vals_sql select X'01'; -- try to insert directly 
in spark-sql;
spark-sql> select * from binary_vals_sql;

Time taken: 0.077 seconds, Fetched 1 row(s)
{code}
>From the log file, we find it shows as a garbage character. (We never 
>encountered this garbage character in logs of other data types)
h3. !image-2022-10-18-12-15-05-576.png!

We then return to spark-shell again and run the following:
{code:java}
scala> spark.sql("select * from binary_vals_sql;").show(false)
++                                                                          
|c1  |
++
|[01]|
++{code}
The binary value does not display correctly via spark-sql, it still displays 
correctly via spark-shell.
h3. Expected behavior

We expect the two Spark interfaces ({{{}spark-sql{}}} & {{{}spark-shell{}}}) to 
behave consistently for the same data type ({{{}BINARY{}}}) & input 
({{{}BigInt("1").toByteArray){}}} / {{{}X'01'{}}}) combination.

 
h3. Additional context

We also tried Avro and Parquet and encountered the same issue. We believe this 
is format-independent.

  was:
h3. Describe the bug

When we store a BINARY value (e.g. {{BigInt("1").toByteArray)}} / 
{{{}X'01'{}}}) either via {{spark-shell or spark-sql, and then read it from 
Spark-shell, it}} outputs {{{}[01]{}}}. However, it does not encode correctly 
when quering it via {{{}spark-sql{}}}.
h3. To Reproduce

On Spark 3.2.1 (commit {{{}4f25b3f712{}}}), using {{{}spark-shell{}}}:
{code:java}
$SPARK_HOME/bin/spark-shell{code}
Execute the following:
{code:java}
scala> import org.apache.spark.sql.Row 
scala> import org.apache.spark.sql.types._
scala> val rdd = sc.parallelize(Seq(Row(BigInt("1").toByteArray)))
rdd: org.apache.spark.rdd.RDD[org.apache.spark.sql.Row] = 
ParallelCollectionRDD[356] at parallelize at :28
scala> val schema = new StructType().add(StructField("c1", BinaryType, true))
schema: org.apache.spark.sql.types.StructType = 
StructType(StructField(c1,BinaryType,true))
scala> val df = spark.createDataFrame(rdd, schema)
df: org.apache.spark.sql.DataFrame = [c1: binary]
scala> df.show(false)
++
|c1  |
++
|[01]|
++
scala> df.write.mode("overwrite").format("orc").saveAsTable("binary_vals_shell")
scala> spark.sql("select * from binary_vals_shell;").show(false)
++
|c1  |
++
|[01]|
++{code}
We can see the output using "spark.sql" in spark-shell.

Then using {{spark-sql}} to (1) query what is inserted via spark-shell to the 
binary_vals_shell table, and then (2) insert the value via spark-sql to the 
binary_vals_sql table (we use tee to redirect the log to a file)
{code:java}
$SPARK_HOME/bin/spark-sql | tee sql.log{code}
 Execute the following, we only get an empty output in the terminal (but a 
garbage character in the log file):
{code:java}
spark-sql> select * from binary_vals_shell; -- query what is inserted via 
spark-shell;

spark-sql> create table binary_vals_sql(c1 BINARY) stored as ORC; 
spark-sql> insert into binary_vals_sql select X'01'; -- try to insert 

[jira] [Updated] (SPARK-40637) Spark-shell can correctly encode BINARY type but Spark-sql cannot

2022-10-18 Thread xsys (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40637?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

xsys updated SPARK-40637:
-
Description: 
h3. Describe the bug

When we store a BINARY value (e.g. {{BigInt("1").toByteArray)}} / 
{{{}X'01'{}}}) either via {{spark-shell or spark-sql, and then read it from 
Spark-shell, it}} outputs {{{}[01]{}}}. However, it does not encode correctly 
when quering it via {{{}spark-sql{}}}.
h3. To Reproduce

On Spark 3.2.1 (commit {{{}4f25b3f712{}}}), using {{{}spark-shell{}}}:
{code:java}
$SPARK_HOME/bin/spark-shell{code}
Execute the following:
{code:java}
scala> import org.apache.spark.sql.Row 
scala> import org.apache.spark.sql.types._
scala> val rdd = sc.parallelize(Seq(Row(BigInt("1").toByteArray)))
rdd: org.apache.spark.rdd.RDD[org.apache.spark.sql.Row] = 
ParallelCollectionRDD[356] at parallelize at :28
scala> val schema = new StructType().add(StructField("c1", BinaryType, true))
schema: org.apache.spark.sql.types.StructType = 
StructType(StructField(c1,BinaryType,true))
scala> val df = spark.createDataFrame(rdd, schema)
df: org.apache.spark.sql.DataFrame = [c1: binary]
scala> df.show(false)
++
|c1  |
++
|[01]|
++
scala> df.write.mode("overwrite").format("orc").saveAsTable("binary_vals_shell")
scala> spark.sql("select * from binary_vals_shell;").show(false)
++
|c1  |
++
|[01]|
++{code}
We can see the output using "spark.sql" in spark-shell.

Then using {{spark-sql}} to (1) query what is inserted via spark-shell to the 
binary_vals_shell table, and then (2) insert the value via spark-sql to the 
binary_vals_sql table (we use tee to redirect the log to a file)
{code:java}
$SPARK_HOME/bin/spark-sql | tee sql.log{code}
 Execute the following, we only get an empty output in the terminal (but a 
garbage character in the log file):
{code:java}
spark-sql> select * from binary_vals_shell; -- query what is inserted via 
spark-shell;

spark-sql> create table binary_vals_sql(c1 BINARY) stored as ORC; 
spark-sql> insert into binary_vals_sql select X'01'; -- try to insert directly 
in spark-sql;
spark-sql> select * from binary_vals_sql;

Time taken: 0.077 seconds, Fetched 1 row(s)
{code}
>From the log file, we find it shows as a garbage character. (We never 
>encountered this garbage character in logs of other data types)
h3. !image-2022-10-18-12-15-05-576.png!

We then return to spark-shell again and run the following:
{code:java}
scala> spark.sql("select * from binary_vals_sql;").show(false)
++                                                                          
|c1  |
++
|[01]|
++{code}
The binary value does not display correctly via spark-sql, it still displays 
correctly via spark-shell.
h3. Expected behavior

We expect the two Spark interfaces ({{{}spark-sql{}}} & {{{}spark-shell{}}}) to 
behave consistently for the same data type ({{{}BINARY{}}}) & input 
({{{}BigInt("1").toByteArray){}}} / {{{}X'01'{}}}) combination.

 
h3. Additional context

We also tried Avro and Parquet and encountered the same issue. We believe this 
is format-independent.

  was:
h3. Describe the bug

When we store a BINARY value (e.g. {{BigInt("1").toByteArray)}} / 
{{{}X'01'{}}}) either via {{spark-shell or spark-sql, and then read it from 
Spark-shell, it}} outputs {{{}[01]{}}}. However, it does not encode correctly 
when quering it via {{{}spark-sql{}}}.
h3. To Reproduce

On Spark 3.2.1 (commit {{{}4f25b3f712{}}}), using {{{}spark-shell{}}}:
{code:java}
$SPARK_HOME/bin/spark-shell{code}
Execute the following:
{code:java}
scala> val rdd = sc.parallelize(Seq(Row(BigInt("1").toByteArray)))
rdd: org.apache.spark.rdd.RDD[org.apache.spark.sql.Row] = 
ParallelCollectionRDD[356] at parallelize at :28
scala> val schema = new StructType().add(StructField("c1", BinaryType, true))
schema: org.apache.spark.sql.types.StructType = 
StructType(StructField(c1,BinaryType,true))
scala> val df = spark.createDataFrame(rdd, schema)
df: org.apache.spark.sql.DataFrame = [c1: binary]
scala> df.show(false)
++
|c1  |
++
|[01]|
++
scala> df.write.mode("overwrite").format("orc").saveAsTable("binary_vals")
scala> spark.sql("select * from binary_vals;").show(false)
++
|c1  |
++
|[01]|
++{code}
We can see the output using "spark.sql" in spark-shell.

Then using {{spark-sql}} (to see we use tee to redirect the log to a file)
{code:java}
$SPARK_HOME/bin/spark-sql | tee sql.log{code}
 Execute the following, we only get an empty output in the terminal:
{code:java}
spark-sql> select * from binary_vals; -- check what was inserted by DataFrame

spark-sql> drop table binary_vals;
spark-sql> create table binary_vals(c1 BINARY) stored as ORC; -- try to insert 
directly in spark-sql 
spark-sql> insert into binary_vals select X'01';
spark-sql> select * from binary_vals;

Time taken: 0.077 seconds, Fetched 1 row(s)
{code}
 

>From the log file, we find it shows as a garbage character. (We never 
>encountered this 

[jira] [Updated] (SPARK-40637) Spark-shell can correctly encode BINARY type but Spark-sql cannot

2022-10-18 Thread xsys (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40637?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

xsys updated SPARK-40637:
-
Description: 
h3. Describe the bug

When we store a BINARY value (e.g. {{BigInt("1").toByteArray)}} / 
{{{}X'01'{}}}) either via {{spark-shell or spark-sql, and then read it from 
Spark-shell, it}} outputs {{{}[01]{}}}. However, it does not encode correctly 
when quering it via {{{}spark-sql{}}}.
h3. To Reproduce

On Spark 3.2.1 (commit {{{}4f25b3f712{}}}), using {{{}spark-shell{}}}:
{code:java}
$SPARK_HOME/bin/spark-shell{code}
Execute the following:
{code:java}
scala> val rdd = sc.parallelize(Seq(Row(BigInt("1").toByteArray)))
rdd: org.apache.spark.rdd.RDD[org.apache.spark.sql.Row] = 
ParallelCollectionRDD[356] at parallelize at :28
scala> val schema = new StructType().add(StructField("c1", BinaryType, true))
schema: org.apache.spark.sql.types.StructType = 
StructType(StructField(c1,BinaryType,true))
scala> val df = spark.createDataFrame(rdd, schema)
df: org.apache.spark.sql.DataFrame = [c1: binary]
scala> df.show(false)
++
|c1  |
++
|[01]|
++
scala> df.write.mode("overwrite").format("orc").saveAsTable("binary_vals")
scala> spark.sql("select * from binary_vals;").show(false)
++
|c1  |
++
|[01]|
++{code}
We can see the output using "spark.sql" in spark-shell.

Then using {{spark-sql}} (to see we use tee to redirect the log to a file)
{code:java}
$SPARK_HOME/bin/spark-sql | tee sql.log{code}
 Execute the following, we only get an empty output in the terminal:
{code:java}
spark-sql> select * from binary_vals; -- check what was inserted by DataFrame

spark-sql> drop table binary_vals;
spark-sql> create table binary_vals(c1 BINARY) stored as ORC; -- try to insert 
directly in spark-sql 
spark-sql> insert into binary_vals select X'01';
spark-sql> select * from binary_vals;

Time taken: 0.077 seconds, Fetched 1 row(s)
{code}
 

>From the log file, we find it shows as a garbage character. (We never 
>encountered this garbage character in logs of other data types)
h3. !image-2022-10-18-12-15-05-576.png!
h3. Expected behavior

We expect the two Spark interfaces ({{{}spark-sql{}}} & {{{}spark-shell{}}}) to 
behave consistently for the same data type ({{{}BINARY{}}}) & input 
({{{}BigInt("1").toByteArray){}}} / {{{}X'01'{}}}) combination.

 
h3. Additional context

We also tried Avro and Parquet and encountered the same issue. We believe this 
is format-independent.

  was:
h3. Describe the bug

When we store a BINARY value (e.g. {{BigInt("1").toByteArray)}} / 
{{{}X'01'{}}}) either via {{spark-shell or spark-sql, and then read it from 
Spark-shell, it}} outputs {{{}[01]{}}}. However, it does not encode correctly 
via {{{}spark-sql{}}}.
h3. To Reproduce

On Spark 3.2.1 (commit {{{}4f25b3f712{}}}), using {{{}spark-shell{}}}:
{code:java}
$SPARK_HOME/bin/spark-shell{code}
Execute the following:
{code:java}
scala> val rdd = sc.parallelize(Seq(Row(BigInt("1").toByteArray)))
rdd: org.apache.spark.rdd.RDD[org.apache.spark.sql.Row] = 
ParallelCollectionRDD[356] at parallelize at :28
scala> val schema = new StructType().add(StructField("c1", BinaryType, true))
schema: org.apache.spark.sql.types.StructType = 
StructType(StructField(c1,BinaryType,true))
scala> val df = spark.createDataFrame(rdd, schema)
df: org.apache.spark.sql.DataFrame = [c1: binary]
scala> df.show(false)
++
|c1  |
++
|[01]|
++
scala> df.write.mode("overwrite").format("orc").saveAsTable("binary_vals")
scala> spark.sql("select * from binary_vals;").show(false)
++
|c1  |
++
|[01]|
++{code}
We can see the output using "spark.sql" in spark-shell.

Then using {{spark-sql}} (to see we use tee to redirect the log to a file)
{code:java}
$SPARK_HOME/bin/spark-sql | tee sql.log{code}
 Execute the following, we only get an empty output in the terminal:
{code:java}
spark-sql> select * from binary_vals; -- check what was inserted by DataFrame

spark-sql> drop table binary_vals;
spark-sql> create table binary_vals(c1 BINARY) stored as ORC; -- try to insert 
directly in spark-sql 
spark-sql> insert into binary_vals select X'01';
spark-sql> select * from binary_vals;

Time taken: 0.077 seconds, Fetched 1 row(s)
{code}
 

>From the log file, we find it shows as a garbage character. (We never 
>encountered this garbage character in logs of other data types)
h3. !image-2022-10-18-12-15-05-576.png!
h3. Expected behavior

We expect the two Spark interfaces ({{{}spark-sql{}}} & {{{}spark-shell{}}}) to 
behave consistently for the same data type ({{{}BINARY{}}}) & input 
({{{}BigInt("1").toByteArray){}}} / {{{}X'01'{}}}) combination.

 
h3. Additional context

We also tried Avro and Parquet and encountered the same issue. We believe this 
is format-independent.


> Spark-shell can correctly encode BINARY type but Spark-sql cannot
> -
>
> Key: SPARK-40637
> 

[jira] [Updated] (SPARK-40637) Spark-shell can correctly encode BINARY type but Spark-sql cannot

2022-10-18 Thread xsys (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40637?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

xsys updated SPARK-40637:
-
Summary: Spark-shell can correctly encode BINARY type but Spark-sql cannot  
(was: DataFrame can correctly encode BINARY type but SparkSQL cannot)

> Spark-shell can correctly encode BINARY type but Spark-sql cannot
> -
>
> Key: SPARK-40637
> URL: https://issues.apache.org/jira/browse/SPARK-40637
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.1
>Reporter: xsys
>Priority: Major
> Attachments: image-2022-10-18-12-15-05-576.png
>
>
> h3. Describe the bug
> When we store a BINARY value (e.g. {{BigInt("1").toByteArray)}} / 
> {{{}X'01'{}}}) either via {{spark-shell or spark-sql, and then read it from 
> Spark-shell, it}} outputs {{{}[01]{}}}. However, it does not encode correctly 
> via {{{}spark-sql{}}}.
> h3. To Reproduce
> On Spark 3.2.1 (commit {{{}4f25b3f712{}}}), using {{{}spark-shell{}}}:
> {code:java}
> $SPARK_HOME/bin/spark-shell{code}
> Execute the following:
> {code:java}
> scala> val rdd = sc.parallelize(Seq(Row(BigInt("1").toByteArray)))
> rdd: org.apache.spark.rdd.RDD[org.apache.spark.sql.Row] = 
> ParallelCollectionRDD[356] at parallelize at :28
> scala> val schema = new StructType().add(StructField("c1", BinaryType, true))
> schema: org.apache.spark.sql.types.StructType = 
> StructType(StructField(c1,BinaryType,true))
> scala> val df = spark.createDataFrame(rdd, schema)
> df: org.apache.spark.sql.DataFrame = [c1: binary]
> scala> df.show(false)
> ++
> |c1  |
> ++
> |[01]|
> ++
> scala> df.write.mode("overwrite").format("orc").saveAsTable("binary_vals")
> scala> spark.sql("select * from binary_vals;").show(false)
> ++
> |c1  |
> ++
> |[01]|
> ++{code}
> We can see the output using "spark.sql" in spark-shell.
> Then using {{spark-sql}} (to see we use tee to redirect the log to a file)
> {code:java}
> $SPARK_HOME/bin/spark-sql | tee sql.log{code}
>  Execute the following, we only get an empty output in the terminal:
> {code:java}
> spark-sql> select * from binary_vals; -- check what was inserted by DataFrame
> spark-sql> drop table binary_vals;
> spark-sql> create table binary_vals(c1 BINARY) stored as ORC; -- try to 
> insert directly in spark-sql 
> spark-sql> insert into binary_vals select X'01';
> spark-sql> select * from binary_vals;
> Time taken: 0.077 seconds, Fetched 1 row(s)
> {code}
>  
> From the log file, we find it shows as a garbage character. (We never 
> encountered this garbage character in logs of other data types)
> h3. !image-2022-10-18-12-15-05-576.png!
> h3. Expected behavior
> We expect the two Spark interfaces ({{{}spark-sql{}}} & {{{}spark-shell{}}}) 
> to behave consistently for the same data type ({{{}BINARY{}}}) & input 
> ({{{}BigInt("1").toByteArray){}}} / {{{}X'01'{}}}) combination.
>  
> h3. Additional context
> We also tried Avro and Parquet and encountered the same issue. We believe 
> this is format-independent.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-40637) DataFrame can correctly encode BINARY type but SparkSQL cannot

2022-10-18 Thread xsys (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40637?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

xsys updated SPARK-40637:
-
Description: 
h3. Describe the bug

When we store a BINARY value (e.g. {{BigInt("1").toByteArray)}} / 
{{{}X'01'{}}}) either via {{spark-shell or spark-sql, and then read it from 
Spark-shell, it}} outputs {{{}[01]{}}}. However, it does not encode correctly 
via {{{}spark-sql{}}}.
h3. To Reproduce

On Spark 3.2.1 (commit {{{}4f25b3f712{}}}), using {{{}spark-shell{}}}:
{code:java}
$SPARK_HOME/bin/spark-shell{code}
Execute the following:
{code:java}
scala> val rdd = sc.parallelize(Seq(Row(BigInt("1").toByteArray)))
rdd: org.apache.spark.rdd.RDD[org.apache.spark.sql.Row] = 
ParallelCollectionRDD[356] at parallelize at :28
scala> val schema = new StructType().add(StructField("c1", BinaryType, true))
schema: org.apache.spark.sql.types.StructType = 
StructType(StructField(c1,BinaryType,true))
scala> val df = spark.createDataFrame(rdd, schema)
df: org.apache.spark.sql.DataFrame = [c1: binary]
scala> df.show(false)
++
|c1  |
++
|[01]|
++
scala> df.write.mode("overwrite").format("orc").saveAsTable("binary_vals")
scala> spark.sql("select * from binary_vals;").show(false)
++
|c1  |
++
|[01]|
++{code}
We can see the output using "spark.sql" in spark-shell.

Then using {{spark-sql}} (to see we use tee to redirect the log to a file)
{code:java}
$SPARK_HOME/bin/spark-sql | tee sql.log{code}
 Execute the following, we only get an empty output in the terminal:
{code:java}
spark-sql> select * from binary_vals; -- check what was inserted by DataFrame

spark-sql> drop table binary_vals;
spark-sql> create table binary_vals(c1 BINARY) stored as ORC; -- try to insert 
directly in spark-sql 
spark-sql> insert into binary_vals select X'01';
spark-sql> select * from binary_vals;

Time taken: 0.077 seconds, Fetched 1 row(s)
{code}
 

>From the log file, we find it shows as a garbage character. (We never 
>encountered this garbage character in logs of other data types)
h3. !image-2022-10-18-12-15-05-576.png!
h3. Expected behavior

We expect the two Spark interfaces ({{{}spark-sql{}}} & {{{}spark-shell{}}}) to 
behave consistently for the same data type ({{{}BINARY{}}}) & input 
({{{}BigInt("1").toByteArray){}}} / {{{}X'01'{}}}) combination.

 
h3. Additional context

We also tried Avro and Parquet and encountered the same issue. We believe this 
is format-independent.

  was:
h3. Describe the bug

Storing a BINARY value (e.g. {{BigInt("1").toByteArray)}} / {{{}X'01'{}}}) via 
{{spark-shell}} outputs {{{}[01]{}}}. However, it does not encode correctly if 
the value is inserted into a BINARY column of a table via {{{}spark-sql{}}}.
h3. To Reproduce

On Spark 3.2.1 (commit {{{}4f25b3f712{}}}), using {{{}spark-shell{}}}:
{code:java}
$SPARK_HOME/bin/spark-shell{code}
Execute the following:
{code:java}
scala> val rdd = sc.parallelize(Seq(Row(BigInt("1").toByteArray)))
rdd: org.apache.spark.rdd.RDD[org.apache.spark.sql.Row] = 
ParallelCollectionRDD[356] at parallelize at :28
scala> val schema = new StructType().add(StructField("c1", BinaryType, true))
schema: org.apache.spark.sql.types.StructType = 
StructType(StructField(c1,BinaryType,true))
scala> val df = spark.createDataFrame(rdd, schema)
df: org.apache.spark.sql.DataFrame = [c1: binary]
scala> df.show(false)
++
|c1  |
++
|[01]|
++
scala> df.write.mode("overwrite").format("orc").saveAsTable("binary_vals")
scala> spark.sql("select * from binary_vals;").show(false)
++
|c1  |
++
|[01]|
++{code}
We can see the output using "spark.sql" in spark-shell.

Then using {{spark-sql}} (to see we use tee to redirect the log to a file)
{code:java}
$SPARK_HOME/bin/spark-sql | tee sql.log{code}
 Execute the following, we only get an empty output in the terminal:
{code:java}
spark-sql> select * from binary_vals; -- check what was inserted by DataFrame

spark-sql> drop table binary_vals;
spark-sql> create table binary_vals(c1 BINARY) stored as ORC; -- try to insert 
directly in spark-sql 
spark-sql> insert into binary_vals select X'01';
spark-sql> select * from binary_vals;

Time taken: 0.077 seconds, Fetched 1 row(s)
{code}
 

>From the log file, we find it shows as a garbage character. (We never 
>encountered this garbage character in logs of other data types)
h3. !image-2022-10-18-12-15-05-576.png!
h3. Expected behavior

We expect the two Spark interfaces ({{{}spark-sql{}}} & {{{}spark-shell{}}}) to 
behave consistently for the same data type ({{{}BINARY{}}}) & input 
({{{}BigInt("1").toByteArray){}}} / {{{}X'01'{}}}) combination.

 
h3. Additional context

We also tried Avro and Parquet and encountered the same issue. We believe this 
is format-independent.


> DataFrame can correctly encode BINARY type but SparkSQL cannot
> --
>
> Key: SPARK-40637
> URL: 

[jira] [Updated] (SPARK-40637) DataFrame can correctly encode BINARY type but SparkSQL cannot

2022-10-18 Thread xsys (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40637?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

xsys updated SPARK-40637:
-
Description: 
h3. Describe the bug

Storing a BINARY value (e.g. {{BigInt("1").toByteArray)}} / {{{}X'01'{}}}) via 
{{spark-shell}} outputs {{{}[01]{}}}. However, it does not encode correctly if 
the value is inserted into a BINARY column of a table via {{{}spark-sql{}}}.
h3. To Reproduce

On Spark 3.2.1 (commit {{{}4f25b3f712{}}}), using {{{}spark-shell{}}}:
{code:java}
$SPARK_HOME/bin/spark-shell{code}
Execute the following:
{code:java}
scala> val rdd = sc.parallelize(Seq(Row(BigInt("1").toByteArray)))
rdd: org.apache.spark.rdd.RDD[org.apache.spark.sql.Row] = 
ParallelCollectionRDD[356] at parallelize at :28
scala> val schema = new StructType().add(StructField("c1", BinaryType, true))
schema: org.apache.spark.sql.types.StructType = 
StructType(StructField(c1,BinaryType,true))
scala> val df = spark.createDataFrame(rdd, schema)
df: org.apache.spark.sql.DataFrame = [c1: binary]
scala> df.show(false)
++
|c1  |
++
|[01]|
++
scala> df.write.mode("overwrite").format("orc").saveAsTable("binary_vals")
scala> spark.sql("select * from binary_vals;").show(false)
++
|c1  |
++
|[01]|
++{code}
We can see the output using "spark.sql" in spark-shell.

Then using {{spark-sql}} (to see we use tee to redirect the log to a file)
{code:java}
$SPARK_HOME/bin/spark-sql | tee sql.log{code}
 Execute the following, we only get an empty output in the terminal:
{code:java}
spark-sql> select * from binary_vals; -- check what was inserted by DataFrame

spark-sql> drop table binary_vals;
spark-sql> create table binary_vals(c1 BINARY) stored as ORC; -- try to insert 
directly in spark-sql 
spark-sql> insert into binary_vals select X'01';
spark-sql> select * from binary_vals;

Time taken: 0.077 seconds, Fetched 1 row(s)
{code}
 

>From the log file, we find it shows as a garbage character. (We never 
>encountered this garbage character in logs of other data types)
h3. !image-2022-10-18-12-15-05-576.png!
h3. Expected behavior

We expect the two Spark interfaces ({{{}spark-sql{}}} & {{{}spark-shell{}}}) to 
behave consistently for the same data type ({{{}BINARY{}}}) & input 
({{{}BigInt("1").toByteArray){}}} / {{{}X'01'{}}}) combination.

 
h3. Additional context

We also tried Avro and Parquet and encountered the same issue. We believe this 
is format-independent.

  was:
h3. Describe the bug

Storing a BINARY value (e.g. {{BigInt("1").toByteArray)}} / {{{}X'01'{}}}) via 
{{spark-shell}} outputs {{{}[01]{}}}. However, it does not encode correctly if 
the value is inserted into a BINARY column of a table via {{{}spark-sql{}}}.
h3. To Reproduce

On Spark 3.2.1 (commit {{{}4f25b3f712{}}}), using {{{}spark-shell{}}}:
{code:java}
$SPARK_HOME/bin/spark-shell{code}
Execute the following:
{code:java}
scala> val rdd = sc.parallelize(Seq(Row(BigInt("1").toByteArray)))
rdd: org.apache.spark.rdd.RDD[org.apache.spark.sql.Row] = 
ParallelCollectionRDD[356] at parallelize at :28
scala> val schema = new StructType().add(StructField("c1", BinaryType, true))
schema: org.apache.spark.sql.types.StructType = 
StructType(StructField(c1,BinaryType,true))
scala> val df = spark.createDataFrame(rdd, schema)
df: org.apache.spark.sql.DataFrame = [c1: binary]
scala> df.show(false)
++
|c1  |
++
|[01]|
++
scala> df.write.mode("overwrite").format("orc").saveAsTable("binary_vals")
scala> spark.sql("select * from binary_vals;").show(false)
++
|c1  |
++
|[01]|
++{code}
We can see the output using "spark.sql" in spark-shell.

Then using {{{}spark-sql{}}}:
{code:java}
$SPARK_HOME/bin/spark-sql | tee sql.log{code}
 Execute the following, we only get an empty output in the terminal:
{code:java}
spark-sql> select * from binary_vals; 

spark-sql> drop table binary_vals;
spark-sql> create table binary_vals(c1 BINARY) stored as ORC;
spark-sql> insert into binary_vals select X'01';
spark-sql> select * from binary_vals;

Time taken: 0.077 seconds, Fetched 1 row(s)
{code}
>From the log file, we find it shows as a garbage character. (We never 
>encountered this garbage character in logs of other data types)
h3. !image-2022-10-18-12-15-05-576.png!
h3. Expected behavior

We expect the two Spark interfaces ({{{}spark-sql{}}} & {{{}spark-shell{}}}) to 
behave consistently for the same data type ({{{}BINARY{}}}) & input 
({{{}BigInt("1").toByteArray){}}} / {{{}X'01'{}}}) combination.

 
h3. Additional context

We also tried Avro and Parquet and encountered the same issue. We believe this 
is format-independent.


> DataFrame can correctly encode BINARY type but SparkSQL cannot
> --
>
> Key: SPARK-40637
> URL: https://issues.apache.org/jira/browse/SPARK-40637
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.1

[jira] [Updated] (SPARK-40637) DataFrame can correctly encode BINARY type but SparkSQL cannot

2022-10-18 Thread xsys (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40637?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

xsys updated SPARK-40637:
-
Description: 
h3. Describe the bug

Storing a BINARY value (e.g. {{BigInt("1").toByteArray)}} / {{{}X'01'{}}}) via 
{{spark-shell}} outputs {{{}[01]{}}}. However, it does not encode correctly if 
the value is inserted into a BINARY column of a table via {{{}spark-sql{}}}.
h3. To Reproduce

On Spark 3.2.1 (commit {{{}4f25b3f712{}}}), using {{{}spark-shell{}}}:
{code:java}
$SPARK_HOME/bin/spark-shell{code}
Execute the following:
{code:java}
scala> val rdd = sc.parallelize(Seq(Row(BigInt("1").toByteArray)))
rdd: org.apache.spark.rdd.RDD[org.apache.spark.sql.Row] = 
ParallelCollectionRDD[356] at parallelize at :28
scala> val schema = new StructType().add(StructField("c1", BinaryType, true))
schema: org.apache.spark.sql.types.StructType = 
StructType(StructField(c1,BinaryType,true))
scala> val df = spark.createDataFrame(rdd, schema)
df: org.apache.spark.sql.DataFrame = [c1: binary]
scala> df.show(false)
++
|c1  |
++
|[01]|
++
scala> df.write.mode("overwrite").format("orc").saveAsTable("binary_vals")
scala> spark.sql("select * from binary_vals;").show(false)
++
|c1  |
++
|[01]|
++{code}
We can see the output using "spark.sql" in spark-shell.

Then using {{{}spark-sql{}}}:
{code:java}
$SPARK_HOME/bin/spark-sql | tee sql.log{code}
 Execute the following, we only get an empty output in the terminal:
{code:java}
spark-sql> select * from binary_vals; 

spark-sql> drop table binary_vals;
spark-sql> create table binary_vals(c1 BINARY) stored as ORC;
spark-sql> insert into binary_vals select X'01';
spark-sql> select * from binary_vals;

Time taken: 0.077 seconds, Fetched 1 row(s)
{code}
>From the log file, we find it shows as a garbage character. (We never 
>encountered this garbage character in logs of other data types)
h3. !image-2022-10-18-12-15-05-576.png!
h3. Expected behavior

We expect the two Spark interfaces ({{{}spark-sql{}}} & {{{}spark-shell{}}}) to 
behave consistently for the same data type ({{{}BINARY{}}}) & input 
({{{}BigInt("1").toByteArray){}}} / {{{}X'01'{}}}) combination.

 
h3. Additional context

We also tried Avro and Parquet and encountered the same issue. We believe this 
is format-independent.

  was:
h3. Describe the bug

Storing a BINARY value (e.g. {{BigInt("1").toByteArray)}} / {{{}X'01'{}}}) via 
{{spark-shell}} outputs {{{}[01]{}}}. However, it does not encode correctly if 
the value is inserted into a BINARY column of a table via {{{}spark-sql{}}}.
h3. To Reproduce

On Spark 3.2.1 (commit {{{}4f25b3f712{}}}), using {{{}spark-shell{}}}:
{code:java}
$SPARK_HOME/bin/spark-shell{code}
Execute the following:
{code:java}
scala> val rdd = sc.parallelize(Seq(Row(BigInt("1").toByteArray)))
rdd: org.apache.spark.rdd.RDD[org.apache.spark.sql.Row] = 
ParallelCollectionRDD[356] at parallelize at :28
scala> val schema = new StructType().add(StructField("c1", BinaryType, true))
schema: org.apache.spark.sql.types.StructType = 
StructType(StructField(c1,BinaryType,true))
scala> val df = spark.createDataFrame(rdd, schema)
df: org.apache.spark.sql.DataFrame = [c1: binary]
scala> df.show(false)
++
|c1  |
++
|[01]|
++
scala> df.write.mode("overwrite").format("orc").saveAsTable("binary_vals")
scala> spark.sql("select * from binary_vals;").show(false)
++
|c1  |
++
|[01]|
++{code}
 

Using {{{}spark-sql (we use tee to redirect the log to a file){}}}:
{code:java}
$SPARK_HOME/bin/spark-sql | tee sql.log{code}
 Execute the following, we only get an empty output in the terminal:
{code:java}
spark-sql> drop table binary_vals;
spark-sql> create table binary_vals(c1 BINARY) stored as ORC;
spark-sql> insert into binary_vals select X'01';
spark-sql> select * from binary_vals;

Time taken: 0.077 seconds, Fetched 1 row(s)
{code}
>From the log file, we find it shows as a garbage character. (We never 
>encountered this garbage character in logs of other data types)
h3. !image-2022-10-18-12-15-05-576.png!
h3. Expected behavior

We expect the two Spark interfaces ({{{}spark-sql{}}} & {{{}spark-shell{}}}) to 
behave consistently for the same data type ({{{}BINARY{}}}) & input 
({{{}BigInt("1").toByteArray){}}} / {{{}X'01'{}}}) combination.

 
h3. Additional context

We also tried Avro and Parquet and encountered the same issue. We believe this 
is format-independent.


> DataFrame can correctly encode BINARY type but SparkSQL cannot
> --
>
> Key: SPARK-40637
> URL: https://issues.apache.org/jira/browse/SPARK-40637
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.1
>Reporter: xsys
>Priority: Major
> Attachments: image-2022-10-18-12-15-05-576.png
>
>
> h3. Describe the bug
> Storing a BINARY value (e.g. 

[jira] [Updated] (SPARK-40637) DataFrame can correctly encode BINARY type but SparkSQL cannot

2022-10-18 Thread xsys (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40637?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

xsys updated SPARK-40637:
-
Description: 
h3. Describe the bug

Storing a BINARY value (e.g. {{BigInt("1").toByteArray)}} / {{{}X'01'{}}}) via 
{{spark-shell}} outputs {{{}[01]{}}}. However, it does not encode correctly if 
the value is inserted into a BINARY column of a table via {{{}spark-sql{}}}.
h3. To Reproduce

On Spark 3.2.1 (commit {{{}4f25b3f712{}}}), using {{{}spark-shell{}}}:
{code:java}
$SPARK_HOME/bin/spark-shell{code}
Execute the following:
{code:java}
scala> val rdd = sc.parallelize(Seq(Row(BigInt("1").toByteArray)))
rdd: org.apache.spark.rdd.RDD[org.apache.spark.sql.Row] = 
ParallelCollectionRDD[356] at parallelize at :28
scala> val schema = new StructType().add(StructField("c1", BinaryType, true))
schema: org.apache.spark.sql.types.StructType = 
StructType(StructField(c1,BinaryType,true))
scala> val df = spark.createDataFrame(rdd, schema)
df: org.apache.spark.sql.DataFrame = [c1: binary]
scala> df.show(false)
++
|c1  |
++
|[01]|
++
scala> df.write.mode("overwrite").format("orc").saveAsTable("binary_vals")
scala> spark.sql("select * from binary_vals;").show(false)
++
|c1  |
++
|[01]|
++{code}
 

Using {{{}spark-sql (we use tee to redirect the log to a file){}}}:
{code:java}
$SPARK_HOME/bin/spark-sql | tee sql.log{code}
 Execute the following, we only get an empty output in the terminal:
{code:java}
spark-sql> drop table binary_vals;
spark-sql> create table binary_vals(c1 BINARY) stored as ORC;
spark-sql> insert into binary_vals select X'01';
spark-sql> select * from binary_vals;

Time taken: 0.077 seconds, Fetched 1 row(s)
{code}
>From the log file, we find it shows as a garbage character. (We never 
>encountered this garbage character in logs of other data types)
h3. !image-2022-10-18-12-15-05-576.png!
h3. Expected behavior

We expect the two Spark interfaces ({{{}spark-sql{}}} & {{{}spark-shell{}}}) to 
behave consistently for the same data type ({{{}BINARY{}}}) & input 
({{{}BigInt("1").toByteArray){}}} / {{{}X'01'{}}}) combination.

 
h3. Additional context

We also tried Avro and Parquet and encountered the same issue. We believe this 
is format-independent.

  was:
h3. Describe the bug

Storing a BINARY value (e.g. {{BigInt("1").toByteArray)}} / {{{}X'01'{}}}) via 
{{spark-shell}} outputs {{{}[01]{}}}. However, it does not encode correctly if 
the value is inserted into a BINARY column of a table via {{{}spark-sql{}}}.
h3. To Reproduce

On Spark 3.2.1 (commit {{{}4f25b3f712{}}}), using {{{}spark-shell{}}}:
{code:java}
$SPARK_HOME/bin/spark-shell{code}
Execute the following:
{code:java}
scala> val rdd = sc.parallelize(Seq(Row(BigInt("1").toByteArray)))
rdd: org.apache.spark.rdd.RDD[org.apache.spark.sql.Row] = 
ParallelCollectionRDD[356] at parallelize at :28
scala> val schema = new StructType().add(StructField("c1", BinaryType, true))
schema: org.apache.spark.sql.types.StructType = 
StructType(StructField(c1,BinaryType,true))
scala> val df = spark.createDataFrame(rdd, schema)
df: org.apache.spark.sql.DataFrame = [c1: binary]
scala> df.show(false)
++
|c1  |
++
|[01]|
++
scala> df.write.mode("overwrite").format("orc").saveAsTable("binary_vals")
scala> spark.sql("select * from binary_vals;").show(false)
++
|c1  |
++
|[01]|
++{code}
 

Using {{{}spark-sql (we use tee to redirect the log to a file){}}}:
{code:java}
$SPARK_HOME/bin/spark-sql | tee sql.log{code}
 Execute the following, we only get an empty output in the terminal:
{code:java}
spark-sql> create table binary_vals(c1 BINARY) stored as ORC;
spark-sql> insert into binary_vals select X'01';
spark-sql> select * from binary_vals;

Time taken: 0.077 seconds, Fetched 1 row(s)
{code}
>From the log file, we find it shows as a garbage character. (We never 
>encountered this garbage character in logs of other data types)
h3. !image-2022-10-18-12-15-05-576.png!
h3. Expected behavior

We expect the two Spark interfaces ({{{}spark-sql{}}} & {{{}spark-shell{}}}) to 
behave consistently for the same data type ({{{}BINARY{}}}) & input 
({{{}BigInt("1").toByteArray){}}} / {{{}X'01'{}}}) combination.

 
h3. Additional context

We also tried Avro and Parquet and encountered the same issue. We believe this 
is format-independent.


> DataFrame can correctly encode BINARY type but SparkSQL cannot
> --
>
> Key: SPARK-40637
> URL: https://issues.apache.org/jira/browse/SPARK-40637
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.1
>Reporter: xsys
>Priority: Major
> Attachments: image-2022-10-18-12-15-05-576.png
>
>
> h3. Describe the bug
> Storing a BINARY value (e.g. {{BigInt("1").toByteArray)}} / {{{}X'01'{}}}) 
> via {{spark-shell}} outputs {{{}[01]{}}}. However, it 

[jira] [Updated] (SPARK-40637) DataFrame can correctly encode BINARY type but SparkSQL cannot

2022-10-18 Thread xsys (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40637?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

xsys updated SPARK-40637:
-
Description: 
h3. Describe the bug

Storing a BINARY value (e.g. {{BigInt("1").toByteArray)}} / {{{}X'01'{}}}) via 
{{spark-shell}} outputs {{{}[01]{}}}. However, it does not encode correctly if 
the value is inserted into a BINARY column of a table via {{{}spark-sql{}}}.
h3. To Reproduce

On Spark 3.2.1 (commit {{{}4f25b3f712{}}}), using {{{}spark-shell{}}}:
{code:java}
$SPARK_HOME/bin/spark-shell{code}
Execute the following:
{code:java}
scala> val rdd = sc.parallelize(Seq(Row(BigInt("1").toByteArray)))
rdd: org.apache.spark.rdd.RDD[org.apache.spark.sql.Row] = 
ParallelCollectionRDD[356] at parallelize at :28
scala> val schema = new StructType().add(StructField("c1", BinaryType, true))
schema: org.apache.spark.sql.types.StructType = 
StructType(StructField(c1,BinaryType,true))
scala> val df = spark.createDataFrame(rdd, schema)
df: org.apache.spark.sql.DataFrame = [c1: binary]
scala> df.show(false)
++
|c1  |
++
|[01]|
++
scala> df.write.mode("overwrite").format("orc").saveAsTable("binary_vals")
scala> spark.sql("select * from binary_vals;").show(false)
++
|c1  |
++
|[01]|
++{code}
 

Using {{{}spark-sql (we use tee to redirect the log to a file){}}}:
{code:java}
$SPARK_HOME/bin/spark-sql | tee sql.log{code}
 Execute the following, we only get an empty output in the terminal:
{code:java}
spark-sql> create table binary_vals(c1 BINARY) stored as ORC;
spark-sql> insert into binary_vals select X'01';
spark-sql> select * from binary_vals;

Time taken: 0.077 seconds, Fetched 1 row(s)
{code}
>From the log file, we find it shows as a garbage character. (We never 
>encountered this garbage character in logs of other data types)
h3. !image-2022-10-18-12-15-05-576.png!
h3. Expected behavior

We expect the two Spark interfaces ({{{}spark-sql{}}} & {{{}spark-shell{}}}) to 
behave consistently for the same data type ({{{}BINARY{}}}) & input 
({{{}BigInt("1").toByteArray){}}} / {{{}X'01'{}}}) combination.

 
h3. Additional context

We also tried Avro and Parquet and encountered the same issue. We believe this 
is format-independent.

  was:
h3. Describe the bug

Storing a BINARY value (e.g. {{BigInt("1").toByteArray)}} / {{{}X'01'{}}}) via 
{{spark-shell}} outputs {{{}[01]{}}}. However, it does not encode correctly if 
the value is inserted into a BINARY column of a table via {{{}spark-sql{}}}.
h3. To Reproduce

On Spark 3.2.1 (commit {{{}4f25b3f712{}}}), using {{{}spark-shell{}}}:
{code:java}
$SPARK_HOME/bin/spark-shell{code}
Execute the following:
{code:java}
scala> val rdd = sc.parallelize(Seq(Row(BigInt("1").toByteArray)))
rdd: org.apache.spark.rdd.RDD[org.apache.spark.sql.Row] = 
ParallelCollectionRDD[356] at parallelize at :28
scala> val schema = new StructType().add(StructField("c1", BinaryType, true))
schema: org.apache.spark.sql.types.StructType = 
StructType(StructField(c1,BinaryType,true))
scala> val df = spark.createDataFrame(rdd, schema)
df: org.apache.spark.sql.DataFrame = [c1: binary]
scala> df.show(false)
++
|c1  |
++
|[01]|
++
scala> df.write.mode("overwrite").format("orc").saveAsTable("binary_vals")
scala> spark.sql("select * from binary_vals;").show(false)
++
|c1  |
++
|[01]|
++{code}
 

Using {{{}spark-sql{}}}:
{code:java}
$SPARK_HOME/bin/spark-sql{code}
 Execute the following, we only get an empty output:
{code:java}
spark-sql> create table binary_vals(c1 BINARY) stored as ORC;
spark-sql> insert into binary_vals select X'01';
spark-sql> select * from binary_vals;
Time taken: 0.077 seconds, Fetched 1 row(s)
{code}
h3. Expected behavior

We expect the two Spark interfaces ({{{}spark-sql{}}} & {{{}spark-shell{}}}) to 
behave consistently for the same data type ({{{}BINARY{}}}) & input 
({{{}BigInt("1").toByteArray){}}} / {{{}X'01'{}}}) combination.

 
h3. Additional context

We tried Avro and Parquet and encountered the same issue. We believe this is 
format-independent.


> DataFrame can correctly encode BINARY type but SparkSQL cannot
> --
>
> Key: SPARK-40637
> URL: https://issues.apache.org/jira/browse/SPARK-40637
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.1
>Reporter: xsys
>Priority: Major
> Attachments: image-2022-10-18-12-15-05-576.png
>
>
> h3. Describe the bug
> Storing a BINARY value (e.g. {{BigInt("1").toByteArray)}} / {{{}X'01'{}}}) 
> via {{spark-shell}} outputs {{{}[01]{}}}. However, it does not encode 
> correctly if the value is inserted into a BINARY column of a table via 
> {{{}spark-sql{}}}.
> h3. To Reproduce
> On Spark 3.2.1 (commit {{{}4f25b3f712{}}}), using {{{}spark-shell{}}}:
> {code:java}
> $SPARK_HOME/bin/spark-shell{code}
> Execute the following:
> 

[jira] [Updated] (SPARK-40637) DataFrame can correctly encode BINARY type but SparkSQL cannot

2022-10-18 Thread xsys (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40637?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

xsys updated SPARK-40637:
-
Attachment: image-2022-10-18-12-15-05-576.png

> DataFrame can correctly encode BINARY type but SparkSQL cannot
> --
>
> Key: SPARK-40637
> URL: https://issues.apache.org/jira/browse/SPARK-40637
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.1
>Reporter: xsys
>Priority: Major
> Attachments: image-2022-10-18-12-15-05-576.png
>
>
> h3. Describe the bug
> Storing a BINARY value (e.g. {{BigInt("1").toByteArray)}} / {{{}X'01'{}}}) 
> via {{spark-shell}} outputs {{{}[01]{}}}. However, it does not encode 
> correctly if the value is inserted into a BINARY column of a table via 
> {{{}spark-sql{}}}.
> h3. To Reproduce
> On Spark 3.2.1 (commit {{{}4f25b3f712{}}}), using {{{}spark-shell{}}}:
> {code:java}
> $SPARK_HOME/bin/spark-shell{code}
> Execute the following:
> {code:java}
> scala> val rdd = sc.parallelize(Seq(Row(BigInt("1").toByteArray)))
> rdd: org.apache.spark.rdd.RDD[org.apache.spark.sql.Row] = 
> ParallelCollectionRDD[356] at parallelize at :28
> scala> val schema = new StructType().add(StructField("c1", BinaryType, true))
> schema: org.apache.spark.sql.types.StructType = 
> StructType(StructField(c1,BinaryType,true))
> scala> val df = spark.createDataFrame(rdd, schema)
> df: org.apache.spark.sql.DataFrame = [c1: binary]
> scala> df.show(false)
> ++
> |c1  |
> ++
> |[01]|
> ++
> scala> df.write.mode("overwrite").format("orc").saveAsTable("binary_vals")
> scala> spark.sql("select * from binary_vals;").show(false)
> ++
> |c1  |
> ++
> |[01]|
> ++{code}
>  
> Using {{{}spark-sql{}}}:
> {code:java}
> $SPARK_HOME/bin/spark-sql{code}
>  Execute the following, we only get an empty output:
> {code:java}
> spark-sql> create table binary_vals(c1 BINARY) stored as ORC;
> spark-sql> insert into binary_vals select X'01';
> spark-sql> select * from binary_vals;
> Time taken: 0.077 seconds, Fetched 1 row(s)
> {code}
> h3. Expected behavior
> We expect the two Spark interfaces ({{{}spark-sql{}}} & {{{}spark-shell{}}}) 
> to behave consistently for the same data type ({{{}BINARY{}}}) & input 
> ({{{}BigInt("1").toByteArray){}}} / {{{}X'01'{}}}) combination.
>  
> h3. Additional context
> We tried Avro and Parquet and encountered the same issue. We believe this is 
> format-independent.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-40637) DataFrame can correctly encode BINARY type but SparkSQL cannot

2022-10-18 Thread xsys (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40637?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

xsys updated SPARK-40637:
-
Description: 
h3. Describe the bug

Storing a BINARY value (e.g. {{BigInt("1").toByteArray)}} / {{{}X'01'{}}}) via 
{{spark-shell}} outputs {{{}[01]{}}}. However, it does not encode correctly if 
the value is inserted into a BINARY column of a table via {{{}spark-sql{}}}.
h3. To Reproduce

On Spark 3.2.1 (commit {{{}4f25b3f712{}}}), using {{{}spark-shell{}}}:
{code:java}
$SPARK_HOME/bin/spark-shell{code}
Execute the following:
{code:java}
scala> val rdd = sc.parallelize(Seq(Row(BigInt("1").toByteArray)))
rdd: org.apache.spark.rdd.RDD[org.apache.spark.sql.Row] = 
ParallelCollectionRDD[356] at parallelize at :28
scala> val schema = new StructType().add(StructField("c1", BinaryType, true))
schema: org.apache.spark.sql.types.StructType = 
StructType(StructField(c1,BinaryType,true))
scala> val df = spark.createDataFrame(rdd, schema)
df: org.apache.spark.sql.DataFrame = [c1: binary]
scala> df.show(false)
++
|c1  |
++
|[01]|
++
scala> df.write.mode("overwrite").format("orc").saveAsTable("binary_vals")
scala> spark.sql("select * from binary_vals;").show(false)
++
|c1  |
++
|[01]|
++{code}
 

Using {{{}spark-sql{}}}:
{code:java}
$SPARK_HOME/bin/spark-sql{code}
 Execute the following, we only get an empty output:
{code:java}
spark-sql> create table binary_vals(c1 BINARY) stored as ORC;
spark-sql> insert into binary_vals select X'01';
spark-sql> select * from binary_vals;
Time taken: 0.077 seconds, Fetched 1 row(s)
{code}
h3. Expected behavior

We expect the two Spark interfaces ({{{}spark-sql{}}} & {{{}spark-shell{}}}) to 
behave consistently for the same data type ({{{}BINARY{}}}) & input 
({{{}BigInt("1").toByteArray){}}} / {{{}X'01'{}}}) combination.

 
h3. Additional context

We tried Avro and Parquet and encountered the same issue. We believe this is 
format-independent.

  was:
h3. Describe the bug

Storing a BINARY value (e.g. {{BigInt("1").toByteArray)}} / {{{}X'01'{}}}) via 
{{spark-shell}} outputs {{{}[01]{}}}. However, it does not encode correctly if 
the value is inserted into a BINARY column of a table via {{{}spark-sql{}}}.
h3. To Reproduce

On Spark 3.2.1 (commit {{{}4f25b3f712{}}}), using {{{}spark-shell{}}}:
{code:java}
$SPARK_HOME/bin/spark-shell{code}
Execute the following:
{code:java}
scala> val rdd = sc.parallelize(Seq(Row(BigInt("1").toByteArray)))
rdd: org.apache.spark.rdd.RDD[org.apache.spark.sql.Row] = 
ParallelCollectionRDD[356] at parallelize at :28
scala> val schema = new StructType().add(StructField("c1", BinaryType, true))
schema: org.apache.spark.sql.types.StructType = 
StructType(StructField(c1,BinaryType,true))
scala> val df = spark.createDataFrame(rdd, schema)
df: org.apache.spark.sql.DataFrame = [c1: binary]
scala> df.show(false)
++
|c1  |
++
|[01]|
++
{code}
 

Using {{{}spark-sql{}}}:
{code:java}
$SPARK_HOME/bin/spark-sql{code}
 Execute the following, we only get an empty output:
{code:java}
spark-sql> create table binary_vals(c1 BINARY) stored as ORC;
spark-sql> insert into binary_vals select X'01';
spark-sql> select * from binary_vals;
Time taken: 0.077 seconds, Fetched 1 row(s)
{code}
h3. Expected behavior

We expect the two Spark interfaces ({{{}spark-sql{}}} & {{{}spark-shell{}}}) to 
behave consistently for the same data type ({{{}BINARY{}}}) & input 
({{{}BigInt("1").toByteArray){}}} / {{{}X'01'{}}}) combination.


> DataFrame can correctly encode BINARY type but SparkSQL cannot
> --
>
> Key: SPARK-40637
> URL: https://issues.apache.org/jira/browse/SPARK-40637
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.1
>Reporter: xsys
>Priority: Major
>
> h3. Describe the bug
> Storing a BINARY value (e.g. {{BigInt("1").toByteArray)}} / {{{}X'01'{}}}) 
> via {{spark-shell}} outputs {{{}[01]{}}}. However, it does not encode 
> correctly if the value is inserted into a BINARY column of a table via 
> {{{}spark-sql{}}}.
> h3. To Reproduce
> On Spark 3.2.1 (commit {{{}4f25b3f712{}}}), using {{{}spark-shell{}}}:
> {code:java}
> $SPARK_HOME/bin/spark-shell{code}
> Execute the following:
> {code:java}
> scala> val rdd = sc.parallelize(Seq(Row(BigInt("1").toByteArray)))
> rdd: org.apache.spark.rdd.RDD[org.apache.spark.sql.Row] = 
> ParallelCollectionRDD[356] at parallelize at :28
> scala> val schema = new StructType().add(StructField("c1", BinaryType, true))
> schema: org.apache.spark.sql.types.StructType = 
> StructType(StructField(c1,BinaryType,true))
> scala> val df = spark.createDataFrame(rdd, schema)
> df: org.apache.spark.sql.DataFrame = [c1: binary]
> scala> df.show(false)
> ++
> |c1  |
> ++
> |[01]|
> ++
> scala> 

[jira] [Updated] (SPARK-40630) Both SparkSQL and DataFrame insert invalid DATE/TIMESTAMP as NULL

2022-10-03 Thread xsys (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40630?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

xsys updated SPARK-40630:
-
Description: 
h3. Describe the bug

When we construct a DataFrame with an invalid DATE/TIMESTAMP (e.g. 
{{{}1969-12-31 23:59:59 B{}}}) via {{{}spark-shell{}}}, or insert an invalid 
DATE/TIMESTAMP into a table via {{{}spark-sql{}}}, both interfaces unexpectedly 
evaluate the invalid value to {{{}NULL{}}}, instead of throwing an exception.
h3. To Reproduce

On Spark 3.2.1 (commit {{{}4f25b3f712{}}}), using {{{}spark-sql{}}}:
{code:java}
$SPARK_HOME/bin/spark-sql{code}
Execute the following:
{code:java}
spark-sql> create table timestamp_vals(c1 TIMESTAMP) stored as ORC;
spark-sql> insert into timestamp_vals select cast(" 1969-12-31 23:59:59 B "as 
timestamp);
spark-sql> select * from timestamp_vals;
NULL{code}
 
Using {{{}spark-shell{}}}:
{code:java}
$SPARK_HOME/bin/spark-shell{code}
 
Execute the following:
{code:java}
scala> import org.apache.spark.sql.{Row, SparkSession}
import org.apache.spark.sql.{Row, SparkSession}
scala> import org.apache.spark.sql.types._
import org.apache.spark.sql.types._
scala> val rdd = sc.parallelize(Seq(Row(Seq(" 1969-12-31 23:59:59 B 
").toDF("time").select(to_timestamp(col("time")).as("to_timestamp")).first().getAs[java.sql.Timestamp](0
rdd: org.apache.spark.rdd.RDD[org.apache.spark.sql.Row] = 
ParallelCollectionRDD[721] at parallelize at :28
scala> val schema = new StructType().add(StructField("c1", TimestampType,  
true))
schema: org.apache.spark.sql.types.StructType = 
StructType(StructField(c1,TimestampType,true))
scala> val df = spark.createDataFrame(rdd, schema)
df194: org.apache.spark.sql.DataFrame = [c1: timestamp]
scala> df.show(false)
++
|c1  |
++
|null|
++
{code}
h3. Expected behavior

We expect both {{spark-sql}} & {{spark-shell}} interfaces to throw an exception 
for an invalid DATE/TIMESTAMP, like what they do for most of the other data 
types (e.g. invalid value {{"foo"}} for {{INT}} data type).

  was:
h3. Describe the bug

When we construct a DataFrame with an invalid DATE/TIMESTAMP (e.g. 
{{{}1969-12-31 23:59:59 B{}}}) via {{{}spark-shell{}}}, or insert an invalid 
DATE/TIMESTAMP into a table via {{{}spark-sql{}}}, both interfaces unexpectedly 
evaluate the invalid value to {{{}NULL{}}}, instead of throwing an exception.
h3. To Reproduce

On Spark 3.2.1 (commit {{{}4f25b3f712{}}}), using {{{}spark-sql{}}}:
{code:java}
$SPARK_HOME/bin/spark-sql{code}
Execute the following:
{code:java}
spark-sql> create table timestamp_vals(c1 TIMESTAMP) stored as ORC;
spark-sql> insert into timestamp_vals select cast(" 1969-12-31 23:59:59 B "as 
timestamp);
spark-sql> select * from timestamp_vals;
NULL{code}
 
Using {{{}spark-shell{}}}:
{code:java}
$SPARK_HOME/bin/spark-shell{code}
 
Execute the following:
{code:java}
scala> import org.apache.spark.sql.{Row, SparkSession}
import org.apache.spark.sql.{Row, SparkSession}
scala> import org.apache.spark.sql.types._
import org.apache.spark.sql.types._
scala> val rdd = sc.parallelize(Seq(Row(Seq(" 1969-12-31 23:59:59 B 
").toDF("time").select(to_timestamp(col("ti 
me")).as("to_timestamp")).first().getAs[java.sql.Timestamp](0
rdd: org.apache.spark.rdd.RDD[org.apache.spark.sql.Row] = 
ParallelCollectionRDD[721] at parallelize at :28
scala> val schema = new StructType().add(StructField("c1", TimestampType,  
true))
schema: org.apache.spark.sql.types.StructType = 
StructType(StructField(c1,TimestampType,true))
scala> val df = spark.createDataFrame(rdd, schema)
df194: org.apache.spark.sql.DataFrame = [c1: timestamp]
scala> df.show(false)
++
|c1  |
++
|null|
++
{code}
h3. Expected behavior

We expect both {{spark-sql}} & {{spark-shell}} interfaces to throw an exception 
for an invalid DATE/TIMESTAMP, like what they do for most of the other data 
types (e.g. invalid value {{"foo"}} for {{INT}} data type).


> Both SparkSQL and DataFrame insert invalid DATE/TIMESTAMP as NULL
> -
>
> Key: SPARK-40630
> URL: https://issues.apache.org/jira/browse/SPARK-40630
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Shell, SQL
>Affects Versions: 3.2.1
>Reporter: xsys
>Priority: Major
>
> h3. Describe the bug
> When we construct a DataFrame with an invalid DATE/TIMESTAMP (e.g. 
> {{{}1969-12-31 23:59:59 B{}}}) via {{{}spark-shell{}}}, or insert an invalid 
> DATE/TIMESTAMP into a table via {{{}spark-sql{}}}, both interfaces 
> unexpectedly evaluate the invalid value to {{{}NULL{}}}, instead of throwing 
> an exception.
> h3. To Reproduce
> On Spark 3.2.1 (commit {{{}4f25b3f712{}}}), using {{{}spark-sql{}}}:
> {code:java}
> $SPARK_HOME/bin/spark-sql{code}
> Execute the following:
> {code:java}
> spark-sql> create table timestamp_vals(c1 TIMESTAMP) stored as ORC;
> spark-sql> insert 

[jira] [Updated] (SPARK-40630) Both SparkSQL and DataFrame insert invalid DATE/TIMESTAMP as NULL

2022-10-03 Thread xsys (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40630?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

xsys updated SPARK-40630:
-
Description: 
h3. Describe the bug

When we construct a DataFrame with an invalid DATE/TIMESTAMP (e.g. 
{{{}1969-12-31 23:59:59 B{}}}) via {{{}spark-shell{}}}, or insert an invalid 
DATE/TIMESTAMP into a table via {{{}spark-sql{}}}, both interfaces unexpectedly 
evaluate the invalid value to {{{}NULL{}}}, instead of throwing an exception.
h3. To Reproduce

On Spark 3.2.1 (commit {{{}4f25b3f712{}}}), using {{{}spark-sql{}}}:
{code:java}
$SPARK_HOME/bin/spark-sql{code}
Execute the following:
{code:java}
spark-sql> create table timestamp_vals(c1 TIMESTAMP) stored as ORC;
spark-sql> insert into timestamp_vals select cast(" 1969-12-31 23:59:59 B "as 
timestamp);
spark-sql> select * from timestamp_vals;
NULL{code}
 
Using {{{}spark-shell{}}}:
{code:java}
$SPARK_HOME/bin/spark-shell{code}
 
Execute the following:
{code:java}
scala> import org.apache.spark.sql.{Row, SparkSession}
import org.apache.spark.sql.{Row, SparkSession}
scala> import org.apache.spark.sql.types._
import org.apache.spark.sql.types._
scala> val rdd = sc.parallelize(Seq(Row(Seq(" 1969-12-31 23:59:59 B 
").toDF("time").select(to_timestamp(col("ti 
me")).as("to_timestamp")).first().getAs[java.sql.Timestamp](0
rdd: org.apache.spark.rdd.RDD[org.apache.spark.sql.Row] = 
ParallelCollectionRDD[721] at parallelize at :28
scala> val schema = new StructType().add(StructField("c1", TimestampType,  
true))
schema: org.apache.spark.sql.types.StructType = 
StructType(StructField(c1,TimestampType,true))
scala> val df = spark.createDataFrame(rdd, schema)
df194: org.apache.spark.sql.DataFrame = [c1: timestamp]
scala> df.show(false)
++
|c1  |
++
|null|
++
{code}
h3. Expected behavior

We expect both {{spark-sql}} & {{spark-shell}} interfaces to throw an exception 
for an invalid DATE/TIMESTAMP, like what they do for most of the other data 
types (e.g. invalid value {{"foo"}} for {{INT}} data type).

  was:
h3. Describe the bug

When we construct a DataFrame with an invalid DATE/TIMESTAMP (e.g. 
{{{}1969-12-31 23:59:59 B{}}}) via {{{}spark-shell{}}}, or insert an invalid 
DATE/TIMESTAMP into a table via {{{}spark-sql{}}}, both interfaces unexpectedly 
evaluate the invalid value to {{{}NULL{}}}, instead of throwing an exception.
h3. To Reproduce

On Spark 3.2.1 (commit {{{}4f25b3f712{}}}), using {{{}spark-sql{}}}:
{code:java}
$SPARK_HOME/bin/spark-sql{code}
Execute the following:
{code:java}
spark-sql> create table timestamp_vals(c1 TIMESTAMP) stored as ORC;
spark-sql> insert into timestamp_vals select cast(" 1969-12-31 23:59:59 B "as 
timestamp);
spark-sql> select * from timestamp_vals;
NULL{code}
 
Using {{{}spark-shell{}}}:
{code:java}
$SPARK_HOME/bin/spark-shell{code}
 
Execute the following:
{code:java}
scala> val rdd = sc.parallelize(Seq(Row(Seq(" 1969-12-31 23:59:59 B 
").toDF("time").select(to_timestamp(col("ti 
me")).as("to_timestamp")).first().getAs[java.sql.Timestamp](0
rdd: org.apache.spark.rdd.RDD[org.apache.spark.sql.Row] = 
ParallelCollectionRDD[721] at parallelize at :28
scala> val schema = new StructType().add(StructField("c1", TimestampType,  
true))
schema: org.apache.spark.sql.types.StructType = 
StructType(StructField(c1,TimestampType,true))
scala> val df = spark.createDataFrame(rdd, schema)
df194: org.apache.spark.sql.DataFrame = [c1: timestamp]
scala> df.show(false)
++
|c1  |
++
|null|
++
{code}
h3. Expected behavior

We expect both {{spark-sql}} & {{spark-shell}} interfaces to throw an exception 
for an invalid DATE/TIMESTAMP, like what they do for most of the other data 
types (e.g. invalid value {{"foo"}} for {{INT}} data type).


> Both SparkSQL and DataFrame insert invalid DATE/TIMESTAMP as NULL
> -
>
> Key: SPARK-40630
> URL: https://issues.apache.org/jira/browse/SPARK-40630
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Shell, SQL
>Affects Versions: 3.2.1
>Reporter: xsys
>Priority: Major
>
> h3. Describe the bug
> When we construct a DataFrame with an invalid DATE/TIMESTAMP (e.g. 
> {{{}1969-12-31 23:59:59 B{}}}) via {{{}spark-shell{}}}, or insert an invalid 
> DATE/TIMESTAMP into a table via {{{}spark-sql{}}}, both interfaces 
> unexpectedly evaluate the invalid value to {{{}NULL{}}}, instead of throwing 
> an exception.
> h3. To Reproduce
> On Spark 3.2.1 (commit {{{}4f25b3f712{}}}), using {{{}spark-sql{}}}:
> {code:java}
> $SPARK_HOME/bin/spark-sql{code}
> Execute the following:
> {code:java}
> spark-sql> create table timestamp_vals(c1 TIMESTAMP) stored as ORC;
> spark-sql> insert into timestamp_vals select cast(" 1969-12-31 23:59:59 B "as 
> timestamp);
> spark-sql> select * from timestamp_vals;
> NULL{code}
>  
> Using {{{}spark-shell{}}}:
> {code:java}
> 

[jira] [Updated] (SPARK-40629) FLOAT/DOUBLE division by 0 gives Infinity/-Infinity/NaN in DataFrame but NULL in SparkSQL

2022-10-03 Thread xsys (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40629?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

xsys updated SPARK-40629:
-
Description: 
h3. Describe the bug

Storing a FLOAT/DOUBLE value with division by 0 (e.g. {{{}( 1.0/0 
).floatValue(){}}}) via {{spark-shell}} outputs {{{}Infinity{}}}. However, 
{{1.0/0}} ({{{}cast ( 1.0/0 as float){}}}) evaluated to {{NULL}} if the value 
is inserted into a FLOAT/DOUBLE column of a table via {{{}spark-sql{}}}.
h3. To Reproduce

On Spark 3.2.1 (commit {{{}4f25b3f712{}}}), using {{{}spark-sql{}}}:
{code:java}
$SPARK_HOME/bin/spark-sql{code}
Execute the following:
{code:java}
spark-sql> create table float_vals(c1 float) stored as ORC;
spark-sql> insert into float_vals select cast ( 1.0/0  as float);
spark-sql> select * from float_vals;
NULL{code}
 

Using {{{}spark-shell{}}}:
{code:java}
$SPARK_HOME/bin/spark-shell{code}
Execute the following:
{code:java}
scala> import org.apache.spark.sql.{Row, SparkSession}
import org.apache.spark.sql.{Row, SparkSession}
scala> import org.apache.spark.sql.types._
import org.apache.spark.sql.types._
scala> val rdd = sc.parallelize(Seq(Row(( 1.0/0 ).floatValue(
rdd: org.apache.spark.rdd.RDD[org.apache.spark.sql.Row] = 
ParallelCollectionRDD[180] at parallelize at :28
scala> val schema = new StructType().add(StructField("c1", FloatType, true) 
)
schema: org.apache.spark.sql.types.StructType = StructType( 
StructField(c1,FloatType,true))
scala> val df = spark.createDataFrame(rdd, schema)
df: org.apache.spark.sql.DataFrame = [c1: float]
scala> df.show(false)
+-+
|c1       |
+-+
|Infinity |
+-+
{code}
h3. Expected behavior

We expect the two Spark interfaces ({{{}spark-sql{}}} & {{{}spark-shell{}}}) to 
behave consistently for the same data type & input combination & configuration 
({{{}FLOAT/DOUBLE{}}} and {{{}1.0/0{}}}).

  was:
h3. Describe the bug

Storing a FLOAT/DOUBLE value with division by 0 (e.g. {{{}( 1.0/0 
).floatValue(){}}}) via {{spark-shell}} outputs {{{}Infinity{}}}. However, 
{{1.0/0}} ({{{}cast ( 1.0/0 as float){}}}) evaluated to {{NULL}} if the value 
is inserted into a FLOAT/DOUBLE column of a table via {{{}spark-sql{}}}.
h3. To Reproduce

On Spark 3.2.1 (commit {{{}4f25b3f712{}}}), using {{{}spark-sql{}}}:
{code:java}
$SPARK_HOME/bin/spark-sql{code}
Execute the following:
{code:java}
spark-sql> create table float_vals(c1 float) stored as ORC;
spark-sql> insert into float_vals select cast ( 1.0/0  as float);
spark-sql> select * from float_vals;
NULL{code}
 

Using {{{}spark-shell{}}}:
{code:java}
$SPARK_HOME/bin/spark-shell{code}
Execute the following:
{code:java}
scala> val rdd = sc.parallelize(Seq(Row(( 1.0/0 ).floatValue(
rdd: org.apache.spark.rdd.RDD[org.apache.spark.sql.Row] = 
ParallelCollectionRDD[180] at parallelize at :28
scala> val schema = new StructType().add(StructField("c1", FloatType, true) 
)
schema: org.apache.spark.sql.types.StructType = StructType( 
StructField(c1,FloatType,true))
scala> val df = spark.createDataFrame(rdd, schema)
df: org.apache.spark.sql.DataFrame = [c1: float]
scala> df.show(false)
+-+
|c1       |
+-+
|Infinity |
+-+
{code}
h3. Expected behavior

We expect the two Spark interfaces ({{{}spark-sql{}}} & {{{}spark-shell{}}}) to 
behave consistently for the same data type & input combination & configuration 
({{{}FLOAT/DOUBLE{}}} and {{{}1.0/0{}}}).


> FLOAT/DOUBLE division by 0 gives Infinity/-Infinity/NaN in DataFrame but NULL 
> in SparkSQL
> -
>
> Key: SPARK-40629
> URL: https://issues.apache.org/jira/browse/SPARK-40629
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Shell, SQL
>Affects Versions: 3.2.1
>Reporter: xsys
>Priority: Major
>
> h3. Describe the bug
> Storing a FLOAT/DOUBLE value with division by 0 (e.g. {{{}( 1.0/0 
> ).floatValue(){}}}) via {{spark-shell}} outputs {{{}Infinity{}}}. However, 
> {{1.0/0}} ({{{}cast ( 1.0/0 as float){}}}) evaluated to {{NULL}} if the value 
> is inserted into a FLOAT/DOUBLE column of a table via {{{}spark-sql{}}}.
> h3. To Reproduce
> On Spark 3.2.1 (commit {{{}4f25b3f712{}}}), using {{{}spark-sql{}}}:
> {code:java}
> $SPARK_HOME/bin/spark-sql{code}
> Execute the following:
> {code:java}
> spark-sql> create table float_vals(c1 float) stored as ORC;
> spark-sql> insert into float_vals select cast ( 1.0/0  as float);
> spark-sql> select * from float_vals;
> NULL{code}
>  
> Using {{{}spark-shell{}}}:
> {code:java}
> $SPARK_HOME/bin/spark-shell{code}
> Execute the following:
> {code:java}
> scala> import org.apache.spark.sql.{Row, SparkSession}
> import org.apache.spark.sql.{Row, SparkSession}
> scala> import org.apache.spark.sql.types._
> import org.apache.spark.sql.types._
> scala> val rdd = sc.parallelize(Seq(Row(( 1.0/0 ).floatValue(
> 

[jira] [Updated] (SPARK-40624) A DECIMAL value with division by 0 errors in DataFrame but evaluates to NULL in SparkSQL

2022-10-03 Thread xsys (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40624?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

xsys updated SPARK-40624:
-
Description: 
h3. Describe the bug

Storing an invalid value (e.g. {{{}BigDecimal("1.0/0"){}}}) via {{spark-shell}} 
errors out during RDD creation. However, {{1.0/0}} evaluates to {{NULL}} if the 
value is inserted into a {{DECIMAL(20,10)}} column of a table via 
{{{}spark-sql{}}}.
h3. To Reproduce

On Spark 3.2.1 (commit {{{}4f25b3f712{}}}), using {{{}spark-sql{}}}:
{code:java}
$SPARK_HOME/bin/spark-sql{code}
Execute the following: (evaluated to {{{}NULL{}}})
{code:java}
spark-sql> create table decimal_vals(c1 DECIMAL(20,10)) stored as ORC;
spark-sql> insert into decimal_vals select 1.0/0;
spark-sql> select * from decimal_vals;
NULL{code}
Using {{{}spark-shell{}}}:
{code:java}
$SPARK_HOME/bin/spark-shell{code}
Execute the following: (errors out during RDD creation)
{code:java}
scala> import org.apache.spark.sql.{Row, SparkSession}
import org.apache.spark.sql.{Row, SparkSession}
scala> import org.apache.spark.sql.types._
import org.apache.spark.sql.types._
scala> val rdd = sc.parallelize(Seq(Row(BigDecimal("1.0/0"
java.lang.NumberFormatException
  at java.math.BigDecimal.(BigDecimal.java:497)
  at java.math.BigDecimal.(BigDecimal.java:383)
  at java.math.BigDecimal.(BigDecimal.java:809)
  at scala.math.BigDecimal$.exact(BigDecimal.scala:126)
  at scala.math.BigDecimal$.apply(BigDecimal.scala:284)
  ... 49 elided{code}
h3. Expected behavior

We expect the two Spark interfaces ({{{}spark-sql{}}} & {{{}spark-shell{}}}) to 
behave consistently for the same data type & input combination 
({{{}BigDecimal{}}}/{{{}DECIMAL(20,10){}}} and {{{}1.0/0{}}}).

 

  was:
h3. Describe the bug

Storing an invalid value (e.g. {{{}BigDecimal("1.0/0"){}}}) via {{spark-shell}} 
errors out during RDD creation. However, {{1.0/0}} evaluates to {{NULL}} if the 
value is inserted into a {{DECIMAL(20,10)}} column of a table via 
{{{}spark-sql{}}}.
h3. To Reproduce

On Spark 3.2.1 (commit {{{}4f25b3f712{}}}), using {{{}spark-sql{}}}:
{code:java}
$SPARK_HOME/bin/spark-sql{code}
Execute the following: (evaluated to {{{}NULL{}}})
{code:java}
spark-sql> create table decimal_vals(c1 DECIMAL(20,10)) stored as ORC;
spark-sql> insert into decimal_vals select 1.0/0;
spark-sql> select * from decimal_vals;
NULL{code}
Using {{{}spark-shell{}}}:
{code:java}
$SPARK_HOME/bin/spark-shell{code}
Execute the following: (errors out during RDD creation)
{code:java}
scala> import org.apache.spark.sql.{Row, SparkSession}
import org.apache.spark.sql.{Row, SparkSession}scala> import 
org.apache.spark.sql.types._
import org.apache.spark.sql.types._
scala> val rdd = sc.parallelize(Seq(Row(BigDecimal("1.0/0"
java.lang.NumberFormatException
  at java.math.BigDecimal.(BigDecimal.java:497)
  at java.math.BigDecimal.(BigDecimal.java:383)
  at java.math.BigDecimal.(BigDecimal.java:809)
  at scala.math.BigDecimal$.exact(BigDecimal.scala:126)
  at scala.math.BigDecimal$.apply(BigDecimal.scala:284)
  ... 49 elided{code}
h3. Expected behavior

We expect the two Spark interfaces ({{{}spark-sql{}}} & {{{}spark-shell{}}}) to 
behave consistently for the same data type & input combination 
({{{}BigDecimal{}}}/{{{}DECIMAL(20,10){}}} and {{{}1.0/0{}}}).

 


> A DECIMAL value with division by 0 errors in DataFrame but evaluates to NULL 
> in SparkSQL
> 
>
> Key: SPARK-40624
> URL: https://issues.apache.org/jira/browse/SPARK-40624
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Shell
>Affects Versions: 3.2.1
>Reporter: xsys
>Priority: Major
>
> h3. Describe the bug
> Storing an invalid value (e.g. {{{}BigDecimal("1.0/0"){}}}) via 
> {{spark-shell}} errors out during RDD creation. However, {{1.0/0}} evaluates 
> to {{NULL}} if the value is inserted into a {{DECIMAL(20,10)}} column of a 
> table via {{{}spark-sql{}}}.
> h3. To Reproduce
> On Spark 3.2.1 (commit {{{}4f25b3f712{}}}), using {{{}spark-sql{}}}:
> {code:java}
> $SPARK_HOME/bin/spark-sql{code}
> Execute the following: (evaluated to {{{}NULL{}}})
> {code:java}
> spark-sql> create table decimal_vals(c1 DECIMAL(20,10)) stored as ORC;
> spark-sql> insert into decimal_vals select 1.0/0;
> spark-sql> select * from decimal_vals;
> NULL{code}
> Using {{{}spark-shell{}}}:
> {code:java}
> $SPARK_HOME/bin/spark-shell{code}
> Execute the following: (errors out during RDD creation)
> {code:java}
> scala> import org.apache.spark.sql.{Row, SparkSession}
> import org.apache.spark.sql.{Row, SparkSession}
> scala> import org.apache.spark.sql.types._
> import org.apache.spark.sql.types._
> scala> val rdd = sc.parallelize(Seq(Row(BigDecimal("1.0/0"
> java.lang.NumberFormatException
>   at java.math.BigDecimal.(BigDecimal.java:497)
>   at 

[jira] [Updated] (SPARK-40624) A DECIMAL value with division by 0 errors in DataFrame but evaluates to NULL in SparkSQL

2022-10-03 Thread xsys (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40624?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

xsys updated SPARK-40624:
-
Description: 
h3. Describe the bug

Storing an invalid value (e.g. {{{}BigDecimal("1.0/0"){}}}) via {{spark-shell}} 
errors out during RDD creation. However, {{1.0/0}} evaluates to {{NULL}} if the 
value is inserted into a {{DECIMAL(20,10)}} column of a table via 
{{{}spark-sql{}}}.
h3. To Reproduce

On Spark 3.2.1 (commit {{{}4f25b3f712{}}}), using {{{}spark-sql{}}}:
{code:java}
$SPARK_HOME/bin/spark-sql{code}
Execute the following: (evaluated to {{{}NULL{}}})
{code:java}
spark-sql> create table decimal_vals(c1 DECIMAL(20,10)) stored as ORC;
spark-sql> insert into decimal_vals select 1.0/0;
spark-sql> select * from decimal_vals;
NULL{code}
Using {{{}spark-shell{}}}:
{code:java}
$SPARK_HOME/bin/spark-shell{code}
Execute the following: (errors out during RDD creation)
{code:java}
scala> import org.apache.spark.sql.{Row, SparkSession}
import org.apache.spark.sql.{Row, SparkSession}scala> import 
org.apache.spark.sql.types._
import org.apache.spark.sql.types._
scala> val rdd = sc.parallelize(Seq(Row(BigDecimal("1.0/0"
java.lang.NumberFormatException
  at java.math.BigDecimal.(BigDecimal.java:497)
  at java.math.BigDecimal.(BigDecimal.java:383)
  at java.math.BigDecimal.(BigDecimal.java:809)
  at scala.math.BigDecimal$.exact(BigDecimal.scala:126)
  at scala.math.BigDecimal$.apply(BigDecimal.scala:284)
  ... 49 elided{code}
h3. Expected behavior

We expect the two Spark interfaces ({{{}spark-sql{}}} & {{{}spark-shell{}}}) to 
behave consistently for the same data type & input combination 
({{{}BigDecimal{}}}/{{{}DECIMAL(20,10){}}} and {{{}1.0/0{}}}).

 

  was:
h3. Describe the bug

Storing an invalid value (e.g. {{{}BigDecimal("1.0/0"){}}}) via {{spark-shell}} 
errors out during RDD creation. However, {{1.0/0}} evaluates to {{NULL}} if the 
value is inserted into a {{DECIMAL(20,10)}} column of a table via 
{{{}spark-sql{}}}.
h3. To Reproduce

On Spark 3.2.1 (commit {{{}4f25b3f712{}}}), using {{{}spark-sql{}}}:
{code:java}
$SPARK_HOME/bin/spark-sql{code}
Execute the following: (evaluated to {{{}NULL{}}})
{code:java}
spark-sql> create table decimal_vals(c1 DECIMAL(20,10)) stored as ORC;
spark-sql> insert into decimal_vals select 1.0/0;
spark-sql> select * from decimal_vals;
NULL{code}
Using {{{}spark-shell{}}}:
{code:java}
$SPARK_HOME/bin/spark-shell{code}
Execute the following: (errors out during RDD creation)
{code:java}
scala> val rdd = sc.parallelize(Seq(Row(BigDecimal("1.0/0"
java.lang.NumberFormatException
  at java.math.BigDecimal.(BigDecimal.java:497)
  at java.math.BigDecimal.(BigDecimal.java:383)
  at java.math.BigDecimal.(BigDecimal.java:809)
  at scala.math.BigDecimal$.exact(BigDecimal.scala:126)
  at scala.math.BigDecimal$.apply(BigDecimal.scala:284)
  ... 49 elided{code}
h3. Expected behavior

We expect the two Spark interfaces ({{{}spark-sql{}}} & {{{}spark-shell{}}}) to 
behave consistently for the same data type & input combination 
({{{}BigDecimal{}}}/{{{}DECIMAL(20,10){}}} and {{{}1.0/0{}}}).

 


> A DECIMAL value with division by 0 errors in DataFrame but evaluates to NULL 
> in SparkSQL
> 
>
> Key: SPARK-40624
> URL: https://issues.apache.org/jira/browse/SPARK-40624
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Shell
>Affects Versions: 3.2.1
>Reporter: xsys
>Priority: Major
>
> h3. Describe the bug
> Storing an invalid value (e.g. {{{}BigDecimal("1.0/0"){}}}) via 
> {{spark-shell}} errors out during RDD creation. However, {{1.0/0}} evaluates 
> to {{NULL}} if the value is inserted into a {{DECIMAL(20,10)}} column of a 
> table via {{{}spark-sql{}}}.
> h3. To Reproduce
> On Spark 3.2.1 (commit {{{}4f25b3f712{}}}), using {{{}spark-sql{}}}:
> {code:java}
> $SPARK_HOME/bin/spark-sql{code}
> Execute the following: (evaluated to {{{}NULL{}}})
> {code:java}
> spark-sql> create table decimal_vals(c1 DECIMAL(20,10)) stored as ORC;
> spark-sql> insert into decimal_vals select 1.0/0;
> spark-sql> select * from decimal_vals;
> NULL{code}
> Using {{{}spark-shell{}}}:
> {code:java}
> $SPARK_HOME/bin/spark-shell{code}
> Execute the following: (errors out during RDD creation)
> {code:java}
> scala> import org.apache.spark.sql.{Row, SparkSession}
> import org.apache.spark.sql.{Row, SparkSession}scala> import 
> org.apache.spark.sql.types._
> import org.apache.spark.sql.types._
> scala> val rdd = sc.parallelize(Seq(Row(BigDecimal("1.0/0"
> java.lang.NumberFormatException
>   at java.math.BigDecimal.(BigDecimal.java:497)
>   at java.math.BigDecimal.(BigDecimal.java:383)
>   at java.math.BigDecimal.(BigDecimal.java:809)
>   at scala.math.BigDecimal$.exact(BigDecimal.scala:126)
>   at 

[jira] [Updated] (SPARK-40637) DataFrame can correctly encode BINARY type but SparkSQL cannot

2022-10-02 Thread xsys (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40637?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

xsys updated SPARK-40637:
-
Description: 
h3. Describe the bug

Storing a BINARY value (e.g. {{BigInt("1").toByteArray)}} / {{{}X'01'{}}}) via 
{{spark-shell}} outputs {{{}[01]{}}}. However, it does not encode correctly if 
the value is inserted into a BINARY column of a table via {{{}spark-sql{}}}.
h3. To Reproduce

On Spark 3.2.1 (commit {{{}4f25b3f712{}}}), using {{{}spark-shell{}}}:
{code:java}
$SPARK_HOME/bin/spark-shell{code}
Execute the following:
{code:java}
scala> val rdd = sc.parallelize(Seq(Row(BigInt("1").toByteArray)))
rdd: org.apache.spark.rdd.RDD[org.apache.spark.sql.Row] = 
ParallelCollectionRDD[356] at parallelize at :28
scala> val schema = new StructType().add(StructField("c1", BinaryType, true))
schema: org.apache.spark.sql.types.StructType = 
StructType(StructField(c1,BinaryType,true))
scala> val df = spark.createDataFrame(rdd, schema)
df: org.apache.spark.sql.DataFrame = [c1: binary]
scala> df.show(false)
++
|c1  |
++
|[01]|
++
{code}
 

Using {{{}spark-sql{}}}:
{code:java}
$SPARK_HOME/bin/spark-sql{code}
 Execute the following, we only get an empty output:
{code:java}
spark-sql> create table binary_vals(c1 BINARY) stored as ORC;
spark-sql> insert into binary_vals select X'01';
spark-sql> select * from binary_vals;
Time taken: 0.077 seconds, Fetched 1 row(s)
{code}
h3. Expected behavior

We expect the two Spark interfaces ({{{}spark-sql{}}} & {{{}spark-shell{}}}) to 
behave consistently for the same data type ({{{}BINARY{}}}) & input 
({{{}BigInt("1").toByteArray){}}} / {{{}X'01'{}}}) combination.

  was:
h3. Describe the bug

Storing a BINARY value (e.g. {{BigInt("1").toByteArray)}} / {{{}X'01'{}}}) via 
{{spark-shell}} outputs {{{}[01]{}}}. However, it does not encode correctly if 
the value is inserted into a BINARY column of a table via {{{}spark-sql{}}}.
h3. To Reproduce

On Spark 3.2.1 (commit {{{}4f25b3f712{}}}), using {{{}spark-shell{}}}:

 
{code:java}
$SPARK_HOME/bin/spark-shell{code}
 

Execute the following:
{code:java}
scala> val rdd = sc.parallelize(Seq(Row(BigInt("1").toByteArray)))
rdd: org.apache.spark.rdd.RDD[org.apache.spark.sql.Row] = 
ParallelCollectionRDD[356] at parallelize at :28
scala> val schema = new StructType().add(StructField("c1", BinaryType, true))
schema: org.apache.spark.sql.types.StructType = 
StructType(StructField(c1,BinaryType,true))
scala> val df = spark.createDataFrame(rdd, schema)
df: org.apache.spark.sql.DataFrame = [c1: binary]
scala> df.show(false)
++
|c1  |
++
|[01]|
++
{code}
Using {{{}spark-sql{}}}:
{code:java}
$SPARK_HOME/bin/spark-sql{code}
 Execute the following, we only get an empty output:
{code:java}
spark-sql> create table binary_vals(c1 BINARY) stored as ORC;
spark-sql> insert into binary_vals select X'01';
spark-sql> select * from binary_vals;
Time taken: 0.077 seconds, Fetched 1 row(s)
{code}
 
h3. Expected behavior

We expect the two Spark interfaces ({{{}spark-sql{}}} & {{{}spark-shell{}}}) to 
behave consistently for the same data type ({{{}BINARY{}}}) & input 
({{{}BigInt("1").toByteArray){}}} / {{{}X'01'{}}}) combination.


> DataFrame can correctly encode BINARY type but SparkSQL cannot
> --
>
> Key: SPARK-40637
> URL: https://issues.apache.org/jira/browse/SPARK-40637
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.1
>Reporter: xsys
>Priority: Major
>
> h3. Describe the bug
> Storing a BINARY value (e.g. {{BigInt("1").toByteArray)}} / {{{}X'01'{}}}) 
> via {{spark-shell}} outputs {{{}[01]{}}}. However, it does not encode 
> correctly if the value is inserted into a BINARY column of a table via 
> {{{}spark-sql{}}}.
> h3. To Reproduce
> On Spark 3.2.1 (commit {{{}4f25b3f712{}}}), using {{{}spark-shell{}}}:
> {code:java}
> $SPARK_HOME/bin/spark-shell{code}
> Execute the following:
> {code:java}
> scala> val rdd = sc.parallelize(Seq(Row(BigInt("1").toByteArray)))
> rdd: org.apache.spark.rdd.RDD[org.apache.spark.sql.Row] = 
> ParallelCollectionRDD[356] at parallelize at :28
> scala> val schema = new StructType().add(StructField("c1", BinaryType, true))
> schema: org.apache.spark.sql.types.StructType = 
> StructType(StructField(c1,BinaryType,true))
> scala> val df = spark.createDataFrame(rdd, schema)
> df: org.apache.spark.sql.DataFrame = [c1: binary]
> scala> df.show(false)
> ++
> |c1  |
> ++
> |[01]|
> ++
> {code}
>  
> Using {{{}spark-sql{}}}:
> {code:java}
> $SPARK_HOME/bin/spark-sql{code}
>  Execute the following, we only get an empty output:
> {code:java}
> spark-sql> create table binary_vals(c1 BINARY) stored as ORC;
> spark-sql> insert into binary_vals select X'01';
> spark-sql> select * from binary_vals;
> Time taken: 0.077 seconds, Fetched 1 row(s)
> 

[jira] [Created] (SPARK-40637) DataFrame can correctly encode BINARY type but SparkSQL cannot

2022-10-02 Thread xsys (Jira)
xsys created SPARK-40637:


 Summary: DataFrame can correctly encode BINARY type but SparkSQL 
cannot
 Key: SPARK-40637
 URL: https://issues.apache.org/jira/browse/SPARK-40637
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.2.1
Reporter: xsys


h3. Describe the bug

Storing a BINARY value (e.g. {{BigInt("1").toByteArray)}} / {{{}X'01'{}}}) via 
{{spark-shell}} outputs {{{}[01]{}}}. However, it does not encode correctly if 
the value is inserted into a BINARY column of a table via {{{}spark-sql{}}}.
h3. To Reproduce

On Spark 3.2.1 (commit {{{}4f25b3f712{}}}), using {{{}spark-shell{}}}:

 
{code:java}
$SPARK_HOME/bin/spark-shell{code}
 

Execute the following:
{code:java}
scala> val rdd = sc.parallelize(Seq(Row(BigInt("1").toByteArray)))
rdd: org.apache.spark.rdd.RDD[org.apache.spark.sql.Row] = 
ParallelCollectionRDD[356] at parallelize at :28
scala> val schema = new StructType().add(StructField("c1", BinaryType, true))
schema: org.apache.spark.sql.types.StructType = 
StructType(StructField(c1,BinaryType,true))
scala> val df = spark.createDataFrame(rdd, schema)
df: org.apache.spark.sql.DataFrame = [c1: binary]
scala> df.show(false)
++
|c1  |
++
|[01]|
++
{code}
Using {{{}spark-sql{}}}:
{code:java}
$SPARK_HOME/bin/spark-sql{code}
 Execute the following, we only get an empty output:
{code:java}
spark-sql> create table binary_vals(c1 BINARY) stored as ORC;
spark-sql> insert into binary_vals select X'01';
spark-sql> select * from binary_vals;
Time taken: 0.077 seconds, Fetched 1 row(s)
{code}
 
h3. Expected behavior

We expect the two Spark interfaces ({{{}spark-sql{}}} & {{{}spark-shell{}}}) to 
behave consistently for the same data type ({{{}BINARY{}}}) & input 
({{{}BigInt("1").toByteArray){}}} / {{{}X'01'{}}}) combination.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-40624) A DECIMAL value with division by 0 errors in DataFrame but evaluates to NULL in SparkSQL

2022-10-02 Thread xsys (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40624?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

xsys updated SPARK-40624:
-
Description: 
h3. Describe the bug

Storing an invalid value (e.g. {{{}BigDecimal("1.0/0"){}}}) via {{spark-shell}} 
errors out during RDD creation. However, {{1.0/0}} evaluates to {{NULL}} if the 
value is inserted into a {{DECIMAL(20,10)}} column of a table via 
{{{}spark-sql{}}}.
h3. To Reproduce

On Spark 3.2.1 (commit {{{}4f25b3f712{}}}), using {{{}spark-sql{}}}:
{code:java}
$SPARK_HOME/bin/spark-sql{code}
Execute the following: (evaluated to {{{}NULL{}}})
{code:java}
spark-sql> create table decimal_vals(c1 DECIMAL(20,10)) stored as ORC;
spark-sql> insert into decimal_vals select 1.0/0;
spark-sql> select * from decimal_vals;
NULL{code}
Using {{{}spark-shell{}}}:
{code:java}
$SPARK_HOME/bin/spark-shell{code}
Execute the following: (errors out during RDD creation)
{code:java}
scala> val rdd = sc.parallelize(Seq(Row(BigDecimal("1.0/0"
java.lang.NumberFormatException
  at java.math.BigDecimal.(BigDecimal.java:497)
  at java.math.BigDecimal.(BigDecimal.java:383)
  at java.math.BigDecimal.(BigDecimal.java:809)
  at scala.math.BigDecimal$.exact(BigDecimal.scala:126)
  at scala.math.BigDecimal$.apply(BigDecimal.scala:284)
  ... 49 elided{code}
h3. Expected behavior

We expect the two Spark interfaces ({{{}spark-sql{}}} & {{{}spark-shell{}}}) to 
behave consistently for the same data type & input combination 
({{{}BigDecimal{}}}/{{{}DECIMAL(20,10){}}} and {{{}1.0/0{}}}).

 

  was:
h3. Describe the bug

Storing an invalid value (e.g. {{{}BigDecimal("1.0/0"){}}}) via {{spark-shell}} 
errors out during RDD creation. However, {{1.0/0}} evaluates to {{NULL}} if the 
value is inserted into a {{DECIMAL(20,10)}} column of a table via 
{{{}spark-sql{}}}.
h3. To Reproduce

On Spark 3.2.1 (commit {{{}4f25b3f712{}}}), using {{{}spark-sql{}}}:
{code:java}
$SPARK_HOME/bin/spark-sql{code}
Execute the following: (evaluated to {{{}NULL{}}})
{code:java}
spark-sql> create table decimal_vals(c1 DECIMAL(20,10)) stored as ORC;
spark-sql> insert into decimal_vals 1.0/0;
spark-sql> select * from decimal_vals;
NULL{code}
Using {{{}spark-shell{}}}:
{code:java}
$SPARK_HOME/bin/spark-shell{code}
Execute the following: (errors out during RDD creation)
{code:java}
scala> val rdd = sc.parallelize(Seq(Row(BigDecimal("1.0/0"
java.lang.NumberFormatException
  at java.math.BigDecimal.(BigDecimal.java:497)
  at java.math.BigDecimal.(BigDecimal.java:383)
  at java.math.BigDecimal.(BigDecimal.java:809)
  at scala.math.BigDecimal$.exact(BigDecimal.scala:126)
  at scala.math.BigDecimal$.apply(BigDecimal.scala:284)
  ... 49 elided{code}
h3. Expected behavior

We expect the two Spark interfaces ({{{}spark-sql{}}} & {{{}spark-shell{}}}) to 
behave consistently for the same data type & input combination 
({{{}BigDecimal{}}}/{{{}DECIMAL(20,10){}}} and {{{}1.0/0{}}}).

 


> A DECIMAL value with division by 0 errors in DataFrame but evaluates to NULL 
> in SparkSQL
> 
>
> Key: SPARK-40624
> URL: https://issues.apache.org/jira/browse/SPARK-40624
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Shell
>Affects Versions: 3.2.1
>Reporter: xsys
>Priority: Major
>
> h3. Describe the bug
> Storing an invalid value (e.g. {{{}BigDecimal("1.0/0"){}}}) via 
> {{spark-shell}} errors out during RDD creation. However, {{1.0/0}} evaluates 
> to {{NULL}} if the value is inserted into a {{DECIMAL(20,10)}} column of a 
> table via {{{}spark-sql{}}}.
> h3. To Reproduce
> On Spark 3.2.1 (commit {{{}4f25b3f712{}}}), using {{{}spark-sql{}}}:
> {code:java}
> $SPARK_HOME/bin/spark-sql{code}
> Execute the following: (evaluated to {{{}NULL{}}})
> {code:java}
> spark-sql> create table decimal_vals(c1 DECIMAL(20,10)) stored as ORC;
> spark-sql> insert into decimal_vals select 1.0/0;
> spark-sql> select * from decimal_vals;
> NULL{code}
> Using {{{}spark-shell{}}}:
> {code:java}
> $SPARK_HOME/bin/spark-shell{code}
> Execute the following: (errors out during RDD creation)
> {code:java}
> scala> val rdd = sc.parallelize(Seq(Row(BigDecimal("1.0/0"
> java.lang.NumberFormatException
>   at java.math.BigDecimal.(BigDecimal.java:497)
>   at java.math.BigDecimal.(BigDecimal.java:383)
>   at java.math.BigDecimal.(BigDecimal.java:809)
>   at scala.math.BigDecimal$.exact(BigDecimal.scala:126)
>   at scala.math.BigDecimal$.apply(BigDecimal.scala:284)
>   ... 49 elided{code}
> h3. Expected behavior
> We expect the two Spark interfaces ({{{}spark-sql{}}} & {{{}spark-shell{}}}) 
> to behave consistently for the same data type & input combination 
> ({{{}BigDecimal{}}}/{{{}DECIMAL(20,10){}}} and {{{}1.0/0{}}}).
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (SPARK-40629) FLOAT/DOUBLE division by 0 gives Infinity/-Infinity/NaN in DataFrame but NULL in SparkSQL

2022-10-02 Thread xsys (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40629?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

xsys updated SPARK-40629:
-
Description: 
h3. Describe the bug

Storing a FLOAT/DOUBLE value with division by 0 (e.g. {{{}( 1.0/0 
).floatValue(){}}}) via {{spark-shell}} outputs {{{}Infinity{}}}. However, 
{{1.0/0}} ({{{}cast ( 1.0/0 as float){}}}) evaluated to {{NULL}} if the value 
is inserted into a FLOAT/DOUBLE column of a table via {{{}spark-sql{}}}.
h3. To Reproduce

On Spark 3.2.1 (commit {{{}4f25b3f712{}}}), using {{{}spark-sql{}}}:
{code:java}
$SPARK_HOME/bin/spark-sql{code}
Execute the following:
{code:java}
spark-sql> create table float_vals(c1 float) stored as ORC;
spark-sql> insert into float_vals select cast ( 1.0/0  as float);
spark-sql> select * from float_vals;
NULL{code}
 

Using {{{}spark-shell{}}}:
{code:java}
$SPARK_HOME/bin/spark-shell{code}
Execute the following:
{code:java}
scala> val rdd = sc.parallelize(Seq(Row(( 1.0/0 ).floatValue(
rdd: org.apache.spark.rdd.RDD[org.apache.spark.sql.Row] = 
ParallelCollectionRDD[180] at parallelize at :28
scala> val schema = new StructType().add(StructField("c1", FloatType, true) 
)
schema: org.apache.spark.sql.types.StructType = StructType( 
StructField(c1,FloatType,true))
scala> val df = spark.createDataFrame(rdd, schema)
df: org.apache.spark.sql.DataFrame = [c1: float]
scala> df.show(false)
+-+
|c1       |
+-+
|Infinity |
+-+
{code}
h3. Expected behavior

We expect the two Spark interfaces ({{{}spark-sql{}}} & {{{}spark-shell{}}}) to 
behave consistently for the same data type & input combination & configuration 
({{{}FLOAT/DOUBLE{}}} and {{{}1.0/0{}}}).

  was:
h3. Describe the bug

Storing a FLOAT/DOUBLE value with division by 0 (e.g. {{{}( 1.0/0 
).floatValue(){}}}) via {{spark-shell}} outputs {{{}Infinity{}}}. However, 
{{1.0/0}} ({{{}cast ( 1.0/0 as float){}}}) evaluated to {{NULL}} if the value 
is inserted into a FLOAT/DOUBLE column of a table via {{{}spark-sql{}}}.
h3. To Reproduce

On Spark 3.2.1 (commit {{{}4f25b3f712{}}}), using {{{}spark-sql{}}}:
{code:java}
$SPARK_HOME/bin/spark-sql{code}
Execute the following:
{code:java}
spark-sql> create table float_vals(c1 float) stored as ORC;
spark-sql> insert into float_vals cast ( 1.0/0  as float);
spark-sql> select * from float_vals;
NULL{code}
 

Using {{{}spark-shell{}}}:
{code:java}
$SPARK_HOME/bin/spark-shell{code}
Execute the following:
{code:java}
scala> val rdd = sc.parallelize(Seq(Row(( 1.0/0 ).floatValue(
rdd: org.apache.spark.rdd.RDD[org.apache.spark.sql.Row] = 
ParallelCollectionRDD[180] at parallelize at :28
scala> val schema = new StructType().add(StructField("c1", FloatType, true) 
)
schema: org.apache.spark.sql.types.StructType = StructType( 
StructField(c1,FloatType,true))
scala> val df = spark.createDataFrame(rdd, schema)
df: org.apache.spark.sql.DataFrame = [c1: float]
scala> df.show(false)
+-+
|c1       |
+-+
|Infinity |
+-+
{code}
h3. Expected behavior

We expect the two Spark interfaces ({{{}spark-sql{}}} & {{{}spark-shell{}}}) to 
behave consistently for the same data type & input combination & configuration 
({{{}FLOAT/DOUBLE{}}} and {{{}1.0/0{}}}).


> FLOAT/DOUBLE division by 0 gives Infinity/-Infinity/NaN in DataFrame but NULL 
> in SparkSQL
> -
>
> Key: SPARK-40629
> URL: https://issues.apache.org/jira/browse/SPARK-40629
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Shell, SQL
>Affects Versions: 3.2.1
>Reporter: xsys
>Priority: Major
>
> h3. Describe the bug
> Storing a FLOAT/DOUBLE value with division by 0 (e.g. {{{}( 1.0/0 
> ).floatValue(){}}}) via {{spark-shell}} outputs {{{}Infinity{}}}. However, 
> {{1.0/0}} ({{{}cast ( 1.0/0 as float){}}}) evaluated to {{NULL}} if the value 
> is inserted into a FLOAT/DOUBLE column of a table via {{{}spark-sql{}}}.
> h3. To Reproduce
> On Spark 3.2.1 (commit {{{}4f25b3f712{}}}), using {{{}spark-sql{}}}:
> {code:java}
> $SPARK_HOME/bin/spark-sql{code}
> Execute the following:
> {code:java}
> spark-sql> create table float_vals(c1 float) stored as ORC;
> spark-sql> insert into float_vals select cast ( 1.0/0  as float);
> spark-sql> select * from float_vals;
> NULL{code}
>  
> Using {{{}spark-shell{}}}:
> {code:java}
> $SPARK_HOME/bin/spark-shell{code}
> Execute the following:
> {code:java}
> scala> val rdd = sc.parallelize(Seq(Row(( 1.0/0 ).floatValue(
> rdd: org.apache.spark.rdd.RDD[org.apache.spark.sql.Row] = 
> ParallelCollectionRDD[180] at parallelize at :28
> scala> val schema = new StructType().add(StructField("c1", FloatType, true) 
> )
> schema: org.apache.spark.sql.types.StructType = StructType( 
> StructField(c1,FloatType,true))
> scala> val df = spark.createDataFrame(rdd, schema)
> df: org.apache.spark.sql.DataFrame 

[jira] [Created] (SPARK-40630) Both SparkSQL and DataFrame insert invalid DATE/TIMESTAMP as NULL

2022-10-01 Thread xsys (Jira)
xsys created SPARK-40630:


 Summary: Both SparkSQL and DataFrame insert invalid DATE/TIMESTAMP 
as NULL
 Key: SPARK-40630
 URL: https://issues.apache.org/jira/browse/SPARK-40630
 Project: Spark
  Issue Type: Bug
  Components: Spark Shell, SQL
Affects Versions: 3.2.1
Reporter: xsys


h3. Describe the bug

When we construct a DataFrame with an invalid DATE/TIMESTAMP (e.g. 
{{{}1969-12-31 23:59:59 B{}}}) via {{{}spark-shell{}}}, or insert an invalid 
DATE/TIMESTAMP into a table via {{{}spark-sql{}}}, both interfaces unexpectedly 
evaluate the invalid value to {{{}NULL{}}}, instead of throwing an exception.
h3. To Reproduce

On Spark 3.2.1 (commit {{{}4f25b3f712{}}}), using {{{}spark-sql{}}}:

 
{code:java}
$SPARK_HOME/bin/spark-sql{code}
 

Execute the following:
{code:java}
spark-sql> create table timestamp_vals(c1 TIMESTAMP) stored as ORC;
spark-sql> insert into timestamp_vals select cast(" 1969-12-31 23:59:59 B "as 
timestamp);
spark-sql> select * from timestamp_vals;
NULL{code}
 
Using {{{}spark-shell{}}}:
{code:java}
$SPARK_HOME/bin/spark-shell{code}
 
Execute the following:
{code:java}
scala> val rdd = sc.parallelize(Seq(Row(Seq(" 1969-12-31 23:59:59 B 
").toDF("time").select(to_timestamp(col("ti 
me")).as("to_timestamp")).first().getAs[java.sql.Timestamp](0
rdd: org.apache.spark.rdd.RDD[org.apache.spark.sql.Row] = 
ParallelCollectionRDD[721] at parallelize at :28
scala> val schema = new StructType().add(StructField("c1", TimestampType,  
true))
schema: org.apache.spark.sql.types.StructType = 
StructType(StructField(c1,TimestampType,true))
scala> val df = spark.createDataFrame(rdd, schema)
df194: org.apache.spark.sql.DataFrame = [c1: timestamp]
scala> df.show(false)
++
|c1  |
++
|null|
++
{code}
h3. Expected behavior

We expect both {{spark-sql}} & {{spark-shell}} interfaces to throw an exception 
for an invalid DATE/TIMESTAMP, like what they do for most of the other data 
types (e.g. invalid value {{"foo"}} for {{INT}} data type).



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-40630) Both SparkSQL and DataFrame insert invalid DATE/TIMESTAMP as NULL

2022-10-01 Thread xsys (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40630?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

xsys updated SPARK-40630:
-
Description: 
h3. Describe the bug

When we construct a DataFrame with an invalid DATE/TIMESTAMP (e.g. 
{{{}1969-12-31 23:59:59 B{}}}) via {{{}spark-shell{}}}, or insert an invalid 
DATE/TIMESTAMP into a table via {{{}spark-sql{}}}, both interfaces unexpectedly 
evaluate the invalid value to {{{}NULL{}}}, instead of throwing an exception.
h3. To Reproduce

On Spark 3.2.1 (commit {{{}4f25b3f712{}}}), using {{{}spark-sql{}}}:
{code:java}
$SPARK_HOME/bin/spark-sql{code}
Execute the following:
{code:java}
spark-sql> create table timestamp_vals(c1 TIMESTAMP) stored as ORC;
spark-sql> insert into timestamp_vals select cast(" 1969-12-31 23:59:59 B "as 
timestamp);
spark-sql> select * from timestamp_vals;
NULL{code}
 
Using {{{}spark-shell{}}}:
{code:java}
$SPARK_HOME/bin/spark-shell{code}
 
Execute the following:
{code:java}
scala> val rdd = sc.parallelize(Seq(Row(Seq(" 1969-12-31 23:59:59 B 
").toDF("time").select(to_timestamp(col("ti 
me")).as("to_timestamp")).first().getAs[java.sql.Timestamp](0
rdd: org.apache.spark.rdd.RDD[org.apache.spark.sql.Row] = 
ParallelCollectionRDD[721] at parallelize at :28
scala> val schema = new StructType().add(StructField("c1", TimestampType,  
true))
schema: org.apache.spark.sql.types.StructType = 
StructType(StructField(c1,TimestampType,true))
scala> val df = spark.createDataFrame(rdd, schema)
df194: org.apache.spark.sql.DataFrame = [c1: timestamp]
scala> df.show(false)
++
|c1  |
++
|null|
++
{code}
h3. Expected behavior

We expect both {{spark-sql}} & {{spark-shell}} interfaces to throw an exception 
for an invalid DATE/TIMESTAMP, like what they do for most of the other data 
types (e.g. invalid value {{"foo"}} for {{INT}} data type).

  was:
h3. Describe the bug

When we construct a DataFrame with an invalid DATE/TIMESTAMP (e.g. 
{{{}1969-12-31 23:59:59 B{}}}) via {{{}spark-shell{}}}, or insert an invalid 
DATE/TIMESTAMP into a table via {{{}spark-sql{}}}, both interfaces unexpectedly 
evaluate the invalid value to {{{}NULL{}}}, instead of throwing an exception.
h3. To Reproduce

On Spark 3.2.1 (commit {{{}4f25b3f712{}}}), using {{{}spark-sql{}}}:

 
{code:java}
$SPARK_HOME/bin/spark-sql{code}
 

Execute the following:
{code:java}
spark-sql> create table timestamp_vals(c1 TIMESTAMP) stored as ORC;
spark-sql> insert into timestamp_vals select cast(" 1969-12-31 23:59:59 B "as 
timestamp);
spark-sql> select * from timestamp_vals;
NULL{code}
 
Using {{{}spark-shell{}}}:
{code:java}
$SPARK_HOME/bin/spark-shell{code}
 
Execute the following:
{code:java}
scala> val rdd = sc.parallelize(Seq(Row(Seq(" 1969-12-31 23:59:59 B 
").toDF("time").select(to_timestamp(col("ti 
me")).as("to_timestamp")).first().getAs[java.sql.Timestamp](0
rdd: org.apache.spark.rdd.RDD[org.apache.spark.sql.Row] = 
ParallelCollectionRDD[721] at parallelize at :28
scala> val schema = new StructType().add(StructField("c1", TimestampType,  
true))
schema: org.apache.spark.sql.types.StructType = 
StructType(StructField(c1,TimestampType,true))
scala> val df = spark.createDataFrame(rdd, schema)
df194: org.apache.spark.sql.DataFrame = [c1: timestamp]
scala> df.show(false)
++
|c1  |
++
|null|
++
{code}
h3. Expected behavior

We expect both {{spark-sql}} & {{spark-shell}} interfaces to throw an exception 
for an invalid DATE/TIMESTAMP, like what they do for most of the other data 
types (e.g. invalid value {{"foo"}} for {{INT}} data type).


> Both SparkSQL and DataFrame insert invalid DATE/TIMESTAMP as NULL
> -
>
> Key: SPARK-40630
> URL: https://issues.apache.org/jira/browse/SPARK-40630
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Shell, SQL
>Affects Versions: 3.2.1
>Reporter: xsys
>Priority: Major
>
> h3. Describe the bug
> When we construct a DataFrame with an invalid DATE/TIMESTAMP (e.g. 
> {{{}1969-12-31 23:59:59 B{}}}) via {{{}spark-shell{}}}, or insert an invalid 
> DATE/TIMESTAMP into a table via {{{}spark-sql{}}}, both interfaces 
> unexpectedly evaluate the invalid value to {{{}NULL{}}}, instead of throwing 
> an exception.
> h3. To Reproduce
> On Spark 3.2.1 (commit {{{}4f25b3f712{}}}), using {{{}spark-sql{}}}:
> {code:java}
> $SPARK_HOME/bin/spark-sql{code}
> Execute the following:
> {code:java}
> spark-sql> create table timestamp_vals(c1 TIMESTAMP) stored as ORC;
> spark-sql> insert into timestamp_vals select cast(" 1969-12-31 23:59:59 B "as 
> timestamp);
> spark-sql> select * from timestamp_vals;
> NULL{code}
>  
> Using {{{}spark-shell{}}}:
> {code:java}
> $SPARK_HOME/bin/spark-shell{code}
>  
> Execute the following:
> {code:java}
> scala> val rdd = sc.parallelize(Seq(Row(Seq(" 1969-12-31 23:59:59 B 
> 

[jira] [Updated] (SPARK-40624) A DECIMAL value with division by 0 errors in DataFrame but evaluates to NULL in SparkSQL

2022-10-01 Thread xsys (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40624?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

xsys updated SPARK-40624:
-
Description: 
h3. Describe the bug

Storing an invalid value (e.g. {{{}BigDecimal("1.0/0"){}}}) via {{spark-shell}} 
errors out during RDD creation. However, {{1.0/0}} evaluates to {{NULL}} if the 
value is inserted into a {{DECIMAL(20,10)}} column of a table via 
{{{}spark-sql{}}}.
h3. To Reproduce

On Spark 3.2.1 (commit {{{}4f25b3f712{}}}), using {{{}spark-sql{}}}:
{code:java}
$SPARK_HOME/bin/spark-sql{code}
Execute the following: (evaluated to {{{}NULL{}}})
{code:java}
spark-sql> create table decimal_vals(c1 DECIMAL(20,10)) stored as ORC;
spark-sql> insert into decimal_vals 1.0/0;
spark-sql> select * from decimal_vals;
NULL{code}
Using {{{}spark-shell{}}}:
{code:java}
$SPARK_HOME/bin/spark-shell{code}
Execute the following: (errors out during RDD creation)
{code:java}
scala> val rdd = sc.parallelize(Seq(Row(BigDecimal("1.0/0"
java.lang.NumberFormatException
  at java.math.BigDecimal.(BigDecimal.java:497)
  at java.math.BigDecimal.(BigDecimal.java:383)
  at java.math.BigDecimal.(BigDecimal.java:809)
  at scala.math.BigDecimal$.exact(BigDecimal.scala:126)
  at scala.math.BigDecimal$.apply(BigDecimal.scala:284)
  ... 49 elided{code}
h3. Expected behavior

We expect the two Spark interfaces ({{{}spark-sql{}}} & {{{}spark-shell{}}}) to 
behave consistently for the same data type & input combination 
({{{}BigDecimal{}}}/{{{}DECIMAL(20,10){}}} and {{{}1.0/0{}}}).

 

  was:
h3. Describe the bug

Storing an invalid value (e.g. {{{}BigDecimal("1.0/0"){}}}) via {{spark-shell}} 
errors out during RDD creation. However, {{1.0/0}} evaluates to {{NULL}} if the 
value is inserted into a {{DECIMAL(20,10)}} column of a table via 
{{{}spark-sql{}}}.
h3. To Reproduce

On Spark 3.2.1 (commit {{{}4f25b3f712{}}}), using {{{}spark-sql{}}}:
{code:java}
$SPARK_HOME/bin/spark-sql{code}
Execute the following: (evaluated to {{{}NULL{}}})
{code:java}
spark-sql> create table decimal_vals(c1 DECIMAL(20,10)) stored as ORC;
spark-sql> insert into decimal_vals 1.0/0;
spark-sql> select * from decimal_vals;
71    NULL{code}
Using {{{}spark-shell{}}}:
{code:java}
$SPARK_HOME/bin/spark-shell{code}
Execute the following: (errors out during RDD creation)
{code:java}
scala> val rdd = sc.parallelize(Seq(Row(BigDecimal("1.0/0"
java.lang.NumberFormatException
  at java.math.BigDecimal.(BigDecimal.java:497)
  at java.math.BigDecimal.(BigDecimal.java:383)
  at java.math.BigDecimal.(BigDecimal.java:809)
  at scala.math.BigDecimal$.exact(BigDecimal.scala:126)
  at scala.math.BigDecimal$.apply(BigDecimal.scala:284)
  ... 49 elided{code}
h3. Expected behavior

We expect the two Spark interfaces ({{{}spark-sql{}}} & {{{}spark-shell{}}}) to 
behave consistently for the same data type & input combination 
({{{}BigDecimal{}}}/{{{}DECIMAL(20,10){}}} and {{{}1.0/0{}}}).

 


> A DECIMAL value with division by 0 errors in DataFrame but evaluates to NULL 
> in SparkSQL
> 
>
> Key: SPARK-40624
> URL: https://issues.apache.org/jira/browse/SPARK-40624
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Shell
>Affects Versions: 3.2.1
>Reporter: xsys
>Priority: Major
>
> h3. Describe the bug
> Storing an invalid value (e.g. {{{}BigDecimal("1.0/0"){}}}) via 
> {{spark-shell}} errors out during RDD creation. However, {{1.0/0}} evaluates 
> to {{NULL}} if the value is inserted into a {{DECIMAL(20,10)}} column of a 
> table via {{{}spark-sql{}}}.
> h3. To Reproduce
> On Spark 3.2.1 (commit {{{}4f25b3f712{}}}), using {{{}spark-sql{}}}:
> {code:java}
> $SPARK_HOME/bin/spark-sql{code}
> Execute the following: (evaluated to {{{}NULL{}}})
> {code:java}
> spark-sql> create table decimal_vals(c1 DECIMAL(20,10)) stored as ORC;
> spark-sql> insert into decimal_vals 1.0/0;
> spark-sql> select * from decimal_vals;
> NULL{code}
> Using {{{}spark-shell{}}}:
> {code:java}
> $SPARK_HOME/bin/spark-shell{code}
> Execute the following: (errors out during RDD creation)
> {code:java}
> scala> val rdd = sc.parallelize(Seq(Row(BigDecimal("1.0/0"
> java.lang.NumberFormatException
>   at java.math.BigDecimal.(BigDecimal.java:497)
>   at java.math.BigDecimal.(BigDecimal.java:383)
>   at java.math.BigDecimal.(BigDecimal.java:809)
>   at scala.math.BigDecimal$.exact(BigDecimal.scala:126)
>   at scala.math.BigDecimal$.apply(BigDecimal.scala:284)
>   ... 49 elided{code}
> h3. Expected behavior
> We expect the two Spark interfaces ({{{}spark-sql{}}} & {{{}spark-shell{}}}) 
> to behave consistently for the same data type & input combination 
> ({{{}BigDecimal{}}}/{{{}DECIMAL(20,10){}}} and {{{}1.0/0{}}}).
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (SPARK-40624) A DECIMAL value with division by 0 errors in DataFrame but evaluates to NULL in SparkSQL

2022-10-01 Thread xsys (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40624?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

xsys updated SPARK-40624:
-
Description: 
h3. Describe the bug

Storing an invalid value (e.g. {{{}BigDecimal("1.0/0"){}}}) via {{spark-shell}} 
errors out during RDD creation. However, {{1.0/0}} evaluates to {{NULL}} if the 
value is inserted into a {{DECIMAL(20,10)}} column of a table via 
{{{}spark-sql{}}}.
h3. To Reproduce

On Spark 3.2.1 (commit {{{}4f25b3f712{}}}), using {{{}spark-sql{}}}:
{code:java}
$SPARK_HOME/bin/spark-sql{code}
Execute the following: (evaluated to {{{}NULL{}}})
{code:java}
spark-sql> create table decimal_vals(c1 DECIMAL(20,10)) stored as ORC;
spark-sql> insert into decimal_vals 1.0/0;
spark-sql> select * from decimal_vals;
71    NULL{code}
Using {{{}spark-shell{}}}:
{code:java}
$SPARK_HOME/bin/spark-shell{code}
Execute the following: (errors out during RDD creation)
{code:java}
scala> val rdd = sc.parallelize(Seq(Row(BigDecimal("1.0/0"
java.lang.NumberFormatException
  at java.math.BigDecimal.(BigDecimal.java:497)
  at java.math.BigDecimal.(BigDecimal.java:383)
  at java.math.BigDecimal.(BigDecimal.java:809)
  at scala.math.BigDecimal$.exact(BigDecimal.scala:126)
  at scala.math.BigDecimal$.apply(BigDecimal.scala:284)
  ... 49 elided{code}
h3. Expected behavior

We expect the two Spark interfaces ({{{}spark-sql{}}} & {{{}spark-shell{}}}) to 
behave consistently for the same data type & input combination 
({{{}BigDecimal{}}}/{{{}DECIMAL(20,10){}}} and {{{}1.0/0{}}}).

 

  was:
h3. Describe the bug

Storing an invalid value (e.g. {{{}BigDecimal("1.0/0"){}}}) via {{spark-shell}} 
errors out during RDD creation. However, {{1.0/0}} evaluates to {{NULL}} if the 
value is inserted into a {{DECIMAL(20,10)}} column of a table via 
{{{}spark-sql{}}}.
h3. To Reproduce

On Spark 3.2.1 (commit {{{}4f25b3f712{}}}), using {{{}spark-sql{}}}:
{code:java}
$SPARK_HOME/bin/spark-sql{code}
Execute the following: (evaluated to {{{}NULL{}}})
{code:java}
spark-sql> create table decimal_vals(c1 DECIMAL(20,10)) stored as ORC;
spark-sql> insert into decimal_vals 1.0/0;
spark-sql> select * from ws71;
71    NULL{code}
Using {{{}spark-shell{}}}:
{code:java}
$SPARK_HOME/bin/spark-shell{code}
Execute the following: (errors out during RDD creation)
{code:java}
scala> val rdd = sc.parallelize(Seq(Row(BigDecimal("1.0/0"
java.lang.NumberFormatException
  at java.math.BigDecimal.(BigDecimal.java:497)
  at java.math.BigDecimal.(BigDecimal.java:383)
  at java.math.BigDecimal.(BigDecimal.java:809)
  at scala.math.BigDecimal$.exact(BigDecimal.scala:126)
  at scala.math.BigDecimal$.apply(BigDecimal.scala:284)
  ... 49 elided{code}
h3. Expected behavior

We expect the two Spark interfaces ({{{}spark-sql{}}} & {{{}spark-shell{}}}) to 
behave consistently for the same data type & input combination 
({{{}BigDecimal{}}}/{{{}DECIMAL(20,10){}}} and {{{}1.0/0{}}}).

 


> A DECIMAL value with division by 0 errors in DataFrame but evaluates to NULL 
> in SparkSQL
> 
>
> Key: SPARK-40624
> URL: https://issues.apache.org/jira/browse/SPARK-40624
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Shell
>Affects Versions: 3.2.1
>Reporter: xsys
>Priority: Major
>
> h3. Describe the bug
> Storing an invalid value (e.g. {{{}BigDecimal("1.0/0"){}}}) via 
> {{spark-shell}} errors out during RDD creation. However, {{1.0/0}} evaluates 
> to {{NULL}} if the value is inserted into a {{DECIMAL(20,10)}} column of a 
> table via {{{}spark-sql{}}}.
> h3. To Reproduce
> On Spark 3.2.1 (commit {{{}4f25b3f712{}}}), using {{{}spark-sql{}}}:
> {code:java}
> $SPARK_HOME/bin/spark-sql{code}
> Execute the following: (evaluated to {{{}NULL{}}})
> {code:java}
> spark-sql> create table decimal_vals(c1 DECIMAL(20,10)) stored as ORC;
> spark-sql> insert into decimal_vals 1.0/0;
> spark-sql> select * from decimal_vals;
> 71    NULL{code}
> Using {{{}spark-shell{}}}:
> {code:java}
> $SPARK_HOME/bin/spark-shell{code}
> Execute the following: (errors out during RDD creation)
> {code:java}
> scala> val rdd = sc.parallelize(Seq(Row(BigDecimal("1.0/0"
> java.lang.NumberFormatException
>   at java.math.BigDecimal.(BigDecimal.java:497)
>   at java.math.BigDecimal.(BigDecimal.java:383)
>   at java.math.BigDecimal.(BigDecimal.java:809)
>   at scala.math.BigDecimal$.exact(BigDecimal.scala:126)
>   at scala.math.BigDecimal$.apply(BigDecimal.scala:284)
>   ... 49 elided{code}
> h3. Expected behavior
> We expect the two Spark interfaces ({{{}spark-sql{}}} & {{{}spark-shell{}}}) 
> to behave consistently for the same data type & input combination 
> ({{{}BigDecimal{}}}/{{{}DECIMAL(20,10){}}} and {{{}1.0/0{}}}).
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (SPARK-40629) FLOAT/DOUBLE division by 0 gives Infinity/-Infinity/NaN in DataFrame but NULL in SparkSQL

2022-10-01 Thread xsys (Jira)
xsys created SPARK-40629:


 Summary: FLOAT/DOUBLE division by 0 gives Infinity/-Infinity/NaN 
in DataFrame but NULL in SparkSQL
 Key: SPARK-40629
 URL: https://issues.apache.org/jira/browse/SPARK-40629
 Project: Spark
  Issue Type: Bug
  Components: Spark Shell, SQL
Affects Versions: 3.2.1
Reporter: xsys


h3. Describe the bug

Storing a FLOAT/DOUBLE value with division by 0 (e.g. {{{}( 1.0/0 
).floatValue(){}}}) via {{spark-shell}} outputs {{{}Infinity{}}}. However, 
{{1.0/0}} ({{{}cast ( 1.0/0 as float){}}}) evaluated to {{NULL}} if the value 
is inserted into a FLOAT/DOUBLE column of a table via {{{}spark-sql{}}}.
h3. To Reproduce

On Spark 3.2.1 (commit {{{}4f25b3f712{}}}), using {{{}spark-sql{}}}:

 
{code:java}
$SPARK_HOME/bin/spark-sql{code}
 

Execute the following:

 
{code:java}
spark-sql> create table float_vals(c1 float) stored as ORC;
spark-sql> insert into float_vals cast ( 1.0/0  as float);
spark-sql> select * from float_vals;
NULL{code}
 

Using {{{}spark-shell{}}}:
{code:java}
$SPARK_HOME/bin/spark-shell{code}
Execute the following:
{code:java}
scala> val rdd = sc.parallelize(Seq(Row(( 1.0/0 ).floatValue(
rdd: org.apache.spark.rdd.RDD[org.apache.spark.sql.Row] = 
ParallelCollectionRDD[180] at parallelize at :28
scala> val schema = new StructType().add(StructField("c1", FloatType, true) 
)
schema: org.apache.spark.sql.types.StructType = StructType( 
StructField(c1,FloatType,true))
scala> val df = spark.createDataFrame(rdd, schema)
df: org.apache.spark.sql.DataFrame = [c1: float]
scala> df.show(false)
+-+
|c1       |
+-+
|Infinity |
+-+
{code}
h3. Expected behavior

We expect the two Spark interfaces ({{{}spark-sql{}}} & {{{}spark-shell{}}}) to 
behave consistently for the same data type & input combination & configuration 
({{{}FLOAT/DOUBLE{}}} and {{{}1.0/0{}}}).



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-40629) FLOAT/DOUBLE division by 0 gives Infinity/-Infinity/NaN in DataFrame but NULL in SparkSQL

2022-10-01 Thread xsys (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40629?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

xsys updated SPARK-40629:
-
Description: 
h3. Describe the bug

Storing a FLOAT/DOUBLE value with division by 0 (e.g. {{{}( 1.0/0 
).floatValue(){}}}) via {{spark-shell}} outputs {{{}Infinity{}}}. However, 
{{1.0/0}} ({{{}cast ( 1.0/0 as float){}}}) evaluated to {{NULL}} if the value 
is inserted into a FLOAT/DOUBLE column of a table via {{{}spark-sql{}}}.
h3. To Reproduce

On Spark 3.2.1 (commit {{{}4f25b3f712{}}}), using {{{}spark-sql{}}}:
{code:java}
$SPARK_HOME/bin/spark-sql{code}
Execute the following:
{code:java}
spark-sql> create table float_vals(c1 float) stored as ORC;
spark-sql> insert into float_vals cast ( 1.0/0  as float);
spark-sql> select * from float_vals;
NULL{code}
Using {{{}spark-shell{}}}:
{code:java}
$SPARK_HOME/bin/spark-shell{code}
Execute the following:
{code:java}
scala> val rdd = sc.parallelize(Seq(Row(( 1.0/0 ).floatValue(
rdd: org.apache.spark.rdd.RDD[org.apache.spark.sql.Row] = 
ParallelCollectionRDD[180] at parallelize at :28
scala> val schema = new StructType().add(StructField("c1", FloatType, true) 
)
schema: org.apache.spark.sql.types.StructType = StructType( 
StructField(c1,FloatType,true))
scala> val df = spark.createDataFrame(rdd, schema)
df: org.apache.spark.sql.DataFrame = [c1: float]
scala> df.show(false)
+-+
|c1       |
+-+
|Infinity |
+-+
{code}
h3. Expected behavior

We expect the two Spark interfaces ({{{}spark-sql{}}} & {{{}spark-shell{}}}) to 
behave consistently for the same data type & input combination & configuration 
({{{}FLOAT/DOUBLE{}}} and {{{}1.0/0{}}}).

  was:
h3. Describe the bug

Storing a FLOAT/DOUBLE value with division by 0 (e.g. {{{}( 1.0/0 
).floatValue(){}}}) via {{spark-shell}} outputs {{{}Infinity{}}}. However, 
{{1.0/0}} ({{{}cast ( 1.0/0 as float){}}}) evaluated to {{NULL}} if the value 
is inserted into a FLOAT/DOUBLE column of a table via {{{}spark-sql{}}}.
h3. To Reproduce

On Spark 3.2.1 (commit {{{}4f25b3f712{}}}), using {{{}spark-sql{}}}:

 
{code:java}
$SPARK_HOME/bin/spark-sql{code}
 

Execute the following:

 
{code:java}
spark-sql> create table float_vals(c1 float) stored as ORC;
spark-sql> insert into float_vals cast ( 1.0/0  as float);
spark-sql> select * from float_vals;
NULL{code}
 

Using {{{}spark-shell{}}}:
{code:java}
$SPARK_HOME/bin/spark-shell{code}
Execute the following:
{code:java}
scala> val rdd = sc.parallelize(Seq(Row(( 1.0/0 ).floatValue(
rdd: org.apache.spark.rdd.RDD[org.apache.spark.sql.Row] = 
ParallelCollectionRDD[180] at parallelize at :28
scala> val schema = new StructType().add(StructField("c1", FloatType, true) 
)
schema: org.apache.spark.sql.types.StructType = StructType( 
StructField(c1,FloatType,true))
scala> val df = spark.createDataFrame(rdd, schema)
df: org.apache.spark.sql.DataFrame = [c1: float]
scala> df.show(false)
+-+
|c1       |
+-+
|Infinity |
+-+
{code}
h3. Expected behavior

We expect the two Spark interfaces ({{{}spark-sql{}}} & {{{}spark-shell{}}}) to 
behave consistently for the same data type & input combination & configuration 
({{{}FLOAT/DOUBLE{}}} and {{{}1.0/0{}}}).


> FLOAT/DOUBLE division by 0 gives Infinity/-Infinity/NaN in DataFrame but NULL 
> in SparkSQL
> -
>
> Key: SPARK-40629
> URL: https://issues.apache.org/jira/browse/SPARK-40629
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Shell, SQL
>Affects Versions: 3.2.1
>Reporter: xsys
>Priority: Major
>
> h3. Describe the bug
> Storing a FLOAT/DOUBLE value with division by 0 (e.g. {{{}( 1.0/0 
> ).floatValue(){}}}) via {{spark-shell}} outputs {{{}Infinity{}}}. However, 
> {{1.0/0}} ({{{}cast ( 1.0/0 as float){}}}) evaluated to {{NULL}} if the value 
> is inserted into a FLOAT/DOUBLE column of a table via {{{}spark-sql{}}}.
> h3. To Reproduce
> On Spark 3.2.1 (commit {{{}4f25b3f712{}}}), using {{{}spark-sql{}}}:
> {code:java}
> $SPARK_HOME/bin/spark-sql{code}
> Execute the following:
> {code:java}
> spark-sql> create table float_vals(c1 float) stored as ORC;
> spark-sql> insert into float_vals cast ( 1.0/0  as float);
> spark-sql> select * from float_vals;
> NULL{code}
> Using {{{}spark-shell{}}}:
> {code:java}
> $SPARK_HOME/bin/spark-shell{code}
> Execute the following:
> {code:java}
> scala> val rdd = sc.parallelize(Seq(Row(( 1.0/0 ).floatValue(
> rdd: org.apache.spark.rdd.RDD[org.apache.spark.sql.Row] = 
> ParallelCollectionRDD[180] at parallelize at :28
> scala> val schema = new StructType().add(StructField("c1", FloatType, true) 
> )
> schema: org.apache.spark.sql.types.StructType = StructType( 
> StructField(c1,FloatType,true))
> scala> val df = spark.createDataFrame(rdd, schema)
> df: org.apache.spark.sql.DataFrame = [c1: 

[jira] [Updated] (SPARK-40629) FLOAT/DOUBLE division by 0 gives Infinity/-Infinity/NaN in DataFrame but NULL in SparkSQL

2022-10-01 Thread xsys (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40629?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

xsys updated SPARK-40629:
-
Description: 
h3. Describe the bug

Storing a FLOAT/DOUBLE value with division by 0 (e.g. {{{}( 1.0/0 
).floatValue(){}}}) via {{spark-shell}} outputs {{{}Infinity{}}}. However, 
{{1.0/0}} ({{{}cast ( 1.0/0 as float){}}}) evaluated to {{NULL}} if the value 
is inserted into a FLOAT/DOUBLE column of a table via {{{}spark-sql{}}}.
h3. To Reproduce

On Spark 3.2.1 (commit {{{}4f25b3f712{}}}), using {{{}spark-sql{}}}:
{code:java}
$SPARK_HOME/bin/spark-sql{code}
Execute the following:
{code:java}
spark-sql> create table float_vals(c1 float) stored as ORC;
spark-sql> insert into float_vals cast ( 1.0/0  as float);
spark-sql> select * from float_vals;
NULL{code}
 

Using {{{}spark-shell{}}}:
{code:java}
$SPARK_HOME/bin/spark-shell{code}
Execute the following:
{code:java}
scala> val rdd = sc.parallelize(Seq(Row(( 1.0/0 ).floatValue(
rdd: org.apache.spark.rdd.RDD[org.apache.spark.sql.Row] = 
ParallelCollectionRDD[180] at parallelize at :28
scala> val schema = new StructType().add(StructField("c1", FloatType, true) 
)
schema: org.apache.spark.sql.types.StructType = StructType( 
StructField(c1,FloatType,true))
scala> val df = spark.createDataFrame(rdd, schema)
df: org.apache.spark.sql.DataFrame = [c1: float]
scala> df.show(false)
+-+
|c1       |
+-+
|Infinity |
+-+
{code}
h3. Expected behavior

We expect the two Spark interfaces ({{{}spark-sql{}}} & {{{}spark-shell{}}}) to 
behave consistently for the same data type & input combination & configuration 
({{{}FLOAT/DOUBLE{}}} and {{{}1.0/0{}}}).

  was:
h3. Describe the bug

Storing a FLOAT/DOUBLE value with division by 0 (e.g. {{{}( 1.0/0 
).floatValue(){}}}) via {{spark-shell}} outputs {{{}Infinity{}}}. However, 
{{1.0/0}} ({{{}cast ( 1.0/0 as float){}}}) evaluated to {{NULL}} if the value 
is inserted into a FLOAT/DOUBLE column of a table via {{{}spark-sql{}}}.
h3. To Reproduce

On Spark 3.2.1 (commit {{{}4f25b3f712{}}}), using {{{}spark-sql{}}}:
{code:java}
$SPARK_HOME/bin/spark-sql{code}
Execute the following:
{code:java}
spark-sql> create table float_vals(c1 float) stored as ORC;
spark-sql> insert into float_vals cast ( 1.0/0  as float);
spark-sql> select * from float_vals;
NULL{code}
Using {{{}spark-shell{}}}:
{code:java}
$SPARK_HOME/bin/spark-shell{code}
Execute the following:
{code:java}
scala> val rdd = sc.parallelize(Seq(Row(( 1.0/0 ).floatValue(
rdd: org.apache.spark.rdd.RDD[org.apache.spark.sql.Row] = 
ParallelCollectionRDD[180] at parallelize at :28
scala> val schema = new StructType().add(StructField("c1", FloatType, true) 
)
schema: org.apache.spark.sql.types.StructType = StructType( 
StructField(c1,FloatType,true))
scala> val df = spark.createDataFrame(rdd, schema)
df: org.apache.spark.sql.DataFrame = [c1: float]
scala> df.show(false)
+-+
|c1       |
+-+
|Infinity |
+-+
{code}
h3. Expected behavior

We expect the two Spark interfaces ({{{}spark-sql{}}} & {{{}spark-shell{}}}) to 
behave consistently for the same data type & input combination & configuration 
({{{}FLOAT/DOUBLE{}}} and {{{}1.0/0{}}}).


> FLOAT/DOUBLE division by 0 gives Infinity/-Infinity/NaN in DataFrame but NULL 
> in SparkSQL
> -
>
> Key: SPARK-40629
> URL: https://issues.apache.org/jira/browse/SPARK-40629
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Shell, SQL
>Affects Versions: 3.2.1
>Reporter: xsys
>Priority: Major
>
> h3. Describe the bug
> Storing a FLOAT/DOUBLE value with division by 0 (e.g. {{{}( 1.0/0 
> ).floatValue(){}}}) via {{spark-shell}} outputs {{{}Infinity{}}}. However, 
> {{1.0/0}} ({{{}cast ( 1.0/0 as float){}}}) evaluated to {{NULL}} if the value 
> is inserted into a FLOAT/DOUBLE column of a table via {{{}spark-sql{}}}.
> h3. To Reproduce
> On Spark 3.2.1 (commit {{{}4f25b3f712{}}}), using {{{}spark-sql{}}}:
> {code:java}
> $SPARK_HOME/bin/spark-sql{code}
> Execute the following:
> {code:java}
> spark-sql> create table float_vals(c1 float) stored as ORC;
> spark-sql> insert into float_vals cast ( 1.0/0  as float);
> spark-sql> select * from float_vals;
> NULL{code}
>  
> Using {{{}spark-shell{}}}:
> {code:java}
> $SPARK_HOME/bin/spark-shell{code}
> Execute the following:
> {code:java}
> scala> val rdd = sc.parallelize(Seq(Row(( 1.0/0 ).floatValue(
> rdd: org.apache.spark.rdd.RDD[org.apache.spark.sql.Row] = 
> ParallelCollectionRDD[180] at parallelize at :28
> scala> val schema = new StructType().add(StructField("c1", FloatType, true) 
> )
> schema: org.apache.spark.sql.types.StructType = StructType( 
> StructField(c1,FloatType,true))
> scala> val df = spark.createDataFrame(rdd, schema)
> df: org.apache.spark.sql.DataFrame = [c1: float]
> 

[jira] [Updated] (SPARK-40624) A DECIMAL value with division by 0 errors in DataFrame but evaluates to NULL in SparkSQL

2022-09-30 Thread xsys (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40624?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

xsys updated SPARK-40624:
-
Description: 
h3. Describe the bug

Storing an invalid value (e.g. {{{}BigDecimal("1.0/0"){}}}) via {{spark-shell}} 
errors out during RDD creation. However, {{1.0/0}} evaluates to {{NULL}} if the 
value is inserted into a {{DECIMAL(20,10)}} column of a table via 
{{{}spark-sql{}}}.
h3. To Reproduce

On Spark 3.2.1 (commit {{{}4f25b3f712{}}}), using {{{}spark-sql{}}}:
{code:java}
$SPARK_HOME/bin/spark-sql{code}
Execute the following: (evaluated to {{{}NULL{}}})
{code:java}
spark-sql> create table decimal_vals(c1 DECIMAL(20,10)) stored as ORC;
spark-sql> insert into decimal_vals 1.0/0;
spark-sql> select * from ws71;
71    NULL{code}
Using {{{}spark-shell{}}}:
{code:java}
$SPARK_HOME/bin/spark-shell{code}
Execute the following: (errors out during RDD creation)
{code:java}
scala> val rdd = sc.parallelize(Seq(Row(BigDecimal("1.0/0"
java.lang.NumberFormatException
  at java.math.BigDecimal.(BigDecimal.java:497)
  at java.math.BigDecimal.(BigDecimal.java:383)
  at java.math.BigDecimal.(BigDecimal.java:809)
  at scala.math.BigDecimal$.exact(BigDecimal.scala:126)
  at scala.math.BigDecimal$.apply(BigDecimal.scala:284)
  ... 49 elided{code}
h3. Expected behavior

We expect the two Spark interfaces ({{{}spark-sql{}}} & {{{}spark-shell{}}}) to 
behave consistently for the same data type & input combination 
({{{}BigDecimal{}}}/{{{}DECIMAL(20,10){}}} and {{{}1.0/0{}}}).

 

  was:
h3. Describe the bug

Storing an invalid value (e.g. {{{}BigDecimal("1.0/0"){}}}) via {{spark-shell}} 
errors out during RDD creation. However, {{1.0/0}} evaluated to {{NULL}} if the 
value is inserted into a {{DECIMAL(20,10)}} column of a table via 
{{{}spark-sql{}}}.
h3. To Reproduce

On Spark 3.2.1 (commit {{{}4f25b3f712{}}}), using {{{}spark-sql{}}}:

 
{code:java}
$SPARK_HOME/bin/spark-sql{code}
Execute the following: (evaluated to {{{}NULL{}}})
{code:java}
spark-sql> create table decimal_vals(c1 DECIMAL(20,10)) stored as ORC;
spark-sql> insert into decimal_vals 1.0/0;
spark-sql> select * from ws71;
71    NULL{code}
Using {{{}spark-shell{}}}:
{code:java}
$SPARK_HOME/bin/spark-shell{code}
Execute the following: (errors out during RDD creation)
{code:java}
scala> val rdd = sc.parallelize(Seq(Row(BigDecimal("1.0/0"
java.lang.NumberFormatException
  at java.math.BigDecimal.(BigDecimal.java:497)
  at java.math.BigDecimal.(BigDecimal.java:383)
  at java.math.BigDecimal.(BigDecimal.java:809)
  at scala.math.BigDecimal$.exact(BigDecimal.scala:126)
  at scala.math.BigDecimal$.apply(BigDecimal.scala:284)
  ... 49 elided{code}
h3. Expected behavior

We expect the two Spark interfaces ({{{}spark-sql{}}} & {{{}spark-shell{}}}) to 
behave consistently for the same data type & input combination 
({{{}BigDecimal{}}}/{{{}DECIMAL(20,10){}}} and {{{}1.0/0{}}}).

 


> A DECIMAL value with division by 0 errors in DataFrame but evaluates to NULL 
> in SparkSQL
> 
>
> Key: SPARK-40624
> URL: https://issues.apache.org/jira/browse/SPARK-40624
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Shell
>Affects Versions: 3.2.1
>Reporter: xsys
>Priority: Major
>
> h3. Describe the bug
> Storing an invalid value (e.g. {{{}BigDecimal("1.0/0"){}}}) via 
> {{spark-shell}} errors out during RDD creation. However, {{1.0/0}} evaluates 
> to {{NULL}} if the value is inserted into a {{DECIMAL(20,10)}} column of a 
> table via {{{}spark-sql{}}}.
> h3. To Reproduce
> On Spark 3.2.1 (commit {{{}4f25b3f712{}}}), using {{{}spark-sql{}}}:
> {code:java}
> $SPARK_HOME/bin/spark-sql{code}
> Execute the following: (evaluated to {{{}NULL{}}})
> {code:java}
> spark-sql> create table decimal_vals(c1 DECIMAL(20,10)) stored as ORC;
> spark-sql> insert into decimal_vals 1.0/0;
> spark-sql> select * from ws71;
> 71    NULL{code}
> Using {{{}spark-shell{}}}:
> {code:java}
> $SPARK_HOME/bin/spark-shell{code}
> Execute the following: (errors out during RDD creation)
> {code:java}
> scala> val rdd = sc.parallelize(Seq(Row(BigDecimal("1.0/0"
> java.lang.NumberFormatException
>   at java.math.BigDecimal.(BigDecimal.java:497)
>   at java.math.BigDecimal.(BigDecimal.java:383)
>   at java.math.BigDecimal.(BigDecimal.java:809)
>   at scala.math.BigDecimal$.exact(BigDecimal.scala:126)
>   at scala.math.BigDecimal$.apply(BigDecimal.scala:284)
>   ... 49 elided{code}
> h3. Expected behavior
> We expect the two Spark interfaces ({{{}spark-sql{}}} & {{{}spark-shell{}}}) 
> to behave consistently for the same data type & input combination 
> ({{{}BigDecimal{}}}/{{{}DECIMAL(20,10){}}} and {{{}1.0/0{}}}).
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (SPARK-40624) A DECIMAL value with division by 0 errors in DataFrame but evaluates to NULL in SparkSQL

2022-09-30 Thread xsys (Jira)
xsys created SPARK-40624:


 Summary: A DECIMAL value with division by 0 errors in DataFrame 
but evaluates to NULL in SparkSQL
 Key: SPARK-40624
 URL: https://issues.apache.org/jira/browse/SPARK-40624
 Project: Spark
  Issue Type: Bug
  Components: Spark Shell
Affects Versions: 3.2.1
Reporter: xsys


h3. Describe the bug

Storing an invalid value (e.g. {{{}BigDecimal("1.0/0"){}}}) via {{spark-shell}} 
errors out during RDD creation. However, {{1.0/0}} evaluated to {{NULL}} if the 
value is inserted into a {{DECIMAL(20,10)}} column of a table via 
{{{}spark-sql{}}}.
h3. To Reproduce

On Spark 3.2.1 (commit {{{}4f25b3f712{}}}), using {{{}spark-sql{}}}:

 
{code:java}
$SPARK_HOME/bin/spark-sql{code}
Execute the following: (evaluated to {{{}NULL{}}})
{code:java}
spark-sql> create table decimal_vals(c1 DECIMAL(20,10)) stored as ORC;
spark-sql> insert into decimal_vals 1.0/0;
spark-sql> select * from ws71;
71    NULL{code}
Using {{{}spark-shell{}}}:
{code:java}
$SPARK_HOME/bin/spark-shell{code}
Execute the following: (errors out during RDD creation)
{code:java}
scala> val rdd = sc.parallelize(Seq(Row(BigDecimal("1.0/0"
java.lang.NumberFormatException
  at java.math.BigDecimal.(BigDecimal.java:497)
  at java.math.BigDecimal.(BigDecimal.java:383)
  at java.math.BigDecimal.(BigDecimal.java:809)
  at scala.math.BigDecimal$.exact(BigDecimal.scala:126)
  at scala.math.BigDecimal$.apply(BigDecimal.scala:284)
  ... 49 elided{code}
h3. Expected behavior

We expect the two Spark interfaces ({{{}spark-sql{}}} & {{{}spark-shell{}}}) to 
behave consistently for the same data type & input combination 
({{{}BigDecimal{}}}/{{{}DECIMAL(20,10){}}} and {{{}1.0/0{}}}).

 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-40616) Loss of precision using SparkSQL shell on high-precision DECIMAL types

2022-09-29 Thread xsys (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40616?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

xsys updated SPARK-40616:
-
Description: 
h3. Describe the bug

We are trying to save {{DECIMAL}} values with high precision in a table using 
the SparkSQL shell. When we {{INSERT}} decimal values with precision higher 
than the standard double precision, precision is lost. (8.888e9 
interpreted as 88.90 instead of 88.88).

This seems to be caused by type inference at shell parsing inferring that the 
value is a double type.
h3. To reproduce

On Spark 3.2.1 (commit {{{}4f25b3f712{}}}), using {{{}spark-sql{}}}:
{code:java}
$SPARK_HOME/bin/spark-sql{code}
In the shell:
{code:java}
CREATE TABLE t(c0 DECIMAL(20,10));         
INSERT INTO t VALUES (8.888e9);                             
SELECT * FROM t;{code}
Executing the above gives this:
{code:java}
spark-sql> CREATE TABLE t(c0 DECIMAL(20,10));
22/09/29 11:28:41 WARN ResolveSessionCatalog: A Hive serde table will be 
created as there is no table provider specified. You can set 
spark.sql.legacy.createHiveTableByDefault to false so that native data source 
table will be created instead.
Time taken: 0.118 seconds
spark-sql> INSERT INTO t VALUES (8.888e9);
Time taken: 0.392 seconds
spark-sql> SELECT * FROM t;
88.90
Time taken: 0.197 seconds, Fetched 1 row(s){code}
h3. Expected behavior

We expect the inserted value to retain the precision as determined by the 
parameters for the {{DECIMAL}} type. For example, we expect the example above 
to return {{{}88.88{}}}.

  was:
h3. Describe the bug

We are trying to save {{DECIMAL}} values with high precision in a table using 
the SparkSQL shell. When we {{INSERT}} decimal values with precision higher 
than the standard double precision, precision is lost.

This seems to be caused by type inference at shell parsing inferring that the 
value is a double type.
h3. To reproduce

On Spark 3.2.1 (commit {{{}4f25b3f712{}}}), using {{{}spark-sql{}}}:
{code:java}
$SPARK_HOME/bin/spark-sql{code}
In the shell:
{code:java}
CREATE TABLE t(c0 DECIMAL(20,10));         
INSERT INTO t VALUES (8.888e9);                             
SELECT * FROM t;{code}
Executing the above gives this:
{code:java}
spark-sql> CREATE TABLE t(c0 DECIMAL(20,10));
22/09/29 11:28:41 WARN ResolveSessionCatalog: A Hive serde table will be 
created as there is no table provider specified. You can set 
spark.sql.legacy.createHiveTableByDefault to false so that native data source 
table will be created instead.
Time taken: 0.118 seconds
spark-sql> INSERT INTO t VALUES (8.888e9);
Time taken: 0.392 seconds
spark-sql> SELECT * FROM t;
88.90
Time taken: 0.197 seconds, Fetched 1 row(s){code}
h3. Expected behavior

We expect the inserted value to retain the precision as determined by the 
parameters for the {{DECIMAL}} type. For example, we expect the example above 
to return {{{}88.88{}}}.


> Loss of precision using SparkSQL shell on high-precision DECIMAL types
> --
>
> Key: SPARK-40616
> URL: https://issues.apache.org/jira/browse/SPARK-40616
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.1
>Reporter: xsys
>Priority: Major
>
> h3. Describe the bug
> We are trying to save {{DECIMAL}} values with high precision in a table using 
> the SparkSQL shell. When we {{INSERT}} decimal values with precision higher 
> than the standard double precision, precision is lost. 
> (8.888e9 interpreted as 88.90 instead of 
> 88.88).
> This seems to be caused by type inference at shell parsing inferring that the 
> value is a double type.
> h3. To reproduce
> On Spark 3.2.1 (commit {{{}4f25b3f712{}}}), using {{{}spark-sql{}}}:
> {code:java}
> $SPARK_HOME/bin/spark-sql{code}
> In the shell:
> {code:java}
> CREATE TABLE t(c0 DECIMAL(20,10));         
> INSERT INTO t VALUES (8.888e9);                             
> SELECT * FROM t;{code}
> Executing the above gives this:
> {code:java}
> spark-sql> CREATE TABLE t(c0 DECIMAL(20,10));
> 22/09/29 11:28:41 WARN ResolveSessionCatalog: A Hive serde table will be 
> created as there is no table provider specified. You can set 
> spark.sql.legacy.createHiveTableByDefault to false so that native data source 
> table will be created instead.
> Time taken: 0.118 seconds
> spark-sql> INSERT INTO t VALUES (8.888e9);
> Time taken: 0.392 seconds
> spark-sql> SELECT * FROM t;
> 88.90
> Time taken: 0.197 seconds, Fetched 1 row(s){code}
> h3. Expected behavior
> We expect the inserted value to retain the precision as determined by the 
> parameters 

[jira] [Updated] (SPARK-40616) Loss of precision using SparkSQL shell on high-precision DECIMAL types

2022-09-29 Thread xsys (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40616?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

xsys updated SPARK-40616:
-
Description: 
h3. Describe the bug

We are trying to save {{DECIMAL}} values with high precision in a table using 
the SparkSQL shell.
When we {{INSERT}} decimal values with precision higher than the standard 
double precision.

This seems to be caused by type inference at shell parsing inferring that the 
value is a double type.
h3. To reproduce

On Spark 3.2.1 (commit {{{}4f25b3f712{}}}), using {{{}spark-sql{}}}:
{code:java}
$SPARK_HOME/bin/spark-sql{code}
 

In the shell:
{code:java}
CREATE TABLE t(c0 DECIMAL(20,10));         
INSERT INTO t VALUES (8.888e9);                             
SELECT * FROM t;{code}
Executing the above gives this:

 
{code:java}
spark-sql> CREATE TABLE t(c0 DECIMAL(20,10));
22/09/29 11:28:41 WARN ResolveSessionCatalog: A Hive serde table will be 
created as there is no table provider specified. You can set 
spark.sql.legacy.createHiveTableByDefault to false so that native data source 
table will be created instead.
Time taken: 0.118 seconds
spark-sql> INSERT INTO t VALUES (8.888e9);
Time taken: 0.392 seconds
spark-sql> SELECT * FROM t;
88.90
Time taken: 0.197 seconds, Fetched 1 row(s){code}
h3. Expected behavior

We expect the inserted value to retain the precision as determined by the 
parameters for the {{DECIMAL}} type. For example, we expect the example above 
to return {{{}88.88{}}}.

  was:
h3. Describe the bug

We are trying to save {{DECIMAL}} values with high precision in a table using 
the SparkSQL shell.
When we {{INSERT}} decimal values with precision higher than the standard 
double precision.

This seems to be caused by type inference at shell parsing inferring that the 
value is a double type.
h3. To reproduce

On Spark 3.2.1 (commit {{{}4f25b3f712{}}}), using {{{}spark-sql{}}}:

 
{code:java}
$SPARK_HOME/bin/spark-sql{code}
 

In the shell:
{code:java}
CREATE TABLE t(c0 DECIMAL(20,10));         
INSERT INTO t VALUES (8.888e9);                             
SELECT * FROM t;{code}
Executing the above gives this:

 
{code:java}
spark-sql> CREATE TABLE t(c0 DECIMAL(20,10));
22/09/29 11:28:41 WARN ResolveSessionCatalog: A Hive serde table will be 
created as there is no table provider specified. You can set 
spark.sql.legacy.createHiveTableByDefault to false so that native data source 
table will be created instead.
Time taken: 0.118 seconds
spark-sql> INSERT INTO t VALUES (8.888e9);
Time taken: 0.392 seconds
spark-sql> SELECT * FROM t;
88.90
Time taken: 0.197 seconds, Fetched 1 row(s){code}
 
 
h3. Expected behavior

We expect the inserted value to retain the precision as determined by the 
parameters for the {{DECIMAL}} type. For example, we expect the example above 
to return {{{}88.88{}}}.

 

 


> Loss of precision using SparkSQL shell on high-precision DECIMAL types
> --
>
> Key: SPARK-40616
> URL: https://issues.apache.org/jira/browse/SPARK-40616
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.1
>Reporter: xsys
>Priority: Major
>
> h3. Describe the bug
> We are trying to save {{DECIMAL}} values with high precision in a table using 
> the SparkSQL shell.
> When we {{INSERT}} decimal values with precision higher than the standard 
> double precision.
> This seems to be caused by type inference at shell parsing inferring that the 
> value is a double type.
> h3. To reproduce
> On Spark 3.2.1 (commit {{{}4f25b3f712{}}}), using {{{}spark-sql{}}}:
> {code:java}
> $SPARK_HOME/bin/spark-sql{code}
>  
> In the shell:
> {code:java}
> CREATE TABLE t(c0 DECIMAL(20,10));         
> INSERT INTO t VALUES (8.888e9);                             
> SELECT * FROM t;{code}
> Executing the above gives this:
>  
> {code:java}
> spark-sql> CREATE TABLE t(c0 DECIMAL(20,10));
> 22/09/29 11:28:41 WARN ResolveSessionCatalog: A Hive serde table will be 
> created as there is no table provider specified. You can set 
> spark.sql.legacy.createHiveTableByDefault to false so that native data source 
> table will be created instead.
> Time taken: 0.118 seconds
> spark-sql> INSERT INTO t VALUES (8.888e9);
> Time taken: 0.392 seconds
> spark-sql> SELECT * FROM t;
> 88.90
> Time taken: 0.197 seconds, Fetched 1 row(s){code}
> h3. Expected behavior
> We expect the inserted value to retain the precision as determined by the 
> parameters for the {{DECIMAL}} type. For example, we expect the example above 
> to return {{{}88.88{}}}.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (SPARK-40616) Loss of precision using SparkSQL shell on high-precision DECIMAL types

2022-09-29 Thread xsys (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40616?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

xsys updated SPARK-40616:
-
Description: 
h3. Describe the bug

We are trying to save {{DECIMAL}} values with high precision in a table using 
the SparkSQL shell. When we {{INSERT}} decimal values with precision higher 
than the standard double precision, precision is lost.

This seems to be caused by type inference at shell parsing inferring that the 
value is a double type.
h3. To reproduce

On Spark 3.2.1 (commit {{{}4f25b3f712{}}}), using {{{}spark-sql{}}}:
{code:java}
$SPARK_HOME/bin/spark-sql{code}
In the shell:
{code:java}
CREATE TABLE t(c0 DECIMAL(20,10));         
INSERT INTO t VALUES (8.888e9);                             
SELECT * FROM t;{code}
Executing the above gives this:
{code:java}
spark-sql> CREATE TABLE t(c0 DECIMAL(20,10));
22/09/29 11:28:41 WARN ResolveSessionCatalog: A Hive serde table will be 
created as there is no table provider specified. You can set 
spark.sql.legacy.createHiveTableByDefault to false so that native data source 
table will be created instead.
Time taken: 0.118 seconds
spark-sql> INSERT INTO t VALUES (8.888e9);
Time taken: 0.392 seconds
spark-sql> SELECT * FROM t;
88.90
Time taken: 0.197 seconds, Fetched 1 row(s){code}
h3. Expected behavior

We expect the inserted value to retain the precision as determined by the 
parameters for the {{DECIMAL}} type. For example, we expect the example above 
to return {{{}88.88{}}}.

  was:
h3. Describe the bug

We are trying to save {{DECIMAL}} values with high precision in a table using 
the SparkSQL shell.
When we {{INSERT}} decimal values with precision higher than the standard 
double precision.

This seems to be caused by type inference at shell parsing inferring that the 
value is a double type.
h3. To reproduce

On Spark 3.2.1 (commit {{{}4f25b3f712{}}}), using {{{}spark-sql{}}}:
{code:java}
$SPARK_HOME/bin/spark-sql{code}
In the shell:
{code:java}
CREATE TABLE t(c0 DECIMAL(20,10));         
INSERT INTO t VALUES (8.888e9);                             
SELECT * FROM t;{code}
Executing the above gives this:
{code:java}
spark-sql> CREATE TABLE t(c0 DECIMAL(20,10));
22/09/29 11:28:41 WARN ResolveSessionCatalog: A Hive serde table will be 
created as there is no table provider specified. You can set 
spark.sql.legacy.createHiveTableByDefault to false so that native data source 
table will be created instead.
Time taken: 0.118 seconds
spark-sql> INSERT INTO t VALUES (8.888e9);
Time taken: 0.392 seconds
spark-sql> SELECT * FROM t;
88.90
Time taken: 0.197 seconds, Fetched 1 row(s){code}
h3. Expected behavior

We expect the inserted value to retain the precision as determined by the 
parameters for the {{DECIMAL}} type. For example, we expect the example above 
to return {{{}88.88{}}}.


> Loss of precision using SparkSQL shell on high-precision DECIMAL types
> --
>
> Key: SPARK-40616
> URL: https://issues.apache.org/jira/browse/SPARK-40616
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.1
>Reporter: xsys
>Priority: Major
>
> h3. Describe the bug
> We are trying to save {{DECIMAL}} values with high precision in a table using 
> the SparkSQL shell. When we {{INSERT}} decimal values with precision higher 
> than the standard double precision, precision is lost.
> This seems to be caused by type inference at shell parsing inferring that the 
> value is a double type.
> h3. To reproduce
> On Spark 3.2.1 (commit {{{}4f25b3f712{}}}), using {{{}spark-sql{}}}:
> {code:java}
> $SPARK_HOME/bin/spark-sql{code}
> In the shell:
> {code:java}
> CREATE TABLE t(c0 DECIMAL(20,10));         
> INSERT INTO t VALUES (8.888e9);                             
> SELECT * FROM t;{code}
> Executing the above gives this:
> {code:java}
> spark-sql> CREATE TABLE t(c0 DECIMAL(20,10));
> 22/09/29 11:28:41 WARN ResolveSessionCatalog: A Hive serde table will be 
> created as there is no table provider specified. You can set 
> spark.sql.legacy.createHiveTableByDefault to false so that native data source 
> table will be created instead.
> Time taken: 0.118 seconds
> spark-sql> INSERT INTO t VALUES (8.888e9);
> Time taken: 0.392 seconds
> spark-sql> SELECT * FROM t;
> 88.90
> Time taken: 0.197 seconds, Fetched 1 row(s){code}
> h3. Expected behavior
> We expect the inserted value to retain the precision as determined by the 
> parameters for the {{DECIMAL}} type. For example, we expect the example above 
> to return {{{}88.88{}}}.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (SPARK-40616) Loss of precision using SparkSQL shell on high-precision DECIMAL types

2022-09-29 Thread xsys (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40616?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

xsys updated SPARK-40616:
-
Description: 
h3. Describe the bug

We are trying to save {{DECIMAL}} values with high precision in a table using 
the SparkSQL shell.
When we {{INSERT}} decimal values with precision higher than the standard 
double precision.

This seems to be caused by type inference at shell parsing inferring that the 
value is a double type.
h3. To reproduce

On Spark 3.2.1 (commit {{{}4f25b3f712{}}}), using {{{}spark-sql{}}}:
{code:java}
$SPARK_HOME/bin/spark-sql{code}
In the shell:
{code:java}
CREATE TABLE t(c0 DECIMAL(20,10));         
INSERT INTO t VALUES (8.888e9);                             
SELECT * FROM t;{code}
Executing the above gives this:
{code:java}
spark-sql> CREATE TABLE t(c0 DECIMAL(20,10));
22/09/29 11:28:41 WARN ResolveSessionCatalog: A Hive serde table will be 
created as there is no table provider specified. You can set 
spark.sql.legacy.createHiveTableByDefault to false so that native data source 
table will be created instead.
Time taken: 0.118 seconds
spark-sql> INSERT INTO t VALUES (8.888e9);
Time taken: 0.392 seconds
spark-sql> SELECT * FROM t;
88.90
Time taken: 0.197 seconds, Fetched 1 row(s){code}
h3. Expected behavior

We expect the inserted value to retain the precision as determined by the 
parameters for the {{DECIMAL}} type. For example, we expect the example above 
to return {{{}88.88{}}}.

  was:
h3. Describe the bug

We are trying to save {{DECIMAL}} values with high precision in a table using 
the SparkSQL shell.
When we {{INSERT}} decimal values with precision higher than the standard 
double precision.

This seems to be caused by type inference at shell parsing inferring that the 
value is a double type.
h3. To reproduce

On Spark 3.2.1 (commit {{{}4f25b3f712{}}}), using {{{}spark-sql{}}}:
{code:java}
$SPARK_HOME/bin/spark-sql{code}
 

In the shell:
{code:java}
CREATE TABLE t(c0 DECIMAL(20,10));         
INSERT INTO t VALUES (8.888e9);                             
SELECT * FROM t;{code}
Executing the above gives this:

 
{code:java}
spark-sql> CREATE TABLE t(c0 DECIMAL(20,10));
22/09/29 11:28:41 WARN ResolveSessionCatalog: A Hive serde table will be 
created as there is no table provider specified. You can set 
spark.sql.legacy.createHiveTableByDefault to false so that native data source 
table will be created instead.
Time taken: 0.118 seconds
spark-sql> INSERT INTO t VALUES (8.888e9);
Time taken: 0.392 seconds
spark-sql> SELECT * FROM t;
88.90
Time taken: 0.197 seconds, Fetched 1 row(s){code}
h3. Expected behavior

We expect the inserted value to retain the precision as determined by the 
parameters for the {{DECIMAL}} type. For example, we expect the example above 
to return {{{}88.88{}}}.


> Loss of precision using SparkSQL shell on high-precision DECIMAL types
> --
>
> Key: SPARK-40616
> URL: https://issues.apache.org/jira/browse/SPARK-40616
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.1
>Reporter: xsys
>Priority: Major
>
> h3. Describe the bug
> We are trying to save {{DECIMAL}} values with high precision in a table using 
> the SparkSQL shell.
> When we {{INSERT}} decimal values with precision higher than the standard 
> double precision.
> This seems to be caused by type inference at shell parsing inferring that the 
> value is a double type.
> h3. To reproduce
> On Spark 3.2.1 (commit {{{}4f25b3f712{}}}), using {{{}spark-sql{}}}:
> {code:java}
> $SPARK_HOME/bin/spark-sql{code}
> In the shell:
> {code:java}
> CREATE TABLE t(c0 DECIMAL(20,10));         
> INSERT INTO t VALUES (8.888e9);                             
> SELECT * FROM t;{code}
> Executing the above gives this:
> {code:java}
> spark-sql> CREATE TABLE t(c0 DECIMAL(20,10));
> 22/09/29 11:28:41 WARN ResolveSessionCatalog: A Hive serde table will be 
> created as there is no table provider specified. You can set 
> spark.sql.legacy.createHiveTableByDefault to false so that native data source 
> table will be created instead.
> Time taken: 0.118 seconds
> spark-sql> INSERT INTO t VALUES (8.888e9);
> Time taken: 0.392 seconds
> spark-sql> SELECT * FROM t;
> 88.90
> Time taken: 0.197 seconds, Fetched 1 row(s){code}
> h3. Expected behavior
> We expect the inserted value to retain the precision as determined by the 
> parameters for the {{DECIMAL}} type. For example, we expect the example above 
> to return {{{}88.88{}}}.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To 

[jira] [Created] (SPARK-40616) Loss of precision using SparkSQL shell on high-precision DECIMAL types

2022-09-29 Thread xsys (Jira)
xsys created SPARK-40616:


 Summary: Loss of precision using SparkSQL shell on high-precision 
DECIMAL types
 Key: SPARK-40616
 URL: https://issues.apache.org/jira/browse/SPARK-40616
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.2.1
Reporter: xsys


h3. Describe the bug

We are trying to save {{DECIMAL}} values with high precision in a table using 
the SparkSQL shell.
When we {{INSERT}} decimal values with precision higher than the standard 
double precision.

This seems to be caused by type inference at shell parsing inferring that the 
value is a double type.
h3. To reproduce

On Spark 3.2.1 (commit {{{}4f25b3f712{}}}), using {{{}spark-sql{}}}:

 
{code:java}
$SPARK_HOME/bin/spark-sql{code}
 

In the shell:
{code:java}
CREATE TABLE t(c0 DECIMAL(20,10));         
INSERT INTO t VALUES (8.888e9);                             
SELECT * FROM t;{code}
Executing the above gives this:

 
{code:java}
spark-sql> CREATE TABLE t(c0 DECIMAL(20,10));
22/09/29 11:28:41 WARN ResolveSessionCatalog: A Hive serde table will be 
created as there is no table provider specified. You can set 
spark.sql.legacy.createHiveTableByDefault to false so that native data source 
table will be created instead.
Time taken: 0.118 seconds
spark-sql> INSERT INTO t VALUES (8.888e9);
Time taken: 0.392 seconds
spark-sql> SELECT * FROM t;
88.90
Time taken: 0.197 seconds, Fetched 1 row(s){code}
 
 
h3. Expected behavior

We expect the inserted value to retain the precision as determined by the 
parameters for the {{DECIMAL}} type. For example, we expect the example above 
to return {{{}88.88{}}}.

 

 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-40525) Floating-point value with an INT/BYTE/SHORT/LONG type errors out in DataFrame but evaluates to a rounded value in SparkSQL

2022-09-21 Thread xsys (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40525?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

xsys updated SPARK-40525:
-
Description: 
h3. Describe the bug

Storing an invalid INT value {{1.1}} using DataFrames via {{spark-shell}} 
expectedly errors out. However, it is evaluated to a rounded value {{1}} if the 
value is inserted into the table via {{{}spark-sql{}}}.
h3. Steps to reproduce:

On Spark 3.2.1 (commit {{{}4f25b3f712{}}}), using {{{}spark-sql{}}}:
{code:java}
$SPARK_HOME/bin/spark-sql {code}
Execute the following:
{code:java}
spark-sql> create table int_floating_point_vals(c1 INT) stored as ORC;
22/09/19 16:49:11 WARN SessionState: METASTORE_FILTER_HOOK will be ignored, 
since hive.security.authorization.manager is set to instance of 
HiveAuthorizerFactory.
Time taken: 0.216 seconds
spark-sql> insert into int_floating_point_vals select 1.1;
Time taken: 1.747 seconds
spark-sql> select * from int_floating_point_vals;
1
Time taken: 0.518 seconds, Fetched 1 row(s){code}
h3. Expected behavior

We expect the two Spark interfaces ({{{}spark-sql{}}} & {{{}spark-shell{}}}) to 
behave consistently for the same data type & input combination ({{{}INT{}}} and 
{{{}1.1{}}}).
h4. Here is a simplified example in {{{}spark-shell{}}}, where insertion of the 
aforementioned value correctly raises an exception:

On Spark 3.2.1 (commit {{{}4f25b3f712{}}}), using {{{}spark-shell{}}}:
{code:java}
$SPARK_HOME/bin/spark-shell{code}
Execute the following:
{code:java}
import org.apache.spark.sql.{Row, SparkSession}
import org.apache.spark.sql.types._
val rdd = sc.parallelize(Seq(Row(1.1)))
val schema = new StructType().add(StructField("c1", IntegerType, true))
val df = spark.createDataFrame(rdd, schema)
df.write.mode("overwrite").format("orc").saveAsTable("int_floating_point_vals") 
{code}
The following exception is raised:
{code:java}
java.lang.RuntimeException: Error while encoding: java.lang.RuntimeException: 
java.lang.Double is not a valid external type for schema of int{code}

  was:
h3. Describe the bug

Storing an invalid INT value {{1.1}} using DataFrames via {{spark-shell}} 
expectedly errors out. However, it is evaluated to a rounded value {{1}} if the 
value is inserted into the table via {{{}spark-sql{}}}.
h3. Steps to reproduce:

On Spark 3.2.1 (commit {{{}4f25b3f712{}}}), using {{{}spark-sql{}}}:}}{}}}
{code:java}
$SPARK_HOME/bin/spark-sql {code}
Execute the following:
{code:java}
spark-sql> create table int_floating_point_vals(c1 INT) stored as ORC;
22/09/19 16:49:11 WARN SessionState: METASTORE_FILTER_HOOK will be ignored, 
since hive.security.authorization.manager is set to instance of 
HiveAuthorizerFactory.
Time taken: 0.216 seconds
spark-sql> insert into int_floating_point_vals select 1.1;
Time taken: 1.747 seconds
spark-sql> select * from int_floating_point_vals;
1
Time taken: 0.518 seconds, Fetched 1 row(s){code}
h3. Expected behavior

We expect the two Spark interfaces ({{{}spark-sql{}}} & {{{}spark-shell{}}}) to 
behave consistently for the same data type & input combination ({{{}INT{}}} and 
{{{}1.1{}}}).
h4. Here is a simplified example in {{{}spark-shell{}}}, where insertion of the 
aforementioned value correctly raises an exception:

On Spark 3.2.1 (commit {{{}4f25b3f712{}}}), using {{{}spark-shell{}}}:
{code:java}
$SPARK_HOME/bin/spark-shell{code}
Execute the following:
{code:java}
import org.apache.spark.sql.{Row, SparkSession}
import org.apache.spark.sql.types._
val rdd = sc.parallelize(Seq(Row(1.1)))
val schema = new StructType().add(StructField("c1", IntegerType, true))
val df = spark.createDataFrame(rdd, schema)
df.write.mode("overwrite").format("orc").saveAsTable("int_floating_point_vals") 
{code}
The following exception is raised:
{code:java}
java.lang.RuntimeException: Error while encoding: java.lang.RuntimeException: 
java.lang.Double is not a valid external type for schema of int{code}


> Floating-point value with an INT/BYTE/SHORT/LONG type errors out in DataFrame 
> but evaluates to a rounded value in SparkSQL
> --
>
> Key: SPARK-40525
> URL: https://issues.apache.org/jira/browse/SPARK-40525
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.1
>Reporter: xsys
>Priority: Major
>
> h3. Describe the bug
> Storing an invalid INT value {{1.1}} using DataFrames via {{spark-shell}} 
> expectedly errors out. However, it is evaluated to a rounded value {{1}} if 
> the value is inserted into the table via {{{}spark-sql{}}}.
> h3. Steps to reproduce:
> On Spark 3.2.1 (commit {{{}4f25b3f712{}}}), using {{{}spark-sql{}}}:
> {code:java}
> $SPARK_HOME/bin/spark-sql {code}
> Execute the following:
> {code:java}
> spark-sql> create table int_floating_point_vals(c1 INT) stored as ORC;
> 22/09/19 

[jira] [Updated] (SPARK-40525) Floating-point value with an INT/BYTE/SHORT/LONG type errors out in DataFrame but evaluates to a rounded value in SparkSQL

2022-09-21 Thread xsys (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40525?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

xsys updated SPARK-40525:
-
Description: 
h3. Describe the bug

Storing an invalid INT value {{1.1}} using DataFrames via {{spark-shell}} 
expectedly errors out. However, it is evaluated to a rounded value {{1}} if the 
value is inserted into the table via {{{}spark-sql{}}}.
h3. Steps to reproduce:

On Spark 3.2.1 (commit {{{}4f25b3f712{}}}), using {{{}spark-sql{}}}:}}{}}}
{code:java}
$SPARK_HOME/bin/spark-sql {code}
Execute the following:
{code:java}
spark-sql> create table int_floating_point_vals(c1 INT) stored as ORC;
22/09/19 16:49:11 WARN SessionState: METASTORE_FILTER_HOOK will be ignored, 
since hive.security.authorization.manager is set to instance of 
HiveAuthorizerFactory.
Time taken: 0.216 seconds
spark-sql> insert into int_floating_point_vals select 1.1;
Time taken: 1.747 seconds
spark-sql> select * from int_floating_point_vals;
1
Time taken: 0.518 seconds, Fetched 1 row(s){code}
h3. Expected behavior

We expect the two Spark interfaces ({{{}spark-sql{}}} & {{{}spark-shell{}}}) to 
behave consistently for the same data type & input combination ({{{}INT{}}} and 
{{{}1.1{}}}).
h4. Here is a simplified example in {{{}spark-shell{}}}, where insertion of the 
aforementioned value correctly raises an exception:

On Spark 3.2.1 (commit {{{}4f25b3f712{}}}), using {{{}spark-shell{}}}:
{code:java}
$SPARK_HOME/bin/spark-shell{code}
Execute the following:
{code:java}
import org.apache.spark.sql.{Row, SparkSession}
import org.apache.spark.sql.types._
val rdd = sc.parallelize(Seq(Row(1.1)))
val schema = new StructType().add(StructField("c1", IntegerType, true))
val df = spark.createDataFrame(rdd, schema)
df.write.mode("overwrite").format("orc").saveAsTable("int_floating_point_vals") 
{code}
The following exception is raised:
{code:java}
java.lang.RuntimeException: Error while encoding: java.lang.RuntimeException: 
java.lang.Double is not a valid external type for schema of int{code}

  was:
h3. Describe the bug

Storing an invalid INT value {{1.1}} using DataFrames via {{spark-shell}} 
expectedly errors out. However, it is evaluated to a rounded value {{1}} if the 
value is inserted into the table via {{{}spark-sql{}}}.
h3. Steps to reproduce:

On Spark 3.2.1 (commit {{{}4f25b3f712{}}}), using {{{}spark-sql{}}}:{{{}{}}}
{code:java}
$SPARK_HOME/bin/spark-sql {code}
Execute the following:

 
{code:java}
spark-sql> create table int_floating_point_vals(c1 INT) stored as ORC;
22/09/19 16:49:11 WARN SessionState: METASTORE_FILTER_HOOK will be ignored, 
since hive.security.authorization.manager is set to instance of 
HiveAuthorizerFactory.
Time taken: 0.216 seconds
spark-sql> insert into int_floating_point_vals select 1.1;
Time taken: 1.747 seconds
spark-sql> select * from int_floating_point_vals;
1
Time taken: 0.518 seconds, Fetched 1 row(s){code}
 
h3. Expected behavior

We expect the two Spark interfaces ({{{}spark-sql{}}} & {{{}spark-shell{}}}) to 
behave consistently for the same data type & input combination ({{{}INT{}}} and 
{{{}1.1{}}}).
h4. Here is a simplified example in {{{}spark-shell{}}}, where insertion of the 
aforementioned value correctly raises an exception:

On Spark 3.2.1 (commit {{{}4f25b3f712{}}}), using {{{}spark-shell{}}}:
{code:java}
$SPARK_HOME/bin/spark-shell{code}
Execute the following:
{code:java}
import org.apache.spark.sql.{Row, SparkSession}
import org.apache.spark.sql.types._
val rdd = sc.parallelize(Seq(Row(1.1)))
val schema = new StructType().add(StructField("c1", IntegerType, true))
val df = spark.createDataFrame(rdd, schema)
df.write.mode("overwrite").format("orc").saveAsTable("int_floating_point_vals") 
{code}
The following exception is raised:
{code:java}
java.lang.RuntimeException: Error while encoding: java.lang.RuntimeException: 
java.lang.Double is not a valid external type for schema of int{code}


> Floating-point value with an INT/BYTE/SHORT/LONG type errors out in DataFrame 
> but evaluates to a rounded value in SparkSQL
> --
>
> Key: SPARK-40525
> URL: https://issues.apache.org/jira/browse/SPARK-40525
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.1
>Reporter: xsys
>Priority: Major
>
> h3. Describe the bug
> Storing an invalid INT value {{1.1}} using DataFrames via {{spark-shell}} 
> expectedly errors out. However, it is evaluated to a rounded value {{1}} if 
> the value is inserted into the table via {{{}spark-sql{}}}.
> h3. Steps to reproduce:
> On Spark 3.2.1 (commit {{{}4f25b3f712{}}}), using {{{}spark-sql{}}}:}}{}}}
> {code:java}
> $SPARK_HOME/bin/spark-sql {code}
> Execute the following:
> {code:java}
> spark-sql> create table int_floating_point_vals(c1 INT) 

[jira] [Created] (SPARK-40525) Floating-point value with an INT/BYTE/SHORT/LONG type errors out in DataFrame but evaluates to a rounded value in SparkSQL

2022-09-21 Thread xsys (Jira)
xsys created SPARK-40525:


 Summary: Floating-point value with an INT/BYTE/SHORT/LONG type 
errors out in DataFrame but evaluates to a rounded value in SparkSQL
 Key: SPARK-40525
 URL: https://issues.apache.org/jira/browse/SPARK-40525
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.2.1
Reporter: xsys


h3. Describe the bug

Storing an invalid INT value {{1.1}} using DataFrames via {{spark-shell}} 
expectedly errors out. However, it is evaluated to a rounded value {{1}} if the 
value is inserted into the table via {{{}spark-sql{}}}.
h3. Steps to reproduce:

On Spark 3.2.1 (commit {{{}4f25b3f712{}}}), using {{{}spark-sql{}}}:{{{}{}}}
{code:java}
$SPARK_HOME/bin/spark-sql {code}
Execute the following:

 
{code:java}
spark-sql> create table int_floating_point_vals(c1 INT) stored as ORC;
22/09/19 16:49:11 WARN SessionState: METASTORE_FILTER_HOOK will be ignored, 
since hive.security.authorization.manager is set to instance of 
HiveAuthorizerFactory.
Time taken: 0.216 seconds
spark-sql> insert into int_floating_point_vals select 1.1;
Time taken: 1.747 seconds
spark-sql> select * from int_floating_point_vals;
1
Time taken: 0.518 seconds, Fetched 1 row(s){code}
 
h3. Expected behavior

We expect the two Spark interfaces ({{{}spark-sql{}}} & {{{}spark-shell{}}}) to 
behave consistently for the same data type & input combination ({{{}INT{}}} and 
{{{}1.1{}}}).
h4. Here is a simplified example in {{{}spark-shell{}}}, where insertion of the 
aforementioned value correctly raises an exception:

On Spark 3.2.1 (commit {{{}4f25b3f712{}}}), using {{{}spark-shell{}}}:
{code:java}
$SPARK_HOME/bin/spark-shell{code}
Execute the following:
{code:java}
import org.apache.spark.sql.{Row, SparkSession}
import org.apache.spark.sql.types._
val rdd = sc.parallelize(Seq(Row(1.1)))
val schema = new StructType().add(StructField("c1", IntegerType, true))
val df = spark.createDataFrame(rdd, schema)
df.write.mode("overwrite").format("orc").saveAsTable("int_floating_point_vals") 
{code}
The following exception is raised:
{code:java}
java.lang.RuntimeException: Error while encoding: java.lang.RuntimeException: 
java.lang.Double is not a valid external type for schema of int{code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-40439) DECIMAL value with more precision than what is defined in the schema raises exception in SparkSQL but evaluates to NULL for DataFrame

2022-09-20 Thread xsys (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-40439?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17607314#comment-17607314
 ] 

xsys edited comment on SPARK-40439 at 9/20/22 5:23 PM:
---

[~hyukjin.kwon]: Thank you for your response! Setting 
{{spark.sql.storeAssignmentPolicy}} to LEGACY works. However, I believe it 
could get non-trivial for users to discover that 
{{spark.sql.storeAssignmentPolicy}} would work.

For instance, after inspecting the code, I thought nullOnOverflow is controlled 
by {{spark.sql.ansi.enabled. }}I tried to achieve the desired behavior by 
altering it (but to no avail).

Could we add the usage of {{spark.sql.storeAssignmentPolicy}} to {{LEGACY}} to 
the error message? 


was (Author: JIRAUSER288838):
[~hyukjin.kwon]: Thank you for your response! Setting 
{{spark.sql.storeAssignmentPolicy}} to LEGACY works. However, I believe it 
could get non-trivial for users to discover that 
{{spark.sql.storeAssignmentPolicy}} would work.

For instance, after inspecting the code, I thought that nullOnOverflow is 
controlled by {{spark.sql.ansi.enabled. I}} tried to achieve the desired 
behaviour by altering it (but to no avail).

Could we add the usage of {{spark.sql.storeAssignmentPolicy}} to {{LEGACY}} to 
the error message? 

> DECIMAL value with more precision than what is defined in the schema raises 
> exception in SparkSQL but evaluates to NULL for DataFrame
> -
>
> Key: SPARK-40439
> URL: https://issues.apache.org/jira/browse/SPARK-40439
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.1
>Reporter: xsys
>Priority: Major
>
> h3. Describe the bug
> We are trying to store a DECIMAL value {{333.22}} with more 
> precision than what is defined in the schema: {{{}DECIMAL(20,10){}}}. This 
> leads to a {{NULL}} value being stored if the table is created using 
> DataFrames via {{{}spark-shell{}}}. However, it leads to the following 
> exception if the table is created via {{{}spark-sql{}}}:
> {code:java}
> Failed in [insert into decimal_extra_precision select 333.22]
> java.lang.ArithmeticException: 
> Decimal(expanded,333.22,21,10}) cannot be represented as 
> Decimal(20, 10){code}
> h3. Step to reproduce:
> On Spark 3.2.1 (commit {{{}4f25b3f712{}}}), using {{{}spark-sql{}}}:
> {code:java}
> create table decimal_extra_precision(c1 DECIMAL(20,10)) STORED AS ORC;
> insert into decimal_extra_precision select 333.22;{code}
> Execute the following:
> {code:java}
> create table decimal_extra_precision(c1 DECIMAL(20,10)) STORED AS ORC;
> insert into decimal_extra_precision select 333.22;{code}
> h3. Expected behavior
> We expect the two Spark interfaces ({{{}spark-sql{}}} & {{{}spark-shell{}}}) 
> to behave consistently for the same data type & input combination 
> ({{{}DECIMAL(20,10){}}} and {{{}333.22{}}}). 
> Here is a simplified example in {{{}spark-shell{}}}, where insertion of the 
> aforementioned decimal value evaluates to a {{{}NULL{}}}:
> {code:java}
> scala> import org.apache.spark.sql.{Row, SparkSession}
> import org.apache.spark.sql.{Row, SparkSession}
> scala> import org.apache.spark.sql.types._
> import org.apache.spark.sql.types._
> scala> val rdd = 
> sc.parallelize(Seq(Row(BigDecimal("333.22"
> rdd: org.apache.spark.rdd.RDD[org.apache.spark.sql.Row] = 
> ParallelCollectionRDD[0] at parallelize at :27
> scala> val schema = new StructType().add(StructField("c1", DecimalType(20, 
> 10), true))
> schema: org.apache.spark.sql.types.StructType = 
> StructType(StructField(c1,DecimalType(20,10),true))
> scala> val df = spark.createDataFrame(rdd, schema)
> df: org.apache.spark.sql.DataFrame = [c1: decimal(20,10)]
> scala> df.show()
> ++
> |  c1|
> ++
> |null|
> ++
> scala> 
> df.write.mode("overwrite").format("orc").saveAsTable("decimal_extra_precision")
> 22/08/29 10:33:47 WARN SessionState: METASTORE_FILTER_HOOK will be ignored, 
> since hive.security.authorization.manager is set to instance of 
> HiveAuthorizerFactory.
> scala> spark.sql("select * from decimal_extra_precision;")
> res2: org.apache.spark.sql.DataFrame = [c1: decimal(20,10)]
> {code}
> h3. Root Cause
> The exception is being raised from 
> [Decimal|https://github.com/apache/spark/blob/v3.2.1/sql/catalyst/src/main/scala/org/apache/spark/sql/types/Decimal.scala#L358-L373]
>  ({{{}nullOnOverflow{}}} is controlled by {{spark.sql.ansi.enabled}} in 
> [SQLConf|https://github.com/apache/spark/blob/v3.2.1/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala#L2542-L2551].):
> {code:java}
>   private[sql] def toPrecision(
>       precision: Int,
>       

[jira] [Comment Edited] (SPARK-40439) DECIMAL value with more precision than what is defined in the schema raises exception in SparkSQL but evaluates to NULL for DataFrame

2022-09-20 Thread xsys (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-40439?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17607314#comment-17607314
 ] 

xsys edited comment on SPARK-40439 at 9/20/22 5:23 PM:
---

[~hyukjin.kwon]: Thank you for your response! Setting 
{{spark.sql.storeAssignmentPolicy}} to LEGACY works. However, I believe it 
could get non-trivial for users to discover that 
{{spark.sql.storeAssignmentPolicy}} would work.

For instance, after inspecting the code, I thought nullOnOverflow is controlled 
by \{{spark.sql.ansi.enabled.}} I tried to achieve the desired behavior by 
altering it (but to no avail).

Could we add the usage of {{spark.sql.storeAssignmentPolicy}} to {{LEGACY}} to 
the error message? 


was (Author: JIRAUSER288838):
[~hyukjin.kwon]: Thank you for your response! Setting 
{{spark.sql.storeAssignmentPolicy}} to LEGACY works. However, I believe it 
could get non-trivial for users to discover that 
{{spark.sql.storeAssignmentPolicy}} would work.

For instance, after inspecting the code, I thought nullOnOverflow is controlled 
by {{spark.sql.ansi.enabled. }}I tried to achieve the desired behavior by 
altering it (but to no avail).

Could we add the usage of {{spark.sql.storeAssignmentPolicy}} to {{LEGACY}} to 
the error message? 

> DECIMAL value with more precision than what is defined in the schema raises 
> exception in SparkSQL but evaluates to NULL for DataFrame
> -
>
> Key: SPARK-40439
> URL: https://issues.apache.org/jira/browse/SPARK-40439
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.1
>Reporter: xsys
>Priority: Major
>
> h3. Describe the bug
> We are trying to store a DECIMAL value {{333.22}} with more 
> precision than what is defined in the schema: {{{}DECIMAL(20,10){}}}. This 
> leads to a {{NULL}} value being stored if the table is created using 
> DataFrames via {{{}spark-shell{}}}. However, it leads to the following 
> exception if the table is created via {{{}spark-sql{}}}:
> {code:java}
> Failed in [insert into decimal_extra_precision select 333.22]
> java.lang.ArithmeticException: 
> Decimal(expanded,333.22,21,10}) cannot be represented as 
> Decimal(20, 10){code}
> h3. Step to reproduce:
> On Spark 3.2.1 (commit {{{}4f25b3f712{}}}), using {{{}spark-sql{}}}:
> {code:java}
> create table decimal_extra_precision(c1 DECIMAL(20,10)) STORED AS ORC;
> insert into decimal_extra_precision select 333.22;{code}
> Execute the following:
> {code:java}
> create table decimal_extra_precision(c1 DECIMAL(20,10)) STORED AS ORC;
> insert into decimal_extra_precision select 333.22;{code}
> h3. Expected behavior
> We expect the two Spark interfaces ({{{}spark-sql{}}} & {{{}spark-shell{}}}) 
> to behave consistently for the same data type & input combination 
> ({{{}DECIMAL(20,10){}}} and {{{}333.22{}}}). 
> Here is a simplified example in {{{}spark-shell{}}}, where insertion of the 
> aforementioned decimal value evaluates to a {{{}NULL{}}}:
> {code:java}
> scala> import org.apache.spark.sql.{Row, SparkSession}
> import org.apache.spark.sql.{Row, SparkSession}
> scala> import org.apache.spark.sql.types._
> import org.apache.spark.sql.types._
> scala> val rdd = 
> sc.parallelize(Seq(Row(BigDecimal("333.22"
> rdd: org.apache.spark.rdd.RDD[org.apache.spark.sql.Row] = 
> ParallelCollectionRDD[0] at parallelize at :27
> scala> val schema = new StructType().add(StructField("c1", DecimalType(20, 
> 10), true))
> schema: org.apache.spark.sql.types.StructType = 
> StructType(StructField(c1,DecimalType(20,10),true))
> scala> val df = spark.createDataFrame(rdd, schema)
> df: org.apache.spark.sql.DataFrame = [c1: decimal(20,10)]
> scala> df.show()
> ++
> |  c1|
> ++
> |null|
> ++
> scala> 
> df.write.mode("overwrite").format("orc").saveAsTable("decimal_extra_precision")
> 22/08/29 10:33:47 WARN SessionState: METASTORE_FILTER_HOOK will be ignored, 
> since hive.security.authorization.manager is set to instance of 
> HiveAuthorizerFactory.
> scala> spark.sql("select * from decimal_extra_precision;")
> res2: org.apache.spark.sql.DataFrame = [c1: decimal(20,10)]
> {code}
> h3. Root Cause
> The exception is being raised from 
> [Decimal|https://github.com/apache/spark/blob/v3.2.1/sql/catalyst/src/main/scala/org/apache/spark/sql/types/Decimal.scala#L358-L373]
>  ({{{}nullOnOverflow{}}} is controlled by {{spark.sql.ansi.enabled}} in 
> [SQLConf|https://github.com/apache/spark/blob/v3.2.1/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala#L2542-L2551].):
> {code:java}
>   private[sql] def toPrecision(
>       precision: Int,
>       scale: 

[jira] [Comment Edited] (SPARK-40439) DECIMAL value with more precision than what is defined in the schema raises exception in SparkSQL but evaluates to NULL for DataFrame

2022-09-20 Thread xsys (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-40439?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17607314#comment-17607314
 ] 

xsys edited comment on SPARK-40439 at 9/20/22 5:22 PM:
---

[~hyukjin.kwon]: Thank you for your response! Setting 
{{spark.sql.storeAssignmentPolicy}} to LEGACY works. However, I believe it 
could get non-trivial for users to discover that 
{{spark.sql.storeAssignmentPolicy}} would work.

For instance, after inspecting the code, I thought that nullOnOverflow is 
controlled by {{spark.sql.ansi.enabled. I}} tried to achieve the desired 
behaviour by altering it (but to no avail).

Could we add the usage of {{spark.sql.storeAssignmentPolicy}} to {{LEGACY}} to 
the error message? 


was (Author: JIRAUSER288838):
[~hyukjin.kwon]: Thank you for your response! Setting 
{{spark.sql.storeAssignmentPolicy}} to LEGACY works. However, I believe it 
could get non-trivial for users to discover that 
{{spark.sql.storeAssignmentPolicy }}would work.

For instance, after inspecting the code, I thought that nullOnOverflow is 
controlled by {{spark.sql.ansi.enabled. I}} tried to achieve the desired 
behaviour by altering it (but to no avail).

Could we add the usage of {{spark.sql.storeAssignmentPolicy}} to {{LEGACY}} to 
the error message? 

> DECIMAL value with more precision than what is defined in the schema raises 
> exception in SparkSQL but evaluates to NULL for DataFrame
> -
>
> Key: SPARK-40439
> URL: https://issues.apache.org/jira/browse/SPARK-40439
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.1
>Reporter: xsys
>Priority: Major
>
> h3. Describe the bug
> We are trying to store a DECIMAL value {{333.22}} with more 
> precision than what is defined in the schema: {{{}DECIMAL(20,10){}}}. This 
> leads to a {{NULL}} value being stored if the table is created using 
> DataFrames via {{{}spark-shell{}}}. However, it leads to the following 
> exception if the table is created via {{{}spark-sql{}}}:
> {code:java}
> Failed in [insert into decimal_extra_precision select 333.22]
> java.lang.ArithmeticException: 
> Decimal(expanded,333.22,21,10}) cannot be represented as 
> Decimal(20, 10){code}
> h3. Step to reproduce:
> On Spark 3.2.1 (commit {{{}4f25b3f712{}}}), using {{{}spark-sql{}}}:
> {code:java}
> create table decimal_extra_precision(c1 DECIMAL(20,10)) STORED AS ORC;
> insert into decimal_extra_precision select 333.22;{code}
> Execute the following:
> {code:java}
> create table decimal_extra_precision(c1 DECIMAL(20,10)) STORED AS ORC;
> insert into decimal_extra_precision select 333.22;{code}
> h3. Expected behavior
> We expect the two Spark interfaces ({{{}spark-sql{}}} & {{{}spark-shell{}}}) 
> to behave consistently for the same data type & input combination 
> ({{{}DECIMAL(20,10){}}} and {{{}333.22{}}}). 
> Here is a simplified example in {{{}spark-shell{}}}, where insertion of the 
> aforementioned decimal value evaluates to a {{{}NULL{}}}:
> {code:java}
> scala> import org.apache.spark.sql.{Row, SparkSession}
> import org.apache.spark.sql.{Row, SparkSession}
> scala> import org.apache.spark.sql.types._
> import org.apache.spark.sql.types._
> scala> val rdd = 
> sc.parallelize(Seq(Row(BigDecimal("333.22"
> rdd: org.apache.spark.rdd.RDD[org.apache.spark.sql.Row] = 
> ParallelCollectionRDD[0] at parallelize at :27
> scala> val schema = new StructType().add(StructField("c1", DecimalType(20, 
> 10), true))
> schema: org.apache.spark.sql.types.StructType = 
> StructType(StructField(c1,DecimalType(20,10),true))
> scala> val df = spark.createDataFrame(rdd, schema)
> df: org.apache.spark.sql.DataFrame = [c1: decimal(20,10)]
> scala> df.show()
> ++
> |  c1|
> ++
> |null|
> ++
> scala> 
> df.write.mode("overwrite").format("orc").saveAsTable("decimal_extra_precision")
> 22/08/29 10:33:47 WARN SessionState: METASTORE_FILTER_HOOK will be ignored, 
> since hive.security.authorization.manager is set to instance of 
> HiveAuthorizerFactory.
> scala> spark.sql("select * from decimal_extra_precision;")
> res2: org.apache.spark.sql.DataFrame = [c1: decimal(20,10)]
> {code}
> h3. Root Cause
> The exception is being raised from 
> [Decimal|https://github.com/apache/spark/blob/v3.2.1/sql/catalyst/src/main/scala/org/apache/spark/sql/types/Decimal.scala#L358-L373]
>  ({{{}nullOnOverflow{}}} is controlled by {{spark.sql.ansi.enabled}} in 
> [SQLConf|https://github.com/apache/spark/blob/v3.2.1/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala#L2542-L2551].):
> {code:java}
>   private[sql] def toPrecision(
>       precision: Int,
>       

[jira] [Comment Edited] (SPARK-40439) DECIMAL value with more precision than what is defined in the schema raises exception in SparkSQL but evaluates to NULL for DataFrame

2022-09-20 Thread xsys (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-40439?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17607314#comment-17607314
 ] 

xsys edited comment on SPARK-40439 at 9/20/22 5:22 PM:
---

[~hyukjin.kwon]: Thank you for your response! Setting 
{{spark.sql.storeAssignmentPolicy}} to LEGACY works. However, I believe it 
could get non-trivial for users to discover that 
{{spark.sql.storeAssignmentPolicy }}would work.

For instance, after inspecting the code, I thought that nullOnOverflow is 
controlled by {{spark.sql.ansi.enabled. I}} tried to achieve the desired 
behaviour by altering it (but to no avail).

Could we add the usage of {{spark.sql.storeAssignmentPolicy}} to {{LEGACY}} to 
the error message? 


was (Author: JIRAUSER288838):
[~hyukjin.kwon]: Thank you for your response! Setting 
{{spark.sql.storeAssignmentPolicy}} to LEGACY works. However, I believe it 
could get non-trivial for users to discover that 
{{spark.sql.storeAssignmentPolicy would work.}}

For instance, after inspecting the code, I thought that nullOnOverflow is 
controlled by {{spark.sql.ansi.enabled. I}} tried to achieve the desired 
behaviour by altering it (but to no avail).

Could we add the usage of {{spark.sql.storeAssignmentPolicy}} to {{LEGACY}} to 
the error message? 

> DECIMAL value with more precision than what is defined in the schema raises 
> exception in SparkSQL but evaluates to NULL for DataFrame
> -
>
> Key: SPARK-40439
> URL: https://issues.apache.org/jira/browse/SPARK-40439
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.1
>Reporter: xsys
>Priority: Major
>
> h3. Describe the bug
> We are trying to store a DECIMAL value {{333.22}} with more 
> precision than what is defined in the schema: {{{}DECIMAL(20,10){}}}. This 
> leads to a {{NULL}} value being stored if the table is created using 
> DataFrames via {{{}spark-shell{}}}. However, it leads to the following 
> exception if the table is created via {{{}spark-sql{}}}:
> {code:java}
> Failed in [insert into decimal_extra_precision select 333.22]
> java.lang.ArithmeticException: 
> Decimal(expanded,333.22,21,10}) cannot be represented as 
> Decimal(20, 10){code}
> h3. Step to reproduce:
> On Spark 3.2.1 (commit {{{}4f25b3f712{}}}), using {{{}spark-sql{}}}:
> {code:java}
> create table decimal_extra_precision(c1 DECIMAL(20,10)) STORED AS ORC;
> insert into decimal_extra_precision select 333.22;{code}
> Execute the following:
> {code:java}
> create table decimal_extra_precision(c1 DECIMAL(20,10)) STORED AS ORC;
> insert into decimal_extra_precision select 333.22;{code}
> h3. Expected behavior
> We expect the two Spark interfaces ({{{}spark-sql{}}} & {{{}spark-shell{}}}) 
> to behave consistently for the same data type & input combination 
> ({{{}DECIMAL(20,10){}}} and {{{}333.22{}}}). 
> Here is a simplified example in {{{}spark-shell{}}}, where insertion of the 
> aforementioned decimal value evaluates to a {{{}NULL{}}}:
> {code:java}
> scala> import org.apache.spark.sql.{Row, SparkSession}
> import org.apache.spark.sql.{Row, SparkSession}
> scala> import org.apache.spark.sql.types._
> import org.apache.spark.sql.types._
> scala> val rdd = 
> sc.parallelize(Seq(Row(BigDecimal("333.22"
> rdd: org.apache.spark.rdd.RDD[org.apache.spark.sql.Row] = 
> ParallelCollectionRDD[0] at parallelize at :27
> scala> val schema = new StructType().add(StructField("c1", DecimalType(20, 
> 10), true))
> schema: org.apache.spark.sql.types.StructType = 
> StructType(StructField(c1,DecimalType(20,10),true))
> scala> val df = spark.createDataFrame(rdd, schema)
> df: org.apache.spark.sql.DataFrame = [c1: decimal(20,10)]
> scala> df.show()
> ++
> |  c1|
> ++
> |null|
> ++
> scala> 
> df.write.mode("overwrite").format("orc").saveAsTable("decimal_extra_precision")
> 22/08/29 10:33:47 WARN SessionState: METASTORE_FILTER_HOOK will be ignored, 
> since hive.security.authorization.manager is set to instance of 
> HiveAuthorizerFactory.
> scala> spark.sql("select * from decimal_extra_precision;")
> res2: org.apache.spark.sql.DataFrame = [c1: decimal(20,10)]
> {code}
> h3. Root Cause
> The exception is being raised from 
> [Decimal|https://github.com/apache/spark/blob/v3.2.1/sql/catalyst/src/main/scala/org/apache/spark/sql/types/Decimal.scala#L358-L373]
>  ({{{}nullOnOverflow{}}} is controlled by {{spark.sql.ansi.enabled}} in 
> [SQLConf|https://github.com/apache/spark/blob/v3.2.1/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala#L2542-L2551].):
> {code:java}
>   private[sql] def toPrecision(
>       precision: Int,
>       

[jira] [Comment Edited] (SPARK-40439) DECIMAL value with more precision than what is defined in the schema raises exception in SparkSQL but evaluates to NULL for DataFrame

2022-09-20 Thread xsys (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-40439?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17607314#comment-17607314
 ] 

xsys edited comment on SPARK-40439 at 9/20/22 5:21 PM:
---

[~hyukjin.kwon]: Thank you for your response! Setting 
{{spark.sql.storeAssignmentPolicy}} to LEGACY works. However, I believe it 
could get non-trivial for users to discover that 
{{spark.sql.storeAssignmentPolicy would work.}}

{{ For instance, after inspecting the code, I thought that nullOnOverflow}} is 
controlled by {{spark.sql.ansi.enabled. I}} tried to achieve the desired 
behaviour by altering it (but to no avail).

Could we add the usage of {{spark.sql.storeAssignmentPolicy}} to {{LEGACY}} to 
the error message? 


was (Author: JIRAUSER288838):
[~hyukjin.kwon]: Thank you for your response! Setting 
{{spark.sql.storeAssignmentPolicy}} to LEGACY works. 

I believe it could get non-trivial for users to discover that 
{{spark.sql.storeAssignmentPolicy would work. For instance, after inspecting 
the code, I thought that nullOnOverflow}} is controlled by 
{{spark.sql.ansi.enabled. I}} tried to achieve the desired behaviour by 
altering it (but to no avail).

Could we add the usage of {{spark.sql.storeAssignmentPolicy}} to {{LEGACY}} to 
the error message? 

> DECIMAL value with more precision than what is defined in the schema raises 
> exception in SparkSQL but evaluates to NULL for DataFrame
> -
>
> Key: SPARK-40439
> URL: https://issues.apache.org/jira/browse/SPARK-40439
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.1
>Reporter: xsys
>Priority: Major
>
> h3. Describe the bug
> We are trying to store a DECIMAL value {{333.22}} with more 
> precision than what is defined in the schema: {{{}DECIMAL(20,10){}}}. This 
> leads to a {{NULL}} value being stored if the table is created using 
> DataFrames via {{{}spark-shell{}}}. However, it leads to the following 
> exception if the table is created via {{{}spark-sql{}}}:
> {code:java}
> Failed in [insert into decimal_extra_precision select 333.22]
> java.lang.ArithmeticException: 
> Decimal(expanded,333.22,21,10}) cannot be represented as 
> Decimal(20, 10){code}
> h3. Step to reproduce:
> On Spark 3.2.1 (commit {{{}4f25b3f712{}}}), using {{{}spark-sql{}}}:
> {code:java}
> create table decimal_extra_precision(c1 DECIMAL(20,10)) STORED AS ORC;
> insert into decimal_extra_precision select 333.22;{code}
> Execute the following:
> {code:java}
> create table decimal_extra_precision(c1 DECIMAL(20,10)) STORED AS ORC;
> insert into decimal_extra_precision select 333.22;{code}
> h3. Expected behavior
> We expect the two Spark interfaces ({{{}spark-sql{}}} & {{{}spark-shell{}}}) 
> to behave consistently for the same data type & input combination 
> ({{{}DECIMAL(20,10){}}} and {{{}333.22{}}}). 
> Here is a simplified example in {{{}spark-shell{}}}, where insertion of the 
> aforementioned decimal value evaluates to a {{{}NULL{}}}:
> {code:java}
> scala> import org.apache.spark.sql.{Row, SparkSession}
> import org.apache.spark.sql.{Row, SparkSession}
> scala> import org.apache.spark.sql.types._
> import org.apache.spark.sql.types._
> scala> val rdd = 
> sc.parallelize(Seq(Row(BigDecimal("333.22"
> rdd: org.apache.spark.rdd.RDD[org.apache.spark.sql.Row] = 
> ParallelCollectionRDD[0] at parallelize at :27
> scala> val schema = new StructType().add(StructField("c1", DecimalType(20, 
> 10), true))
> schema: org.apache.spark.sql.types.StructType = 
> StructType(StructField(c1,DecimalType(20,10),true))
> scala> val df = spark.createDataFrame(rdd, schema)
> df: org.apache.spark.sql.DataFrame = [c1: decimal(20,10)]
> scala> df.show()
> ++
> |  c1|
> ++
> |null|
> ++
> scala> 
> df.write.mode("overwrite").format("orc").saveAsTable("decimal_extra_precision")
> 22/08/29 10:33:47 WARN SessionState: METASTORE_FILTER_HOOK will be ignored, 
> since hive.security.authorization.manager is set to instance of 
> HiveAuthorizerFactory.
> scala> spark.sql("select * from decimal_extra_precision;")
> res2: org.apache.spark.sql.DataFrame = [c1: decimal(20,10)]
> {code}
> h3. Root Cause
> The exception is being raised from 
> [Decimal|https://github.com/apache/spark/blob/v3.2.1/sql/catalyst/src/main/scala/org/apache/spark/sql/types/Decimal.scala#L358-L373]
>  ({{{}nullOnOverflow{}}} is controlled by {{spark.sql.ansi.enabled}} in 
> [SQLConf|https://github.com/apache/spark/blob/v3.2.1/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala#L2542-L2551].):
> {code:java}
>   private[sql] def toPrecision(
>       precision: Int,
>       

[jira] [Comment Edited] (SPARK-40439) DECIMAL value with more precision than what is defined in the schema raises exception in SparkSQL but evaluates to NULL for DataFrame

2022-09-20 Thread xsys (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-40439?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17607314#comment-17607314
 ] 

xsys edited comment on SPARK-40439 at 9/20/22 5:21 PM:
---

[~hyukjin.kwon]: Thank you for your response! Setting 
{{spark.sql.storeAssignmentPolicy}} to LEGACY works. However, I believe it 
could get non-trivial for users to discover that 
{{spark.sql.storeAssignmentPolicy would work.}}

For instance, after inspecting the code, I thought that nullOnOverflow is 
controlled by {{spark.sql.ansi.enabled. I}} tried to achieve the desired 
behaviour by altering it (but to no avail).

Could we add the usage of {{spark.sql.storeAssignmentPolicy}} to {{LEGACY}} to 
the error message? 


was (Author: JIRAUSER288838):
[~hyukjin.kwon]: Thank you for your response! Setting 
{{spark.sql.storeAssignmentPolicy}} to LEGACY works. However, I believe it 
could get non-trivial for users to discover that 
{{spark.sql.storeAssignmentPolicy would work.}}

{{ For instance, after inspecting the code, I thought that nullOnOverflow}} is 
controlled by {{spark.sql.ansi.enabled. I}} tried to achieve the desired 
behaviour by altering it (but to no avail).

Could we add the usage of {{spark.sql.storeAssignmentPolicy}} to {{LEGACY}} to 
the error message? 

> DECIMAL value with more precision than what is defined in the schema raises 
> exception in SparkSQL but evaluates to NULL for DataFrame
> -
>
> Key: SPARK-40439
> URL: https://issues.apache.org/jira/browse/SPARK-40439
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.1
>Reporter: xsys
>Priority: Major
>
> h3. Describe the bug
> We are trying to store a DECIMAL value {{333.22}} with more 
> precision than what is defined in the schema: {{{}DECIMAL(20,10){}}}. This 
> leads to a {{NULL}} value being stored if the table is created using 
> DataFrames via {{{}spark-shell{}}}. However, it leads to the following 
> exception if the table is created via {{{}spark-sql{}}}:
> {code:java}
> Failed in [insert into decimal_extra_precision select 333.22]
> java.lang.ArithmeticException: 
> Decimal(expanded,333.22,21,10}) cannot be represented as 
> Decimal(20, 10){code}
> h3. Step to reproduce:
> On Spark 3.2.1 (commit {{{}4f25b3f712{}}}), using {{{}spark-sql{}}}:
> {code:java}
> create table decimal_extra_precision(c1 DECIMAL(20,10)) STORED AS ORC;
> insert into decimal_extra_precision select 333.22;{code}
> Execute the following:
> {code:java}
> create table decimal_extra_precision(c1 DECIMAL(20,10)) STORED AS ORC;
> insert into decimal_extra_precision select 333.22;{code}
> h3. Expected behavior
> We expect the two Spark interfaces ({{{}spark-sql{}}} & {{{}spark-shell{}}}) 
> to behave consistently for the same data type & input combination 
> ({{{}DECIMAL(20,10){}}} and {{{}333.22{}}}). 
> Here is a simplified example in {{{}spark-shell{}}}, where insertion of the 
> aforementioned decimal value evaluates to a {{{}NULL{}}}:
> {code:java}
> scala> import org.apache.spark.sql.{Row, SparkSession}
> import org.apache.spark.sql.{Row, SparkSession}
> scala> import org.apache.spark.sql.types._
> import org.apache.spark.sql.types._
> scala> val rdd = 
> sc.parallelize(Seq(Row(BigDecimal("333.22"
> rdd: org.apache.spark.rdd.RDD[org.apache.spark.sql.Row] = 
> ParallelCollectionRDD[0] at parallelize at :27
> scala> val schema = new StructType().add(StructField("c1", DecimalType(20, 
> 10), true))
> schema: org.apache.spark.sql.types.StructType = 
> StructType(StructField(c1,DecimalType(20,10),true))
> scala> val df = spark.createDataFrame(rdd, schema)
> df: org.apache.spark.sql.DataFrame = [c1: decimal(20,10)]
> scala> df.show()
> ++
> |  c1|
> ++
> |null|
> ++
> scala> 
> df.write.mode("overwrite").format("orc").saveAsTable("decimal_extra_precision")
> 22/08/29 10:33:47 WARN SessionState: METASTORE_FILTER_HOOK will be ignored, 
> since hive.security.authorization.manager is set to instance of 
> HiveAuthorizerFactory.
> scala> spark.sql("select * from decimal_extra_precision;")
> res2: org.apache.spark.sql.DataFrame = [c1: decimal(20,10)]
> {code}
> h3. Root Cause
> The exception is being raised from 
> [Decimal|https://github.com/apache/spark/blob/v3.2.1/sql/catalyst/src/main/scala/org/apache/spark/sql/types/Decimal.scala#L358-L373]
>  ({{{}nullOnOverflow{}}} is controlled by {{spark.sql.ansi.enabled}} in 
> [SQLConf|https://github.com/apache/spark/blob/v3.2.1/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala#L2542-L2551].):
> {code:java}
>   private[sql] def toPrecision(
>       precision: Int,
>  

[jira] [Comment Edited] (SPARK-40439) DECIMAL value with more precision than what is defined in the schema raises exception in SparkSQL but evaluates to NULL for DataFrame

2022-09-20 Thread xsys (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-40439?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17607314#comment-17607314
 ] 

xsys edited comment on SPARK-40439 at 9/20/22 5:20 PM:
---

[~hyukjin.kwon]: Thank you for your response! Setting 
{{spark.sql.storeAssignmentPolicy}} to LEGACY works. 

I believe it could get non-trivial for users to discover that 
{{spark.sql.storeAssignmentPolicy would work. For instance, after inspecting 
the code, I thought that nullOnOverflow}} is controlled by 
{{spark.sql.ansi.enabled. I}} tried to achieve the desired behaviour by 
altering it (but to no avail).

Could we add the usage of {{spark.sql.storeAssignmentPolicy}} to {{LEGACY}} to 
the error message? 


was (Author: JIRAUSER288838):
[~hyukjin.kwon]: Thank you for your response! Setting 
{{spark.sql.storeAssignmentPolicy}} to {{LEGACY works. }}

I believe it could get non trivial for users to discover that 
{{spark.sql.storeAssignmentPolicy }}would work. For instance, after inspecting 
the code, I thought that {{nullOnOverflow}} is controlled by 
{{spark.sql.ansi.enabled and}} I tried to achieve the desired behaviour by 
altering it (but to no avail).

Could we add the usage of {{spark.sql.storeAssignmentPolicy}} to {{LEGACY}} to 
the error message? 

> DECIMAL value with more precision than what is defined in the schema raises 
> exception in SparkSQL but evaluates to NULL for DataFrame
> -
>
> Key: SPARK-40439
> URL: https://issues.apache.org/jira/browse/SPARK-40439
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.1
>Reporter: xsys
>Priority: Major
>
> h3. Describe the bug
> We are trying to store a DECIMAL value {{333.22}} with more 
> precision than what is defined in the schema: {{{}DECIMAL(20,10){}}}. This 
> leads to a {{NULL}} value being stored if the table is created using 
> DataFrames via {{{}spark-shell{}}}. However, it leads to the following 
> exception if the table is created via {{{}spark-sql{}}}:
> {code:java}
> Failed in [insert into decimal_extra_precision select 333.22]
> java.lang.ArithmeticException: 
> Decimal(expanded,333.22,21,10}) cannot be represented as 
> Decimal(20, 10){code}
> h3. Step to reproduce:
> On Spark 3.2.1 (commit {{{}4f25b3f712{}}}), using {{{}spark-sql{}}}:
> {code:java}
> create table decimal_extra_precision(c1 DECIMAL(20,10)) STORED AS ORC;
> insert into decimal_extra_precision select 333.22;{code}
> Execute the following:
> {code:java}
> create table decimal_extra_precision(c1 DECIMAL(20,10)) STORED AS ORC;
> insert into decimal_extra_precision select 333.22;{code}
> h3. Expected behavior
> We expect the two Spark interfaces ({{{}spark-sql{}}} & {{{}spark-shell{}}}) 
> to behave consistently for the same data type & input combination 
> ({{{}DECIMAL(20,10){}}} and {{{}333.22{}}}). 
> Here is a simplified example in {{{}spark-shell{}}}, where insertion of the 
> aforementioned decimal value evaluates to a {{{}NULL{}}}:
> {code:java}
> scala> import org.apache.spark.sql.{Row, SparkSession}
> import org.apache.spark.sql.{Row, SparkSession}
> scala> import org.apache.spark.sql.types._
> import org.apache.spark.sql.types._
> scala> val rdd = 
> sc.parallelize(Seq(Row(BigDecimal("333.22"
> rdd: org.apache.spark.rdd.RDD[org.apache.spark.sql.Row] = 
> ParallelCollectionRDD[0] at parallelize at :27
> scala> val schema = new StructType().add(StructField("c1", DecimalType(20, 
> 10), true))
> schema: org.apache.spark.sql.types.StructType = 
> StructType(StructField(c1,DecimalType(20,10),true))
> scala> val df = spark.createDataFrame(rdd, schema)
> df: org.apache.spark.sql.DataFrame = [c1: decimal(20,10)]
> scala> df.show()
> ++
> |  c1|
> ++
> |null|
> ++
> scala> 
> df.write.mode("overwrite").format("orc").saveAsTable("decimal_extra_precision")
> 22/08/29 10:33:47 WARN SessionState: METASTORE_FILTER_HOOK will be ignored, 
> since hive.security.authorization.manager is set to instance of 
> HiveAuthorizerFactory.
> scala> spark.sql("select * from decimal_extra_precision;")
> res2: org.apache.spark.sql.DataFrame = [c1: decimal(20,10)]
> {code}
> h3. Root Cause
> The exception is being raised from 
> [Decimal|https://github.com/apache/spark/blob/v3.2.1/sql/catalyst/src/main/scala/org/apache/spark/sql/types/Decimal.scala#L358-L373]
>  ({{{}nullOnOverflow{}}} is controlled by {{spark.sql.ansi.enabled}} in 
> [SQLConf|https://github.com/apache/spark/blob/v3.2.1/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala#L2542-L2551].):
> {code:java}
>   private[sql] def toPrecision(
>       precision: Int,
>       

[jira] [Comment Edited] (SPARK-40439) DECIMAL value with more precision than what is defined in the schema raises exception in SparkSQL but evaluates to NULL for DataFrame

2022-09-20 Thread xsys (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-40439?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17607314#comment-17607314
 ] 

xsys edited comment on SPARK-40439 at 9/20/22 5:20 PM:
---

[~hyukjin.kwon]: Thank you for your response! Setting 
{{spark.sql.storeAssignmentPolicy}} to {{LEGACY works. }}

I believe it could get non trivial for users to discover that 
{{spark.sql.storeAssignmentPolicy }}would work. For instance, after inspecting 
the code, I thought that {{nullOnOverflow}} is controlled by 
{{spark.sql.ansi.enabled and}} I tried to achieve the desired behaviour by 
altering it (but to no avail).

Could we add the usage of {{spark.sql.storeAssignmentPolicy}} to {{LEGACY}} to 
the error message? 


was (Author: JIRAUSER288838):
[~hyukjin.kwon]: Thank you for your response! Setting 
{{spark.sql.storeAssignmentPolicy}} to {{LEGACY works. }}I believe it could get 
non-trivial for users to discover that {{spark.sql.storeAssignmentPolicy}} 
would work. For instance, after inspecting the code, I thought that 
{{nullOnOverflow}} is controlled by {{spark.sql.ansi.enabled and}} I tried to 
achieve the desired behaviour by altering it (but to no avail).

Could we add the usage of {{spark.sql.storeAssignmentPolicy}} to {{LEGACY}} to 
the error message? 

> DECIMAL value with more precision than what is defined in the schema raises 
> exception in SparkSQL but evaluates to NULL for DataFrame
> -
>
> Key: SPARK-40439
> URL: https://issues.apache.org/jira/browse/SPARK-40439
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.1
>Reporter: xsys
>Priority: Major
>
> h3. Describe the bug
> We are trying to store a DECIMAL value {{333.22}} with more 
> precision than what is defined in the schema: {{{}DECIMAL(20,10){}}}. This 
> leads to a {{NULL}} value being stored if the table is created using 
> DataFrames via {{{}spark-shell{}}}. However, it leads to the following 
> exception if the table is created via {{{}spark-sql{}}}:
> {code:java}
> Failed in [insert into decimal_extra_precision select 333.22]
> java.lang.ArithmeticException: 
> Decimal(expanded,333.22,21,10}) cannot be represented as 
> Decimal(20, 10){code}
> h3. Step to reproduce:
> On Spark 3.2.1 (commit {{{}4f25b3f712{}}}), using {{{}spark-sql{}}}:
> {code:java}
> create table decimal_extra_precision(c1 DECIMAL(20,10)) STORED AS ORC;
> insert into decimal_extra_precision select 333.22;{code}
> Execute the following:
> {code:java}
> create table decimal_extra_precision(c1 DECIMAL(20,10)) STORED AS ORC;
> insert into decimal_extra_precision select 333.22;{code}
> h3. Expected behavior
> We expect the two Spark interfaces ({{{}spark-sql{}}} & {{{}spark-shell{}}}) 
> to behave consistently for the same data type & input combination 
> ({{{}DECIMAL(20,10){}}} and {{{}333.22{}}}). 
> Here is a simplified example in {{{}spark-shell{}}}, where insertion of the 
> aforementioned decimal value evaluates to a {{{}NULL{}}}:
> {code:java}
> scala> import org.apache.spark.sql.{Row, SparkSession}
> import org.apache.spark.sql.{Row, SparkSession}
> scala> import org.apache.spark.sql.types._
> import org.apache.spark.sql.types._
> scala> val rdd = 
> sc.parallelize(Seq(Row(BigDecimal("333.22"
> rdd: org.apache.spark.rdd.RDD[org.apache.spark.sql.Row] = 
> ParallelCollectionRDD[0] at parallelize at :27
> scala> val schema = new StructType().add(StructField("c1", DecimalType(20, 
> 10), true))
> schema: org.apache.spark.sql.types.StructType = 
> StructType(StructField(c1,DecimalType(20,10),true))
> scala> val df = spark.createDataFrame(rdd, schema)
> df: org.apache.spark.sql.DataFrame = [c1: decimal(20,10)]
> scala> df.show()
> ++
> |  c1|
> ++
> |null|
> ++
> scala> 
> df.write.mode("overwrite").format("orc").saveAsTable("decimal_extra_precision")
> 22/08/29 10:33:47 WARN SessionState: METASTORE_FILTER_HOOK will be ignored, 
> since hive.security.authorization.manager is set to instance of 
> HiveAuthorizerFactory.
> scala> spark.sql("select * from decimal_extra_precision;")
> res2: org.apache.spark.sql.DataFrame = [c1: decimal(20,10)]
> {code}
> h3. Root Cause
> The exception is being raised from 
> [Decimal|https://github.com/apache/spark/blob/v3.2.1/sql/catalyst/src/main/scala/org/apache/spark/sql/types/Decimal.scala#L358-L373]
>  ({{{}nullOnOverflow{}}} is controlled by {{spark.sql.ansi.enabled}} in 
> [SQLConf|https://github.com/apache/spark/blob/v3.2.1/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala#L2542-L2551].):
> {code:java}
>   private[sql] def toPrecision(
>       precision: Int,
>   

[jira] [Comment Edited] (SPARK-40439) DECIMAL value with more precision than what is defined in the schema raises exception in SparkSQL but evaluates to NULL for DataFrame

2022-09-20 Thread xsys (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-40439?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17607314#comment-17607314
 ] 

xsys edited comment on SPARK-40439 at 9/20/22 5:18 PM:
---

[~hyukjin.kwon]: Thank you for your response! Setting 
{{spark.sql.storeAssignmentPolicy}} to {{LEGACY works. }}I believe it could get 
non-trivial for users to discover that {{spark.sql.storeAssignmentPolicy}} 
would work. For instance, after inspecting the code, I thought that 
{{nullOnOverflow}} is controlled by {{spark.sql.ansi.enabled and}} I tried to 
achieve the desired behaviour by altering it (but to no avail).

Could we add the usage of {{spark.sql.storeAssignmentPolicy}} to {{LEGACY}} to 
the error message? 


was (Author: JIRAUSER288838):
[~hyukjin.kwon]: Thank you for your response! Setting 
{{spark.sql.storeAssignmentPolicy}} to {{LEGACY works. I believe it could get 
non-trivial for users to discover that spark.sql.storeAssignmentPolicy}} would 
work. For instance, after inspecting the code, I thought that 
{{nullOnOverflow}} is controlled by {{spark.sql.ansi.enabled and}} I tried to 
achieve the desired behaviour by altering it (but to no avail).

Could we add the usage of {{spark.sql.storeAssignmentPolicy}} to {{LEGACY}} to 
the error message? 

> DECIMAL value with more precision than what is defined in the schema raises 
> exception in SparkSQL but evaluates to NULL for DataFrame
> -
>
> Key: SPARK-40439
> URL: https://issues.apache.org/jira/browse/SPARK-40439
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.1
>Reporter: xsys
>Priority: Major
>
> h3. Describe the bug
> We are trying to store a DECIMAL value {{333.22}} with more 
> precision than what is defined in the schema: {{{}DECIMAL(20,10){}}}. This 
> leads to a {{NULL}} value being stored if the table is created using 
> DataFrames via {{{}spark-shell{}}}. However, it leads to the following 
> exception if the table is created via {{{}spark-sql{}}}:
> {code:java}
> Failed in [insert into decimal_extra_precision select 333.22]
> java.lang.ArithmeticException: 
> Decimal(expanded,333.22,21,10}) cannot be represented as 
> Decimal(20, 10){code}
> h3. Step to reproduce:
> On Spark 3.2.1 (commit {{{}4f25b3f712{}}}), using {{{}spark-sql{}}}:
> {code:java}
> create table decimal_extra_precision(c1 DECIMAL(20,10)) STORED AS ORC;
> insert into decimal_extra_precision select 333.22;{code}
> Execute the following:
> {code:java}
> create table decimal_extra_precision(c1 DECIMAL(20,10)) STORED AS ORC;
> insert into decimal_extra_precision select 333.22;{code}
> h3. Expected behavior
> We expect the two Spark interfaces ({{{}spark-sql{}}} & {{{}spark-shell{}}}) 
> to behave consistently for the same data type & input combination 
> ({{{}DECIMAL(20,10){}}} and {{{}333.22{}}}). 
> Here is a simplified example in {{{}spark-shell{}}}, where insertion of the 
> aforementioned decimal value evaluates to a {{{}NULL{}}}:
> {code:java}
> scala> import org.apache.spark.sql.{Row, SparkSession}
> import org.apache.spark.sql.{Row, SparkSession}
> scala> import org.apache.spark.sql.types._
> import org.apache.spark.sql.types._
> scala> val rdd = 
> sc.parallelize(Seq(Row(BigDecimal("333.22"
> rdd: org.apache.spark.rdd.RDD[org.apache.spark.sql.Row] = 
> ParallelCollectionRDD[0] at parallelize at :27
> scala> val schema = new StructType().add(StructField("c1", DecimalType(20, 
> 10), true))
> schema: org.apache.spark.sql.types.StructType = 
> StructType(StructField(c1,DecimalType(20,10),true))
> scala> val df = spark.createDataFrame(rdd, schema)
> df: org.apache.spark.sql.DataFrame = [c1: decimal(20,10)]
> scala> df.show()
> ++
> |  c1|
> ++
> |null|
> ++
> scala> 
> df.write.mode("overwrite").format("orc").saveAsTable("decimal_extra_precision")
> 22/08/29 10:33:47 WARN SessionState: METASTORE_FILTER_HOOK will be ignored, 
> since hive.security.authorization.manager is set to instance of 
> HiveAuthorizerFactory.
> scala> spark.sql("select * from decimal_extra_precision;")
> res2: org.apache.spark.sql.DataFrame = [c1: decimal(20,10)]
> {code}
> h3. Root Cause
> The exception is being raised from 
> [Decimal|https://github.com/apache/spark/blob/v3.2.1/sql/catalyst/src/main/scala/org/apache/spark/sql/types/Decimal.scala#L358-L373]
>  ({{{}nullOnOverflow{}}} is controlled by {{spark.sql.ansi.enabled}} in 
> [SQLConf|https://github.com/apache/spark/blob/v3.2.1/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala#L2542-L2551].):
> {code:java}
>   private[sql] def toPrecision(
>       precision: Int,
>       

[jira] [Comment Edited] (SPARK-40439) DECIMAL value with more precision than what is defined in the schema raises exception in SparkSQL but evaluates to NULL for DataFrame

2022-09-20 Thread xsys (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-40439?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17607314#comment-17607314
 ] 

xsys edited comment on SPARK-40439 at 9/20/22 5:17 PM:
---

[~hyukjin.kwon]: Thank you for your response! Setting 
{{spark.sql.storeAssignmentPolicy}} to {{LEGACY works. I believe it could get 
non-trivial for users to discover that spark.sql.storeAssignmentPolicy}} would 
work. For instance, after inspecting the code, I thought that 
{{nullOnOverflow}} is controlled by {{spark.sql.ansi.enabled and}} I tried to 
achieve the desired behaviour by altering it (but to no avail).

Could we add the usage of {{spark.sql.storeAssignmentPolicy}} to {{LEGACY}} to 
the error message? 


was (Author: JIRAUSER288838):
[~hyukjin.kwon]: Thank you for your response! Setting 
{{spark.sql.storeAssignmentPolicy}} to {{LEGACY }}works. I believe it could get 
non-trivial for users to discover that {{spark.sql.storeAssignmentPolicy}} 
would work. For instance, after inspecting the code, I thought that 
{{nullOnOverflow}} is controlled by {{spark.sql.ansi.enabled and}} I tried to 
achieve the desired behaviour by altering it (but to no avail).

Could we add the usage of {{spark.sql.storeAssignmentPolicy}} to {{LEGACY}} to 
the error message? 

> DECIMAL value with more precision than what is defined in the schema raises 
> exception in SparkSQL but evaluates to NULL for DataFrame
> -
>
> Key: SPARK-40439
> URL: https://issues.apache.org/jira/browse/SPARK-40439
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.1
>Reporter: xsys
>Priority: Major
>
> h3. Describe the bug
> We are trying to store a DECIMAL value {{333.22}} with more 
> precision than what is defined in the schema: {{{}DECIMAL(20,10){}}}. This 
> leads to a {{NULL}} value being stored if the table is created using 
> DataFrames via {{{}spark-shell{}}}. However, it leads to the following 
> exception if the table is created via {{{}spark-sql{}}}:
> {code:java}
> Failed in [insert into decimal_extra_precision select 333.22]
> java.lang.ArithmeticException: 
> Decimal(expanded,333.22,21,10}) cannot be represented as 
> Decimal(20, 10){code}
> h3. Step to reproduce:
> On Spark 3.2.1 (commit {{{}4f25b3f712{}}}), using {{{}spark-sql{}}}:
> {code:java}
> create table decimal_extra_precision(c1 DECIMAL(20,10)) STORED AS ORC;
> insert into decimal_extra_precision select 333.22;{code}
> Execute the following:
> {code:java}
> create table decimal_extra_precision(c1 DECIMAL(20,10)) STORED AS ORC;
> insert into decimal_extra_precision select 333.22;{code}
> h3. Expected behavior
> We expect the two Spark interfaces ({{{}spark-sql{}}} & {{{}spark-shell{}}}) 
> to behave consistently for the same data type & input combination 
> ({{{}DECIMAL(20,10){}}} and {{{}333.22{}}}). 
> Here is a simplified example in {{{}spark-shell{}}}, where insertion of the 
> aforementioned decimal value evaluates to a {{{}NULL{}}}:
> {code:java}
> scala> import org.apache.spark.sql.{Row, SparkSession}
> import org.apache.spark.sql.{Row, SparkSession}
> scala> import org.apache.spark.sql.types._
> import org.apache.spark.sql.types._
> scala> val rdd = 
> sc.parallelize(Seq(Row(BigDecimal("333.22"
> rdd: org.apache.spark.rdd.RDD[org.apache.spark.sql.Row] = 
> ParallelCollectionRDD[0] at parallelize at :27
> scala> val schema = new StructType().add(StructField("c1", DecimalType(20, 
> 10), true))
> schema: org.apache.spark.sql.types.StructType = 
> StructType(StructField(c1,DecimalType(20,10),true))
> scala> val df = spark.createDataFrame(rdd, schema)
> df: org.apache.spark.sql.DataFrame = [c1: decimal(20,10)]
> scala> df.show()
> ++
> |  c1|
> ++
> |null|
> ++
> scala> 
> df.write.mode("overwrite").format("orc").saveAsTable("decimal_extra_precision")
> 22/08/29 10:33:47 WARN SessionState: METASTORE_FILTER_HOOK will be ignored, 
> since hive.security.authorization.manager is set to instance of 
> HiveAuthorizerFactory.
> scala> spark.sql("select * from decimal_extra_precision;")
> res2: org.apache.spark.sql.DataFrame = [c1: decimal(20,10)]
> {code}
> h3. Root Cause
> The exception is being raised from 
> [Decimal|https://github.com/apache/spark/blob/v3.2.1/sql/catalyst/src/main/scala/org/apache/spark/sql/types/Decimal.scala#L358-L373]
>  ({{{}nullOnOverflow{}}} is controlled by {{spark.sql.ansi.enabled}} in 
> [SQLConf|https://github.com/apache/spark/blob/v3.2.1/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala#L2542-L2551].):
> {code:java}
>   private[sql] def toPrecision(
>       precision: Int,
>       

[jira] [Commented] (SPARK-40439) DECIMAL value with more precision than what is defined in the schema raises exception in SparkSQL but evaluates to NULL for DataFrame

2022-09-20 Thread xsys (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-40439?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17607314#comment-17607314
 ] 

xsys commented on SPARK-40439:
--

[~hyukjin.kwon]: Thank you for your response! Setting 
{{spark.sql.storeAssignmentPolicy}} to {{LEGACY }}works. I believe it could get 
non-trivial for users to discover that {{spark.sql.storeAssignmentPolicy}} 
would work. For instance, after inspecting the code, I thought that 
{{nullOnOverflow}} is controlled by {{spark.sql.ansi.enabled and}} I tried to 
achieve the desired behaviour by altering it (but to no avail).

Could we add the usage of {{spark.sql.storeAssignmentPolicy}} to {{LEGACY}} to 
the error message? 

> DECIMAL value with more precision than what is defined in the schema raises 
> exception in SparkSQL but evaluates to NULL for DataFrame
> -
>
> Key: SPARK-40439
> URL: https://issues.apache.org/jira/browse/SPARK-40439
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.1
>Reporter: xsys
>Priority: Major
>
> h3. Describe the bug
> We are trying to store a DECIMAL value {{333.22}} with more 
> precision than what is defined in the schema: {{{}DECIMAL(20,10){}}}. This 
> leads to a {{NULL}} value being stored if the table is created using 
> DataFrames via {{{}spark-shell{}}}. However, it leads to the following 
> exception if the table is created via {{{}spark-sql{}}}:
> {code:java}
> Failed in [insert into decimal_extra_precision select 333.22]
> java.lang.ArithmeticException: 
> Decimal(expanded,333.22,21,10}) cannot be represented as 
> Decimal(20, 10){code}
> h3. Step to reproduce:
> On Spark 3.2.1 (commit {{{}4f25b3f712{}}}), using {{{}spark-sql{}}}:
> {code:java}
> create table decimal_extra_precision(c1 DECIMAL(20,10)) STORED AS ORC;
> insert into decimal_extra_precision select 333.22;{code}
> Execute the following:
> {code:java}
> create table decimal_extra_precision(c1 DECIMAL(20,10)) STORED AS ORC;
> insert into decimal_extra_precision select 333.22;{code}
> h3. Expected behavior
> We expect the two Spark interfaces ({{{}spark-sql{}}} & {{{}spark-shell{}}}) 
> to behave consistently for the same data type & input combination 
> ({{{}DECIMAL(20,10){}}} and {{{}333.22{}}}). 
> Here is a simplified example in {{{}spark-shell{}}}, where insertion of the 
> aforementioned decimal value evaluates to a {{{}NULL{}}}:
> {code:java}
> scala> import org.apache.spark.sql.{Row, SparkSession}
> import org.apache.spark.sql.{Row, SparkSession}
> scala> import org.apache.spark.sql.types._
> import org.apache.spark.sql.types._
> scala> val rdd = 
> sc.parallelize(Seq(Row(BigDecimal("333.22"
> rdd: org.apache.spark.rdd.RDD[org.apache.spark.sql.Row] = 
> ParallelCollectionRDD[0] at parallelize at :27
> scala> val schema = new StructType().add(StructField("c1", DecimalType(20, 
> 10), true))
> schema: org.apache.spark.sql.types.StructType = 
> StructType(StructField(c1,DecimalType(20,10),true))
> scala> val df = spark.createDataFrame(rdd, schema)
> df: org.apache.spark.sql.DataFrame = [c1: decimal(20,10)]
> scala> df.show()
> ++
> |  c1|
> ++
> |null|
> ++
> scala> 
> df.write.mode("overwrite").format("orc").saveAsTable("decimal_extra_precision")
> 22/08/29 10:33:47 WARN SessionState: METASTORE_FILTER_HOOK will be ignored, 
> since hive.security.authorization.manager is set to instance of 
> HiveAuthorizerFactory.
> scala> spark.sql("select * from decimal_extra_precision;")
> res2: org.apache.spark.sql.DataFrame = [c1: decimal(20,10)]
> {code}
> h3. Root Cause
> The exception is being raised from 
> [Decimal|https://github.com/apache/spark/blob/v3.2.1/sql/catalyst/src/main/scala/org/apache/spark/sql/types/Decimal.scala#L358-L373]
>  ({{{}nullOnOverflow{}}} is controlled by {{spark.sql.ansi.enabled}} in 
> [SQLConf|https://github.com/apache/spark/blob/v3.2.1/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala#L2542-L2551].):
> {code:java}
>   private[sql] def toPrecision(
>       precision: Int,
>       scale: Int,
>       roundMode: BigDecimal.RoundingMode.Value = ROUND_HALF_UP,
>       nullOnOverflow: Boolean = true,
>       context: SQLQueryContext = null): Decimal = {
>     val copy = clone()
>     if (copy.changePrecision(precision, scale, roundMode)) {
>       copy
>     } else {
>       if (nullOnOverflow) {
>         null
>       } else {
>         throw QueryExecutionErrors.cannotChangeDecimalPrecisionError(
>           this, precision, scale, context)
>       }
>     }
>   }{code}
> The above function is invoked from 
> 

[jira] [Updated] (SPARK-40439) DECIMAL value with more precision than what is defined in the schema raises exception in SparkSQL but evaluates to NULL for DataFrame

2022-09-14 Thread xsys (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40439?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

xsys updated SPARK-40439:
-
Description: 
h3. Describe the bug

We are trying to store a DECIMAL value {{333.22}} with more 
precision than what is defined in the schema: {{{}DECIMAL(20,10){}}}. This 
leads to a {{NULL}} value being stored if the table is created using DataFrames 
via {{{}spark-shell{}}}. However, it leads to the following exception if the 
table is created via {{{}spark-sql{}}}:
{code:java}
Failed in [insert into decimal_extra_precision select 333.22]
java.lang.ArithmeticException: Decimal(expanded,333.22,21,10}) 
cannot be represented as Decimal(20, 10){code}
h3. Step to reproduce:

On Spark 3.2.1 (commit {{{}4f25b3f712{}}}), using {{{}spark-sql{}}}:
{code:java}
create table decimal_extra_precision(c1 DECIMAL(20,10)) STORED AS ORC;
insert into decimal_extra_precision select 333.22;{code}
Execute the following:
{code:java}
create table decimal_extra_precision(c1 DECIMAL(20,10)) STORED AS ORC;
insert into decimal_extra_precision select 333.22;{code}
h3. Expected behavior

We expect the two Spark interfaces ({{{}spark-sql{}}} & {{{}spark-shell{}}}) to 
behave consistently for the same data type & input combination 
({{{}DECIMAL(20,10){}}} and {{{}333.22{}}}). 

Here is a simplified example in {{{}spark-shell{}}}, where insertion of the 
aforementioned decimal value evaluates to a {{{}NULL{}}}:
{code:java}
scala> import org.apache.spark.sql.{Row, SparkSession}
import org.apache.spark.sql.{Row, SparkSession}
scala> import org.apache.spark.sql.types._
import org.apache.spark.sql.types._
scala> val rdd = sc.parallelize(Seq(Row(BigDecimal("333.22"
rdd: org.apache.spark.rdd.RDD[org.apache.spark.sql.Row] = 
ParallelCollectionRDD[0] at parallelize at :27
scala> val schema = new StructType().add(StructField("c1", DecimalType(20, 10), 
true))
schema: org.apache.spark.sql.types.StructType = 
StructType(StructField(c1,DecimalType(20,10),true))
scala> val df = spark.createDataFrame(rdd, schema)
df: org.apache.spark.sql.DataFrame = [c1: decimal(20,10)]
scala> df.show()
++
|  c1|
++
|null|
++
scala> 
df.write.mode("overwrite").format("orc").saveAsTable("decimal_extra_precision")
22/08/29 10:33:47 WARN SessionState: METASTORE_FILTER_HOOK will be ignored, 
since hive.security.authorization.manager is set to instance of 
HiveAuthorizerFactory.
scala> spark.sql("select * from decimal_extra_precision;")
res2: org.apache.spark.sql.DataFrame = [c1: decimal(20,10)]
{code}
h3. Root Cause

The exception is being raised from 
[Decimal|https://github.com/apache/spark/blob/v3.2.1/sql/catalyst/src/main/scala/org/apache/spark/sql/types/Decimal.scala#L358-L373]
 ({{{}nullOnOverflow{}}} is controlled by {{spark.sql.ansi.enabled}} in 
[SQLConf|https://github.com/apache/spark/blob/v3.2.1/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala#L2542-L2551].):
{code:java}
  private[sql] def toPrecision(
      precision: Int,
      scale: Int,
      roundMode: BigDecimal.RoundingMode.Value = ROUND_HALF_UP,
      nullOnOverflow: Boolean = true,
      context: SQLQueryContext = null): Decimal = {
    val copy = clone()
    if (copy.changePrecision(precision, scale, roundMode)) {
      copy
    } else {
      if (nullOnOverflow) {
        null
      } else {
        throw QueryExecutionErrors.cannotChangeDecimalPrecisionError(
          this, precision, scale, context)
      }
    }
  }{code}
The above function is invoked from 
[toPrecision|https://github.com/apache/spark/blob/v3.2.1/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Cast.scala#L754-L756]
 (in Cast.scala).  However, our attempt to insert {{333.22}} 
after setting {{spark.sql.ansi.enabled}} _to {{False}}_  failed as well (which 
may be an independent issue).

  was:
h3. Describe the bug

We are trying to store a DECIMAL value {{333.22}} with more 
precision than what is defined in the schema: {{{}DECIMAL(20,10){}}}. This 
leads to a {{NULL}} value being stored if the table is created using DataFrames 
via {{{}spark-shell{}}}. However, it leads to the following exception if the 
table is created via {{{}spark-sql{}}}:
{code:java}
Failed in [insert into decimal_extra_precision select 333.22]
java.lang.ArithmeticException: Decimal(expanded,333.22,21,10}) 
cannot be represented as Decimal(20, 10){code}
h3. Step to reproduce:

On Spark 3.2.1 (commit {{{}4f25b3f712{}}}), using {{{}spark-sql{}}}:
{code:java}
create table decimal_extra_precision(c1 DECIMAL(20,10)) STORED AS ORC;
insert into decimal_extra_precision select 333.22;{code}
Execute the following:
{code:java}
create table decimal_extra_precision(c1 DECIMAL(20,10)) STORED AS ORC;
insert into decimal_extra_precision 

[jira] [Updated] (SPARK-40439) DECIMAL value with more precision than what is defined in the schema raises exception in SparkSQL but evaluates to NULL for DataFrame

2022-09-14 Thread xsys (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40439?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

xsys updated SPARK-40439:
-
Description: 
h3. Describe the bug

We are trying to store a DECIMAL value {{333.22}} with more 
precision than what is defined in the schema: {{{}DECIMAL(20,10){}}}. This 
leads to a {{NULL}} value being stored if the table is created using DataFrames 
via {{{}spark-shell{}}}. However, it leads to the following exception if the 
table is created via {{{}spark-sql{}}}:
{code:java}
Failed in [insert into decimal_extra_precision select 333.22]
java.lang.ArithmeticException: Decimal(expanded,333.22,21,10}) 
cannot be represented as Decimal(20, 10){code}
h3. Step to reproduce:

On Spark 3.2.1 (commit {{{}4f25b3f712{}}}), using {{{}spark-sql{}}}:
{code:java}
create table decimal_extra_precision(c1 DECIMAL(20,10)) STORED AS ORC;
insert into decimal_extra_precision select 333.22;{code}
Execute the following:
{code:java}
create table decimal_extra_precision(c1 DECIMAL(20,10)) STORED AS ORC;
insert into decimal_extra_precision select 333.22;{code}
h3. Expected behavior

We expect the two Spark interfaces ({{{}spark-sql{}}} & {{{}spark-shell{}}}) to 
behave consistently for the same data type & input combination 
({{{}DECIMAL(20,10){}}} and {{{}333.22{}}}). 

Here is a simplified example in {{{}spark-shell{}}}, where insertion of the 
aforementioned decimal value evaluates to a {{{}NULL{}}}:
{code:java}
scala> import org.apache.spark.sql.{Row, SparkSession}
import org.apache.spark.sql.{Row, SparkSession}
scala> import org.apache.spark.sql.types._
import org.apache.spark.sql.types._
scala> val rdd = sc.parallelize(Seq(Row(BigDecimal("333.22"
rdd: org.apache.spark.rdd.RDD[org.apache.spark.sql.Row] = 
ParallelCollectionRDD[0] at parallelize at :27
scala> val schema = new StructType().add(StructField("c1", DecimalType(20, 10), 
true))
schema: org.apache.spark.sql.types.StructType = 
StructType(StructField(c1,DecimalType(20,10),true))
scala> val df = spark.createDataFrame(rdd, schema)
df: org.apache.spark.sql.DataFrame = [c1: decimal(20,10)]
scala> df.show()
++
|  c1|
++
|null|
++
scala> 
df.write.mode("overwrite").format("orc").saveAsTable("decimal_extra_precision")
22/08/29 10:33:47 WARN SessionState: METASTORE_FILTER_HOOK will be ignored, 
since hive.security.authorization.manager is set to instance of 
HiveAuthorizerFactory.
scala> spark.sql("select * from decimal_extra_precision;")
res2: org.apache.spark.sql.DataFrame = [c1: decimal(20,10)]
{code}
h3. Root Cause

The exception is being raised from 
[Decimal|https://github.com/apache/spark/blob/v3.2.1/sql/catalyst/src/main/scala/org/apache/spark/sql/types/Decimal.scala#L358-L373]
 ({{{}nullOnOverflow{}}} is controlled by {{spark.sql.ansi.enabled}} in 
[SQLConf|https://github.com/apache/spark/blob/v3.2.1/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala#L2542-L2551].):
{code:java}
  private[sql] def toPrecision(
      precision: Int,
      scale: Int,
      roundMode: BigDecimal.RoundingMode.Value = ROUND_HALF_UP,
      nullOnOverflow: Boolean = true,
      context: SQLQueryContext = null): Decimal = {
    val copy = clone()
    if (copy.changePrecision(precision, scale, roundMode)) {
      copy
    } else {
      if (nullOnOverflow) {
        null
      } else {
        throw QueryExecutionErrors.cannotChangeDecimalPrecisionError(
          this, precision, scale, context)
      }
    }
  }{code}
The above function is invoked from 
[toPrecision|https://github.com/apache/spark/blob/v3.2.1/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Cast.scala#L754-L756]
 (in Cast.scala).  However, our attempt to insert {{333.22}} 
after setting {{spark.sql.ansi.enabled}} _to {{False}}_  failed as well (which 
may be an independent issue).

 

 

 

 

  was:
h3. Describe the bug

We are trying to store a DECIMAL value {{333.22}} with more 
precision than what is defined in the schema: {{{}DECIMAL(20,10){}}}. This 
leads to a {{NULL}} value being stored if the table is created using DataFrames 
via {{{}spark-shell{}}}. However, it leads to the following exception if the 
table is created via {{{}spark-sql{}}}:
{code:java}
Failed in [insert into decimal_extra_precision select 333.22]
java.lang.ArithmeticException: Decimal(expanded,333.22,21,10}) 
cannot be represented as Decimal(20, 10){code}
h3. Step to reproduce:

On Spark 3.2.1 (commit {{{}4f25b3f712{}}}), using {{{}spark-sql{}}}:
{code:java}
create table decimal_extra_precision(c1 DECIMAL(20,10)) STORED AS ORC;
insert into decimal_extra_precision select 333.22;{code}
Execute the following:
{code:java}
create table decimal_extra_precision(c1 DECIMAL(20,10)) STORED AS ORC;
insert into 

[jira] [Updated] (SPARK-40439) DECIMAL value with more precision than what is defined in the schema raises exception in SparkSQL but evaluates to NULL for DataFrame

2022-09-14 Thread xsys (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40439?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

xsys updated SPARK-40439:
-
Description: 
h3. Describe the bug

We are trying to store a DECIMAL value {{333.22}} with more 
precision than what is defined in the schema: {{{}DECIMAL(20,10){}}}. This 
leads to a {{NULL}} value being stored if the table is created using DataFrames 
via {{{}spark-shell{}}}. However, it leads to the following exception if the 
table is created via {{{}spark-sql{}}}:
{code:java}
Failed in [insert into decimal_extra_precision select 333.22]
java.lang.ArithmeticException: Decimal(expanded,333.22,21,10}) 
cannot be represented as Decimal(20, 10){code}
h3. Step to reproduce:

On Spark 3.2.1 (commit {{{}4f25b3f712{}}}), using {{{}spark-sql{}}}:
{code:java}
create table decimal_extra_precision(c1 DECIMAL(20,10)) STORED AS ORC;
insert into decimal_extra_precision select 333.22;{code}
Execute the following:
{code:java}
create table decimal_extra_precision(c1 DECIMAL(20,10)) STORED AS ORC;
insert into decimal_extra_precision select 333.22;{code}
h3. Expected behavior

We expect the two Spark interfaces ({{{}spark-sql{}}} & {{{}spark-shell{}}}) to 
behave consistently for the same data type & input combination 
({{{}DECIMAL(20,10){}}} and {{{}333.22{}}}). 

Here is a simplified example in {{{}spark-shell{}}}, where insertion of the 
aforementioned decimal value evaluates to a {{{}NULL{}}}:
{code:java}
scala> import org.apache.spark.sql.{Row, SparkSession}
import org.apache.spark.sql.{Row, SparkSession}
scala> import org.apache.spark.sql.types._
import org.apache.spark.sql.types._
scala> val rdd = sc.parallelize(Seq(Row(BigDecimal("333.22"
rdd: org.apache.spark.rdd.RDD[org.apache.spark.sql.Row] = 
ParallelCollectionRDD[0] at parallelize at :27
scala> val schema = new StructType().add(StructField("c1", DecimalType(20, 10), 
true))
schema: org.apache.spark.sql.types.StructType = 
StructType(StructField(c1,DecimalType(20,10),true))
scala> val df = spark.createDataFrame(rdd, schema)
df: org.apache.spark.sql.DataFrame = [c1: decimal(20,10)]
scala> df.show()
++
|  c1|
++
|null|
++
scala> 
df.write.mode("overwrite").format("orc").saveAsTable("decimal_extra_precision")
22/08/29 10:33:47 WARN SessionState: METASTORE_FILTER_HOOK will be ignored, 
since hive.security.authorization.manager is set to instance of 
HiveAuthorizerFactory.
scala> spark.sql("select * from decimal_extra_precision;")
res2: org.apache.spark.sql.DataFrame = [c1: decimal(20,10)]
{code}
h3. Root Cause

The exception is being raised from 
[Decimal|https://github.com/apache/spark/blob/v3.2.1/sql/catalyst/src/main/scala/org/apache/spark/sql/types/Decimal.scala#L358-L373]
 ({{{}nullOnOverflow{}}} is controlled by {{spark.sql.ansi.enabled}} in 
[SQLConf|https://github.com/apache/spark/blob/v3.2.1/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala#L2542-L2551].):
{code:java}
  private[sql] def toPrecision(
      precision: Int,
      scale: Int,
      roundMode: BigDecimal.RoundingMode.Value = ROUND_HALF_UP,
      nullOnOverflow: Boolean = true,
      context: SQLQueryContext = null): Decimal = {
    val copy = clone()
    if (copy.changePrecision(precision, scale, roundMode)) {
      copy
    } else {
      if (nullOnOverflow) {
        null
      } else {
        throw QueryExecutionErrors.cannotChangeDecimalPrecisionError(
          this, precision, scale, context)
      }
    }
  }{code}
 

The above function is invoked from 
[toPrecision|https://github.com/apache/spark/blob/v3.2.1/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Cast.scala#L754-L756]
 (in Cast.scala).  However, our attempt to insert {{333.22}} 
after setting {{spark.sql.ansi.enabled}} _to {{False}}_  failed as well (which 
may be an independent issue).

 

 

 

 

  was:
h3. Describe the bug

We are trying to store a DECIMAL value {{333.22}} with more 
precision than what is defined in the schema: {{{}DECIMAL(20,10){}}}. This 
leads to a {{NULL}} value being stored if the table is created using DataFrames 
via {{{}spark-shell{}}}. However, it leads to the following exception if the 
table is created via {{{}spark-sql{}}}:
{code:java}
Failed in [insert into decimal_extra_precision select 333.22]
java.lang.ArithmeticException: Decimal(expanded,333.22,21,10}) 
cannot be represented as Decimal(20, 10){code}
h3. Step to reproduce:

On Spark 3.2.1 (commit {{{}4f25b3f712{}}}), using {{{}spark-sql{}}}:
{code:java}
create table decimal_extra_precision(c1 DECIMAL(20,10)) STORED AS ORC;
insert into decimal_extra_precision select 333.22;{code}
Execute the following:
{code:java}
create table decimal_extra_precision(c1 DECIMAL(20,10)) STORED AS ORC;
insert into 

[jira] [Updated] (SPARK-40439) DECIMAL value with more precision than what is defined in the schema raises exception in SparkSQL but evaluates to NULL for DataFrame

2022-09-14 Thread xsys (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40439?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

xsys updated SPARK-40439:
-
Description: 
h3. Describe the bug

We are trying to store a DECIMAL value {{333.22}} with more 
precision than what is defined in the schema: {{{}DECIMAL(20,10){}}}. This 
leads to a {{NULL}} value being stored if the table is created using DataFrames 
via {{{}spark-shell{}}}. However, it leads to the following exception if the 
table is created via {{{}spark-sql{}}}:
{code:java}
Failed in [insert into decimal_extra_precision select 333.22]
java.lang.ArithmeticException: Decimal(expanded,333.22,21,10}) 
cannot be represented as Decimal(20, 10){code}
h3. Step to reproduce:

On Spark 3.2.1 (commit {{{}4f25b3f712{}}}), using {{{}spark-sql{}}}:
{code:java}
create table decimal_extra_precision(c1 DECIMAL(20,10)) STORED AS ORC;
insert into decimal_extra_precision select 333.22;{code}
Execute the following:
{code:java}
create table decimal_extra_precision(c1 DECIMAL(20,10)) STORED AS ORC;
insert into decimal_extra_precision select 333.22;{code}
h3. Expected behavior

We expect the two Spark interfaces ({{{}spark-sql{}}} & {{{}spark-shell{}}}) to 
behave consistently for the same data type & input combination 
({{{}DECIMAL(20,10){}}} and {{{}333.22{}}}). 

Here is a simplified example in {{{}spark-shell{}}}, where insertion of the 
aforementioned decimal value evaluates to a {{{}NULL{}}}:
{code:java}
scala> import org.apache.spark.sql.{Row, SparkSession}
import org.apache.spark.sql.{Row, SparkSession}
scala> import org.apache.spark.sql.types._
import org.apache.spark.sql.types._
scala> val rdd = sc.parallelize(Seq(Row(BigDecimal("333.22"
rdd: org.apache.spark.rdd.RDD[org.apache.spark.sql.Row] = 
ParallelCollectionRDD[0] at parallelize at :27
scala> val schema = new StructType().add(StructField("c1", DecimalType(20, 10), 
true))
schema: org.apache.spark.sql.types.StructType = 
StructType(StructField(c1,DecimalType(20,10),true))
scala> val df = spark.createDataFrame(rdd, schema)
df: org.apache.spark.sql.DataFrame = [c1: decimal(20,10)]
scala> df.show()
++
|  c1|
++
|null|
++
scala> 
df.write.mode("overwrite").format("orc").saveAsTable("decimal_extra_precision")
22/08/29 10:33:47 WARN SessionState: METASTORE_FILTER_HOOK will be ignored, 
since hive.security.authorization.manager is set to instance of 
HiveAuthorizerFactory.
scala> spark.sql("select * from decimal_extra_precision;")
res2: org.apache.spark.sql.DataFrame = [c1: decimal(20,10)]
{code}
h3. Root Cause

The exception is being raised from 
[Decimal|https://github.com/apache/spark/blob/v3.2.1/sql/catalyst/src/main/scala/org/apache/spark/sql/types/Decimal.scala#L358-L373]
 ({{{}nullOnOverflow {}}}is controlled by {{spark.sql.ansi.enabled}} in 
[SQLConf|https://github.com/apache/spark/blob/v3.2.1/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala#L2542-L2551].):

 
{code:java}
  private[sql] def toPrecision(
      precision: Int,
      scale: Int,
      roundMode: BigDecimal.RoundingMode.Value = ROUND_HALF_UP,
      nullOnOverflow: Boolean = true,
      context: SQLQueryContext = null): Decimal = {
    val copy = clone()
    if (copy.changePrecision(precision, scale, roundMode)) {
      copy
    } else {
      if (nullOnOverflow) {
        null
      } else {
        throw QueryExecutionErrors.cannotChangeDecimalPrecisionError(
          this, precision, scale, context)
      }
    }
  }{code}
 

The above function is invoked from 
[toPrecision|https://github.com/apache/spark/blob/v3.2.1/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Cast.scala#L754-L756]
 (in Cast.scala).  However, our attempt to insert {{333.22}} 
after setting _spark.sql.ansi.enabled to False_ failed as well (which may be an 
independent issue).

  was:
h3. Describe the bug

We are trying to store a DECIMAL value {{333.22}} with more 
precision than what is defined in the schema: {{{}DECIMAL(20,10){}}}. This 
leads to a {{NULL}} value being stored if the table is created using DataFrames 
via {{{}spark-shell{}}}. However, it leads to the following exception if the 
table is created via {{{}spark-sql{}}}:
{code:java}
Failed in [insert into decimal_extra_precision select 333.22]
java.lang.ArithmeticException: Decimal(expanded,333.22,21,10}) 
cannot be represented as Decimal(20, 10){code}
 
h3. Step to reproduce:

On Spark 3.2.1 (commit {{{}4f25b3f712{}}}), using {{{}spark-sql{}}}:
{code:java}
create table decimal_extra_precision(c1 DECIMAL(20,10)) STORED AS ORC;
insert into decimal_extra_precision select 333.22;{code}
Execute the following:
{code:java}
create table decimal_extra_precision(c1 DECIMAL(20,10)) STORED AS ORC;
insert into decimal_extra_precision 

[jira] [Created] (SPARK-40439) DECIMAL value with more precision than what is defined in the schema raises exception in SparkSQL but evaluates to NULL for DataFrame

2022-09-14 Thread xsys (Jira)
xsys created SPARK-40439:


 Summary: DECIMAL value with more precision than what is defined in 
the schema raises exception in SparkSQL but evaluates to NULL for DataFrame
 Key: SPARK-40439
 URL: https://issues.apache.org/jira/browse/SPARK-40439
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.2.1
Reporter: xsys


h3. Describe the bug

We are trying to store a DECIMAL value {{333.22}} with more 
precision than what is defined in the schema: {{{}DECIMAL(20,10){}}}. This 
leads to a {{NULL}} value being stored if the table is created using DataFrames 
via {{{}spark-shell{}}}. However, it leads to the following exception if the 
table is created via {{{}spark-sql{}}}:

 
{code:java}
Failed in [insert into decimal_extra_precision select 333.22]
java.lang.ArithmeticException: Decimal(expanded,333.22,21,10}) 
cannot be represented as Decimal(20, 10){code}
 
h3. Step to reproduce:

On Spark 3.2.1 (commit {{{}4f25b3f712{}}}), using {{{}spark-sql{}}}:
{code:java}
create table decimal_extra_precision(c1 DECIMAL(20,10)) STORED AS ORC;
insert into decimal_extra_precision select 333.22;{code}
Execute the following:
{code:java}
create table decimal_extra_precision(c1 DECIMAL(20,10)) STORED AS ORC;
insert into decimal_extra_precision select 333.22;{code}
h3. Expected behavior

We expect the two Spark interfaces ({{{}spark-sql{}}} & {{{}spark-shell{}}}) to 
behave consistently for the same data type & input combination 
({{{}DECIMAL(20,10){}}} and {{{}333.22{}}}). 

Here is a simplified example in {{{}spark-shell{}}}, where insertion of the 
aforementioned decimal value evaluates to a {{{}NULL{}}}:

 
{code:java}
scala> import org.apache.spark.sql.{Row, SparkSession}
import org.apache.spark.sql.{Row, SparkSession}
scala> import org.apache.spark.sql.types._
import org.apache.spark.sql.types._
scala> val rdd = sc.parallelize(Seq(Row(BigDecimal("333.22"
rdd: org.apache.spark.rdd.RDD[org.apache.spark.sql.Row] = 
ParallelCollectionRDD[0] at parallelize at :27
scala> val schema = new StructType().add(StructField("c1", DecimalType(20, 10), 
true))
schema: org.apache.spark.sql.types.StructType = 
StructType(StructField(c1,DecimalType(20,10),true))
scala> val df = spark.createDataFrame(rdd, schema)
df: org.apache.spark.sql.DataFrame = [c1: decimal(20,10)]
scala> df.show()
++
|  c1|
++
|null|
++
scala> 
df.write.mode("overwrite").format("orc").saveAsTable("decimal_extra_precision")
22/08/29 10:33:47 WARN SessionState: METASTORE_FILTER_HOOK will be ignored, 
since hive.security.authorization.manager is set to instance of 
HiveAuthorizerFactory.
scala> spark.sql("select * from decimal_extra_precision;")
res2: org.apache.spark.sql.DataFrame = [c1: decimal(20,10)]
{code}
 
h3. Root Cause

The exception is being raised from 
[Decimal|https://github.com/apache/spark/blob/v3.2.1/sql/catalyst/src/main/scala/org/apache/spark/sql/types/Decimal.scala#L358-L373]
 ({{{}nullOnOverflow {}}}is controlled by {{spark.sql.ansi.enabled}} in 
[SQLConf|https://github.com/apache/spark/blob/v3.2.1/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala#L2542-L2551].):

 
{code:java}
  private[sql] def toPrecision(
      precision: Int,
      scale: Int,
      roundMode: BigDecimal.RoundingMode.Value = ROUND_HALF_UP,
      nullOnOverflow: Boolean = true,
      context: SQLQueryContext = null): Decimal = {
    val copy = clone()
    if (copy.changePrecision(precision, scale, roundMode)) {
      copy
    } else {
      if (nullOnOverflow) {
        null
      } else {
        throw QueryExecutionErrors.cannotChangeDecimalPrecisionError(
          this, precision, scale, context)
      }
    }
  }{code}
 

The above function is invoked from 
[toPrecision|https://github.com/apache/spark/blob/v3.2.1/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Cast.scala#L754-L756]
 (in Cast.scala).  However, our attempt to insert {{333.22}} 
after setting {{spark.sql.ansi.enabled }}to False failed as well (which may be 
an independent issue).



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-40439) DECIMAL value with more precision than what is defined in the schema raises exception in SparkSQL but evaluates to NULL for DataFrame

2022-09-14 Thread xsys (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40439?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

xsys updated SPARK-40439:
-
Description: 
h3. Describe the bug

We are trying to store a DECIMAL value {{333.22}} with more 
precision than what is defined in the schema: {{{}DECIMAL(20,10){}}}. This 
leads to a {{NULL}} value being stored if the table is created using DataFrames 
via {{{}spark-shell{}}}. However, it leads to the following exception if the 
table is created via {{{}spark-sql{}}}:
{code:java}
Failed in [insert into decimal_extra_precision select 333.22]
java.lang.ArithmeticException: Decimal(expanded,333.22,21,10}) 
cannot be represented as Decimal(20, 10){code}
 
h3. Step to reproduce:

On Spark 3.2.1 (commit {{{}4f25b3f712{}}}), using {{{}spark-sql{}}}:
{code:java}
create table decimal_extra_precision(c1 DECIMAL(20,10)) STORED AS ORC;
insert into decimal_extra_precision select 333.22;{code}
Execute the following:
{code:java}
create table decimal_extra_precision(c1 DECIMAL(20,10)) STORED AS ORC;
insert into decimal_extra_precision select 333.22;{code}
h3. Expected behavior

We expect the two Spark interfaces ({{{}spark-sql{}}} & {{{}spark-shell{}}}) to 
behave consistently for the same data type & input combination 
({{{}DECIMAL(20,10){}}} and {{{}333.22{}}}). 

Here is a simplified example in {{{}spark-shell{}}}, where insertion of the 
aforementioned decimal value evaluates to a {{{}NULL{}}}:
{code:java}
scala> import org.apache.spark.sql.{Row, SparkSession}
import org.apache.spark.sql.{Row, SparkSession}
scala> import org.apache.spark.sql.types._
import org.apache.spark.sql.types._
scala> val rdd = sc.parallelize(Seq(Row(BigDecimal("333.22"
rdd: org.apache.spark.rdd.RDD[org.apache.spark.sql.Row] = 
ParallelCollectionRDD[0] at parallelize at :27
scala> val schema = new StructType().add(StructField("c1", DecimalType(20, 10), 
true))
schema: org.apache.spark.sql.types.StructType = 
StructType(StructField(c1,DecimalType(20,10),true))
scala> val df = spark.createDataFrame(rdd, schema)
df: org.apache.spark.sql.DataFrame = [c1: decimal(20,10)]
scala> df.show()
++
|  c1|
++
|null|
++
scala> 
df.write.mode("overwrite").format("orc").saveAsTable("decimal_extra_precision")
22/08/29 10:33:47 WARN SessionState: METASTORE_FILTER_HOOK will be ignored, 
since hive.security.authorization.manager is set to instance of 
HiveAuthorizerFactory.
scala> spark.sql("select * from decimal_extra_precision;")
res2: org.apache.spark.sql.DataFrame = [c1: decimal(20,10)]
{code}
h3. Root Cause

The exception is being raised from 
[Decimal|https://github.com/apache/spark/blob/v3.2.1/sql/catalyst/src/main/scala/org/apache/spark/sql/types/Decimal.scala#L358-L373]
 ({{{}nullOnOverflow {{is controlled by {{spark.sql.ansi.enabled}} in 
[SQLConf|https://github.com/apache/spark/blob/v3.2.1/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala#L2542-L2551].):

 
{code:java}
  private[sql] def toPrecision(
      precision: Int,
      scale: Int,
      roundMode: BigDecimal.RoundingMode.Value = ROUND_HALF_UP,
      nullOnOverflow: Boolean = true,
      context: SQLQueryContext = null): Decimal = {
    val copy = clone()
    if (copy.changePrecision(precision, scale, roundMode)) {
      copy
    } else {
      if (nullOnOverflow) {
        null
      } else {
        throw QueryExecutionErrors.cannotChangeDecimalPrecisionError(
          this, precision, scale, context)
      }
    }
  }{code}
 

The above function is invoked from 
[toPrecision|https://github.com/apache/spark/blob/v3.2.1/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Cast.scala#L754-L756]
 (in Cast.scala).  However, our attempt to insert {{333.22}} 
after setting \{{spark.sql.ansi.enabled }}to False failed as well (which may be 
an independent issue).

  was:
h3. Describe the bug

We are trying to store a DECIMAL value {{333.22}} with more 
precision than what is defined in the schema: {{{}DECIMAL(20,10){}}}. This 
leads to a {{NULL}} value being stored if the table is created using DataFrames 
via {{{}spark-shell{}}}. However, it leads to the following exception if the 
table is created via {{{}spark-sql{}}}:

 
{code:java}
Failed in [insert into decimal_extra_precision select 333.22]
java.lang.ArithmeticException: Decimal(expanded,333.22,21,10}) 
cannot be represented as Decimal(20, 10){code}
 
h3. Step to reproduce:

On Spark 3.2.1 (commit {{{}4f25b3f712{}}}), using {{{}spark-sql{}}}:
{code:java}
create table decimal_extra_precision(c1 DECIMAL(20,10)) STORED AS ORC;
insert into decimal_extra_precision select 333.22;{code}
Execute the following:
{code:java}
create table decimal_extra_precision(c1 DECIMAL(20,10)) STORED AS ORC;
insert into 

[jira] [Updated] (SPARK-40409) IncompatibleSchemaException when BYTE stored from DataFrame to Avro is read using spark-sql

2022-09-12 Thread xsys (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40409?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

xsys updated SPARK-40409:
-
Description: 
h3. Describe the bug

We are trying to store a BYTE {{"-128"}} to a table created via Spark 
DataFrame. The table is created with the Avro file format. We encounter no 
errors while creating the table and inserting the aforementioned BYTE value. 
However, performing a SELECT query on the table through spark-sql results in an 
{{IncompatibleSchemaException}} as shown below:
{code:java}
2022-09-09 21:15:03,248 ERROR executor.Executor: Exception in task 0.0 in stage 
0.0 (TID 0)
org.apache.spark.sql.avro.IncompatibleSchemaException: Cannot convert Avro type 
{"type":"record","name":"topLevelRecord","fields"$
[{"name":"c1","type":["int","null"]}]} to SQL type STRUCT<`c1`: TINYINT>{code}
h3. Step to reproduce

On Spark 3.2.1 (commit {{{}4f25b3f712{}}}), using {{spark-shell}} with the Avro 
package:
{code:java}
./bin/spark-shell --packages org.apache.spark:spark-avro_2.12:3.2.1{code}
Execute the following:
{code:java}
import org.apache.spark.sql.{Row, SparkSession}
import org.apache.spark.sql.types._
val rdd = sc.parallelize(Seq(Row(("-128").toByte)))
val schema = new StructType().add(StructField("c1", ByteType, true))
val df = spark.createDataFrame(rdd, schema)
df.show(false)
df.write.mode("overwrite").format("avro").saveAsTable("byte_avro"){code}
On Spark 3.2.1 (commit {{{}4f25b3f712{}}}), using {{spark-sql}} with the Avro 
package:
{code:java}
./bin/spark-sql --packages org.apache.spark:spark-avro_2.12:3.2.1{code}
Execute the following:
{code:java}
spark-sql> select * from byte_avro;{code}
h3. Expected behavior

We expect the output of the {{SELECT}} query to be {{{}-128{}}}. Additionally, 
we expect the data type to be preserved (it is changed from BYTE/TINYINT to 
INT, hence the mismatch). We tried other formats like ORC and the outcome is 
consistent with this expectation. Here are the logs from our attempt at doing 
the same with ORC:
{code:java}
scala> df.write.mode("overwrite").format("orc").saveAsTable("byte_orc")
2022-09-09 21:38:28,880 WARN conf.HiveConf: HiveConf of name 
hive.stats.jdbc.timeout does not exist
2022-09-09 21:38:28,880 WARN conf.HiveConf: HiveConf of name 
hive.stats.retries.wait does not exist
2022-09-09 21:38:34,642 WARN session.SessionState: METASTORE_FILTER_HOOK will 
be ignored, since hive.security.authorization.manage
r is set to instance of HiveAuthorizerFactory.
2022-09-09 21:38:34,716 WARN conf.HiveConf: HiveConf of name 
hive.internal.ss.authz.settings.applied.marker does not exist
2022-09-09 21:38:34,716 WARN conf.HiveConf: HiveConf of name 
hive.stats.jdbc.timeout does not exist
2022-09-09 21:38:34,716 WARN conf.HiveConf: HiveConf of name 
hive.stats.retries.wait does not exist
scala> spark.sql("select * from byte_orc;")
res2: org.apache.spark.sql.DataFrame = [c1: tinyint]
scala> spark.sql("select * from byte_orc;").show(false)
++
|c1  |
++
|-128|
++
{code}
h3. Root Cause
h4. 
[AvroSerializer|https://github.com/apache/spark/blob/v3.2.1/external/avro/src/main/scala/org/apache/spark/sql/avro/AvroSerializer.scala#L114-L119]
{code:java}
   (catalystType, avroType.getType) match {
      case (NullType, NULL) =>
        (getter, ordinal) => null
      case (BooleanType, BOOLEAN) =>
        (getter, ordinal) => getter.getBoolean(ordinal)
      case (ByteType, INT) =>
        (getter, ordinal) => getter.getByte(ordinal).toInt
      case (ShortType, INT) =>
        (getter, ordinal) => getter.getShort(ordinal).toInt
      case (IntegerType, INT) =>
        (getter, ordinal) => getter.getInt(ordinal){code}
h4. 
[AvroDeserializer|https://github.com/apache/spark/blob/v3.2.1/external/avro/src/main/scala/org/apache/spark/sql/avro/AvroDeserializer.scala#L121-L130]
{code:java}
    (avroType.getType, catalystType) match {
      case (NULL, NullType) => (updater, ordinal, _) =>
        updater.setNullAt(ordinal)
      // TODO: we can avoid boxing if future version of avro provide primitive 
accessors.
      case (BOOLEAN, BooleanType) => (updater, ordinal, value) =>
        updater.setBoolean(ordinal, value.asInstanceOf[Boolean])
      case (INT, IntegerType) => (updater, ordinal, value) =>
        updater.setInt(ordinal, value.asInstanceOf[Int])
      case (INT, DateType) => (updater, ordinal, value) =>
        updater.setInt(ordinal, dateRebaseFunc(value.asInstanceOf[Int]))
{code}
AvroSerializer converts Spark's ByteType into Avro's INT. Further, Spark's 
AvroDeserializer expects Avro's INT to map to Spark's IntegerType. The mismatch 
between user-specified ByteType & the type AvroDeserializer expects 
(IntegerType) is the root cause of this issue.

  was:
h3. Describe the bug

We are trying to store a BYTE {{"-128"}} to a table created via Spark 
DataFrame. The table is created with the Avro file format. We encounter no 
errors while creating the table and inserting the 

[jira] [Updated] (SPARK-40409) IncompatibleSchemaException when BYTE stored from DataFrame to Avro is read using spark-sql

2022-09-12 Thread xsys (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40409?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

xsys updated SPARK-40409:
-
Description: 
h2. Describe the bug

We are trying to store a BYTE {{"-128"}} to a table created via Spark 
DataFrame. The table is created with the Avro file format. We encounter no 
errors while creating the table and inserting the aforementioned BYTE value. 
However, performing a SELECT query on the table through spark-sql results in an 
{{IncompatibleSchemaException}} as shown below:
{code:java}
2022-09-09 21:15:03,248 ERROR executor.Executor: Exception in task 0.0 in stage 
0.0 (TID 0)
org.apache.spark.sql.avro.IncompatibleSchemaException: Cannot convert Avro type 
{"type":"record","name":"topLevelRecord","fields"$
[{"name":"c1","type":["int","null"]}]} to SQL type STRUCT<`c1`: TINYINT>{code}
h2. Step to reproduce

On Spark 3.2.1 (commit {{{}4f25b3f712{}}}), using {{spark-shell}} with the Avro 
package:
{code:java}
./bin/spark-shell --packages org.apache.spark:spark-avro_2.12:3.2.1{code}
Execute the following:
{code:java}
import org.apache.spark.sql.{Row, SparkSession}
import org.apache.spark.sql.types._
val rdd = sc.parallelize(Seq(Row(("-128").toByte)))
val schema = new StructType().add(StructField("c1", ByteType, true))
val df = spark.createDataFrame(rdd, schema)
df.show(false)
df.write.mode("overwrite").format("avro").saveAsTable("byte_avro"){code}
On Spark 3.2.1 (commit {{{}4f25b3f712{}}}), using {{spark-sql}} with the Avro 
package:
{code:java}
./bin/spark-sql --packages org.apache.spark:spark-avro_2.12:3.2.1{code}
Execute the following:
{code:java}
spark-sql> select * from byte_avro;{code}
h2. Expected behavior

We expect the output of the {{SELECT}} query to be {{{}-128{}}}. Additionally, 
we expect the data type to be preserved (it is changed from BYTE/TINYINT to 
INT, hence the mismatch). We tried other formats like ORC and the outcome is 
consistent with this expectation. Here are the logs from our attempt at doing 
the same with ORC:
{code:java}
scala> df.write.mode("overwrite").format("orc").saveAsTable("byte_orc")
2022-09-09 21:38:28,880 WARN conf.HiveConf: HiveConf of name 
hive.stats.jdbc.timeout does not exist
2022-09-09 21:38:28,880 WARN conf.HiveConf: HiveConf of name 
hive.stats.retries.wait does not exist
2022-09-09 21:38:34,642 WARN session.SessionState: METASTORE_FILTER_HOOK will 
be ignored, since hive.security.authorization.manage
r is set to instance of HiveAuthorizerFactory.
2022-09-09 21:38:34,716 WARN conf.HiveConf: HiveConf of name 
hive.internal.ss.authz.settings.applied.marker does not exist
2022-09-09 21:38:34,716 WARN conf.HiveConf: HiveConf of name 
hive.stats.jdbc.timeout does not exist
2022-09-09 21:38:34,716 WARN conf.HiveConf: HiveConf of name 
hive.stats.retries.wait does not exist
scala> spark.sql("select * from byte_orc;")
res2: org.apache.spark.sql.DataFrame = [c1: tinyint]
scala> spark.sql("select * from byte_orc;").show(false)
++
|c1  |
++
|-128|
++
{code}
h2. Root Cause
h4. 
[AvroSerializer|https://github.com/apache/spark/blob/v3.2.1/external/avro/src/main/scala/org/apache/spark/sql/avro/AvroSerializer.scala#L114-L119]
{code:java}
   (catalystType, avroType.getType) match {
      case (NullType, NULL) =>
        (getter, ordinal) => null
      case (BooleanType, BOOLEAN) =>
        (getter, ordinal) => getter.getBoolean(ordinal)
      case (ByteType, INT) =>
        (getter, ordinal) => getter.getByte(ordinal).toInt
      case (ShortType, INT) =>
        (getter, ordinal) => getter.getShort(ordinal).toInt
      case (IntegerType, INT) =>
        (getter, ordinal) => getter.getInt(ordinal){code}
h4. 
[AvroDeserializer|https://github.com/apache/spark/blob/v3.2.1/external/avro/src/main/scala/org/apache/spark/sql/avro/AvroDeserializer.scala#L121-L130]
{code:java}
    (avroType.getType, catalystType) match {
      case (NULL, NullType) => (updater, ordinal, _) =>
        updater.setNullAt(ordinal)
      // TODO: we can avoid boxing if future version of avro provide primitive 
accessors.
      case (BOOLEAN, BooleanType) => (updater, ordinal, value) =>
        updater.setBoolean(ordinal, value.asInstanceOf[Boolean])
      case (INT, IntegerType) => (updater, ordinal, value) =>
        updater.setInt(ordinal, value.asInstanceOf[Int])
      case (INT, DateType) => (updater, ordinal, value) =>
        updater.setInt(ordinal, dateRebaseFunc(value.asInstanceOf[Int]))
{code}
AvroSerializer converts Spark's ByteType into Avro's INT. Further, Spark's 
AvroDeserializer expects Avro's INT to map to Spark's IntegerType. The mismatch 
between user-specified ByteType & the type AvroDeserializer expects 
(IntegerType) is the root cause of this issue.

 

 

  was:
h2. Describe the bug

We are trying to store a BYTE {{"-128"}} to a table created via Spark 
DataFrame. The table is created with the Avro file format. We encounter no 
errors while creating the table and inserting the 

[jira] [Updated] (SPARK-40409) IncompatibleSchemaException when BYTE stored from DataFrame to Avro is read using spark-sql

2022-09-12 Thread xsys (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40409?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

xsys updated SPARK-40409:
-
Description: 
h2. Describe the bug

We are trying to store a BYTE {{"-128"}} to a table created via Spark 
DataFrame. The table is created with the Avro file format. We encounter no 
errors while creating the table and inserting the aforementioned BYTE value. 
However, performing a SELECT query on the table through spark-sql results in an 
{{IncompatibleSchemaException}} as shown below:
{code:java}
2022-09-09 21:15:03,248 ERROR executor.Executor: Exception in task 0.0 in stage 
0.0 (TID 0)
org.apache.spark.sql.avro.IncompatibleSchemaException: Cannot convert Avro type 
{"type":"record","name":"topLevelRecord","fields"$
[{"name":"c1","type":["int","null"]}]} to SQL type STRUCT<`c1`: TINYINT>{code}
h3. Step to reproduce

On Spark 3.2.1 (commit {{{}4f25b3f712{}}}), using {{spark-shell}} with the Avro 
package:
{code:java}
./bin/spark-shell --packages org.apache.spark:spark-avro_2.12:3.2.1{code}
Execute the following:
{code:java}
import org.apache.spark.sql.{Row, SparkSession}
import org.apache.spark.sql.types._
val rdd = sc.parallelize(Seq(Row(("-128").toByte)))
val schema = new StructType().add(StructField("c1", ByteType, true))
val df = spark.createDataFrame(rdd, schema)
df.show(false)
df.write.mode("overwrite").format("avro").saveAsTable("byte_avro"){code}
On Spark 3.2.1 (commit {{{}4f25b3f712{}}}), using {{spark-sql}} with the Avro 
package:
{code:java}
./bin/spark-sql --packages org.apache.spark:spark-avro_2.12:3.2.1{code}
Execute the following:
{code:java}
spark-sql> select * from byte_avro;{code}
h3. Expected behavior

We expect the output of the {{SELECT}} query to be {{{}-128{}}}. Additionally, 
we expect the data type to be preserved (it is changed from BYTE/TINYINT to 
INT, hence the mismatch). We tried other formats like ORC and the outcome is 
consistent with this expectation. Here are the logs from our attempt at doing 
the same with ORC:
{code:java}
scala> df.write.mode("overwrite").format("orc").saveAsTable("byte_orc")
2022-09-09 21:38:28,880 WARN conf.HiveConf: HiveConf of name 
hive.stats.jdbc.timeout does not exist
2022-09-09 21:38:28,880 WARN conf.HiveConf: HiveConf of name 
hive.stats.retries.wait does not exist
2022-09-09 21:38:34,642 WARN session.SessionState: METASTORE_FILTER_HOOK will 
be ignored, since hive.security.authorization.manage
r is set to instance of HiveAuthorizerFactory.
2022-09-09 21:38:34,716 WARN conf.HiveConf: HiveConf of name 
hive.internal.ss.authz.settings.applied.marker does not exist
2022-09-09 21:38:34,716 WARN conf.HiveConf: HiveConf of name 
hive.stats.jdbc.timeout does not exist
2022-09-09 21:38:34,716 WARN conf.HiveConf: HiveConf of name 
hive.stats.retries.wait does not exist
scala> spark.sql("select * from byte_orc;")
res2: org.apache.spark.sql.DataFrame = [c1: tinyint]
scala> spark.sql("select * from byte_orc;").show(false)
++
|c1  |
++
|-128|
++
{code}
h3. Root Cause
h4. 
[AvroSerializer|https://github.com/apache/spark/blob/v3.2.1/external/avro/src/main/scala/org/apache/spark/sql/avro/AvroSerializer.scala#L114-L119]
{code:java}
   (catalystType, avroType.getType) match {
      case (NullType, NULL) =>
        (getter, ordinal) => null
      case (BooleanType, BOOLEAN) =>
        (getter, ordinal) => getter.getBoolean(ordinal)
      case (ByteType, INT) =>
        (getter, ordinal) => getter.getByte(ordinal).toInt
      case (ShortType, INT) =>
        (getter, ordinal) => getter.getShort(ordinal).toInt
      case (IntegerType, INT) =>
        (getter, ordinal) => getter.getInt(ordinal){code}
h4. 
[AvroDeserializer|https://github.com/apache/spark/blob/v3.2.1/external/avro/src/main/scala/org/apache/spark/sql/avro/AvroDeserializer.scala#L121-L130]
{code:java}
    (avroType.getType, catalystType) match {
      case (NULL, NullType) => (updater, ordinal, _) =>
        updater.setNullAt(ordinal)
      // TODO: we can avoid boxing if future version of avro provide primitive 
accessors.
      case (BOOLEAN, BooleanType) => (updater, ordinal, value) =>
        updater.setBoolean(ordinal, value.asInstanceOf[Boolean])
      case (INT, IntegerType) => (updater, ordinal, value) =>
        updater.setInt(ordinal, value.asInstanceOf[Int])
      case (INT, DateType) => (updater, ordinal, value) =>
        updater.setInt(ordinal, dateRebaseFunc(value.asInstanceOf[Int]))
{code}
AvroSerializer converts Spark's ByteType into Avro's INT. Further, Spark's 
AvroDeserializer expects Avro's INT to map to Spark's IntegerType. The mismatch 
between user-specified ByteType & the type AvroDeserializer expects 
(IntegerType) is the root cause of this issue.

 

 

  was:
h3. Describe the bug

We are trying to store a BYTE {{"-128"}} to a table created via Spark 
DataFrame. The table is created with the Avro file format. We encounter no 
errors while creating the table and inserting the 

[jira] [Updated] (SPARK-40409) IncompatibleSchemaException when BYTE stored from DataFrame to Avro is read using spark-sql

2022-09-12 Thread xsys (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40409?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

xsys updated SPARK-40409:
-
Description: 
h3. Describe the bug

We are trying to store a BYTE {{"-128"}} to a table created via Spark 
DataFrame. The table is created with the Avro file format. We encounter no 
errors while creating the table and inserting the aforementioned BYTE value. 
However, performing a SELECT query on the table through spark-sql results in an 
{{IncompatibleSchemaException}} as shown below:
{code:java}
2022-09-09 21:15:03,248 ERROR executor.Executor: Exception in task 0.0 in stage 
0.0 (TID 0)
org.apache.spark.sql.avro.IncompatibleSchemaException: Cannot convert Avro type 
{"type":"record","name":"topLevelRecord","fields"$
[{"name":"c1","type":["int","null"]}]} to SQL type STRUCT<`c1`: TINYINT>{code}
h3. Step to reproduce

On Spark 3.2.1 (commit {{{}4f25b3f712{}}}), using {{spark-shell}} with the Avro 
package:
{code:java}
./bin/spark-shell --packages org.apache.spark:spark-avro_2.12:3.2.1{code}
Execute the following:
{code:java}
import org.apache.spark.sql.{Row, SparkSession}
import org.apache.spark.sql.types._
val rdd = sc.parallelize(Seq(Row(("-128").toByte)))
val schema = new StructType().add(StructField("c1", ByteType, true))
val df = spark.createDataFrame(rdd, schema)
df.show(false)
df.write.mode("overwrite").format("avro").saveAsTable("byte_avro"){code}
On Spark 3.2.1 (commit {{{}4f25b3f712{}}}), using {{spark-sql}} with the Avro 
package:
{code:java}
./bin/spark-sql --packages org.apache.spark:spark-avro_2.12:3.2.1{code}
Execute the following:
{code:java}
spark-sql> select * from byte_avro;{code}
h3. Expected behavior

We expect the output of the {{SELECT}} query to be {{{}-128{}}}. Additionally, 
we expect the data type to be preserved (it is changed from BYTE/TINYINT to 
INT, hence the mismatch). We tried other formats like ORC and the outcome is 
consistent with this expectation. Here are the logs from our attempt at doing 
the same with ORC:
{code:java}
scala> df.write.mode("overwrite").format("orc").saveAsTable("byte_orc")
2022-09-09 21:38:28,880 WARN conf.HiveConf: HiveConf of name 
hive.stats.jdbc.timeout does not exist
2022-09-09 21:38:28,880 WARN conf.HiveConf: HiveConf of name 
hive.stats.retries.wait does not exist
2022-09-09 21:38:34,642 WARN session.SessionState: METASTORE_FILTER_HOOK will 
be ignored, since hive.security.authorization.manage
r is set to instance of HiveAuthorizerFactory.
2022-09-09 21:38:34,716 WARN conf.HiveConf: HiveConf of name 
hive.internal.ss.authz.settings.applied.marker does not exist
2022-09-09 21:38:34,716 WARN conf.HiveConf: HiveConf of name 
hive.stats.jdbc.timeout does not exist
2022-09-09 21:38:34,716 WARN conf.HiveConf: HiveConf of name 
hive.stats.retries.wait does not exist
scala> spark.sql("select * from byte_orc;")
res2: org.apache.spark.sql.DataFrame = [c1: tinyint]
scala> spark.sql("select * from byte_orc;").show(false)
++
|c1  |
++
|-128|
++
{code}
h3. Root Cause
h4. 
[AvroSerializer|https://github.com/apache/spark/blob/v3.2.1/external/avro/src/main/scala/org/apache/spark/sql/avro/AvroSerializer.scala#L114-L119]
{code:java}
   (catalystType, avroType.getType) match {
      case (NullType, NULL) =>
        (getter, ordinal) => null
      case (BooleanType, BOOLEAN) =>
        (getter, ordinal) => getter.getBoolean(ordinal)
      case (ByteType, INT) =>
        (getter, ordinal) => getter.getByte(ordinal).toInt
      case (ShortType, INT) =>
        (getter, ordinal) => getter.getShort(ordinal).toInt
      case (IntegerType, INT) =>
        (getter, ordinal) => getter.getInt(ordinal){code}
h4. 
[AvroDeserializer|https://github.com/apache/spark/blob/v3.2.1/external/avro/src/main/scala/org/apache/spark/sql/avro/AvroDeserializer.scala#L121-L130]
{code:java}
    (avroType.getType, catalystType) match {
      case (NULL, NullType) => (updater, ordinal, _) =>
        updater.setNullAt(ordinal)
      // TODO: we can avoid boxing if future version of avro provide primitive 
accessors.
      case (BOOLEAN, BooleanType) => (updater, ordinal, value) =>
        updater.setBoolean(ordinal, value.asInstanceOf[Boolean])
      case (INT, IntegerType) => (updater, ordinal, value) =>
        updater.setInt(ordinal, value.asInstanceOf[Int])
      case (INT, DateType) => (updater, ordinal, value) =>
        updater.setInt(ordinal, dateRebaseFunc(value.asInstanceOf[Int]))
{code}
AvroSerializer converts Spark's ByteType into Avro's INT. Further, Spark's 
AvroDeserializer expects Avro's INT to map to Spark's IntegerType. The mismatch 
between user-specified ByteType & the type AvroDeserializer expects 
(IntegerType) is the root cause of this issue.

 

 

  was:
h3. Describe the bug

We are trying to store a BYTE {{"-128"}} to a table created via Spark 
DataFrame. The table is created with the Avro file format. We encounter no 
errors while creating the table and inserting the 

[jira] [Updated] (SPARK-40409) IncompatibleSchemaException when BYTE stored from DataFrame to Avro is read using spark-sql

2022-09-12 Thread xsys (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40409?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

xsys updated SPARK-40409:
-
Description: 
h3. Describe the bug

We are trying to store a BYTE {{"-128"}} to a table created via Spark 
DataFrame. The table is created with the Avro file format. We encounter no 
errors while creating the table and inserting the aforementioned BYTE value. 
However, performing a SELECT query on the table through spark-sql results in an 
{{IncompatibleSchemaException}} as shown below:
{code:java}
2022-09-09 21:15:03,248 ERROR executor.Executor: Exception in task 0.0 in stage 
0.0 (TID 0)
org.apache.spark.sql.avro.IncompatibleSchemaException: Cannot convert Avro type 
{"type":"record","name":"topLevelRecord","fields"$
[{"name":"c1","type":["int","null"]}]} to SQL type STRUCT<`c1`: TINYINT>{code}
h3. Step to reproduce

On Spark 3.2.1 (commit {{{}4f25b3f712{}}}), using {{spark-shell}} with the Avro 
package:
{code:java}
./bin/spark-shell --packages org.apache.spark:spark-avro_2.12:3.2.1{code}
Execute the following:
{code:java}
import org.apache.spark.sql.{Row, SparkSession}
import org.apache.spark.sql.types._
val rdd = sc.parallelize(Seq(Row(("-128").toByte)))
val schema = new StructType().add(StructField("c1", ByteType, true))
val df = spark.createDataFrame(rdd, schema)
df.show(false)
df.write.mode("overwrite").format("avro").saveAsTable("byte_avro"){code}
On Spark 3.2.1 (commit {{{}4f25b3f712{}}}), using {{spark-sql}} with the Avro 
package:
{code:java}
./bin/spark-sql --packages org.apache.spark:spark-avro_2.12:3.2.1{code}
Execute the following:
{code:java}
spark-sql> select * from byte_avro;{code}
h3. Expected behavior

We expect the output of the {{SELECT}} query to be {{{}-128{}}}. Additionally, 
we expect the data type to be preserved (it is changed from BYTE/TINYINT to 
INT, hence the mismatch). We tried other formats like ORC and the outcome is 
consistent with this expectation. Here are the logs from our attempt at doing 
the same with ORC:
{code:java}
scala> df.write.mode("overwrite").format("orc").saveAsTable("byte_orc")
2022-09-09 21:38:28,880 WARN conf.HiveConf: HiveConf of name 
hive.stats.jdbc.timeout does not exist
2022-09-09 21:38:28,880 WARN conf.HiveConf: HiveConf of name 
hive.stats.retries.wait does not exist
2022-09-09 21:38:34,642 WARN session.SessionState: METASTORE_FILTER_HOOK will 
be ignored, since hive.security.authorization.manage
r is set to instance of HiveAuthorizerFactory.
2022-09-09 21:38:34,716 WARN conf.HiveConf: HiveConf of name 
hive.internal.ss.authz.settings.applied.marker does not exist
2022-09-09 21:38:34,716 WARN conf.HiveConf: HiveConf of name 
hive.stats.jdbc.timeout does not exist
2022-09-09 21:38:34,716 WARN conf.HiveConf: HiveConf of name 
hive.stats.retries.wait does not exist
scala> spark.sql("select * from byte_orc;")
res2: org.apache.spark.sql.DataFrame = [c1: tinyint]
scala> spark.sql("select * from byte_orc;").show(false)
++
|c1  |
++
|-128|
++
{code}
h3. Root Cause
h4. 
[AvroSerializer|https://github.com/apache/spark/blob/v3.2.1/external/avro/src/main/scala/org/apache/spark/sql/avro/AvroSerializer.scala#L114-L119]
{code:java}
   (catalystType, avroType.getType) match {
      case (NullType, NULL) =>
        (getter, ordinal) => null
      case (BooleanType, BOOLEAN) =>
        (getter, ordinal) => getter.getBoolean(ordinal)
      case (ByteType, INT) =>
        (getter, ordinal) => getter.getByte(ordinal).toInt
      case (ShortType, INT) =>
        (getter, ordinal) => getter.getShort(ordinal).toInt
      case (IntegerType, INT) =>
        (getter, ordinal) => getter.getInt(ordinal){code}
h4. 
[AvroDeserializer|https://github.com/apache/spark/blob/v3.2.1/external/avro/src/main/scala/org/apache/spark/sql/avro/AvroDeserializer.scala#L121-L130]
{code:java}
    (avroType.getType, catalystType) match {
      case (NULL, NullType) => (updater, ordinal, _) =>
        updater.setNullAt(ordinal)
      // TODO: we can avoid boxing if future version of avro provide primitive 
accessors.
      case (BOOLEAN, BooleanType) => (updater, ordinal, value) =>
        updater.setBoolean(ordinal, value.asInstanceOf[Boolean])
      case (INT, IntegerType) => (updater, ordinal, value) =>
        updater.setInt(ordinal, value.asInstanceOf[Int])
      case (INT, DateType) => (updater, ordinal, value) =>
        updater.setInt(ordinal, dateRebaseFunc(value.asInstanceOf[Int]))
{code}
AvroSerializer converts Spark's ByteType into Avro's INT. Further, Spark's 
AvroDeserializer expects Avro's INT to map to Spark's IntegerType. The mismatch 
between user-specified ByteType & the type AvroDeserializer expects 
(IntegerType) is the root cause of this issue.

 

 

  was:
h2. Describe the bug

We are trying to store a BYTE {{"-128"}} to a table created via Spark 
DataFrame. The table is created with the Avro file format. We encounter no 
errors while creating the table and inserting the 

[jira] [Created] (SPARK-40409) IncompatibleSchemaException when BYTE stored from DataFrame to Avro is read using spark-sql

2022-09-12 Thread xsys (Jira)
xsys created SPARK-40409:


 Summary: IncompatibleSchemaException when BYTE stored from 
DataFrame to Avro is read using spark-sql
 Key: SPARK-40409
 URL: https://issues.apache.org/jira/browse/SPARK-40409
 Project: Spark
  Issue Type: Bug
  Components: Input/Output
Affects Versions: 3.2.1
Reporter: xsys


h3. Describe the bug

We are trying to store a BYTE {{"-128"}} to a table created via Spark 
DataFrame. The table is created with the Avro file format. We encounter no 
errors while creating the table and inserting the aforementioned BYTE value. 
However, performing a SELECT query on the table through spark-sql results in an 
{{IncompatibleSchemaException}} as shown below:

 
{code:java}
2022-09-09 21:15:03,248 ERROR executor.Executor: Exception in task 0.0 in stage 
0.0 (TID 0)
org.apache.spark.sql.avro.IncompatibleSchemaException: Cannot convert Avro type 
{"type":"record","name":"topLevelRecord","fields"$
[{"name":"c1","type":["int","null"]}]} to SQL type STRUCT<`c1`: TINYINT>{code}
h3. Step to reproduce

On Spark 3.2.1 (commit {{{}4f25b3f712{}}}), using {{spark-shell}} with the Avro 
package:
{code:java}
./bin/spark-shell --packages org.apache.spark:spark-avro_2.12:3.2.1{code}
Execute the following:
{code:java}
import org.apache.spark.sql.{Row, SparkSession}
import org.apache.spark.sql.types._
val rdd = sc.parallelize(Seq(Row(("-128").toByte)))
val schema = new StructType().add(StructField("c1", ByteType, true))
val df = spark.createDataFrame(rdd, schema)
df.show(false)
df.write.mode("overwrite").format("avro").saveAsTable("byte_avro"){code}
On Spark 3.2.1 (commit {{{}4f25b3f712{}}}), using {{spark-sql}} with the Avro 
package:
{code:java}
./bin/spark-sql --packages org.apache.spark:spark-avro_2.12:3.2.1{code}
Execute the following:
{code:java}
spark-sql> select * from byte_avro;{code}
h3. Expected behavior

We expect the output of the {{SELECT}} query to be {{{}-128{}}}. Additionally, 
we expect the data type to be preserved (it is changed from BYTE/TINYINT to 
INT, hence the mismatch). We tried other formats like ORC and the outcome is 
consistent with this expectation. Here are the logs from our attempt at doing 
the same with ORC:

 
{code:java}
scala> df.write.mode("overwrite").format("orc").saveAsTable("byte_orc")
2022-09-09 21:38:28,880 WARN conf.HiveConf: HiveConf of name 
hive.stats.jdbc.timeout does not exist
2022-09-09 21:38:28,880 WARN conf.HiveConf: HiveConf of name 
hive.stats.retries.wait does not exist
2022-09-09 21:38:34,642 WARN session.SessionState: METASTORE_FILTER_HOOK will 
be ignored, since hive.security.authorization.manage
r is set to instance of HiveAuthorizerFactory.
2022-09-09 21:38:34,716 WARN conf.HiveConf: HiveConf of name 
hive.internal.ss.authz.settings.applied.marker does not exist
2022-09-09 21:38:34,716 WARN conf.HiveConf: HiveConf of name 
hive.stats.jdbc.timeout does not exist
2022-09-09 21:38:34,716 WARN conf.HiveConf: HiveConf of name 
hive.stats.retries.wait does not exist
scala> spark.sql("select * from byte_orc;")
res2: org.apache.spark.sql.DataFrame = [c1: tinyint]
scala> spark.sql("select * from byte_orc;").show(false)
++
|c1  |
++
|-128|
++
{code}
h3. Root Cause
h4. 
[AvroSerializer|https://github.com/apache/spark/blob/v3.2.1/external/avro/src/main/scala/org/apache/spark/sql/avro/AvroSerializer.scala#L114-L119]
{code:java}
   (catalystType, avroType.getType) match {
      case (NullType, NULL) =>
        (getter, ordinal) => null
      case (BooleanType, BOOLEAN) =>
        (getter, ordinal) => getter.getBoolean(ordinal)
      case (ByteType, INT) =>
        (getter, ordinal) => getter.getByte(ordinal).toInt
      case (ShortType, INT) =>
        (getter, ordinal) => getter.getShort(ordinal).toInt
      case (IntegerType, INT) =>
        (getter, ordinal) => getter.getInt(ordinal){code}
h4. 
[AvroDeserializer|https://github.com/apache/spark/blob/v3.2.1/external/avro/src/main/scala/org/apache/spark/sql/avro/AvroDeserializer.scala#L121-L130]

 
{code:java}
    (avroType.getType, catalystType) match {
      case (NULL, NullType) => (updater, ordinal, _) =>
        updater.setNullAt(ordinal)
      // TODO: we can avoid boxing if future version of avro provide primitive 
accessors.
      case (BOOLEAN, BooleanType) => (updater, ordinal, value) =>
        updater.setBoolean(ordinal, value.asInstanceOf[Boolean])
      case (INT, IntegerType) => (updater, ordinal, value) =>
        updater.setInt(ordinal, value.asInstanceOf[Int])
      case (INT, DateType) => (updater, ordinal, value) =>
        updater.setInt(ordinal, dateRebaseFunc(value.asInstanceOf[Int]))
{code}
 

 

AvroSerializer converts Spark's ByteType into Avro's INT. Further, Spark's 
AvroDeserializer expects Avro's INT to map to Spark's IntegerType. The mismatch 
between user-specified ByteType & the type AvroDeserializer expects 
(IntegerType) is the root cause of this 

[jira] [Updated] (SPARK-39158) A valid DECIMAL inserted by DataFrame cannot be read in HiveQL

2022-05-11 Thread xsys (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-39158?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

xsys updated SPARK-39158:
-
Description: 
h2. Describe the bug

We are trying to save a table containing a `{{{}DecimalType{}}}` column 
constructed through a Spark DataFrame with the `Avro` data format. We also want 
to be able to query this table both from this Spark instance as well as from 
the Hive instance that Spark is using directly. Say that `{{{}DecimalType(6, 
3){}}}` is part of the schema.

When we `INSERT` some valid value (e.g. {{{}BigDecimal("333.222"){}}}) in 
DataFrame, and `SELECT` from the table in HiveQL, we expect it to give back the 
inserted value. However, we instead get an `AvroTypeException`.
h2. To Reproduce

On Spark 3.2.1 (commit `4f25b3f712`), using `spark-shell` with the Avro package:
{code:java}
./bin/spark-shell --packages org.apache.spark:spark-avro_2.12:3.2.1{code}
Execute the following:
{code:java}
import org.apache.spark.sql.Row
import org.apache.spark.sql.types._
val rdd = sc.parallelize(Seq(Row(BigDecimal("333.222"
val schema = new StructType().add(StructField("c1", DecimalType(6,3), true))
val df = spark.createDataFrame(rdd, schema)
df.show(false) // result in error despite correctly showing output in the end
df.write.mode("overwrite").format("avro").saveAsTable("ws") {code}
`df.show(false)` will result in the following error before printing out the 
expected output `333.222`:
{code:java}
java.lang.AssertionError: assertion failed:                                     
  Decimal$DecimalIsFractional
     while compiling: 
        during phase: globalPhase=terminal, enteringPhase=jvm
     library version: version 2.12.15
    compiler version: version 2.12.15
  reconstructed args: -classpath 
/Users/xsystem/.ivy2/jars/org.apache.spark_spark-avro_2.12-3.2.1.jar:/Users/xsystem/.ivy2/jars/org.tukaani_xz-1.8.jar:/Users/xsystem/.ivy2/jars/org.spark-project.spark_unused-1.0.0.jar
 -Yrepl-class-based -Yrepl-outdir 
/private/var/folders/01/bm1ky3qj3sq7gb5f345nxlcmgn/T/spark-ed7aba34-997a-4950-9ea4-52c61c222660/repl-bd6bbf2b-5647-4306-a5d3-50cdc30fcbc0
  last tree to typer: TypeTree(class Byte)
       tree position: line 6 of 
            tree tpe: Byte
              symbol: (final abstract) class Byte in package scala
   symbol definition: final abstract class Byte extends  (a ClassSymbol)
      symbol package: scala
       symbol owners: class Byte
           call site: constructor $eval in object $eval in package $line19
== Source file context for tree position ==
     3 
     4 object $eval {
     5   lazy val $result = 
$line19.$read.INSTANCE.$iw.$iw.$iw.$iw.$iw.$iw.$iw.$iw.$iw.$iw.res0
     6   lazy val $print: _root_.java.lang.String =  {
     7     $line19.$read.INSTANCE.$iw.$iw.$iw.$iw.$iw.$iw.$iw.$iw.$iw.$iw
     8       
     9 "" 
        at 
scala.reflect.internal.SymbolTable.throwAssertionError(SymbolTable.scala:185)
        at 
scala.reflect.internal.Symbols$Symbol.completeInfo(Symbols.scala:1525)
        at scala.reflect.internal.Symbols$Symbol.info(Symbols.scala:1514)
        at 
scala.reflect.internal.Symbols$Symbol.flatOwnerInfo(Symbols.scala:2353)
        at 
scala.reflect.internal.Symbols$ClassSymbol.companionModule0(Symbols.scala:3346)
        at 
scala.reflect.internal.Symbols$ClassSymbol.companionModule(Symbols.scala:3348)
        at 
scala.reflect.internal.Symbols$ModuleClassSymbol.sourceModule(Symbols.scala:3487)
        at 
scala.reflect.internal.Symbols.$anonfun$forEachRelevantSymbols$1$adapted(Symbols.scala:3802)
        at 
scala.collection.IndexedSeqOptimized.foreach(IndexedSeqOptimized.scala:36)
        at 
scala.collection.IndexedSeqOptimized.foreach$(IndexedSeqOptimized.scala:33)
        at scala.collection.mutable.WrappedArray.foreach(WrappedArray.scala:38)
        at scala.reflect.internal.Symbols.markFlagsCompleted(Symbols.scala:3799)
        at 
scala.reflect.internal.Symbols.markFlagsCompleted$(Symbols.scala:3805)
        at 
scala.reflect.internal.SymbolTable.markFlagsCompleted(SymbolTable.scala:28)
        at 
scala.reflect.internal.pickling.UnPickler$Scan.finishSym$1(UnPickler.scala:324)
        at 
scala.reflect.internal.pickling.UnPickler$Scan.readSymbol(UnPickler.scala:342)
        at 
scala.reflect.internal.pickling.UnPickler$Scan.readSymbolRef(UnPickler.scala:645)
        at 
scala.reflect.internal.pickling.UnPickler$Scan.readType(UnPickler.scala:413)
        at 
scala.reflect.internal.pickling.UnPickler$Scan.$anonfun$readSymbol$10(UnPickler.scala:357)
        at 
scala.reflect.internal.pickling.UnPickler$Scan.at(UnPickler.scala:188)
        at 
scala.reflect.internal.pickling.UnPickler$Scan.readSymbol(UnPickler.scala:357)
        at 
scala.reflect.internal.pickling.UnPickler$Scan.$anonfun$run$1(UnPickler.scala:96)
        at 
scala.reflect.internal.pickling.UnPickler$Scan.run(UnPickler.scala:88)
        at 

[jira] [Updated] (SPARK-39158) A valid DECIMAL inserted by DataFrame cannot be read in HiveQL

2022-05-11 Thread xsys (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-39158?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

xsys updated SPARK-39158:
-
Description: 
h2. Describe the bug

We are trying to save a table containing a `{{{}DecimalType{}}}` column 
constructed through a Spark DataFrame with the `Avro` data format. We also want 
to be able to query this table both from this Spark instance as well as from 
the Hive instance that Spark is using directly. Say that `{{{}DecimalType(6, 
3){}}}` is part of the schema.

When we `INSERT` some valid value (e.g. {{{}BigDecimal("333.222"){}}}) in 
DataFrame, and `SELECT` from the table in HiveQL, we expect it to give back the 
inserted value. However, we instead get an `AvroTypeException`.
h2. To Reproduce

On Spark 3.2.1 (commit `4f25b3f712`), using `spark-shell` with the Avro package:
{code:java}
./bin/spark-shell --packages org.apache.spark:spark-avro_2.12:3.2.1{code}
Execute the following:
{code:java}
import org.apache.spark.sql.Row
import org.apache.spark.sql.types._
val rdd = sc.parallelize(Seq(Row(BigDecimal("333.222"
val schema = new StructType().add(StructField("c1", DecimalType(6,3), true))
val df = spark.createDataFrame(rdd, schema)
df.show(false) // result in error despite correctly showing output in the end
df.write.mode("overwrite").format("avro").saveAsTable("ws") {code}
`df.show(false)` will result in the following error before printing out the 
expected output `333.222`:
{code:java}
java.lang.AssertionError: assertion failed:                                     
  Decimal$DecimalIsFractional
     while compiling: 
        during phase: globalPhase=terminal, enteringPhase=jvm
     library version: version 2.12.15
    compiler version: version 2.12.15
  reconstructed args: -classpath 
/Users/xsystem/.ivy2/jars/org.apache.spark_spark-avro_2.12-3.2.1.jar:/Users/xsystem/.ivy2/jars/org.tukaani_xz-1.8.jar:/Users/xsystem/.ivy2/jars/org.spark-project.spark_unused-1.0.0.jar
 -Yrepl-class-based -Yrepl-outdir 
/private/var/folders/01/bm1ky3qj3sq7gb5f345nxlcmgn/T/spark-ed7aba34-997a-4950-9ea4-52c61c222660/repl-bd6bbf2b-5647-4306-a5d3-50cdc30fcbc0
  last tree to typer: TypeTree(class Byte)
       tree position: line 6 of 
            tree tpe: Byte
              symbol: (final abstract) class Byte in package scala
   symbol definition: final abstract class Byte extends  (a ClassSymbol)
      symbol package: scala
       symbol owners: class Byte
           call site: constructor $eval in object $eval in package $line19
== Source file context for tree position ==
     3 
     4 object $eval {
     5   lazy val $result = 
$line19.$read.INSTANCE.$iw.$iw.$iw.$iw.$iw.$iw.$iw.$iw.$iw.$iw.res0
     6   lazy val $print: _root_.java.lang.String =  {
     7     $line19.$read.INSTANCE.$iw.$iw.$iw.$iw.$iw.$iw.$iw.$iw.$iw.$iw
     8       
     9 "" 
        at 
scala.reflect.internal.SymbolTable.throwAssertionError(SymbolTable.scala:185)
        at 
scala.reflect.internal.Symbols$Symbol.completeInfo(Symbols.scala:1525)
        at scala.reflect.internal.Symbols$Symbol.info(Symbols.scala:1514)
        at 
scala.reflect.internal.Symbols$Symbol.flatOwnerInfo(Symbols.scala:2353)
        at 
scala.reflect.internal.Symbols$ClassSymbol.companionModule0(Symbols.scala:3346)
        at 
scala.reflect.internal.Symbols$ClassSymbol.companionModule(Symbols.scala:3348)
        at 
scala.reflect.internal.Symbols$ModuleClassSymbol.sourceModule(Symbols.scala:3487)
        at 
scala.reflect.internal.Symbols.$anonfun$forEachRelevantSymbols$1$adapted(Symbols.scala:3802)
        at 
scala.collection.IndexedSeqOptimized.foreach(IndexedSeqOptimized.scala:36)
        at 
scala.collection.IndexedSeqOptimized.foreach$(IndexedSeqOptimized.scala:33)
        at scala.collection.mutable.WrappedArray.foreach(WrappedArray.scala:38)
        at scala.reflect.internal.Symbols.markFlagsCompleted(Symbols.scala:3799)
        at 
scala.reflect.internal.Symbols.markFlagsCompleted$(Symbols.scala:3805)
        at 
scala.reflect.internal.SymbolTable.markFlagsCompleted(SymbolTable.scala:28)
        at 
scala.reflect.internal.pickling.UnPickler$Scan.finishSym$1(UnPickler.scala:324)
        at 
scala.reflect.internal.pickling.UnPickler$Scan.readSymbol(UnPickler.scala:342)
        at 
scala.reflect.internal.pickling.UnPickler$Scan.readSymbolRef(UnPickler.scala:645)
        at 
scala.reflect.internal.pickling.UnPickler$Scan.readType(UnPickler.scala:413)
        at 
scala.reflect.internal.pickling.UnPickler$Scan.$anonfun$readSymbol$10(UnPickler.scala:357)
        at 
scala.reflect.internal.pickling.UnPickler$Scan.at(UnPickler.scala:188)
        at 
scala.reflect.internal.pickling.UnPickler$Scan.readSymbol(UnPickler.scala:357)
        at 
scala.reflect.internal.pickling.UnPickler$Scan.$anonfun$run$1(UnPickler.scala:96)
        at 
scala.reflect.internal.pickling.UnPickler$Scan.run(UnPickler.scala:88)
        at 

[jira] [Updated] (SPARK-39158) A valid DECIMAL inserted by DataFrame cannot be read in HiveQL

2022-05-11 Thread xsys (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-39158?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

xsys updated SPARK-39158:
-
Description: 
h2. Describe the bug

We are trying to save a table containing a `DecimalType` column constructed 
through a Spark DataFrame with the `Avro` data format. We also want to be able 
to query this table both from this Spark instance as well as from the Hive 
instance that Spark is using directly. Say that `DecimalType(6, 3)` is part of 
the schema.

When we `INSERT` some valid value (e.g. `BigDecimal("333.222")`) in DataFrame, 
and `SELECT` from the table in HiveQL, we expect it to give back the inserted 
value. However, we instead get an `AvroTypeException`.
h2. To Reproduce

On Spark 3.2.1 (commit `4f25b3f712`), using `spark-shell` with the Avro package:
{code:java}
./bin/spark-shell --packages org.apache.spark:spark-avro_2.12:3.2.1{code}
Execute the following:
{code:java}
import org.apache.spark.sql.Row
import org.apache.spark.sql.types._
val rdd = sc.parallelize(Seq(Row(BigDecimal("333.222"
val schema = new StructType().add(StructField("c1", DecimalType(6,3), true))
val df = spark.createDataFrame(rdd, schema)
df.show(false) // result in error despite correctly showing output in the end
df.write.mode("overwrite").format("avro").saveAsTable("ws") {code}
`df.show(false)` will result in the following error before printing out the 
expected output `333.222`:
{code:java}
java.lang.AssertionError: assertion failed:                                     
  Decimal$DecimalIsFractional
     while compiling: 
        during phase: globalPhase=terminal, enteringPhase=jvm
     library version: version 2.12.15
    compiler version: version 2.12.15
  reconstructed args: -classpath 
/Users/xsystem/.ivy2/jars/org.apache.spark_spark-avro_2.12-3.2.1.jar:/Users/xsystem/.ivy2/jars/org.tukaani_xz-1.8.jar:/Users/xsystem/.ivy2/jars/org.spark-project.spark_unused-1.0.0.jar
 -Yrepl-class-based -Yrepl-outdir 
/private/var/folders/01/bm1ky3qj3sq7gb5f345nxlcmgn/T/spark-ed7aba34-997a-4950-9ea4-52c61c222660/repl-bd6bbf2b-5647-4306-a5d3-50cdc30fcbc0
  last tree to typer: TypeTree(class Byte)
       tree position: line 6 of 
            tree tpe: Byte
              symbol: (final abstract) class Byte in package scala
   symbol definition: final abstract class Byte extends  (a ClassSymbol)
      symbol package: scala
       symbol owners: class Byte
           call site: constructor $eval in object $eval in package $line19
== Source file context for tree position ==
     3 
     4 object $eval {
     5   lazy val $result = 
$line19.$read.INSTANCE.$iw.$iw.$iw.$iw.$iw.$iw.$iw.$iw.$iw.$iw.res0
     6   lazy val $print: _root_.java.lang.String =  {
     7     $line19.$read.INSTANCE.$iw.$iw.$iw.$iw.$iw.$iw.$iw.$iw.$iw.$iw
     8       
     9 "" 
        at 
scala.reflect.internal.SymbolTable.throwAssertionError(SymbolTable.scala:185)
        at 
scala.reflect.internal.Symbols$Symbol.completeInfo(Symbols.scala:1525)
        at scala.reflect.internal.Symbols$Symbol.info(Symbols.scala:1514)
        at 
scala.reflect.internal.Symbols$Symbol.flatOwnerInfo(Symbols.scala:2353)
        at 
scala.reflect.internal.Symbols$ClassSymbol.companionModule0(Symbols.scala:3346)
        at 
scala.reflect.internal.Symbols$ClassSymbol.companionModule(Symbols.scala:3348)
        at 
scala.reflect.internal.Symbols$ModuleClassSymbol.sourceModule(Symbols.scala:3487)
        at 
scala.reflect.internal.Symbols.$anonfun$forEachRelevantSymbols$1$adapted(Symbols.scala:3802)
        at 
scala.collection.IndexedSeqOptimized.foreach(IndexedSeqOptimized.scala:36)
        at 
scala.collection.IndexedSeqOptimized.foreach$(IndexedSeqOptimized.scala:33)
        at scala.collection.mutable.WrappedArray.foreach(WrappedArray.scala:38)
        at scala.reflect.internal.Symbols.markFlagsCompleted(Symbols.scala:3799)
        at 
scala.reflect.internal.Symbols.markFlagsCompleted$(Symbols.scala:3805)
        at 
scala.reflect.internal.SymbolTable.markFlagsCompleted(SymbolTable.scala:28)
        at 
scala.reflect.internal.pickling.UnPickler$Scan.finishSym$1(UnPickler.scala:324)
        at 
scala.reflect.internal.pickling.UnPickler$Scan.readSymbol(UnPickler.scala:342)
        at 
scala.reflect.internal.pickling.UnPickler$Scan.readSymbolRef(UnPickler.scala:645)
        at 
scala.reflect.internal.pickling.UnPickler$Scan.readType(UnPickler.scala:413)
        at 
scala.reflect.internal.pickling.UnPickler$Scan.$anonfun$readSymbol$10(UnPickler.scala:357)
        at 
scala.reflect.internal.pickling.UnPickler$Scan.at(UnPickler.scala:188)
        at 
scala.reflect.internal.pickling.UnPickler$Scan.readSymbol(UnPickler.scala:357)
        at 
scala.reflect.internal.pickling.UnPickler$Scan.$anonfun$run$1(UnPickler.scala:96)
        at 
scala.reflect.internal.pickling.UnPickler$Scan.run(UnPickler.scala:88)
        at 
scala.reflect.internal.pickling.UnPickler.unpickle(UnPickler.scala:47)
        at 

[jira] [Created] (SPARK-39158) A valid DECIMAL inserted by DataFrame cannot be read in HiveQL

2022-05-11 Thread xsys (Jira)
xsys created SPARK-39158:


 Summary: A valid DECIMAL inserted by DataFrame cannot be read in 
HiveQL
 Key: SPARK-39158
 URL: https://issues.apache.org/jira/browse/SPARK-39158
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.2.1
Reporter: xsys


h2. Describe the bug

We are trying to save a table containing a `DecimalType` column constructed 
through a Spark DataFrame with the `Avro` data format. We also want to be able 
to query this table both from this Spark instance as well as from the Hive 
instance that Spark is using directly. Say that `DecimalType(6, 3)` is part of 
the schema.

When we `INSERT` some valid value (e.g. `BigDecimal("333.222")`) in DataFrame, 
and `SELECT` from the table in HiveQL, we expect it to give back the inserted 
value. However, we instead get an `AvroTypeException`.
h2. To Reproduce

On Spark 3.2.1 (commit `4f25b3f712`), using `spark-shell` with the Avro package:
{code:java}
./bin/spark-shell --packages org.apache.spark:spark-avro_2.12:3.2.1{code}
Execute the following:
{code:java}
import org.apache.spark.sql.Row
import org.apache.spark.sql.types._
val rdd = sc.parallelize(Seq(Row(BigDecimal("333.222"
val schema = new StructType().add(StructField("c1", DecimalType(6,3), true))
val df = spark.createDataFrame(rdd, schema)
df.show(false) // result in error despite correctly showing output in the end
df.write.mode("overwrite").format("avro").saveAsTable("ws") {code}
`df.show(false)` will result in the following error before printing out the 
expected output `333.222`:


{code:java}
java.lang.AssertionError: assertion failed:                                     
  Decimal$DecimalIsFractional
     while compiling: 
        during phase: globalPhase=terminal, enteringPhase=jvm
     library version: version 2.12.15
    compiler version: version 2.12.15
  reconstructed args: -classpath 
/Users/xsystem/.ivy2/jars/org.apache.spark_spark-avro_2.12-3.2.1.jar:/Users/xsystem/.ivy2/jars/org.tukaani_xz-1.8.jar:/Users/xsystem/.ivy2/jars/org.spark-project.spark_unused-1.0.0.jar
 -Yrepl-class-based -Yrepl-outdir 
/private/var/folders/01/bm1ky3qj3sq7gb5f345nxlcmgn/T/spark-ed7aba34-997a-4950-9ea4-52c61c222660/repl-bd6bbf2b-5647-4306-a5d3-50cdc30fcbc0
  last tree to typer: TypeTree(class Byte)
       tree position: line 6 of 
            tree tpe: Byte
              symbol: (final abstract) class Byte in package scala
   symbol definition: final abstract class Byte extends  (a ClassSymbol)
      symbol package: scala
       symbol owners: class Byte
           call site: constructor $eval in object $eval in package $line19
== Source file context for tree position ==
     3 
     4 object $eval {
     5   lazy val $result = 
$line19.$read.INSTANCE.$iw.$iw.$iw.$iw.$iw.$iw.$iw.$iw.$iw.$iw.res0
     6   lazy val $print: _root_.java.lang.String =  {
     7     $line19.$read.INSTANCE.$iw.$iw.$iw.$iw.$iw.$iw.$iw.$iw.$iw.$iw
     8       
     9 "" 
        at 
scala.reflect.internal.SymbolTable.throwAssertionError(SymbolTable.scala:185)
        at 
scala.reflect.internal.Symbols$Symbol.completeInfo(Symbols.scala:1525)
        at scala.reflect.internal.Symbols$Symbol.info(Symbols.scala:1514)
        at 
scala.reflect.internal.Symbols$Symbol.flatOwnerInfo(Symbols.scala:2353)
        at 
scala.reflect.internal.Symbols$ClassSymbol.companionModule0(Symbols.scala:3346)
        at 
scala.reflect.internal.Symbols$ClassSymbol.companionModule(Symbols.scala:3348)
        at 
scala.reflect.internal.Symbols$ModuleClassSymbol.sourceModule(Symbols.scala:3487)
        at 
scala.reflect.internal.Symbols.$anonfun$forEachRelevantSymbols$1$adapted(Symbols.scala:3802)
        at 
scala.collection.IndexedSeqOptimized.foreach(IndexedSeqOptimized.scala:36)
        at 
scala.collection.IndexedSeqOptimized.foreach$(IndexedSeqOptimized.scala:33)
        at scala.collection.mutable.WrappedArray.foreach(WrappedArray.scala:38)
        at scala.reflect.internal.Symbols.markFlagsCompleted(Symbols.scala:3799)
        at 
scala.reflect.internal.Symbols.markFlagsCompleted$(Symbols.scala:3805)
        at 
scala.reflect.internal.SymbolTable.markFlagsCompleted(SymbolTable.scala:28)
        at 
scala.reflect.internal.pickling.UnPickler$Scan.finishSym$1(UnPickler.scala:324)
        at 
scala.reflect.internal.pickling.UnPickler$Scan.readSymbol(UnPickler.scala:342)
        at 
scala.reflect.internal.pickling.UnPickler$Scan.readSymbolRef(UnPickler.scala:645)
        at 
scala.reflect.internal.pickling.UnPickler$Scan.readType(UnPickler.scala:413)
        at 
scala.reflect.internal.pickling.UnPickler$Scan.$anonfun$readSymbol$10(UnPickler.scala:357)
        at 
scala.reflect.internal.pickling.UnPickler$Scan.at(UnPickler.scala:188)
        at 
scala.reflect.internal.pickling.UnPickler$Scan.readSymbol(UnPickler.scala:357)
        at 

[jira] [Comment Edited] (SPARK-39075) IncompatibleSchemaException when selecting data from table stored from a DataFrame in Avro format with BYTE/SHORT

2022-05-03 Thread xsys (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-39075?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17531398#comment-17531398
 ] 

xsys edited comment on SPARK-39075 at 5/3/22 8:34 PM:
--

Thanks for the response, Erik.

I understand the concern. OTOH, in principle it is inconsistent and confusing 
that one can write a piece of data but cannot read it back via Spark/Avro. It’s 
almost equivalent to a data loss.

Moreover, DataFrame enforces explicit type checks so one can only write 
SHORT/BYTE-typed data into a SHORT/BYTE column. In this context, it is safe to 
downcast. And, it does not make sense that Avro’s lack of SHORT/BYTE type 
support breaks DataFrame operation.

The concern is valid under the context that the source of the serialized data 
is unknown, so potentially downcasting is unsafe.

One way to systematically address the issue is to determine whether Spark is 
the source of the serialized data, and permitting the cast in this context. 
Because the SELECT API is used, the data is retrieved from a table through Hive 
or another supported Spark store, and not from a standalone Avro file. We could 
then potentially leverage Spark-specific metadata stored with the Hive table 
and provide this context to the deserializer.

Or we can change the Spark schema type from SHORT/BYTE to INT, like what 
SparkSQL does in the 
[HiveExternalCatalog|https://github.com/apache/spark/blob/4df8512b11dc9cc3a179fd5ccedf91af1f3fc6ee/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveExternalCatalog.scala#L821].


was (Author: JIRAUSER288838):
Thanks for the response, Erik.

I understand the concern. OTOH, in principle it is inconsistent and confusing 
that one can write a piece of data but cannot read it back via Spark/Avro. It’s 
almost equivalent to a data loss.

Moreover, DataFrame enforces explicit type checks so one can only write 
SHORT/BYTE-typed data into a SHORT/BYTE column. In this context, it is safe to 
downcast. And, it does not make sense that Avro’s lack of SHORT/BYTE type 
support breaks DataFrame operation.

The concern is valid under the context that the source of the serialized data 
is unknown, so potentially downcasting is unsafe.

 

One way to systematically address the issue is to determine whether Spark is 
the source of the serialized data, and permitting the cast in this context. 
Because the SELECT API is used, the data is retrieved from a table through Hive 
or another supported Spark store, and not from a standalone Avro file. We could 
then potentially leverage Spark-specific metadata stored with the Hive table 
and provide this context to the deserializer.

Or we can change the Spark schema type from SHORT/BYTE to INT, like what 
SparkSQL does in the 
[HiveExternalCatalog|https://github.com/apache/spark/blob/4df8512b11dc9cc3a179fd5ccedf91af1f3fc6ee/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveExternalCatalog.scala#L821].

> IncompatibleSchemaException when selecting data from table stored from a 
> DataFrame in Avro format with BYTE/SHORT
> -
>
> Key: SPARK-39075
> URL: https://issues.apache.org/jira/browse/SPARK-39075
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.1
>Reporter: xsys
>Priority: Major
>
> h3. Describe the bug
> We are trying to save a table constructed through a DataFrame with the 
> {{Avro}} data format. The table contains {{ByteType}} or {{ShortType}} as 
> part of the schema.
> When we {{INSERT}} some valid values (e.g. {{{}-128{}}}) and {{SELECT}} from 
> the table, we expect it to give back the inserted value. However, we instead 
> get an {{IncompatibleSchemaException}} from the {{{}AvroDeserializer{}}}.
> This appears to be caused by a missing case statement handling the {{(INT, 
> ShortType)}} and {{(INT, ByteType)}} cases in [{{AvroDeserializer 
> newWriter}}|https://github.com/apache/spark/blob/4f25b3f71238a00508a356591553f2dfa89f8290/external/avro/src/main/scala/org/apache/spark/sql/avro/AvroDeserializer.scala#L321].
> h3. To Reproduce
> On Spark 3.2.1 (commit {{{}4f25b3f712{}}}), using {{spark-shell}} with the 
> Avro package:
> {code:java}
> ./bin/spark-shell --packages org.apache.spark:spark-avro_2.12:3.2.1{code}
> Execute the following:
> {code:java}
> import org.apache.spark.sql.Row
> import org.apache.spark.sql.types._
> val schema = new StructType().add(StructField("c1", ShortType, true))
> val rdd = sc.parallelize(Seq(Row("-128".toShort)))
> val df = spark.createDataFrame(rdd, schema)
> df.write.mode("overwrite").format("avro").saveAsTable("t0")
> spark.sql("select * from t0;").show(false){code}
> Resulting error:
> {code:java}
> 22/04/27 18:04:14 ERROR Executor: Exception in task 0.0 in stage 2.0 (TID 32) 
> 

[jira] [Commented] (SPARK-39075) IncompatibleSchemaException when selecting data from table stored from a DataFrame in Avro format with BYTE/SHORT

2022-05-03 Thread xsys (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-39075?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17531398#comment-17531398
 ] 

xsys commented on SPARK-39075:
--

Thanks for the response, Erik.

I understand the concern. OTOH, in principle it is inconsistent and confusing 
that one can write a piece of data but cannot read it back via Spark/Avro. It’s 
almost equivalent to a data loss.

Moreover, DataFrame enforces explicit type checks so one can only write 
SHORT/BYTE-typed data into a SHORT/BYTE column. In this context, it is safe to 
downcast. And, it does not make sense that Avro’s lack of SHORT/BYTE type 
support breaks DataFrame operation.

The concern is valid under the context that the source of the serialized data 
is unknown, so potentially downcasting is unsafe.

 

One way to systematically address the issue is to determine whether Spark is 
the source of the serialized data, and permitting the cast in this context. 
Because the SELECT API is used, the data is retrieved from a table through Hive 
or another supported Spark store, and not from a standalone Avro file. We could 
then potentially leverage Spark-specific metadata stored with the Hive table 
and provide this context to the deserializer.

Or we can change the Spark schema type from SHORT/BYTE to INT, like what 
SparkSQL does in the 
[HiveExternalCatalog|https://github.com/apache/spark/blob/4df8512b11dc9cc3a179fd5ccedf91af1f3fc6ee/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveExternalCatalog.scala#L821].

> IncompatibleSchemaException when selecting data from table stored from a 
> DataFrame in Avro format with BYTE/SHORT
> -
>
> Key: SPARK-39075
> URL: https://issues.apache.org/jira/browse/SPARK-39075
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.1
>Reporter: xsys
>Priority: Major
>
> h3. Describe the bug
> We are trying to save a table constructed through a DataFrame with the 
> {{Avro}} data format. The table contains {{ByteType}} or {{ShortType}} as 
> part of the schema.
> When we {{INSERT}} some valid values (e.g. {{{}-128{}}}) and {{SELECT}} from 
> the table, we expect it to give back the inserted value. However, we instead 
> get an {{IncompatibleSchemaException}} from the {{{}AvroDeserializer{}}}.
> This appears to be caused by a missing case statement handling the {{(INT, 
> ShortType)}} and {{(INT, ByteType)}} cases in [{{AvroDeserializer 
> newWriter}}|https://github.com/apache/spark/blob/4f25b3f71238a00508a356591553f2dfa89f8290/external/avro/src/main/scala/org/apache/spark/sql/avro/AvroDeserializer.scala#L321].
> h3. To Reproduce
> On Spark 3.2.1 (commit {{{}4f25b3f712{}}}), using {{spark-shell}} with the 
> Avro package:
> {code:java}
> ./bin/spark-shell --packages org.apache.spark:spark-avro_2.12:3.2.1{code}
> Execute the following:
> {code:java}
> import org.apache.spark.sql.Row
> import org.apache.spark.sql.types._
> val schema = new StructType().add(StructField("c1", ShortType, true))
> val rdd = sc.parallelize(Seq(Row("-128".toShort)))
> val df = spark.createDataFrame(rdd, schema)
> df.write.mode("overwrite").format("avro").saveAsTable("t0")
> spark.sql("select * from t0;").show(false){code}
> Resulting error:
> {code:java}
> 22/04/27 18:04:14 ERROR Executor: Exception in task 0.0 in stage 2.0 (TID 32) 
> org.apache.spark.sql.avro.IncompatibleSchemaException: Cannot convert Avro 
> type {"type":"record","name":"topLevelRecord","fields":[
> {"name":"c1","type":["int","null"]}
> ]} to SQL type STRUCT<`c1`: SMALLINT>. 
> at 
> org.apache.spark.sql.avro.AvroDeserializer.liftedTree1$1(AvroDeserializer.scala:102)
>  
> at 
> org.apache.spark.sql.avro.AvroDeserializer.(AvroDeserializer.scala:74) 
> at 
> org.apache.spark.sql.avro.AvroFileFormat$$anon$1.(AvroFileFormat.scala:143)
>  
> at 
> org.apache.spark.sql.avro.AvroFileFormat.$anonfun$buildReader$1(AvroFileFormat.scala:136)
>  
> at 
> org.apache.spark.sql.execution.datasources.FileFormat$$anon$1.apply(FileFormat.scala:148)
>  
> at 
> org.apache.spark.sql.execution.datasources.FileFormat$$anon$1.apply(FileFormat.scala:133)
>  
> at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.org$apache$spark$sql$execution$datasources$FileScanRDD$$anon$$readCurrentFile(FileScanRDD.scala:127)
>  
> at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:187)
>  
> at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:104)
>  
> at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460) 
> at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown
>  Source) 
> at 
> 

[jira] [Updated] (SPARK-39075) IncompatibleSchemaException when selecting data from table stored from a DataFrame in Avro format with BYTE/SHORT

2022-04-29 Thread xsys (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-39075?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

xsys updated SPARK-39075:
-
Description: 
h3. Describe the bug

We are trying to save a table constructed through a DataFrame with the {{Avro}} 
data format. The table contains {{ByteType}} or {{ShortType}} as part of the 
schema.

When we {{INSERT}} some valid values (e.g. {{{}-128{}}}) and {{SELECT}} from 
the table, we expect it to give back the inserted value. However, we instead 
get an {{IncompatibleSchemaException}} from the {{{}AvroDeserializer{}}}.

This appears to be caused by a missing case statement handling the {{(INT, 
ShortType)}} and {{(INT, ByteType)}} cases in [{{AvroDeserializer 
newWriter}}|https://github.com/apache/spark/blob/4f25b3f71238a00508a356591553f2dfa89f8290/external/avro/src/main/scala/org/apache/spark/sql/avro/AvroDeserializer.scala#L321].
h3. To Reproduce

On Spark 3.2.1 (commit {{{}4f25b3f712{}}}), using {{spark-shell}} with the Avro 
package:
{code:java}
./bin/spark-shell --packages org.apache.spark:spark-avro_2.12:3.2.1{code}
Execute the following:
{code:java}
import org.apache.spark.sql.Row
import org.apache.spark.sql.types._
val schema = new StructType().add(StructField("c1", ShortType, true))
val rdd = sc.parallelize(Seq(Row("-128".toShort)))
val df = spark.createDataFrame(rdd, schema)
df.write.mode("overwrite").format("avro").saveAsTable("t0")
spark.sql("select * from t0;").show(false){code}
Resulting error:
{code:java}
22/04/27 18:04:14 ERROR Executor: Exception in task 0.0 in stage 2.0 (TID 32) 
org.apache.spark.sql.avro.IncompatibleSchemaException: Cannot convert Avro type 
{"type":"record","name":"topLevelRecord","fields":[
{"name":"c1","type":["int","null"]}
]} to SQL type STRUCT<`c1`: SMALLINT>. 
at 
org.apache.spark.sql.avro.AvroDeserializer.liftedTree1$1(AvroDeserializer.scala:102)
 
at org.apache.spark.sql.avro.AvroDeserializer.(AvroDeserializer.scala:74) 
at 
org.apache.spark.sql.avro.AvroFileFormat$$anon$1.(AvroFileFormat.scala:143)
 
at 
org.apache.spark.sql.avro.AvroFileFormat.$anonfun$buildReader$1(AvroFileFormat.scala:136)
 
at 
org.apache.spark.sql.execution.datasources.FileFormat$$anon$1.apply(FileFormat.scala:148)
 
at 
org.apache.spark.sql.execution.datasources.FileFormat$$anon$1.apply(FileFormat.scala:133)
 
at 
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.org$apache$spark$sql$execution$datasources$FileScanRDD$$anon$$readCurrentFile(FileScanRDD.scala:127)
 
at 
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:187)
 
at 
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:104)
 
at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460) 
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown
 Source) 
at 
org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
 
at 
org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:759)
at 
org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:349)
 
at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:898) 
at 
org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:898)
 
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) 
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373) 
at org.apache.spark.rdd.RDD.iterator(RDD.scala:337) 
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90) 
at org.apache.spark.scheduler.Task.run(Task.scala:131) 
at 
org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:506)
 
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1462) 
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:509) 
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) 
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) 
at java.lang.Thread.run(Thread.java:748) 
Caused by: org.apache.spark.sql.avro.IncompatibleSchemaException: Cannot 
convert Avro field 'c1' to SQL field 'c1' because schema is incompatible 
(avroType = "int", sqlType = SMALLINT) 
at 
org.apache.spark.sql.avro.AvroDeserializer.newWriter(AvroDeserializer.scala:321)
at 
org.apache.spark.sql.avro.AvroDeserializer.getRecordWriter(AvroDeserializer.scala:356)
 
at 
org.apache.spark.sql.avro.AvroDeserializer.liftedTree1$1(AvroDeserializer.scala:84)
... 26 more
{code}
h3. Expected behavior & Possible Solution

We expect the output to successfully select {{{}-128{}}}. We tried other 
formats like Parquet and the outcome is consistent with this expectation.

In the [{{AvroSerializer 

[jira] [Updated] (SPARK-39075) IncompatibleSchemaException when selecting data from table stored from a DataFrame in Avro format with BYTE/SHORT

2022-04-29 Thread xsys (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-39075?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

xsys updated SPARK-39075:
-
Description: 
h3. Describe the bug

We are trying to save a table constructed through a DataFrame with the {{Avro}} 
data format. The table contains {{ByteType}} or {{ShortType}} as part of the 
schema.

When we {{INSERT}} some valid values (e.g. {{{}-128{}}}) and {{SELECT}} from 
the table, we expect it to give back the inserted value. However, we instead 
get an {{IncompatibleSchemaException}} from the {{{}AvroDeserializer{}}}.

This appears to be caused by a missing case statement handling the {{(INT, 
ShortType)}} and {{(INT, ByteType)}} cases in [{{AvroDeserializer 
newWriter}}|https://github.com/apache/spark/blob/4f25b3f71238a00508a356591553f2dfa89f8290/external/avro/src/main/scala/org/apache/spark/sql/avro/AvroDeserializer.scala#L321].
h3. To Reproduce

On Spark 3.2.1 (commit {{{}4f25b3f712{}}}), using {{spark-shell}} with the Avro 
package:
{code:java}
./bin/spark-shell --packages org.apache.spark:spark-avro_2.12:3.2.1{code}
Execute the following:
{code:java}
import org.apache.spark.sql.Row
import org.apache.spark.sql.types._
val schema = new StructType().add(StructField("c1", ShortType, true))
val rdd = sc.parallelize(Seq(Row("-128".toShort)))
val df = spark.createDataFrame(rdd, schema)
df.write.mode("overwrite").format("avro").saveAsTable("t0")
spark.sql("select * from t0;").show(false){code}
Resulting error:
{code:java}
22/04/27 18:04:14 ERROR Executor: Exception in task 0.0 in stage 2.0 (TID 32) 
org.apache.spark.sql.avro.IncompatibleSchemaException: Cannot convert Avro type 
{"type":"record","name":"topLevelRecord","fields":[
{"name":"c1","type":["int","null"]}
]} to SQL type STRUCT<`c1`: SMALLINT>. 
at 
org.apache.spark.sql.avro.AvroDeserializer.liftedTree1$1(AvroDeserializer.scala:102)
 
at org.apache.spark.sql.avro.AvroDeserializer.(AvroDeserializer.scala:74) 
at 
org.apache.spark.sql.avro.AvroFileFormat$$anon$1.(AvroFileFormat.scala:143)
 
at 
org.apache.spark.sql.avro.AvroFileFormat.$anonfun$buildReader$1(AvroFileFormat.scala:136)
 
at 
org.apache.spark.sql.execution.datasources.FileFormat$$anon$1.apply(FileFormat.scala:148)
 
at 
org.apache.spark.sql.execution.datasources.FileFormat$$anon$1.apply(FileFormat.scala:133)
 
at 
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.org$apache$spark$sql$execution$datasources$FileScanRDD$$anon$$readCurrentFile(FileScanRDD.scala:127)
 
at 
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:187)
 
at 
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:104)
 
at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460) 
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown
 Source) 
at 
org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
 
at 
org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:759)
at 
org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:349)
 
at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:898) 
at 
org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:898)
 
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) 
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373) 
at org.apache.spark.rdd.RDD.iterator(RDD.scala:337) 
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90) 
at org.apache.spark.scheduler.Task.run(Task.scala:131) 
at 
org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:506)
 
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1462) 
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:509) 
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) 
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) 
at java.lang.Thread.run(Thread.java:748) 
Caused by: org.apache.spark.sql.avro.IncompatibleSchemaException: Cannot 
convert Avro field 'c1' to SQL field 'c1' because schema is incompatible 
(avroType = "int", sqlType = SMALLINT) 
at 
org.apache.spark.sql.avro.AvroDeserializer.newWriter(AvroDeserializer.scala:321)
at 
org.apache.spark.sql.avro.AvroDeserializer.getRecordWriter(AvroDeserializer.scala:356)
 
at 
org.apache.spark.sql.avro.AvroDeserializer.liftedTree1$1(AvroDeserializer.scala:84)
... 26 more
{code}
 
h3. Expected behavior & Possible Solution

We expect the output to successfully select {{{}-128{}}}. We tried other 
formats like Parquet and the outcome is consistent with this expectation.

In the [{{AvroSerializer 

[jira] [Updated] (SPARK-39075) IncompatibleSchemaException when selecting data from table stored from a DataFrame in Avro format with BYTE/SHORT

2022-04-29 Thread xsys (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-39075?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

xsys updated SPARK-39075:
-
Description: 
h3. Describe the bug

We are trying to save a table constructed through a DataFrame with the {{Avro}} 
data format. The table contains {{ByteType}} or {{ShortType}} as part of the 
schema.

When we {{INSERT}} some valid values (e.g. {{{}-128{}}}) and {{SELECT}} from 
the table, we expect it to give back the inserted value. However, we instead 
get an {{IncompatibleSchemaException}} from the {{{}AvroDeserializer{}}}.

This appears to be caused by a missing case statement handling the {{(INT, 
ShortType)}} and {{(INT, ByteType)}} cases in [{{AvroDeserializer 
newWriter}}|https://github.com/apache/spark/blob/4f25b3f71238a00508a356591553f2dfa89f8290/external/avro/src/main/scala/org/apache/spark/sql/avro/AvroDeserializer.scala#L321].
h3. To Reproduce

On Spark 3.2.1 (commit {{{}4f25b3f712{}}}), using {{spark-shell}} with the Avro 
package:
./bin/spark-shell --packages org.apache.spark:spark-avro_2.12:3.2.1\{{}}
Execute the following:
import org.apache.spark.sql.Row
import org.apache.spark.sql.types._
val schema = new StructType().add(StructField("c1", ShortType, true))
val rdd = sc.parallelize(Seq(Row("-128".toShort)))
val df = spark.createDataFrame(rdd, schema)
df.write.mode("overwrite").format("avro").saveAsTable("t0")
spark.sql("select * from t0;").show(false)\{{}}
Resulting error:
22/04/27 18:04:14 ERROR Executor: Exception in task 0.0 in stage 2.0 (TID 32) 
org.apache.spark.sql.avro.IncompatibleSchemaException: Cannot convert Avro type 
{"type":"record","name":"topLevelRecord","fields":[

{"name":"c1","type":["int","null"]}

]} to SQL type STRUCT<`c1`: SMALLINT>. 
at 
org.apache.spark.sql.avro.AvroDeserializer.liftedTree1$1(AvroDeserializer.scala:102)
 
at org.apache.spark.sql.avro.AvroDeserializer.(AvroDeserializer.scala:74) 
at 
org.apache.spark.sql.avro.AvroFileFormat$$anon$1.(AvroFileFormat.scala:143)
 
at 
org.apache.spark.sql.avro.AvroFileFormat.$anonfun$buildReader$1(AvroFileFormat.scala:136)
 
at 
org.apache.spark.sql.execution.datasources.FileFormat$$anon$1.apply(FileFormat.scala:148)
 
at 
org.apache.spark.sql.execution.datasources.FileFormat$$anon$1.apply(FileFormat.scala:133)
 
at 
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.org$apache$spark$sql$execution$datasources$FileScanRDD$$anon$$readCurrentFile(FileScanRDD.scala:127)
 
at 
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:187)
 
at 
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:104)
 
at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460) 
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown
 Source) 
at 
org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
 
at 
org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:759)
at 
org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:349)
 
at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:898) 
at 
org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:898)
 
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) 
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373) 
at org.apache.spark.rdd.RDD.iterator(RDD.scala:337) 
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90) 
at org.apache.spark.scheduler.Task.run(Task.scala:131) 
at 
org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:506)
 
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1462) 
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:509) 
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) 
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) 
at java.lang.Thread.run(Thread.java:748) 
Caused by: org.apache.spark.sql.avro.IncompatibleSchemaException: Cannot 
convert Avro field 'c1' to SQL field 'c1' because schema is incompatible 
(avroType = "int", sqlType = SMALLINT) 
at 
org.apache.spark.sql.avro.AvroDeserializer.newWriter(AvroDeserializer.scala:321)
at 
org.apache.spark.sql.avro.AvroDeserializer.getRecordWriter(AvroDeserializer.scala:356)
 
at 
org.apache.spark.sql.avro.AvroDeserializer.liftedTree1$1(AvroDeserializer.scala:84)
... 26 more\{{}}
h3. Expected behavior & Possible Solution

We expect the output to successfully select {{{}-128{}}}. We tried other 
formats like Parquet and the outcome is consistent with this expectation.

In the [{{AvroSerializer 
newConverter}}|https://github.com/apache/spark/blob/4f25b3f71238a00508a356591553f2dfa89f8290/external/avro/src/main/scala/org/apache/spark/sql/avro/AvroSerializer.scala#L114]{{{},
 

[jira] [Updated] (SPARK-39075) IncompatibleSchemaException when selecting data from table stored from a DataFrame in Avro format with BYTE/SHORT

2022-04-29 Thread xsys (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-39075?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

xsys updated SPARK-39075:
-
Description: 
h3. Describe the bug

We are trying to save a table constructed through a DataFrame with the {{Avro}} 
data format. The table contains {{ByteType}} or {{ShortType}} as part of the 
schema.

When we {{INSERT}} some valid values (e.g. {{{}-128{}}}) and {{SELECT}} from 
the table, we expect it to give back the inserted value. However, we instead 
get an {{IncompatibleSchemaException}} from the {{{}AvroDeserializer{}}}.

This appears to be caused by a missing case statement handling the {{(INT, 
ShortType)}} and {{(INT, ByteType)}} cases in [{{AvroDeserializer 
newWriter}}|https://github.com/apache/spark/blob/4f25b3f71238a00508a356591553f2dfa89f8290/external/avro/src/main/scala/org/apache/spark/sql/avro/AvroDeserializer.scala#L321].
h3. To Reproduce

On Spark 3.2.1 (commit {{{}4f25b3f712{}}}), using {{spark-shell}} with the Avro 
package:
./bin/spark-shell --packages org.apache.spark:spark-avro_2.12:3.2.1\{{}}
Execute the following:
import org.apache.spark.sql.\{Row, SparkSession}
import org.apache.spark.sql.types._
val schema = new StructType().add(StructField("c1", ShortType, true))
val rdd = sc.parallelize(Seq(Row("-128".toShort)))
val df = spark.createDataFrame(rdd, schema)
df.write.mode("overwrite").format("avro").saveAsTable("t0")
spark.sql("select * from t0;").show(false)\{{}}
Resulting error:
22/04/27 18:04:14 ERROR Executor: Exception in task 0.0 in stage 2.0 (TID 32) 
org.apache.spark.sql.avro.IncompatibleSchemaException: Cannot convert Avro type 
{"type":"record","name":"topLevelRecord","fields":[

{"name":"c1","type":["int","null"]}

]} to SQL type STRUCT<`c1`: SMALLINT>. 
at 
org.apache.spark.sql.avro.AvroDeserializer.liftedTree1$1(AvroDeserializer.scala:102)
 
at org.apache.spark.sql.avro.AvroDeserializer.(AvroDeserializer.scala:74) 
at 
org.apache.spark.sql.avro.AvroFileFormat$$anon$1.(AvroFileFormat.scala:143)
 
at 
org.apache.spark.sql.avro.AvroFileFormat.$anonfun$buildReader$1(AvroFileFormat.scala:136)
 
at 
org.apache.spark.sql.execution.datasources.FileFormat$$anon$1.apply(FileFormat.scala:148)
 
at 
org.apache.spark.sql.execution.datasources.FileFormat$$anon$1.apply(FileFormat.scala:133)
 
at 
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.org$apache$spark$sql$execution$datasources$FileScanRDD$$anon$$readCurrentFile(FileScanRDD.scala:127)
 
at 
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:187)
 
at 
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:104)
 
at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460) 
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown
 Source) 
at 
org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
 
at 
org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:759)
at 
org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:349)
 
at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:898) 
at 
org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:898)
 
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) 
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373) 
at org.apache.spark.rdd.RDD.iterator(RDD.scala:337) 
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90) 
at org.apache.spark.scheduler.Task.run(Task.scala:131) 
at 
org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:506)
 
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1462) 
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:509) 
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) 
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) 
at java.lang.Thread.run(Thread.java:748) 
Caused by: org.apache.spark.sql.avro.IncompatibleSchemaException: Cannot 
convert Avro field 'c1' to SQL field 'c1' because schema is incompatible 
(avroType = "int", sqlType = SMALLINT) 
at 
org.apache.spark.sql.avro.AvroDeserializer.newWriter(AvroDeserializer.scala:321)
at 
org.apache.spark.sql.avro.AvroDeserializer.getRecordWriter(AvroDeserializer.scala:356)
 
at 
org.apache.spark.sql.avro.AvroDeserializer.liftedTree1$1(AvroDeserializer.scala:84)
... 26 more\{{}}
h3. Expected behavior & Possible Solution

We expect the output to successfully select {{{}-128{}}}. We tried other 
formats like Parquet and the outcome is consistent with this expectation.

In the [{{AvroSerializer 

[jira] [Updated] (SPARK-39075) IncompatibleSchemaException when selecting data from table stored from a DataFrame in Avro format with BYTE/SHORT

2022-04-29 Thread xsys (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-39075?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

xsys updated SPARK-39075:
-
Description: 
h3. Describe the bug

We are trying to save a table constructed through a DataFrame with the {{Avro}} 
data format. The table contains {{ByteType}} or {{ShortType}} as part of the 
schema.

When we {{INSERT}} some valid values (e.g. {{{}-128{}}}) and {{SELECT}} from 
the table, we expect it to give back the inserted value. However, we instead 
get an {{IncompatibleSchemaException}} from the {{{}AvroDeserializer{}}}.

This appears to be caused by a missing case statement handling the {{(INT, 
ShortType)}} and {{(INT, ByteType)}} cases in [{{AvroDeserializer 
newWriter}}|https://github.com/apache/spark/blob/4f25b3f71238a00508a356591553f2dfa89f8290/external/avro/src/main/scala/org/apache/spark/sql/avro/AvroDeserializer.scala#L321].
h3. To Reproduce

On Spark 3.2.1 (commit {{{}4f25b3f712{}}}), using {{spark-shell}} with the Avro 
package:
./bin/spark-shell --packages org.apache.spark:spark-avro_2.12:3.2.1\{{}}
Execute the following:
import org.apache.spark.sql.\{Row, SparkSession}
import org.apache.spark.sql.types._
val schema = new StructType().add(StructField("c1", ShortType, true))
val rdd = sc.parallelize(Seq(Row("-128".toShort)))
val df = spark.createDataFrame(rdd, schema)
df.write.mode("overwrite").format("avro").saveAsTable("t0")
spark.sql("select * from t0;").show(false)\{{}}
Resulting error:
22/04/27 18:04:14 ERROR Executor: Exception in task 0.0 in stage 2.0 (TID 32) 
org.apache.spark.sql.avro.IncompatibleSchemaException: Cannot convert Avro type 
{"type":"record","name":"topLevelRecord","fields":[

{"name":"c1","type":["int","null"]}

]} to SQL type STRUCT<`c1`: SMALLINT>. 
at 
org.apache.spark.sql.avro.AvroDeserializer.liftedTree1$1(AvroDeserializer.scala:102)
 
at org.apache.spark.sql.avro.AvroDeserializer.(AvroDeserializer.scala:74) 
at 
org.apache.spark.sql.avro.AvroFileFormat$$anon$1.(AvroFileFormat.scala:143)
 
at 
org.apache.spark.sql.avro.AvroFileFormat.$anonfun$buildReader$1(AvroFileFormat.scala:136)
 
at 
org.apache.spark.sql.execution.datasources.FileFormat$$anon$1.apply(FileFormat.scala:148)
 
at 
org.apache.spark.sql.execution.datasources.FileFormat$$anon$1.apply(FileFormat.scala:133)
 
at 
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.org$apache$spark$sql$execution$datasources$FileScanRDD$$anon$$readCurrentFile(FileScanRDD.scala:127)
 
at 
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:187)
 
at 
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:104)
 
at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460) 
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown
 Source) 
at 
org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
 
at 
org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:759)
at 
org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:349)
 
at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:898) 
at 
org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:898)
 
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) 
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373) 
at org.apache.spark.rdd.RDD.iterator(RDD.scala:337) 
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90) 
at org.apache.spark.scheduler.Task.run(Task.scala:131) 
at 
org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:506)
 
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1462) 
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:509) 
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) 
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) 
at java.lang.Thread.run(Thread.java:748) 
Caused by: org.apache.spark.sql.avro.IncompatibleSchemaException: Cannot 
convert Avro field 'c1' to SQL field 'c1' because schema is incompatible 
(avroType = "int", sqlType = SMALLINT) 
at 
org.apache.spark.sql.avro.AvroDeserializer.newWriter(AvroDeserializer.scala:321)
at 
org.apache.spark.sql.avro.AvroDeserializer.getRecordWriter(AvroDeserializer.scala:356)
 
at 
org.apache.spark.sql.avro.AvroDeserializer.liftedTree1$1(AvroDeserializer.scala:84)
... 26 more\{{}}
h3. Expected behavior & Possible Solution

We expect the output to successfully select {{{}-128{}}}. We tried other 
formats like Parquet and the outcome is consistent with this expectation.

In the [{{AvroSerializer 

[jira] [Created] (SPARK-39075) IncompatibleSchemaException when selecting data from table stored from a DataFrame in Avro format with BYTE/SHORT

2022-04-29 Thread xsys (Jira)
xsys created SPARK-39075:


 Summary: IncompatibleSchemaException when selecting data from 
table stored from a DataFrame in Avro format with BYTE/SHORT
 Key: SPARK-39075
 URL: https://issues.apache.org/jira/browse/SPARK-39075
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.2.1
Reporter: xsys


h3. Describe the bug

We are trying to save a table constructed through a DataFrame with the {{Avro}} 
data format. The table contains {{ByteType}} or {{ShortType}} as part of the 
schema.

When we {{INSERT}} some valid values (e.g. {{{}-128{}}}) and {{SELECT}} from 
the table, we expect it to give back the inserted value. However, we instead 
get an {{IncompatibleSchemaException}} from the {{{}AvroDeserializer{}}}.

This appears to be caused by a missing case statement handling the {{(INT, 
ShortType)}} and {{(INT, ByteType)}} cases in [{{{}AvroDeserializer 
newWriter{}}}|https://github.com/apache/spark/blob/4f25b3f71238a00508a356591553f2dfa89f8290/external/avro/src/main/scala/org/apache/spark/sql/avro/AvroDeserializer.scala#L321][{{}}|https://github.com/apache/spark/blob/4f25b3f71238a00508a356591553f2dfa89f8290/external/avro/src/main/scala/org/apache/spark/sql/avro/AvroDeserializer.scala#L321].
h3. To Reproduce

On Spark 3.2.1 (commit {{{}4f25b3f712{}}}), using {{spark-shell}} with the Avro 
package:
./bin/spark-shell --packages org.apache.spark:spark-avro_2.12:3.2.1{{}}
Execute the following:
import org.apache.spark.sql.\{Row, SparkSession}
import org.apache.spark.sql.types._
val schema = new StructType().add(StructField("c1", ShortType, true))
val rdd = sc.parallelize(Seq(Row("-128".toShort)))
val df = spark.createDataFrame(rdd, schema)
df.write.mode("overwrite").format("avro").saveAsTable("t0")
spark.sql("select * from t0;").show(false){{}}
Resulting error:
22/04/27 18:04:14 ERROR Executor: Exception in task 0.0 in stage 2.0 (TID 32)   
   
org.apache.spark.sql.avro.IncompatibleSchemaException: Cannot convert Avro type 
\{"type":"record","name":"topLevelRecord","fields":[{"name":"c1","type":["int","null"]}]}
 to SQL type STRUCT<`c1`: SMALLINT>.
at 
org.apache.spark.sql.avro.AvroDeserializer.liftedTree1$1(AvroDeserializer.scala:102)
 
at 
org.apache.spark.sql.avro.AvroDeserializer.(AvroDeserializer.scala:74)
 
at 
org.apache.spark.sql.avro.AvroFileFormat$$anon$1.(AvroFileFormat.scala:143)

at 
org.apache.spark.sql.avro.AvroFileFormat.$anonfun$buildReader$1(AvroFileFormat.scala:136)

at 
org.apache.spark.sql.execution.datasources.FileFormat$$anon$1.apply(FileFormat.scala:148)

at 
org.apache.spark.sql.execution.datasources.FileFormat$$anon$1.apply(FileFormat.scala:133)

at 
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.org$apache$spark$sql$execution$datasources$FileScanRDD$$anon$$readCurrentFile(FileScanRDD.scala:127)

at 
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:187)
 
at 
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:104)
 
at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)   

at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown
 Source)
   
at 
org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
  
at 
org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:759)
at 
org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:349)
 
at 
org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:898)
 
at 
org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:898)
 
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
 
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)  

at org.apache.spark.rdd.RDD.iterator(RDD.scala:337) 

at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)   

at org.apache.spark.scheduler.Task.run(Task.scala:131)