[jira] [Updated] (SPARK-40637) Spark-shell can correctly encode BINARY type but Spark-sql cannot
[ https://issues.apache.org/jira/browse/SPARK-40637?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] xsys updated SPARK-40637: - Description: h3. Describe the bug When we store a BINARY value (e.g. {{BigInt("1").toByteArray)}} / {{{}X'01'{}}}) either via {{spark-shell or spark-sql, and then read it from Spark-shell, it}} outputs {{{}[01]{}}}. However, it does not encode correctly when querying it via {{{}spark-sql{}}}. i.e., Insert via spark-shell, read via spark-shell: display correctly Insert via spark-shell, read via spark-sql: does not display correctly Insert via spark-sql, read via spark-sql: does not display correctly Insert via spark-sql, read via spark-shell: display correctly h3. To Reproduce On Spark 3.2.1 (commit {{{}4f25b3f712{}}}), using {{{}spark-shell{}}}: {code:java} $SPARK_HOME/bin/spark-shell{code} Execute the following: {code:java} scala> import org.apache.spark.sql.Row scala> import org.apache.spark.sql.types._ scala> val rdd = sc.parallelize(Seq(Row(BigInt("1").toByteArray))) rdd: org.apache.spark.rdd.RDD[org.apache.spark.sql.Row] = ParallelCollectionRDD[356] at parallelize at :28 scala> val schema = new StructType().add(StructField("c1", BinaryType, true)) schema: org.apache.spark.sql.types.StructType = StructType(StructField(c1,BinaryType,true)) scala> val df = spark.createDataFrame(rdd, schema) df: org.apache.spark.sql.DataFrame = [c1: binary] scala> df.write.mode("overwrite").format("orc").saveAsTable("binary_vals_shell") scala> spark.sql("select * from binary_vals_shell;").show(false) ++ |c1 | ++ |[01]| ++{code} Then using {{spark-sql}} to (1) query what is inserted via spark-shell to the binary_vals_shell table, and then (2) insert the value via spark-sql to the binary_vals_sql table (we use tee to redirect the log to a file) {code:java} $SPARK_HOME/bin/spark-sql | tee sql.log{code} Execute the following, we only get an empty output in the terminal (but a garbage character in the log file): {code:java} spark-sql> select * from binary_vals_shell; -- query what is inserted via spark-shell; spark-sql> create table binary_vals_sql(c1 BINARY) stored as ORC; spark-sql> insert into binary_vals_sql select X'01'; -- try to insert directly in spark-sql; spark-sql> select * from binary_vals_sql; Time taken: 0.077 seconds, Fetched 1 row(s) {code} >From the log file, we find it shows as a garbage character. (We never >encountered this garbage character in logs of other data types) h3. !image-2022-10-18-12-15-05-576.png! We then return to spark-shell again and run the following: {code:java} scala> spark.sql("select * from binary_vals_sql;").show(false) ++ |c1 | ++ |[01]| ++{code} The binary value does not display correctly via spark-sql, it still displays correctly via spark-shell. h3. Expected behavior We expect the two Spark interfaces ({{{}spark-sql{}}} & {{{}spark-shell{}}}) to behave consistently for the same data type ({{{}BINARY{}}}) & input ({{{}BigInt("1").toByteArray){}}} / {{{}X'01'{}}}) combination. h3. Additional context We also tried Avro and Parquet and encountered the same issue. We believe this is format-independent. was: h3. Describe the bug When we store a BINARY value (e.g. {{BigInt("1").toByteArray)}} / {{{}X'01'{}}}) either via {{spark-shell or spark-sql, and then read it from Spark-shell, it}} outputs {{{}[01]{}}}. However, it does not encode correctly when querying it via {{{}spark-sql{}}}. h3. To Reproduce On Spark 3.2.1 (commit {{{}4f25b3f712{}}}), using {{{}spark-shell{}}}: {code:java} $SPARK_HOME/bin/spark-shell{code} Execute the following: {code:java} scala> import org.apache.spark.sql.Row scala> import org.apache.spark.sql.types._ scala> val rdd = sc.parallelize(Seq(Row(BigInt("1").toByteArray))) rdd: org.apache.spark.rdd.RDD[org.apache.spark.sql.Row] = ParallelCollectionRDD[356] at parallelize at :28 scala> val schema = new StructType().add(StructField("c1", BinaryType, true)) schema: org.apache.spark.sql.types.StructType = StructType(StructField(c1,BinaryType,true)) scala> val df = spark.createDataFrame(rdd, schema) df: org.apache.spark.sql.DataFrame = [c1: binary] scala> df.write.mode("overwrite").format("orc").saveAsTable("binary_vals_shell") scala> spark.sql("select * from binary_vals_shell;").show(false) ++ |c1 | ++ |[01]| ++{code} We can see the output using "spark.sql" in spark-shell. Then using {{spark-sql}} to (1) query what is inserted via spark-shell to the binary_vals_shell table, and then (2) insert the value via spark-sql to the binary_vals_sql table (we use tee to redirect the log to a file) {code:java} $SPARK_HOME/bin/spark-sql | tee sql.log{code} Execute the following, we only get an empty output in the terminal (but a garbage character in the log file): {code:java} spark-sql> select * from binary_vals_shell; -- query what is
[jira] [Commented] (SPARK-40637) Spark-shell can correctly encode BINARY type but Spark-sql cannot
[ https://issues.apache.org/jira/browse/SPARK-40637?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17619715#comment-17619715 ] xsys commented on SPARK-40637: -- Thank you for the response [~ivan.sadikov]. I just updated the description with more details about how to reproduce it (including writing the value to the table in the first example). Basically, when we insert the binary value either via spark-shell or spark-sql, spark-shell displays it correctly but spark-sql does not. We also tried Avro and Parquet and encountered the same issue. We believe this is format-independent. > Spark-shell can correctly encode BINARY type but Spark-sql cannot > - > > Key: SPARK-40637 > URL: https://issues.apache.org/jira/browse/SPARK-40637 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.2.1 >Reporter: xsys >Priority: Major > Attachments: image-2022-10-18-12-15-05-576.png > > > h3. Describe the bug > When we store a BINARY value (e.g. {{BigInt("1").toByteArray)}} / > {{{}X'01'{}}}) either via {{spark-shell or spark-sql, and then read it from > Spark-shell, it}} outputs {{{}[01]{}}}. However, it does not encode correctly > when querying it via {{{}spark-sql{}}}. > h3. To Reproduce > On Spark 3.2.1 (commit {{{}4f25b3f712{}}}), using {{{}spark-shell{}}}: > {code:java} > $SPARK_HOME/bin/spark-shell{code} > Execute the following: > {code:java} > scala> import org.apache.spark.sql.Row > scala> import org.apache.spark.sql.types._ > scala> val rdd = sc.parallelize(Seq(Row(BigInt("1").toByteArray))) > rdd: org.apache.spark.rdd.RDD[org.apache.spark.sql.Row] = > ParallelCollectionRDD[356] at parallelize at :28 > scala> val schema = new StructType().add(StructField("c1", BinaryType, true)) > schema: org.apache.spark.sql.types.StructType = > StructType(StructField(c1,BinaryType,true)) > scala> val df = spark.createDataFrame(rdd, schema) > df: org.apache.spark.sql.DataFrame = [c1: binary] > scala> > df.write.mode("overwrite").format("orc").saveAsTable("binary_vals_shell") > scala> spark.sql("select * from binary_vals_shell;").show(false) > ++ > |c1 | > ++ > |[01]| > ++{code} > We can see the output using "spark.sql" in spark-shell. > Then using {{spark-sql}} to (1) query what is inserted via spark-shell to the > binary_vals_shell table, and then (2) insert the value via spark-sql to the > binary_vals_sql table (we use tee to redirect the log to a file) > {code:java} > $SPARK_HOME/bin/spark-sql | tee sql.log{code} > Execute the following, we only get an empty output in the terminal (but a > garbage character in the log file): > {code:java} > spark-sql> select * from binary_vals_shell; -- query what is inserted via > spark-shell; > spark-sql> create table binary_vals_sql(c1 BINARY) stored as ORC; > spark-sql> insert into binary_vals_sql select X'01'; -- try to insert > directly in spark-sql; > spark-sql> select * from binary_vals_sql; > Time taken: 0.077 seconds, Fetched 1 row(s) > {code} > From the log file, we find it shows as a garbage character. (We never > encountered this garbage character in logs of other data types) > h3. !image-2022-10-18-12-15-05-576.png! > We then return to spark-shell again and run the following: > {code:java} > scala> spark.sql("select * from binary_vals_sql;").show(false) > ++ > > |c1 | > ++ > |[01]| > ++{code} > The binary value does not display correctly via spark-sql, it still displays > correctly via spark-shell. > h3. Expected behavior > We expect the two Spark interfaces ({{{}spark-sql{}}} & {{{}spark-shell{}}}) > to behave consistently for the same data type ({{{}BINARY{}}}) & input > ({{{}BigInt("1").toByteArray){}}} / {{{}X'01'{}}}) combination. > > h3. Additional context > We also tried Avro and Parquet and encountered the same issue. We believe > this is format-independent. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-40637) Spark-shell can correctly encode BINARY type but Spark-sql cannot
[ https://issues.apache.org/jira/browse/SPARK-40637?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] xsys updated SPARK-40637: - Description: h3. Describe the bug When we store a BINARY value (e.g. {{BigInt("1").toByteArray)}} / {{{}X'01'{}}}) either via {{spark-shell or spark-sql, and then read it from Spark-shell, it}} outputs {{{}[01]{}}}. However, it does not encode correctly when querying it via {{{}spark-sql{}}}. h3. To Reproduce On Spark 3.2.1 (commit {{{}4f25b3f712{}}}), using {{{}spark-shell{}}}: {code:java} $SPARK_HOME/bin/spark-shell{code} Execute the following: {code:java} scala> import org.apache.spark.sql.Row scala> import org.apache.spark.sql.types._ scala> val rdd = sc.parallelize(Seq(Row(BigInt("1").toByteArray))) rdd: org.apache.spark.rdd.RDD[org.apache.spark.sql.Row] = ParallelCollectionRDD[356] at parallelize at :28 scala> val schema = new StructType().add(StructField("c1", BinaryType, true)) schema: org.apache.spark.sql.types.StructType = StructType(StructField(c1,BinaryType,true)) scala> val df = spark.createDataFrame(rdd, schema) df: org.apache.spark.sql.DataFrame = [c1: binary] scala> df.write.mode("overwrite").format("orc").saveAsTable("binary_vals_shell") scala> spark.sql("select * from binary_vals_shell;").show(false) ++ |c1 | ++ |[01]| ++{code} We can see the output using "spark.sql" in spark-shell. Then using {{spark-sql}} to (1) query what is inserted via spark-shell to the binary_vals_shell table, and then (2) insert the value via spark-sql to the binary_vals_sql table (we use tee to redirect the log to a file) {code:java} $SPARK_HOME/bin/spark-sql | tee sql.log{code} Execute the following, we only get an empty output in the terminal (but a garbage character in the log file): {code:java} spark-sql> select * from binary_vals_shell; -- query what is inserted via spark-shell; spark-sql> create table binary_vals_sql(c1 BINARY) stored as ORC; spark-sql> insert into binary_vals_sql select X'01'; -- try to insert directly in spark-sql; spark-sql> select * from binary_vals_sql; Time taken: 0.077 seconds, Fetched 1 row(s) {code} >From the log file, we find it shows as a garbage character. (We never >encountered this garbage character in logs of other data types) h3. !image-2022-10-18-12-15-05-576.png! We then return to spark-shell again and run the following: {code:java} scala> spark.sql("select * from binary_vals_sql;").show(false) ++ |c1 | ++ |[01]| ++{code} The binary value does not display correctly via spark-sql, it still displays correctly via spark-shell. h3. Expected behavior We expect the two Spark interfaces ({{{}spark-sql{}}} & {{{}spark-shell{}}}) to behave consistently for the same data type ({{{}BINARY{}}}) & input ({{{}BigInt("1").toByteArray){}}} / {{{}X'01'{}}}) combination. h3. Additional context We also tried Avro and Parquet and encountered the same issue. We believe this is format-independent. was: h3. Describe the bug When we store a BINARY value (e.g. {{BigInt("1").toByteArray)}} / {{{}X'01'{}}}) either via {{spark-shell or spark-sql, and then read it from Spark-shell, it}} outputs {{{}[01]{}}}. However, it does not encode correctly when quering it via {{{}spark-sql{}}}. h3. To Reproduce On Spark 3.2.1 (commit {{{}4f25b3f712{}}}), using {{{}spark-shell{}}}: {code:java} $SPARK_HOME/bin/spark-shell{code} Execute the following: {code:java} scala> import org.apache.spark.sql.Row scala> import org.apache.spark.sql.types._ scala> val rdd = sc.parallelize(Seq(Row(BigInt("1").toByteArray))) rdd: org.apache.spark.rdd.RDD[org.apache.spark.sql.Row] = ParallelCollectionRDD[356] at parallelize at :28 scala> val schema = new StructType().add(StructField("c1", BinaryType, true)) schema: org.apache.spark.sql.types.StructType = StructType(StructField(c1,BinaryType,true)) scala> val df = spark.createDataFrame(rdd, schema) df: org.apache.spark.sql.DataFrame = [c1: binary] scala> df.write.mode("overwrite").format("orc").saveAsTable("binary_vals_shell") scala> spark.sql("select * from binary_vals_shell;").show(false) ++ |c1 | ++ |[01]| ++{code} We can see the output using "spark.sql" in spark-shell. Then using {{spark-sql}} to (1) query what is inserted via spark-shell to the binary_vals_shell table, and then (2) insert the value via spark-sql to the binary_vals_sql table (we use tee to redirect the log to a file) {code:java} $SPARK_HOME/bin/spark-sql | tee sql.log{code} Execute the following, we only get an empty output in the terminal (but a garbage character in the log file): {code:java} spark-sql> select * from binary_vals_shell; -- query what is inserted via spark-shell; spark-sql> create table binary_vals_sql(c1 BINARY) stored as ORC; spark-sql> insert into binary_vals_sql select X'01'; -- try to insert directly in spark-sql; spark-sql> select * from
[jira] [Updated] (SPARK-40637) Spark-shell can correctly encode BINARY type but Spark-sql cannot
[ https://issues.apache.org/jira/browse/SPARK-40637?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] xsys updated SPARK-40637: - Description: h3. Describe the bug When we store a BINARY value (e.g. {{BigInt("1").toByteArray)}} / {{{}X'01'{}}}) either via {{spark-shell or spark-sql, and then read it from Spark-shell, it}} outputs {{{}[01]{}}}. However, it does not encode correctly when quering it via {{{}spark-sql{}}}. h3. To Reproduce On Spark 3.2.1 (commit {{{}4f25b3f712{}}}), using {{{}spark-shell{}}}: {code:java} $SPARK_HOME/bin/spark-shell{code} Execute the following: {code:java} scala> import org.apache.spark.sql.Row scala> import org.apache.spark.sql.types._ scala> val rdd = sc.parallelize(Seq(Row(BigInt("1").toByteArray))) rdd: org.apache.spark.rdd.RDD[org.apache.spark.sql.Row] = ParallelCollectionRDD[356] at parallelize at :28 scala> val schema = new StructType().add(StructField("c1", BinaryType, true)) schema: org.apache.spark.sql.types.StructType = StructType(StructField(c1,BinaryType,true)) scala> val df = spark.createDataFrame(rdd, schema) df: org.apache.spark.sql.DataFrame = [c1: binary] scala> df.write.mode("overwrite").format("orc").saveAsTable("binary_vals_shell") scala> spark.sql("select * from binary_vals_shell;").show(false) ++ |c1 | ++ |[01]| ++{code} We can see the output using "spark.sql" in spark-shell. Then using {{spark-sql}} to (1) query what is inserted via spark-shell to the binary_vals_shell table, and then (2) insert the value via spark-sql to the binary_vals_sql table (we use tee to redirect the log to a file) {code:java} $SPARK_HOME/bin/spark-sql | tee sql.log{code} Execute the following, we only get an empty output in the terminal (but a garbage character in the log file): {code:java} spark-sql> select * from binary_vals_shell; -- query what is inserted via spark-shell; spark-sql> create table binary_vals_sql(c1 BINARY) stored as ORC; spark-sql> insert into binary_vals_sql select X'01'; -- try to insert directly in spark-sql; spark-sql> select * from binary_vals_sql; Time taken: 0.077 seconds, Fetched 1 row(s) {code} >From the log file, we find it shows as a garbage character. (We never >encountered this garbage character in logs of other data types) h3. !image-2022-10-18-12-15-05-576.png! We then return to spark-shell again and run the following: {code:java} scala> spark.sql("select * from binary_vals_sql;").show(false) ++ |c1 | ++ |[01]| ++{code} The binary value does not display correctly via spark-sql, it still displays correctly via spark-shell. h3. Expected behavior We expect the two Spark interfaces ({{{}spark-sql{}}} & {{{}spark-shell{}}}) to behave consistently for the same data type ({{{}BINARY{}}}) & input ({{{}BigInt("1").toByteArray){}}} / {{{}X'01'{}}}) combination. h3. Additional context We also tried Avro and Parquet and encountered the same issue. We believe this is format-independent. was: h3. Describe the bug When we store a BINARY value (e.g. {{BigInt("1").toByteArray)}} / {{{}X'01'{}}}) either via {{spark-shell or spark-sql, and then read it from Spark-shell, it}} outputs {{{}[01]{}}}. However, it does not encode correctly when quering it via {{{}spark-sql{}}}. h3. To Reproduce On Spark 3.2.1 (commit {{{}4f25b3f712{}}}), using {{{}spark-shell{}}}: {code:java} $SPARK_HOME/bin/spark-shell{code} Execute the following: {code:java} scala> import org.apache.spark.sql.Row scala> import org.apache.spark.sql.types._ scala> val rdd = sc.parallelize(Seq(Row(BigInt("1").toByteArray))) rdd: org.apache.spark.rdd.RDD[org.apache.spark.sql.Row] = ParallelCollectionRDD[356] at parallelize at :28 scala> val schema = new StructType().add(StructField("c1", BinaryType, true)) schema: org.apache.spark.sql.types.StructType = StructType(StructField(c1,BinaryType,true)) scala> val df = spark.createDataFrame(rdd, schema) df: org.apache.spark.sql.DataFrame = [c1: binary] scala> df.show(false) ++ |c1 | ++ |[01]| ++ scala> df.write.mode("overwrite").format("orc").saveAsTable("binary_vals_shell") scala> spark.sql("select * from binary_vals_shell;").show(false) ++ |c1 | ++ |[01]| ++{code} We can see the output using "spark.sql" in spark-shell. Then using {{spark-sql}} to (1) query what is inserted via spark-shell to the binary_vals_shell table, and then (2) insert the value via spark-sql to the binary_vals_sql table (we use tee to redirect the log to a file) {code:java} $SPARK_HOME/bin/spark-sql | tee sql.log{code} Execute the following, we only get an empty output in the terminal (but a garbage character in the log file): {code:java} spark-sql> select * from binary_vals_shell; -- query what is inserted via spark-shell; spark-sql> create table binary_vals_sql(c1 BINARY) stored as ORC; spark-sql> insert into binary_vals_sql select X'01'; -- try to insert
[jira] [Updated] (SPARK-40637) Spark-shell can correctly encode BINARY type but Spark-sql cannot
[ https://issues.apache.org/jira/browse/SPARK-40637?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] xsys updated SPARK-40637: - Description: h3. Describe the bug When we store a BINARY value (e.g. {{BigInt("1").toByteArray)}} / {{{}X'01'{}}}) either via {{spark-shell or spark-sql, and then read it from Spark-shell, it}} outputs {{{}[01]{}}}. However, it does not encode correctly when quering it via {{{}spark-sql{}}}. h3. To Reproduce On Spark 3.2.1 (commit {{{}4f25b3f712{}}}), using {{{}spark-shell{}}}: {code:java} $SPARK_HOME/bin/spark-shell{code} Execute the following: {code:java} scala> import org.apache.spark.sql.Row scala> import org.apache.spark.sql.types._ scala> val rdd = sc.parallelize(Seq(Row(BigInt("1").toByteArray))) rdd: org.apache.spark.rdd.RDD[org.apache.spark.sql.Row] = ParallelCollectionRDD[356] at parallelize at :28 scala> val schema = new StructType().add(StructField("c1", BinaryType, true)) schema: org.apache.spark.sql.types.StructType = StructType(StructField(c1,BinaryType,true)) scala> val df = spark.createDataFrame(rdd, schema) df: org.apache.spark.sql.DataFrame = [c1: binary] scala> df.show(false) ++ |c1 | ++ |[01]| ++ scala> df.write.mode("overwrite").format("orc").saveAsTable("binary_vals_shell") scala> spark.sql("select * from binary_vals_shell;").show(false) ++ |c1 | ++ |[01]| ++{code} We can see the output using "spark.sql" in spark-shell. Then using {{spark-sql}} to (1) query what is inserted via spark-shell to the binary_vals_shell table, and then (2) insert the value via spark-sql to the binary_vals_sql table (we use tee to redirect the log to a file) {code:java} $SPARK_HOME/bin/spark-sql | tee sql.log{code} Execute the following, we only get an empty output in the terminal (but a garbage character in the log file): {code:java} spark-sql> select * from binary_vals_shell; -- query what is inserted via spark-shell; spark-sql> create table binary_vals_sql(c1 BINARY) stored as ORC; spark-sql> insert into binary_vals_sql select X'01'; -- try to insert directly in spark-sql; spark-sql> select * from binary_vals_sql; Time taken: 0.077 seconds, Fetched 1 row(s) {code} >From the log file, we find it shows as a garbage character. (We never >encountered this garbage character in logs of other data types) h3. !image-2022-10-18-12-15-05-576.png! We then return to spark-shell again and run the following: {code:java} scala> spark.sql("select * from binary_vals_sql;").show(false) ++ |c1 | ++ |[01]| ++{code} The binary value does not display correctly via spark-sql, it still displays correctly via spark-shell. h3. Expected behavior We expect the two Spark interfaces ({{{}spark-sql{}}} & {{{}spark-shell{}}}) to behave consistently for the same data type ({{{}BINARY{}}}) & input ({{{}BigInt("1").toByteArray){}}} / {{{}X'01'{}}}) combination. h3. Additional context We also tried Avro and Parquet and encountered the same issue. We believe this is format-independent. was: h3. Describe the bug When we store a BINARY value (e.g. {{BigInt("1").toByteArray)}} / {{{}X'01'{}}}) either via {{spark-shell or spark-sql, and then read it from Spark-shell, it}} outputs {{{}[01]{}}}. However, it does not encode correctly when quering it via {{{}spark-sql{}}}. h3. To Reproduce On Spark 3.2.1 (commit {{{}4f25b3f712{}}}), using {{{}spark-shell{}}}: {code:java} $SPARK_HOME/bin/spark-shell{code} Execute the following: {code:java} scala> val rdd = sc.parallelize(Seq(Row(BigInt("1").toByteArray))) rdd: org.apache.spark.rdd.RDD[org.apache.spark.sql.Row] = ParallelCollectionRDD[356] at parallelize at :28 scala> val schema = new StructType().add(StructField("c1", BinaryType, true)) schema: org.apache.spark.sql.types.StructType = StructType(StructField(c1,BinaryType,true)) scala> val df = spark.createDataFrame(rdd, schema) df: org.apache.spark.sql.DataFrame = [c1: binary] scala> df.show(false) ++ |c1 | ++ |[01]| ++ scala> df.write.mode("overwrite").format("orc").saveAsTable("binary_vals") scala> spark.sql("select * from binary_vals;").show(false) ++ |c1 | ++ |[01]| ++{code} We can see the output using "spark.sql" in spark-shell. Then using {{spark-sql}} (to see we use tee to redirect the log to a file) {code:java} $SPARK_HOME/bin/spark-sql | tee sql.log{code} Execute the following, we only get an empty output in the terminal: {code:java} spark-sql> select * from binary_vals; -- check what was inserted by DataFrame spark-sql> drop table binary_vals; spark-sql> create table binary_vals(c1 BINARY) stored as ORC; -- try to insert directly in spark-sql spark-sql> insert into binary_vals select X'01'; spark-sql> select * from binary_vals; Time taken: 0.077 seconds, Fetched 1 row(s) {code} >From the log file, we find it shows as a garbage character. (We never >encountered this
[jira] [Updated] (SPARK-40637) Spark-shell can correctly encode BINARY type but Spark-sql cannot
[ https://issues.apache.org/jira/browse/SPARK-40637?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] xsys updated SPARK-40637: - Description: h3. Describe the bug When we store a BINARY value (e.g. {{BigInt("1").toByteArray)}} / {{{}X'01'{}}}) either via {{spark-shell or spark-sql, and then read it from Spark-shell, it}} outputs {{{}[01]{}}}. However, it does not encode correctly when quering it via {{{}spark-sql{}}}. h3. To Reproduce On Spark 3.2.1 (commit {{{}4f25b3f712{}}}), using {{{}spark-shell{}}}: {code:java} $SPARK_HOME/bin/spark-shell{code} Execute the following: {code:java} scala> val rdd = sc.parallelize(Seq(Row(BigInt("1").toByteArray))) rdd: org.apache.spark.rdd.RDD[org.apache.spark.sql.Row] = ParallelCollectionRDD[356] at parallelize at :28 scala> val schema = new StructType().add(StructField("c1", BinaryType, true)) schema: org.apache.spark.sql.types.StructType = StructType(StructField(c1,BinaryType,true)) scala> val df = spark.createDataFrame(rdd, schema) df: org.apache.spark.sql.DataFrame = [c1: binary] scala> df.show(false) ++ |c1 | ++ |[01]| ++ scala> df.write.mode("overwrite").format("orc").saveAsTable("binary_vals") scala> spark.sql("select * from binary_vals;").show(false) ++ |c1 | ++ |[01]| ++{code} We can see the output using "spark.sql" in spark-shell. Then using {{spark-sql}} (to see we use tee to redirect the log to a file) {code:java} $SPARK_HOME/bin/spark-sql | tee sql.log{code} Execute the following, we only get an empty output in the terminal: {code:java} spark-sql> select * from binary_vals; -- check what was inserted by DataFrame spark-sql> drop table binary_vals; spark-sql> create table binary_vals(c1 BINARY) stored as ORC; -- try to insert directly in spark-sql spark-sql> insert into binary_vals select X'01'; spark-sql> select * from binary_vals; Time taken: 0.077 seconds, Fetched 1 row(s) {code} >From the log file, we find it shows as a garbage character. (We never >encountered this garbage character in logs of other data types) h3. !image-2022-10-18-12-15-05-576.png! h3. Expected behavior We expect the two Spark interfaces ({{{}spark-sql{}}} & {{{}spark-shell{}}}) to behave consistently for the same data type ({{{}BINARY{}}}) & input ({{{}BigInt("1").toByteArray){}}} / {{{}X'01'{}}}) combination. h3. Additional context We also tried Avro and Parquet and encountered the same issue. We believe this is format-independent. was: h3. Describe the bug When we store a BINARY value (e.g. {{BigInt("1").toByteArray)}} / {{{}X'01'{}}}) either via {{spark-shell or spark-sql, and then read it from Spark-shell, it}} outputs {{{}[01]{}}}. However, it does not encode correctly via {{{}spark-sql{}}}. h3. To Reproduce On Spark 3.2.1 (commit {{{}4f25b3f712{}}}), using {{{}spark-shell{}}}: {code:java} $SPARK_HOME/bin/spark-shell{code} Execute the following: {code:java} scala> val rdd = sc.parallelize(Seq(Row(BigInt("1").toByteArray))) rdd: org.apache.spark.rdd.RDD[org.apache.spark.sql.Row] = ParallelCollectionRDD[356] at parallelize at :28 scala> val schema = new StructType().add(StructField("c1", BinaryType, true)) schema: org.apache.spark.sql.types.StructType = StructType(StructField(c1,BinaryType,true)) scala> val df = spark.createDataFrame(rdd, schema) df: org.apache.spark.sql.DataFrame = [c1: binary] scala> df.show(false) ++ |c1 | ++ |[01]| ++ scala> df.write.mode("overwrite").format("orc").saveAsTable("binary_vals") scala> spark.sql("select * from binary_vals;").show(false) ++ |c1 | ++ |[01]| ++{code} We can see the output using "spark.sql" in spark-shell. Then using {{spark-sql}} (to see we use tee to redirect the log to a file) {code:java} $SPARK_HOME/bin/spark-sql | tee sql.log{code} Execute the following, we only get an empty output in the terminal: {code:java} spark-sql> select * from binary_vals; -- check what was inserted by DataFrame spark-sql> drop table binary_vals; spark-sql> create table binary_vals(c1 BINARY) stored as ORC; -- try to insert directly in spark-sql spark-sql> insert into binary_vals select X'01'; spark-sql> select * from binary_vals; Time taken: 0.077 seconds, Fetched 1 row(s) {code} >From the log file, we find it shows as a garbage character. (We never >encountered this garbage character in logs of other data types) h3. !image-2022-10-18-12-15-05-576.png! h3. Expected behavior We expect the two Spark interfaces ({{{}spark-sql{}}} & {{{}spark-shell{}}}) to behave consistently for the same data type ({{{}BINARY{}}}) & input ({{{}BigInt("1").toByteArray){}}} / {{{}X'01'{}}}) combination. h3. Additional context We also tried Avro and Parquet and encountered the same issue. We believe this is format-independent. > Spark-shell can correctly encode BINARY type but Spark-sql cannot > - > > Key: SPARK-40637 >
[jira] [Updated] (SPARK-40637) Spark-shell can correctly encode BINARY type but Spark-sql cannot
[ https://issues.apache.org/jira/browse/SPARK-40637?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] xsys updated SPARK-40637: - Summary: Spark-shell can correctly encode BINARY type but Spark-sql cannot (was: DataFrame can correctly encode BINARY type but SparkSQL cannot) > Spark-shell can correctly encode BINARY type but Spark-sql cannot > - > > Key: SPARK-40637 > URL: https://issues.apache.org/jira/browse/SPARK-40637 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.2.1 >Reporter: xsys >Priority: Major > Attachments: image-2022-10-18-12-15-05-576.png > > > h3. Describe the bug > When we store a BINARY value (e.g. {{BigInt("1").toByteArray)}} / > {{{}X'01'{}}}) either via {{spark-shell or spark-sql, and then read it from > Spark-shell, it}} outputs {{{}[01]{}}}. However, it does not encode correctly > via {{{}spark-sql{}}}. > h3. To Reproduce > On Spark 3.2.1 (commit {{{}4f25b3f712{}}}), using {{{}spark-shell{}}}: > {code:java} > $SPARK_HOME/bin/spark-shell{code} > Execute the following: > {code:java} > scala> val rdd = sc.parallelize(Seq(Row(BigInt("1").toByteArray))) > rdd: org.apache.spark.rdd.RDD[org.apache.spark.sql.Row] = > ParallelCollectionRDD[356] at parallelize at :28 > scala> val schema = new StructType().add(StructField("c1", BinaryType, true)) > schema: org.apache.spark.sql.types.StructType = > StructType(StructField(c1,BinaryType,true)) > scala> val df = spark.createDataFrame(rdd, schema) > df: org.apache.spark.sql.DataFrame = [c1: binary] > scala> df.show(false) > ++ > |c1 | > ++ > |[01]| > ++ > scala> df.write.mode("overwrite").format("orc").saveAsTable("binary_vals") > scala> spark.sql("select * from binary_vals;").show(false) > ++ > |c1 | > ++ > |[01]| > ++{code} > We can see the output using "spark.sql" in spark-shell. > Then using {{spark-sql}} (to see we use tee to redirect the log to a file) > {code:java} > $SPARK_HOME/bin/spark-sql | tee sql.log{code} > Execute the following, we only get an empty output in the terminal: > {code:java} > spark-sql> select * from binary_vals; -- check what was inserted by DataFrame > spark-sql> drop table binary_vals; > spark-sql> create table binary_vals(c1 BINARY) stored as ORC; -- try to > insert directly in spark-sql > spark-sql> insert into binary_vals select X'01'; > spark-sql> select * from binary_vals; > Time taken: 0.077 seconds, Fetched 1 row(s) > {code} > > From the log file, we find it shows as a garbage character. (We never > encountered this garbage character in logs of other data types) > h3. !image-2022-10-18-12-15-05-576.png! > h3. Expected behavior > We expect the two Spark interfaces ({{{}spark-sql{}}} & {{{}spark-shell{}}}) > to behave consistently for the same data type ({{{}BINARY{}}}) & input > ({{{}BigInt("1").toByteArray){}}} / {{{}X'01'{}}}) combination. > > h3. Additional context > We also tried Avro and Parquet and encountered the same issue. We believe > this is format-independent. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-40637) DataFrame can correctly encode BINARY type but SparkSQL cannot
[ https://issues.apache.org/jira/browse/SPARK-40637?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] xsys updated SPARK-40637: - Description: h3. Describe the bug When we store a BINARY value (e.g. {{BigInt("1").toByteArray)}} / {{{}X'01'{}}}) either via {{spark-shell or spark-sql, and then read it from Spark-shell, it}} outputs {{{}[01]{}}}. However, it does not encode correctly via {{{}spark-sql{}}}. h3. To Reproduce On Spark 3.2.1 (commit {{{}4f25b3f712{}}}), using {{{}spark-shell{}}}: {code:java} $SPARK_HOME/bin/spark-shell{code} Execute the following: {code:java} scala> val rdd = sc.parallelize(Seq(Row(BigInt("1").toByteArray))) rdd: org.apache.spark.rdd.RDD[org.apache.spark.sql.Row] = ParallelCollectionRDD[356] at parallelize at :28 scala> val schema = new StructType().add(StructField("c1", BinaryType, true)) schema: org.apache.spark.sql.types.StructType = StructType(StructField(c1,BinaryType,true)) scala> val df = spark.createDataFrame(rdd, schema) df: org.apache.spark.sql.DataFrame = [c1: binary] scala> df.show(false) ++ |c1 | ++ |[01]| ++ scala> df.write.mode("overwrite").format("orc").saveAsTable("binary_vals") scala> spark.sql("select * from binary_vals;").show(false) ++ |c1 | ++ |[01]| ++{code} We can see the output using "spark.sql" in spark-shell. Then using {{spark-sql}} (to see we use tee to redirect the log to a file) {code:java} $SPARK_HOME/bin/spark-sql | tee sql.log{code} Execute the following, we only get an empty output in the terminal: {code:java} spark-sql> select * from binary_vals; -- check what was inserted by DataFrame spark-sql> drop table binary_vals; spark-sql> create table binary_vals(c1 BINARY) stored as ORC; -- try to insert directly in spark-sql spark-sql> insert into binary_vals select X'01'; spark-sql> select * from binary_vals; Time taken: 0.077 seconds, Fetched 1 row(s) {code} >From the log file, we find it shows as a garbage character. (We never >encountered this garbage character in logs of other data types) h3. !image-2022-10-18-12-15-05-576.png! h3. Expected behavior We expect the two Spark interfaces ({{{}spark-sql{}}} & {{{}spark-shell{}}}) to behave consistently for the same data type ({{{}BINARY{}}}) & input ({{{}BigInt("1").toByteArray){}}} / {{{}X'01'{}}}) combination. h3. Additional context We also tried Avro and Parquet and encountered the same issue. We believe this is format-independent. was: h3. Describe the bug Storing a BINARY value (e.g. {{BigInt("1").toByteArray)}} / {{{}X'01'{}}}) via {{spark-shell}} outputs {{{}[01]{}}}. However, it does not encode correctly if the value is inserted into a BINARY column of a table via {{{}spark-sql{}}}. h3. To Reproduce On Spark 3.2.1 (commit {{{}4f25b3f712{}}}), using {{{}spark-shell{}}}: {code:java} $SPARK_HOME/bin/spark-shell{code} Execute the following: {code:java} scala> val rdd = sc.parallelize(Seq(Row(BigInt("1").toByteArray))) rdd: org.apache.spark.rdd.RDD[org.apache.spark.sql.Row] = ParallelCollectionRDD[356] at parallelize at :28 scala> val schema = new StructType().add(StructField("c1", BinaryType, true)) schema: org.apache.spark.sql.types.StructType = StructType(StructField(c1,BinaryType,true)) scala> val df = spark.createDataFrame(rdd, schema) df: org.apache.spark.sql.DataFrame = [c1: binary] scala> df.show(false) ++ |c1 | ++ |[01]| ++ scala> df.write.mode("overwrite").format("orc").saveAsTable("binary_vals") scala> spark.sql("select * from binary_vals;").show(false) ++ |c1 | ++ |[01]| ++{code} We can see the output using "spark.sql" in spark-shell. Then using {{spark-sql}} (to see we use tee to redirect the log to a file) {code:java} $SPARK_HOME/bin/spark-sql | tee sql.log{code} Execute the following, we only get an empty output in the terminal: {code:java} spark-sql> select * from binary_vals; -- check what was inserted by DataFrame spark-sql> drop table binary_vals; spark-sql> create table binary_vals(c1 BINARY) stored as ORC; -- try to insert directly in spark-sql spark-sql> insert into binary_vals select X'01'; spark-sql> select * from binary_vals; Time taken: 0.077 seconds, Fetched 1 row(s) {code} >From the log file, we find it shows as a garbage character. (We never >encountered this garbage character in logs of other data types) h3. !image-2022-10-18-12-15-05-576.png! h3. Expected behavior We expect the two Spark interfaces ({{{}spark-sql{}}} & {{{}spark-shell{}}}) to behave consistently for the same data type ({{{}BINARY{}}}) & input ({{{}BigInt("1").toByteArray){}}} / {{{}X'01'{}}}) combination. h3. Additional context We also tried Avro and Parquet and encountered the same issue. We believe this is format-independent. > DataFrame can correctly encode BINARY type but SparkSQL cannot > -- > > Key: SPARK-40637 > URL:
[jira] [Updated] (SPARK-40637) DataFrame can correctly encode BINARY type but SparkSQL cannot
[ https://issues.apache.org/jira/browse/SPARK-40637?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] xsys updated SPARK-40637: - Description: h3. Describe the bug Storing a BINARY value (e.g. {{BigInt("1").toByteArray)}} / {{{}X'01'{}}}) via {{spark-shell}} outputs {{{}[01]{}}}. However, it does not encode correctly if the value is inserted into a BINARY column of a table via {{{}spark-sql{}}}. h3. To Reproduce On Spark 3.2.1 (commit {{{}4f25b3f712{}}}), using {{{}spark-shell{}}}: {code:java} $SPARK_HOME/bin/spark-shell{code} Execute the following: {code:java} scala> val rdd = sc.parallelize(Seq(Row(BigInt("1").toByteArray))) rdd: org.apache.spark.rdd.RDD[org.apache.spark.sql.Row] = ParallelCollectionRDD[356] at parallelize at :28 scala> val schema = new StructType().add(StructField("c1", BinaryType, true)) schema: org.apache.spark.sql.types.StructType = StructType(StructField(c1,BinaryType,true)) scala> val df = spark.createDataFrame(rdd, schema) df: org.apache.spark.sql.DataFrame = [c1: binary] scala> df.show(false) ++ |c1 | ++ |[01]| ++ scala> df.write.mode("overwrite").format("orc").saveAsTable("binary_vals") scala> spark.sql("select * from binary_vals;").show(false) ++ |c1 | ++ |[01]| ++{code} We can see the output using "spark.sql" in spark-shell. Then using {{spark-sql}} (to see we use tee to redirect the log to a file) {code:java} $SPARK_HOME/bin/spark-sql | tee sql.log{code} Execute the following, we only get an empty output in the terminal: {code:java} spark-sql> select * from binary_vals; -- check what was inserted by DataFrame spark-sql> drop table binary_vals; spark-sql> create table binary_vals(c1 BINARY) stored as ORC; -- try to insert directly in spark-sql spark-sql> insert into binary_vals select X'01'; spark-sql> select * from binary_vals; Time taken: 0.077 seconds, Fetched 1 row(s) {code} >From the log file, we find it shows as a garbage character. (We never >encountered this garbage character in logs of other data types) h3. !image-2022-10-18-12-15-05-576.png! h3. Expected behavior We expect the two Spark interfaces ({{{}spark-sql{}}} & {{{}spark-shell{}}}) to behave consistently for the same data type ({{{}BINARY{}}}) & input ({{{}BigInt("1").toByteArray){}}} / {{{}X'01'{}}}) combination. h3. Additional context We also tried Avro and Parquet and encountered the same issue. We believe this is format-independent. was: h3. Describe the bug Storing a BINARY value (e.g. {{BigInt("1").toByteArray)}} / {{{}X'01'{}}}) via {{spark-shell}} outputs {{{}[01]{}}}. However, it does not encode correctly if the value is inserted into a BINARY column of a table via {{{}spark-sql{}}}. h3. To Reproduce On Spark 3.2.1 (commit {{{}4f25b3f712{}}}), using {{{}spark-shell{}}}: {code:java} $SPARK_HOME/bin/spark-shell{code} Execute the following: {code:java} scala> val rdd = sc.parallelize(Seq(Row(BigInt("1").toByteArray))) rdd: org.apache.spark.rdd.RDD[org.apache.spark.sql.Row] = ParallelCollectionRDD[356] at parallelize at :28 scala> val schema = new StructType().add(StructField("c1", BinaryType, true)) schema: org.apache.spark.sql.types.StructType = StructType(StructField(c1,BinaryType,true)) scala> val df = spark.createDataFrame(rdd, schema) df: org.apache.spark.sql.DataFrame = [c1: binary] scala> df.show(false) ++ |c1 | ++ |[01]| ++ scala> df.write.mode("overwrite").format("orc").saveAsTable("binary_vals") scala> spark.sql("select * from binary_vals;").show(false) ++ |c1 | ++ |[01]| ++{code} We can see the output using "spark.sql" in spark-shell. Then using {{{}spark-sql{}}}: {code:java} $SPARK_HOME/bin/spark-sql | tee sql.log{code} Execute the following, we only get an empty output in the terminal: {code:java} spark-sql> select * from binary_vals; spark-sql> drop table binary_vals; spark-sql> create table binary_vals(c1 BINARY) stored as ORC; spark-sql> insert into binary_vals select X'01'; spark-sql> select * from binary_vals; Time taken: 0.077 seconds, Fetched 1 row(s) {code} >From the log file, we find it shows as a garbage character. (We never >encountered this garbage character in logs of other data types) h3. !image-2022-10-18-12-15-05-576.png! h3. Expected behavior We expect the two Spark interfaces ({{{}spark-sql{}}} & {{{}spark-shell{}}}) to behave consistently for the same data type ({{{}BINARY{}}}) & input ({{{}BigInt("1").toByteArray){}}} / {{{}X'01'{}}}) combination. h3. Additional context We also tried Avro and Parquet and encountered the same issue. We believe this is format-independent. > DataFrame can correctly encode BINARY type but SparkSQL cannot > -- > > Key: SPARK-40637 > URL: https://issues.apache.org/jira/browse/SPARK-40637 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.2.1
[jira] [Updated] (SPARK-40637) DataFrame can correctly encode BINARY type but SparkSQL cannot
[ https://issues.apache.org/jira/browse/SPARK-40637?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] xsys updated SPARK-40637: - Description: h3. Describe the bug Storing a BINARY value (e.g. {{BigInt("1").toByteArray)}} / {{{}X'01'{}}}) via {{spark-shell}} outputs {{{}[01]{}}}. However, it does not encode correctly if the value is inserted into a BINARY column of a table via {{{}spark-sql{}}}. h3. To Reproduce On Spark 3.2.1 (commit {{{}4f25b3f712{}}}), using {{{}spark-shell{}}}: {code:java} $SPARK_HOME/bin/spark-shell{code} Execute the following: {code:java} scala> val rdd = sc.parallelize(Seq(Row(BigInt("1").toByteArray))) rdd: org.apache.spark.rdd.RDD[org.apache.spark.sql.Row] = ParallelCollectionRDD[356] at parallelize at :28 scala> val schema = new StructType().add(StructField("c1", BinaryType, true)) schema: org.apache.spark.sql.types.StructType = StructType(StructField(c1,BinaryType,true)) scala> val df = spark.createDataFrame(rdd, schema) df: org.apache.spark.sql.DataFrame = [c1: binary] scala> df.show(false) ++ |c1 | ++ |[01]| ++ scala> df.write.mode("overwrite").format("orc").saveAsTable("binary_vals") scala> spark.sql("select * from binary_vals;").show(false) ++ |c1 | ++ |[01]| ++{code} We can see the output using "spark.sql" in spark-shell. Then using {{{}spark-sql{}}}: {code:java} $SPARK_HOME/bin/spark-sql | tee sql.log{code} Execute the following, we only get an empty output in the terminal: {code:java} spark-sql> select * from binary_vals; spark-sql> drop table binary_vals; spark-sql> create table binary_vals(c1 BINARY) stored as ORC; spark-sql> insert into binary_vals select X'01'; spark-sql> select * from binary_vals; Time taken: 0.077 seconds, Fetched 1 row(s) {code} >From the log file, we find it shows as a garbage character. (We never >encountered this garbage character in logs of other data types) h3. !image-2022-10-18-12-15-05-576.png! h3. Expected behavior We expect the two Spark interfaces ({{{}spark-sql{}}} & {{{}spark-shell{}}}) to behave consistently for the same data type ({{{}BINARY{}}}) & input ({{{}BigInt("1").toByteArray){}}} / {{{}X'01'{}}}) combination. h3. Additional context We also tried Avro and Parquet and encountered the same issue. We believe this is format-independent. was: h3. Describe the bug Storing a BINARY value (e.g. {{BigInt("1").toByteArray)}} / {{{}X'01'{}}}) via {{spark-shell}} outputs {{{}[01]{}}}. However, it does not encode correctly if the value is inserted into a BINARY column of a table via {{{}spark-sql{}}}. h3. To Reproduce On Spark 3.2.1 (commit {{{}4f25b3f712{}}}), using {{{}spark-shell{}}}: {code:java} $SPARK_HOME/bin/spark-shell{code} Execute the following: {code:java} scala> val rdd = sc.parallelize(Seq(Row(BigInt("1").toByteArray))) rdd: org.apache.spark.rdd.RDD[org.apache.spark.sql.Row] = ParallelCollectionRDD[356] at parallelize at :28 scala> val schema = new StructType().add(StructField("c1", BinaryType, true)) schema: org.apache.spark.sql.types.StructType = StructType(StructField(c1,BinaryType,true)) scala> val df = spark.createDataFrame(rdd, schema) df: org.apache.spark.sql.DataFrame = [c1: binary] scala> df.show(false) ++ |c1 | ++ |[01]| ++ scala> df.write.mode("overwrite").format("orc").saveAsTable("binary_vals") scala> spark.sql("select * from binary_vals;").show(false) ++ |c1 | ++ |[01]| ++{code} Using {{{}spark-sql (we use tee to redirect the log to a file){}}}: {code:java} $SPARK_HOME/bin/spark-sql | tee sql.log{code} Execute the following, we only get an empty output in the terminal: {code:java} spark-sql> drop table binary_vals; spark-sql> create table binary_vals(c1 BINARY) stored as ORC; spark-sql> insert into binary_vals select X'01'; spark-sql> select * from binary_vals; Time taken: 0.077 seconds, Fetched 1 row(s) {code} >From the log file, we find it shows as a garbage character. (We never >encountered this garbage character in logs of other data types) h3. !image-2022-10-18-12-15-05-576.png! h3. Expected behavior We expect the two Spark interfaces ({{{}spark-sql{}}} & {{{}spark-shell{}}}) to behave consistently for the same data type ({{{}BINARY{}}}) & input ({{{}BigInt("1").toByteArray){}}} / {{{}X'01'{}}}) combination. h3. Additional context We also tried Avro and Parquet and encountered the same issue. We believe this is format-independent. > DataFrame can correctly encode BINARY type but SparkSQL cannot > -- > > Key: SPARK-40637 > URL: https://issues.apache.org/jira/browse/SPARK-40637 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.2.1 >Reporter: xsys >Priority: Major > Attachments: image-2022-10-18-12-15-05-576.png > > > h3. Describe the bug > Storing a BINARY value (e.g.
[jira] [Updated] (SPARK-40637) DataFrame can correctly encode BINARY type but SparkSQL cannot
[ https://issues.apache.org/jira/browse/SPARK-40637?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] xsys updated SPARK-40637: - Description: h3. Describe the bug Storing a BINARY value (e.g. {{BigInt("1").toByteArray)}} / {{{}X'01'{}}}) via {{spark-shell}} outputs {{{}[01]{}}}. However, it does not encode correctly if the value is inserted into a BINARY column of a table via {{{}spark-sql{}}}. h3. To Reproduce On Spark 3.2.1 (commit {{{}4f25b3f712{}}}), using {{{}spark-shell{}}}: {code:java} $SPARK_HOME/bin/spark-shell{code} Execute the following: {code:java} scala> val rdd = sc.parallelize(Seq(Row(BigInt("1").toByteArray))) rdd: org.apache.spark.rdd.RDD[org.apache.spark.sql.Row] = ParallelCollectionRDD[356] at parallelize at :28 scala> val schema = new StructType().add(StructField("c1", BinaryType, true)) schema: org.apache.spark.sql.types.StructType = StructType(StructField(c1,BinaryType,true)) scala> val df = spark.createDataFrame(rdd, schema) df: org.apache.spark.sql.DataFrame = [c1: binary] scala> df.show(false) ++ |c1 | ++ |[01]| ++ scala> df.write.mode("overwrite").format("orc").saveAsTable("binary_vals") scala> spark.sql("select * from binary_vals;").show(false) ++ |c1 | ++ |[01]| ++{code} Using {{{}spark-sql (we use tee to redirect the log to a file){}}}: {code:java} $SPARK_HOME/bin/spark-sql | tee sql.log{code} Execute the following, we only get an empty output in the terminal: {code:java} spark-sql> drop table binary_vals; spark-sql> create table binary_vals(c1 BINARY) stored as ORC; spark-sql> insert into binary_vals select X'01'; spark-sql> select * from binary_vals; Time taken: 0.077 seconds, Fetched 1 row(s) {code} >From the log file, we find it shows as a garbage character. (We never >encountered this garbage character in logs of other data types) h3. !image-2022-10-18-12-15-05-576.png! h3. Expected behavior We expect the two Spark interfaces ({{{}spark-sql{}}} & {{{}spark-shell{}}}) to behave consistently for the same data type ({{{}BINARY{}}}) & input ({{{}BigInt("1").toByteArray){}}} / {{{}X'01'{}}}) combination. h3. Additional context We also tried Avro and Parquet and encountered the same issue. We believe this is format-independent. was: h3. Describe the bug Storing a BINARY value (e.g. {{BigInt("1").toByteArray)}} / {{{}X'01'{}}}) via {{spark-shell}} outputs {{{}[01]{}}}. However, it does not encode correctly if the value is inserted into a BINARY column of a table via {{{}spark-sql{}}}. h3. To Reproduce On Spark 3.2.1 (commit {{{}4f25b3f712{}}}), using {{{}spark-shell{}}}: {code:java} $SPARK_HOME/bin/spark-shell{code} Execute the following: {code:java} scala> val rdd = sc.parallelize(Seq(Row(BigInt("1").toByteArray))) rdd: org.apache.spark.rdd.RDD[org.apache.spark.sql.Row] = ParallelCollectionRDD[356] at parallelize at :28 scala> val schema = new StructType().add(StructField("c1", BinaryType, true)) schema: org.apache.spark.sql.types.StructType = StructType(StructField(c1,BinaryType,true)) scala> val df = spark.createDataFrame(rdd, schema) df: org.apache.spark.sql.DataFrame = [c1: binary] scala> df.show(false) ++ |c1 | ++ |[01]| ++ scala> df.write.mode("overwrite").format("orc").saveAsTable("binary_vals") scala> spark.sql("select * from binary_vals;").show(false) ++ |c1 | ++ |[01]| ++{code} Using {{{}spark-sql (we use tee to redirect the log to a file){}}}: {code:java} $SPARK_HOME/bin/spark-sql | tee sql.log{code} Execute the following, we only get an empty output in the terminal: {code:java} spark-sql> create table binary_vals(c1 BINARY) stored as ORC; spark-sql> insert into binary_vals select X'01'; spark-sql> select * from binary_vals; Time taken: 0.077 seconds, Fetched 1 row(s) {code} >From the log file, we find it shows as a garbage character. (We never >encountered this garbage character in logs of other data types) h3. !image-2022-10-18-12-15-05-576.png! h3. Expected behavior We expect the two Spark interfaces ({{{}spark-sql{}}} & {{{}spark-shell{}}}) to behave consistently for the same data type ({{{}BINARY{}}}) & input ({{{}BigInt("1").toByteArray){}}} / {{{}X'01'{}}}) combination. h3. Additional context We also tried Avro and Parquet and encountered the same issue. We believe this is format-independent. > DataFrame can correctly encode BINARY type but SparkSQL cannot > -- > > Key: SPARK-40637 > URL: https://issues.apache.org/jira/browse/SPARK-40637 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.2.1 >Reporter: xsys >Priority: Major > Attachments: image-2022-10-18-12-15-05-576.png > > > h3. Describe the bug > Storing a BINARY value (e.g. {{BigInt("1").toByteArray)}} / {{{}X'01'{}}}) > via {{spark-shell}} outputs {{{}[01]{}}}. However, it
[jira] [Updated] (SPARK-40637) DataFrame can correctly encode BINARY type but SparkSQL cannot
[ https://issues.apache.org/jira/browse/SPARK-40637?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] xsys updated SPARK-40637: - Description: h3. Describe the bug Storing a BINARY value (e.g. {{BigInt("1").toByteArray)}} / {{{}X'01'{}}}) via {{spark-shell}} outputs {{{}[01]{}}}. However, it does not encode correctly if the value is inserted into a BINARY column of a table via {{{}spark-sql{}}}. h3. To Reproduce On Spark 3.2.1 (commit {{{}4f25b3f712{}}}), using {{{}spark-shell{}}}: {code:java} $SPARK_HOME/bin/spark-shell{code} Execute the following: {code:java} scala> val rdd = sc.parallelize(Seq(Row(BigInt("1").toByteArray))) rdd: org.apache.spark.rdd.RDD[org.apache.spark.sql.Row] = ParallelCollectionRDD[356] at parallelize at :28 scala> val schema = new StructType().add(StructField("c1", BinaryType, true)) schema: org.apache.spark.sql.types.StructType = StructType(StructField(c1,BinaryType,true)) scala> val df = spark.createDataFrame(rdd, schema) df: org.apache.spark.sql.DataFrame = [c1: binary] scala> df.show(false) ++ |c1 | ++ |[01]| ++ scala> df.write.mode("overwrite").format("orc").saveAsTable("binary_vals") scala> spark.sql("select * from binary_vals;").show(false) ++ |c1 | ++ |[01]| ++{code} Using {{{}spark-sql (we use tee to redirect the log to a file){}}}: {code:java} $SPARK_HOME/bin/spark-sql | tee sql.log{code} Execute the following, we only get an empty output in the terminal: {code:java} spark-sql> create table binary_vals(c1 BINARY) stored as ORC; spark-sql> insert into binary_vals select X'01'; spark-sql> select * from binary_vals; Time taken: 0.077 seconds, Fetched 1 row(s) {code} >From the log file, we find it shows as a garbage character. (We never >encountered this garbage character in logs of other data types) h3. !image-2022-10-18-12-15-05-576.png! h3. Expected behavior We expect the two Spark interfaces ({{{}spark-sql{}}} & {{{}spark-shell{}}}) to behave consistently for the same data type ({{{}BINARY{}}}) & input ({{{}BigInt("1").toByteArray){}}} / {{{}X'01'{}}}) combination. h3. Additional context We also tried Avro and Parquet and encountered the same issue. We believe this is format-independent. was: h3. Describe the bug Storing a BINARY value (e.g. {{BigInt("1").toByteArray)}} / {{{}X'01'{}}}) via {{spark-shell}} outputs {{{}[01]{}}}. However, it does not encode correctly if the value is inserted into a BINARY column of a table via {{{}spark-sql{}}}. h3. To Reproduce On Spark 3.2.1 (commit {{{}4f25b3f712{}}}), using {{{}spark-shell{}}}: {code:java} $SPARK_HOME/bin/spark-shell{code} Execute the following: {code:java} scala> val rdd = sc.parallelize(Seq(Row(BigInt("1").toByteArray))) rdd: org.apache.spark.rdd.RDD[org.apache.spark.sql.Row] = ParallelCollectionRDD[356] at parallelize at :28 scala> val schema = new StructType().add(StructField("c1", BinaryType, true)) schema: org.apache.spark.sql.types.StructType = StructType(StructField(c1,BinaryType,true)) scala> val df = spark.createDataFrame(rdd, schema) df: org.apache.spark.sql.DataFrame = [c1: binary] scala> df.show(false) ++ |c1 | ++ |[01]| ++ scala> df.write.mode("overwrite").format("orc").saveAsTable("binary_vals") scala> spark.sql("select * from binary_vals;").show(false) ++ |c1 | ++ |[01]| ++{code} Using {{{}spark-sql{}}}: {code:java} $SPARK_HOME/bin/spark-sql{code} Execute the following, we only get an empty output: {code:java} spark-sql> create table binary_vals(c1 BINARY) stored as ORC; spark-sql> insert into binary_vals select X'01'; spark-sql> select * from binary_vals; Time taken: 0.077 seconds, Fetched 1 row(s) {code} h3. Expected behavior We expect the two Spark interfaces ({{{}spark-sql{}}} & {{{}spark-shell{}}}) to behave consistently for the same data type ({{{}BINARY{}}}) & input ({{{}BigInt("1").toByteArray){}}} / {{{}X'01'{}}}) combination. h3. Additional context We tried Avro and Parquet and encountered the same issue. We believe this is format-independent. > DataFrame can correctly encode BINARY type but SparkSQL cannot > -- > > Key: SPARK-40637 > URL: https://issues.apache.org/jira/browse/SPARK-40637 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.2.1 >Reporter: xsys >Priority: Major > Attachments: image-2022-10-18-12-15-05-576.png > > > h3. Describe the bug > Storing a BINARY value (e.g. {{BigInt("1").toByteArray)}} / {{{}X'01'{}}}) > via {{spark-shell}} outputs {{{}[01]{}}}. However, it does not encode > correctly if the value is inserted into a BINARY column of a table via > {{{}spark-sql{}}}. > h3. To Reproduce > On Spark 3.2.1 (commit {{{}4f25b3f712{}}}), using {{{}spark-shell{}}}: > {code:java} > $SPARK_HOME/bin/spark-shell{code} > Execute the following: >
[jira] [Updated] (SPARK-40637) DataFrame can correctly encode BINARY type but SparkSQL cannot
[ https://issues.apache.org/jira/browse/SPARK-40637?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] xsys updated SPARK-40637: - Attachment: image-2022-10-18-12-15-05-576.png > DataFrame can correctly encode BINARY type but SparkSQL cannot > -- > > Key: SPARK-40637 > URL: https://issues.apache.org/jira/browse/SPARK-40637 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.2.1 >Reporter: xsys >Priority: Major > Attachments: image-2022-10-18-12-15-05-576.png > > > h3. Describe the bug > Storing a BINARY value (e.g. {{BigInt("1").toByteArray)}} / {{{}X'01'{}}}) > via {{spark-shell}} outputs {{{}[01]{}}}. However, it does not encode > correctly if the value is inserted into a BINARY column of a table via > {{{}spark-sql{}}}. > h3. To Reproduce > On Spark 3.2.1 (commit {{{}4f25b3f712{}}}), using {{{}spark-shell{}}}: > {code:java} > $SPARK_HOME/bin/spark-shell{code} > Execute the following: > {code:java} > scala> val rdd = sc.parallelize(Seq(Row(BigInt("1").toByteArray))) > rdd: org.apache.spark.rdd.RDD[org.apache.spark.sql.Row] = > ParallelCollectionRDD[356] at parallelize at :28 > scala> val schema = new StructType().add(StructField("c1", BinaryType, true)) > schema: org.apache.spark.sql.types.StructType = > StructType(StructField(c1,BinaryType,true)) > scala> val df = spark.createDataFrame(rdd, schema) > df: org.apache.spark.sql.DataFrame = [c1: binary] > scala> df.show(false) > ++ > |c1 | > ++ > |[01]| > ++ > scala> df.write.mode("overwrite").format("orc").saveAsTable("binary_vals") > scala> spark.sql("select * from binary_vals;").show(false) > ++ > |c1 | > ++ > |[01]| > ++{code} > > Using {{{}spark-sql{}}}: > {code:java} > $SPARK_HOME/bin/spark-sql{code} > Execute the following, we only get an empty output: > {code:java} > spark-sql> create table binary_vals(c1 BINARY) stored as ORC; > spark-sql> insert into binary_vals select X'01'; > spark-sql> select * from binary_vals; > Time taken: 0.077 seconds, Fetched 1 row(s) > {code} > h3. Expected behavior > We expect the two Spark interfaces ({{{}spark-sql{}}} & {{{}spark-shell{}}}) > to behave consistently for the same data type ({{{}BINARY{}}}) & input > ({{{}BigInt("1").toByteArray){}}} / {{{}X'01'{}}}) combination. > > h3. Additional context > We tried Avro and Parquet and encountered the same issue. We believe this is > format-independent. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-40637) DataFrame can correctly encode BINARY type but SparkSQL cannot
[ https://issues.apache.org/jira/browse/SPARK-40637?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] xsys updated SPARK-40637: - Description: h3. Describe the bug Storing a BINARY value (e.g. {{BigInt("1").toByteArray)}} / {{{}X'01'{}}}) via {{spark-shell}} outputs {{{}[01]{}}}. However, it does not encode correctly if the value is inserted into a BINARY column of a table via {{{}spark-sql{}}}. h3. To Reproduce On Spark 3.2.1 (commit {{{}4f25b3f712{}}}), using {{{}spark-shell{}}}: {code:java} $SPARK_HOME/bin/spark-shell{code} Execute the following: {code:java} scala> val rdd = sc.parallelize(Seq(Row(BigInt("1").toByteArray))) rdd: org.apache.spark.rdd.RDD[org.apache.spark.sql.Row] = ParallelCollectionRDD[356] at parallelize at :28 scala> val schema = new StructType().add(StructField("c1", BinaryType, true)) schema: org.apache.spark.sql.types.StructType = StructType(StructField(c1,BinaryType,true)) scala> val df = spark.createDataFrame(rdd, schema) df: org.apache.spark.sql.DataFrame = [c1: binary] scala> df.show(false) ++ |c1 | ++ |[01]| ++ scala> df.write.mode("overwrite").format("orc").saveAsTable("binary_vals") scala> spark.sql("select * from binary_vals;").show(false) ++ |c1 | ++ |[01]| ++{code} Using {{{}spark-sql{}}}: {code:java} $SPARK_HOME/bin/spark-sql{code} Execute the following, we only get an empty output: {code:java} spark-sql> create table binary_vals(c1 BINARY) stored as ORC; spark-sql> insert into binary_vals select X'01'; spark-sql> select * from binary_vals; Time taken: 0.077 seconds, Fetched 1 row(s) {code} h3. Expected behavior We expect the two Spark interfaces ({{{}spark-sql{}}} & {{{}spark-shell{}}}) to behave consistently for the same data type ({{{}BINARY{}}}) & input ({{{}BigInt("1").toByteArray){}}} / {{{}X'01'{}}}) combination. h3. Additional context We tried Avro and Parquet and encountered the same issue. We believe this is format-independent. was: h3. Describe the bug Storing a BINARY value (e.g. {{BigInt("1").toByteArray)}} / {{{}X'01'{}}}) via {{spark-shell}} outputs {{{}[01]{}}}. However, it does not encode correctly if the value is inserted into a BINARY column of a table via {{{}spark-sql{}}}. h3. To Reproduce On Spark 3.2.1 (commit {{{}4f25b3f712{}}}), using {{{}spark-shell{}}}: {code:java} $SPARK_HOME/bin/spark-shell{code} Execute the following: {code:java} scala> val rdd = sc.parallelize(Seq(Row(BigInt("1").toByteArray))) rdd: org.apache.spark.rdd.RDD[org.apache.spark.sql.Row] = ParallelCollectionRDD[356] at parallelize at :28 scala> val schema = new StructType().add(StructField("c1", BinaryType, true)) schema: org.apache.spark.sql.types.StructType = StructType(StructField(c1,BinaryType,true)) scala> val df = spark.createDataFrame(rdd, schema) df: org.apache.spark.sql.DataFrame = [c1: binary] scala> df.show(false) ++ |c1 | ++ |[01]| ++ {code} Using {{{}spark-sql{}}}: {code:java} $SPARK_HOME/bin/spark-sql{code} Execute the following, we only get an empty output: {code:java} spark-sql> create table binary_vals(c1 BINARY) stored as ORC; spark-sql> insert into binary_vals select X'01'; spark-sql> select * from binary_vals; Time taken: 0.077 seconds, Fetched 1 row(s) {code} h3. Expected behavior We expect the two Spark interfaces ({{{}spark-sql{}}} & {{{}spark-shell{}}}) to behave consistently for the same data type ({{{}BINARY{}}}) & input ({{{}BigInt("1").toByteArray){}}} / {{{}X'01'{}}}) combination. > DataFrame can correctly encode BINARY type but SparkSQL cannot > -- > > Key: SPARK-40637 > URL: https://issues.apache.org/jira/browse/SPARK-40637 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.2.1 >Reporter: xsys >Priority: Major > > h3. Describe the bug > Storing a BINARY value (e.g. {{BigInt("1").toByteArray)}} / {{{}X'01'{}}}) > via {{spark-shell}} outputs {{{}[01]{}}}. However, it does not encode > correctly if the value is inserted into a BINARY column of a table via > {{{}spark-sql{}}}. > h3. To Reproduce > On Spark 3.2.1 (commit {{{}4f25b3f712{}}}), using {{{}spark-shell{}}}: > {code:java} > $SPARK_HOME/bin/spark-shell{code} > Execute the following: > {code:java} > scala> val rdd = sc.parallelize(Seq(Row(BigInt("1").toByteArray))) > rdd: org.apache.spark.rdd.RDD[org.apache.spark.sql.Row] = > ParallelCollectionRDD[356] at parallelize at :28 > scala> val schema = new StructType().add(StructField("c1", BinaryType, true)) > schema: org.apache.spark.sql.types.StructType = > StructType(StructField(c1,BinaryType,true)) > scala> val df = spark.createDataFrame(rdd, schema) > df: org.apache.spark.sql.DataFrame = [c1: binary] > scala> df.show(false) > ++ > |c1 | > ++ > |[01]| > ++ > scala>
[jira] [Updated] (SPARK-40630) Both SparkSQL and DataFrame insert invalid DATE/TIMESTAMP as NULL
[ https://issues.apache.org/jira/browse/SPARK-40630?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] xsys updated SPARK-40630: - Description: h3. Describe the bug When we construct a DataFrame with an invalid DATE/TIMESTAMP (e.g. {{{}1969-12-31 23:59:59 B{}}}) via {{{}spark-shell{}}}, or insert an invalid DATE/TIMESTAMP into a table via {{{}spark-sql{}}}, both interfaces unexpectedly evaluate the invalid value to {{{}NULL{}}}, instead of throwing an exception. h3. To Reproduce On Spark 3.2.1 (commit {{{}4f25b3f712{}}}), using {{{}spark-sql{}}}: {code:java} $SPARK_HOME/bin/spark-sql{code} Execute the following: {code:java} spark-sql> create table timestamp_vals(c1 TIMESTAMP) stored as ORC; spark-sql> insert into timestamp_vals select cast(" 1969-12-31 23:59:59 B "as timestamp); spark-sql> select * from timestamp_vals; NULL{code} Using {{{}spark-shell{}}}: {code:java} $SPARK_HOME/bin/spark-shell{code} Execute the following: {code:java} scala> import org.apache.spark.sql.{Row, SparkSession} import org.apache.spark.sql.{Row, SparkSession} scala> import org.apache.spark.sql.types._ import org.apache.spark.sql.types._ scala> val rdd = sc.parallelize(Seq(Row(Seq(" 1969-12-31 23:59:59 B ").toDF("time").select(to_timestamp(col("time")).as("to_timestamp")).first().getAs[java.sql.Timestamp](0 rdd: org.apache.spark.rdd.RDD[org.apache.spark.sql.Row] = ParallelCollectionRDD[721] at parallelize at :28 scala> val schema = new StructType().add(StructField("c1", TimestampType, true)) schema: org.apache.spark.sql.types.StructType = StructType(StructField(c1,TimestampType,true)) scala> val df = spark.createDataFrame(rdd, schema) df194: org.apache.spark.sql.DataFrame = [c1: timestamp] scala> df.show(false) ++ |c1 | ++ |null| ++ {code} h3. Expected behavior We expect both {{spark-sql}} & {{spark-shell}} interfaces to throw an exception for an invalid DATE/TIMESTAMP, like what they do for most of the other data types (e.g. invalid value {{"foo"}} for {{INT}} data type). was: h3. Describe the bug When we construct a DataFrame with an invalid DATE/TIMESTAMP (e.g. {{{}1969-12-31 23:59:59 B{}}}) via {{{}spark-shell{}}}, or insert an invalid DATE/TIMESTAMP into a table via {{{}spark-sql{}}}, both interfaces unexpectedly evaluate the invalid value to {{{}NULL{}}}, instead of throwing an exception. h3. To Reproduce On Spark 3.2.1 (commit {{{}4f25b3f712{}}}), using {{{}spark-sql{}}}: {code:java} $SPARK_HOME/bin/spark-sql{code} Execute the following: {code:java} spark-sql> create table timestamp_vals(c1 TIMESTAMP) stored as ORC; spark-sql> insert into timestamp_vals select cast(" 1969-12-31 23:59:59 B "as timestamp); spark-sql> select * from timestamp_vals; NULL{code} Using {{{}spark-shell{}}}: {code:java} $SPARK_HOME/bin/spark-shell{code} Execute the following: {code:java} scala> import org.apache.spark.sql.{Row, SparkSession} import org.apache.spark.sql.{Row, SparkSession} scala> import org.apache.spark.sql.types._ import org.apache.spark.sql.types._ scala> val rdd = sc.parallelize(Seq(Row(Seq(" 1969-12-31 23:59:59 B ").toDF("time").select(to_timestamp(col("ti me")).as("to_timestamp")).first().getAs[java.sql.Timestamp](0 rdd: org.apache.spark.rdd.RDD[org.apache.spark.sql.Row] = ParallelCollectionRDD[721] at parallelize at :28 scala> val schema = new StructType().add(StructField("c1", TimestampType, true)) schema: org.apache.spark.sql.types.StructType = StructType(StructField(c1,TimestampType,true)) scala> val df = spark.createDataFrame(rdd, schema) df194: org.apache.spark.sql.DataFrame = [c1: timestamp] scala> df.show(false) ++ |c1 | ++ |null| ++ {code} h3. Expected behavior We expect both {{spark-sql}} & {{spark-shell}} interfaces to throw an exception for an invalid DATE/TIMESTAMP, like what they do for most of the other data types (e.g. invalid value {{"foo"}} for {{INT}} data type). > Both SparkSQL and DataFrame insert invalid DATE/TIMESTAMP as NULL > - > > Key: SPARK-40630 > URL: https://issues.apache.org/jira/browse/SPARK-40630 > Project: Spark > Issue Type: Bug > Components: Spark Shell, SQL >Affects Versions: 3.2.1 >Reporter: xsys >Priority: Major > > h3. Describe the bug > When we construct a DataFrame with an invalid DATE/TIMESTAMP (e.g. > {{{}1969-12-31 23:59:59 B{}}}) via {{{}spark-shell{}}}, or insert an invalid > DATE/TIMESTAMP into a table via {{{}spark-sql{}}}, both interfaces > unexpectedly evaluate the invalid value to {{{}NULL{}}}, instead of throwing > an exception. > h3. To Reproduce > On Spark 3.2.1 (commit {{{}4f25b3f712{}}}), using {{{}spark-sql{}}}: > {code:java} > $SPARK_HOME/bin/spark-sql{code} > Execute the following: > {code:java} > spark-sql> create table timestamp_vals(c1 TIMESTAMP) stored as ORC; > spark-sql> insert
[jira] [Updated] (SPARK-40630) Both SparkSQL and DataFrame insert invalid DATE/TIMESTAMP as NULL
[ https://issues.apache.org/jira/browse/SPARK-40630?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] xsys updated SPARK-40630: - Description: h3. Describe the bug When we construct a DataFrame with an invalid DATE/TIMESTAMP (e.g. {{{}1969-12-31 23:59:59 B{}}}) via {{{}spark-shell{}}}, or insert an invalid DATE/TIMESTAMP into a table via {{{}spark-sql{}}}, both interfaces unexpectedly evaluate the invalid value to {{{}NULL{}}}, instead of throwing an exception. h3. To Reproduce On Spark 3.2.1 (commit {{{}4f25b3f712{}}}), using {{{}spark-sql{}}}: {code:java} $SPARK_HOME/bin/spark-sql{code} Execute the following: {code:java} spark-sql> create table timestamp_vals(c1 TIMESTAMP) stored as ORC; spark-sql> insert into timestamp_vals select cast(" 1969-12-31 23:59:59 B "as timestamp); spark-sql> select * from timestamp_vals; NULL{code} Using {{{}spark-shell{}}}: {code:java} $SPARK_HOME/bin/spark-shell{code} Execute the following: {code:java} scala> import org.apache.spark.sql.{Row, SparkSession} import org.apache.spark.sql.{Row, SparkSession} scala> import org.apache.spark.sql.types._ import org.apache.spark.sql.types._ scala> val rdd = sc.parallelize(Seq(Row(Seq(" 1969-12-31 23:59:59 B ").toDF("time").select(to_timestamp(col("ti me")).as("to_timestamp")).first().getAs[java.sql.Timestamp](0 rdd: org.apache.spark.rdd.RDD[org.apache.spark.sql.Row] = ParallelCollectionRDD[721] at parallelize at :28 scala> val schema = new StructType().add(StructField("c1", TimestampType, true)) schema: org.apache.spark.sql.types.StructType = StructType(StructField(c1,TimestampType,true)) scala> val df = spark.createDataFrame(rdd, schema) df194: org.apache.spark.sql.DataFrame = [c1: timestamp] scala> df.show(false) ++ |c1 | ++ |null| ++ {code} h3. Expected behavior We expect both {{spark-sql}} & {{spark-shell}} interfaces to throw an exception for an invalid DATE/TIMESTAMP, like what they do for most of the other data types (e.g. invalid value {{"foo"}} for {{INT}} data type). was: h3. Describe the bug When we construct a DataFrame with an invalid DATE/TIMESTAMP (e.g. {{{}1969-12-31 23:59:59 B{}}}) via {{{}spark-shell{}}}, or insert an invalid DATE/TIMESTAMP into a table via {{{}spark-sql{}}}, both interfaces unexpectedly evaluate the invalid value to {{{}NULL{}}}, instead of throwing an exception. h3. To Reproduce On Spark 3.2.1 (commit {{{}4f25b3f712{}}}), using {{{}spark-sql{}}}: {code:java} $SPARK_HOME/bin/spark-sql{code} Execute the following: {code:java} spark-sql> create table timestamp_vals(c1 TIMESTAMP) stored as ORC; spark-sql> insert into timestamp_vals select cast(" 1969-12-31 23:59:59 B "as timestamp); spark-sql> select * from timestamp_vals; NULL{code} Using {{{}spark-shell{}}}: {code:java} $SPARK_HOME/bin/spark-shell{code} Execute the following: {code:java} scala> val rdd = sc.parallelize(Seq(Row(Seq(" 1969-12-31 23:59:59 B ").toDF("time").select(to_timestamp(col("ti me")).as("to_timestamp")).first().getAs[java.sql.Timestamp](0 rdd: org.apache.spark.rdd.RDD[org.apache.spark.sql.Row] = ParallelCollectionRDD[721] at parallelize at :28 scala> val schema = new StructType().add(StructField("c1", TimestampType, true)) schema: org.apache.spark.sql.types.StructType = StructType(StructField(c1,TimestampType,true)) scala> val df = spark.createDataFrame(rdd, schema) df194: org.apache.spark.sql.DataFrame = [c1: timestamp] scala> df.show(false) ++ |c1 | ++ |null| ++ {code} h3. Expected behavior We expect both {{spark-sql}} & {{spark-shell}} interfaces to throw an exception for an invalid DATE/TIMESTAMP, like what they do for most of the other data types (e.g. invalid value {{"foo"}} for {{INT}} data type). > Both SparkSQL and DataFrame insert invalid DATE/TIMESTAMP as NULL > - > > Key: SPARK-40630 > URL: https://issues.apache.org/jira/browse/SPARK-40630 > Project: Spark > Issue Type: Bug > Components: Spark Shell, SQL >Affects Versions: 3.2.1 >Reporter: xsys >Priority: Major > > h3. Describe the bug > When we construct a DataFrame with an invalid DATE/TIMESTAMP (e.g. > {{{}1969-12-31 23:59:59 B{}}}) via {{{}spark-shell{}}}, or insert an invalid > DATE/TIMESTAMP into a table via {{{}spark-sql{}}}, both interfaces > unexpectedly evaluate the invalid value to {{{}NULL{}}}, instead of throwing > an exception. > h3. To Reproduce > On Spark 3.2.1 (commit {{{}4f25b3f712{}}}), using {{{}spark-sql{}}}: > {code:java} > $SPARK_HOME/bin/spark-sql{code} > Execute the following: > {code:java} > spark-sql> create table timestamp_vals(c1 TIMESTAMP) stored as ORC; > spark-sql> insert into timestamp_vals select cast(" 1969-12-31 23:59:59 B "as > timestamp); > spark-sql> select * from timestamp_vals; > NULL{code} > > Using {{{}spark-shell{}}}: > {code:java} >
[jira] [Updated] (SPARK-40629) FLOAT/DOUBLE division by 0 gives Infinity/-Infinity/NaN in DataFrame but NULL in SparkSQL
[ https://issues.apache.org/jira/browse/SPARK-40629?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] xsys updated SPARK-40629: - Description: h3. Describe the bug Storing a FLOAT/DOUBLE value with division by 0 (e.g. {{{}( 1.0/0 ).floatValue(){}}}) via {{spark-shell}} outputs {{{}Infinity{}}}. However, {{1.0/0}} ({{{}cast ( 1.0/0 as float){}}}) evaluated to {{NULL}} if the value is inserted into a FLOAT/DOUBLE column of a table via {{{}spark-sql{}}}. h3. To Reproduce On Spark 3.2.1 (commit {{{}4f25b3f712{}}}), using {{{}spark-sql{}}}: {code:java} $SPARK_HOME/bin/spark-sql{code} Execute the following: {code:java} spark-sql> create table float_vals(c1 float) stored as ORC; spark-sql> insert into float_vals select cast ( 1.0/0 as float); spark-sql> select * from float_vals; NULL{code} Using {{{}spark-shell{}}}: {code:java} $SPARK_HOME/bin/spark-shell{code} Execute the following: {code:java} scala> import org.apache.spark.sql.{Row, SparkSession} import org.apache.spark.sql.{Row, SparkSession} scala> import org.apache.spark.sql.types._ import org.apache.spark.sql.types._ scala> val rdd = sc.parallelize(Seq(Row(( 1.0/0 ).floatValue( rdd: org.apache.spark.rdd.RDD[org.apache.spark.sql.Row] = ParallelCollectionRDD[180] at parallelize at :28 scala> val schema = new StructType().add(StructField("c1", FloatType, true) ) schema: org.apache.spark.sql.types.StructType = StructType( StructField(c1,FloatType,true)) scala> val df = spark.createDataFrame(rdd, schema) df: org.apache.spark.sql.DataFrame = [c1: float] scala> df.show(false) +-+ |c1 | +-+ |Infinity | +-+ {code} h3. Expected behavior We expect the two Spark interfaces ({{{}spark-sql{}}} & {{{}spark-shell{}}}) to behave consistently for the same data type & input combination & configuration ({{{}FLOAT/DOUBLE{}}} and {{{}1.0/0{}}}). was: h3. Describe the bug Storing a FLOAT/DOUBLE value with division by 0 (e.g. {{{}( 1.0/0 ).floatValue(){}}}) via {{spark-shell}} outputs {{{}Infinity{}}}. However, {{1.0/0}} ({{{}cast ( 1.0/0 as float){}}}) evaluated to {{NULL}} if the value is inserted into a FLOAT/DOUBLE column of a table via {{{}spark-sql{}}}. h3. To Reproduce On Spark 3.2.1 (commit {{{}4f25b3f712{}}}), using {{{}spark-sql{}}}: {code:java} $SPARK_HOME/bin/spark-sql{code} Execute the following: {code:java} spark-sql> create table float_vals(c1 float) stored as ORC; spark-sql> insert into float_vals select cast ( 1.0/0 as float); spark-sql> select * from float_vals; NULL{code} Using {{{}spark-shell{}}}: {code:java} $SPARK_HOME/bin/spark-shell{code} Execute the following: {code:java} scala> val rdd = sc.parallelize(Seq(Row(( 1.0/0 ).floatValue( rdd: org.apache.spark.rdd.RDD[org.apache.spark.sql.Row] = ParallelCollectionRDD[180] at parallelize at :28 scala> val schema = new StructType().add(StructField("c1", FloatType, true) ) schema: org.apache.spark.sql.types.StructType = StructType( StructField(c1,FloatType,true)) scala> val df = spark.createDataFrame(rdd, schema) df: org.apache.spark.sql.DataFrame = [c1: float] scala> df.show(false) +-+ |c1 | +-+ |Infinity | +-+ {code} h3. Expected behavior We expect the two Spark interfaces ({{{}spark-sql{}}} & {{{}spark-shell{}}}) to behave consistently for the same data type & input combination & configuration ({{{}FLOAT/DOUBLE{}}} and {{{}1.0/0{}}}). > FLOAT/DOUBLE division by 0 gives Infinity/-Infinity/NaN in DataFrame but NULL > in SparkSQL > - > > Key: SPARK-40629 > URL: https://issues.apache.org/jira/browse/SPARK-40629 > Project: Spark > Issue Type: Bug > Components: Spark Shell, SQL >Affects Versions: 3.2.1 >Reporter: xsys >Priority: Major > > h3. Describe the bug > Storing a FLOAT/DOUBLE value with division by 0 (e.g. {{{}( 1.0/0 > ).floatValue(){}}}) via {{spark-shell}} outputs {{{}Infinity{}}}. However, > {{1.0/0}} ({{{}cast ( 1.0/0 as float){}}}) evaluated to {{NULL}} if the value > is inserted into a FLOAT/DOUBLE column of a table via {{{}spark-sql{}}}. > h3. To Reproduce > On Spark 3.2.1 (commit {{{}4f25b3f712{}}}), using {{{}spark-sql{}}}: > {code:java} > $SPARK_HOME/bin/spark-sql{code} > Execute the following: > {code:java} > spark-sql> create table float_vals(c1 float) stored as ORC; > spark-sql> insert into float_vals select cast ( 1.0/0 as float); > spark-sql> select * from float_vals; > NULL{code} > > Using {{{}spark-shell{}}}: > {code:java} > $SPARK_HOME/bin/spark-shell{code} > Execute the following: > {code:java} > scala> import org.apache.spark.sql.{Row, SparkSession} > import org.apache.spark.sql.{Row, SparkSession} > scala> import org.apache.spark.sql.types._ > import org.apache.spark.sql.types._ > scala> val rdd = sc.parallelize(Seq(Row(( 1.0/0 ).floatValue( >
[jira] [Updated] (SPARK-40624) A DECIMAL value with division by 0 errors in DataFrame but evaluates to NULL in SparkSQL
[ https://issues.apache.org/jira/browse/SPARK-40624?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] xsys updated SPARK-40624: - Description: h3. Describe the bug Storing an invalid value (e.g. {{{}BigDecimal("1.0/0"){}}}) via {{spark-shell}} errors out during RDD creation. However, {{1.0/0}} evaluates to {{NULL}} if the value is inserted into a {{DECIMAL(20,10)}} column of a table via {{{}spark-sql{}}}. h3. To Reproduce On Spark 3.2.1 (commit {{{}4f25b3f712{}}}), using {{{}spark-sql{}}}: {code:java} $SPARK_HOME/bin/spark-sql{code} Execute the following: (evaluated to {{{}NULL{}}}) {code:java} spark-sql> create table decimal_vals(c1 DECIMAL(20,10)) stored as ORC; spark-sql> insert into decimal_vals select 1.0/0; spark-sql> select * from decimal_vals; NULL{code} Using {{{}spark-shell{}}}: {code:java} $SPARK_HOME/bin/spark-shell{code} Execute the following: (errors out during RDD creation) {code:java} scala> import org.apache.spark.sql.{Row, SparkSession} import org.apache.spark.sql.{Row, SparkSession} scala> import org.apache.spark.sql.types._ import org.apache.spark.sql.types._ scala> val rdd = sc.parallelize(Seq(Row(BigDecimal("1.0/0" java.lang.NumberFormatException at java.math.BigDecimal.(BigDecimal.java:497) at java.math.BigDecimal.(BigDecimal.java:383) at java.math.BigDecimal.(BigDecimal.java:809) at scala.math.BigDecimal$.exact(BigDecimal.scala:126) at scala.math.BigDecimal$.apply(BigDecimal.scala:284) ... 49 elided{code} h3. Expected behavior We expect the two Spark interfaces ({{{}spark-sql{}}} & {{{}spark-shell{}}}) to behave consistently for the same data type & input combination ({{{}BigDecimal{}}}/{{{}DECIMAL(20,10){}}} and {{{}1.0/0{}}}). was: h3. Describe the bug Storing an invalid value (e.g. {{{}BigDecimal("1.0/0"){}}}) via {{spark-shell}} errors out during RDD creation. However, {{1.0/0}} evaluates to {{NULL}} if the value is inserted into a {{DECIMAL(20,10)}} column of a table via {{{}spark-sql{}}}. h3. To Reproduce On Spark 3.2.1 (commit {{{}4f25b3f712{}}}), using {{{}spark-sql{}}}: {code:java} $SPARK_HOME/bin/spark-sql{code} Execute the following: (evaluated to {{{}NULL{}}}) {code:java} spark-sql> create table decimal_vals(c1 DECIMAL(20,10)) stored as ORC; spark-sql> insert into decimal_vals select 1.0/0; spark-sql> select * from decimal_vals; NULL{code} Using {{{}spark-shell{}}}: {code:java} $SPARK_HOME/bin/spark-shell{code} Execute the following: (errors out during RDD creation) {code:java} scala> import org.apache.spark.sql.{Row, SparkSession} import org.apache.spark.sql.{Row, SparkSession}scala> import org.apache.spark.sql.types._ import org.apache.spark.sql.types._ scala> val rdd = sc.parallelize(Seq(Row(BigDecimal("1.0/0" java.lang.NumberFormatException at java.math.BigDecimal.(BigDecimal.java:497) at java.math.BigDecimal.(BigDecimal.java:383) at java.math.BigDecimal.(BigDecimal.java:809) at scala.math.BigDecimal$.exact(BigDecimal.scala:126) at scala.math.BigDecimal$.apply(BigDecimal.scala:284) ... 49 elided{code} h3. Expected behavior We expect the two Spark interfaces ({{{}spark-sql{}}} & {{{}spark-shell{}}}) to behave consistently for the same data type & input combination ({{{}BigDecimal{}}}/{{{}DECIMAL(20,10){}}} and {{{}1.0/0{}}}). > A DECIMAL value with division by 0 errors in DataFrame but evaluates to NULL > in SparkSQL > > > Key: SPARK-40624 > URL: https://issues.apache.org/jira/browse/SPARK-40624 > Project: Spark > Issue Type: Bug > Components: Spark Shell >Affects Versions: 3.2.1 >Reporter: xsys >Priority: Major > > h3. Describe the bug > Storing an invalid value (e.g. {{{}BigDecimal("1.0/0"){}}}) via > {{spark-shell}} errors out during RDD creation. However, {{1.0/0}} evaluates > to {{NULL}} if the value is inserted into a {{DECIMAL(20,10)}} column of a > table via {{{}spark-sql{}}}. > h3. To Reproduce > On Spark 3.2.1 (commit {{{}4f25b3f712{}}}), using {{{}spark-sql{}}}: > {code:java} > $SPARK_HOME/bin/spark-sql{code} > Execute the following: (evaluated to {{{}NULL{}}}) > {code:java} > spark-sql> create table decimal_vals(c1 DECIMAL(20,10)) stored as ORC; > spark-sql> insert into decimal_vals select 1.0/0; > spark-sql> select * from decimal_vals; > NULL{code} > Using {{{}spark-shell{}}}: > {code:java} > $SPARK_HOME/bin/spark-shell{code} > Execute the following: (errors out during RDD creation) > {code:java} > scala> import org.apache.spark.sql.{Row, SparkSession} > import org.apache.spark.sql.{Row, SparkSession} > scala> import org.apache.spark.sql.types._ > import org.apache.spark.sql.types._ > scala> val rdd = sc.parallelize(Seq(Row(BigDecimal("1.0/0" > java.lang.NumberFormatException > at java.math.BigDecimal.(BigDecimal.java:497) > at
[jira] [Updated] (SPARK-40624) A DECIMAL value with division by 0 errors in DataFrame but evaluates to NULL in SparkSQL
[ https://issues.apache.org/jira/browse/SPARK-40624?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] xsys updated SPARK-40624: - Description: h3. Describe the bug Storing an invalid value (e.g. {{{}BigDecimal("1.0/0"){}}}) via {{spark-shell}} errors out during RDD creation. However, {{1.0/0}} evaluates to {{NULL}} if the value is inserted into a {{DECIMAL(20,10)}} column of a table via {{{}spark-sql{}}}. h3. To Reproduce On Spark 3.2.1 (commit {{{}4f25b3f712{}}}), using {{{}spark-sql{}}}: {code:java} $SPARK_HOME/bin/spark-sql{code} Execute the following: (evaluated to {{{}NULL{}}}) {code:java} spark-sql> create table decimal_vals(c1 DECIMAL(20,10)) stored as ORC; spark-sql> insert into decimal_vals select 1.0/0; spark-sql> select * from decimal_vals; NULL{code} Using {{{}spark-shell{}}}: {code:java} $SPARK_HOME/bin/spark-shell{code} Execute the following: (errors out during RDD creation) {code:java} scala> import org.apache.spark.sql.{Row, SparkSession} import org.apache.spark.sql.{Row, SparkSession}scala> import org.apache.spark.sql.types._ import org.apache.spark.sql.types._ scala> val rdd = sc.parallelize(Seq(Row(BigDecimal("1.0/0" java.lang.NumberFormatException at java.math.BigDecimal.(BigDecimal.java:497) at java.math.BigDecimal.(BigDecimal.java:383) at java.math.BigDecimal.(BigDecimal.java:809) at scala.math.BigDecimal$.exact(BigDecimal.scala:126) at scala.math.BigDecimal$.apply(BigDecimal.scala:284) ... 49 elided{code} h3. Expected behavior We expect the two Spark interfaces ({{{}spark-sql{}}} & {{{}spark-shell{}}}) to behave consistently for the same data type & input combination ({{{}BigDecimal{}}}/{{{}DECIMAL(20,10){}}} and {{{}1.0/0{}}}). was: h3. Describe the bug Storing an invalid value (e.g. {{{}BigDecimal("1.0/0"){}}}) via {{spark-shell}} errors out during RDD creation. However, {{1.0/0}} evaluates to {{NULL}} if the value is inserted into a {{DECIMAL(20,10)}} column of a table via {{{}spark-sql{}}}. h3. To Reproduce On Spark 3.2.1 (commit {{{}4f25b3f712{}}}), using {{{}spark-sql{}}}: {code:java} $SPARK_HOME/bin/spark-sql{code} Execute the following: (evaluated to {{{}NULL{}}}) {code:java} spark-sql> create table decimal_vals(c1 DECIMAL(20,10)) stored as ORC; spark-sql> insert into decimal_vals select 1.0/0; spark-sql> select * from decimal_vals; NULL{code} Using {{{}spark-shell{}}}: {code:java} $SPARK_HOME/bin/spark-shell{code} Execute the following: (errors out during RDD creation) {code:java} scala> val rdd = sc.parallelize(Seq(Row(BigDecimal("1.0/0" java.lang.NumberFormatException at java.math.BigDecimal.(BigDecimal.java:497) at java.math.BigDecimal.(BigDecimal.java:383) at java.math.BigDecimal.(BigDecimal.java:809) at scala.math.BigDecimal$.exact(BigDecimal.scala:126) at scala.math.BigDecimal$.apply(BigDecimal.scala:284) ... 49 elided{code} h3. Expected behavior We expect the two Spark interfaces ({{{}spark-sql{}}} & {{{}spark-shell{}}}) to behave consistently for the same data type & input combination ({{{}BigDecimal{}}}/{{{}DECIMAL(20,10){}}} and {{{}1.0/0{}}}). > A DECIMAL value with division by 0 errors in DataFrame but evaluates to NULL > in SparkSQL > > > Key: SPARK-40624 > URL: https://issues.apache.org/jira/browse/SPARK-40624 > Project: Spark > Issue Type: Bug > Components: Spark Shell >Affects Versions: 3.2.1 >Reporter: xsys >Priority: Major > > h3. Describe the bug > Storing an invalid value (e.g. {{{}BigDecimal("1.0/0"){}}}) via > {{spark-shell}} errors out during RDD creation. However, {{1.0/0}} evaluates > to {{NULL}} if the value is inserted into a {{DECIMAL(20,10)}} column of a > table via {{{}spark-sql{}}}. > h3. To Reproduce > On Spark 3.2.1 (commit {{{}4f25b3f712{}}}), using {{{}spark-sql{}}}: > {code:java} > $SPARK_HOME/bin/spark-sql{code} > Execute the following: (evaluated to {{{}NULL{}}}) > {code:java} > spark-sql> create table decimal_vals(c1 DECIMAL(20,10)) stored as ORC; > spark-sql> insert into decimal_vals select 1.0/0; > spark-sql> select * from decimal_vals; > NULL{code} > Using {{{}spark-shell{}}}: > {code:java} > $SPARK_HOME/bin/spark-shell{code} > Execute the following: (errors out during RDD creation) > {code:java} > scala> import org.apache.spark.sql.{Row, SparkSession} > import org.apache.spark.sql.{Row, SparkSession}scala> import > org.apache.spark.sql.types._ > import org.apache.spark.sql.types._ > scala> val rdd = sc.parallelize(Seq(Row(BigDecimal("1.0/0" > java.lang.NumberFormatException > at java.math.BigDecimal.(BigDecimal.java:497) > at java.math.BigDecimal.(BigDecimal.java:383) > at java.math.BigDecimal.(BigDecimal.java:809) > at scala.math.BigDecimal$.exact(BigDecimal.scala:126) > at
[jira] [Updated] (SPARK-40637) DataFrame can correctly encode BINARY type but SparkSQL cannot
[ https://issues.apache.org/jira/browse/SPARK-40637?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] xsys updated SPARK-40637: - Description: h3. Describe the bug Storing a BINARY value (e.g. {{BigInt("1").toByteArray)}} / {{{}X'01'{}}}) via {{spark-shell}} outputs {{{}[01]{}}}. However, it does not encode correctly if the value is inserted into a BINARY column of a table via {{{}spark-sql{}}}. h3. To Reproduce On Spark 3.2.1 (commit {{{}4f25b3f712{}}}), using {{{}spark-shell{}}}: {code:java} $SPARK_HOME/bin/spark-shell{code} Execute the following: {code:java} scala> val rdd = sc.parallelize(Seq(Row(BigInt("1").toByteArray))) rdd: org.apache.spark.rdd.RDD[org.apache.spark.sql.Row] = ParallelCollectionRDD[356] at parallelize at :28 scala> val schema = new StructType().add(StructField("c1", BinaryType, true)) schema: org.apache.spark.sql.types.StructType = StructType(StructField(c1,BinaryType,true)) scala> val df = spark.createDataFrame(rdd, schema) df: org.apache.spark.sql.DataFrame = [c1: binary] scala> df.show(false) ++ |c1 | ++ |[01]| ++ {code} Using {{{}spark-sql{}}}: {code:java} $SPARK_HOME/bin/spark-sql{code} Execute the following, we only get an empty output: {code:java} spark-sql> create table binary_vals(c1 BINARY) stored as ORC; spark-sql> insert into binary_vals select X'01'; spark-sql> select * from binary_vals; Time taken: 0.077 seconds, Fetched 1 row(s) {code} h3. Expected behavior We expect the two Spark interfaces ({{{}spark-sql{}}} & {{{}spark-shell{}}}) to behave consistently for the same data type ({{{}BINARY{}}}) & input ({{{}BigInt("1").toByteArray){}}} / {{{}X'01'{}}}) combination. was: h3. Describe the bug Storing a BINARY value (e.g. {{BigInt("1").toByteArray)}} / {{{}X'01'{}}}) via {{spark-shell}} outputs {{{}[01]{}}}. However, it does not encode correctly if the value is inserted into a BINARY column of a table via {{{}spark-sql{}}}. h3. To Reproduce On Spark 3.2.1 (commit {{{}4f25b3f712{}}}), using {{{}spark-shell{}}}: {code:java} $SPARK_HOME/bin/spark-shell{code} Execute the following: {code:java} scala> val rdd = sc.parallelize(Seq(Row(BigInt("1").toByteArray))) rdd: org.apache.spark.rdd.RDD[org.apache.spark.sql.Row] = ParallelCollectionRDD[356] at parallelize at :28 scala> val schema = new StructType().add(StructField("c1", BinaryType, true)) schema: org.apache.spark.sql.types.StructType = StructType(StructField(c1,BinaryType,true)) scala> val df = spark.createDataFrame(rdd, schema) df: org.apache.spark.sql.DataFrame = [c1: binary] scala> df.show(false) ++ |c1 | ++ |[01]| ++ {code} Using {{{}spark-sql{}}}: {code:java} $SPARK_HOME/bin/spark-sql{code} Execute the following, we only get an empty output: {code:java} spark-sql> create table binary_vals(c1 BINARY) stored as ORC; spark-sql> insert into binary_vals select X'01'; spark-sql> select * from binary_vals; Time taken: 0.077 seconds, Fetched 1 row(s) {code} h3. Expected behavior We expect the two Spark interfaces ({{{}spark-sql{}}} & {{{}spark-shell{}}}) to behave consistently for the same data type ({{{}BINARY{}}}) & input ({{{}BigInt("1").toByteArray){}}} / {{{}X'01'{}}}) combination. > DataFrame can correctly encode BINARY type but SparkSQL cannot > -- > > Key: SPARK-40637 > URL: https://issues.apache.org/jira/browse/SPARK-40637 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.2.1 >Reporter: xsys >Priority: Major > > h3. Describe the bug > Storing a BINARY value (e.g. {{BigInt("1").toByteArray)}} / {{{}X'01'{}}}) > via {{spark-shell}} outputs {{{}[01]{}}}. However, it does not encode > correctly if the value is inserted into a BINARY column of a table via > {{{}spark-sql{}}}. > h3. To Reproduce > On Spark 3.2.1 (commit {{{}4f25b3f712{}}}), using {{{}spark-shell{}}}: > {code:java} > $SPARK_HOME/bin/spark-shell{code} > Execute the following: > {code:java} > scala> val rdd = sc.parallelize(Seq(Row(BigInt("1").toByteArray))) > rdd: org.apache.spark.rdd.RDD[org.apache.spark.sql.Row] = > ParallelCollectionRDD[356] at parallelize at :28 > scala> val schema = new StructType().add(StructField("c1", BinaryType, true)) > schema: org.apache.spark.sql.types.StructType = > StructType(StructField(c1,BinaryType,true)) > scala> val df = spark.createDataFrame(rdd, schema) > df: org.apache.spark.sql.DataFrame = [c1: binary] > scala> df.show(false) > ++ > |c1 | > ++ > |[01]| > ++ > {code} > > Using {{{}spark-sql{}}}: > {code:java} > $SPARK_HOME/bin/spark-sql{code} > Execute the following, we only get an empty output: > {code:java} > spark-sql> create table binary_vals(c1 BINARY) stored as ORC; > spark-sql> insert into binary_vals select X'01'; > spark-sql> select * from binary_vals; > Time taken: 0.077 seconds, Fetched 1 row(s) >
[jira] [Created] (SPARK-40637) DataFrame can correctly encode BINARY type but SparkSQL cannot
xsys created SPARK-40637: Summary: DataFrame can correctly encode BINARY type but SparkSQL cannot Key: SPARK-40637 URL: https://issues.apache.org/jira/browse/SPARK-40637 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.2.1 Reporter: xsys h3. Describe the bug Storing a BINARY value (e.g. {{BigInt("1").toByteArray)}} / {{{}X'01'{}}}) via {{spark-shell}} outputs {{{}[01]{}}}. However, it does not encode correctly if the value is inserted into a BINARY column of a table via {{{}spark-sql{}}}. h3. To Reproduce On Spark 3.2.1 (commit {{{}4f25b3f712{}}}), using {{{}spark-shell{}}}: {code:java} $SPARK_HOME/bin/spark-shell{code} Execute the following: {code:java} scala> val rdd = sc.parallelize(Seq(Row(BigInt("1").toByteArray))) rdd: org.apache.spark.rdd.RDD[org.apache.spark.sql.Row] = ParallelCollectionRDD[356] at parallelize at :28 scala> val schema = new StructType().add(StructField("c1", BinaryType, true)) schema: org.apache.spark.sql.types.StructType = StructType(StructField(c1,BinaryType,true)) scala> val df = spark.createDataFrame(rdd, schema) df: org.apache.spark.sql.DataFrame = [c1: binary] scala> df.show(false) ++ |c1 | ++ |[01]| ++ {code} Using {{{}spark-sql{}}}: {code:java} $SPARK_HOME/bin/spark-sql{code} Execute the following, we only get an empty output: {code:java} spark-sql> create table binary_vals(c1 BINARY) stored as ORC; spark-sql> insert into binary_vals select X'01'; spark-sql> select * from binary_vals; Time taken: 0.077 seconds, Fetched 1 row(s) {code} h3. Expected behavior We expect the two Spark interfaces ({{{}spark-sql{}}} & {{{}spark-shell{}}}) to behave consistently for the same data type ({{{}BINARY{}}}) & input ({{{}BigInt("1").toByteArray){}}} / {{{}X'01'{}}}) combination. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-40624) A DECIMAL value with division by 0 errors in DataFrame but evaluates to NULL in SparkSQL
[ https://issues.apache.org/jira/browse/SPARK-40624?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] xsys updated SPARK-40624: - Description: h3. Describe the bug Storing an invalid value (e.g. {{{}BigDecimal("1.0/0"){}}}) via {{spark-shell}} errors out during RDD creation. However, {{1.0/0}} evaluates to {{NULL}} if the value is inserted into a {{DECIMAL(20,10)}} column of a table via {{{}spark-sql{}}}. h3. To Reproduce On Spark 3.2.1 (commit {{{}4f25b3f712{}}}), using {{{}spark-sql{}}}: {code:java} $SPARK_HOME/bin/spark-sql{code} Execute the following: (evaluated to {{{}NULL{}}}) {code:java} spark-sql> create table decimal_vals(c1 DECIMAL(20,10)) stored as ORC; spark-sql> insert into decimal_vals select 1.0/0; spark-sql> select * from decimal_vals; NULL{code} Using {{{}spark-shell{}}}: {code:java} $SPARK_HOME/bin/spark-shell{code} Execute the following: (errors out during RDD creation) {code:java} scala> val rdd = sc.parallelize(Seq(Row(BigDecimal("1.0/0" java.lang.NumberFormatException at java.math.BigDecimal.(BigDecimal.java:497) at java.math.BigDecimal.(BigDecimal.java:383) at java.math.BigDecimal.(BigDecimal.java:809) at scala.math.BigDecimal$.exact(BigDecimal.scala:126) at scala.math.BigDecimal$.apply(BigDecimal.scala:284) ... 49 elided{code} h3. Expected behavior We expect the two Spark interfaces ({{{}spark-sql{}}} & {{{}spark-shell{}}}) to behave consistently for the same data type & input combination ({{{}BigDecimal{}}}/{{{}DECIMAL(20,10){}}} and {{{}1.0/0{}}}). was: h3. Describe the bug Storing an invalid value (e.g. {{{}BigDecimal("1.0/0"){}}}) via {{spark-shell}} errors out during RDD creation. However, {{1.0/0}} evaluates to {{NULL}} if the value is inserted into a {{DECIMAL(20,10)}} column of a table via {{{}spark-sql{}}}. h3. To Reproduce On Spark 3.2.1 (commit {{{}4f25b3f712{}}}), using {{{}spark-sql{}}}: {code:java} $SPARK_HOME/bin/spark-sql{code} Execute the following: (evaluated to {{{}NULL{}}}) {code:java} spark-sql> create table decimal_vals(c1 DECIMAL(20,10)) stored as ORC; spark-sql> insert into decimal_vals 1.0/0; spark-sql> select * from decimal_vals; NULL{code} Using {{{}spark-shell{}}}: {code:java} $SPARK_HOME/bin/spark-shell{code} Execute the following: (errors out during RDD creation) {code:java} scala> val rdd = sc.parallelize(Seq(Row(BigDecimal("1.0/0" java.lang.NumberFormatException at java.math.BigDecimal.(BigDecimal.java:497) at java.math.BigDecimal.(BigDecimal.java:383) at java.math.BigDecimal.(BigDecimal.java:809) at scala.math.BigDecimal$.exact(BigDecimal.scala:126) at scala.math.BigDecimal$.apply(BigDecimal.scala:284) ... 49 elided{code} h3. Expected behavior We expect the two Spark interfaces ({{{}spark-sql{}}} & {{{}spark-shell{}}}) to behave consistently for the same data type & input combination ({{{}BigDecimal{}}}/{{{}DECIMAL(20,10){}}} and {{{}1.0/0{}}}). > A DECIMAL value with division by 0 errors in DataFrame but evaluates to NULL > in SparkSQL > > > Key: SPARK-40624 > URL: https://issues.apache.org/jira/browse/SPARK-40624 > Project: Spark > Issue Type: Bug > Components: Spark Shell >Affects Versions: 3.2.1 >Reporter: xsys >Priority: Major > > h3. Describe the bug > Storing an invalid value (e.g. {{{}BigDecimal("1.0/0"){}}}) via > {{spark-shell}} errors out during RDD creation. However, {{1.0/0}} evaluates > to {{NULL}} if the value is inserted into a {{DECIMAL(20,10)}} column of a > table via {{{}spark-sql{}}}. > h3. To Reproduce > On Spark 3.2.1 (commit {{{}4f25b3f712{}}}), using {{{}spark-sql{}}}: > {code:java} > $SPARK_HOME/bin/spark-sql{code} > Execute the following: (evaluated to {{{}NULL{}}}) > {code:java} > spark-sql> create table decimal_vals(c1 DECIMAL(20,10)) stored as ORC; > spark-sql> insert into decimal_vals select 1.0/0; > spark-sql> select * from decimal_vals; > NULL{code} > Using {{{}spark-shell{}}}: > {code:java} > $SPARK_HOME/bin/spark-shell{code} > Execute the following: (errors out during RDD creation) > {code:java} > scala> val rdd = sc.parallelize(Seq(Row(BigDecimal("1.0/0" > java.lang.NumberFormatException > at java.math.BigDecimal.(BigDecimal.java:497) > at java.math.BigDecimal.(BigDecimal.java:383) > at java.math.BigDecimal.(BigDecimal.java:809) > at scala.math.BigDecimal$.exact(BigDecimal.scala:126) > at scala.math.BigDecimal$.apply(BigDecimal.scala:284) > ... 49 elided{code} > h3. Expected behavior > We expect the two Spark interfaces ({{{}spark-sql{}}} & {{{}spark-shell{}}}) > to behave consistently for the same data type & input combination > ({{{}BigDecimal{}}}/{{{}DECIMAL(20,10){}}} and {{{}1.0/0{}}}). > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (SPARK-40629) FLOAT/DOUBLE division by 0 gives Infinity/-Infinity/NaN in DataFrame but NULL in SparkSQL
[ https://issues.apache.org/jira/browse/SPARK-40629?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] xsys updated SPARK-40629: - Description: h3. Describe the bug Storing a FLOAT/DOUBLE value with division by 0 (e.g. {{{}( 1.0/0 ).floatValue(){}}}) via {{spark-shell}} outputs {{{}Infinity{}}}. However, {{1.0/0}} ({{{}cast ( 1.0/0 as float){}}}) evaluated to {{NULL}} if the value is inserted into a FLOAT/DOUBLE column of a table via {{{}spark-sql{}}}. h3. To Reproduce On Spark 3.2.1 (commit {{{}4f25b3f712{}}}), using {{{}spark-sql{}}}: {code:java} $SPARK_HOME/bin/spark-sql{code} Execute the following: {code:java} spark-sql> create table float_vals(c1 float) stored as ORC; spark-sql> insert into float_vals select cast ( 1.0/0 as float); spark-sql> select * from float_vals; NULL{code} Using {{{}spark-shell{}}}: {code:java} $SPARK_HOME/bin/spark-shell{code} Execute the following: {code:java} scala> val rdd = sc.parallelize(Seq(Row(( 1.0/0 ).floatValue( rdd: org.apache.spark.rdd.RDD[org.apache.spark.sql.Row] = ParallelCollectionRDD[180] at parallelize at :28 scala> val schema = new StructType().add(StructField("c1", FloatType, true) ) schema: org.apache.spark.sql.types.StructType = StructType( StructField(c1,FloatType,true)) scala> val df = spark.createDataFrame(rdd, schema) df: org.apache.spark.sql.DataFrame = [c1: float] scala> df.show(false) +-+ |c1 | +-+ |Infinity | +-+ {code} h3. Expected behavior We expect the two Spark interfaces ({{{}spark-sql{}}} & {{{}spark-shell{}}}) to behave consistently for the same data type & input combination & configuration ({{{}FLOAT/DOUBLE{}}} and {{{}1.0/0{}}}). was: h3. Describe the bug Storing a FLOAT/DOUBLE value with division by 0 (e.g. {{{}( 1.0/0 ).floatValue(){}}}) via {{spark-shell}} outputs {{{}Infinity{}}}. However, {{1.0/0}} ({{{}cast ( 1.0/0 as float){}}}) evaluated to {{NULL}} if the value is inserted into a FLOAT/DOUBLE column of a table via {{{}spark-sql{}}}. h3. To Reproduce On Spark 3.2.1 (commit {{{}4f25b3f712{}}}), using {{{}spark-sql{}}}: {code:java} $SPARK_HOME/bin/spark-sql{code} Execute the following: {code:java} spark-sql> create table float_vals(c1 float) stored as ORC; spark-sql> insert into float_vals cast ( 1.0/0 as float); spark-sql> select * from float_vals; NULL{code} Using {{{}spark-shell{}}}: {code:java} $SPARK_HOME/bin/spark-shell{code} Execute the following: {code:java} scala> val rdd = sc.parallelize(Seq(Row(( 1.0/0 ).floatValue( rdd: org.apache.spark.rdd.RDD[org.apache.spark.sql.Row] = ParallelCollectionRDD[180] at parallelize at :28 scala> val schema = new StructType().add(StructField("c1", FloatType, true) ) schema: org.apache.spark.sql.types.StructType = StructType( StructField(c1,FloatType,true)) scala> val df = spark.createDataFrame(rdd, schema) df: org.apache.spark.sql.DataFrame = [c1: float] scala> df.show(false) +-+ |c1 | +-+ |Infinity | +-+ {code} h3. Expected behavior We expect the two Spark interfaces ({{{}spark-sql{}}} & {{{}spark-shell{}}}) to behave consistently for the same data type & input combination & configuration ({{{}FLOAT/DOUBLE{}}} and {{{}1.0/0{}}}). > FLOAT/DOUBLE division by 0 gives Infinity/-Infinity/NaN in DataFrame but NULL > in SparkSQL > - > > Key: SPARK-40629 > URL: https://issues.apache.org/jira/browse/SPARK-40629 > Project: Spark > Issue Type: Bug > Components: Spark Shell, SQL >Affects Versions: 3.2.1 >Reporter: xsys >Priority: Major > > h3. Describe the bug > Storing a FLOAT/DOUBLE value with division by 0 (e.g. {{{}( 1.0/0 > ).floatValue(){}}}) via {{spark-shell}} outputs {{{}Infinity{}}}. However, > {{1.0/0}} ({{{}cast ( 1.0/0 as float){}}}) evaluated to {{NULL}} if the value > is inserted into a FLOAT/DOUBLE column of a table via {{{}spark-sql{}}}. > h3. To Reproduce > On Spark 3.2.1 (commit {{{}4f25b3f712{}}}), using {{{}spark-sql{}}}: > {code:java} > $SPARK_HOME/bin/spark-sql{code} > Execute the following: > {code:java} > spark-sql> create table float_vals(c1 float) stored as ORC; > spark-sql> insert into float_vals select cast ( 1.0/0 as float); > spark-sql> select * from float_vals; > NULL{code} > > Using {{{}spark-shell{}}}: > {code:java} > $SPARK_HOME/bin/spark-shell{code} > Execute the following: > {code:java} > scala> val rdd = sc.parallelize(Seq(Row(( 1.0/0 ).floatValue( > rdd: org.apache.spark.rdd.RDD[org.apache.spark.sql.Row] = > ParallelCollectionRDD[180] at parallelize at :28 > scala> val schema = new StructType().add(StructField("c1", FloatType, true) > ) > schema: org.apache.spark.sql.types.StructType = StructType( > StructField(c1,FloatType,true)) > scala> val df = spark.createDataFrame(rdd, schema) > df: org.apache.spark.sql.DataFrame
[jira] [Created] (SPARK-40630) Both SparkSQL and DataFrame insert invalid DATE/TIMESTAMP as NULL
xsys created SPARK-40630: Summary: Both SparkSQL and DataFrame insert invalid DATE/TIMESTAMP as NULL Key: SPARK-40630 URL: https://issues.apache.org/jira/browse/SPARK-40630 Project: Spark Issue Type: Bug Components: Spark Shell, SQL Affects Versions: 3.2.1 Reporter: xsys h3. Describe the bug When we construct a DataFrame with an invalid DATE/TIMESTAMP (e.g. {{{}1969-12-31 23:59:59 B{}}}) via {{{}spark-shell{}}}, or insert an invalid DATE/TIMESTAMP into a table via {{{}spark-sql{}}}, both interfaces unexpectedly evaluate the invalid value to {{{}NULL{}}}, instead of throwing an exception. h3. To Reproduce On Spark 3.2.1 (commit {{{}4f25b3f712{}}}), using {{{}spark-sql{}}}: {code:java} $SPARK_HOME/bin/spark-sql{code} Execute the following: {code:java} spark-sql> create table timestamp_vals(c1 TIMESTAMP) stored as ORC; spark-sql> insert into timestamp_vals select cast(" 1969-12-31 23:59:59 B "as timestamp); spark-sql> select * from timestamp_vals; NULL{code} Using {{{}spark-shell{}}}: {code:java} $SPARK_HOME/bin/spark-shell{code} Execute the following: {code:java} scala> val rdd = sc.parallelize(Seq(Row(Seq(" 1969-12-31 23:59:59 B ").toDF("time").select(to_timestamp(col("ti me")).as("to_timestamp")).first().getAs[java.sql.Timestamp](0 rdd: org.apache.spark.rdd.RDD[org.apache.spark.sql.Row] = ParallelCollectionRDD[721] at parallelize at :28 scala> val schema = new StructType().add(StructField("c1", TimestampType, true)) schema: org.apache.spark.sql.types.StructType = StructType(StructField(c1,TimestampType,true)) scala> val df = spark.createDataFrame(rdd, schema) df194: org.apache.spark.sql.DataFrame = [c1: timestamp] scala> df.show(false) ++ |c1 | ++ |null| ++ {code} h3. Expected behavior We expect both {{spark-sql}} & {{spark-shell}} interfaces to throw an exception for an invalid DATE/TIMESTAMP, like what they do for most of the other data types (e.g. invalid value {{"foo"}} for {{INT}} data type). -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-40630) Both SparkSQL and DataFrame insert invalid DATE/TIMESTAMP as NULL
[ https://issues.apache.org/jira/browse/SPARK-40630?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] xsys updated SPARK-40630: - Description: h3. Describe the bug When we construct a DataFrame with an invalid DATE/TIMESTAMP (e.g. {{{}1969-12-31 23:59:59 B{}}}) via {{{}spark-shell{}}}, or insert an invalid DATE/TIMESTAMP into a table via {{{}spark-sql{}}}, both interfaces unexpectedly evaluate the invalid value to {{{}NULL{}}}, instead of throwing an exception. h3. To Reproduce On Spark 3.2.1 (commit {{{}4f25b3f712{}}}), using {{{}spark-sql{}}}: {code:java} $SPARK_HOME/bin/spark-sql{code} Execute the following: {code:java} spark-sql> create table timestamp_vals(c1 TIMESTAMP) stored as ORC; spark-sql> insert into timestamp_vals select cast(" 1969-12-31 23:59:59 B "as timestamp); spark-sql> select * from timestamp_vals; NULL{code} Using {{{}spark-shell{}}}: {code:java} $SPARK_HOME/bin/spark-shell{code} Execute the following: {code:java} scala> val rdd = sc.parallelize(Seq(Row(Seq(" 1969-12-31 23:59:59 B ").toDF("time").select(to_timestamp(col("ti me")).as("to_timestamp")).first().getAs[java.sql.Timestamp](0 rdd: org.apache.spark.rdd.RDD[org.apache.spark.sql.Row] = ParallelCollectionRDD[721] at parallelize at :28 scala> val schema = new StructType().add(StructField("c1", TimestampType, true)) schema: org.apache.spark.sql.types.StructType = StructType(StructField(c1,TimestampType,true)) scala> val df = spark.createDataFrame(rdd, schema) df194: org.apache.spark.sql.DataFrame = [c1: timestamp] scala> df.show(false) ++ |c1 | ++ |null| ++ {code} h3. Expected behavior We expect both {{spark-sql}} & {{spark-shell}} interfaces to throw an exception for an invalid DATE/TIMESTAMP, like what they do for most of the other data types (e.g. invalid value {{"foo"}} for {{INT}} data type). was: h3. Describe the bug When we construct a DataFrame with an invalid DATE/TIMESTAMP (e.g. {{{}1969-12-31 23:59:59 B{}}}) via {{{}spark-shell{}}}, or insert an invalid DATE/TIMESTAMP into a table via {{{}spark-sql{}}}, both interfaces unexpectedly evaluate the invalid value to {{{}NULL{}}}, instead of throwing an exception. h3. To Reproduce On Spark 3.2.1 (commit {{{}4f25b3f712{}}}), using {{{}spark-sql{}}}: {code:java} $SPARK_HOME/bin/spark-sql{code} Execute the following: {code:java} spark-sql> create table timestamp_vals(c1 TIMESTAMP) stored as ORC; spark-sql> insert into timestamp_vals select cast(" 1969-12-31 23:59:59 B "as timestamp); spark-sql> select * from timestamp_vals; NULL{code} Using {{{}spark-shell{}}}: {code:java} $SPARK_HOME/bin/spark-shell{code} Execute the following: {code:java} scala> val rdd = sc.parallelize(Seq(Row(Seq(" 1969-12-31 23:59:59 B ").toDF("time").select(to_timestamp(col("ti me")).as("to_timestamp")).first().getAs[java.sql.Timestamp](0 rdd: org.apache.spark.rdd.RDD[org.apache.spark.sql.Row] = ParallelCollectionRDD[721] at parallelize at :28 scala> val schema = new StructType().add(StructField("c1", TimestampType, true)) schema: org.apache.spark.sql.types.StructType = StructType(StructField(c1,TimestampType,true)) scala> val df = spark.createDataFrame(rdd, schema) df194: org.apache.spark.sql.DataFrame = [c1: timestamp] scala> df.show(false) ++ |c1 | ++ |null| ++ {code} h3. Expected behavior We expect both {{spark-sql}} & {{spark-shell}} interfaces to throw an exception for an invalid DATE/TIMESTAMP, like what they do for most of the other data types (e.g. invalid value {{"foo"}} for {{INT}} data type). > Both SparkSQL and DataFrame insert invalid DATE/TIMESTAMP as NULL > - > > Key: SPARK-40630 > URL: https://issues.apache.org/jira/browse/SPARK-40630 > Project: Spark > Issue Type: Bug > Components: Spark Shell, SQL >Affects Versions: 3.2.1 >Reporter: xsys >Priority: Major > > h3. Describe the bug > When we construct a DataFrame with an invalid DATE/TIMESTAMP (e.g. > {{{}1969-12-31 23:59:59 B{}}}) via {{{}spark-shell{}}}, or insert an invalid > DATE/TIMESTAMP into a table via {{{}spark-sql{}}}, both interfaces > unexpectedly evaluate the invalid value to {{{}NULL{}}}, instead of throwing > an exception. > h3. To Reproduce > On Spark 3.2.1 (commit {{{}4f25b3f712{}}}), using {{{}spark-sql{}}}: > {code:java} > $SPARK_HOME/bin/spark-sql{code} > Execute the following: > {code:java} > spark-sql> create table timestamp_vals(c1 TIMESTAMP) stored as ORC; > spark-sql> insert into timestamp_vals select cast(" 1969-12-31 23:59:59 B "as > timestamp); > spark-sql> select * from timestamp_vals; > NULL{code} > > Using {{{}spark-shell{}}}: > {code:java} > $SPARK_HOME/bin/spark-shell{code} > > Execute the following: > {code:java} > scala> val rdd = sc.parallelize(Seq(Row(Seq(" 1969-12-31 23:59:59 B >
[jira] [Updated] (SPARK-40624) A DECIMAL value with division by 0 errors in DataFrame but evaluates to NULL in SparkSQL
[ https://issues.apache.org/jira/browse/SPARK-40624?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] xsys updated SPARK-40624: - Description: h3. Describe the bug Storing an invalid value (e.g. {{{}BigDecimal("1.0/0"){}}}) via {{spark-shell}} errors out during RDD creation. However, {{1.0/0}} evaluates to {{NULL}} if the value is inserted into a {{DECIMAL(20,10)}} column of a table via {{{}spark-sql{}}}. h3. To Reproduce On Spark 3.2.1 (commit {{{}4f25b3f712{}}}), using {{{}spark-sql{}}}: {code:java} $SPARK_HOME/bin/spark-sql{code} Execute the following: (evaluated to {{{}NULL{}}}) {code:java} spark-sql> create table decimal_vals(c1 DECIMAL(20,10)) stored as ORC; spark-sql> insert into decimal_vals 1.0/0; spark-sql> select * from decimal_vals; NULL{code} Using {{{}spark-shell{}}}: {code:java} $SPARK_HOME/bin/spark-shell{code} Execute the following: (errors out during RDD creation) {code:java} scala> val rdd = sc.parallelize(Seq(Row(BigDecimal("1.0/0" java.lang.NumberFormatException at java.math.BigDecimal.(BigDecimal.java:497) at java.math.BigDecimal.(BigDecimal.java:383) at java.math.BigDecimal.(BigDecimal.java:809) at scala.math.BigDecimal$.exact(BigDecimal.scala:126) at scala.math.BigDecimal$.apply(BigDecimal.scala:284) ... 49 elided{code} h3. Expected behavior We expect the two Spark interfaces ({{{}spark-sql{}}} & {{{}spark-shell{}}}) to behave consistently for the same data type & input combination ({{{}BigDecimal{}}}/{{{}DECIMAL(20,10){}}} and {{{}1.0/0{}}}). was: h3. Describe the bug Storing an invalid value (e.g. {{{}BigDecimal("1.0/0"){}}}) via {{spark-shell}} errors out during RDD creation. However, {{1.0/0}} evaluates to {{NULL}} if the value is inserted into a {{DECIMAL(20,10)}} column of a table via {{{}spark-sql{}}}. h3. To Reproduce On Spark 3.2.1 (commit {{{}4f25b3f712{}}}), using {{{}spark-sql{}}}: {code:java} $SPARK_HOME/bin/spark-sql{code} Execute the following: (evaluated to {{{}NULL{}}}) {code:java} spark-sql> create table decimal_vals(c1 DECIMAL(20,10)) stored as ORC; spark-sql> insert into decimal_vals 1.0/0; spark-sql> select * from decimal_vals; 71 NULL{code} Using {{{}spark-shell{}}}: {code:java} $SPARK_HOME/bin/spark-shell{code} Execute the following: (errors out during RDD creation) {code:java} scala> val rdd = sc.parallelize(Seq(Row(BigDecimal("1.0/0" java.lang.NumberFormatException at java.math.BigDecimal.(BigDecimal.java:497) at java.math.BigDecimal.(BigDecimal.java:383) at java.math.BigDecimal.(BigDecimal.java:809) at scala.math.BigDecimal$.exact(BigDecimal.scala:126) at scala.math.BigDecimal$.apply(BigDecimal.scala:284) ... 49 elided{code} h3. Expected behavior We expect the two Spark interfaces ({{{}spark-sql{}}} & {{{}spark-shell{}}}) to behave consistently for the same data type & input combination ({{{}BigDecimal{}}}/{{{}DECIMAL(20,10){}}} and {{{}1.0/0{}}}). > A DECIMAL value with division by 0 errors in DataFrame but evaluates to NULL > in SparkSQL > > > Key: SPARK-40624 > URL: https://issues.apache.org/jira/browse/SPARK-40624 > Project: Spark > Issue Type: Bug > Components: Spark Shell >Affects Versions: 3.2.1 >Reporter: xsys >Priority: Major > > h3. Describe the bug > Storing an invalid value (e.g. {{{}BigDecimal("1.0/0"){}}}) via > {{spark-shell}} errors out during RDD creation. However, {{1.0/0}} evaluates > to {{NULL}} if the value is inserted into a {{DECIMAL(20,10)}} column of a > table via {{{}spark-sql{}}}. > h3. To Reproduce > On Spark 3.2.1 (commit {{{}4f25b3f712{}}}), using {{{}spark-sql{}}}: > {code:java} > $SPARK_HOME/bin/spark-sql{code} > Execute the following: (evaluated to {{{}NULL{}}}) > {code:java} > spark-sql> create table decimal_vals(c1 DECIMAL(20,10)) stored as ORC; > spark-sql> insert into decimal_vals 1.0/0; > spark-sql> select * from decimal_vals; > NULL{code} > Using {{{}spark-shell{}}}: > {code:java} > $SPARK_HOME/bin/spark-shell{code} > Execute the following: (errors out during RDD creation) > {code:java} > scala> val rdd = sc.parallelize(Seq(Row(BigDecimal("1.0/0" > java.lang.NumberFormatException > at java.math.BigDecimal.(BigDecimal.java:497) > at java.math.BigDecimal.(BigDecimal.java:383) > at java.math.BigDecimal.(BigDecimal.java:809) > at scala.math.BigDecimal$.exact(BigDecimal.scala:126) > at scala.math.BigDecimal$.apply(BigDecimal.scala:284) > ... 49 elided{code} > h3. Expected behavior > We expect the two Spark interfaces ({{{}spark-sql{}}} & {{{}spark-shell{}}}) > to behave consistently for the same data type & input combination > ({{{}BigDecimal{}}}/{{{}DECIMAL(20,10){}}} and {{{}1.0/0{}}}). > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (SPARK-40624) A DECIMAL value with division by 0 errors in DataFrame but evaluates to NULL in SparkSQL
[ https://issues.apache.org/jira/browse/SPARK-40624?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] xsys updated SPARK-40624: - Description: h3. Describe the bug Storing an invalid value (e.g. {{{}BigDecimal("1.0/0"){}}}) via {{spark-shell}} errors out during RDD creation. However, {{1.0/0}} evaluates to {{NULL}} if the value is inserted into a {{DECIMAL(20,10)}} column of a table via {{{}spark-sql{}}}. h3. To Reproduce On Spark 3.2.1 (commit {{{}4f25b3f712{}}}), using {{{}spark-sql{}}}: {code:java} $SPARK_HOME/bin/spark-sql{code} Execute the following: (evaluated to {{{}NULL{}}}) {code:java} spark-sql> create table decimal_vals(c1 DECIMAL(20,10)) stored as ORC; spark-sql> insert into decimal_vals 1.0/0; spark-sql> select * from decimal_vals; 71 NULL{code} Using {{{}spark-shell{}}}: {code:java} $SPARK_HOME/bin/spark-shell{code} Execute the following: (errors out during RDD creation) {code:java} scala> val rdd = sc.parallelize(Seq(Row(BigDecimal("1.0/0" java.lang.NumberFormatException at java.math.BigDecimal.(BigDecimal.java:497) at java.math.BigDecimal.(BigDecimal.java:383) at java.math.BigDecimal.(BigDecimal.java:809) at scala.math.BigDecimal$.exact(BigDecimal.scala:126) at scala.math.BigDecimal$.apply(BigDecimal.scala:284) ... 49 elided{code} h3. Expected behavior We expect the two Spark interfaces ({{{}spark-sql{}}} & {{{}spark-shell{}}}) to behave consistently for the same data type & input combination ({{{}BigDecimal{}}}/{{{}DECIMAL(20,10){}}} and {{{}1.0/0{}}}). was: h3. Describe the bug Storing an invalid value (e.g. {{{}BigDecimal("1.0/0"){}}}) via {{spark-shell}} errors out during RDD creation. However, {{1.0/0}} evaluates to {{NULL}} if the value is inserted into a {{DECIMAL(20,10)}} column of a table via {{{}spark-sql{}}}. h3. To Reproduce On Spark 3.2.1 (commit {{{}4f25b3f712{}}}), using {{{}spark-sql{}}}: {code:java} $SPARK_HOME/bin/spark-sql{code} Execute the following: (evaluated to {{{}NULL{}}}) {code:java} spark-sql> create table decimal_vals(c1 DECIMAL(20,10)) stored as ORC; spark-sql> insert into decimal_vals 1.0/0; spark-sql> select * from ws71; 71 NULL{code} Using {{{}spark-shell{}}}: {code:java} $SPARK_HOME/bin/spark-shell{code} Execute the following: (errors out during RDD creation) {code:java} scala> val rdd = sc.parallelize(Seq(Row(BigDecimal("1.0/0" java.lang.NumberFormatException at java.math.BigDecimal.(BigDecimal.java:497) at java.math.BigDecimal.(BigDecimal.java:383) at java.math.BigDecimal.(BigDecimal.java:809) at scala.math.BigDecimal$.exact(BigDecimal.scala:126) at scala.math.BigDecimal$.apply(BigDecimal.scala:284) ... 49 elided{code} h3. Expected behavior We expect the two Spark interfaces ({{{}spark-sql{}}} & {{{}spark-shell{}}}) to behave consistently for the same data type & input combination ({{{}BigDecimal{}}}/{{{}DECIMAL(20,10){}}} and {{{}1.0/0{}}}). > A DECIMAL value with division by 0 errors in DataFrame but evaluates to NULL > in SparkSQL > > > Key: SPARK-40624 > URL: https://issues.apache.org/jira/browse/SPARK-40624 > Project: Spark > Issue Type: Bug > Components: Spark Shell >Affects Versions: 3.2.1 >Reporter: xsys >Priority: Major > > h3. Describe the bug > Storing an invalid value (e.g. {{{}BigDecimal("1.0/0"){}}}) via > {{spark-shell}} errors out during RDD creation. However, {{1.0/0}} evaluates > to {{NULL}} if the value is inserted into a {{DECIMAL(20,10)}} column of a > table via {{{}spark-sql{}}}. > h3. To Reproduce > On Spark 3.2.1 (commit {{{}4f25b3f712{}}}), using {{{}spark-sql{}}}: > {code:java} > $SPARK_HOME/bin/spark-sql{code} > Execute the following: (evaluated to {{{}NULL{}}}) > {code:java} > spark-sql> create table decimal_vals(c1 DECIMAL(20,10)) stored as ORC; > spark-sql> insert into decimal_vals 1.0/0; > spark-sql> select * from decimal_vals; > 71 NULL{code} > Using {{{}spark-shell{}}}: > {code:java} > $SPARK_HOME/bin/spark-shell{code} > Execute the following: (errors out during RDD creation) > {code:java} > scala> val rdd = sc.parallelize(Seq(Row(BigDecimal("1.0/0" > java.lang.NumberFormatException > at java.math.BigDecimal.(BigDecimal.java:497) > at java.math.BigDecimal.(BigDecimal.java:383) > at java.math.BigDecimal.(BigDecimal.java:809) > at scala.math.BigDecimal$.exact(BigDecimal.scala:126) > at scala.math.BigDecimal$.apply(BigDecimal.scala:284) > ... 49 elided{code} > h3. Expected behavior > We expect the two Spark interfaces ({{{}spark-sql{}}} & {{{}spark-shell{}}}) > to behave consistently for the same data type & input combination > ({{{}BigDecimal{}}}/{{{}DECIMAL(20,10){}}} and {{{}1.0/0{}}}). > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (SPARK-40629) FLOAT/DOUBLE division by 0 gives Infinity/-Infinity/NaN in DataFrame but NULL in SparkSQL
xsys created SPARK-40629: Summary: FLOAT/DOUBLE division by 0 gives Infinity/-Infinity/NaN in DataFrame but NULL in SparkSQL Key: SPARK-40629 URL: https://issues.apache.org/jira/browse/SPARK-40629 Project: Spark Issue Type: Bug Components: Spark Shell, SQL Affects Versions: 3.2.1 Reporter: xsys h3. Describe the bug Storing a FLOAT/DOUBLE value with division by 0 (e.g. {{{}( 1.0/0 ).floatValue(){}}}) via {{spark-shell}} outputs {{{}Infinity{}}}. However, {{1.0/0}} ({{{}cast ( 1.0/0 as float){}}}) evaluated to {{NULL}} if the value is inserted into a FLOAT/DOUBLE column of a table via {{{}spark-sql{}}}. h3. To Reproduce On Spark 3.2.1 (commit {{{}4f25b3f712{}}}), using {{{}spark-sql{}}}: {code:java} $SPARK_HOME/bin/spark-sql{code} Execute the following: {code:java} spark-sql> create table float_vals(c1 float) stored as ORC; spark-sql> insert into float_vals cast ( 1.0/0 as float); spark-sql> select * from float_vals; NULL{code} Using {{{}spark-shell{}}}: {code:java} $SPARK_HOME/bin/spark-shell{code} Execute the following: {code:java} scala> val rdd = sc.parallelize(Seq(Row(( 1.0/0 ).floatValue( rdd: org.apache.spark.rdd.RDD[org.apache.spark.sql.Row] = ParallelCollectionRDD[180] at parallelize at :28 scala> val schema = new StructType().add(StructField("c1", FloatType, true) ) schema: org.apache.spark.sql.types.StructType = StructType( StructField(c1,FloatType,true)) scala> val df = spark.createDataFrame(rdd, schema) df: org.apache.spark.sql.DataFrame = [c1: float] scala> df.show(false) +-+ |c1 | +-+ |Infinity | +-+ {code} h3. Expected behavior We expect the two Spark interfaces ({{{}spark-sql{}}} & {{{}spark-shell{}}}) to behave consistently for the same data type & input combination & configuration ({{{}FLOAT/DOUBLE{}}} and {{{}1.0/0{}}}). -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-40629) FLOAT/DOUBLE division by 0 gives Infinity/-Infinity/NaN in DataFrame but NULL in SparkSQL
[ https://issues.apache.org/jira/browse/SPARK-40629?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] xsys updated SPARK-40629: - Description: h3. Describe the bug Storing a FLOAT/DOUBLE value with division by 0 (e.g. {{{}( 1.0/0 ).floatValue(){}}}) via {{spark-shell}} outputs {{{}Infinity{}}}. However, {{1.0/0}} ({{{}cast ( 1.0/0 as float){}}}) evaluated to {{NULL}} if the value is inserted into a FLOAT/DOUBLE column of a table via {{{}spark-sql{}}}. h3. To Reproduce On Spark 3.2.1 (commit {{{}4f25b3f712{}}}), using {{{}spark-sql{}}}: {code:java} $SPARK_HOME/bin/spark-sql{code} Execute the following: {code:java} spark-sql> create table float_vals(c1 float) stored as ORC; spark-sql> insert into float_vals cast ( 1.0/0 as float); spark-sql> select * from float_vals; NULL{code} Using {{{}spark-shell{}}}: {code:java} $SPARK_HOME/bin/spark-shell{code} Execute the following: {code:java} scala> val rdd = sc.parallelize(Seq(Row(( 1.0/0 ).floatValue( rdd: org.apache.spark.rdd.RDD[org.apache.spark.sql.Row] = ParallelCollectionRDD[180] at parallelize at :28 scala> val schema = new StructType().add(StructField("c1", FloatType, true) ) schema: org.apache.spark.sql.types.StructType = StructType( StructField(c1,FloatType,true)) scala> val df = spark.createDataFrame(rdd, schema) df: org.apache.spark.sql.DataFrame = [c1: float] scala> df.show(false) +-+ |c1 | +-+ |Infinity | +-+ {code} h3. Expected behavior We expect the two Spark interfaces ({{{}spark-sql{}}} & {{{}spark-shell{}}}) to behave consistently for the same data type & input combination & configuration ({{{}FLOAT/DOUBLE{}}} and {{{}1.0/0{}}}). was: h3. Describe the bug Storing a FLOAT/DOUBLE value with division by 0 (e.g. {{{}( 1.0/0 ).floatValue(){}}}) via {{spark-shell}} outputs {{{}Infinity{}}}. However, {{1.0/0}} ({{{}cast ( 1.0/0 as float){}}}) evaluated to {{NULL}} if the value is inserted into a FLOAT/DOUBLE column of a table via {{{}spark-sql{}}}. h3. To Reproduce On Spark 3.2.1 (commit {{{}4f25b3f712{}}}), using {{{}spark-sql{}}}: {code:java} $SPARK_HOME/bin/spark-sql{code} Execute the following: {code:java} spark-sql> create table float_vals(c1 float) stored as ORC; spark-sql> insert into float_vals cast ( 1.0/0 as float); spark-sql> select * from float_vals; NULL{code} Using {{{}spark-shell{}}}: {code:java} $SPARK_HOME/bin/spark-shell{code} Execute the following: {code:java} scala> val rdd = sc.parallelize(Seq(Row(( 1.0/0 ).floatValue( rdd: org.apache.spark.rdd.RDD[org.apache.spark.sql.Row] = ParallelCollectionRDD[180] at parallelize at :28 scala> val schema = new StructType().add(StructField("c1", FloatType, true) ) schema: org.apache.spark.sql.types.StructType = StructType( StructField(c1,FloatType,true)) scala> val df = spark.createDataFrame(rdd, schema) df: org.apache.spark.sql.DataFrame = [c1: float] scala> df.show(false) +-+ |c1 | +-+ |Infinity | +-+ {code} h3. Expected behavior We expect the two Spark interfaces ({{{}spark-sql{}}} & {{{}spark-shell{}}}) to behave consistently for the same data type & input combination & configuration ({{{}FLOAT/DOUBLE{}}} and {{{}1.0/0{}}}). > FLOAT/DOUBLE division by 0 gives Infinity/-Infinity/NaN in DataFrame but NULL > in SparkSQL > - > > Key: SPARK-40629 > URL: https://issues.apache.org/jira/browse/SPARK-40629 > Project: Spark > Issue Type: Bug > Components: Spark Shell, SQL >Affects Versions: 3.2.1 >Reporter: xsys >Priority: Major > > h3. Describe the bug > Storing a FLOAT/DOUBLE value with division by 0 (e.g. {{{}( 1.0/0 > ).floatValue(){}}}) via {{spark-shell}} outputs {{{}Infinity{}}}. However, > {{1.0/0}} ({{{}cast ( 1.0/0 as float){}}}) evaluated to {{NULL}} if the value > is inserted into a FLOAT/DOUBLE column of a table via {{{}spark-sql{}}}. > h3. To Reproduce > On Spark 3.2.1 (commit {{{}4f25b3f712{}}}), using {{{}spark-sql{}}}: > {code:java} > $SPARK_HOME/bin/spark-sql{code} > Execute the following: > {code:java} > spark-sql> create table float_vals(c1 float) stored as ORC; > spark-sql> insert into float_vals cast ( 1.0/0 as float); > spark-sql> select * from float_vals; > NULL{code} > Using {{{}spark-shell{}}}: > {code:java} > $SPARK_HOME/bin/spark-shell{code} > Execute the following: > {code:java} > scala> val rdd = sc.parallelize(Seq(Row(( 1.0/0 ).floatValue( > rdd: org.apache.spark.rdd.RDD[org.apache.spark.sql.Row] = > ParallelCollectionRDD[180] at parallelize at :28 > scala> val schema = new StructType().add(StructField("c1", FloatType, true) > ) > schema: org.apache.spark.sql.types.StructType = StructType( > StructField(c1,FloatType,true)) > scala> val df = spark.createDataFrame(rdd, schema) > df: org.apache.spark.sql.DataFrame = [c1:
[jira] [Updated] (SPARK-40629) FLOAT/DOUBLE division by 0 gives Infinity/-Infinity/NaN in DataFrame but NULL in SparkSQL
[ https://issues.apache.org/jira/browse/SPARK-40629?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] xsys updated SPARK-40629: - Description: h3. Describe the bug Storing a FLOAT/DOUBLE value with division by 0 (e.g. {{{}( 1.0/0 ).floatValue(){}}}) via {{spark-shell}} outputs {{{}Infinity{}}}. However, {{1.0/0}} ({{{}cast ( 1.0/0 as float){}}}) evaluated to {{NULL}} if the value is inserted into a FLOAT/DOUBLE column of a table via {{{}spark-sql{}}}. h3. To Reproduce On Spark 3.2.1 (commit {{{}4f25b3f712{}}}), using {{{}spark-sql{}}}: {code:java} $SPARK_HOME/bin/spark-sql{code} Execute the following: {code:java} spark-sql> create table float_vals(c1 float) stored as ORC; spark-sql> insert into float_vals cast ( 1.0/0 as float); spark-sql> select * from float_vals; NULL{code} Using {{{}spark-shell{}}}: {code:java} $SPARK_HOME/bin/spark-shell{code} Execute the following: {code:java} scala> val rdd = sc.parallelize(Seq(Row(( 1.0/0 ).floatValue( rdd: org.apache.spark.rdd.RDD[org.apache.spark.sql.Row] = ParallelCollectionRDD[180] at parallelize at :28 scala> val schema = new StructType().add(StructField("c1", FloatType, true) ) schema: org.apache.spark.sql.types.StructType = StructType( StructField(c1,FloatType,true)) scala> val df = spark.createDataFrame(rdd, schema) df: org.apache.spark.sql.DataFrame = [c1: float] scala> df.show(false) +-+ |c1 | +-+ |Infinity | +-+ {code} h3. Expected behavior We expect the two Spark interfaces ({{{}spark-sql{}}} & {{{}spark-shell{}}}) to behave consistently for the same data type & input combination & configuration ({{{}FLOAT/DOUBLE{}}} and {{{}1.0/0{}}}). was: h3. Describe the bug Storing a FLOAT/DOUBLE value with division by 0 (e.g. {{{}( 1.0/0 ).floatValue(){}}}) via {{spark-shell}} outputs {{{}Infinity{}}}. However, {{1.0/0}} ({{{}cast ( 1.0/0 as float){}}}) evaluated to {{NULL}} if the value is inserted into a FLOAT/DOUBLE column of a table via {{{}spark-sql{}}}. h3. To Reproduce On Spark 3.2.1 (commit {{{}4f25b3f712{}}}), using {{{}spark-sql{}}}: {code:java} $SPARK_HOME/bin/spark-sql{code} Execute the following: {code:java} spark-sql> create table float_vals(c1 float) stored as ORC; spark-sql> insert into float_vals cast ( 1.0/0 as float); spark-sql> select * from float_vals; NULL{code} Using {{{}spark-shell{}}}: {code:java} $SPARK_HOME/bin/spark-shell{code} Execute the following: {code:java} scala> val rdd = sc.parallelize(Seq(Row(( 1.0/0 ).floatValue( rdd: org.apache.spark.rdd.RDD[org.apache.spark.sql.Row] = ParallelCollectionRDD[180] at parallelize at :28 scala> val schema = new StructType().add(StructField("c1", FloatType, true) ) schema: org.apache.spark.sql.types.StructType = StructType( StructField(c1,FloatType,true)) scala> val df = spark.createDataFrame(rdd, schema) df: org.apache.spark.sql.DataFrame = [c1: float] scala> df.show(false) +-+ |c1 | +-+ |Infinity | +-+ {code} h3. Expected behavior We expect the two Spark interfaces ({{{}spark-sql{}}} & {{{}spark-shell{}}}) to behave consistently for the same data type & input combination & configuration ({{{}FLOAT/DOUBLE{}}} and {{{}1.0/0{}}}). > FLOAT/DOUBLE division by 0 gives Infinity/-Infinity/NaN in DataFrame but NULL > in SparkSQL > - > > Key: SPARK-40629 > URL: https://issues.apache.org/jira/browse/SPARK-40629 > Project: Spark > Issue Type: Bug > Components: Spark Shell, SQL >Affects Versions: 3.2.1 >Reporter: xsys >Priority: Major > > h3. Describe the bug > Storing a FLOAT/DOUBLE value with division by 0 (e.g. {{{}( 1.0/0 > ).floatValue(){}}}) via {{spark-shell}} outputs {{{}Infinity{}}}. However, > {{1.0/0}} ({{{}cast ( 1.0/0 as float){}}}) evaluated to {{NULL}} if the value > is inserted into a FLOAT/DOUBLE column of a table via {{{}spark-sql{}}}. > h3. To Reproduce > On Spark 3.2.1 (commit {{{}4f25b3f712{}}}), using {{{}spark-sql{}}}: > {code:java} > $SPARK_HOME/bin/spark-sql{code} > Execute the following: > {code:java} > spark-sql> create table float_vals(c1 float) stored as ORC; > spark-sql> insert into float_vals cast ( 1.0/0 as float); > spark-sql> select * from float_vals; > NULL{code} > > Using {{{}spark-shell{}}}: > {code:java} > $SPARK_HOME/bin/spark-shell{code} > Execute the following: > {code:java} > scala> val rdd = sc.parallelize(Seq(Row(( 1.0/0 ).floatValue( > rdd: org.apache.spark.rdd.RDD[org.apache.spark.sql.Row] = > ParallelCollectionRDD[180] at parallelize at :28 > scala> val schema = new StructType().add(StructField("c1", FloatType, true) > ) > schema: org.apache.spark.sql.types.StructType = StructType( > StructField(c1,FloatType,true)) > scala> val df = spark.createDataFrame(rdd, schema) > df: org.apache.spark.sql.DataFrame = [c1: float] >
[jira] [Updated] (SPARK-40624) A DECIMAL value with division by 0 errors in DataFrame but evaluates to NULL in SparkSQL
[ https://issues.apache.org/jira/browse/SPARK-40624?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] xsys updated SPARK-40624: - Description: h3. Describe the bug Storing an invalid value (e.g. {{{}BigDecimal("1.0/0"){}}}) via {{spark-shell}} errors out during RDD creation. However, {{1.0/0}} evaluates to {{NULL}} if the value is inserted into a {{DECIMAL(20,10)}} column of a table via {{{}spark-sql{}}}. h3. To Reproduce On Spark 3.2.1 (commit {{{}4f25b3f712{}}}), using {{{}spark-sql{}}}: {code:java} $SPARK_HOME/bin/spark-sql{code} Execute the following: (evaluated to {{{}NULL{}}}) {code:java} spark-sql> create table decimal_vals(c1 DECIMAL(20,10)) stored as ORC; spark-sql> insert into decimal_vals 1.0/0; spark-sql> select * from ws71; 71 NULL{code} Using {{{}spark-shell{}}}: {code:java} $SPARK_HOME/bin/spark-shell{code} Execute the following: (errors out during RDD creation) {code:java} scala> val rdd = sc.parallelize(Seq(Row(BigDecimal("1.0/0" java.lang.NumberFormatException at java.math.BigDecimal.(BigDecimal.java:497) at java.math.BigDecimal.(BigDecimal.java:383) at java.math.BigDecimal.(BigDecimal.java:809) at scala.math.BigDecimal$.exact(BigDecimal.scala:126) at scala.math.BigDecimal$.apply(BigDecimal.scala:284) ... 49 elided{code} h3. Expected behavior We expect the two Spark interfaces ({{{}spark-sql{}}} & {{{}spark-shell{}}}) to behave consistently for the same data type & input combination ({{{}BigDecimal{}}}/{{{}DECIMAL(20,10){}}} and {{{}1.0/0{}}}). was: h3. Describe the bug Storing an invalid value (e.g. {{{}BigDecimal("1.0/0"){}}}) via {{spark-shell}} errors out during RDD creation. However, {{1.0/0}} evaluated to {{NULL}} if the value is inserted into a {{DECIMAL(20,10)}} column of a table via {{{}spark-sql{}}}. h3. To Reproduce On Spark 3.2.1 (commit {{{}4f25b3f712{}}}), using {{{}spark-sql{}}}: {code:java} $SPARK_HOME/bin/spark-sql{code} Execute the following: (evaluated to {{{}NULL{}}}) {code:java} spark-sql> create table decimal_vals(c1 DECIMAL(20,10)) stored as ORC; spark-sql> insert into decimal_vals 1.0/0; spark-sql> select * from ws71; 71 NULL{code} Using {{{}spark-shell{}}}: {code:java} $SPARK_HOME/bin/spark-shell{code} Execute the following: (errors out during RDD creation) {code:java} scala> val rdd = sc.parallelize(Seq(Row(BigDecimal("1.0/0" java.lang.NumberFormatException at java.math.BigDecimal.(BigDecimal.java:497) at java.math.BigDecimal.(BigDecimal.java:383) at java.math.BigDecimal.(BigDecimal.java:809) at scala.math.BigDecimal$.exact(BigDecimal.scala:126) at scala.math.BigDecimal$.apply(BigDecimal.scala:284) ... 49 elided{code} h3. Expected behavior We expect the two Spark interfaces ({{{}spark-sql{}}} & {{{}spark-shell{}}}) to behave consistently for the same data type & input combination ({{{}BigDecimal{}}}/{{{}DECIMAL(20,10){}}} and {{{}1.0/0{}}}). > A DECIMAL value with division by 0 errors in DataFrame but evaluates to NULL > in SparkSQL > > > Key: SPARK-40624 > URL: https://issues.apache.org/jira/browse/SPARK-40624 > Project: Spark > Issue Type: Bug > Components: Spark Shell >Affects Versions: 3.2.1 >Reporter: xsys >Priority: Major > > h3. Describe the bug > Storing an invalid value (e.g. {{{}BigDecimal("1.0/0"){}}}) via > {{spark-shell}} errors out during RDD creation. However, {{1.0/0}} evaluates > to {{NULL}} if the value is inserted into a {{DECIMAL(20,10)}} column of a > table via {{{}spark-sql{}}}. > h3. To Reproduce > On Spark 3.2.1 (commit {{{}4f25b3f712{}}}), using {{{}spark-sql{}}}: > {code:java} > $SPARK_HOME/bin/spark-sql{code} > Execute the following: (evaluated to {{{}NULL{}}}) > {code:java} > spark-sql> create table decimal_vals(c1 DECIMAL(20,10)) stored as ORC; > spark-sql> insert into decimal_vals 1.0/0; > spark-sql> select * from ws71; > 71 NULL{code} > Using {{{}spark-shell{}}}: > {code:java} > $SPARK_HOME/bin/spark-shell{code} > Execute the following: (errors out during RDD creation) > {code:java} > scala> val rdd = sc.parallelize(Seq(Row(BigDecimal("1.0/0" > java.lang.NumberFormatException > at java.math.BigDecimal.(BigDecimal.java:497) > at java.math.BigDecimal.(BigDecimal.java:383) > at java.math.BigDecimal.(BigDecimal.java:809) > at scala.math.BigDecimal$.exact(BigDecimal.scala:126) > at scala.math.BigDecimal$.apply(BigDecimal.scala:284) > ... 49 elided{code} > h3. Expected behavior > We expect the two Spark interfaces ({{{}spark-sql{}}} & {{{}spark-shell{}}}) > to behave consistently for the same data type & input combination > ({{{}BigDecimal{}}}/{{{}DECIMAL(20,10){}}} and {{{}1.0/0{}}}). > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (SPARK-40624) A DECIMAL value with division by 0 errors in DataFrame but evaluates to NULL in SparkSQL
xsys created SPARK-40624: Summary: A DECIMAL value with division by 0 errors in DataFrame but evaluates to NULL in SparkSQL Key: SPARK-40624 URL: https://issues.apache.org/jira/browse/SPARK-40624 Project: Spark Issue Type: Bug Components: Spark Shell Affects Versions: 3.2.1 Reporter: xsys h3. Describe the bug Storing an invalid value (e.g. {{{}BigDecimal("1.0/0"){}}}) via {{spark-shell}} errors out during RDD creation. However, {{1.0/0}} evaluated to {{NULL}} if the value is inserted into a {{DECIMAL(20,10)}} column of a table via {{{}spark-sql{}}}. h3. To Reproduce On Spark 3.2.1 (commit {{{}4f25b3f712{}}}), using {{{}spark-sql{}}}: {code:java} $SPARK_HOME/bin/spark-sql{code} Execute the following: (evaluated to {{{}NULL{}}}) {code:java} spark-sql> create table decimal_vals(c1 DECIMAL(20,10)) stored as ORC; spark-sql> insert into decimal_vals 1.0/0; spark-sql> select * from ws71; 71 NULL{code} Using {{{}spark-shell{}}}: {code:java} $SPARK_HOME/bin/spark-shell{code} Execute the following: (errors out during RDD creation) {code:java} scala> val rdd = sc.parallelize(Seq(Row(BigDecimal("1.0/0" java.lang.NumberFormatException at java.math.BigDecimal.(BigDecimal.java:497) at java.math.BigDecimal.(BigDecimal.java:383) at java.math.BigDecimal.(BigDecimal.java:809) at scala.math.BigDecimal$.exact(BigDecimal.scala:126) at scala.math.BigDecimal$.apply(BigDecimal.scala:284) ... 49 elided{code} h3. Expected behavior We expect the two Spark interfaces ({{{}spark-sql{}}} & {{{}spark-shell{}}}) to behave consistently for the same data type & input combination ({{{}BigDecimal{}}}/{{{}DECIMAL(20,10){}}} and {{{}1.0/0{}}}). -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-40616) Loss of precision using SparkSQL shell on high-precision DECIMAL types
[ https://issues.apache.org/jira/browse/SPARK-40616?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] xsys updated SPARK-40616: - Description: h3. Describe the bug We are trying to save {{DECIMAL}} values with high precision in a table using the SparkSQL shell. When we {{INSERT}} decimal values with precision higher than the standard double precision, precision is lost. (8.888e9 interpreted as 88.90 instead of 88.88). This seems to be caused by type inference at shell parsing inferring that the value is a double type. h3. To reproduce On Spark 3.2.1 (commit {{{}4f25b3f712{}}}), using {{{}spark-sql{}}}: {code:java} $SPARK_HOME/bin/spark-sql{code} In the shell: {code:java} CREATE TABLE t(c0 DECIMAL(20,10)); INSERT INTO t VALUES (8.888e9); SELECT * FROM t;{code} Executing the above gives this: {code:java} spark-sql> CREATE TABLE t(c0 DECIMAL(20,10)); 22/09/29 11:28:41 WARN ResolveSessionCatalog: A Hive serde table will be created as there is no table provider specified. You can set spark.sql.legacy.createHiveTableByDefault to false so that native data source table will be created instead. Time taken: 0.118 seconds spark-sql> INSERT INTO t VALUES (8.888e9); Time taken: 0.392 seconds spark-sql> SELECT * FROM t; 88.90 Time taken: 0.197 seconds, Fetched 1 row(s){code} h3. Expected behavior We expect the inserted value to retain the precision as determined by the parameters for the {{DECIMAL}} type. For example, we expect the example above to return {{{}88.88{}}}. was: h3. Describe the bug We are trying to save {{DECIMAL}} values with high precision in a table using the SparkSQL shell. When we {{INSERT}} decimal values with precision higher than the standard double precision, precision is lost. This seems to be caused by type inference at shell parsing inferring that the value is a double type. h3. To reproduce On Spark 3.2.1 (commit {{{}4f25b3f712{}}}), using {{{}spark-sql{}}}: {code:java} $SPARK_HOME/bin/spark-sql{code} In the shell: {code:java} CREATE TABLE t(c0 DECIMAL(20,10)); INSERT INTO t VALUES (8.888e9); SELECT * FROM t;{code} Executing the above gives this: {code:java} spark-sql> CREATE TABLE t(c0 DECIMAL(20,10)); 22/09/29 11:28:41 WARN ResolveSessionCatalog: A Hive serde table will be created as there is no table provider specified. You can set spark.sql.legacy.createHiveTableByDefault to false so that native data source table will be created instead. Time taken: 0.118 seconds spark-sql> INSERT INTO t VALUES (8.888e9); Time taken: 0.392 seconds spark-sql> SELECT * FROM t; 88.90 Time taken: 0.197 seconds, Fetched 1 row(s){code} h3. Expected behavior We expect the inserted value to retain the precision as determined by the parameters for the {{DECIMAL}} type. For example, we expect the example above to return {{{}88.88{}}}. > Loss of precision using SparkSQL shell on high-precision DECIMAL types > -- > > Key: SPARK-40616 > URL: https://issues.apache.org/jira/browse/SPARK-40616 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.2.1 >Reporter: xsys >Priority: Major > > h3. Describe the bug > We are trying to save {{DECIMAL}} values with high precision in a table using > the SparkSQL shell. When we {{INSERT}} decimal values with precision higher > than the standard double precision, precision is lost. > (8.888e9 interpreted as 88.90 instead of > 88.88). > This seems to be caused by type inference at shell parsing inferring that the > value is a double type. > h3. To reproduce > On Spark 3.2.1 (commit {{{}4f25b3f712{}}}), using {{{}spark-sql{}}}: > {code:java} > $SPARK_HOME/bin/spark-sql{code} > In the shell: > {code:java} > CREATE TABLE t(c0 DECIMAL(20,10)); > INSERT INTO t VALUES (8.888e9); > SELECT * FROM t;{code} > Executing the above gives this: > {code:java} > spark-sql> CREATE TABLE t(c0 DECIMAL(20,10)); > 22/09/29 11:28:41 WARN ResolveSessionCatalog: A Hive serde table will be > created as there is no table provider specified. You can set > spark.sql.legacy.createHiveTableByDefault to false so that native data source > table will be created instead. > Time taken: 0.118 seconds > spark-sql> INSERT INTO t VALUES (8.888e9); > Time taken: 0.392 seconds > spark-sql> SELECT * FROM t; > 88.90 > Time taken: 0.197 seconds, Fetched 1 row(s){code} > h3. Expected behavior > We expect the inserted value to retain the precision as determined by the > parameters
[jira] [Updated] (SPARK-40616) Loss of precision using SparkSQL shell on high-precision DECIMAL types
[ https://issues.apache.org/jira/browse/SPARK-40616?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] xsys updated SPARK-40616: - Description: h3. Describe the bug We are trying to save {{DECIMAL}} values with high precision in a table using the SparkSQL shell. When we {{INSERT}} decimal values with precision higher than the standard double precision. This seems to be caused by type inference at shell parsing inferring that the value is a double type. h3. To reproduce On Spark 3.2.1 (commit {{{}4f25b3f712{}}}), using {{{}spark-sql{}}}: {code:java} $SPARK_HOME/bin/spark-sql{code} In the shell: {code:java} CREATE TABLE t(c0 DECIMAL(20,10)); INSERT INTO t VALUES (8.888e9); SELECT * FROM t;{code} Executing the above gives this: {code:java} spark-sql> CREATE TABLE t(c0 DECIMAL(20,10)); 22/09/29 11:28:41 WARN ResolveSessionCatalog: A Hive serde table will be created as there is no table provider specified. You can set spark.sql.legacy.createHiveTableByDefault to false so that native data source table will be created instead. Time taken: 0.118 seconds spark-sql> INSERT INTO t VALUES (8.888e9); Time taken: 0.392 seconds spark-sql> SELECT * FROM t; 88.90 Time taken: 0.197 seconds, Fetched 1 row(s){code} h3. Expected behavior We expect the inserted value to retain the precision as determined by the parameters for the {{DECIMAL}} type. For example, we expect the example above to return {{{}88.88{}}}. was: h3. Describe the bug We are trying to save {{DECIMAL}} values with high precision in a table using the SparkSQL shell. When we {{INSERT}} decimal values with precision higher than the standard double precision. This seems to be caused by type inference at shell parsing inferring that the value is a double type. h3. To reproduce On Spark 3.2.1 (commit {{{}4f25b3f712{}}}), using {{{}spark-sql{}}}: {code:java} $SPARK_HOME/bin/spark-sql{code} In the shell: {code:java} CREATE TABLE t(c0 DECIMAL(20,10)); INSERT INTO t VALUES (8.888e9); SELECT * FROM t;{code} Executing the above gives this: {code:java} spark-sql> CREATE TABLE t(c0 DECIMAL(20,10)); 22/09/29 11:28:41 WARN ResolveSessionCatalog: A Hive serde table will be created as there is no table provider specified. You can set spark.sql.legacy.createHiveTableByDefault to false so that native data source table will be created instead. Time taken: 0.118 seconds spark-sql> INSERT INTO t VALUES (8.888e9); Time taken: 0.392 seconds spark-sql> SELECT * FROM t; 88.90 Time taken: 0.197 seconds, Fetched 1 row(s){code} h3. Expected behavior We expect the inserted value to retain the precision as determined by the parameters for the {{DECIMAL}} type. For example, we expect the example above to return {{{}88.88{}}}. > Loss of precision using SparkSQL shell on high-precision DECIMAL types > -- > > Key: SPARK-40616 > URL: https://issues.apache.org/jira/browse/SPARK-40616 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.2.1 >Reporter: xsys >Priority: Major > > h3. Describe the bug > We are trying to save {{DECIMAL}} values with high precision in a table using > the SparkSQL shell. > When we {{INSERT}} decimal values with precision higher than the standard > double precision. > This seems to be caused by type inference at shell parsing inferring that the > value is a double type. > h3. To reproduce > On Spark 3.2.1 (commit {{{}4f25b3f712{}}}), using {{{}spark-sql{}}}: > {code:java} > $SPARK_HOME/bin/spark-sql{code} > > In the shell: > {code:java} > CREATE TABLE t(c0 DECIMAL(20,10)); > INSERT INTO t VALUES (8.888e9); > SELECT * FROM t;{code} > Executing the above gives this: > > {code:java} > spark-sql> CREATE TABLE t(c0 DECIMAL(20,10)); > 22/09/29 11:28:41 WARN ResolveSessionCatalog: A Hive serde table will be > created as there is no table provider specified. You can set > spark.sql.legacy.createHiveTableByDefault to false so that native data source > table will be created instead. > Time taken: 0.118 seconds > spark-sql> INSERT INTO t VALUES (8.888e9); > Time taken: 0.392 seconds > spark-sql> SELECT * FROM t; > 88.90 > Time taken: 0.197 seconds, Fetched 1 row(s){code} > h3. Expected behavior > We expect the inserted value to retain the precision as determined by the > parameters for the {{DECIMAL}} type. For example, we expect the example above > to return {{{}88.88{}}}. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (SPARK-40616) Loss of precision using SparkSQL shell on high-precision DECIMAL types
[ https://issues.apache.org/jira/browse/SPARK-40616?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] xsys updated SPARK-40616: - Description: h3. Describe the bug We are trying to save {{DECIMAL}} values with high precision in a table using the SparkSQL shell. When we {{INSERT}} decimal values with precision higher than the standard double precision, precision is lost. This seems to be caused by type inference at shell parsing inferring that the value is a double type. h3. To reproduce On Spark 3.2.1 (commit {{{}4f25b3f712{}}}), using {{{}spark-sql{}}}: {code:java} $SPARK_HOME/bin/spark-sql{code} In the shell: {code:java} CREATE TABLE t(c0 DECIMAL(20,10)); INSERT INTO t VALUES (8.888e9); SELECT * FROM t;{code} Executing the above gives this: {code:java} spark-sql> CREATE TABLE t(c0 DECIMAL(20,10)); 22/09/29 11:28:41 WARN ResolveSessionCatalog: A Hive serde table will be created as there is no table provider specified. You can set spark.sql.legacy.createHiveTableByDefault to false so that native data source table will be created instead. Time taken: 0.118 seconds spark-sql> INSERT INTO t VALUES (8.888e9); Time taken: 0.392 seconds spark-sql> SELECT * FROM t; 88.90 Time taken: 0.197 seconds, Fetched 1 row(s){code} h3. Expected behavior We expect the inserted value to retain the precision as determined by the parameters for the {{DECIMAL}} type. For example, we expect the example above to return {{{}88.88{}}}. was: h3. Describe the bug We are trying to save {{DECIMAL}} values with high precision in a table using the SparkSQL shell. When we {{INSERT}} decimal values with precision higher than the standard double precision. This seems to be caused by type inference at shell parsing inferring that the value is a double type. h3. To reproduce On Spark 3.2.1 (commit {{{}4f25b3f712{}}}), using {{{}spark-sql{}}}: {code:java} $SPARK_HOME/bin/spark-sql{code} In the shell: {code:java} CREATE TABLE t(c0 DECIMAL(20,10)); INSERT INTO t VALUES (8.888e9); SELECT * FROM t;{code} Executing the above gives this: {code:java} spark-sql> CREATE TABLE t(c0 DECIMAL(20,10)); 22/09/29 11:28:41 WARN ResolveSessionCatalog: A Hive serde table will be created as there is no table provider specified. You can set spark.sql.legacy.createHiveTableByDefault to false so that native data source table will be created instead. Time taken: 0.118 seconds spark-sql> INSERT INTO t VALUES (8.888e9); Time taken: 0.392 seconds spark-sql> SELECT * FROM t; 88.90 Time taken: 0.197 seconds, Fetched 1 row(s){code} h3. Expected behavior We expect the inserted value to retain the precision as determined by the parameters for the {{DECIMAL}} type. For example, we expect the example above to return {{{}88.88{}}}. > Loss of precision using SparkSQL shell on high-precision DECIMAL types > -- > > Key: SPARK-40616 > URL: https://issues.apache.org/jira/browse/SPARK-40616 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.2.1 >Reporter: xsys >Priority: Major > > h3. Describe the bug > We are trying to save {{DECIMAL}} values with high precision in a table using > the SparkSQL shell. When we {{INSERT}} decimal values with precision higher > than the standard double precision, precision is lost. > This seems to be caused by type inference at shell parsing inferring that the > value is a double type. > h3. To reproduce > On Spark 3.2.1 (commit {{{}4f25b3f712{}}}), using {{{}spark-sql{}}}: > {code:java} > $SPARK_HOME/bin/spark-sql{code} > In the shell: > {code:java} > CREATE TABLE t(c0 DECIMAL(20,10)); > INSERT INTO t VALUES (8.888e9); > SELECT * FROM t;{code} > Executing the above gives this: > {code:java} > spark-sql> CREATE TABLE t(c0 DECIMAL(20,10)); > 22/09/29 11:28:41 WARN ResolveSessionCatalog: A Hive serde table will be > created as there is no table provider specified. You can set > spark.sql.legacy.createHiveTableByDefault to false so that native data source > table will be created instead. > Time taken: 0.118 seconds > spark-sql> INSERT INTO t VALUES (8.888e9); > Time taken: 0.392 seconds > spark-sql> SELECT * FROM t; > 88.90 > Time taken: 0.197 seconds, Fetched 1 row(s){code} > h3. Expected behavior > We expect the inserted value to retain the precision as determined by the > parameters for the {{DECIMAL}} type. For example, we expect the example above > to return {{{}88.88{}}}. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (SPARK-40616) Loss of precision using SparkSQL shell on high-precision DECIMAL types
[ https://issues.apache.org/jira/browse/SPARK-40616?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] xsys updated SPARK-40616: - Description: h3. Describe the bug We are trying to save {{DECIMAL}} values with high precision in a table using the SparkSQL shell. When we {{INSERT}} decimal values with precision higher than the standard double precision. This seems to be caused by type inference at shell parsing inferring that the value is a double type. h3. To reproduce On Spark 3.2.1 (commit {{{}4f25b3f712{}}}), using {{{}spark-sql{}}}: {code:java} $SPARK_HOME/bin/spark-sql{code} In the shell: {code:java} CREATE TABLE t(c0 DECIMAL(20,10)); INSERT INTO t VALUES (8.888e9); SELECT * FROM t;{code} Executing the above gives this: {code:java} spark-sql> CREATE TABLE t(c0 DECIMAL(20,10)); 22/09/29 11:28:41 WARN ResolveSessionCatalog: A Hive serde table will be created as there is no table provider specified. You can set spark.sql.legacy.createHiveTableByDefault to false so that native data source table will be created instead. Time taken: 0.118 seconds spark-sql> INSERT INTO t VALUES (8.888e9); Time taken: 0.392 seconds spark-sql> SELECT * FROM t; 88.90 Time taken: 0.197 seconds, Fetched 1 row(s){code} h3. Expected behavior We expect the inserted value to retain the precision as determined by the parameters for the {{DECIMAL}} type. For example, we expect the example above to return {{{}88.88{}}}. was: h3. Describe the bug We are trying to save {{DECIMAL}} values with high precision in a table using the SparkSQL shell. When we {{INSERT}} decimal values with precision higher than the standard double precision. This seems to be caused by type inference at shell parsing inferring that the value is a double type. h3. To reproduce On Spark 3.2.1 (commit {{{}4f25b3f712{}}}), using {{{}spark-sql{}}}: {code:java} $SPARK_HOME/bin/spark-sql{code} In the shell: {code:java} CREATE TABLE t(c0 DECIMAL(20,10)); INSERT INTO t VALUES (8.888e9); SELECT * FROM t;{code} Executing the above gives this: {code:java} spark-sql> CREATE TABLE t(c0 DECIMAL(20,10)); 22/09/29 11:28:41 WARN ResolveSessionCatalog: A Hive serde table will be created as there is no table provider specified. You can set spark.sql.legacy.createHiveTableByDefault to false so that native data source table will be created instead. Time taken: 0.118 seconds spark-sql> INSERT INTO t VALUES (8.888e9); Time taken: 0.392 seconds spark-sql> SELECT * FROM t; 88.90 Time taken: 0.197 seconds, Fetched 1 row(s){code} h3. Expected behavior We expect the inserted value to retain the precision as determined by the parameters for the {{DECIMAL}} type. For example, we expect the example above to return {{{}88.88{}}}. > Loss of precision using SparkSQL shell on high-precision DECIMAL types > -- > > Key: SPARK-40616 > URL: https://issues.apache.org/jira/browse/SPARK-40616 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.2.1 >Reporter: xsys >Priority: Major > > h3. Describe the bug > We are trying to save {{DECIMAL}} values with high precision in a table using > the SparkSQL shell. > When we {{INSERT}} decimal values with precision higher than the standard > double precision. > This seems to be caused by type inference at shell parsing inferring that the > value is a double type. > h3. To reproduce > On Spark 3.2.1 (commit {{{}4f25b3f712{}}}), using {{{}spark-sql{}}}: > {code:java} > $SPARK_HOME/bin/spark-sql{code} > In the shell: > {code:java} > CREATE TABLE t(c0 DECIMAL(20,10)); > INSERT INTO t VALUES (8.888e9); > SELECT * FROM t;{code} > Executing the above gives this: > {code:java} > spark-sql> CREATE TABLE t(c0 DECIMAL(20,10)); > 22/09/29 11:28:41 WARN ResolveSessionCatalog: A Hive serde table will be > created as there is no table provider specified. You can set > spark.sql.legacy.createHiveTableByDefault to false so that native data source > table will be created instead. > Time taken: 0.118 seconds > spark-sql> INSERT INTO t VALUES (8.888e9); > Time taken: 0.392 seconds > spark-sql> SELECT * FROM t; > 88.90 > Time taken: 0.197 seconds, Fetched 1 row(s){code} > h3. Expected behavior > We expect the inserted value to retain the precision as determined by the > parameters for the {{DECIMAL}} type. For example, we expect the example above > to return {{{}88.88{}}}. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To
[jira] [Created] (SPARK-40616) Loss of precision using SparkSQL shell on high-precision DECIMAL types
xsys created SPARK-40616: Summary: Loss of precision using SparkSQL shell on high-precision DECIMAL types Key: SPARK-40616 URL: https://issues.apache.org/jira/browse/SPARK-40616 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.2.1 Reporter: xsys h3. Describe the bug We are trying to save {{DECIMAL}} values with high precision in a table using the SparkSQL shell. When we {{INSERT}} decimal values with precision higher than the standard double precision. This seems to be caused by type inference at shell parsing inferring that the value is a double type. h3. To reproduce On Spark 3.2.1 (commit {{{}4f25b3f712{}}}), using {{{}spark-sql{}}}: {code:java} $SPARK_HOME/bin/spark-sql{code} In the shell: {code:java} CREATE TABLE t(c0 DECIMAL(20,10)); INSERT INTO t VALUES (8.888e9); SELECT * FROM t;{code} Executing the above gives this: {code:java} spark-sql> CREATE TABLE t(c0 DECIMAL(20,10)); 22/09/29 11:28:41 WARN ResolveSessionCatalog: A Hive serde table will be created as there is no table provider specified. You can set spark.sql.legacy.createHiveTableByDefault to false so that native data source table will be created instead. Time taken: 0.118 seconds spark-sql> INSERT INTO t VALUES (8.888e9); Time taken: 0.392 seconds spark-sql> SELECT * FROM t; 88.90 Time taken: 0.197 seconds, Fetched 1 row(s){code} h3. Expected behavior We expect the inserted value to retain the precision as determined by the parameters for the {{DECIMAL}} type. For example, we expect the example above to return {{{}88.88{}}}. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-40525) Floating-point value with an INT/BYTE/SHORT/LONG type errors out in DataFrame but evaluates to a rounded value in SparkSQL
[ https://issues.apache.org/jira/browse/SPARK-40525?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] xsys updated SPARK-40525: - Description: h3. Describe the bug Storing an invalid INT value {{1.1}} using DataFrames via {{spark-shell}} expectedly errors out. However, it is evaluated to a rounded value {{1}} if the value is inserted into the table via {{{}spark-sql{}}}. h3. Steps to reproduce: On Spark 3.2.1 (commit {{{}4f25b3f712{}}}), using {{{}spark-sql{}}}: {code:java} $SPARK_HOME/bin/spark-sql {code} Execute the following: {code:java} spark-sql> create table int_floating_point_vals(c1 INT) stored as ORC; 22/09/19 16:49:11 WARN SessionState: METASTORE_FILTER_HOOK will be ignored, since hive.security.authorization.manager is set to instance of HiveAuthorizerFactory. Time taken: 0.216 seconds spark-sql> insert into int_floating_point_vals select 1.1; Time taken: 1.747 seconds spark-sql> select * from int_floating_point_vals; 1 Time taken: 0.518 seconds, Fetched 1 row(s){code} h3. Expected behavior We expect the two Spark interfaces ({{{}spark-sql{}}} & {{{}spark-shell{}}}) to behave consistently for the same data type & input combination ({{{}INT{}}} and {{{}1.1{}}}). h4. Here is a simplified example in {{{}spark-shell{}}}, where insertion of the aforementioned value correctly raises an exception: On Spark 3.2.1 (commit {{{}4f25b3f712{}}}), using {{{}spark-shell{}}}: {code:java} $SPARK_HOME/bin/spark-shell{code} Execute the following: {code:java} import org.apache.spark.sql.{Row, SparkSession} import org.apache.spark.sql.types._ val rdd = sc.parallelize(Seq(Row(1.1))) val schema = new StructType().add(StructField("c1", IntegerType, true)) val df = spark.createDataFrame(rdd, schema) df.write.mode("overwrite").format("orc").saveAsTable("int_floating_point_vals") {code} The following exception is raised: {code:java} java.lang.RuntimeException: Error while encoding: java.lang.RuntimeException: java.lang.Double is not a valid external type for schema of int{code} was: h3. Describe the bug Storing an invalid INT value {{1.1}} using DataFrames via {{spark-shell}} expectedly errors out. However, it is evaluated to a rounded value {{1}} if the value is inserted into the table via {{{}spark-sql{}}}. h3. Steps to reproduce: On Spark 3.2.1 (commit {{{}4f25b3f712{}}}), using {{{}spark-sql{}}}:}}{}}} {code:java} $SPARK_HOME/bin/spark-sql {code} Execute the following: {code:java} spark-sql> create table int_floating_point_vals(c1 INT) stored as ORC; 22/09/19 16:49:11 WARN SessionState: METASTORE_FILTER_HOOK will be ignored, since hive.security.authorization.manager is set to instance of HiveAuthorizerFactory. Time taken: 0.216 seconds spark-sql> insert into int_floating_point_vals select 1.1; Time taken: 1.747 seconds spark-sql> select * from int_floating_point_vals; 1 Time taken: 0.518 seconds, Fetched 1 row(s){code} h3. Expected behavior We expect the two Spark interfaces ({{{}spark-sql{}}} & {{{}spark-shell{}}}) to behave consistently for the same data type & input combination ({{{}INT{}}} and {{{}1.1{}}}). h4. Here is a simplified example in {{{}spark-shell{}}}, where insertion of the aforementioned value correctly raises an exception: On Spark 3.2.1 (commit {{{}4f25b3f712{}}}), using {{{}spark-shell{}}}: {code:java} $SPARK_HOME/bin/spark-shell{code} Execute the following: {code:java} import org.apache.spark.sql.{Row, SparkSession} import org.apache.spark.sql.types._ val rdd = sc.parallelize(Seq(Row(1.1))) val schema = new StructType().add(StructField("c1", IntegerType, true)) val df = spark.createDataFrame(rdd, schema) df.write.mode("overwrite").format("orc").saveAsTable("int_floating_point_vals") {code} The following exception is raised: {code:java} java.lang.RuntimeException: Error while encoding: java.lang.RuntimeException: java.lang.Double is not a valid external type for schema of int{code} > Floating-point value with an INT/BYTE/SHORT/LONG type errors out in DataFrame > but evaluates to a rounded value in SparkSQL > -- > > Key: SPARK-40525 > URL: https://issues.apache.org/jira/browse/SPARK-40525 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.2.1 >Reporter: xsys >Priority: Major > > h3. Describe the bug > Storing an invalid INT value {{1.1}} using DataFrames via {{spark-shell}} > expectedly errors out. However, it is evaluated to a rounded value {{1}} if > the value is inserted into the table via {{{}spark-sql{}}}. > h3. Steps to reproduce: > On Spark 3.2.1 (commit {{{}4f25b3f712{}}}), using {{{}spark-sql{}}}: > {code:java} > $SPARK_HOME/bin/spark-sql {code} > Execute the following: > {code:java} > spark-sql> create table int_floating_point_vals(c1 INT) stored as ORC; > 22/09/19
[jira] [Updated] (SPARK-40525) Floating-point value with an INT/BYTE/SHORT/LONG type errors out in DataFrame but evaluates to a rounded value in SparkSQL
[ https://issues.apache.org/jira/browse/SPARK-40525?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] xsys updated SPARK-40525: - Description: h3. Describe the bug Storing an invalid INT value {{1.1}} using DataFrames via {{spark-shell}} expectedly errors out. However, it is evaluated to a rounded value {{1}} if the value is inserted into the table via {{{}spark-sql{}}}. h3. Steps to reproduce: On Spark 3.2.1 (commit {{{}4f25b3f712{}}}), using {{{}spark-sql{}}}:}}{}}} {code:java} $SPARK_HOME/bin/spark-sql {code} Execute the following: {code:java} spark-sql> create table int_floating_point_vals(c1 INT) stored as ORC; 22/09/19 16:49:11 WARN SessionState: METASTORE_FILTER_HOOK will be ignored, since hive.security.authorization.manager is set to instance of HiveAuthorizerFactory. Time taken: 0.216 seconds spark-sql> insert into int_floating_point_vals select 1.1; Time taken: 1.747 seconds spark-sql> select * from int_floating_point_vals; 1 Time taken: 0.518 seconds, Fetched 1 row(s){code} h3. Expected behavior We expect the two Spark interfaces ({{{}spark-sql{}}} & {{{}spark-shell{}}}) to behave consistently for the same data type & input combination ({{{}INT{}}} and {{{}1.1{}}}). h4. Here is a simplified example in {{{}spark-shell{}}}, where insertion of the aforementioned value correctly raises an exception: On Spark 3.2.1 (commit {{{}4f25b3f712{}}}), using {{{}spark-shell{}}}: {code:java} $SPARK_HOME/bin/spark-shell{code} Execute the following: {code:java} import org.apache.spark.sql.{Row, SparkSession} import org.apache.spark.sql.types._ val rdd = sc.parallelize(Seq(Row(1.1))) val schema = new StructType().add(StructField("c1", IntegerType, true)) val df = spark.createDataFrame(rdd, schema) df.write.mode("overwrite").format("orc").saveAsTable("int_floating_point_vals") {code} The following exception is raised: {code:java} java.lang.RuntimeException: Error while encoding: java.lang.RuntimeException: java.lang.Double is not a valid external type for schema of int{code} was: h3. Describe the bug Storing an invalid INT value {{1.1}} using DataFrames via {{spark-shell}} expectedly errors out. However, it is evaluated to a rounded value {{1}} if the value is inserted into the table via {{{}spark-sql{}}}. h3. Steps to reproduce: On Spark 3.2.1 (commit {{{}4f25b3f712{}}}), using {{{}spark-sql{}}}:{{{}{}}} {code:java} $SPARK_HOME/bin/spark-sql {code} Execute the following: {code:java} spark-sql> create table int_floating_point_vals(c1 INT) stored as ORC; 22/09/19 16:49:11 WARN SessionState: METASTORE_FILTER_HOOK will be ignored, since hive.security.authorization.manager is set to instance of HiveAuthorizerFactory. Time taken: 0.216 seconds spark-sql> insert into int_floating_point_vals select 1.1; Time taken: 1.747 seconds spark-sql> select * from int_floating_point_vals; 1 Time taken: 0.518 seconds, Fetched 1 row(s){code} h3. Expected behavior We expect the two Spark interfaces ({{{}spark-sql{}}} & {{{}spark-shell{}}}) to behave consistently for the same data type & input combination ({{{}INT{}}} and {{{}1.1{}}}). h4. Here is a simplified example in {{{}spark-shell{}}}, where insertion of the aforementioned value correctly raises an exception: On Spark 3.2.1 (commit {{{}4f25b3f712{}}}), using {{{}spark-shell{}}}: {code:java} $SPARK_HOME/bin/spark-shell{code} Execute the following: {code:java} import org.apache.spark.sql.{Row, SparkSession} import org.apache.spark.sql.types._ val rdd = sc.parallelize(Seq(Row(1.1))) val schema = new StructType().add(StructField("c1", IntegerType, true)) val df = spark.createDataFrame(rdd, schema) df.write.mode("overwrite").format("orc").saveAsTable("int_floating_point_vals") {code} The following exception is raised: {code:java} java.lang.RuntimeException: Error while encoding: java.lang.RuntimeException: java.lang.Double is not a valid external type for schema of int{code} > Floating-point value with an INT/BYTE/SHORT/LONG type errors out in DataFrame > but evaluates to a rounded value in SparkSQL > -- > > Key: SPARK-40525 > URL: https://issues.apache.org/jira/browse/SPARK-40525 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.2.1 >Reporter: xsys >Priority: Major > > h3. Describe the bug > Storing an invalid INT value {{1.1}} using DataFrames via {{spark-shell}} > expectedly errors out. However, it is evaluated to a rounded value {{1}} if > the value is inserted into the table via {{{}spark-sql{}}}. > h3. Steps to reproduce: > On Spark 3.2.1 (commit {{{}4f25b3f712{}}}), using {{{}spark-sql{}}}:}}{}}} > {code:java} > $SPARK_HOME/bin/spark-sql {code} > Execute the following: > {code:java} > spark-sql> create table int_floating_point_vals(c1 INT)
[jira] [Created] (SPARK-40525) Floating-point value with an INT/BYTE/SHORT/LONG type errors out in DataFrame but evaluates to a rounded value in SparkSQL
xsys created SPARK-40525: Summary: Floating-point value with an INT/BYTE/SHORT/LONG type errors out in DataFrame but evaluates to a rounded value in SparkSQL Key: SPARK-40525 URL: https://issues.apache.org/jira/browse/SPARK-40525 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.2.1 Reporter: xsys h3. Describe the bug Storing an invalid INT value {{1.1}} using DataFrames via {{spark-shell}} expectedly errors out. However, it is evaluated to a rounded value {{1}} if the value is inserted into the table via {{{}spark-sql{}}}. h3. Steps to reproduce: On Spark 3.2.1 (commit {{{}4f25b3f712{}}}), using {{{}spark-sql{}}}:{{{}{}}} {code:java} $SPARK_HOME/bin/spark-sql {code} Execute the following: {code:java} spark-sql> create table int_floating_point_vals(c1 INT) stored as ORC; 22/09/19 16:49:11 WARN SessionState: METASTORE_FILTER_HOOK will be ignored, since hive.security.authorization.manager is set to instance of HiveAuthorizerFactory. Time taken: 0.216 seconds spark-sql> insert into int_floating_point_vals select 1.1; Time taken: 1.747 seconds spark-sql> select * from int_floating_point_vals; 1 Time taken: 0.518 seconds, Fetched 1 row(s){code} h3. Expected behavior We expect the two Spark interfaces ({{{}spark-sql{}}} & {{{}spark-shell{}}}) to behave consistently for the same data type & input combination ({{{}INT{}}} and {{{}1.1{}}}). h4. Here is a simplified example in {{{}spark-shell{}}}, where insertion of the aforementioned value correctly raises an exception: On Spark 3.2.1 (commit {{{}4f25b3f712{}}}), using {{{}spark-shell{}}}: {code:java} $SPARK_HOME/bin/spark-shell{code} Execute the following: {code:java} import org.apache.spark.sql.{Row, SparkSession} import org.apache.spark.sql.types._ val rdd = sc.parallelize(Seq(Row(1.1))) val schema = new StructType().add(StructField("c1", IntegerType, true)) val df = spark.createDataFrame(rdd, schema) df.write.mode("overwrite").format("orc").saveAsTable("int_floating_point_vals") {code} The following exception is raised: {code:java} java.lang.RuntimeException: Error while encoding: java.lang.RuntimeException: java.lang.Double is not a valid external type for schema of int{code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-40439) DECIMAL value with more precision than what is defined in the schema raises exception in SparkSQL but evaluates to NULL for DataFrame
[ https://issues.apache.org/jira/browse/SPARK-40439?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17607314#comment-17607314 ] xsys edited comment on SPARK-40439 at 9/20/22 5:23 PM: --- [~hyukjin.kwon]: Thank you for your response! Setting {{spark.sql.storeAssignmentPolicy}} to LEGACY works. However, I believe it could get non-trivial for users to discover that {{spark.sql.storeAssignmentPolicy}} would work. For instance, after inspecting the code, I thought nullOnOverflow is controlled by {{spark.sql.ansi.enabled. }}I tried to achieve the desired behavior by altering it (but to no avail). Could we add the usage of {{spark.sql.storeAssignmentPolicy}} to {{LEGACY}} to the error message? was (Author: JIRAUSER288838): [~hyukjin.kwon]: Thank you for your response! Setting {{spark.sql.storeAssignmentPolicy}} to LEGACY works. However, I believe it could get non-trivial for users to discover that {{spark.sql.storeAssignmentPolicy}} would work. For instance, after inspecting the code, I thought that nullOnOverflow is controlled by {{spark.sql.ansi.enabled. I}} tried to achieve the desired behaviour by altering it (but to no avail). Could we add the usage of {{spark.sql.storeAssignmentPolicy}} to {{LEGACY}} to the error message? > DECIMAL value with more precision than what is defined in the schema raises > exception in SparkSQL but evaluates to NULL for DataFrame > - > > Key: SPARK-40439 > URL: https://issues.apache.org/jira/browse/SPARK-40439 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.2.1 >Reporter: xsys >Priority: Major > > h3. Describe the bug > We are trying to store a DECIMAL value {{333.22}} with more > precision than what is defined in the schema: {{{}DECIMAL(20,10){}}}. This > leads to a {{NULL}} value being stored if the table is created using > DataFrames via {{{}spark-shell{}}}. However, it leads to the following > exception if the table is created via {{{}spark-sql{}}}: > {code:java} > Failed in [insert into decimal_extra_precision select 333.22] > java.lang.ArithmeticException: > Decimal(expanded,333.22,21,10}) cannot be represented as > Decimal(20, 10){code} > h3. Step to reproduce: > On Spark 3.2.1 (commit {{{}4f25b3f712{}}}), using {{{}spark-sql{}}}: > {code:java} > create table decimal_extra_precision(c1 DECIMAL(20,10)) STORED AS ORC; > insert into decimal_extra_precision select 333.22;{code} > Execute the following: > {code:java} > create table decimal_extra_precision(c1 DECIMAL(20,10)) STORED AS ORC; > insert into decimal_extra_precision select 333.22;{code} > h3. Expected behavior > We expect the two Spark interfaces ({{{}spark-sql{}}} & {{{}spark-shell{}}}) > to behave consistently for the same data type & input combination > ({{{}DECIMAL(20,10){}}} and {{{}333.22{}}}). > Here is a simplified example in {{{}spark-shell{}}}, where insertion of the > aforementioned decimal value evaluates to a {{{}NULL{}}}: > {code:java} > scala> import org.apache.spark.sql.{Row, SparkSession} > import org.apache.spark.sql.{Row, SparkSession} > scala> import org.apache.spark.sql.types._ > import org.apache.spark.sql.types._ > scala> val rdd = > sc.parallelize(Seq(Row(BigDecimal("333.22" > rdd: org.apache.spark.rdd.RDD[org.apache.spark.sql.Row] = > ParallelCollectionRDD[0] at parallelize at :27 > scala> val schema = new StructType().add(StructField("c1", DecimalType(20, > 10), true)) > schema: org.apache.spark.sql.types.StructType = > StructType(StructField(c1,DecimalType(20,10),true)) > scala> val df = spark.createDataFrame(rdd, schema) > df: org.apache.spark.sql.DataFrame = [c1: decimal(20,10)] > scala> df.show() > ++ > | c1| > ++ > |null| > ++ > scala> > df.write.mode("overwrite").format("orc").saveAsTable("decimal_extra_precision") > 22/08/29 10:33:47 WARN SessionState: METASTORE_FILTER_HOOK will be ignored, > since hive.security.authorization.manager is set to instance of > HiveAuthorizerFactory. > scala> spark.sql("select * from decimal_extra_precision;") > res2: org.apache.spark.sql.DataFrame = [c1: decimal(20,10)] > {code} > h3. Root Cause > The exception is being raised from > [Decimal|https://github.com/apache/spark/blob/v3.2.1/sql/catalyst/src/main/scala/org/apache/spark/sql/types/Decimal.scala#L358-L373] > ({{{}nullOnOverflow{}}} is controlled by {{spark.sql.ansi.enabled}} in > [SQLConf|https://github.com/apache/spark/blob/v3.2.1/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala#L2542-L2551].): > {code:java} > private[sql] def toPrecision( > precision: Int, >
[jira] [Comment Edited] (SPARK-40439) DECIMAL value with more precision than what is defined in the schema raises exception in SparkSQL but evaluates to NULL for DataFrame
[ https://issues.apache.org/jira/browse/SPARK-40439?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17607314#comment-17607314 ] xsys edited comment on SPARK-40439 at 9/20/22 5:23 PM: --- [~hyukjin.kwon]: Thank you for your response! Setting {{spark.sql.storeAssignmentPolicy}} to LEGACY works. However, I believe it could get non-trivial for users to discover that {{spark.sql.storeAssignmentPolicy}} would work. For instance, after inspecting the code, I thought nullOnOverflow is controlled by \{{spark.sql.ansi.enabled.}} I tried to achieve the desired behavior by altering it (but to no avail). Could we add the usage of {{spark.sql.storeAssignmentPolicy}} to {{LEGACY}} to the error message? was (Author: JIRAUSER288838): [~hyukjin.kwon]: Thank you for your response! Setting {{spark.sql.storeAssignmentPolicy}} to LEGACY works. However, I believe it could get non-trivial for users to discover that {{spark.sql.storeAssignmentPolicy}} would work. For instance, after inspecting the code, I thought nullOnOverflow is controlled by {{spark.sql.ansi.enabled. }}I tried to achieve the desired behavior by altering it (but to no avail). Could we add the usage of {{spark.sql.storeAssignmentPolicy}} to {{LEGACY}} to the error message? > DECIMAL value with more precision than what is defined in the schema raises > exception in SparkSQL but evaluates to NULL for DataFrame > - > > Key: SPARK-40439 > URL: https://issues.apache.org/jira/browse/SPARK-40439 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.2.1 >Reporter: xsys >Priority: Major > > h3. Describe the bug > We are trying to store a DECIMAL value {{333.22}} with more > precision than what is defined in the schema: {{{}DECIMAL(20,10){}}}. This > leads to a {{NULL}} value being stored if the table is created using > DataFrames via {{{}spark-shell{}}}. However, it leads to the following > exception if the table is created via {{{}spark-sql{}}}: > {code:java} > Failed in [insert into decimal_extra_precision select 333.22] > java.lang.ArithmeticException: > Decimal(expanded,333.22,21,10}) cannot be represented as > Decimal(20, 10){code} > h3. Step to reproduce: > On Spark 3.2.1 (commit {{{}4f25b3f712{}}}), using {{{}spark-sql{}}}: > {code:java} > create table decimal_extra_precision(c1 DECIMAL(20,10)) STORED AS ORC; > insert into decimal_extra_precision select 333.22;{code} > Execute the following: > {code:java} > create table decimal_extra_precision(c1 DECIMAL(20,10)) STORED AS ORC; > insert into decimal_extra_precision select 333.22;{code} > h3. Expected behavior > We expect the two Spark interfaces ({{{}spark-sql{}}} & {{{}spark-shell{}}}) > to behave consistently for the same data type & input combination > ({{{}DECIMAL(20,10){}}} and {{{}333.22{}}}). > Here is a simplified example in {{{}spark-shell{}}}, where insertion of the > aforementioned decimal value evaluates to a {{{}NULL{}}}: > {code:java} > scala> import org.apache.spark.sql.{Row, SparkSession} > import org.apache.spark.sql.{Row, SparkSession} > scala> import org.apache.spark.sql.types._ > import org.apache.spark.sql.types._ > scala> val rdd = > sc.parallelize(Seq(Row(BigDecimal("333.22" > rdd: org.apache.spark.rdd.RDD[org.apache.spark.sql.Row] = > ParallelCollectionRDD[0] at parallelize at :27 > scala> val schema = new StructType().add(StructField("c1", DecimalType(20, > 10), true)) > schema: org.apache.spark.sql.types.StructType = > StructType(StructField(c1,DecimalType(20,10),true)) > scala> val df = spark.createDataFrame(rdd, schema) > df: org.apache.spark.sql.DataFrame = [c1: decimal(20,10)] > scala> df.show() > ++ > | c1| > ++ > |null| > ++ > scala> > df.write.mode("overwrite").format("orc").saveAsTable("decimal_extra_precision") > 22/08/29 10:33:47 WARN SessionState: METASTORE_FILTER_HOOK will be ignored, > since hive.security.authorization.manager is set to instance of > HiveAuthorizerFactory. > scala> spark.sql("select * from decimal_extra_precision;") > res2: org.apache.spark.sql.DataFrame = [c1: decimal(20,10)] > {code} > h3. Root Cause > The exception is being raised from > [Decimal|https://github.com/apache/spark/blob/v3.2.1/sql/catalyst/src/main/scala/org/apache/spark/sql/types/Decimal.scala#L358-L373] > ({{{}nullOnOverflow{}}} is controlled by {{spark.sql.ansi.enabled}} in > [SQLConf|https://github.com/apache/spark/blob/v3.2.1/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala#L2542-L2551].): > {code:java} > private[sql] def toPrecision( > precision: Int, > scale:
[jira] [Comment Edited] (SPARK-40439) DECIMAL value with more precision than what is defined in the schema raises exception in SparkSQL but evaluates to NULL for DataFrame
[ https://issues.apache.org/jira/browse/SPARK-40439?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17607314#comment-17607314 ] xsys edited comment on SPARK-40439 at 9/20/22 5:22 PM: --- [~hyukjin.kwon]: Thank you for your response! Setting {{spark.sql.storeAssignmentPolicy}} to LEGACY works. However, I believe it could get non-trivial for users to discover that {{spark.sql.storeAssignmentPolicy}} would work. For instance, after inspecting the code, I thought that nullOnOverflow is controlled by {{spark.sql.ansi.enabled. I}} tried to achieve the desired behaviour by altering it (but to no avail). Could we add the usage of {{spark.sql.storeAssignmentPolicy}} to {{LEGACY}} to the error message? was (Author: JIRAUSER288838): [~hyukjin.kwon]: Thank you for your response! Setting {{spark.sql.storeAssignmentPolicy}} to LEGACY works. However, I believe it could get non-trivial for users to discover that {{spark.sql.storeAssignmentPolicy }}would work. For instance, after inspecting the code, I thought that nullOnOverflow is controlled by {{spark.sql.ansi.enabled. I}} tried to achieve the desired behaviour by altering it (but to no avail). Could we add the usage of {{spark.sql.storeAssignmentPolicy}} to {{LEGACY}} to the error message? > DECIMAL value with more precision than what is defined in the schema raises > exception in SparkSQL but evaluates to NULL for DataFrame > - > > Key: SPARK-40439 > URL: https://issues.apache.org/jira/browse/SPARK-40439 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.2.1 >Reporter: xsys >Priority: Major > > h3. Describe the bug > We are trying to store a DECIMAL value {{333.22}} with more > precision than what is defined in the schema: {{{}DECIMAL(20,10){}}}. This > leads to a {{NULL}} value being stored if the table is created using > DataFrames via {{{}spark-shell{}}}. However, it leads to the following > exception if the table is created via {{{}spark-sql{}}}: > {code:java} > Failed in [insert into decimal_extra_precision select 333.22] > java.lang.ArithmeticException: > Decimal(expanded,333.22,21,10}) cannot be represented as > Decimal(20, 10){code} > h3. Step to reproduce: > On Spark 3.2.1 (commit {{{}4f25b3f712{}}}), using {{{}spark-sql{}}}: > {code:java} > create table decimal_extra_precision(c1 DECIMAL(20,10)) STORED AS ORC; > insert into decimal_extra_precision select 333.22;{code} > Execute the following: > {code:java} > create table decimal_extra_precision(c1 DECIMAL(20,10)) STORED AS ORC; > insert into decimal_extra_precision select 333.22;{code} > h3. Expected behavior > We expect the two Spark interfaces ({{{}spark-sql{}}} & {{{}spark-shell{}}}) > to behave consistently for the same data type & input combination > ({{{}DECIMAL(20,10){}}} and {{{}333.22{}}}). > Here is a simplified example in {{{}spark-shell{}}}, where insertion of the > aforementioned decimal value evaluates to a {{{}NULL{}}}: > {code:java} > scala> import org.apache.spark.sql.{Row, SparkSession} > import org.apache.spark.sql.{Row, SparkSession} > scala> import org.apache.spark.sql.types._ > import org.apache.spark.sql.types._ > scala> val rdd = > sc.parallelize(Seq(Row(BigDecimal("333.22" > rdd: org.apache.spark.rdd.RDD[org.apache.spark.sql.Row] = > ParallelCollectionRDD[0] at parallelize at :27 > scala> val schema = new StructType().add(StructField("c1", DecimalType(20, > 10), true)) > schema: org.apache.spark.sql.types.StructType = > StructType(StructField(c1,DecimalType(20,10),true)) > scala> val df = spark.createDataFrame(rdd, schema) > df: org.apache.spark.sql.DataFrame = [c1: decimal(20,10)] > scala> df.show() > ++ > | c1| > ++ > |null| > ++ > scala> > df.write.mode("overwrite").format("orc").saveAsTable("decimal_extra_precision") > 22/08/29 10:33:47 WARN SessionState: METASTORE_FILTER_HOOK will be ignored, > since hive.security.authorization.manager is set to instance of > HiveAuthorizerFactory. > scala> spark.sql("select * from decimal_extra_precision;") > res2: org.apache.spark.sql.DataFrame = [c1: decimal(20,10)] > {code} > h3. Root Cause > The exception is being raised from > [Decimal|https://github.com/apache/spark/blob/v3.2.1/sql/catalyst/src/main/scala/org/apache/spark/sql/types/Decimal.scala#L358-L373] > ({{{}nullOnOverflow{}}} is controlled by {{spark.sql.ansi.enabled}} in > [SQLConf|https://github.com/apache/spark/blob/v3.2.1/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala#L2542-L2551].): > {code:java} > private[sql] def toPrecision( > precision: Int, >
[jira] [Comment Edited] (SPARK-40439) DECIMAL value with more precision than what is defined in the schema raises exception in SparkSQL but evaluates to NULL for DataFrame
[ https://issues.apache.org/jira/browse/SPARK-40439?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17607314#comment-17607314 ] xsys edited comment on SPARK-40439 at 9/20/22 5:22 PM: --- [~hyukjin.kwon]: Thank you for your response! Setting {{spark.sql.storeAssignmentPolicy}} to LEGACY works. However, I believe it could get non-trivial for users to discover that {{spark.sql.storeAssignmentPolicy }}would work. For instance, after inspecting the code, I thought that nullOnOverflow is controlled by {{spark.sql.ansi.enabled. I}} tried to achieve the desired behaviour by altering it (but to no avail). Could we add the usage of {{spark.sql.storeAssignmentPolicy}} to {{LEGACY}} to the error message? was (Author: JIRAUSER288838): [~hyukjin.kwon]: Thank you for your response! Setting {{spark.sql.storeAssignmentPolicy}} to LEGACY works. However, I believe it could get non-trivial for users to discover that {{spark.sql.storeAssignmentPolicy would work.}} For instance, after inspecting the code, I thought that nullOnOverflow is controlled by {{spark.sql.ansi.enabled. I}} tried to achieve the desired behaviour by altering it (but to no avail). Could we add the usage of {{spark.sql.storeAssignmentPolicy}} to {{LEGACY}} to the error message? > DECIMAL value with more precision than what is defined in the schema raises > exception in SparkSQL but evaluates to NULL for DataFrame > - > > Key: SPARK-40439 > URL: https://issues.apache.org/jira/browse/SPARK-40439 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.2.1 >Reporter: xsys >Priority: Major > > h3. Describe the bug > We are trying to store a DECIMAL value {{333.22}} with more > precision than what is defined in the schema: {{{}DECIMAL(20,10){}}}. This > leads to a {{NULL}} value being stored if the table is created using > DataFrames via {{{}spark-shell{}}}. However, it leads to the following > exception if the table is created via {{{}spark-sql{}}}: > {code:java} > Failed in [insert into decimal_extra_precision select 333.22] > java.lang.ArithmeticException: > Decimal(expanded,333.22,21,10}) cannot be represented as > Decimal(20, 10){code} > h3. Step to reproduce: > On Spark 3.2.1 (commit {{{}4f25b3f712{}}}), using {{{}spark-sql{}}}: > {code:java} > create table decimal_extra_precision(c1 DECIMAL(20,10)) STORED AS ORC; > insert into decimal_extra_precision select 333.22;{code} > Execute the following: > {code:java} > create table decimal_extra_precision(c1 DECIMAL(20,10)) STORED AS ORC; > insert into decimal_extra_precision select 333.22;{code} > h3. Expected behavior > We expect the two Spark interfaces ({{{}spark-sql{}}} & {{{}spark-shell{}}}) > to behave consistently for the same data type & input combination > ({{{}DECIMAL(20,10){}}} and {{{}333.22{}}}). > Here is a simplified example in {{{}spark-shell{}}}, where insertion of the > aforementioned decimal value evaluates to a {{{}NULL{}}}: > {code:java} > scala> import org.apache.spark.sql.{Row, SparkSession} > import org.apache.spark.sql.{Row, SparkSession} > scala> import org.apache.spark.sql.types._ > import org.apache.spark.sql.types._ > scala> val rdd = > sc.parallelize(Seq(Row(BigDecimal("333.22" > rdd: org.apache.spark.rdd.RDD[org.apache.spark.sql.Row] = > ParallelCollectionRDD[0] at parallelize at :27 > scala> val schema = new StructType().add(StructField("c1", DecimalType(20, > 10), true)) > schema: org.apache.spark.sql.types.StructType = > StructType(StructField(c1,DecimalType(20,10),true)) > scala> val df = spark.createDataFrame(rdd, schema) > df: org.apache.spark.sql.DataFrame = [c1: decimal(20,10)] > scala> df.show() > ++ > | c1| > ++ > |null| > ++ > scala> > df.write.mode("overwrite").format("orc").saveAsTable("decimal_extra_precision") > 22/08/29 10:33:47 WARN SessionState: METASTORE_FILTER_HOOK will be ignored, > since hive.security.authorization.manager is set to instance of > HiveAuthorizerFactory. > scala> spark.sql("select * from decimal_extra_precision;") > res2: org.apache.spark.sql.DataFrame = [c1: decimal(20,10)] > {code} > h3. Root Cause > The exception is being raised from > [Decimal|https://github.com/apache/spark/blob/v3.2.1/sql/catalyst/src/main/scala/org/apache/spark/sql/types/Decimal.scala#L358-L373] > ({{{}nullOnOverflow{}}} is controlled by {{spark.sql.ansi.enabled}} in > [SQLConf|https://github.com/apache/spark/blob/v3.2.1/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala#L2542-L2551].): > {code:java} > private[sql] def toPrecision( > precision: Int, >
[jira] [Comment Edited] (SPARK-40439) DECIMAL value with more precision than what is defined in the schema raises exception in SparkSQL but evaluates to NULL for DataFrame
[ https://issues.apache.org/jira/browse/SPARK-40439?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17607314#comment-17607314 ] xsys edited comment on SPARK-40439 at 9/20/22 5:21 PM: --- [~hyukjin.kwon]: Thank you for your response! Setting {{spark.sql.storeAssignmentPolicy}} to LEGACY works. However, I believe it could get non-trivial for users to discover that {{spark.sql.storeAssignmentPolicy would work.}} {{ For instance, after inspecting the code, I thought that nullOnOverflow}} is controlled by {{spark.sql.ansi.enabled. I}} tried to achieve the desired behaviour by altering it (but to no avail). Could we add the usage of {{spark.sql.storeAssignmentPolicy}} to {{LEGACY}} to the error message? was (Author: JIRAUSER288838): [~hyukjin.kwon]: Thank you for your response! Setting {{spark.sql.storeAssignmentPolicy}} to LEGACY works. I believe it could get non-trivial for users to discover that {{spark.sql.storeAssignmentPolicy would work. For instance, after inspecting the code, I thought that nullOnOverflow}} is controlled by {{spark.sql.ansi.enabled. I}} tried to achieve the desired behaviour by altering it (but to no avail). Could we add the usage of {{spark.sql.storeAssignmentPolicy}} to {{LEGACY}} to the error message? > DECIMAL value with more precision than what is defined in the schema raises > exception in SparkSQL but evaluates to NULL for DataFrame > - > > Key: SPARK-40439 > URL: https://issues.apache.org/jira/browse/SPARK-40439 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.2.1 >Reporter: xsys >Priority: Major > > h3. Describe the bug > We are trying to store a DECIMAL value {{333.22}} with more > precision than what is defined in the schema: {{{}DECIMAL(20,10){}}}. This > leads to a {{NULL}} value being stored if the table is created using > DataFrames via {{{}spark-shell{}}}. However, it leads to the following > exception if the table is created via {{{}spark-sql{}}}: > {code:java} > Failed in [insert into decimal_extra_precision select 333.22] > java.lang.ArithmeticException: > Decimal(expanded,333.22,21,10}) cannot be represented as > Decimal(20, 10){code} > h3. Step to reproduce: > On Spark 3.2.1 (commit {{{}4f25b3f712{}}}), using {{{}spark-sql{}}}: > {code:java} > create table decimal_extra_precision(c1 DECIMAL(20,10)) STORED AS ORC; > insert into decimal_extra_precision select 333.22;{code} > Execute the following: > {code:java} > create table decimal_extra_precision(c1 DECIMAL(20,10)) STORED AS ORC; > insert into decimal_extra_precision select 333.22;{code} > h3. Expected behavior > We expect the two Spark interfaces ({{{}spark-sql{}}} & {{{}spark-shell{}}}) > to behave consistently for the same data type & input combination > ({{{}DECIMAL(20,10){}}} and {{{}333.22{}}}). > Here is a simplified example in {{{}spark-shell{}}}, where insertion of the > aforementioned decimal value evaluates to a {{{}NULL{}}}: > {code:java} > scala> import org.apache.spark.sql.{Row, SparkSession} > import org.apache.spark.sql.{Row, SparkSession} > scala> import org.apache.spark.sql.types._ > import org.apache.spark.sql.types._ > scala> val rdd = > sc.parallelize(Seq(Row(BigDecimal("333.22" > rdd: org.apache.spark.rdd.RDD[org.apache.spark.sql.Row] = > ParallelCollectionRDD[0] at parallelize at :27 > scala> val schema = new StructType().add(StructField("c1", DecimalType(20, > 10), true)) > schema: org.apache.spark.sql.types.StructType = > StructType(StructField(c1,DecimalType(20,10),true)) > scala> val df = spark.createDataFrame(rdd, schema) > df: org.apache.spark.sql.DataFrame = [c1: decimal(20,10)] > scala> df.show() > ++ > | c1| > ++ > |null| > ++ > scala> > df.write.mode("overwrite").format("orc").saveAsTable("decimal_extra_precision") > 22/08/29 10:33:47 WARN SessionState: METASTORE_FILTER_HOOK will be ignored, > since hive.security.authorization.manager is set to instance of > HiveAuthorizerFactory. > scala> spark.sql("select * from decimal_extra_precision;") > res2: org.apache.spark.sql.DataFrame = [c1: decimal(20,10)] > {code} > h3. Root Cause > The exception is being raised from > [Decimal|https://github.com/apache/spark/blob/v3.2.1/sql/catalyst/src/main/scala/org/apache/spark/sql/types/Decimal.scala#L358-L373] > ({{{}nullOnOverflow{}}} is controlled by {{spark.sql.ansi.enabled}} in > [SQLConf|https://github.com/apache/spark/blob/v3.2.1/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala#L2542-L2551].): > {code:java} > private[sql] def toPrecision( > precision: Int, >
[jira] [Comment Edited] (SPARK-40439) DECIMAL value with more precision than what is defined in the schema raises exception in SparkSQL but evaluates to NULL for DataFrame
[ https://issues.apache.org/jira/browse/SPARK-40439?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17607314#comment-17607314 ] xsys edited comment on SPARK-40439 at 9/20/22 5:21 PM: --- [~hyukjin.kwon]: Thank you for your response! Setting {{spark.sql.storeAssignmentPolicy}} to LEGACY works. However, I believe it could get non-trivial for users to discover that {{spark.sql.storeAssignmentPolicy would work.}} For instance, after inspecting the code, I thought that nullOnOverflow is controlled by {{spark.sql.ansi.enabled. I}} tried to achieve the desired behaviour by altering it (but to no avail). Could we add the usage of {{spark.sql.storeAssignmentPolicy}} to {{LEGACY}} to the error message? was (Author: JIRAUSER288838): [~hyukjin.kwon]: Thank you for your response! Setting {{spark.sql.storeAssignmentPolicy}} to LEGACY works. However, I believe it could get non-trivial for users to discover that {{spark.sql.storeAssignmentPolicy would work.}} {{ For instance, after inspecting the code, I thought that nullOnOverflow}} is controlled by {{spark.sql.ansi.enabled. I}} tried to achieve the desired behaviour by altering it (but to no avail). Could we add the usage of {{spark.sql.storeAssignmentPolicy}} to {{LEGACY}} to the error message? > DECIMAL value with more precision than what is defined in the schema raises > exception in SparkSQL but evaluates to NULL for DataFrame > - > > Key: SPARK-40439 > URL: https://issues.apache.org/jira/browse/SPARK-40439 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.2.1 >Reporter: xsys >Priority: Major > > h3. Describe the bug > We are trying to store a DECIMAL value {{333.22}} with more > precision than what is defined in the schema: {{{}DECIMAL(20,10){}}}. This > leads to a {{NULL}} value being stored if the table is created using > DataFrames via {{{}spark-shell{}}}. However, it leads to the following > exception if the table is created via {{{}spark-sql{}}}: > {code:java} > Failed in [insert into decimal_extra_precision select 333.22] > java.lang.ArithmeticException: > Decimal(expanded,333.22,21,10}) cannot be represented as > Decimal(20, 10){code} > h3. Step to reproduce: > On Spark 3.2.1 (commit {{{}4f25b3f712{}}}), using {{{}spark-sql{}}}: > {code:java} > create table decimal_extra_precision(c1 DECIMAL(20,10)) STORED AS ORC; > insert into decimal_extra_precision select 333.22;{code} > Execute the following: > {code:java} > create table decimal_extra_precision(c1 DECIMAL(20,10)) STORED AS ORC; > insert into decimal_extra_precision select 333.22;{code} > h3. Expected behavior > We expect the two Spark interfaces ({{{}spark-sql{}}} & {{{}spark-shell{}}}) > to behave consistently for the same data type & input combination > ({{{}DECIMAL(20,10){}}} and {{{}333.22{}}}). > Here is a simplified example in {{{}spark-shell{}}}, where insertion of the > aforementioned decimal value evaluates to a {{{}NULL{}}}: > {code:java} > scala> import org.apache.spark.sql.{Row, SparkSession} > import org.apache.spark.sql.{Row, SparkSession} > scala> import org.apache.spark.sql.types._ > import org.apache.spark.sql.types._ > scala> val rdd = > sc.parallelize(Seq(Row(BigDecimal("333.22" > rdd: org.apache.spark.rdd.RDD[org.apache.spark.sql.Row] = > ParallelCollectionRDD[0] at parallelize at :27 > scala> val schema = new StructType().add(StructField("c1", DecimalType(20, > 10), true)) > schema: org.apache.spark.sql.types.StructType = > StructType(StructField(c1,DecimalType(20,10),true)) > scala> val df = spark.createDataFrame(rdd, schema) > df: org.apache.spark.sql.DataFrame = [c1: decimal(20,10)] > scala> df.show() > ++ > | c1| > ++ > |null| > ++ > scala> > df.write.mode("overwrite").format("orc").saveAsTable("decimal_extra_precision") > 22/08/29 10:33:47 WARN SessionState: METASTORE_FILTER_HOOK will be ignored, > since hive.security.authorization.manager is set to instance of > HiveAuthorizerFactory. > scala> spark.sql("select * from decimal_extra_precision;") > res2: org.apache.spark.sql.DataFrame = [c1: decimal(20,10)] > {code} > h3. Root Cause > The exception is being raised from > [Decimal|https://github.com/apache/spark/blob/v3.2.1/sql/catalyst/src/main/scala/org/apache/spark/sql/types/Decimal.scala#L358-L373] > ({{{}nullOnOverflow{}}} is controlled by {{spark.sql.ansi.enabled}} in > [SQLConf|https://github.com/apache/spark/blob/v3.2.1/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala#L2542-L2551].): > {code:java} > private[sql] def toPrecision( > precision: Int, >
[jira] [Comment Edited] (SPARK-40439) DECIMAL value with more precision than what is defined in the schema raises exception in SparkSQL but evaluates to NULL for DataFrame
[ https://issues.apache.org/jira/browse/SPARK-40439?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17607314#comment-17607314 ] xsys edited comment on SPARK-40439 at 9/20/22 5:20 PM: --- [~hyukjin.kwon]: Thank you for your response! Setting {{spark.sql.storeAssignmentPolicy}} to LEGACY works. I believe it could get non-trivial for users to discover that {{spark.sql.storeAssignmentPolicy would work. For instance, after inspecting the code, I thought that nullOnOverflow}} is controlled by {{spark.sql.ansi.enabled. I}} tried to achieve the desired behaviour by altering it (but to no avail). Could we add the usage of {{spark.sql.storeAssignmentPolicy}} to {{LEGACY}} to the error message? was (Author: JIRAUSER288838): [~hyukjin.kwon]: Thank you for your response! Setting {{spark.sql.storeAssignmentPolicy}} to {{LEGACY works. }} I believe it could get non trivial for users to discover that {{spark.sql.storeAssignmentPolicy }}would work. For instance, after inspecting the code, I thought that {{nullOnOverflow}} is controlled by {{spark.sql.ansi.enabled and}} I tried to achieve the desired behaviour by altering it (but to no avail). Could we add the usage of {{spark.sql.storeAssignmentPolicy}} to {{LEGACY}} to the error message? > DECIMAL value with more precision than what is defined in the schema raises > exception in SparkSQL but evaluates to NULL for DataFrame > - > > Key: SPARK-40439 > URL: https://issues.apache.org/jira/browse/SPARK-40439 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.2.1 >Reporter: xsys >Priority: Major > > h3. Describe the bug > We are trying to store a DECIMAL value {{333.22}} with more > precision than what is defined in the schema: {{{}DECIMAL(20,10){}}}. This > leads to a {{NULL}} value being stored if the table is created using > DataFrames via {{{}spark-shell{}}}. However, it leads to the following > exception if the table is created via {{{}spark-sql{}}}: > {code:java} > Failed in [insert into decimal_extra_precision select 333.22] > java.lang.ArithmeticException: > Decimal(expanded,333.22,21,10}) cannot be represented as > Decimal(20, 10){code} > h3. Step to reproduce: > On Spark 3.2.1 (commit {{{}4f25b3f712{}}}), using {{{}spark-sql{}}}: > {code:java} > create table decimal_extra_precision(c1 DECIMAL(20,10)) STORED AS ORC; > insert into decimal_extra_precision select 333.22;{code} > Execute the following: > {code:java} > create table decimal_extra_precision(c1 DECIMAL(20,10)) STORED AS ORC; > insert into decimal_extra_precision select 333.22;{code} > h3. Expected behavior > We expect the two Spark interfaces ({{{}spark-sql{}}} & {{{}spark-shell{}}}) > to behave consistently for the same data type & input combination > ({{{}DECIMAL(20,10){}}} and {{{}333.22{}}}). > Here is a simplified example in {{{}spark-shell{}}}, where insertion of the > aforementioned decimal value evaluates to a {{{}NULL{}}}: > {code:java} > scala> import org.apache.spark.sql.{Row, SparkSession} > import org.apache.spark.sql.{Row, SparkSession} > scala> import org.apache.spark.sql.types._ > import org.apache.spark.sql.types._ > scala> val rdd = > sc.parallelize(Seq(Row(BigDecimal("333.22" > rdd: org.apache.spark.rdd.RDD[org.apache.spark.sql.Row] = > ParallelCollectionRDD[0] at parallelize at :27 > scala> val schema = new StructType().add(StructField("c1", DecimalType(20, > 10), true)) > schema: org.apache.spark.sql.types.StructType = > StructType(StructField(c1,DecimalType(20,10),true)) > scala> val df = spark.createDataFrame(rdd, schema) > df: org.apache.spark.sql.DataFrame = [c1: decimal(20,10)] > scala> df.show() > ++ > | c1| > ++ > |null| > ++ > scala> > df.write.mode("overwrite").format("orc").saveAsTable("decimal_extra_precision") > 22/08/29 10:33:47 WARN SessionState: METASTORE_FILTER_HOOK will be ignored, > since hive.security.authorization.manager is set to instance of > HiveAuthorizerFactory. > scala> spark.sql("select * from decimal_extra_precision;") > res2: org.apache.spark.sql.DataFrame = [c1: decimal(20,10)] > {code} > h3. Root Cause > The exception is being raised from > [Decimal|https://github.com/apache/spark/blob/v3.2.1/sql/catalyst/src/main/scala/org/apache/spark/sql/types/Decimal.scala#L358-L373] > ({{{}nullOnOverflow{}}} is controlled by {{spark.sql.ansi.enabled}} in > [SQLConf|https://github.com/apache/spark/blob/v3.2.1/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala#L2542-L2551].): > {code:java} > private[sql] def toPrecision( > precision: Int, >
[jira] [Comment Edited] (SPARK-40439) DECIMAL value with more precision than what is defined in the schema raises exception in SparkSQL but evaluates to NULL for DataFrame
[ https://issues.apache.org/jira/browse/SPARK-40439?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17607314#comment-17607314 ] xsys edited comment on SPARK-40439 at 9/20/22 5:20 PM: --- [~hyukjin.kwon]: Thank you for your response! Setting {{spark.sql.storeAssignmentPolicy}} to {{LEGACY works. }} I believe it could get non trivial for users to discover that {{spark.sql.storeAssignmentPolicy }}would work. For instance, after inspecting the code, I thought that {{nullOnOverflow}} is controlled by {{spark.sql.ansi.enabled and}} I tried to achieve the desired behaviour by altering it (but to no avail). Could we add the usage of {{spark.sql.storeAssignmentPolicy}} to {{LEGACY}} to the error message? was (Author: JIRAUSER288838): [~hyukjin.kwon]: Thank you for your response! Setting {{spark.sql.storeAssignmentPolicy}} to {{LEGACY works. }}I believe it could get non-trivial for users to discover that {{spark.sql.storeAssignmentPolicy}} would work. For instance, after inspecting the code, I thought that {{nullOnOverflow}} is controlled by {{spark.sql.ansi.enabled and}} I tried to achieve the desired behaviour by altering it (but to no avail). Could we add the usage of {{spark.sql.storeAssignmentPolicy}} to {{LEGACY}} to the error message? > DECIMAL value with more precision than what is defined in the schema raises > exception in SparkSQL but evaluates to NULL for DataFrame > - > > Key: SPARK-40439 > URL: https://issues.apache.org/jira/browse/SPARK-40439 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.2.1 >Reporter: xsys >Priority: Major > > h3. Describe the bug > We are trying to store a DECIMAL value {{333.22}} with more > precision than what is defined in the schema: {{{}DECIMAL(20,10){}}}. This > leads to a {{NULL}} value being stored if the table is created using > DataFrames via {{{}spark-shell{}}}. However, it leads to the following > exception if the table is created via {{{}spark-sql{}}}: > {code:java} > Failed in [insert into decimal_extra_precision select 333.22] > java.lang.ArithmeticException: > Decimal(expanded,333.22,21,10}) cannot be represented as > Decimal(20, 10){code} > h3. Step to reproduce: > On Spark 3.2.1 (commit {{{}4f25b3f712{}}}), using {{{}spark-sql{}}}: > {code:java} > create table decimal_extra_precision(c1 DECIMAL(20,10)) STORED AS ORC; > insert into decimal_extra_precision select 333.22;{code} > Execute the following: > {code:java} > create table decimal_extra_precision(c1 DECIMAL(20,10)) STORED AS ORC; > insert into decimal_extra_precision select 333.22;{code} > h3. Expected behavior > We expect the two Spark interfaces ({{{}spark-sql{}}} & {{{}spark-shell{}}}) > to behave consistently for the same data type & input combination > ({{{}DECIMAL(20,10){}}} and {{{}333.22{}}}). > Here is a simplified example in {{{}spark-shell{}}}, where insertion of the > aforementioned decimal value evaluates to a {{{}NULL{}}}: > {code:java} > scala> import org.apache.spark.sql.{Row, SparkSession} > import org.apache.spark.sql.{Row, SparkSession} > scala> import org.apache.spark.sql.types._ > import org.apache.spark.sql.types._ > scala> val rdd = > sc.parallelize(Seq(Row(BigDecimal("333.22" > rdd: org.apache.spark.rdd.RDD[org.apache.spark.sql.Row] = > ParallelCollectionRDD[0] at parallelize at :27 > scala> val schema = new StructType().add(StructField("c1", DecimalType(20, > 10), true)) > schema: org.apache.spark.sql.types.StructType = > StructType(StructField(c1,DecimalType(20,10),true)) > scala> val df = spark.createDataFrame(rdd, schema) > df: org.apache.spark.sql.DataFrame = [c1: decimal(20,10)] > scala> df.show() > ++ > | c1| > ++ > |null| > ++ > scala> > df.write.mode("overwrite").format("orc").saveAsTable("decimal_extra_precision") > 22/08/29 10:33:47 WARN SessionState: METASTORE_FILTER_HOOK will be ignored, > since hive.security.authorization.manager is set to instance of > HiveAuthorizerFactory. > scala> spark.sql("select * from decimal_extra_precision;") > res2: org.apache.spark.sql.DataFrame = [c1: decimal(20,10)] > {code} > h3. Root Cause > The exception is being raised from > [Decimal|https://github.com/apache/spark/blob/v3.2.1/sql/catalyst/src/main/scala/org/apache/spark/sql/types/Decimal.scala#L358-L373] > ({{{}nullOnOverflow{}}} is controlled by {{spark.sql.ansi.enabled}} in > [SQLConf|https://github.com/apache/spark/blob/v3.2.1/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala#L2542-L2551].): > {code:java} > private[sql] def toPrecision( > precision: Int, >
[jira] [Comment Edited] (SPARK-40439) DECIMAL value with more precision than what is defined in the schema raises exception in SparkSQL but evaluates to NULL for DataFrame
[ https://issues.apache.org/jira/browse/SPARK-40439?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17607314#comment-17607314 ] xsys edited comment on SPARK-40439 at 9/20/22 5:18 PM: --- [~hyukjin.kwon]: Thank you for your response! Setting {{spark.sql.storeAssignmentPolicy}} to {{LEGACY works. }}I believe it could get non-trivial for users to discover that {{spark.sql.storeAssignmentPolicy}} would work. For instance, after inspecting the code, I thought that {{nullOnOverflow}} is controlled by {{spark.sql.ansi.enabled and}} I tried to achieve the desired behaviour by altering it (but to no avail). Could we add the usage of {{spark.sql.storeAssignmentPolicy}} to {{LEGACY}} to the error message? was (Author: JIRAUSER288838): [~hyukjin.kwon]: Thank you for your response! Setting {{spark.sql.storeAssignmentPolicy}} to {{LEGACY works. I believe it could get non-trivial for users to discover that spark.sql.storeAssignmentPolicy}} would work. For instance, after inspecting the code, I thought that {{nullOnOverflow}} is controlled by {{spark.sql.ansi.enabled and}} I tried to achieve the desired behaviour by altering it (but to no avail). Could we add the usage of {{spark.sql.storeAssignmentPolicy}} to {{LEGACY}} to the error message? > DECIMAL value with more precision than what is defined in the schema raises > exception in SparkSQL but evaluates to NULL for DataFrame > - > > Key: SPARK-40439 > URL: https://issues.apache.org/jira/browse/SPARK-40439 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.2.1 >Reporter: xsys >Priority: Major > > h3. Describe the bug > We are trying to store a DECIMAL value {{333.22}} with more > precision than what is defined in the schema: {{{}DECIMAL(20,10){}}}. This > leads to a {{NULL}} value being stored if the table is created using > DataFrames via {{{}spark-shell{}}}. However, it leads to the following > exception if the table is created via {{{}spark-sql{}}}: > {code:java} > Failed in [insert into decimal_extra_precision select 333.22] > java.lang.ArithmeticException: > Decimal(expanded,333.22,21,10}) cannot be represented as > Decimal(20, 10){code} > h3. Step to reproduce: > On Spark 3.2.1 (commit {{{}4f25b3f712{}}}), using {{{}spark-sql{}}}: > {code:java} > create table decimal_extra_precision(c1 DECIMAL(20,10)) STORED AS ORC; > insert into decimal_extra_precision select 333.22;{code} > Execute the following: > {code:java} > create table decimal_extra_precision(c1 DECIMAL(20,10)) STORED AS ORC; > insert into decimal_extra_precision select 333.22;{code} > h3. Expected behavior > We expect the two Spark interfaces ({{{}spark-sql{}}} & {{{}spark-shell{}}}) > to behave consistently for the same data type & input combination > ({{{}DECIMAL(20,10){}}} and {{{}333.22{}}}). > Here is a simplified example in {{{}spark-shell{}}}, where insertion of the > aforementioned decimal value evaluates to a {{{}NULL{}}}: > {code:java} > scala> import org.apache.spark.sql.{Row, SparkSession} > import org.apache.spark.sql.{Row, SparkSession} > scala> import org.apache.spark.sql.types._ > import org.apache.spark.sql.types._ > scala> val rdd = > sc.parallelize(Seq(Row(BigDecimal("333.22" > rdd: org.apache.spark.rdd.RDD[org.apache.spark.sql.Row] = > ParallelCollectionRDD[0] at parallelize at :27 > scala> val schema = new StructType().add(StructField("c1", DecimalType(20, > 10), true)) > schema: org.apache.spark.sql.types.StructType = > StructType(StructField(c1,DecimalType(20,10),true)) > scala> val df = spark.createDataFrame(rdd, schema) > df: org.apache.spark.sql.DataFrame = [c1: decimal(20,10)] > scala> df.show() > ++ > | c1| > ++ > |null| > ++ > scala> > df.write.mode("overwrite").format("orc").saveAsTable("decimal_extra_precision") > 22/08/29 10:33:47 WARN SessionState: METASTORE_FILTER_HOOK will be ignored, > since hive.security.authorization.manager is set to instance of > HiveAuthorizerFactory. > scala> spark.sql("select * from decimal_extra_precision;") > res2: org.apache.spark.sql.DataFrame = [c1: decimal(20,10)] > {code} > h3. Root Cause > The exception is being raised from > [Decimal|https://github.com/apache/spark/blob/v3.2.1/sql/catalyst/src/main/scala/org/apache/spark/sql/types/Decimal.scala#L358-L373] > ({{{}nullOnOverflow{}}} is controlled by {{spark.sql.ansi.enabled}} in > [SQLConf|https://github.com/apache/spark/blob/v3.2.1/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala#L2542-L2551].): > {code:java} > private[sql] def toPrecision( > precision: Int, >
[jira] [Comment Edited] (SPARK-40439) DECIMAL value with more precision than what is defined in the schema raises exception in SparkSQL but evaluates to NULL for DataFrame
[ https://issues.apache.org/jira/browse/SPARK-40439?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17607314#comment-17607314 ] xsys edited comment on SPARK-40439 at 9/20/22 5:17 PM: --- [~hyukjin.kwon]: Thank you for your response! Setting {{spark.sql.storeAssignmentPolicy}} to {{LEGACY works. I believe it could get non-trivial for users to discover that spark.sql.storeAssignmentPolicy}} would work. For instance, after inspecting the code, I thought that {{nullOnOverflow}} is controlled by {{spark.sql.ansi.enabled and}} I tried to achieve the desired behaviour by altering it (but to no avail). Could we add the usage of {{spark.sql.storeAssignmentPolicy}} to {{LEGACY}} to the error message? was (Author: JIRAUSER288838): [~hyukjin.kwon]: Thank you for your response! Setting {{spark.sql.storeAssignmentPolicy}} to {{LEGACY }}works. I believe it could get non-trivial for users to discover that {{spark.sql.storeAssignmentPolicy}} would work. For instance, after inspecting the code, I thought that {{nullOnOverflow}} is controlled by {{spark.sql.ansi.enabled and}} I tried to achieve the desired behaviour by altering it (but to no avail). Could we add the usage of {{spark.sql.storeAssignmentPolicy}} to {{LEGACY}} to the error message? > DECIMAL value with more precision than what is defined in the schema raises > exception in SparkSQL but evaluates to NULL for DataFrame > - > > Key: SPARK-40439 > URL: https://issues.apache.org/jira/browse/SPARK-40439 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.2.1 >Reporter: xsys >Priority: Major > > h3. Describe the bug > We are trying to store a DECIMAL value {{333.22}} with more > precision than what is defined in the schema: {{{}DECIMAL(20,10){}}}. This > leads to a {{NULL}} value being stored if the table is created using > DataFrames via {{{}spark-shell{}}}. However, it leads to the following > exception if the table is created via {{{}spark-sql{}}}: > {code:java} > Failed in [insert into decimal_extra_precision select 333.22] > java.lang.ArithmeticException: > Decimal(expanded,333.22,21,10}) cannot be represented as > Decimal(20, 10){code} > h3. Step to reproduce: > On Spark 3.2.1 (commit {{{}4f25b3f712{}}}), using {{{}spark-sql{}}}: > {code:java} > create table decimal_extra_precision(c1 DECIMAL(20,10)) STORED AS ORC; > insert into decimal_extra_precision select 333.22;{code} > Execute the following: > {code:java} > create table decimal_extra_precision(c1 DECIMAL(20,10)) STORED AS ORC; > insert into decimal_extra_precision select 333.22;{code} > h3. Expected behavior > We expect the two Spark interfaces ({{{}spark-sql{}}} & {{{}spark-shell{}}}) > to behave consistently for the same data type & input combination > ({{{}DECIMAL(20,10){}}} and {{{}333.22{}}}). > Here is a simplified example in {{{}spark-shell{}}}, where insertion of the > aforementioned decimal value evaluates to a {{{}NULL{}}}: > {code:java} > scala> import org.apache.spark.sql.{Row, SparkSession} > import org.apache.spark.sql.{Row, SparkSession} > scala> import org.apache.spark.sql.types._ > import org.apache.spark.sql.types._ > scala> val rdd = > sc.parallelize(Seq(Row(BigDecimal("333.22" > rdd: org.apache.spark.rdd.RDD[org.apache.spark.sql.Row] = > ParallelCollectionRDD[0] at parallelize at :27 > scala> val schema = new StructType().add(StructField("c1", DecimalType(20, > 10), true)) > schema: org.apache.spark.sql.types.StructType = > StructType(StructField(c1,DecimalType(20,10),true)) > scala> val df = spark.createDataFrame(rdd, schema) > df: org.apache.spark.sql.DataFrame = [c1: decimal(20,10)] > scala> df.show() > ++ > | c1| > ++ > |null| > ++ > scala> > df.write.mode("overwrite").format("orc").saveAsTable("decimal_extra_precision") > 22/08/29 10:33:47 WARN SessionState: METASTORE_FILTER_HOOK will be ignored, > since hive.security.authorization.manager is set to instance of > HiveAuthorizerFactory. > scala> spark.sql("select * from decimal_extra_precision;") > res2: org.apache.spark.sql.DataFrame = [c1: decimal(20,10)] > {code} > h3. Root Cause > The exception is being raised from > [Decimal|https://github.com/apache/spark/blob/v3.2.1/sql/catalyst/src/main/scala/org/apache/spark/sql/types/Decimal.scala#L358-L373] > ({{{}nullOnOverflow{}}} is controlled by {{spark.sql.ansi.enabled}} in > [SQLConf|https://github.com/apache/spark/blob/v3.2.1/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala#L2542-L2551].): > {code:java} > private[sql] def toPrecision( > precision: Int, >
[jira] [Commented] (SPARK-40439) DECIMAL value with more precision than what is defined in the schema raises exception in SparkSQL but evaluates to NULL for DataFrame
[ https://issues.apache.org/jira/browse/SPARK-40439?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17607314#comment-17607314 ] xsys commented on SPARK-40439: -- [~hyukjin.kwon]: Thank you for your response! Setting {{spark.sql.storeAssignmentPolicy}} to {{LEGACY }}works. I believe it could get non-trivial for users to discover that {{spark.sql.storeAssignmentPolicy}} would work. For instance, after inspecting the code, I thought that {{nullOnOverflow}} is controlled by {{spark.sql.ansi.enabled and}} I tried to achieve the desired behaviour by altering it (but to no avail). Could we add the usage of {{spark.sql.storeAssignmentPolicy}} to {{LEGACY}} to the error message? > DECIMAL value with more precision than what is defined in the schema raises > exception in SparkSQL but evaluates to NULL for DataFrame > - > > Key: SPARK-40439 > URL: https://issues.apache.org/jira/browse/SPARK-40439 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.2.1 >Reporter: xsys >Priority: Major > > h3. Describe the bug > We are trying to store a DECIMAL value {{333.22}} with more > precision than what is defined in the schema: {{{}DECIMAL(20,10){}}}. This > leads to a {{NULL}} value being stored if the table is created using > DataFrames via {{{}spark-shell{}}}. However, it leads to the following > exception if the table is created via {{{}spark-sql{}}}: > {code:java} > Failed in [insert into decimal_extra_precision select 333.22] > java.lang.ArithmeticException: > Decimal(expanded,333.22,21,10}) cannot be represented as > Decimal(20, 10){code} > h3. Step to reproduce: > On Spark 3.2.1 (commit {{{}4f25b3f712{}}}), using {{{}spark-sql{}}}: > {code:java} > create table decimal_extra_precision(c1 DECIMAL(20,10)) STORED AS ORC; > insert into decimal_extra_precision select 333.22;{code} > Execute the following: > {code:java} > create table decimal_extra_precision(c1 DECIMAL(20,10)) STORED AS ORC; > insert into decimal_extra_precision select 333.22;{code} > h3. Expected behavior > We expect the two Spark interfaces ({{{}spark-sql{}}} & {{{}spark-shell{}}}) > to behave consistently for the same data type & input combination > ({{{}DECIMAL(20,10){}}} and {{{}333.22{}}}). > Here is a simplified example in {{{}spark-shell{}}}, where insertion of the > aforementioned decimal value evaluates to a {{{}NULL{}}}: > {code:java} > scala> import org.apache.spark.sql.{Row, SparkSession} > import org.apache.spark.sql.{Row, SparkSession} > scala> import org.apache.spark.sql.types._ > import org.apache.spark.sql.types._ > scala> val rdd = > sc.parallelize(Seq(Row(BigDecimal("333.22" > rdd: org.apache.spark.rdd.RDD[org.apache.spark.sql.Row] = > ParallelCollectionRDD[0] at parallelize at :27 > scala> val schema = new StructType().add(StructField("c1", DecimalType(20, > 10), true)) > schema: org.apache.spark.sql.types.StructType = > StructType(StructField(c1,DecimalType(20,10),true)) > scala> val df = spark.createDataFrame(rdd, schema) > df: org.apache.spark.sql.DataFrame = [c1: decimal(20,10)] > scala> df.show() > ++ > | c1| > ++ > |null| > ++ > scala> > df.write.mode("overwrite").format("orc").saveAsTable("decimal_extra_precision") > 22/08/29 10:33:47 WARN SessionState: METASTORE_FILTER_HOOK will be ignored, > since hive.security.authorization.manager is set to instance of > HiveAuthorizerFactory. > scala> spark.sql("select * from decimal_extra_precision;") > res2: org.apache.spark.sql.DataFrame = [c1: decimal(20,10)] > {code} > h3. Root Cause > The exception is being raised from > [Decimal|https://github.com/apache/spark/blob/v3.2.1/sql/catalyst/src/main/scala/org/apache/spark/sql/types/Decimal.scala#L358-L373] > ({{{}nullOnOverflow{}}} is controlled by {{spark.sql.ansi.enabled}} in > [SQLConf|https://github.com/apache/spark/blob/v3.2.1/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala#L2542-L2551].): > {code:java} > private[sql] def toPrecision( > precision: Int, > scale: Int, > roundMode: BigDecimal.RoundingMode.Value = ROUND_HALF_UP, > nullOnOverflow: Boolean = true, > context: SQLQueryContext = null): Decimal = { > val copy = clone() > if (copy.changePrecision(precision, scale, roundMode)) { > copy > } else { > if (nullOnOverflow) { > null > } else { > throw QueryExecutionErrors.cannotChangeDecimalPrecisionError( > this, precision, scale, context) > } > } > }{code} > The above function is invoked from >
[jira] [Updated] (SPARK-40439) DECIMAL value with more precision than what is defined in the schema raises exception in SparkSQL but evaluates to NULL for DataFrame
[ https://issues.apache.org/jira/browse/SPARK-40439?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] xsys updated SPARK-40439: - Description: h3. Describe the bug We are trying to store a DECIMAL value {{333.22}} with more precision than what is defined in the schema: {{{}DECIMAL(20,10){}}}. This leads to a {{NULL}} value being stored if the table is created using DataFrames via {{{}spark-shell{}}}. However, it leads to the following exception if the table is created via {{{}spark-sql{}}}: {code:java} Failed in [insert into decimal_extra_precision select 333.22] java.lang.ArithmeticException: Decimal(expanded,333.22,21,10}) cannot be represented as Decimal(20, 10){code} h3. Step to reproduce: On Spark 3.2.1 (commit {{{}4f25b3f712{}}}), using {{{}spark-sql{}}}: {code:java} create table decimal_extra_precision(c1 DECIMAL(20,10)) STORED AS ORC; insert into decimal_extra_precision select 333.22;{code} Execute the following: {code:java} create table decimal_extra_precision(c1 DECIMAL(20,10)) STORED AS ORC; insert into decimal_extra_precision select 333.22;{code} h3. Expected behavior We expect the two Spark interfaces ({{{}spark-sql{}}} & {{{}spark-shell{}}}) to behave consistently for the same data type & input combination ({{{}DECIMAL(20,10){}}} and {{{}333.22{}}}). Here is a simplified example in {{{}spark-shell{}}}, where insertion of the aforementioned decimal value evaluates to a {{{}NULL{}}}: {code:java} scala> import org.apache.spark.sql.{Row, SparkSession} import org.apache.spark.sql.{Row, SparkSession} scala> import org.apache.spark.sql.types._ import org.apache.spark.sql.types._ scala> val rdd = sc.parallelize(Seq(Row(BigDecimal("333.22" rdd: org.apache.spark.rdd.RDD[org.apache.spark.sql.Row] = ParallelCollectionRDD[0] at parallelize at :27 scala> val schema = new StructType().add(StructField("c1", DecimalType(20, 10), true)) schema: org.apache.spark.sql.types.StructType = StructType(StructField(c1,DecimalType(20,10),true)) scala> val df = spark.createDataFrame(rdd, schema) df: org.apache.spark.sql.DataFrame = [c1: decimal(20,10)] scala> df.show() ++ | c1| ++ |null| ++ scala> df.write.mode("overwrite").format("orc").saveAsTable("decimal_extra_precision") 22/08/29 10:33:47 WARN SessionState: METASTORE_FILTER_HOOK will be ignored, since hive.security.authorization.manager is set to instance of HiveAuthorizerFactory. scala> spark.sql("select * from decimal_extra_precision;") res2: org.apache.spark.sql.DataFrame = [c1: decimal(20,10)] {code} h3. Root Cause The exception is being raised from [Decimal|https://github.com/apache/spark/blob/v3.2.1/sql/catalyst/src/main/scala/org/apache/spark/sql/types/Decimal.scala#L358-L373] ({{{}nullOnOverflow{}}} is controlled by {{spark.sql.ansi.enabled}} in [SQLConf|https://github.com/apache/spark/blob/v3.2.1/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala#L2542-L2551].): {code:java} private[sql] def toPrecision( precision: Int, scale: Int, roundMode: BigDecimal.RoundingMode.Value = ROUND_HALF_UP, nullOnOverflow: Boolean = true, context: SQLQueryContext = null): Decimal = { val copy = clone() if (copy.changePrecision(precision, scale, roundMode)) { copy } else { if (nullOnOverflow) { null } else { throw QueryExecutionErrors.cannotChangeDecimalPrecisionError( this, precision, scale, context) } } }{code} The above function is invoked from [toPrecision|https://github.com/apache/spark/blob/v3.2.1/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Cast.scala#L754-L756] (in Cast.scala). However, our attempt to insert {{333.22}} after setting {{spark.sql.ansi.enabled}} _to {{False}}_ failed as well (which may be an independent issue). was: h3. Describe the bug We are trying to store a DECIMAL value {{333.22}} with more precision than what is defined in the schema: {{{}DECIMAL(20,10){}}}. This leads to a {{NULL}} value being stored if the table is created using DataFrames via {{{}spark-shell{}}}. However, it leads to the following exception if the table is created via {{{}spark-sql{}}}: {code:java} Failed in [insert into decimal_extra_precision select 333.22] java.lang.ArithmeticException: Decimal(expanded,333.22,21,10}) cannot be represented as Decimal(20, 10){code} h3. Step to reproduce: On Spark 3.2.1 (commit {{{}4f25b3f712{}}}), using {{{}spark-sql{}}}: {code:java} create table decimal_extra_precision(c1 DECIMAL(20,10)) STORED AS ORC; insert into decimal_extra_precision select 333.22;{code} Execute the following: {code:java} create table decimal_extra_precision(c1 DECIMAL(20,10)) STORED AS ORC; insert into decimal_extra_precision
[jira] [Updated] (SPARK-40439) DECIMAL value with more precision than what is defined in the schema raises exception in SparkSQL but evaluates to NULL for DataFrame
[ https://issues.apache.org/jira/browse/SPARK-40439?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] xsys updated SPARK-40439: - Description: h3. Describe the bug We are trying to store a DECIMAL value {{333.22}} with more precision than what is defined in the schema: {{{}DECIMAL(20,10){}}}. This leads to a {{NULL}} value being stored if the table is created using DataFrames via {{{}spark-shell{}}}. However, it leads to the following exception if the table is created via {{{}spark-sql{}}}: {code:java} Failed in [insert into decimal_extra_precision select 333.22] java.lang.ArithmeticException: Decimal(expanded,333.22,21,10}) cannot be represented as Decimal(20, 10){code} h3. Step to reproduce: On Spark 3.2.1 (commit {{{}4f25b3f712{}}}), using {{{}spark-sql{}}}: {code:java} create table decimal_extra_precision(c1 DECIMAL(20,10)) STORED AS ORC; insert into decimal_extra_precision select 333.22;{code} Execute the following: {code:java} create table decimal_extra_precision(c1 DECIMAL(20,10)) STORED AS ORC; insert into decimal_extra_precision select 333.22;{code} h3. Expected behavior We expect the two Spark interfaces ({{{}spark-sql{}}} & {{{}spark-shell{}}}) to behave consistently for the same data type & input combination ({{{}DECIMAL(20,10){}}} and {{{}333.22{}}}). Here is a simplified example in {{{}spark-shell{}}}, where insertion of the aforementioned decimal value evaluates to a {{{}NULL{}}}: {code:java} scala> import org.apache.spark.sql.{Row, SparkSession} import org.apache.spark.sql.{Row, SparkSession} scala> import org.apache.spark.sql.types._ import org.apache.spark.sql.types._ scala> val rdd = sc.parallelize(Seq(Row(BigDecimal("333.22" rdd: org.apache.spark.rdd.RDD[org.apache.spark.sql.Row] = ParallelCollectionRDD[0] at parallelize at :27 scala> val schema = new StructType().add(StructField("c1", DecimalType(20, 10), true)) schema: org.apache.spark.sql.types.StructType = StructType(StructField(c1,DecimalType(20,10),true)) scala> val df = spark.createDataFrame(rdd, schema) df: org.apache.spark.sql.DataFrame = [c1: decimal(20,10)] scala> df.show() ++ | c1| ++ |null| ++ scala> df.write.mode("overwrite").format("orc").saveAsTable("decimal_extra_precision") 22/08/29 10:33:47 WARN SessionState: METASTORE_FILTER_HOOK will be ignored, since hive.security.authorization.manager is set to instance of HiveAuthorizerFactory. scala> spark.sql("select * from decimal_extra_precision;") res2: org.apache.spark.sql.DataFrame = [c1: decimal(20,10)] {code} h3. Root Cause The exception is being raised from [Decimal|https://github.com/apache/spark/blob/v3.2.1/sql/catalyst/src/main/scala/org/apache/spark/sql/types/Decimal.scala#L358-L373] ({{{}nullOnOverflow{}}} is controlled by {{spark.sql.ansi.enabled}} in [SQLConf|https://github.com/apache/spark/blob/v3.2.1/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala#L2542-L2551].): {code:java} private[sql] def toPrecision( precision: Int, scale: Int, roundMode: BigDecimal.RoundingMode.Value = ROUND_HALF_UP, nullOnOverflow: Boolean = true, context: SQLQueryContext = null): Decimal = { val copy = clone() if (copy.changePrecision(precision, scale, roundMode)) { copy } else { if (nullOnOverflow) { null } else { throw QueryExecutionErrors.cannotChangeDecimalPrecisionError( this, precision, scale, context) } } }{code} The above function is invoked from [toPrecision|https://github.com/apache/spark/blob/v3.2.1/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Cast.scala#L754-L756] (in Cast.scala). However, our attempt to insert {{333.22}} after setting {{spark.sql.ansi.enabled}} _to {{False}}_ failed as well (which may be an independent issue). was: h3. Describe the bug We are trying to store a DECIMAL value {{333.22}} with more precision than what is defined in the schema: {{{}DECIMAL(20,10){}}}. This leads to a {{NULL}} value being stored if the table is created using DataFrames via {{{}spark-shell{}}}. However, it leads to the following exception if the table is created via {{{}spark-sql{}}}: {code:java} Failed in [insert into decimal_extra_precision select 333.22] java.lang.ArithmeticException: Decimal(expanded,333.22,21,10}) cannot be represented as Decimal(20, 10){code} h3. Step to reproduce: On Spark 3.2.1 (commit {{{}4f25b3f712{}}}), using {{{}spark-sql{}}}: {code:java} create table decimal_extra_precision(c1 DECIMAL(20,10)) STORED AS ORC; insert into decimal_extra_precision select 333.22;{code} Execute the following: {code:java} create table decimal_extra_precision(c1 DECIMAL(20,10)) STORED AS ORC; insert into
[jira] [Updated] (SPARK-40439) DECIMAL value with more precision than what is defined in the schema raises exception in SparkSQL but evaluates to NULL for DataFrame
[ https://issues.apache.org/jira/browse/SPARK-40439?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] xsys updated SPARK-40439: - Description: h3. Describe the bug We are trying to store a DECIMAL value {{333.22}} with more precision than what is defined in the schema: {{{}DECIMAL(20,10){}}}. This leads to a {{NULL}} value being stored if the table is created using DataFrames via {{{}spark-shell{}}}. However, it leads to the following exception if the table is created via {{{}spark-sql{}}}: {code:java} Failed in [insert into decimal_extra_precision select 333.22] java.lang.ArithmeticException: Decimal(expanded,333.22,21,10}) cannot be represented as Decimal(20, 10){code} h3. Step to reproduce: On Spark 3.2.1 (commit {{{}4f25b3f712{}}}), using {{{}spark-sql{}}}: {code:java} create table decimal_extra_precision(c1 DECIMAL(20,10)) STORED AS ORC; insert into decimal_extra_precision select 333.22;{code} Execute the following: {code:java} create table decimal_extra_precision(c1 DECIMAL(20,10)) STORED AS ORC; insert into decimal_extra_precision select 333.22;{code} h3. Expected behavior We expect the two Spark interfaces ({{{}spark-sql{}}} & {{{}spark-shell{}}}) to behave consistently for the same data type & input combination ({{{}DECIMAL(20,10){}}} and {{{}333.22{}}}). Here is a simplified example in {{{}spark-shell{}}}, where insertion of the aforementioned decimal value evaluates to a {{{}NULL{}}}: {code:java} scala> import org.apache.spark.sql.{Row, SparkSession} import org.apache.spark.sql.{Row, SparkSession} scala> import org.apache.spark.sql.types._ import org.apache.spark.sql.types._ scala> val rdd = sc.parallelize(Seq(Row(BigDecimal("333.22" rdd: org.apache.spark.rdd.RDD[org.apache.spark.sql.Row] = ParallelCollectionRDD[0] at parallelize at :27 scala> val schema = new StructType().add(StructField("c1", DecimalType(20, 10), true)) schema: org.apache.spark.sql.types.StructType = StructType(StructField(c1,DecimalType(20,10),true)) scala> val df = spark.createDataFrame(rdd, schema) df: org.apache.spark.sql.DataFrame = [c1: decimal(20,10)] scala> df.show() ++ | c1| ++ |null| ++ scala> df.write.mode("overwrite").format("orc").saveAsTable("decimal_extra_precision") 22/08/29 10:33:47 WARN SessionState: METASTORE_FILTER_HOOK will be ignored, since hive.security.authorization.manager is set to instance of HiveAuthorizerFactory. scala> spark.sql("select * from decimal_extra_precision;") res2: org.apache.spark.sql.DataFrame = [c1: decimal(20,10)] {code} h3. Root Cause The exception is being raised from [Decimal|https://github.com/apache/spark/blob/v3.2.1/sql/catalyst/src/main/scala/org/apache/spark/sql/types/Decimal.scala#L358-L373] ({{{}nullOnOverflow{}}} is controlled by {{spark.sql.ansi.enabled}} in [SQLConf|https://github.com/apache/spark/blob/v3.2.1/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala#L2542-L2551].): {code:java} private[sql] def toPrecision( precision: Int, scale: Int, roundMode: BigDecimal.RoundingMode.Value = ROUND_HALF_UP, nullOnOverflow: Boolean = true, context: SQLQueryContext = null): Decimal = { val copy = clone() if (copy.changePrecision(precision, scale, roundMode)) { copy } else { if (nullOnOverflow) { null } else { throw QueryExecutionErrors.cannotChangeDecimalPrecisionError( this, precision, scale, context) } } }{code} The above function is invoked from [toPrecision|https://github.com/apache/spark/blob/v3.2.1/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Cast.scala#L754-L756] (in Cast.scala). However, our attempt to insert {{333.22}} after setting {{spark.sql.ansi.enabled}} _to {{False}}_ failed as well (which may be an independent issue). was: h3. Describe the bug We are trying to store a DECIMAL value {{333.22}} with more precision than what is defined in the schema: {{{}DECIMAL(20,10){}}}. This leads to a {{NULL}} value being stored if the table is created using DataFrames via {{{}spark-shell{}}}. However, it leads to the following exception if the table is created via {{{}spark-sql{}}}: {code:java} Failed in [insert into decimal_extra_precision select 333.22] java.lang.ArithmeticException: Decimal(expanded,333.22,21,10}) cannot be represented as Decimal(20, 10){code} h3. Step to reproduce: On Spark 3.2.1 (commit {{{}4f25b3f712{}}}), using {{{}spark-sql{}}}: {code:java} create table decimal_extra_precision(c1 DECIMAL(20,10)) STORED AS ORC; insert into decimal_extra_precision select 333.22;{code} Execute the following: {code:java} create table decimal_extra_precision(c1 DECIMAL(20,10)) STORED AS ORC; insert into
[jira] [Updated] (SPARK-40439) DECIMAL value with more precision than what is defined in the schema raises exception in SparkSQL but evaluates to NULL for DataFrame
[ https://issues.apache.org/jira/browse/SPARK-40439?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] xsys updated SPARK-40439: - Description: h3. Describe the bug We are trying to store a DECIMAL value {{333.22}} with more precision than what is defined in the schema: {{{}DECIMAL(20,10){}}}. This leads to a {{NULL}} value being stored if the table is created using DataFrames via {{{}spark-shell{}}}. However, it leads to the following exception if the table is created via {{{}spark-sql{}}}: {code:java} Failed in [insert into decimal_extra_precision select 333.22] java.lang.ArithmeticException: Decimal(expanded,333.22,21,10}) cannot be represented as Decimal(20, 10){code} h3. Step to reproduce: On Spark 3.2.1 (commit {{{}4f25b3f712{}}}), using {{{}spark-sql{}}}: {code:java} create table decimal_extra_precision(c1 DECIMAL(20,10)) STORED AS ORC; insert into decimal_extra_precision select 333.22;{code} Execute the following: {code:java} create table decimal_extra_precision(c1 DECIMAL(20,10)) STORED AS ORC; insert into decimal_extra_precision select 333.22;{code} h3. Expected behavior We expect the two Spark interfaces ({{{}spark-sql{}}} & {{{}spark-shell{}}}) to behave consistently for the same data type & input combination ({{{}DECIMAL(20,10){}}} and {{{}333.22{}}}). Here is a simplified example in {{{}spark-shell{}}}, where insertion of the aforementioned decimal value evaluates to a {{{}NULL{}}}: {code:java} scala> import org.apache.spark.sql.{Row, SparkSession} import org.apache.spark.sql.{Row, SparkSession} scala> import org.apache.spark.sql.types._ import org.apache.spark.sql.types._ scala> val rdd = sc.parallelize(Seq(Row(BigDecimal("333.22" rdd: org.apache.spark.rdd.RDD[org.apache.spark.sql.Row] = ParallelCollectionRDD[0] at parallelize at :27 scala> val schema = new StructType().add(StructField("c1", DecimalType(20, 10), true)) schema: org.apache.spark.sql.types.StructType = StructType(StructField(c1,DecimalType(20,10),true)) scala> val df = spark.createDataFrame(rdd, schema) df: org.apache.spark.sql.DataFrame = [c1: decimal(20,10)] scala> df.show() ++ | c1| ++ |null| ++ scala> df.write.mode("overwrite").format("orc").saveAsTable("decimal_extra_precision") 22/08/29 10:33:47 WARN SessionState: METASTORE_FILTER_HOOK will be ignored, since hive.security.authorization.manager is set to instance of HiveAuthorizerFactory. scala> spark.sql("select * from decimal_extra_precision;") res2: org.apache.spark.sql.DataFrame = [c1: decimal(20,10)] {code} h3. Root Cause The exception is being raised from [Decimal|https://github.com/apache/spark/blob/v3.2.1/sql/catalyst/src/main/scala/org/apache/spark/sql/types/Decimal.scala#L358-L373] ({{{}nullOnOverflow {}}}is controlled by {{spark.sql.ansi.enabled}} in [SQLConf|https://github.com/apache/spark/blob/v3.2.1/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala#L2542-L2551].): {code:java} private[sql] def toPrecision( precision: Int, scale: Int, roundMode: BigDecimal.RoundingMode.Value = ROUND_HALF_UP, nullOnOverflow: Boolean = true, context: SQLQueryContext = null): Decimal = { val copy = clone() if (copy.changePrecision(precision, scale, roundMode)) { copy } else { if (nullOnOverflow) { null } else { throw QueryExecutionErrors.cannotChangeDecimalPrecisionError( this, precision, scale, context) } } }{code} The above function is invoked from [toPrecision|https://github.com/apache/spark/blob/v3.2.1/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Cast.scala#L754-L756] (in Cast.scala). However, our attempt to insert {{333.22}} after setting _spark.sql.ansi.enabled to False_ failed as well (which may be an independent issue). was: h3. Describe the bug We are trying to store a DECIMAL value {{333.22}} with more precision than what is defined in the schema: {{{}DECIMAL(20,10){}}}. This leads to a {{NULL}} value being stored if the table is created using DataFrames via {{{}spark-shell{}}}. However, it leads to the following exception if the table is created via {{{}spark-sql{}}}: {code:java} Failed in [insert into decimal_extra_precision select 333.22] java.lang.ArithmeticException: Decimal(expanded,333.22,21,10}) cannot be represented as Decimal(20, 10){code} h3. Step to reproduce: On Spark 3.2.1 (commit {{{}4f25b3f712{}}}), using {{{}spark-sql{}}}: {code:java} create table decimal_extra_precision(c1 DECIMAL(20,10)) STORED AS ORC; insert into decimal_extra_precision select 333.22;{code} Execute the following: {code:java} create table decimal_extra_precision(c1 DECIMAL(20,10)) STORED AS ORC; insert into decimal_extra_precision
[jira] [Created] (SPARK-40439) DECIMAL value with more precision than what is defined in the schema raises exception in SparkSQL but evaluates to NULL for DataFrame
xsys created SPARK-40439: Summary: DECIMAL value with more precision than what is defined in the schema raises exception in SparkSQL but evaluates to NULL for DataFrame Key: SPARK-40439 URL: https://issues.apache.org/jira/browse/SPARK-40439 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.2.1 Reporter: xsys h3. Describe the bug We are trying to store a DECIMAL value {{333.22}} with more precision than what is defined in the schema: {{{}DECIMAL(20,10){}}}. This leads to a {{NULL}} value being stored if the table is created using DataFrames via {{{}spark-shell{}}}. However, it leads to the following exception if the table is created via {{{}spark-sql{}}}: {code:java} Failed in [insert into decimal_extra_precision select 333.22] java.lang.ArithmeticException: Decimal(expanded,333.22,21,10}) cannot be represented as Decimal(20, 10){code} h3. Step to reproduce: On Spark 3.2.1 (commit {{{}4f25b3f712{}}}), using {{{}spark-sql{}}}: {code:java} create table decimal_extra_precision(c1 DECIMAL(20,10)) STORED AS ORC; insert into decimal_extra_precision select 333.22;{code} Execute the following: {code:java} create table decimal_extra_precision(c1 DECIMAL(20,10)) STORED AS ORC; insert into decimal_extra_precision select 333.22;{code} h3. Expected behavior We expect the two Spark interfaces ({{{}spark-sql{}}} & {{{}spark-shell{}}}) to behave consistently for the same data type & input combination ({{{}DECIMAL(20,10){}}} and {{{}333.22{}}}). Here is a simplified example in {{{}spark-shell{}}}, where insertion of the aforementioned decimal value evaluates to a {{{}NULL{}}}: {code:java} scala> import org.apache.spark.sql.{Row, SparkSession} import org.apache.spark.sql.{Row, SparkSession} scala> import org.apache.spark.sql.types._ import org.apache.spark.sql.types._ scala> val rdd = sc.parallelize(Seq(Row(BigDecimal("333.22" rdd: org.apache.spark.rdd.RDD[org.apache.spark.sql.Row] = ParallelCollectionRDD[0] at parallelize at :27 scala> val schema = new StructType().add(StructField("c1", DecimalType(20, 10), true)) schema: org.apache.spark.sql.types.StructType = StructType(StructField(c1,DecimalType(20,10),true)) scala> val df = spark.createDataFrame(rdd, schema) df: org.apache.spark.sql.DataFrame = [c1: decimal(20,10)] scala> df.show() ++ | c1| ++ |null| ++ scala> df.write.mode("overwrite").format("orc").saveAsTable("decimal_extra_precision") 22/08/29 10:33:47 WARN SessionState: METASTORE_FILTER_HOOK will be ignored, since hive.security.authorization.manager is set to instance of HiveAuthorizerFactory. scala> spark.sql("select * from decimal_extra_precision;") res2: org.apache.spark.sql.DataFrame = [c1: decimal(20,10)] {code} h3. Root Cause The exception is being raised from [Decimal|https://github.com/apache/spark/blob/v3.2.1/sql/catalyst/src/main/scala/org/apache/spark/sql/types/Decimal.scala#L358-L373] ({{{}nullOnOverflow {}}}is controlled by {{spark.sql.ansi.enabled}} in [SQLConf|https://github.com/apache/spark/blob/v3.2.1/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala#L2542-L2551].): {code:java} private[sql] def toPrecision( precision: Int, scale: Int, roundMode: BigDecimal.RoundingMode.Value = ROUND_HALF_UP, nullOnOverflow: Boolean = true, context: SQLQueryContext = null): Decimal = { val copy = clone() if (copy.changePrecision(precision, scale, roundMode)) { copy } else { if (nullOnOverflow) { null } else { throw QueryExecutionErrors.cannotChangeDecimalPrecisionError( this, precision, scale, context) } } }{code} The above function is invoked from [toPrecision|https://github.com/apache/spark/blob/v3.2.1/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Cast.scala#L754-L756] (in Cast.scala). However, our attempt to insert {{333.22}} after setting {{spark.sql.ansi.enabled }}to False failed as well (which may be an independent issue). -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-40439) DECIMAL value with more precision than what is defined in the schema raises exception in SparkSQL but evaluates to NULL for DataFrame
[ https://issues.apache.org/jira/browse/SPARK-40439?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] xsys updated SPARK-40439: - Description: h3. Describe the bug We are trying to store a DECIMAL value {{333.22}} with more precision than what is defined in the schema: {{{}DECIMAL(20,10){}}}. This leads to a {{NULL}} value being stored if the table is created using DataFrames via {{{}spark-shell{}}}. However, it leads to the following exception if the table is created via {{{}spark-sql{}}}: {code:java} Failed in [insert into decimal_extra_precision select 333.22] java.lang.ArithmeticException: Decimal(expanded,333.22,21,10}) cannot be represented as Decimal(20, 10){code} h3. Step to reproduce: On Spark 3.2.1 (commit {{{}4f25b3f712{}}}), using {{{}spark-sql{}}}: {code:java} create table decimal_extra_precision(c1 DECIMAL(20,10)) STORED AS ORC; insert into decimal_extra_precision select 333.22;{code} Execute the following: {code:java} create table decimal_extra_precision(c1 DECIMAL(20,10)) STORED AS ORC; insert into decimal_extra_precision select 333.22;{code} h3. Expected behavior We expect the two Spark interfaces ({{{}spark-sql{}}} & {{{}spark-shell{}}}) to behave consistently for the same data type & input combination ({{{}DECIMAL(20,10){}}} and {{{}333.22{}}}). Here is a simplified example in {{{}spark-shell{}}}, where insertion of the aforementioned decimal value evaluates to a {{{}NULL{}}}: {code:java} scala> import org.apache.spark.sql.{Row, SparkSession} import org.apache.spark.sql.{Row, SparkSession} scala> import org.apache.spark.sql.types._ import org.apache.spark.sql.types._ scala> val rdd = sc.parallelize(Seq(Row(BigDecimal("333.22" rdd: org.apache.spark.rdd.RDD[org.apache.spark.sql.Row] = ParallelCollectionRDD[0] at parallelize at :27 scala> val schema = new StructType().add(StructField("c1", DecimalType(20, 10), true)) schema: org.apache.spark.sql.types.StructType = StructType(StructField(c1,DecimalType(20,10),true)) scala> val df = spark.createDataFrame(rdd, schema) df: org.apache.spark.sql.DataFrame = [c1: decimal(20,10)] scala> df.show() ++ | c1| ++ |null| ++ scala> df.write.mode("overwrite").format("orc").saveAsTable("decimal_extra_precision") 22/08/29 10:33:47 WARN SessionState: METASTORE_FILTER_HOOK will be ignored, since hive.security.authorization.manager is set to instance of HiveAuthorizerFactory. scala> spark.sql("select * from decimal_extra_precision;") res2: org.apache.spark.sql.DataFrame = [c1: decimal(20,10)] {code} h3. Root Cause The exception is being raised from [Decimal|https://github.com/apache/spark/blob/v3.2.1/sql/catalyst/src/main/scala/org/apache/spark/sql/types/Decimal.scala#L358-L373] ({{{}nullOnOverflow {{is controlled by {{spark.sql.ansi.enabled}} in [SQLConf|https://github.com/apache/spark/blob/v3.2.1/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala#L2542-L2551].): {code:java} private[sql] def toPrecision( precision: Int, scale: Int, roundMode: BigDecimal.RoundingMode.Value = ROUND_HALF_UP, nullOnOverflow: Boolean = true, context: SQLQueryContext = null): Decimal = { val copy = clone() if (copy.changePrecision(precision, scale, roundMode)) { copy } else { if (nullOnOverflow) { null } else { throw QueryExecutionErrors.cannotChangeDecimalPrecisionError( this, precision, scale, context) } } }{code} The above function is invoked from [toPrecision|https://github.com/apache/spark/blob/v3.2.1/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Cast.scala#L754-L756] (in Cast.scala). However, our attempt to insert {{333.22}} after setting \{{spark.sql.ansi.enabled }}to False failed as well (which may be an independent issue). was: h3. Describe the bug We are trying to store a DECIMAL value {{333.22}} with more precision than what is defined in the schema: {{{}DECIMAL(20,10){}}}. This leads to a {{NULL}} value being stored if the table is created using DataFrames via {{{}spark-shell{}}}. However, it leads to the following exception if the table is created via {{{}spark-sql{}}}: {code:java} Failed in [insert into decimal_extra_precision select 333.22] java.lang.ArithmeticException: Decimal(expanded,333.22,21,10}) cannot be represented as Decimal(20, 10){code} h3. Step to reproduce: On Spark 3.2.1 (commit {{{}4f25b3f712{}}}), using {{{}spark-sql{}}}: {code:java} create table decimal_extra_precision(c1 DECIMAL(20,10)) STORED AS ORC; insert into decimal_extra_precision select 333.22;{code} Execute the following: {code:java} create table decimal_extra_precision(c1 DECIMAL(20,10)) STORED AS ORC; insert into
[jira] [Updated] (SPARK-40409) IncompatibleSchemaException when BYTE stored from DataFrame to Avro is read using spark-sql
[ https://issues.apache.org/jira/browse/SPARK-40409?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] xsys updated SPARK-40409: - Description: h3. Describe the bug We are trying to store a BYTE {{"-128"}} to a table created via Spark DataFrame. The table is created with the Avro file format. We encounter no errors while creating the table and inserting the aforementioned BYTE value. However, performing a SELECT query on the table through spark-sql results in an {{IncompatibleSchemaException}} as shown below: {code:java} 2022-09-09 21:15:03,248 ERROR executor.Executor: Exception in task 0.0 in stage 0.0 (TID 0) org.apache.spark.sql.avro.IncompatibleSchemaException: Cannot convert Avro type {"type":"record","name":"topLevelRecord","fields"$ [{"name":"c1","type":["int","null"]}]} to SQL type STRUCT<`c1`: TINYINT>{code} h3. Step to reproduce On Spark 3.2.1 (commit {{{}4f25b3f712{}}}), using {{spark-shell}} with the Avro package: {code:java} ./bin/spark-shell --packages org.apache.spark:spark-avro_2.12:3.2.1{code} Execute the following: {code:java} import org.apache.spark.sql.{Row, SparkSession} import org.apache.spark.sql.types._ val rdd = sc.parallelize(Seq(Row(("-128").toByte))) val schema = new StructType().add(StructField("c1", ByteType, true)) val df = spark.createDataFrame(rdd, schema) df.show(false) df.write.mode("overwrite").format("avro").saveAsTable("byte_avro"){code} On Spark 3.2.1 (commit {{{}4f25b3f712{}}}), using {{spark-sql}} with the Avro package: {code:java} ./bin/spark-sql --packages org.apache.spark:spark-avro_2.12:3.2.1{code} Execute the following: {code:java} spark-sql> select * from byte_avro;{code} h3. Expected behavior We expect the output of the {{SELECT}} query to be {{{}-128{}}}. Additionally, we expect the data type to be preserved (it is changed from BYTE/TINYINT to INT, hence the mismatch). We tried other formats like ORC and the outcome is consistent with this expectation. Here are the logs from our attempt at doing the same with ORC: {code:java} scala> df.write.mode("overwrite").format("orc").saveAsTable("byte_orc") 2022-09-09 21:38:28,880 WARN conf.HiveConf: HiveConf of name hive.stats.jdbc.timeout does not exist 2022-09-09 21:38:28,880 WARN conf.HiveConf: HiveConf of name hive.stats.retries.wait does not exist 2022-09-09 21:38:34,642 WARN session.SessionState: METASTORE_FILTER_HOOK will be ignored, since hive.security.authorization.manage r is set to instance of HiveAuthorizerFactory. 2022-09-09 21:38:34,716 WARN conf.HiveConf: HiveConf of name hive.internal.ss.authz.settings.applied.marker does not exist 2022-09-09 21:38:34,716 WARN conf.HiveConf: HiveConf of name hive.stats.jdbc.timeout does not exist 2022-09-09 21:38:34,716 WARN conf.HiveConf: HiveConf of name hive.stats.retries.wait does not exist scala> spark.sql("select * from byte_orc;") res2: org.apache.spark.sql.DataFrame = [c1: tinyint] scala> spark.sql("select * from byte_orc;").show(false) ++ |c1 | ++ |-128| ++ {code} h3. Root Cause h4. [AvroSerializer|https://github.com/apache/spark/blob/v3.2.1/external/avro/src/main/scala/org/apache/spark/sql/avro/AvroSerializer.scala#L114-L119] {code:java} (catalystType, avroType.getType) match { case (NullType, NULL) => (getter, ordinal) => null case (BooleanType, BOOLEAN) => (getter, ordinal) => getter.getBoolean(ordinal) case (ByteType, INT) => (getter, ordinal) => getter.getByte(ordinal).toInt case (ShortType, INT) => (getter, ordinal) => getter.getShort(ordinal).toInt case (IntegerType, INT) => (getter, ordinal) => getter.getInt(ordinal){code} h4. [AvroDeserializer|https://github.com/apache/spark/blob/v3.2.1/external/avro/src/main/scala/org/apache/spark/sql/avro/AvroDeserializer.scala#L121-L130] {code:java} (avroType.getType, catalystType) match { case (NULL, NullType) => (updater, ordinal, _) => updater.setNullAt(ordinal) // TODO: we can avoid boxing if future version of avro provide primitive accessors. case (BOOLEAN, BooleanType) => (updater, ordinal, value) => updater.setBoolean(ordinal, value.asInstanceOf[Boolean]) case (INT, IntegerType) => (updater, ordinal, value) => updater.setInt(ordinal, value.asInstanceOf[Int]) case (INT, DateType) => (updater, ordinal, value) => updater.setInt(ordinal, dateRebaseFunc(value.asInstanceOf[Int])) {code} AvroSerializer converts Spark's ByteType into Avro's INT. Further, Spark's AvroDeserializer expects Avro's INT to map to Spark's IntegerType. The mismatch between user-specified ByteType & the type AvroDeserializer expects (IntegerType) is the root cause of this issue. was: h3. Describe the bug We are trying to store a BYTE {{"-128"}} to a table created via Spark DataFrame. The table is created with the Avro file format. We encounter no errors while creating the table and inserting the
[jira] [Updated] (SPARK-40409) IncompatibleSchemaException when BYTE stored from DataFrame to Avro is read using spark-sql
[ https://issues.apache.org/jira/browse/SPARK-40409?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] xsys updated SPARK-40409: - Description: h2. Describe the bug We are trying to store a BYTE {{"-128"}} to a table created via Spark DataFrame. The table is created with the Avro file format. We encounter no errors while creating the table and inserting the aforementioned BYTE value. However, performing a SELECT query on the table through spark-sql results in an {{IncompatibleSchemaException}} as shown below: {code:java} 2022-09-09 21:15:03,248 ERROR executor.Executor: Exception in task 0.0 in stage 0.0 (TID 0) org.apache.spark.sql.avro.IncompatibleSchemaException: Cannot convert Avro type {"type":"record","name":"topLevelRecord","fields"$ [{"name":"c1","type":["int","null"]}]} to SQL type STRUCT<`c1`: TINYINT>{code} h2. Step to reproduce On Spark 3.2.1 (commit {{{}4f25b3f712{}}}), using {{spark-shell}} with the Avro package: {code:java} ./bin/spark-shell --packages org.apache.spark:spark-avro_2.12:3.2.1{code} Execute the following: {code:java} import org.apache.spark.sql.{Row, SparkSession} import org.apache.spark.sql.types._ val rdd = sc.parallelize(Seq(Row(("-128").toByte))) val schema = new StructType().add(StructField("c1", ByteType, true)) val df = spark.createDataFrame(rdd, schema) df.show(false) df.write.mode("overwrite").format("avro").saveAsTable("byte_avro"){code} On Spark 3.2.1 (commit {{{}4f25b3f712{}}}), using {{spark-sql}} with the Avro package: {code:java} ./bin/spark-sql --packages org.apache.spark:spark-avro_2.12:3.2.1{code} Execute the following: {code:java} spark-sql> select * from byte_avro;{code} h2. Expected behavior We expect the output of the {{SELECT}} query to be {{{}-128{}}}. Additionally, we expect the data type to be preserved (it is changed from BYTE/TINYINT to INT, hence the mismatch). We tried other formats like ORC and the outcome is consistent with this expectation. Here are the logs from our attempt at doing the same with ORC: {code:java} scala> df.write.mode("overwrite").format("orc").saveAsTable("byte_orc") 2022-09-09 21:38:28,880 WARN conf.HiveConf: HiveConf of name hive.stats.jdbc.timeout does not exist 2022-09-09 21:38:28,880 WARN conf.HiveConf: HiveConf of name hive.stats.retries.wait does not exist 2022-09-09 21:38:34,642 WARN session.SessionState: METASTORE_FILTER_HOOK will be ignored, since hive.security.authorization.manage r is set to instance of HiveAuthorizerFactory. 2022-09-09 21:38:34,716 WARN conf.HiveConf: HiveConf of name hive.internal.ss.authz.settings.applied.marker does not exist 2022-09-09 21:38:34,716 WARN conf.HiveConf: HiveConf of name hive.stats.jdbc.timeout does not exist 2022-09-09 21:38:34,716 WARN conf.HiveConf: HiveConf of name hive.stats.retries.wait does not exist scala> spark.sql("select * from byte_orc;") res2: org.apache.spark.sql.DataFrame = [c1: tinyint] scala> spark.sql("select * from byte_orc;").show(false) ++ |c1 | ++ |-128| ++ {code} h2. Root Cause h4. [AvroSerializer|https://github.com/apache/spark/blob/v3.2.1/external/avro/src/main/scala/org/apache/spark/sql/avro/AvroSerializer.scala#L114-L119] {code:java} (catalystType, avroType.getType) match { case (NullType, NULL) => (getter, ordinal) => null case (BooleanType, BOOLEAN) => (getter, ordinal) => getter.getBoolean(ordinal) case (ByteType, INT) => (getter, ordinal) => getter.getByte(ordinal).toInt case (ShortType, INT) => (getter, ordinal) => getter.getShort(ordinal).toInt case (IntegerType, INT) => (getter, ordinal) => getter.getInt(ordinal){code} h4. [AvroDeserializer|https://github.com/apache/spark/blob/v3.2.1/external/avro/src/main/scala/org/apache/spark/sql/avro/AvroDeserializer.scala#L121-L130] {code:java} (avroType.getType, catalystType) match { case (NULL, NullType) => (updater, ordinal, _) => updater.setNullAt(ordinal) // TODO: we can avoid boxing if future version of avro provide primitive accessors. case (BOOLEAN, BooleanType) => (updater, ordinal, value) => updater.setBoolean(ordinal, value.asInstanceOf[Boolean]) case (INT, IntegerType) => (updater, ordinal, value) => updater.setInt(ordinal, value.asInstanceOf[Int]) case (INT, DateType) => (updater, ordinal, value) => updater.setInt(ordinal, dateRebaseFunc(value.asInstanceOf[Int])) {code} AvroSerializer converts Spark's ByteType into Avro's INT. Further, Spark's AvroDeserializer expects Avro's INT to map to Spark's IntegerType. The mismatch between user-specified ByteType & the type AvroDeserializer expects (IntegerType) is the root cause of this issue. was: h2. Describe the bug We are trying to store a BYTE {{"-128"}} to a table created via Spark DataFrame. The table is created with the Avro file format. We encounter no errors while creating the table and inserting the
[jira] [Updated] (SPARK-40409) IncompatibleSchemaException when BYTE stored from DataFrame to Avro is read using spark-sql
[ https://issues.apache.org/jira/browse/SPARK-40409?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] xsys updated SPARK-40409: - Description: h2. Describe the bug We are trying to store a BYTE {{"-128"}} to a table created via Spark DataFrame. The table is created with the Avro file format. We encounter no errors while creating the table and inserting the aforementioned BYTE value. However, performing a SELECT query on the table through spark-sql results in an {{IncompatibleSchemaException}} as shown below: {code:java} 2022-09-09 21:15:03,248 ERROR executor.Executor: Exception in task 0.0 in stage 0.0 (TID 0) org.apache.spark.sql.avro.IncompatibleSchemaException: Cannot convert Avro type {"type":"record","name":"topLevelRecord","fields"$ [{"name":"c1","type":["int","null"]}]} to SQL type STRUCT<`c1`: TINYINT>{code} h3. Step to reproduce On Spark 3.2.1 (commit {{{}4f25b3f712{}}}), using {{spark-shell}} with the Avro package: {code:java} ./bin/spark-shell --packages org.apache.spark:spark-avro_2.12:3.2.1{code} Execute the following: {code:java} import org.apache.spark.sql.{Row, SparkSession} import org.apache.spark.sql.types._ val rdd = sc.parallelize(Seq(Row(("-128").toByte))) val schema = new StructType().add(StructField("c1", ByteType, true)) val df = spark.createDataFrame(rdd, schema) df.show(false) df.write.mode("overwrite").format("avro").saveAsTable("byte_avro"){code} On Spark 3.2.1 (commit {{{}4f25b3f712{}}}), using {{spark-sql}} with the Avro package: {code:java} ./bin/spark-sql --packages org.apache.spark:spark-avro_2.12:3.2.1{code} Execute the following: {code:java} spark-sql> select * from byte_avro;{code} h3. Expected behavior We expect the output of the {{SELECT}} query to be {{{}-128{}}}. Additionally, we expect the data type to be preserved (it is changed from BYTE/TINYINT to INT, hence the mismatch). We tried other formats like ORC and the outcome is consistent with this expectation. Here are the logs from our attempt at doing the same with ORC: {code:java} scala> df.write.mode("overwrite").format("orc").saveAsTable("byte_orc") 2022-09-09 21:38:28,880 WARN conf.HiveConf: HiveConf of name hive.stats.jdbc.timeout does not exist 2022-09-09 21:38:28,880 WARN conf.HiveConf: HiveConf of name hive.stats.retries.wait does not exist 2022-09-09 21:38:34,642 WARN session.SessionState: METASTORE_FILTER_HOOK will be ignored, since hive.security.authorization.manage r is set to instance of HiveAuthorizerFactory. 2022-09-09 21:38:34,716 WARN conf.HiveConf: HiveConf of name hive.internal.ss.authz.settings.applied.marker does not exist 2022-09-09 21:38:34,716 WARN conf.HiveConf: HiveConf of name hive.stats.jdbc.timeout does not exist 2022-09-09 21:38:34,716 WARN conf.HiveConf: HiveConf of name hive.stats.retries.wait does not exist scala> spark.sql("select * from byte_orc;") res2: org.apache.spark.sql.DataFrame = [c1: tinyint] scala> spark.sql("select * from byte_orc;").show(false) ++ |c1 | ++ |-128| ++ {code} h3. Root Cause h4. [AvroSerializer|https://github.com/apache/spark/blob/v3.2.1/external/avro/src/main/scala/org/apache/spark/sql/avro/AvroSerializer.scala#L114-L119] {code:java} (catalystType, avroType.getType) match { case (NullType, NULL) => (getter, ordinal) => null case (BooleanType, BOOLEAN) => (getter, ordinal) => getter.getBoolean(ordinal) case (ByteType, INT) => (getter, ordinal) => getter.getByte(ordinal).toInt case (ShortType, INT) => (getter, ordinal) => getter.getShort(ordinal).toInt case (IntegerType, INT) => (getter, ordinal) => getter.getInt(ordinal){code} h4. [AvroDeserializer|https://github.com/apache/spark/blob/v3.2.1/external/avro/src/main/scala/org/apache/spark/sql/avro/AvroDeserializer.scala#L121-L130] {code:java} (avroType.getType, catalystType) match { case (NULL, NullType) => (updater, ordinal, _) => updater.setNullAt(ordinal) // TODO: we can avoid boxing if future version of avro provide primitive accessors. case (BOOLEAN, BooleanType) => (updater, ordinal, value) => updater.setBoolean(ordinal, value.asInstanceOf[Boolean]) case (INT, IntegerType) => (updater, ordinal, value) => updater.setInt(ordinal, value.asInstanceOf[Int]) case (INT, DateType) => (updater, ordinal, value) => updater.setInt(ordinal, dateRebaseFunc(value.asInstanceOf[Int])) {code} AvroSerializer converts Spark's ByteType into Avro's INT. Further, Spark's AvroDeserializer expects Avro's INT to map to Spark's IntegerType. The mismatch between user-specified ByteType & the type AvroDeserializer expects (IntegerType) is the root cause of this issue. was: h3. Describe the bug We are trying to store a BYTE {{"-128"}} to a table created via Spark DataFrame. The table is created with the Avro file format. We encounter no errors while creating the table and inserting the
[jira] [Updated] (SPARK-40409) IncompatibleSchemaException when BYTE stored from DataFrame to Avro is read using spark-sql
[ https://issues.apache.org/jira/browse/SPARK-40409?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] xsys updated SPARK-40409: - Description: h3. Describe the bug We are trying to store a BYTE {{"-128"}} to a table created via Spark DataFrame. The table is created with the Avro file format. We encounter no errors while creating the table and inserting the aforementioned BYTE value. However, performing a SELECT query on the table through spark-sql results in an {{IncompatibleSchemaException}} as shown below: {code:java} 2022-09-09 21:15:03,248 ERROR executor.Executor: Exception in task 0.0 in stage 0.0 (TID 0) org.apache.spark.sql.avro.IncompatibleSchemaException: Cannot convert Avro type {"type":"record","name":"topLevelRecord","fields"$ [{"name":"c1","type":["int","null"]}]} to SQL type STRUCT<`c1`: TINYINT>{code} h3. Step to reproduce On Spark 3.2.1 (commit {{{}4f25b3f712{}}}), using {{spark-shell}} with the Avro package: {code:java} ./bin/spark-shell --packages org.apache.spark:spark-avro_2.12:3.2.1{code} Execute the following: {code:java} import org.apache.spark.sql.{Row, SparkSession} import org.apache.spark.sql.types._ val rdd = sc.parallelize(Seq(Row(("-128").toByte))) val schema = new StructType().add(StructField("c1", ByteType, true)) val df = spark.createDataFrame(rdd, schema) df.show(false) df.write.mode("overwrite").format("avro").saveAsTable("byte_avro"){code} On Spark 3.2.1 (commit {{{}4f25b3f712{}}}), using {{spark-sql}} with the Avro package: {code:java} ./bin/spark-sql --packages org.apache.spark:spark-avro_2.12:3.2.1{code} Execute the following: {code:java} spark-sql> select * from byte_avro;{code} h3. Expected behavior We expect the output of the {{SELECT}} query to be {{{}-128{}}}. Additionally, we expect the data type to be preserved (it is changed from BYTE/TINYINT to INT, hence the mismatch). We tried other formats like ORC and the outcome is consistent with this expectation. Here are the logs from our attempt at doing the same with ORC: {code:java} scala> df.write.mode("overwrite").format("orc").saveAsTable("byte_orc") 2022-09-09 21:38:28,880 WARN conf.HiveConf: HiveConf of name hive.stats.jdbc.timeout does not exist 2022-09-09 21:38:28,880 WARN conf.HiveConf: HiveConf of name hive.stats.retries.wait does not exist 2022-09-09 21:38:34,642 WARN session.SessionState: METASTORE_FILTER_HOOK will be ignored, since hive.security.authorization.manage r is set to instance of HiveAuthorizerFactory. 2022-09-09 21:38:34,716 WARN conf.HiveConf: HiveConf of name hive.internal.ss.authz.settings.applied.marker does not exist 2022-09-09 21:38:34,716 WARN conf.HiveConf: HiveConf of name hive.stats.jdbc.timeout does not exist 2022-09-09 21:38:34,716 WARN conf.HiveConf: HiveConf of name hive.stats.retries.wait does not exist scala> spark.sql("select * from byte_orc;") res2: org.apache.spark.sql.DataFrame = [c1: tinyint] scala> spark.sql("select * from byte_orc;").show(false) ++ |c1 | ++ |-128| ++ {code} h3. Root Cause h4. [AvroSerializer|https://github.com/apache/spark/blob/v3.2.1/external/avro/src/main/scala/org/apache/spark/sql/avro/AvroSerializer.scala#L114-L119] {code:java} (catalystType, avroType.getType) match { case (NullType, NULL) => (getter, ordinal) => null case (BooleanType, BOOLEAN) => (getter, ordinal) => getter.getBoolean(ordinal) case (ByteType, INT) => (getter, ordinal) => getter.getByte(ordinal).toInt case (ShortType, INT) => (getter, ordinal) => getter.getShort(ordinal).toInt case (IntegerType, INT) => (getter, ordinal) => getter.getInt(ordinal){code} h4. [AvroDeserializer|https://github.com/apache/spark/blob/v3.2.1/external/avro/src/main/scala/org/apache/spark/sql/avro/AvroDeserializer.scala#L121-L130] {code:java} (avroType.getType, catalystType) match { case (NULL, NullType) => (updater, ordinal, _) => updater.setNullAt(ordinal) // TODO: we can avoid boxing if future version of avro provide primitive accessors. case (BOOLEAN, BooleanType) => (updater, ordinal, value) => updater.setBoolean(ordinal, value.asInstanceOf[Boolean]) case (INT, IntegerType) => (updater, ordinal, value) => updater.setInt(ordinal, value.asInstanceOf[Int]) case (INT, DateType) => (updater, ordinal, value) => updater.setInt(ordinal, dateRebaseFunc(value.asInstanceOf[Int])) {code} AvroSerializer converts Spark's ByteType into Avro's INT. Further, Spark's AvroDeserializer expects Avro's INT to map to Spark's IntegerType. The mismatch between user-specified ByteType & the type AvroDeserializer expects (IntegerType) is the root cause of this issue. was: h3. Describe the bug We are trying to store a BYTE {{"-128"}} to a table created via Spark DataFrame. The table is created with the Avro file format. We encounter no errors while creating the table and inserting the
[jira] [Updated] (SPARK-40409) IncompatibleSchemaException when BYTE stored from DataFrame to Avro is read using spark-sql
[ https://issues.apache.org/jira/browse/SPARK-40409?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] xsys updated SPARK-40409: - Description: h3. Describe the bug We are trying to store a BYTE {{"-128"}} to a table created via Spark DataFrame. The table is created with the Avro file format. We encounter no errors while creating the table and inserting the aforementioned BYTE value. However, performing a SELECT query on the table through spark-sql results in an {{IncompatibleSchemaException}} as shown below: {code:java} 2022-09-09 21:15:03,248 ERROR executor.Executor: Exception in task 0.0 in stage 0.0 (TID 0) org.apache.spark.sql.avro.IncompatibleSchemaException: Cannot convert Avro type {"type":"record","name":"topLevelRecord","fields"$ [{"name":"c1","type":["int","null"]}]} to SQL type STRUCT<`c1`: TINYINT>{code} h3. Step to reproduce On Spark 3.2.1 (commit {{{}4f25b3f712{}}}), using {{spark-shell}} with the Avro package: {code:java} ./bin/spark-shell --packages org.apache.spark:spark-avro_2.12:3.2.1{code} Execute the following: {code:java} import org.apache.spark.sql.{Row, SparkSession} import org.apache.spark.sql.types._ val rdd = sc.parallelize(Seq(Row(("-128").toByte))) val schema = new StructType().add(StructField("c1", ByteType, true)) val df = spark.createDataFrame(rdd, schema) df.show(false) df.write.mode("overwrite").format("avro").saveAsTable("byte_avro"){code} On Spark 3.2.1 (commit {{{}4f25b3f712{}}}), using {{spark-sql}} with the Avro package: {code:java} ./bin/spark-sql --packages org.apache.spark:spark-avro_2.12:3.2.1{code} Execute the following: {code:java} spark-sql> select * from byte_avro;{code} h3. Expected behavior We expect the output of the {{SELECT}} query to be {{{}-128{}}}. Additionally, we expect the data type to be preserved (it is changed from BYTE/TINYINT to INT, hence the mismatch). We tried other formats like ORC and the outcome is consistent with this expectation. Here are the logs from our attempt at doing the same with ORC: {code:java} scala> df.write.mode("overwrite").format("orc").saveAsTable("byte_orc") 2022-09-09 21:38:28,880 WARN conf.HiveConf: HiveConf of name hive.stats.jdbc.timeout does not exist 2022-09-09 21:38:28,880 WARN conf.HiveConf: HiveConf of name hive.stats.retries.wait does not exist 2022-09-09 21:38:34,642 WARN session.SessionState: METASTORE_FILTER_HOOK will be ignored, since hive.security.authorization.manage r is set to instance of HiveAuthorizerFactory. 2022-09-09 21:38:34,716 WARN conf.HiveConf: HiveConf of name hive.internal.ss.authz.settings.applied.marker does not exist 2022-09-09 21:38:34,716 WARN conf.HiveConf: HiveConf of name hive.stats.jdbc.timeout does not exist 2022-09-09 21:38:34,716 WARN conf.HiveConf: HiveConf of name hive.stats.retries.wait does not exist scala> spark.sql("select * from byte_orc;") res2: org.apache.spark.sql.DataFrame = [c1: tinyint] scala> spark.sql("select * from byte_orc;").show(false) ++ |c1 | ++ |-128| ++ {code} h3. Root Cause h4. [AvroSerializer|https://github.com/apache/spark/blob/v3.2.1/external/avro/src/main/scala/org/apache/spark/sql/avro/AvroSerializer.scala#L114-L119] {code:java} (catalystType, avroType.getType) match { case (NullType, NULL) => (getter, ordinal) => null case (BooleanType, BOOLEAN) => (getter, ordinal) => getter.getBoolean(ordinal) case (ByteType, INT) => (getter, ordinal) => getter.getByte(ordinal).toInt case (ShortType, INT) => (getter, ordinal) => getter.getShort(ordinal).toInt case (IntegerType, INT) => (getter, ordinal) => getter.getInt(ordinal){code} h4. [AvroDeserializer|https://github.com/apache/spark/blob/v3.2.1/external/avro/src/main/scala/org/apache/spark/sql/avro/AvroDeserializer.scala#L121-L130] {code:java} (avroType.getType, catalystType) match { case (NULL, NullType) => (updater, ordinal, _) => updater.setNullAt(ordinal) // TODO: we can avoid boxing if future version of avro provide primitive accessors. case (BOOLEAN, BooleanType) => (updater, ordinal, value) => updater.setBoolean(ordinal, value.asInstanceOf[Boolean]) case (INT, IntegerType) => (updater, ordinal, value) => updater.setInt(ordinal, value.asInstanceOf[Int]) case (INT, DateType) => (updater, ordinal, value) => updater.setInt(ordinal, dateRebaseFunc(value.asInstanceOf[Int])) {code} AvroSerializer converts Spark's ByteType into Avro's INT. Further, Spark's AvroDeserializer expects Avro's INT to map to Spark's IntegerType. The mismatch between user-specified ByteType & the type AvroDeserializer expects (IntegerType) is the root cause of this issue. was: h2. Describe the bug We are trying to store a BYTE {{"-128"}} to a table created via Spark DataFrame. The table is created with the Avro file format. We encounter no errors while creating the table and inserting the
[jira] [Created] (SPARK-40409) IncompatibleSchemaException when BYTE stored from DataFrame to Avro is read using spark-sql
xsys created SPARK-40409: Summary: IncompatibleSchemaException when BYTE stored from DataFrame to Avro is read using spark-sql Key: SPARK-40409 URL: https://issues.apache.org/jira/browse/SPARK-40409 Project: Spark Issue Type: Bug Components: Input/Output Affects Versions: 3.2.1 Reporter: xsys h3. Describe the bug We are trying to store a BYTE {{"-128"}} to a table created via Spark DataFrame. The table is created with the Avro file format. We encounter no errors while creating the table and inserting the aforementioned BYTE value. However, performing a SELECT query on the table through spark-sql results in an {{IncompatibleSchemaException}} as shown below: {code:java} 2022-09-09 21:15:03,248 ERROR executor.Executor: Exception in task 0.0 in stage 0.0 (TID 0) org.apache.spark.sql.avro.IncompatibleSchemaException: Cannot convert Avro type {"type":"record","name":"topLevelRecord","fields"$ [{"name":"c1","type":["int","null"]}]} to SQL type STRUCT<`c1`: TINYINT>{code} h3. Step to reproduce On Spark 3.2.1 (commit {{{}4f25b3f712{}}}), using {{spark-shell}} with the Avro package: {code:java} ./bin/spark-shell --packages org.apache.spark:spark-avro_2.12:3.2.1{code} Execute the following: {code:java} import org.apache.spark.sql.{Row, SparkSession} import org.apache.spark.sql.types._ val rdd = sc.parallelize(Seq(Row(("-128").toByte))) val schema = new StructType().add(StructField("c1", ByteType, true)) val df = spark.createDataFrame(rdd, schema) df.show(false) df.write.mode("overwrite").format("avro").saveAsTable("byte_avro"){code} On Spark 3.2.1 (commit {{{}4f25b3f712{}}}), using {{spark-sql}} with the Avro package: {code:java} ./bin/spark-sql --packages org.apache.spark:spark-avro_2.12:3.2.1{code} Execute the following: {code:java} spark-sql> select * from byte_avro;{code} h3. Expected behavior We expect the output of the {{SELECT}} query to be {{{}-128{}}}. Additionally, we expect the data type to be preserved (it is changed from BYTE/TINYINT to INT, hence the mismatch). We tried other formats like ORC and the outcome is consistent with this expectation. Here are the logs from our attempt at doing the same with ORC: {code:java} scala> df.write.mode("overwrite").format("orc").saveAsTable("byte_orc") 2022-09-09 21:38:28,880 WARN conf.HiveConf: HiveConf of name hive.stats.jdbc.timeout does not exist 2022-09-09 21:38:28,880 WARN conf.HiveConf: HiveConf of name hive.stats.retries.wait does not exist 2022-09-09 21:38:34,642 WARN session.SessionState: METASTORE_FILTER_HOOK will be ignored, since hive.security.authorization.manage r is set to instance of HiveAuthorizerFactory. 2022-09-09 21:38:34,716 WARN conf.HiveConf: HiveConf of name hive.internal.ss.authz.settings.applied.marker does not exist 2022-09-09 21:38:34,716 WARN conf.HiveConf: HiveConf of name hive.stats.jdbc.timeout does not exist 2022-09-09 21:38:34,716 WARN conf.HiveConf: HiveConf of name hive.stats.retries.wait does not exist scala> spark.sql("select * from byte_orc;") res2: org.apache.spark.sql.DataFrame = [c1: tinyint] scala> spark.sql("select * from byte_orc;").show(false) ++ |c1 | ++ |-128| ++ {code} h3. Root Cause h4. [AvroSerializer|https://github.com/apache/spark/blob/v3.2.1/external/avro/src/main/scala/org/apache/spark/sql/avro/AvroSerializer.scala#L114-L119] {code:java} (catalystType, avroType.getType) match { case (NullType, NULL) => (getter, ordinal) => null case (BooleanType, BOOLEAN) => (getter, ordinal) => getter.getBoolean(ordinal) case (ByteType, INT) => (getter, ordinal) => getter.getByte(ordinal).toInt case (ShortType, INT) => (getter, ordinal) => getter.getShort(ordinal).toInt case (IntegerType, INT) => (getter, ordinal) => getter.getInt(ordinal){code} h4. [AvroDeserializer|https://github.com/apache/spark/blob/v3.2.1/external/avro/src/main/scala/org/apache/spark/sql/avro/AvroDeserializer.scala#L121-L130] {code:java} (avroType.getType, catalystType) match { case (NULL, NullType) => (updater, ordinal, _) => updater.setNullAt(ordinal) // TODO: we can avoid boxing if future version of avro provide primitive accessors. case (BOOLEAN, BooleanType) => (updater, ordinal, value) => updater.setBoolean(ordinal, value.asInstanceOf[Boolean]) case (INT, IntegerType) => (updater, ordinal, value) => updater.setInt(ordinal, value.asInstanceOf[Int]) case (INT, DateType) => (updater, ordinal, value) => updater.setInt(ordinal, dateRebaseFunc(value.asInstanceOf[Int])) {code} AvroSerializer converts Spark's ByteType into Avro's INT. Further, Spark's AvroDeserializer expects Avro's INT to map to Spark's IntegerType. The mismatch between user-specified ByteType & the type AvroDeserializer expects (IntegerType) is the root cause of this
[jira] [Updated] (SPARK-39158) A valid DECIMAL inserted by DataFrame cannot be read in HiveQL
[ https://issues.apache.org/jira/browse/SPARK-39158?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] xsys updated SPARK-39158: - Description: h2. Describe the bug We are trying to save a table containing a `{{{}DecimalType{}}}` column constructed through a Spark DataFrame with the `Avro` data format. We also want to be able to query this table both from this Spark instance as well as from the Hive instance that Spark is using directly. Say that `{{{}DecimalType(6, 3){}}}` is part of the schema. When we `INSERT` some valid value (e.g. {{{}BigDecimal("333.222"){}}}) in DataFrame, and `SELECT` from the table in HiveQL, we expect it to give back the inserted value. However, we instead get an `AvroTypeException`. h2. To Reproduce On Spark 3.2.1 (commit `4f25b3f712`), using `spark-shell` with the Avro package: {code:java} ./bin/spark-shell --packages org.apache.spark:spark-avro_2.12:3.2.1{code} Execute the following: {code:java} import org.apache.spark.sql.Row import org.apache.spark.sql.types._ val rdd = sc.parallelize(Seq(Row(BigDecimal("333.222" val schema = new StructType().add(StructField("c1", DecimalType(6,3), true)) val df = spark.createDataFrame(rdd, schema) df.show(false) // result in error despite correctly showing output in the end df.write.mode("overwrite").format("avro").saveAsTable("ws") {code} `df.show(false)` will result in the following error before printing out the expected output `333.222`: {code:java} java.lang.AssertionError: assertion failed: Decimal$DecimalIsFractional while compiling: during phase: globalPhase=terminal, enteringPhase=jvm library version: version 2.12.15 compiler version: version 2.12.15 reconstructed args: -classpath /Users/xsystem/.ivy2/jars/org.apache.spark_spark-avro_2.12-3.2.1.jar:/Users/xsystem/.ivy2/jars/org.tukaani_xz-1.8.jar:/Users/xsystem/.ivy2/jars/org.spark-project.spark_unused-1.0.0.jar -Yrepl-class-based -Yrepl-outdir /private/var/folders/01/bm1ky3qj3sq7gb5f345nxlcmgn/T/spark-ed7aba34-997a-4950-9ea4-52c61c222660/repl-bd6bbf2b-5647-4306-a5d3-50cdc30fcbc0 last tree to typer: TypeTree(class Byte) tree position: line 6 of tree tpe: Byte symbol: (final abstract) class Byte in package scala symbol definition: final abstract class Byte extends (a ClassSymbol) symbol package: scala symbol owners: class Byte call site: constructor $eval in object $eval in package $line19 == Source file context for tree position == 3 4 object $eval { 5 lazy val $result = $line19.$read.INSTANCE.$iw.$iw.$iw.$iw.$iw.$iw.$iw.$iw.$iw.$iw.res0 6 lazy val $print: _root_.java.lang.String = { 7 $line19.$read.INSTANCE.$iw.$iw.$iw.$iw.$iw.$iw.$iw.$iw.$iw.$iw 8 9 "" at scala.reflect.internal.SymbolTable.throwAssertionError(SymbolTable.scala:185) at scala.reflect.internal.Symbols$Symbol.completeInfo(Symbols.scala:1525) at scala.reflect.internal.Symbols$Symbol.info(Symbols.scala:1514) at scala.reflect.internal.Symbols$Symbol.flatOwnerInfo(Symbols.scala:2353) at scala.reflect.internal.Symbols$ClassSymbol.companionModule0(Symbols.scala:3346) at scala.reflect.internal.Symbols$ClassSymbol.companionModule(Symbols.scala:3348) at scala.reflect.internal.Symbols$ModuleClassSymbol.sourceModule(Symbols.scala:3487) at scala.reflect.internal.Symbols.$anonfun$forEachRelevantSymbols$1$adapted(Symbols.scala:3802) at scala.collection.IndexedSeqOptimized.foreach(IndexedSeqOptimized.scala:36) at scala.collection.IndexedSeqOptimized.foreach$(IndexedSeqOptimized.scala:33) at scala.collection.mutable.WrappedArray.foreach(WrappedArray.scala:38) at scala.reflect.internal.Symbols.markFlagsCompleted(Symbols.scala:3799) at scala.reflect.internal.Symbols.markFlagsCompleted$(Symbols.scala:3805) at scala.reflect.internal.SymbolTable.markFlagsCompleted(SymbolTable.scala:28) at scala.reflect.internal.pickling.UnPickler$Scan.finishSym$1(UnPickler.scala:324) at scala.reflect.internal.pickling.UnPickler$Scan.readSymbol(UnPickler.scala:342) at scala.reflect.internal.pickling.UnPickler$Scan.readSymbolRef(UnPickler.scala:645) at scala.reflect.internal.pickling.UnPickler$Scan.readType(UnPickler.scala:413) at scala.reflect.internal.pickling.UnPickler$Scan.$anonfun$readSymbol$10(UnPickler.scala:357) at scala.reflect.internal.pickling.UnPickler$Scan.at(UnPickler.scala:188) at scala.reflect.internal.pickling.UnPickler$Scan.readSymbol(UnPickler.scala:357) at scala.reflect.internal.pickling.UnPickler$Scan.$anonfun$run$1(UnPickler.scala:96) at scala.reflect.internal.pickling.UnPickler$Scan.run(UnPickler.scala:88) at
[jira] [Updated] (SPARK-39158) A valid DECIMAL inserted by DataFrame cannot be read in HiveQL
[ https://issues.apache.org/jira/browse/SPARK-39158?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] xsys updated SPARK-39158: - Description: h2. Describe the bug We are trying to save a table containing a `{{{}DecimalType{}}}` column constructed through a Spark DataFrame with the `Avro` data format. We also want to be able to query this table both from this Spark instance as well as from the Hive instance that Spark is using directly. Say that `{{{}DecimalType(6, 3){}}}` is part of the schema. When we `INSERT` some valid value (e.g. {{{}BigDecimal("333.222"){}}}) in DataFrame, and `SELECT` from the table in HiveQL, we expect it to give back the inserted value. However, we instead get an `AvroTypeException`. h2. To Reproduce On Spark 3.2.1 (commit `4f25b3f712`), using `spark-shell` with the Avro package: {code:java} ./bin/spark-shell --packages org.apache.spark:spark-avro_2.12:3.2.1{code} Execute the following: {code:java} import org.apache.spark.sql.Row import org.apache.spark.sql.types._ val rdd = sc.parallelize(Seq(Row(BigDecimal("333.222" val schema = new StructType().add(StructField("c1", DecimalType(6,3), true)) val df = spark.createDataFrame(rdd, schema) df.show(false) // result in error despite correctly showing output in the end df.write.mode("overwrite").format("avro").saveAsTable("ws") {code} `df.show(false)` will result in the following error before printing out the expected output `333.222`: {code:java} java.lang.AssertionError: assertion failed: Decimal$DecimalIsFractional while compiling: during phase: globalPhase=terminal, enteringPhase=jvm library version: version 2.12.15 compiler version: version 2.12.15 reconstructed args: -classpath /Users/xsystem/.ivy2/jars/org.apache.spark_spark-avro_2.12-3.2.1.jar:/Users/xsystem/.ivy2/jars/org.tukaani_xz-1.8.jar:/Users/xsystem/.ivy2/jars/org.spark-project.spark_unused-1.0.0.jar -Yrepl-class-based -Yrepl-outdir /private/var/folders/01/bm1ky3qj3sq7gb5f345nxlcmgn/T/spark-ed7aba34-997a-4950-9ea4-52c61c222660/repl-bd6bbf2b-5647-4306-a5d3-50cdc30fcbc0 last tree to typer: TypeTree(class Byte) tree position: line 6 of tree tpe: Byte symbol: (final abstract) class Byte in package scala symbol definition: final abstract class Byte extends (a ClassSymbol) symbol package: scala symbol owners: class Byte call site: constructor $eval in object $eval in package $line19 == Source file context for tree position == 3 4 object $eval { 5 lazy val $result = $line19.$read.INSTANCE.$iw.$iw.$iw.$iw.$iw.$iw.$iw.$iw.$iw.$iw.res0 6 lazy val $print: _root_.java.lang.String = { 7 $line19.$read.INSTANCE.$iw.$iw.$iw.$iw.$iw.$iw.$iw.$iw.$iw.$iw 8 9 "" at scala.reflect.internal.SymbolTable.throwAssertionError(SymbolTable.scala:185) at scala.reflect.internal.Symbols$Symbol.completeInfo(Symbols.scala:1525) at scala.reflect.internal.Symbols$Symbol.info(Symbols.scala:1514) at scala.reflect.internal.Symbols$Symbol.flatOwnerInfo(Symbols.scala:2353) at scala.reflect.internal.Symbols$ClassSymbol.companionModule0(Symbols.scala:3346) at scala.reflect.internal.Symbols$ClassSymbol.companionModule(Symbols.scala:3348) at scala.reflect.internal.Symbols$ModuleClassSymbol.sourceModule(Symbols.scala:3487) at scala.reflect.internal.Symbols.$anonfun$forEachRelevantSymbols$1$adapted(Symbols.scala:3802) at scala.collection.IndexedSeqOptimized.foreach(IndexedSeqOptimized.scala:36) at scala.collection.IndexedSeqOptimized.foreach$(IndexedSeqOptimized.scala:33) at scala.collection.mutable.WrappedArray.foreach(WrappedArray.scala:38) at scala.reflect.internal.Symbols.markFlagsCompleted(Symbols.scala:3799) at scala.reflect.internal.Symbols.markFlagsCompleted$(Symbols.scala:3805) at scala.reflect.internal.SymbolTable.markFlagsCompleted(SymbolTable.scala:28) at scala.reflect.internal.pickling.UnPickler$Scan.finishSym$1(UnPickler.scala:324) at scala.reflect.internal.pickling.UnPickler$Scan.readSymbol(UnPickler.scala:342) at scala.reflect.internal.pickling.UnPickler$Scan.readSymbolRef(UnPickler.scala:645) at scala.reflect.internal.pickling.UnPickler$Scan.readType(UnPickler.scala:413) at scala.reflect.internal.pickling.UnPickler$Scan.$anonfun$readSymbol$10(UnPickler.scala:357) at scala.reflect.internal.pickling.UnPickler$Scan.at(UnPickler.scala:188) at scala.reflect.internal.pickling.UnPickler$Scan.readSymbol(UnPickler.scala:357) at scala.reflect.internal.pickling.UnPickler$Scan.$anonfun$run$1(UnPickler.scala:96) at scala.reflect.internal.pickling.UnPickler$Scan.run(UnPickler.scala:88) at
[jira] [Updated] (SPARK-39158) A valid DECIMAL inserted by DataFrame cannot be read in HiveQL
[ https://issues.apache.org/jira/browse/SPARK-39158?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] xsys updated SPARK-39158: - Description: h2. Describe the bug We are trying to save a table containing a `DecimalType` column constructed through a Spark DataFrame with the `Avro` data format. We also want to be able to query this table both from this Spark instance as well as from the Hive instance that Spark is using directly. Say that `DecimalType(6, 3)` is part of the schema. When we `INSERT` some valid value (e.g. `BigDecimal("333.222")`) in DataFrame, and `SELECT` from the table in HiveQL, we expect it to give back the inserted value. However, we instead get an `AvroTypeException`. h2. To Reproduce On Spark 3.2.1 (commit `4f25b3f712`), using `spark-shell` with the Avro package: {code:java} ./bin/spark-shell --packages org.apache.spark:spark-avro_2.12:3.2.1{code} Execute the following: {code:java} import org.apache.spark.sql.Row import org.apache.spark.sql.types._ val rdd = sc.parallelize(Seq(Row(BigDecimal("333.222" val schema = new StructType().add(StructField("c1", DecimalType(6,3), true)) val df = spark.createDataFrame(rdd, schema) df.show(false) // result in error despite correctly showing output in the end df.write.mode("overwrite").format("avro").saveAsTable("ws") {code} `df.show(false)` will result in the following error before printing out the expected output `333.222`: {code:java} java.lang.AssertionError: assertion failed: Decimal$DecimalIsFractional while compiling: during phase: globalPhase=terminal, enteringPhase=jvm library version: version 2.12.15 compiler version: version 2.12.15 reconstructed args: -classpath /Users/xsystem/.ivy2/jars/org.apache.spark_spark-avro_2.12-3.2.1.jar:/Users/xsystem/.ivy2/jars/org.tukaani_xz-1.8.jar:/Users/xsystem/.ivy2/jars/org.spark-project.spark_unused-1.0.0.jar -Yrepl-class-based -Yrepl-outdir /private/var/folders/01/bm1ky3qj3sq7gb5f345nxlcmgn/T/spark-ed7aba34-997a-4950-9ea4-52c61c222660/repl-bd6bbf2b-5647-4306-a5d3-50cdc30fcbc0 last tree to typer: TypeTree(class Byte) tree position: line 6 of tree tpe: Byte symbol: (final abstract) class Byte in package scala symbol definition: final abstract class Byte extends (a ClassSymbol) symbol package: scala symbol owners: class Byte call site: constructor $eval in object $eval in package $line19 == Source file context for tree position == 3 4 object $eval { 5 lazy val $result = $line19.$read.INSTANCE.$iw.$iw.$iw.$iw.$iw.$iw.$iw.$iw.$iw.$iw.res0 6 lazy val $print: _root_.java.lang.String = { 7 $line19.$read.INSTANCE.$iw.$iw.$iw.$iw.$iw.$iw.$iw.$iw.$iw.$iw 8 9 "" at scala.reflect.internal.SymbolTable.throwAssertionError(SymbolTable.scala:185) at scala.reflect.internal.Symbols$Symbol.completeInfo(Symbols.scala:1525) at scala.reflect.internal.Symbols$Symbol.info(Symbols.scala:1514) at scala.reflect.internal.Symbols$Symbol.flatOwnerInfo(Symbols.scala:2353) at scala.reflect.internal.Symbols$ClassSymbol.companionModule0(Symbols.scala:3346) at scala.reflect.internal.Symbols$ClassSymbol.companionModule(Symbols.scala:3348) at scala.reflect.internal.Symbols$ModuleClassSymbol.sourceModule(Symbols.scala:3487) at scala.reflect.internal.Symbols.$anonfun$forEachRelevantSymbols$1$adapted(Symbols.scala:3802) at scala.collection.IndexedSeqOptimized.foreach(IndexedSeqOptimized.scala:36) at scala.collection.IndexedSeqOptimized.foreach$(IndexedSeqOptimized.scala:33) at scala.collection.mutable.WrappedArray.foreach(WrappedArray.scala:38) at scala.reflect.internal.Symbols.markFlagsCompleted(Symbols.scala:3799) at scala.reflect.internal.Symbols.markFlagsCompleted$(Symbols.scala:3805) at scala.reflect.internal.SymbolTable.markFlagsCompleted(SymbolTable.scala:28) at scala.reflect.internal.pickling.UnPickler$Scan.finishSym$1(UnPickler.scala:324) at scala.reflect.internal.pickling.UnPickler$Scan.readSymbol(UnPickler.scala:342) at scala.reflect.internal.pickling.UnPickler$Scan.readSymbolRef(UnPickler.scala:645) at scala.reflect.internal.pickling.UnPickler$Scan.readType(UnPickler.scala:413) at scala.reflect.internal.pickling.UnPickler$Scan.$anonfun$readSymbol$10(UnPickler.scala:357) at scala.reflect.internal.pickling.UnPickler$Scan.at(UnPickler.scala:188) at scala.reflect.internal.pickling.UnPickler$Scan.readSymbol(UnPickler.scala:357) at scala.reflect.internal.pickling.UnPickler$Scan.$anonfun$run$1(UnPickler.scala:96) at scala.reflect.internal.pickling.UnPickler$Scan.run(UnPickler.scala:88) at scala.reflect.internal.pickling.UnPickler.unpickle(UnPickler.scala:47) at
[jira] [Created] (SPARK-39158) A valid DECIMAL inserted by DataFrame cannot be read in HiveQL
xsys created SPARK-39158: Summary: A valid DECIMAL inserted by DataFrame cannot be read in HiveQL Key: SPARK-39158 URL: https://issues.apache.org/jira/browse/SPARK-39158 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.2.1 Reporter: xsys h2. Describe the bug We are trying to save a table containing a `DecimalType` column constructed through a Spark DataFrame with the `Avro` data format. We also want to be able to query this table both from this Spark instance as well as from the Hive instance that Spark is using directly. Say that `DecimalType(6, 3)` is part of the schema. When we `INSERT` some valid value (e.g. `BigDecimal("333.222")`) in DataFrame, and `SELECT` from the table in HiveQL, we expect it to give back the inserted value. However, we instead get an `AvroTypeException`. h2. To Reproduce On Spark 3.2.1 (commit `4f25b3f712`), using `spark-shell` with the Avro package: {code:java} ./bin/spark-shell --packages org.apache.spark:spark-avro_2.12:3.2.1{code} Execute the following: {code:java} import org.apache.spark.sql.Row import org.apache.spark.sql.types._ val rdd = sc.parallelize(Seq(Row(BigDecimal("333.222" val schema = new StructType().add(StructField("c1", DecimalType(6,3), true)) val df = spark.createDataFrame(rdd, schema) df.show(false) // result in error despite correctly showing output in the end df.write.mode("overwrite").format("avro").saveAsTable("ws") {code} `df.show(false)` will result in the following error before printing out the expected output `333.222`: {code:java} java.lang.AssertionError: assertion failed: Decimal$DecimalIsFractional while compiling: during phase: globalPhase=terminal, enteringPhase=jvm library version: version 2.12.15 compiler version: version 2.12.15 reconstructed args: -classpath /Users/xsystem/.ivy2/jars/org.apache.spark_spark-avro_2.12-3.2.1.jar:/Users/xsystem/.ivy2/jars/org.tukaani_xz-1.8.jar:/Users/xsystem/.ivy2/jars/org.spark-project.spark_unused-1.0.0.jar -Yrepl-class-based -Yrepl-outdir /private/var/folders/01/bm1ky3qj3sq7gb5f345nxlcmgn/T/spark-ed7aba34-997a-4950-9ea4-52c61c222660/repl-bd6bbf2b-5647-4306-a5d3-50cdc30fcbc0 last tree to typer: TypeTree(class Byte) tree position: line 6 of tree tpe: Byte symbol: (final abstract) class Byte in package scala symbol definition: final abstract class Byte extends (a ClassSymbol) symbol package: scala symbol owners: class Byte call site: constructor $eval in object $eval in package $line19 == Source file context for tree position == 3 4 object $eval { 5 lazy val $result = $line19.$read.INSTANCE.$iw.$iw.$iw.$iw.$iw.$iw.$iw.$iw.$iw.$iw.res0 6 lazy val $print: _root_.java.lang.String = { 7 $line19.$read.INSTANCE.$iw.$iw.$iw.$iw.$iw.$iw.$iw.$iw.$iw.$iw 8 9 "" at scala.reflect.internal.SymbolTable.throwAssertionError(SymbolTable.scala:185) at scala.reflect.internal.Symbols$Symbol.completeInfo(Symbols.scala:1525) at scala.reflect.internal.Symbols$Symbol.info(Symbols.scala:1514) at scala.reflect.internal.Symbols$Symbol.flatOwnerInfo(Symbols.scala:2353) at scala.reflect.internal.Symbols$ClassSymbol.companionModule0(Symbols.scala:3346) at scala.reflect.internal.Symbols$ClassSymbol.companionModule(Symbols.scala:3348) at scala.reflect.internal.Symbols$ModuleClassSymbol.sourceModule(Symbols.scala:3487) at scala.reflect.internal.Symbols.$anonfun$forEachRelevantSymbols$1$adapted(Symbols.scala:3802) at scala.collection.IndexedSeqOptimized.foreach(IndexedSeqOptimized.scala:36) at scala.collection.IndexedSeqOptimized.foreach$(IndexedSeqOptimized.scala:33) at scala.collection.mutable.WrappedArray.foreach(WrappedArray.scala:38) at scala.reflect.internal.Symbols.markFlagsCompleted(Symbols.scala:3799) at scala.reflect.internal.Symbols.markFlagsCompleted$(Symbols.scala:3805) at scala.reflect.internal.SymbolTable.markFlagsCompleted(SymbolTable.scala:28) at scala.reflect.internal.pickling.UnPickler$Scan.finishSym$1(UnPickler.scala:324) at scala.reflect.internal.pickling.UnPickler$Scan.readSymbol(UnPickler.scala:342) at scala.reflect.internal.pickling.UnPickler$Scan.readSymbolRef(UnPickler.scala:645) at scala.reflect.internal.pickling.UnPickler$Scan.readType(UnPickler.scala:413) at scala.reflect.internal.pickling.UnPickler$Scan.$anonfun$readSymbol$10(UnPickler.scala:357) at scala.reflect.internal.pickling.UnPickler$Scan.at(UnPickler.scala:188) at scala.reflect.internal.pickling.UnPickler$Scan.readSymbol(UnPickler.scala:357) at
[jira] [Comment Edited] (SPARK-39075) IncompatibleSchemaException when selecting data from table stored from a DataFrame in Avro format with BYTE/SHORT
[ https://issues.apache.org/jira/browse/SPARK-39075?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17531398#comment-17531398 ] xsys edited comment on SPARK-39075 at 5/3/22 8:34 PM: -- Thanks for the response, Erik. I understand the concern. OTOH, in principle it is inconsistent and confusing that one can write a piece of data but cannot read it back via Spark/Avro. It’s almost equivalent to a data loss. Moreover, DataFrame enforces explicit type checks so one can only write SHORT/BYTE-typed data into a SHORT/BYTE column. In this context, it is safe to downcast. And, it does not make sense that Avro’s lack of SHORT/BYTE type support breaks DataFrame operation. The concern is valid under the context that the source of the serialized data is unknown, so potentially downcasting is unsafe. One way to systematically address the issue is to determine whether Spark is the source of the serialized data, and permitting the cast in this context. Because the SELECT API is used, the data is retrieved from a table through Hive or another supported Spark store, and not from a standalone Avro file. We could then potentially leverage Spark-specific metadata stored with the Hive table and provide this context to the deserializer. Or we can change the Spark schema type from SHORT/BYTE to INT, like what SparkSQL does in the [HiveExternalCatalog|https://github.com/apache/spark/blob/4df8512b11dc9cc3a179fd5ccedf91af1f3fc6ee/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveExternalCatalog.scala#L821]. was (Author: JIRAUSER288838): Thanks for the response, Erik. I understand the concern. OTOH, in principle it is inconsistent and confusing that one can write a piece of data but cannot read it back via Spark/Avro. It’s almost equivalent to a data loss. Moreover, DataFrame enforces explicit type checks so one can only write SHORT/BYTE-typed data into a SHORT/BYTE column. In this context, it is safe to downcast. And, it does not make sense that Avro’s lack of SHORT/BYTE type support breaks DataFrame operation. The concern is valid under the context that the source of the serialized data is unknown, so potentially downcasting is unsafe. One way to systematically address the issue is to determine whether Spark is the source of the serialized data, and permitting the cast in this context. Because the SELECT API is used, the data is retrieved from a table through Hive or another supported Spark store, and not from a standalone Avro file. We could then potentially leverage Spark-specific metadata stored with the Hive table and provide this context to the deserializer. Or we can change the Spark schema type from SHORT/BYTE to INT, like what SparkSQL does in the [HiveExternalCatalog|https://github.com/apache/spark/blob/4df8512b11dc9cc3a179fd5ccedf91af1f3fc6ee/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveExternalCatalog.scala#L821]. > IncompatibleSchemaException when selecting data from table stored from a > DataFrame in Avro format with BYTE/SHORT > - > > Key: SPARK-39075 > URL: https://issues.apache.org/jira/browse/SPARK-39075 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.2.1 >Reporter: xsys >Priority: Major > > h3. Describe the bug > We are trying to save a table constructed through a DataFrame with the > {{Avro}} data format. The table contains {{ByteType}} or {{ShortType}} as > part of the schema. > When we {{INSERT}} some valid values (e.g. {{{}-128{}}}) and {{SELECT}} from > the table, we expect it to give back the inserted value. However, we instead > get an {{IncompatibleSchemaException}} from the {{{}AvroDeserializer{}}}. > This appears to be caused by a missing case statement handling the {{(INT, > ShortType)}} and {{(INT, ByteType)}} cases in [{{AvroDeserializer > newWriter}}|https://github.com/apache/spark/blob/4f25b3f71238a00508a356591553f2dfa89f8290/external/avro/src/main/scala/org/apache/spark/sql/avro/AvroDeserializer.scala#L321]. > h3. To Reproduce > On Spark 3.2.1 (commit {{{}4f25b3f712{}}}), using {{spark-shell}} with the > Avro package: > {code:java} > ./bin/spark-shell --packages org.apache.spark:spark-avro_2.12:3.2.1{code} > Execute the following: > {code:java} > import org.apache.spark.sql.Row > import org.apache.spark.sql.types._ > val schema = new StructType().add(StructField("c1", ShortType, true)) > val rdd = sc.parallelize(Seq(Row("-128".toShort))) > val df = spark.createDataFrame(rdd, schema) > df.write.mode("overwrite").format("avro").saveAsTable("t0") > spark.sql("select * from t0;").show(false){code} > Resulting error: > {code:java} > 22/04/27 18:04:14 ERROR Executor: Exception in task 0.0 in stage 2.0 (TID 32) >
[jira] [Commented] (SPARK-39075) IncompatibleSchemaException when selecting data from table stored from a DataFrame in Avro format with BYTE/SHORT
[ https://issues.apache.org/jira/browse/SPARK-39075?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17531398#comment-17531398 ] xsys commented on SPARK-39075: -- Thanks for the response, Erik. I understand the concern. OTOH, in principle it is inconsistent and confusing that one can write a piece of data but cannot read it back via Spark/Avro. It’s almost equivalent to a data loss. Moreover, DataFrame enforces explicit type checks so one can only write SHORT/BYTE-typed data into a SHORT/BYTE column. In this context, it is safe to downcast. And, it does not make sense that Avro’s lack of SHORT/BYTE type support breaks DataFrame operation. The concern is valid under the context that the source of the serialized data is unknown, so potentially downcasting is unsafe. One way to systematically address the issue is to determine whether Spark is the source of the serialized data, and permitting the cast in this context. Because the SELECT API is used, the data is retrieved from a table through Hive or another supported Spark store, and not from a standalone Avro file. We could then potentially leverage Spark-specific metadata stored with the Hive table and provide this context to the deserializer. Or we can change the Spark schema type from SHORT/BYTE to INT, like what SparkSQL does in the [HiveExternalCatalog|https://github.com/apache/spark/blob/4df8512b11dc9cc3a179fd5ccedf91af1f3fc6ee/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveExternalCatalog.scala#L821]. > IncompatibleSchemaException when selecting data from table stored from a > DataFrame in Avro format with BYTE/SHORT > - > > Key: SPARK-39075 > URL: https://issues.apache.org/jira/browse/SPARK-39075 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.2.1 >Reporter: xsys >Priority: Major > > h3. Describe the bug > We are trying to save a table constructed through a DataFrame with the > {{Avro}} data format. The table contains {{ByteType}} or {{ShortType}} as > part of the schema. > When we {{INSERT}} some valid values (e.g. {{{}-128{}}}) and {{SELECT}} from > the table, we expect it to give back the inserted value. However, we instead > get an {{IncompatibleSchemaException}} from the {{{}AvroDeserializer{}}}. > This appears to be caused by a missing case statement handling the {{(INT, > ShortType)}} and {{(INT, ByteType)}} cases in [{{AvroDeserializer > newWriter}}|https://github.com/apache/spark/blob/4f25b3f71238a00508a356591553f2dfa89f8290/external/avro/src/main/scala/org/apache/spark/sql/avro/AvroDeserializer.scala#L321]. > h3. To Reproduce > On Spark 3.2.1 (commit {{{}4f25b3f712{}}}), using {{spark-shell}} with the > Avro package: > {code:java} > ./bin/spark-shell --packages org.apache.spark:spark-avro_2.12:3.2.1{code} > Execute the following: > {code:java} > import org.apache.spark.sql.Row > import org.apache.spark.sql.types._ > val schema = new StructType().add(StructField("c1", ShortType, true)) > val rdd = sc.parallelize(Seq(Row("-128".toShort))) > val df = spark.createDataFrame(rdd, schema) > df.write.mode("overwrite").format("avro").saveAsTable("t0") > spark.sql("select * from t0;").show(false){code} > Resulting error: > {code:java} > 22/04/27 18:04:14 ERROR Executor: Exception in task 0.0 in stage 2.0 (TID 32) > org.apache.spark.sql.avro.IncompatibleSchemaException: Cannot convert Avro > type {"type":"record","name":"topLevelRecord","fields":[ > {"name":"c1","type":["int","null"]} > ]} to SQL type STRUCT<`c1`: SMALLINT>. > at > org.apache.spark.sql.avro.AvroDeserializer.liftedTree1$1(AvroDeserializer.scala:102) > > at > org.apache.spark.sql.avro.AvroDeserializer.(AvroDeserializer.scala:74) > at > org.apache.spark.sql.avro.AvroFileFormat$$anon$1.(AvroFileFormat.scala:143) > > at > org.apache.spark.sql.avro.AvroFileFormat.$anonfun$buildReader$1(AvroFileFormat.scala:136) > > at > org.apache.spark.sql.execution.datasources.FileFormat$$anon$1.apply(FileFormat.scala:148) > > at > org.apache.spark.sql.execution.datasources.FileFormat$$anon$1.apply(FileFormat.scala:133) > > at > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.org$apache$spark$sql$execution$datasources$FileScanRDD$$anon$$readCurrentFile(FileScanRDD.scala:127) > > at > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:187) > > at > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:104) > > at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown > Source) > at >
[jira] [Updated] (SPARK-39075) IncompatibleSchemaException when selecting data from table stored from a DataFrame in Avro format with BYTE/SHORT
[ https://issues.apache.org/jira/browse/SPARK-39075?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] xsys updated SPARK-39075: - Description: h3. Describe the bug We are trying to save a table constructed through a DataFrame with the {{Avro}} data format. The table contains {{ByteType}} or {{ShortType}} as part of the schema. When we {{INSERT}} some valid values (e.g. {{{}-128{}}}) and {{SELECT}} from the table, we expect it to give back the inserted value. However, we instead get an {{IncompatibleSchemaException}} from the {{{}AvroDeserializer{}}}. This appears to be caused by a missing case statement handling the {{(INT, ShortType)}} and {{(INT, ByteType)}} cases in [{{AvroDeserializer newWriter}}|https://github.com/apache/spark/blob/4f25b3f71238a00508a356591553f2dfa89f8290/external/avro/src/main/scala/org/apache/spark/sql/avro/AvroDeserializer.scala#L321]. h3. To Reproduce On Spark 3.2.1 (commit {{{}4f25b3f712{}}}), using {{spark-shell}} with the Avro package: {code:java} ./bin/spark-shell --packages org.apache.spark:spark-avro_2.12:3.2.1{code} Execute the following: {code:java} import org.apache.spark.sql.Row import org.apache.spark.sql.types._ val schema = new StructType().add(StructField("c1", ShortType, true)) val rdd = sc.parallelize(Seq(Row("-128".toShort))) val df = spark.createDataFrame(rdd, schema) df.write.mode("overwrite").format("avro").saveAsTable("t0") spark.sql("select * from t0;").show(false){code} Resulting error: {code:java} 22/04/27 18:04:14 ERROR Executor: Exception in task 0.0 in stage 2.0 (TID 32) org.apache.spark.sql.avro.IncompatibleSchemaException: Cannot convert Avro type {"type":"record","name":"topLevelRecord","fields":[ {"name":"c1","type":["int","null"]} ]} to SQL type STRUCT<`c1`: SMALLINT>. at org.apache.spark.sql.avro.AvroDeserializer.liftedTree1$1(AvroDeserializer.scala:102) at org.apache.spark.sql.avro.AvroDeserializer.(AvroDeserializer.scala:74) at org.apache.spark.sql.avro.AvroFileFormat$$anon$1.(AvroFileFormat.scala:143) at org.apache.spark.sql.avro.AvroFileFormat.$anonfun$buildReader$1(AvroFileFormat.scala:136) at org.apache.spark.sql.execution.datasources.FileFormat$$anon$1.apply(FileFormat.scala:148) at org.apache.spark.sql.execution.datasources.FileFormat$$anon$1.apply(FileFormat.scala:133) at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.org$apache$spark$sql$execution$datasources$FileScanRDD$$anon$$readCurrentFile(FileScanRDD.scala:127) at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:187) at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:104) at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source) at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) at org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:759) at org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:349) at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:898) at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:898) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373) at org.apache.spark.rdd.RDD.iterator(RDD.scala:337) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90) at org.apache.spark.scheduler.Task.run(Task.scala:131) at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:506) at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1462) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:509) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) Caused by: org.apache.spark.sql.avro.IncompatibleSchemaException: Cannot convert Avro field 'c1' to SQL field 'c1' because schema is incompatible (avroType = "int", sqlType = SMALLINT) at org.apache.spark.sql.avro.AvroDeserializer.newWriter(AvroDeserializer.scala:321) at org.apache.spark.sql.avro.AvroDeserializer.getRecordWriter(AvroDeserializer.scala:356) at org.apache.spark.sql.avro.AvroDeserializer.liftedTree1$1(AvroDeserializer.scala:84) ... 26 more {code} h3. Expected behavior & Possible Solution We expect the output to successfully select {{{}-128{}}}. We tried other formats like Parquet and the outcome is consistent with this expectation. In the [{{AvroSerializer
[jira] [Updated] (SPARK-39075) IncompatibleSchemaException when selecting data from table stored from a DataFrame in Avro format with BYTE/SHORT
[ https://issues.apache.org/jira/browse/SPARK-39075?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] xsys updated SPARK-39075: - Description: h3. Describe the bug We are trying to save a table constructed through a DataFrame with the {{Avro}} data format. The table contains {{ByteType}} or {{ShortType}} as part of the schema. When we {{INSERT}} some valid values (e.g. {{{}-128{}}}) and {{SELECT}} from the table, we expect it to give back the inserted value. However, we instead get an {{IncompatibleSchemaException}} from the {{{}AvroDeserializer{}}}. This appears to be caused by a missing case statement handling the {{(INT, ShortType)}} and {{(INT, ByteType)}} cases in [{{AvroDeserializer newWriter}}|https://github.com/apache/spark/blob/4f25b3f71238a00508a356591553f2dfa89f8290/external/avro/src/main/scala/org/apache/spark/sql/avro/AvroDeserializer.scala#L321]. h3. To Reproduce On Spark 3.2.1 (commit {{{}4f25b3f712{}}}), using {{spark-shell}} with the Avro package: {code:java} ./bin/spark-shell --packages org.apache.spark:spark-avro_2.12:3.2.1{code} Execute the following: {code:java} import org.apache.spark.sql.Row import org.apache.spark.sql.types._ val schema = new StructType().add(StructField("c1", ShortType, true)) val rdd = sc.parallelize(Seq(Row("-128".toShort))) val df = spark.createDataFrame(rdd, schema) df.write.mode("overwrite").format("avro").saveAsTable("t0") spark.sql("select * from t0;").show(false){code} Resulting error: {code:java} 22/04/27 18:04:14 ERROR Executor: Exception in task 0.0 in stage 2.0 (TID 32) org.apache.spark.sql.avro.IncompatibleSchemaException: Cannot convert Avro type {"type":"record","name":"topLevelRecord","fields":[ {"name":"c1","type":["int","null"]} ]} to SQL type STRUCT<`c1`: SMALLINT>. at org.apache.spark.sql.avro.AvroDeserializer.liftedTree1$1(AvroDeserializer.scala:102) at org.apache.spark.sql.avro.AvroDeserializer.(AvroDeserializer.scala:74) at org.apache.spark.sql.avro.AvroFileFormat$$anon$1.(AvroFileFormat.scala:143) at org.apache.spark.sql.avro.AvroFileFormat.$anonfun$buildReader$1(AvroFileFormat.scala:136) at org.apache.spark.sql.execution.datasources.FileFormat$$anon$1.apply(FileFormat.scala:148) at org.apache.spark.sql.execution.datasources.FileFormat$$anon$1.apply(FileFormat.scala:133) at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.org$apache$spark$sql$execution$datasources$FileScanRDD$$anon$$readCurrentFile(FileScanRDD.scala:127) at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:187) at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:104) at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source) at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) at org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:759) at org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:349) at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:898) at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:898) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373) at org.apache.spark.rdd.RDD.iterator(RDD.scala:337) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90) at org.apache.spark.scheduler.Task.run(Task.scala:131) at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:506) at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1462) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:509) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) Caused by: org.apache.spark.sql.avro.IncompatibleSchemaException: Cannot convert Avro field 'c1' to SQL field 'c1' because schema is incompatible (avroType = "int", sqlType = SMALLINT) at org.apache.spark.sql.avro.AvroDeserializer.newWriter(AvroDeserializer.scala:321) at org.apache.spark.sql.avro.AvroDeserializer.getRecordWriter(AvroDeserializer.scala:356) at org.apache.spark.sql.avro.AvroDeserializer.liftedTree1$1(AvroDeserializer.scala:84) ... 26 more {code} h3. Expected behavior & Possible Solution We expect the output to successfully select {{{}-128{}}}. We tried other formats like Parquet and the outcome is consistent with this expectation. In the [{{AvroSerializer
[jira] [Updated] (SPARK-39075) IncompatibleSchemaException when selecting data from table stored from a DataFrame in Avro format with BYTE/SHORT
[ https://issues.apache.org/jira/browse/SPARK-39075?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] xsys updated SPARK-39075: - Description: h3. Describe the bug We are trying to save a table constructed through a DataFrame with the {{Avro}} data format. The table contains {{ByteType}} or {{ShortType}} as part of the schema. When we {{INSERT}} some valid values (e.g. {{{}-128{}}}) and {{SELECT}} from the table, we expect it to give back the inserted value. However, we instead get an {{IncompatibleSchemaException}} from the {{{}AvroDeserializer{}}}. This appears to be caused by a missing case statement handling the {{(INT, ShortType)}} and {{(INT, ByteType)}} cases in [{{AvroDeserializer newWriter}}|https://github.com/apache/spark/blob/4f25b3f71238a00508a356591553f2dfa89f8290/external/avro/src/main/scala/org/apache/spark/sql/avro/AvroDeserializer.scala#L321]. h3. To Reproduce On Spark 3.2.1 (commit {{{}4f25b3f712{}}}), using {{spark-shell}} with the Avro package: ./bin/spark-shell --packages org.apache.spark:spark-avro_2.12:3.2.1\{{}} Execute the following: import org.apache.spark.sql.Row import org.apache.spark.sql.types._ val schema = new StructType().add(StructField("c1", ShortType, true)) val rdd = sc.parallelize(Seq(Row("-128".toShort))) val df = spark.createDataFrame(rdd, schema) df.write.mode("overwrite").format("avro").saveAsTable("t0") spark.sql("select * from t0;").show(false)\{{}} Resulting error: 22/04/27 18:04:14 ERROR Executor: Exception in task 0.0 in stage 2.0 (TID 32) org.apache.spark.sql.avro.IncompatibleSchemaException: Cannot convert Avro type {"type":"record","name":"topLevelRecord","fields":[ {"name":"c1","type":["int","null"]} ]} to SQL type STRUCT<`c1`: SMALLINT>. at org.apache.spark.sql.avro.AvroDeserializer.liftedTree1$1(AvroDeserializer.scala:102) at org.apache.spark.sql.avro.AvroDeserializer.(AvroDeserializer.scala:74) at org.apache.spark.sql.avro.AvroFileFormat$$anon$1.(AvroFileFormat.scala:143) at org.apache.spark.sql.avro.AvroFileFormat.$anonfun$buildReader$1(AvroFileFormat.scala:136) at org.apache.spark.sql.execution.datasources.FileFormat$$anon$1.apply(FileFormat.scala:148) at org.apache.spark.sql.execution.datasources.FileFormat$$anon$1.apply(FileFormat.scala:133) at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.org$apache$spark$sql$execution$datasources$FileScanRDD$$anon$$readCurrentFile(FileScanRDD.scala:127) at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:187) at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:104) at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source) at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) at org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:759) at org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:349) at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:898) at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:898) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373) at org.apache.spark.rdd.RDD.iterator(RDD.scala:337) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90) at org.apache.spark.scheduler.Task.run(Task.scala:131) at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:506) at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1462) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:509) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) Caused by: org.apache.spark.sql.avro.IncompatibleSchemaException: Cannot convert Avro field 'c1' to SQL field 'c1' because schema is incompatible (avroType = "int", sqlType = SMALLINT) at org.apache.spark.sql.avro.AvroDeserializer.newWriter(AvroDeserializer.scala:321) at org.apache.spark.sql.avro.AvroDeserializer.getRecordWriter(AvroDeserializer.scala:356) at org.apache.spark.sql.avro.AvroDeserializer.liftedTree1$1(AvroDeserializer.scala:84) ... 26 more\{{}} h3. Expected behavior & Possible Solution We expect the output to successfully select {{{}-128{}}}. We tried other formats like Parquet and the outcome is consistent with this expectation. In the [{{AvroSerializer newConverter}}|https://github.com/apache/spark/blob/4f25b3f71238a00508a356591553f2dfa89f8290/external/avro/src/main/scala/org/apache/spark/sql/avro/AvroSerializer.scala#L114]{{{},
[jira] [Updated] (SPARK-39075) IncompatibleSchemaException when selecting data from table stored from a DataFrame in Avro format with BYTE/SHORT
[ https://issues.apache.org/jira/browse/SPARK-39075?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] xsys updated SPARK-39075: - Description: h3. Describe the bug We are trying to save a table constructed through a DataFrame with the {{Avro}} data format. The table contains {{ByteType}} or {{ShortType}} as part of the schema. When we {{INSERT}} some valid values (e.g. {{{}-128{}}}) and {{SELECT}} from the table, we expect it to give back the inserted value. However, we instead get an {{IncompatibleSchemaException}} from the {{{}AvroDeserializer{}}}. This appears to be caused by a missing case statement handling the {{(INT, ShortType)}} and {{(INT, ByteType)}} cases in [{{AvroDeserializer newWriter}}|https://github.com/apache/spark/blob/4f25b3f71238a00508a356591553f2dfa89f8290/external/avro/src/main/scala/org/apache/spark/sql/avro/AvroDeserializer.scala#L321]. h3. To Reproduce On Spark 3.2.1 (commit {{{}4f25b3f712{}}}), using {{spark-shell}} with the Avro package: ./bin/spark-shell --packages org.apache.spark:spark-avro_2.12:3.2.1\{{}} Execute the following: import org.apache.spark.sql.\{Row, SparkSession} import org.apache.spark.sql.types._ val schema = new StructType().add(StructField("c1", ShortType, true)) val rdd = sc.parallelize(Seq(Row("-128".toShort))) val df = spark.createDataFrame(rdd, schema) df.write.mode("overwrite").format("avro").saveAsTable("t0") spark.sql("select * from t0;").show(false)\{{}} Resulting error: 22/04/27 18:04:14 ERROR Executor: Exception in task 0.0 in stage 2.0 (TID 32) org.apache.spark.sql.avro.IncompatibleSchemaException: Cannot convert Avro type {"type":"record","name":"topLevelRecord","fields":[ {"name":"c1","type":["int","null"]} ]} to SQL type STRUCT<`c1`: SMALLINT>. at org.apache.spark.sql.avro.AvroDeserializer.liftedTree1$1(AvroDeserializer.scala:102) at org.apache.spark.sql.avro.AvroDeserializer.(AvroDeserializer.scala:74) at org.apache.spark.sql.avro.AvroFileFormat$$anon$1.(AvroFileFormat.scala:143) at org.apache.spark.sql.avro.AvroFileFormat.$anonfun$buildReader$1(AvroFileFormat.scala:136) at org.apache.spark.sql.execution.datasources.FileFormat$$anon$1.apply(FileFormat.scala:148) at org.apache.spark.sql.execution.datasources.FileFormat$$anon$1.apply(FileFormat.scala:133) at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.org$apache$spark$sql$execution$datasources$FileScanRDD$$anon$$readCurrentFile(FileScanRDD.scala:127) at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:187) at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:104) at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source) at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) at org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:759) at org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:349) at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:898) at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:898) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373) at org.apache.spark.rdd.RDD.iterator(RDD.scala:337) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90) at org.apache.spark.scheduler.Task.run(Task.scala:131) at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:506) at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1462) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:509) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) Caused by: org.apache.spark.sql.avro.IncompatibleSchemaException: Cannot convert Avro field 'c1' to SQL field 'c1' because schema is incompatible (avroType = "int", sqlType = SMALLINT) at org.apache.spark.sql.avro.AvroDeserializer.newWriter(AvroDeserializer.scala:321) at org.apache.spark.sql.avro.AvroDeserializer.getRecordWriter(AvroDeserializer.scala:356) at org.apache.spark.sql.avro.AvroDeserializer.liftedTree1$1(AvroDeserializer.scala:84) ... 26 more\{{}} h3. Expected behavior & Possible Solution We expect the output to successfully select {{{}-128{}}}. We tried other formats like Parquet and the outcome is consistent with this expectation. In the [{{AvroSerializer
[jira] [Updated] (SPARK-39075) IncompatibleSchemaException when selecting data from table stored from a DataFrame in Avro format with BYTE/SHORT
[ https://issues.apache.org/jira/browse/SPARK-39075?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] xsys updated SPARK-39075: - Description: h3. Describe the bug We are trying to save a table constructed through a DataFrame with the {{Avro}} data format. The table contains {{ByteType}} or {{ShortType}} as part of the schema. When we {{INSERT}} some valid values (e.g. {{{}-128{}}}) and {{SELECT}} from the table, we expect it to give back the inserted value. However, we instead get an {{IncompatibleSchemaException}} from the {{{}AvroDeserializer{}}}. This appears to be caused by a missing case statement handling the {{(INT, ShortType)}} and {{(INT, ByteType)}} cases in [{{AvroDeserializer newWriter}}|https://github.com/apache/spark/blob/4f25b3f71238a00508a356591553f2dfa89f8290/external/avro/src/main/scala/org/apache/spark/sql/avro/AvroDeserializer.scala#L321]. h3. To Reproduce On Spark 3.2.1 (commit {{{}4f25b3f712{}}}), using {{spark-shell}} with the Avro package: ./bin/spark-shell --packages org.apache.spark:spark-avro_2.12:3.2.1\{{}} Execute the following: import org.apache.spark.sql.\{Row, SparkSession} import org.apache.spark.sql.types._ val schema = new StructType().add(StructField("c1", ShortType, true)) val rdd = sc.parallelize(Seq(Row("-128".toShort))) val df = spark.createDataFrame(rdd, schema) df.write.mode("overwrite").format("avro").saveAsTable("t0") spark.sql("select * from t0;").show(false)\{{}} Resulting error: 22/04/27 18:04:14 ERROR Executor: Exception in task 0.0 in stage 2.0 (TID 32) org.apache.spark.sql.avro.IncompatibleSchemaException: Cannot convert Avro type {"type":"record","name":"topLevelRecord","fields":[ {"name":"c1","type":["int","null"]} ]} to SQL type STRUCT<`c1`: SMALLINT>. at org.apache.spark.sql.avro.AvroDeserializer.liftedTree1$1(AvroDeserializer.scala:102) at org.apache.spark.sql.avro.AvroDeserializer.(AvroDeserializer.scala:74) at org.apache.spark.sql.avro.AvroFileFormat$$anon$1.(AvroFileFormat.scala:143) at org.apache.spark.sql.avro.AvroFileFormat.$anonfun$buildReader$1(AvroFileFormat.scala:136) at org.apache.spark.sql.execution.datasources.FileFormat$$anon$1.apply(FileFormat.scala:148) at org.apache.spark.sql.execution.datasources.FileFormat$$anon$1.apply(FileFormat.scala:133) at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.org$apache$spark$sql$execution$datasources$FileScanRDD$$anon$$readCurrentFile(FileScanRDD.scala:127) at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:187) at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:104) at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source) at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) at org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:759) at org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:349) at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:898) at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:898) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373) at org.apache.spark.rdd.RDD.iterator(RDD.scala:337) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90) at org.apache.spark.scheduler.Task.run(Task.scala:131) at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:506) at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1462) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:509) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) Caused by: org.apache.spark.sql.avro.IncompatibleSchemaException: Cannot convert Avro field 'c1' to SQL field 'c1' because schema is incompatible (avroType = "int", sqlType = SMALLINT) at org.apache.spark.sql.avro.AvroDeserializer.newWriter(AvroDeserializer.scala:321) at org.apache.spark.sql.avro.AvroDeserializer.getRecordWriter(AvroDeserializer.scala:356) at org.apache.spark.sql.avro.AvroDeserializer.liftedTree1$1(AvroDeserializer.scala:84) ... 26 more\{{}} h3. Expected behavior & Possible Solution We expect the output to successfully select {{{}-128{}}}. We tried other formats like Parquet and the outcome is consistent with this expectation. In the [{{AvroSerializer
[jira] [Created] (SPARK-39075) IncompatibleSchemaException when selecting data from table stored from a DataFrame in Avro format with BYTE/SHORT
xsys created SPARK-39075: Summary: IncompatibleSchemaException when selecting data from table stored from a DataFrame in Avro format with BYTE/SHORT Key: SPARK-39075 URL: https://issues.apache.org/jira/browse/SPARK-39075 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.2.1 Reporter: xsys h3. Describe the bug We are trying to save a table constructed through a DataFrame with the {{Avro}} data format. The table contains {{ByteType}} or {{ShortType}} as part of the schema. When we {{INSERT}} some valid values (e.g. {{{}-128{}}}) and {{SELECT}} from the table, we expect it to give back the inserted value. However, we instead get an {{IncompatibleSchemaException}} from the {{{}AvroDeserializer{}}}. This appears to be caused by a missing case statement handling the {{(INT, ShortType)}} and {{(INT, ByteType)}} cases in [{{{}AvroDeserializer newWriter{}}}|https://github.com/apache/spark/blob/4f25b3f71238a00508a356591553f2dfa89f8290/external/avro/src/main/scala/org/apache/spark/sql/avro/AvroDeserializer.scala#L321][{{}}|https://github.com/apache/spark/blob/4f25b3f71238a00508a356591553f2dfa89f8290/external/avro/src/main/scala/org/apache/spark/sql/avro/AvroDeserializer.scala#L321]. h3. To Reproduce On Spark 3.2.1 (commit {{{}4f25b3f712{}}}), using {{spark-shell}} with the Avro package: ./bin/spark-shell --packages org.apache.spark:spark-avro_2.12:3.2.1{{}} Execute the following: import org.apache.spark.sql.\{Row, SparkSession} import org.apache.spark.sql.types._ val schema = new StructType().add(StructField("c1", ShortType, true)) val rdd = sc.parallelize(Seq(Row("-128".toShort))) val df = spark.createDataFrame(rdd, schema) df.write.mode("overwrite").format("avro").saveAsTable("t0") spark.sql("select * from t0;").show(false){{}} Resulting error: 22/04/27 18:04:14 ERROR Executor: Exception in task 0.0 in stage 2.0 (TID 32) org.apache.spark.sql.avro.IncompatibleSchemaException: Cannot convert Avro type \{"type":"record","name":"topLevelRecord","fields":[{"name":"c1","type":["int","null"]}]} to SQL type STRUCT<`c1`: SMALLINT>. at org.apache.spark.sql.avro.AvroDeserializer.liftedTree1$1(AvroDeserializer.scala:102) at org.apache.spark.sql.avro.AvroDeserializer.(AvroDeserializer.scala:74) at org.apache.spark.sql.avro.AvroFileFormat$$anon$1.(AvroFileFormat.scala:143) at org.apache.spark.sql.avro.AvroFileFormat.$anonfun$buildReader$1(AvroFileFormat.scala:136) at org.apache.spark.sql.execution.datasources.FileFormat$$anon$1.apply(FileFormat.scala:148) at org.apache.spark.sql.execution.datasources.FileFormat$$anon$1.apply(FileFormat.scala:133) at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.org$apache$spark$sql$execution$datasources$FileScanRDD$$anon$$readCurrentFile(FileScanRDD.scala:127) at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:187) at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:104) at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source) at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) at org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:759) at org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:349) at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:898) at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:898) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373) at org.apache.spark.rdd.RDD.iterator(RDD.scala:337) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90) at org.apache.spark.scheduler.Task.run(Task.scala:131)