[jira] [Updated] (SPARK-40637) DataFrame can correctly encode BINARY type but SparkSQL cannot

xsys (Jira) Tue, 18 Oct 2022 10:17:05 -0700


     [ 
https://issues.apache.org/jira/browse/SPARK-40637?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


xsys updated SPARK-40637:
-------------------------
    Description: 
h3. Describe the bug

Storing a BINARY value (e.g. {{BigInt("1").toByteArray)}} / {{{}X'01'{}}}) via 
{{spark-shell}} outputs {{{}[01]{}}}. However, it does not encode correctly if 
the value is inserted into a BINARY column of a table via {{{}spark-sql{}}}.
h3. To Reproduce

On Spark 3.2.1 (commit {{{}4f25b3f712{}}}), using {{{}spark-shell{}}}:
{code:java}
$SPARK_HOME/bin/spark-shell{code}
Execute the following:
{code:java}
scala> val rdd = sc.parallelize(Seq(Row(BigInt("1").toByteArray)))
rdd: org.apache.spark.rdd.RDD[org.apache.spark.sql.Row] = 
ParallelCollectionRDD[356] at parallelize at <console>:28
scala> val schema = new StructType().add(StructField("c1", BinaryType, true))
schema: org.apache.spark.sql.types.StructType = 
StructType(StructField(c1,BinaryType,true))
scala> val df = spark.createDataFrame(rdd, schema)
df: org.apache.spark.sql.DataFrame = [c1: binary]
scala> df.show(false)
+----+
|c1  |
+----+
|[01]|
+----+
scala> df.write.mode("overwrite").format("orc").saveAsTable("binary_vals")
scala> spark.sql("select * from binary_vals;").show(false)
+----+
|c1  |
+----+
|[01]|
+----+{code}
 

Using {{{}spark-sql (we use tee to redirect the log to a file){}}}:
{code:java}
$SPARK_HOME/bin/spark-sql | tee sql.log{code}
 Execute the following, we only get an empty output in the terminal:
{code:java}
spark-sql> create table binary_vals(c1 BINARY) stored as ORC;
spark-sql> insert into binary_vals select X'01';
spark-sql> select * from binary_vals;

Time taken: 0.077 seconds, Fetched 1 row(s)
{code}
>From the log file, we find it shows as a garbage character. (We never 
>encountered this garbage character in logs of other data types)
h3. !image-2022-10-18-12-15-05-576.png!
h3. Expected behavior

We expect the two Spark interfaces ({{{}spark-sql{}}} & {{{}spark-shell{}}}) to 
behave consistently for the same data type ({{{}BINARY{}}}) & input 
({{{}BigInt("1").toByteArray){}}} / {{{}X'01'{}}}) combination.

 
h3. Additional context

We also tried Avro and Parquet and encountered the same issue. We believe this 
is format-independent.

  was:
h3. Describe the bug

Storing a BINARY value (e.g. {{BigInt("1").toByteArray)}} / {{{}X'01'{}}}) via 
{{spark-shell}} outputs {{{}[01]{}}}. However, it does not encode correctly if 
the value is inserted into a BINARY column of a table via {{{}spark-sql{}}}.
h3. To Reproduce

On Spark 3.2.1 (commit {{{}4f25b3f712{}}}), using {{{}spark-shell{}}}:
{code:java}
$SPARK_HOME/bin/spark-shell{code}
Execute the following:
{code:java}
scala> val rdd = sc.parallelize(Seq(Row(BigInt("1").toByteArray)))
rdd: org.apache.spark.rdd.RDD[org.apache.spark.sql.Row] = 
ParallelCollectionRDD[356] at parallelize at <console>:28
scala> val schema = new StructType().add(StructField("c1", BinaryType, true))
schema: org.apache.spark.sql.types.StructType = 
StructType(StructField(c1,BinaryType,true))
scala> val df = spark.createDataFrame(rdd, schema)
df: org.apache.spark.sql.DataFrame = [c1: binary]
scala> df.show(false)
+----+
|c1  |
+----+
|[01]|
+----+
scala> df.write.mode("overwrite").format("orc").saveAsTable("binary_vals")
scala> spark.sql("select * from binary_vals;").show(false)
+----+
|c1  |
+----+
|[01]|
+----+{code}
 

Using {{{}spark-sql{}}}:
{code:java}
$SPARK_HOME/bin/spark-sql{code}
 Execute the following, we only get an empty output:
{code:java}
spark-sql> create table binary_vals(c1 BINARY) stored as ORC;
spark-sql> insert into binary_vals select X'01';
spark-sql> select * from binary_vals;
Time taken: 0.077 seconds, Fetched 1 row(s)
{code}
h3. Expected behavior

We expect the two Spark interfaces ({{{}spark-sql{}}} & {{{}spark-shell{}}}) to 
behave consistently for the same data type ({{{}BINARY{}}}) & input 
({{{}BigInt("1").toByteArray){}}} / {{{}X'01'{}}}) combination.

 
h3. Additional context

We tried Avro and Parquet and encountered the same issue. We believe this is 
format-independent.


> DataFrame can correctly encode BINARY type but SparkSQL cannot
> --------------------------------------------------------------
>
>                 Key: SPARK-40637
>                 URL: https://issues.apache.org/jira/browse/SPARK-40637
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 3.2.1
>            Reporter: xsys
>            Priority: Major
>         Attachments: image-2022-10-18-12-15-05-576.png
>
>
> h3. Describe the bug
> Storing a BINARY value (e.g. {{BigInt("1").toByteArray)}} / {{{}X'01'{}}}) 
> via {{spark-shell}} outputs {{{}[01]{}}}. However, it does not encode 
> correctly if the value is inserted into a BINARY column of a table via 
> {{{}spark-sql{}}}.
> h3. To Reproduce
> On Spark 3.2.1 (commit {{{}4f25b3f712{}}}), using {{{}spark-shell{}}}:
> {code:java}
> $SPARK_HOME/bin/spark-shell{code}
> Execute the following:
> {code:java}
> scala> val rdd = sc.parallelize(Seq(Row(BigInt("1").toByteArray)))
> rdd: org.apache.spark.rdd.RDD[org.apache.spark.sql.Row] = 
> ParallelCollectionRDD[356] at parallelize at <console>:28
> scala> val schema = new StructType().add(StructField("c1", BinaryType, true))
> schema: org.apache.spark.sql.types.StructType = 
> StructType(StructField(c1,BinaryType,true))
> scala> val df = spark.createDataFrame(rdd, schema)
> df: org.apache.spark.sql.DataFrame = [c1: binary]
> scala> df.show(false)
> +----+
> |c1  |
> +----+
> |[01]|
> +----+
> scala> df.write.mode("overwrite").format("orc").saveAsTable("binary_vals")
> scala> spark.sql("select * from binary_vals;").show(false)
> +----+
> |c1  |
> +----+
> |[01]|
> +----+{code}
>  
> Using {{{}spark-sql (we use tee to redirect the log to a file){}}}:
> {code:java}
> $SPARK_HOME/bin/spark-sql | tee sql.log{code}
>  Execute the following, we only get an empty output in the terminal:
> {code:java}
> spark-sql> create table binary_vals(c1 BINARY) stored as ORC;
> spark-sql> insert into binary_vals select X'01';
> spark-sql> select * from binary_vals;
> Time taken: 0.077 seconds, Fetched 1 row(s)
> {code}
> From the log file, we find it shows as a garbage character. (We never 
> encountered this garbage character in logs of other data types)
> h3. !image-2022-10-18-12-15-05-576.png!
> h3. Expected behavior
> We expect the two Spark interfaces ({{{}spark-sql{}}} & {{{}spark-shell{}}}) 
> to behave consistently for the same data type ({{{}BINARY{}}}) & input 
> ({{{}BigInt("1").toByteArray){}}} / {{{}X'01'{}}}) combination.
>  
> h3. Additional context
> We also tried Avro and Parquet and encountered the same issue. We believe 
> this is format-independent.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-40637) DataFrame can correctly encode BINARY type but SparkSQL cannot

Reply via email to