[jira] [Created] (SPARK-44308) Spark 3.0.1 functions.scala -> posexplode_outer API not flattening data

Chirag Sanghvi (Jira) Tue, 04 Jul 2023 23:35:05 -0700

Chirag Sanghvi created SPARK-44308:
--------------------------------------

             Summary: Spark 3.0.1 functions.scala -> posexplode_outer API not 
flattening data
                 Key: SPARK-44308
                 URL: https://issues.apache.org/jira/browse/SPARK-44308
             Project: Spark
          Issue Type: Bug
          Components: Spark Core, SQL
    Affects Versions: 3.0.1
            Reporter: Chirag Sanghvi



Spark 3.x API functions.scala -> posexplode_outer to flatten the array column 
value doesn't work as expected when the table is created with 
"collection.delim" set to non default value.

This used to work as expected in Spark 2.4.5

 

Use the below DDL to create a hive table 

CREATE EXTERNAL TABLE `testnorm2`(
`enquiryuid` string,
`rulestriggered` array<string>)
ROW FORMAT SERDE
'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'
WITH SERDEPROPERTIES (
'collection.delim'=';',
'field.delim'='|',
'line.delim'='\n',
'serialization.format'='|')
STORED AS INPUTFORMAT
'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat';

 

And fill up with the table with array values.

The below statements fill up the table with sample data.

 

INSERT INTO testnorm2 SELECT 'A', array('a','b');
INSERT INTO testnorm2 SELECT 'B', array('e','f','g','h');
INSERT INTO testnorm2 SELECT 'C', array();
INSERT INTO testnorm2 SELECT 'D', array('');
INSERT INTO testnorm2 SELECT 'E', array('','');
INSERT INTO testnorm2 SELECT 'F', array('','1','2');
INSERT INTO testnorm2 SELECT 'G', array(null);
INSERT INTO testnorm2 SELECT 'H', array(null,'');
INSERT INTO testnorm2 SELECT 'I', array(null,'4','5','6');
INSERT INTO testnorm2 SELECT 'G', array("");

 


Open Spark Shell (in spark 3.0.1)

and run below scala code statements


val df = spark.sql("select * from testnorm2");

 

the df.show () gives this output in both cases(spark 2.4 and spark 3.0.1).

+----------+--------------+
|enquiryuid|data          |
+----------+--------------+
|         I|   [, 4, 5, 6]|
|         F|      [, 1, 2]|
|         B|  [e, f, g, h]|
|         A|        [a, b]|
|         H|          [, ]|
|         E|          [, ]|
|         G|          null|
|         G|            []|
|         D|            []|
|         C|            []|
+----------+--------------+


val explodeDF = df.select($"id",(posexplode_outer($"data"));

 

on doing this there is a difference in output for 2.4 and spark 3.0.1

on 2.4.x the output is

+----------+----+----+
|enquiryuid| pos| col|
+----------+----+----+
|         I|   0|null|
|         I|   1|   4|
|         I|   2|   5|
|         I|   3|   6|
|         F|   0|    |
|         F|   1|   1|
|         F|   2|   2|
|         B|   0|   e|
|         B|   1|   f|
|         B|   2|   g|
|         B|   3|   h|
|         A|   0|   a|
|         A|   1|   b|
|         H|   0|null|
|         H|   1|    |
|         E|   0|    |
|         E|   1|    |
|         G|null|null|
|         G|null|null|
|         D|null|null|
+----------+----+----+

Whereas  in 3.x the output is
+----------+----+--------+
|enquiryuid| pos|     col|
+----------+----+--------+
|         I|   0|\N,4,5,6|
|         F|   0|    ,1,2|
|         1|   0|     a,b|
|         C|null|    null|
|         G|null|    null|
|         B|   0| e,f,g,h|
|         H|   0|     \N,|
|         G|null|    null|
|         E|   0|       ,|
|         D|null|    null|
+----------+----+--------+


The array in column 2 is not getting flattened in the case of spark 3.0.1 but 
in spark 2.4.5 it gets flattened.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-44308) Spark 3.0.1 functions.scala -> posexplode_outer API not flattening data

Reply via email to