Chirag Sanghvi created SPARK-44308: -------------------------------------- Summary: Spark 3.0.1 functions.scala -> posexplode_outer API not flattening data Key: SPARK-44308 URL: https://issues.apache.org/jira/browse/SPARK-44308 Project: Spark Issue Type: Bug Components: Spark Core, SQL Affects Versions: 3.0.1 Reporter: Chirag Sanghvi
Spark 3.x API functions.scala -> posexplode_outer to flatten the array column value doesn't work as expected when the table is created with "collection.delim" set to non default value. This used to work as expected in Spark 2.4.5 Use the below DDL to create a hive table CREATE EXTERNAL TABLE `testnorm2`( `enquiryuid` string, `rulestriggered` array<string>) ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe' WITH SERDEPROPERTIES ( 'collection.delim'=';', 'field.delim'='|', 'line.delim'='\n', 'serialization.format'='|') STORED AS INPUTFORMAT 'org.apache.hadoop.mapred.TextInputFormat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'; And fill up with the table with array values. The below statements fill up the table with sample data. INSERT INTO testnorm2 SELECT 'A', array('a','b'); INSERT INTO testnorm2 SELECT 'B', array('e','f','g','h'); INSERT INTO testnorm2 SELECT 'C', array(); INSERT INTO testnorm2 SELECT 'D', array(''); INSERT INTO testnorm2 SELECT 'E', array('',''); INSERT INTO testnorm2 SELECT 'F', array('','1','2'); INSERT INTO testnorm2 SELECT 'G', array(null); INSERT INTO testnorm2 SELECT 'H', array(null,''); INSERT INTO testnorm2 SELECT 'I', array(null,'4','5','6'); INSERT INTO testnorm2 SELECT 'G', array(""); Open Spark Shell (in spark 3.0.1) and run below scala code statements val df = spark.sql("select * from testnorm2"); the df.show () gives this output in both cases(spark 2.4 and spark 3.0.1). +----------+--------------+ |enquiryuid|data | +----------+--------------+ | I| [, 4, 5, 6]| | F| [, 1, 2]| | B| [e, f, g, h]| | A| [a, b]| | H| [, ]| | E| [, ]| | G| null| | G| []| | D| []| | C| []| +----------+--------------+ val explodeDF = df.select($"id",(posexplode_outer($"data")); on doing this there is a difference in output for 2.4 and spark 3.0.1 on 2.4.x the output is +----------+----+----+ |enquiryuid| pos| col| +----------+----+----+ | I| 0|null| | I| 1| 4| | I| 2| 5| | I| 3| 6| | F| 0| | | F| 1| 1| | F| 2| 2| | B| 0| e| | B| 1| f| | B| 2| g| | B| 3| h| | A| 0| a| | A| 1| b| | H| 0|null| | H| 1| | | E| 0| | | E| 1| | | G|null|null| | G|null|null| | D|null|null| +----------+----+----+ Whereas in 3.x the output is +----------+----+--------+ |enquiryuid| pos| col| +----------+----+--------+ | I| 0|\N,4,5,6| | F| 0| ,1,2| | 1| 0| a,b| | C|null| null| | G|null| null| | B| 0| e,f,g,h| | H| 0| \N,| | G|null| null| | E| 0| ,| | D|null| null| +----------+----+--------+ The array in column 2 is not getting flattened in the case of spark 3.0.1 but in spark 2.4.5 it gets flattened. -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org