[ https://issues.apache.org/jira/browse/SPARK-32228?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Kernel Force updated SPARK-32228: --------------------------------- Summary: Partition column of hive table was capitalized and stored on HDFS (was: Partition column of hive table was capitalized while stored on HDFS) > Partition column of hive table was capitalized and stored on HDFS > ----------------------------------------------------------------- > > Key: SPARK-32228 > URL: https://issues.apache.org/jira/browse/SPARK-32228 > Project: Spark > Issue Type: Bug > Components: SQL > Affects Versions: 3.0.0 > Environment: Hadoop 2.7.7 > Hive 2.3.6 > Spark 3.0.0 > Reporter: Kernel Force > Priority: Major > Original Estimate: 10h > Remaining Estimate: 10h > > Suppose we have a target hive table to be insert by spark with dynamic > partition feature on. > {code:sql} > CREATE TABLE DEMO_PART ( > ID VARCHAR(10), > NAME VARCHAR(10) > ) PARTITIONED BY (BATCH DATE, TEAM VARCHAR(10)) > STORED AS ORC; > {code} > And have a source data table like: > {code:sql} > 0: jdbc:hive2://HOSTNAME:10000> SELECT T.* FROM DEMO_DATA T; > +-------+---------+-------------+---------+ > | t.id | t.name | t.batch | t.team | > +-------+---------+-------------+---------+ > | 1 | mike | 2020-07-08 | A | > | 2 | john | 2020-07-07 | B | > +-------+---------+-------------+---------+ > 2 rows selected (0.177 seconds) > {code} > Then doing join operation against an exploded view and insert the result into > DEMO_PART table: > {code:sql} > sql(""" > WITH VA AS ( > SELECT ARRAY_REPEAT(1,10) A > ), > VB AS ( > SELECT EXPLODE(T.A) IDX FROM VA T > ), > VC AS ( > SELECT ROW_NUMBER() OVER(ORDER BY NULL) RN FROM VB T > ), > VD AS ( > SELECT T.RN, DATE_ADD(TO_DATE('2020-07-01','yyyy-MM-dd'),T.RN) DT FROM VC T > ), > VE AS ( > SELECT T.DT BATCH, T.RN ID, CASE WHEN T.RN > 5 THEN 'A' ELSE 'B' END TEAM > FROM VD T > ) > SELECT T.BATCH BATCH, S.ID ID, S.NAME NAME, S.TEAM TEAM FROM VE T > INNER JOIN DEMO_DATA S > ON T.TEAM = S.TEAM > """). > selectExpr(spark.table("DEMO_PART").columns:_*). > write.mode("overwrite").insertInto("DEMO_PART") > {code} > The result could NOT be read by hive beeline: > {code:sql} > 0: jdbc:hive2://HOSTNAME:10000> SELECT T.* FROM DEMO_PART T; > +-------+---------+----------+---------+ > | t.id | t.name | t.batch | t.team | > +-------+---------+----------+---------+ > +-------+---------+----------+---------+ > No rows selected (0.268 seconds) > {code} > Because the underlying data stored in HDFS was uncorrect: > {code:bash} > [user@HOSTNAME ~]$ dfs -ls /user/hive/warehouse/demo_part/ > > Found 21 items > /user/hive/warehouse/demo_part/BATCH=2020-07-02 > /user/hive/warehouse/demo_part/BATCH=2020-07-03 > /user/hive/warehouse/demo_part/BATCH=2020-07-04 > /user/hive/warehouse/demo_part/BATCH=2020-07-05 > /user/hive/warehouse/demo_part/BATCH=2020-07-06 > /user/hive/warehouse/demo_part/BATCH=2020-07-07 > /user/hive/warehouse/demo_part/BATCH=2020-07-08 > /user/hive/warehouse/demo_part/BATCH=2020-07-09 > /user/hive/warehouse/demo_part/BATCH=2020-07-10 > /user/hive/warehouse/demo_part/BATCH=2020-07-11 > /user/hive/warehouse/demo_part/_SUCCESS > /user/hive/warehouse/demo_part/batch=2020-07-02 > /user/hive/warehouse/demo_part/batch=2020-07-03 > /user/hive/warehouse/demo_part/batch=2020-07-04 > /user/hive/warehouse/demo_part/batch=2020-07-05 > /user/hive/warehouse/demo_part/batch=2020-07-06 > /user/hive/warehouse/demo_part/batch=2020-07-07 > /user/hive/warehouse/demo_part/batch=2020-07-08 > /user/hive/warehouse/demo_part/batch=2020-07-09 > /user/hive/warehouse/demo_part/batch=2020-07-10 > /user/hive/warehouse/demo_part/batch=2020-07-11 > {code} > Both "BATCH=XXXX" and "batch=XXXX" directories appeared, and the data files > was stored in "BATCH" directories but not "batch" > The result will be correct if I change the SQL statement, simply change the > column alias to lower case in the last select, like: > {code:sql} > SELECT T.BATCH batch, S.ID id, S.NAME name, S.TEAM team FROM VE T > INNER JOIN DEMO_DATA S > ON T.TEAM = S.TEAM > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org