[ https://issues.apache.org/jira/browse/SPARK-41982?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Apache Spark reassigned SPARK-41982: ------------------------------------ Assignee: (was: Apache Spark) > When the inserted partition type is of string type, similar `dt=01` will be > converted to `dt=1` > ----------------------------------------------------------------------------------------------- > > Key: SPARK-41982 > URL: https://issues.apache.org/jira/browse/SPARK-41982 > Project: Spark > Issue Type: Bug > Components: Spark Core > Affects Versions: 3.4.0 > Reporter: jingxiong zhong > Priority: Critical > > At present, during the process of upgrading Spark2.4 to Spark3.2, we > carefully read the migration documentwe and found a kind of situation not > involved: > {code:java} > create table if not exists test_90(a string, b string) partitioned by (dt > string); > desc formatted test_90; > // case1 > insert into table test_90 partition (dt=05) values("1","2"); > // case2 > insert into table test_90 partition (dt='05') values("1","2"); > drop table test_90;{code} > in spark2.4.3, it will generate such a path: > {code:java} > // the path > hdfs://test5/user/hive/db1/test_90/dt=05 > //result > spark-sql> select * from test_90; > 1 2 05 > 1 2 05 > Time taken: 1.316 seconds, Fetched 2 row(s) > spark-sql> show partitions test_90; > dt=05 > Time taken: 0.201 seconds, Fetched 1 row(s) > spark-sql> select * from test_90 where dt='05'; > 1 2 05 > 1 2 05 > Time taken: 0.212 seconds, Fetched 2 row(s) > spark-sql> explain insert into table test_90 partition (dt=05) > values("1","2"); > == Physical Plan == > Execute InsertIntoHiveTable InsertIntoHiveTable `db1`.`test_90`, > org.apache.hadoop.hive.ql.io.orc.OrcSerde, Map(dt -> Some(05)), false, false, > [a, b] > +- LocalTableScan [a#116, b#117] > Time taken: 1.145 seconds, Fetched 1 row(s){code} > in spark3.2.0, it will generate two path: > {code:java} > // the path > hdfs://test5/user/hive/db1/test_90/dt=05 > hdfs://test5/user/hive/db1/test_90/dt=5 > // result > spark-sql> select * from test_90; > 1 2 05 > 1 2 5 > Time taken: 2.119 seconds, Fetched 2 row(s) > spark-sql> show partitions test_90; > dt=05 > dt=5 > Time taken: 0.161 seconds, Fetched 2 row(s) > spark-sql> select * from test_90 where dt='05'; > 1 2 05 > Time taken: 0.252 seconds, Fetched 1 row(s) > spark-sql> explain insert into table test_90 partition (dt=05) > values("1","2"); > plan > == Physical Plan == > Execute InsertIntoHiveTable `db1`.`test_90`, > org.apache.hadoop.hive.ql.io.orc.OrcSerde, [dt=Some(5)], false, false, [a, b] > +- LocalTableScan [a#109, b#110]{code} > This will cause problems in reading data after the user switches to spark3. > The root cause is that in the process of partition field resolution, Spark3 > has a process of strongly converting this string type, which will cause > partition `05` to lose the previous `0` > So I think we have two solutions: > one is to record the risk clearly in the migration document, and the other is > to repair this case, because we internally keep the partition of string type > as string type, regardless of whether single or double quotation marks are > added. > > -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org