[ https://issues.apache.org/jira/browse/SPARK-30769?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17036686#comment-17036686 ]
Hyukjin Kwon commented on SPARK-30769: -------------------------------------- Please avoid to set Critical+ which is reserved for committers. Are you able to show a full and self-contained reproducer? > insertInto() with existing column as partition key cause weird partition > result > ------------------------------------------------------------------------------- > > Key: SPARK-30769 > URL: https://issues.apache.org/jira/browse/SPARK-30769 > Project: Spark > Issue Type: Bug > Components: SQL > Affects Versions: 2.4.4 > Environment: EMR 5.29.0 with Spark 2.4.4 > Reporter: Woong Seok Kang > Priority: Major > > {code:java} > val tableName = s"${config.service}_$saveDatabase.${config.table}_partitioned" > val writer = TableWriter.getWriter(tableDF.withColumn(config.dateColumn, > typedLit[String](date.toString))) > if (xsc.tableExistIn(config.service, saveDatabase, > s"${config.table}_partitioned")) writer.insertInto(tableName) > else writer.partitionBy(config.dateColumn).saveAsTable(tableName){code} > This code checks whether table exists in desired path. (somewhere in S3 in > this case) If table already exists in path then insert a new partition with > insertInto() function. > If config.dateColumn not exists in table schema, no problem occurred. (just > new column will be added) but if it is already exists in schema, Spark does > not use given column as a partition key, instead it will create a hundred of > partitions. Below is a part of Spark logs: > (Note that the name of partition column is date_ymd, which is already exists > in source table. original value is a date string like '2020-01-01') > 20/02/10 05:33:01 INFO S3NativeFileSystem2: rename > s3://\{my_path_at_s3}_partitioned_test/.spark-staging-e3c1c1fc-6bbe-4e77-8b7f-201cfd60d061/date_ymd=174 > s3://\{my_path_at_s3}_partitioned_test/date_ymd=174 > 20/02/10 05:33:02 INFO S3NativeFileSystem2: rename > s3://\{my_path_at_s3}_partitioned_test/.spark-staging-e3c1c1fc-6bbe-4e77-8b7f-201cfd60d061/date_ymd=62 > s3://\{my_path_at_s3}_partitioned_test/date_ymd=62 > 20/02/10 05:33:02 INFO S3NativeFileSystem2: rename > s3://\{my_path_at_s3}_partitioned_test/.spark-staging-e3c1c1fc-6bbe-4e77-8b7f-201cfd60d061/date_ymd=83 > s3://\{my_path_at_s3}_partitioned_test/date_ymd=83 > 20/02/10 05:33:03 INFO S3NativeFileSystem2: rename > s3://\{my_path_at_s3}_partitioned_test/.spark-staging-e3c1c1fc-6bbe-4e77-8b7f-201cfd60d061/date_ymd=231 > s3://\{my_path_at_s3}_partitioned_test/date_ymd=231 > 20/02/10 05:33:03 INFO S3NativeFileSystem2: rename > s3://\{my_path_at_s3}_partitioned_test/.spark-staging-e3c1c1fc-6bbe-4e77-8b7f-201cfd60d061/date_ymd=268 > s3://\{my_path_at_s3}_partitioned_test/date_ymd=268 > 20/02/10 05:33:04 INFO S3NativeFileSystem2: rename > s3://\{my_path_at_s3}_partitioned_test/.spark-staging-e3c1c1fc-6bbe-4e77-8b7f-201cfd60d061/date_ymd=33 > s3://\{my_path_at_s3}_partitioned_test/date_ymd=33 > 20/02/10 05:33:05 INFO S3NativeFileSystem2: rename > s3://\{my_path_at_s3}_partitioned_test/.spark-staging-e3c1c1fc-6bbe-4e77-8b7f-201cfd60d061/date_ymd=40 > s3://\{my_path_at_s3}_partitioned_test/date_ymd=40 > rename > s3://\{my_path_at_s3}_partitioned_test/.spark-staging-e3c1c1fc-6bbe-4e77-8b7f-201cfd60d061/date_ymd=__HIVE_DEFAULT_PARTITION__ > s3://\{my_path_at_s3}_partitioned_test/date_ymd=__HIVE_DEFAULT_PARTITION__ > When I use different partition key which not in table schema such as > 'stamp_date', everything goes fine. I'm not sure that it is a Spark bugs, I > just wrote the report. (I think it is related with Hive...) > Thanks for reading! -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org