[jira] [Commented] (SPARK-30769) insertInto() with existing column as partition key cause weird partition result

Hyukjin Kwon (Jira) Thu, 13 Feb 2020 20:20:27 -0800


    [ 
https://issues.apache.org/jira/browse/SPARK-30769?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17036686#comment-17036686
 ]


Hyukjin Kwon commented on SPARK-30769:
--------------------------------------

Please avoid to set Critical+ which is reserved for committers. Are you able to 
show a full and self-contained reproducer?

> insertInto() with existing column as partition key cause weird partition 
> result
> -------------------------------------------------------------------------------
>
>                 Key: SPARK-30769
>                 URL: https://issues.apache.org/jira/browse/SPARK-30769
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 2.4.4
>         Environment: EMR 5.29.0 with Spark 2.4.4
>            Reporter: Woong Seok Kang
>            Priority: Major
>
> {code:java}
> val tableName = s"${config.service}_$saveDatabase.${config.table}_partitioned"
> val writer = TableWriter.getWriter(tableDF.withColumn(config.dateColumn, 
> typedLit[String](date.toString))) 
> if (xsc.tableExistIn(config.service, saveDatabase, 
> s"${config.table}_partitioned")) writer.insertInto(tableName)
> else writer.partitionBy(config.dateColumn).saveAsTable(tableName){code}
> This code checks whether table exists in desired path. (somewhere in S3 in 
> this case) If table already exists in path then insert a new partition with 
> insertInto() function.
> If config.dateColumn not exists in table schema, no problem occurred. (just 
> new column will be added) but if it is already exists in schema, Spark does 
> not use given column as a partition key, instead it will create a hundred of 
> partitions. Below is a part of Spark logs:
> (Note that the name of partition column is date_ymd, which is already exists 
> in source table. original value is a date string like '2020-01-01')
> 20/02/10 05:33:01 INFO S3NativeFileSystem2: rename 
> s3://\{my_path_at_s3}_partitioned_test/.spark-staging-e3c1c1fc-6bbe-4e77-8b7f-201cfd60d061/date_ymd=174
>  s3://\{my_path_at_s3}_partitioned_test/date_ymd=174
> 20/02/10 05:33:02 INFO S3NativeFileSystem2: rename 
> s3://\{my_path_at_s3}_partitioned_test/.spark-staging-e3c1c1fc-6bbe-4e77-8b7f-201cfd60d061/date_ymd=62
>  s3://\{my_path_at_s3}_partitioned_test/date_ymd=62
> 20/02/10 05:33:02 INFO S3NativeFileSystem2: rename 
> s3://\{my_path_at_s3}_partitioned_test/.spark-staging-e3c1c1fc-6bbe-4e77-8b7f-201cfd60d061/date_ymd=83
>  s3://\{my_path_at_s3}_partitioned_test/date_ymd=83
> 20/02/10 05:33:03 INFO S3NativeFileSystem2: rename 
> s3://\{my_path_at_s3}_partitioned_test/.spark-staging-e3c1c1fc-6bbe-4e77-8b7f-201cfd60d061/date_ymd=231
>  s3://\{my_path_at_s3}_partitioned_test/date_ymd=231
> 20/02/10 05:33:03 INFO S3NativeFileSystem2: rename 
> s3://\{my_path_at_s3}_partitioned_test/.spark-staging-e3c1c1fc-6bbe-4e77-8b7f-201cfd60d061/date_ymd=268
>  s3://\{my_path_at_s3}_partitioned_test/date_ymd=268
> 20/02/10 05:33:04 INFO S3NativeFileSystem2: rename 
> s3://\{my_path_at_s3}_partitioned_test/.spark-staging-e3c1c1fc-6bbe-4e77-8b7f-201cfd60d061/date_ymd=33
>  s3://\{my_path_at_s3}_partitioned_test/date_ymd=33
> 20/02/10 05:33:05 INFO S3NativeFileSystem2: rename 
> s3://\{my_path_at_s3}_partitioned_test/.spark-staging-e3c1c1fc-6bbe-4e77-8b7f-201cfd60d061/date_ymd=40
>  s3://\{my_path_at_s3}_partitioned_test/date_ymd=40
> rename 
> s3://\{my_path_at_s3}_partitioned_test/.spark-staging-e3c1c1fc-6bbe-4e77-8b7f-201cfd60d061/date_ymd=__HIVE_DEFAULT_PARTITION__
>  s3://\{my_path_at_s3}_partitioned_test/date_ymd=__HIVE_DEFAULT_PARTITION__
> When I use different partition key which not in table schema such as 
> 'stamp_date', everything goes fine. I'm not sure that it is a Spark bugs, I 
> just wrote the report. (I think it is related with Hive...)
> Thanks for reading!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-30769) insertInto() with existing column as partition key cause weird partition result

Reply via email to