[ https://issues.apache.org/jira/browse/SPARK-22306?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Wenchen Fan updated SPARK-22306: -------------------------------- Fix Version/s: 2.3.0 > INFER_AND_SAVE overwrites important metadata in Parquet Metastore table > ----------------------------------------------------------------------- > > Key: SPARK-22306 > URL: https://issues.apache.org/jira/browse/SPARK-22306 > Project: Spark > Issue Type: Bug > Components: SQL > Affects Versions: 2.2.0 > Environment: Hive 2.3.0 (PostgresQL metastore, stored as Parquet) > Spark 2.2.0 > Reporter: David Malinge > Assignee: Wenchen Fan > Priority: Critical > Fix For: 2.2.1, 2.3.0 > > > I noticed some critical changes on my hive tables and realized that they were > caused by a simple select on SparkSQL. Looking at the logs, I found out that > this select was actually performing an update on the database "Saving > case-sensitive schema for table". > I then found out that Spark 2.2.0 introduces a new default value for > spark.sql.hive.caseSensitiveInferenceMode (see SPARK-20888): INFER_AND_SAVE > The issue is that this update changes critical metadata of the table, in > particular: > - changes the owner to the current user > - removes bucketing metadata (BUCKETING_COLS, SDS) > - removes sorting metadata (SORT_COLS) > Switching the property to: NEVER_INFER prevents the issue. > Also, note that the damage can be fix manually in Hive with e.g.: > {code:sql} > alter table [table_name] > clustered by ([col1], [col2]) > sorted by ([colA], [colB]) > into [n] buckets > {code} > *REPRODUCE (branch-2.2)* > In Spark 2.1.x (branch-2.1), NEVER_INFER is used. Spark 2.3 (master) branch > is good due to SPARK-17729. This is a regression on Spark 2.2 only. By > default, Parquet Hive table is affected and only Hive may suffer from this. > {code} > hive> CREATE TABLE t(a string, b string) CLUSTERED BY (a, b) SORTED BY (a, b) > INTO 10 BUCKETS STORED AS PARQUET; > hive> INSERT INTO t VALUES('a','b'); > hive> DESC FORMATTED t; > ... > Num Buckets: 10 > Bucket Columns: [a, b] > Sort Columns: [Order(col:a, order:1), Order(col:b, order:1)] > scala> sql("SELECT * FROM t").show(false) > hive> DESC FORMATTED t; > Num Buckets: -1 > Bucket Columns: [] > Sort Columns: [] > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org