Bruce Robbins created SPARK-32281: ------------------------------------- Summary: Spark wipes out SORTED spec in metastore when when DESC is used Key: SPARK-32281 URL: https://issues.apache.org/jira/browse/SPARK-32281 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.1.0 Reporter: Bruce Robbins
When altering a Hive bucketed table or updating its statistics, Spark will wipe out the SORTED specification in the metastore if the specification uses DESC. For example: {noformat} 0: jdbc:hive2://localhost:10000> -- in beeline 0: jdbc:hive2://localhost:10000> create table bucketed (a int, b int, c int, d int) clustered by (c) sorted by (c asc, d desc) into 10 buckets; No rows affected (0.045 seconds) 0: jdbc:hive2://localhost:10000> show create table bucketed; +----------------------------------------------------+ | createtab_stmt | +----------------------------------------------------+ | CREATE TABLE `bucketed`( | | `a` int, | | `b` int, | | `c` int, | | `d` int) | | CLUSTERED BY ( | | c) | | SORTED BY ( | | c ASC, | | d DESC) | | INTO 10 BUCKETS | | ROW FORMAT SERDE | | 'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe' | | STORED AS INPUTFORMAT | | 'org.apache.hadoop.mapred.TextInputFormat' | | OUTPUTFORMAT | | 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat' | | LOCATION | | 'file:/Users/bruce/hadoop/apache-hive-2.3.7-bin/warehouse/bucketed' | | TBLPROPERTIES ( | | 'transient_lastDdlTime'='1594488043') | +----------------------------------------------------+ 21 rows selected (0.042 seconds) 0: jdbc:hive2://localhost:10000> - - - scala> // in spark scala> sql("alter table bucketed set tblproperties ('foo'='bar')") 20/07/11 10:21:36 WARN HiveConf: HiveConf of name hive.metastore.local does not exist 20/07/11 10:21:38 WARN SessionState: METASTORE_FILTER_HOOK will be ignored, since hive.security.authorization.manager is set to instance of HiveAuthorizerFactory. res0: org.apache.spark.sql.DataFrame = [] scala> - - - 0: jdbc:hive2://localhost:10000> -- back in beeline 0: jdbc:hive2://localhost:10000> show create table bucketed; +----------------------------------------------------+ | createtab_stmt | +----------------------------------------------------+ | CREATE TABLE `bucketed`( | | `a` int, | | `b` int, | | `c` int, | | `d` int) | | CLUSTERED BY ( | | c) | | INTO 10 BUCKETS | | ROW FORMAT SERDE | | 'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe' | | STORED AS INPUTFORMAT | | 'org.apache.hadoop.mapred.TextInputFormat' | | OUTPUTFORMAT | | 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat' | | LOCATION | | 'file:/Users/bruce/hadoop/apache-hive-2.3.7-bin/warehouse/bucketed' | | TBLPROPERTIES ( | | 'foo'='bar', | | 'spark.sql.partitionProvider'='catalog', | | 'transient_lastDdlTime'='1594488098') | +----------------------------------------------------+ 20 rows selected (0.038 seconds) 0: jdbc:hive2://localhost:10000> {noformat} Note that the SORTED specification disappears. Another example, this time using insert: {noformat} 0: jdbc:hive2://localhost:10000> -- in beeline 0: jdbc:hive2://localhost:10000> create table bucketed (a int, b int, c int, d int) clustered by (c) sorted by (c asc, d desc) into 10 buckets; No rows affected (0.055 seconds) 0: jdbc:hive2://localhost:10000> insert into table bucketed values (0, 1, 2, 3); WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases. No rows affected (1.689 seconds) 0: jdbc:hive2://localhost:10000> analyze table bucketed compute statistics; WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases. No rows affected (1.516 seconds) 0: jdbc:hive2://localhost:10000> show create table bucketed; +----------------------------------------------------+ | createtab_stmt | +----------------------------------------------------+ | CREATE TABLE `bucketed`( | | `a` int, | | `b` int, | | `c` int, | | `d` int) | | CLUSTERED BY ( | | c) | | SORTED BY ( | | c ASC, | | d DESC) | | INTO 10 BUCKETS | | ROW FORMAT SERDE | | 'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe' | | STORED AS INPUTFORMAT | | 'org.apache.hadoop.mapred.TextInputFormat' | | OUTPUTFORMAT | | 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat' | | LOCATION | | 'file:/Users/bruce/hadoop/apache-hive-2.3.7-bin/warehouse/bucketed' | | TBLPROPERTIES ( | | 'transient_lastDdlTime'='1594488191') | +----------------------------------------------------+ 21 rows selected (0.078 seconds) 0: jdbc:hive2://localhost:10000> - - - scala> // in spark scala> sql("set hive.enforce.sorting=false") 20/07/11 10:23:57 WARN SetCommand: 'SET hive.enforce.sorting=false' might not work, since Spark doesn't support changing the Hive config dynamically. Please pass the Hive-specific config by adding the prefix spark.hadoop (e.g. spark.hadoop.hive.enforce.sorting) when starting a Spark application. For details, see the link: https://spark.apache.org/docs/latest/configuration.html#dynamically-loading-spark-properties. res0: org.apache.spark.sql.DataFrame = [key: string, value: string] scala> sql("set hive.enforce.bucketing=false") 20/07/11 10:24:01 WARN SetCommand: 'SET hive.enforce.bucketing=false' might not work, since Spark doesn't support changing the Hive config dynamically. Please pass the Hive-specific config by adding the prefix spark.hadoop (e.g. spark.hadoop.hive.enforce.bucketing) when starting a Spark application. For details, see the link: https://spark.apache.org/docs/latest/configuration.html#dynamically-loading-spark-properties. res1: org.apache.spark.sql.DataFrame = [key: string, value: string] scala> spark.range(0,1000).map { x => (x, x + 1, x + 2, x + 3) }. toDF("a", "b", "c", "d").createOrReplaceTempView("df") | scala> scala> sql("insert into bucketed select * from df") 20/07/11 10:24:15 WARN HiveConf: HiveConf of name hive.metastore.local does not exist 20/07/11 10:24:16 WARN HiveConf: HiveConf of name hive.metastore.local does not exist 20/07/11 10:24:16 WARN InsertIntoHiveTable: Output Hive table `default`.`bucketed` is bucketed but Spark currently does NOT populate bucketed output which is compatible with Hive. Inserting data anyways since both hive.enforce.bucketing and hive.enforce.sorting are set to false. 20/07/11 10:24:19 WARN SessionState: METASTORE_FILTER_HOOK will be ignored, since hive.security.authorization.manager is set to instance of HiveAuthorizerFactory. res3: org.apache.spark.sql.DataFrame = [] scala> - - - 0: jdbc:hive2://localhost:10000> -- back in beeline 0: jdbc:hive2://localhost:10000> show create table bucketed; +----------------------------------------------------+ | createtab_stmt | +----------------------------------------------------+ | CREATE TABLE `bucketed`( | | `a` int, | | `b` int, | | `c` int, | | `d` int) | | CLUSTERED BY ( | | c) | | INTO 10 BUCKETS | | ROW FORMAT SERDE | | 'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe' | | STORED AS INPUTFORMAT | | 'org.apache.hadoop.mapred.TextInputFormat' | | OUTPUTFORMAT | | 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat' | | LOCATION | | 'file:/Users/bruce/hadoop/apache-hive-2.3.7-bin/warehouse/bucketed' | | TBLPROPERTIES ( | | 'transient_lastDdlTime'='1594488259') | +----------------------------------------------------+ 18 rows selected (0.041 seconds) 0: jdbc:hive2://localhost:10000> {noformat} Note that the SORTED specification disappears. -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org