Bruce Robbins created SPARK-32281:
-------------------------------------

             Summary: Spark wipes out SORTED spec in metastore when when DESC 
is used
                 Key: SPARK-32281
                 URL: https://issues.apache.org/jira/browse/SPARK-32281
             Project: Spark
          Issue Type: Bug
          Components: SQL
    Affects Versions: 3.1.0
            Reporter: Bruce Robbins


When altering a Hive bucketed table or updating its statistics, Spark will wipe 
out the SORTED specification in the metastore if the specification uses DESC.

 For example:
{noformat}
0: jdbc:hive2://localhost:10000> -- in beeline
0: jdbc:hive2://localhost:10000> create table bucketed (a int, b int, c int, d 
int) clustered by (c) sorted by (c asc, d desc) into 10 buckets;
No rows affected (0.045 seconds)
0: jdbc:hive2://localhost:10000> show create table bucketed;
+----------------------------------------------------+
|                   createtab_stmt                   |
+----------------------------------------------------+
| CREATE TABLE `bucketed`(                           |
|   `a` int,                                         |
|   `b` int,                                         |
|   `c` int,                                         |
|   `d` int)                                         |
| CLUSTERED BY (                                     |
|   c)                                               |
| SORTED BY (                                        |
|   c ASC,                                           |
|   d DESC)                                          |
| INTO 10 BUCKETS                                    |
| ROW FORMAT SERDE                                   |
|   'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'  |
| STORED AS INPUTFORMAT                              |
|   'org.apache.hadoop.mapred.TextInputFormat'       |
| OUTPUTFORMAT                                       |
|   'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat' |
| LOCATION                                           |
|   'file:/Users/bruce/hadoop/apache-hive-2.3.7-bin/warehouse/bucketed' |
| TBLPROPERTIES (                                    |
|   'transient_lastDdlTime'='1594488043')            |
+----------------------------------------------------+
21 rows selected (0.042 seconds)
0: jdbc:hive2://localhost:10000> 
-
-
-

scala> // in spark

scala> sql("alter table bucketed set tblproperties ('foo'='bar')")
20/07/11 10:21:36 WARN HiveConf: HiveConf of name hive.metastore.local does not 
exist
20/07/11 10:21:38 WARN SessionState: METASTORE_FILTER_HOOK will be ignored, 
since hive.security.authorization.manager is set to instance of 
HiveAuthorizerFactory.
res0: org.apache.spark.sql.DataFrame = []

scala> 
-
-
-

0: jdbc:hive2://localhost:10000> -- back in beeline
0: jdbc:hive2://localhost:10000> show create table bucketed;
+----------------------------------------------------+
|                   createtab_stmt                   |
+----------------------------------------------------+
| CREATE TABLE `bucketed`(                           |
|   `a` int,                                         |
|   `b` int,                                         |
|   `c` int,                                         |
|   `d` int)                                         |
| CLUSTERED BY (                                     |
|   c)                                               |
| INTO 10 BUCKETS                                    |
| ROW FORMAT SERDE                                   |
|   'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'  |
| STORED AS INPUTFORMAT                              |
|   'org.apache.hadoop.mapred.TextInputFormat'       |
| OUTPUTFORMAT                                       |
|   'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat' |
| LOCATION                                           |
|   'file:/Users/bruce/hadoop/apache-hive-2.3.7-bin/warehouse/bucketed' |
| TBLPROPERTIES (                                    |
|   'foo'='bar',                                     |
|   'spark.sql.partitionProvider'='catalog',         |
|   'transient_lastDdlTime'='1594488098')            |
+----------------------------------------------------+
20 rows selected (0.038 seconds)
0: jdbc:hive2://localhost:10000> 
{noformat}
Note that the SORTED specification disappears.

Another example, this time using insert:
{noformat}
0: jdbc:hive2://localhost:10000> -- in beeline
0: jdbc:hive2://localhost:10000> create table bucketed (a int, b int, c int, d 
int) clustered by (c) sorted by (c asc, d desc) into 10 buckets;
No rows affected (0.055 seconds)
0: jdbc:hive2://localhost:10000> insert into table bucketed values (0, 1, 2, 3);
WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the 
future versions. Consider using a different execution engine (i.e. spark, tez) 
or using Hive 1.X releases.
No rows affected (1.689 seconds)
0: jdbc:hive2://localhost:10000> analyze table bucketed compute statistics;
WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the 
future versions. Consider using a different execution engine (i.e. spark, tez) 
or using Hive 1.X releases.
No rows affected (1.516 seconds)
0: jdbc:hive2://localhost:10000> show create table bucketed;
+----------------------------------------------------+
|                   createtab_stmt                   |
+----------------------------------------------------+
| CREATE TABLE `bucketed`(                           |
|   `a` int,                                         |
|   `b` int,                                         |
|   `c` int,                                         |
|   `d` int)                                         |
| CLUSTERED BY (                                     |
|   c)                                               |
| SORTED BY (                                        |
|   c ASC,                                           |
|   d DESC)                                          |
| INTO 10 BUCKETS                                    |
| ROW FORMAT SERDE                                   |
|   'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'  |
| STORED AS INPUTFORMAT                              |
|   'org.apache.hadoop.mapred.TextInputFormat'       |
| OUTPUTFORMAT                                       |
|   'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat' |
| LOCATION                                           |
|   'file:/Users/bruce/hadoop/apache-hive-2.3.7-bin/warehouse/bucketed' |
| TBLPROPERTIES (                                    |
|   'transient_lastDdlTime'='1594488191')            |
+----------------------------------------------------+
21 rows selected (0.078 seconds)
0: jdbc:hive2://localhost:10000> 
-
-
-

scala> // in spark

scala> sql("set hive.enforce.sorting=false")
20/07/11 10:23:57 WARN SetCommand: 'SET hive.enforce.sorting=false' might not 
work, since Spark doesn't support changing the Hive config dynamically. Please 
pass the Hive-specific config by adding the prefix spark.hadoop (e.g. 
spark.hadoop.hive.enforce.sorting) when starting a Spark application. For 
details, see the link: 
https://spark.apache.org/docs/latest/configuration.html#dynamically-loading-spark-properties.
res0: org.apache.spark.sql.DataFrame = [key: string, value: string]

scala> sql("set hive.enforce.bucketing=false")
20/07/11 10:24:01 WARN SetCommand: 'SET hive.enforce.bucketing=false' might not 
work, since Spark doesn't support changing the Hive config dynamically. Please 
pass the Hive-specific config by adding the prefix spark.hadoop (e.g. 
spark.hadoop.hive.enforce.bucketing) when starting a Spark application. For 
details, see the link: 
https://spark.apache.org/docs/latest/configuration.html#dynamically-loading-spark-properties.
res1: org.apache.spark.sql.DataFrame = [key: string, value: string]

scala> spark.range(0,1000).map { x => (x, x + 1, x + 2, x + 3) }.
  toDF("a", "b", "c", "d").createOrReplaceTempView("df")

     | 
scala> 
scala> sql("insert into bucketed select * from df")
20/07/11 10:24:15 WARN HiveConf: HiveConf of name hive.metastore.local does not 
exist
20/07/11 10:24:16 WARN HiveConf: HiveConf of name hive.metastore.local does not 
exist
20/07/11 10:24:16 WARN InsertIntoHiveTable: Output Hive table 
`default`.`bucketed` is bucketed but Spark currently does NOT populate bucketed 
output which is compatible with Hive. Inserting data anyways since both 
hive.enforce.bucketing and hive.enforce.sorting are set to false.
20/07/11 10:24:19 WARN SessionState: METASTORE_FILTER_HOOK will be ignored, 
since hive.security.authorization.manager is set to instance of 
HiveAuthorizerFactory.
res3: org.apache.spark.sql.DataFrame = []

scala> 
-
-
-

0: jdbc:hive2://localhost:10000> -- back in beeline
0: jdbc:hive2://localhost:10000> show create table bucketed;
+----------------------------------------------------+
|                   createtab_stmt                   |
+----------------------------------------------------+
| CREATE TABLE `bucketed`(                           |
|   `a` int,                                         |
|   `b` int,                                         |
|   `c` int,                                         |
|   `d` int)                                         |
| CLUSTERED BY (                                     |
|   c)                                               |
| INTO 10 BUCKETS                                    |
| ROW FORMAT SERDE                                   |
|   'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'  |
| STORED AS INPUTFORMAT                              |
|   'org.apache.hadoop.mapred.TextInputFormat'       |
| OUTPUTFORMAT                                       |
|   'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat' |
| LOCATION                                           |
|   'file:/Users/bruce/hadoop/apache-hive-2.3.7-bin/warehouse/bucketed' |
| TBLPROPERTIES (                                    |
|   'transient_lastDdlTime'='1594488259')            |
+----------------------------------------------------+
18 rows selected (0.041 seconds)
0: jdbc:hive2://localhost:10000> 
{noformat}
Note that the SORTED specification disappears.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to