[jira] [Updated] (SPARK-46477) Hive table (bucketdb.dept_part_buk) is bucketed but partition (location=nk) is not bucketed

Ashish Sharma (Jira) Thu, 21 Dec 2023 02:35:08 -0800


     [ 
https://issues.apache.org/jira/browse/SPARK-46477?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Ashish Sharma updated SPARK-46477:
----------------------------------
    Description: 
Presto fail to read partition of hive table update by spark sql with following 
error


{noformat}
Hive table (bucketdb.dept_part_buk) is bucketed but partition (location=nk) is 
not bucketed
{noformat}


Spark SQL which cause read failure in presto


{noformat}
ALTER TABLE bucketdb.dept_part_buk PARTITION (location='nk') set LOCATION 
'file:///tmp/location=nk1';
{noformat}



*Root Cause*

ALTER TABLE table_name [PARTITION partition_spec] SET LOCATION "loc";

is dropping bucket columns information in HMS.


Repo Script


{code:java}
CREATE TABLE bucketdb.dept_part_buk ( deptno INT, dname STRING, location 
STRING) PARTITIONED BY (location) CLUSTERED BY (deptno) INTO 2 BUCKETS STORED 
AS textfile location 'file:///tmp';

ALTER TABLE bucketdb.dept_part_buk ADD IF NOT EXISTS PARTITION (location='nk') 
LOCATION 'file:///tmp/location=nk';

ALTER TABLE bucketdb.dept_part_buk PARTITION (location='nk') set LOCATION 
'file:///tmp/location=nk1';
{code}


Investigation

Before running query HMS state

mysql> select * from SDS where SD_ID = 137;
+-------+-------+------------------------------------------+------------------------------+------------------------------------------------------+--------------------------------------------+-------------+------------------------------------------------------------+----------+
| SD_ID | CD_ID | INPUT_FORMAT                             | IS_COMPRESSED      
          | IS_STOREDASSUBDIRECTORIES                            | LOCATION     
                              | NUM_BUCKETS | OUTPUT_FORMAT                     
                         | SERDE_ID |
+-------+-------+------------------------------------------+------------------------------+------------------------------------------------------+--------------------------------------------+-------------+------------------------------------------------------------+----------+
|   137 |   106 | org.apache.hadoop.mapred.TextInputFormat | 0x00               
          | 0x00                                                 | 
file:/tmp/location=nk |           2 | 
org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat |      137 |
+-------+-------+------------------------------------------+------------------------------+------------------------------------------------------+--------------------------------------------+-------------+------------------------------------------------------------+----------+


mysql> select * from BUCKETING_COLS where SD_ID = 137;
+-------+-----------------+-------------+
| SD_ID | BUCKET_COL_NAME | INTEGER_IDX |
+-------+-----------------+-------------+
|   137 | deptno          |           0 |
+-------+-----------------+-------------+

Spark Sql Query

ALTER TABLE bucketdb.dept_part_buk PARTITION (location='nk') set LOCATION 
'file:///Users/someuser/sparkdata/location=nk1';


HMS state after Running the Spark SQL query HMS

mysql> select * from SDS where SD_ID = 137;
+-------+-------+------------------------------------------+------------------------------+------------------------------------------------------+---------------------------------------------+-------------+------------------------------------------------------------+----------+
| SD_ID | CD_ID | INPUT_FORMAT                             | IS_COMPRESSED      
          | IS_STOREDASSUBDIRECTORIES                            | LOCATION     
                               | NUM_BUCKETS | OUTPUT_FORMAT                    
                          | SERDE_ID |
+-------+-------+------------------------------------------+------------------------------+------------------------------------------------------+---------------------------------------------+-------------+------------------------------------------------------------+----------+
|   137 |   106 | org.apache.hadoop.mapred.TextInputFormat | 0x00               
          | 0x00                                                 | 
file:/tmp/location=nk1 |           0 | 
org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat |      137 |
+-------+-------+------------------------------------------+------------------------------+------------------------------------------------------+---------------------------------------------+-------------+------------------------------------------------------------+----------+

mysql> select * from BUCKETING_COLS where SD_ID = 141;
Empty set (0.00 sec)



  was:
Presto fail to read partition of hive table update by spark sql with following 
error


{noformat}
Hive table (bucketdb.dept_part_buk) is bucketed but partition (location=nk) is 
not bucketed
{noformat}


Spark SQL which cause read failure in presto

ALTER TABLE bucketdb.dept_part_buk PARTITION (location='nk') set LOCATION 
'file:///tmp/location=nk1';


Root Cause

ALTER TABLE table_name [PARTITION partition_spec] SET LOCATION "loc";

is dropping bucket columns information in HMS.


Repo Script

CREATE TABLE bucketdb.dept_part_buk ( deptno INT, dname STRING, location 
STRING) PARTITIONED BY (location) CLUSTERED BY (deptno) INTO 2 BUCKETS STORED 
AS textfile location 'file:///tmp';

ALTER TABLE bucketdb.dept_part_buk ADD IF NOT EXISTS PARTITION (location='nk') 
LOCATION 'file:///tmp/location=nk';

ALTER TABLE bucketdb.dept_part_buk PARTITION (location='nk') set LOCATION 
'file:///tmp/location=nk1';

Investigation

Before running query HMS state

mysql> select * from SDS where SD_ID = 137;
+-------+-------+------------------------------------------+------------------------------+------------------------------------------------------+--------------------------------------------+-------------+------------------------------------------------------------+----------+
| SD_ID | CD_ID | INPUT_FORMAT                             | IS_COMPRESSED      
          | IS_STOREDASSUBDIRECTORIES                            | LOCATION     
                              | NUM_BUCKETS | OUTPUT_FORMAT                     
                         | SERDE_ID |
+-------+-------+------------------------------------------+------------------------------+------------------------------------------------------+--------------------------------------------+-------------+------------------------------------------------------------+----------+
|   137 |   106 | org.apache.hadoop.mapred.TextInputFormat | 0x00               
          | 0x00                                                 | 
file:/tmp/location=nk |           2 | 
org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat |      137 |
+-------+-------+------------------------------------------+------------------------------+------------------------------------------------------+--------------------------------------------+-------------+------------------------------------------------------------+----------+


mysql> select * from BUCKETING_COLS where SD_ID = 137;
+-------+-----------------+-------------+
| SD_ID | BUCKET_COL_NAME | INTEGER_IDX |
+-------+-----------------+-------------+
|   137 | deptno          |           0 |
+-------+-----------------+-------------+

Spark Sql Query

ALTER TABLE bucketdb.dept_part_buk PARTITION (location='nk') set LOCATION 
'file:///Users/someuser/sparkdata/location=nk1';


HMS state after Running the Spark SQL query HMS

mysql> select * from SDS where SD_ID = 137;
+-------+-------+------------------------------------------+------------------------------+------------------------------------------------------+---------------------------------------------+-------------+------------------------------------------------------------+----------+
| SD_ID | CD_ID | INPUT_FORMAT                             | IS_COMPRESSED      
          | IS_STOREDASSUBDIRECTORIES                            | LOCATION     
                               | NUM_BUCKETS | OUTPUT_FORMAT                    
                          | SERDE_ID |
+-------+-------+------------------------------------------+------------------------------+------------------------------------------------------+---------------------------------------------+-------------+------------------------------------------------------------+----------+
|   137 |   106 | org.apache.hadoop.mapred.TextInputFormat | 0x00               
          | 0x00                                                 | 
file:/tmp/location=nk1 |           0 | 
org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat |      137 |
+-------+-------+------------------------------------------+------------------------------+------------------------------------------------------+---------------------------------------------+-------------+------------------------------------------------------------+----------+

mysql> select * from BUCKETING_COLS where SD_ID = 141;
Empty set (0.00 sec)




> Hive table (bucketdb.dept_part_buk) is bucketed but partition (location=nk) 
> is not bucketed
> -------------------------------------------------------------------------------------------
>
>                 Key: SPARK-46477
>                 URL: https://issues.apache.org/jira/browse/SPARK-46477
>             Project: Spark
>          Issue Type: Task
>          Components: SQL
>    Affects Versions: 3.3.2
>            Reporter: Ashish Sharma
>            Priority: Minor
>
> Presto fail to read partition of hive table update by spark sql with 
> following error
> {noformat}
> Hive table (bucketdb.dept_part_buk) is bucketed but partition (location=nk) 
> is not bucketed
> {noformat}
> Spark SQL which cause read failure in presto
> {noformat}
> ALTER TABLE bucketdb.dept_part_buk PARTITION (location='nk') set LOCATION 
> 'file:///tmp/location=nk1';
> {noformat}
> *Root Cause*
> ALTER TABLE table_name [PARTITION partition_spec] SET LOCATION "loc";
> is dropping bucket columns information in HMS.
> Repo Script
> {code:java}
> CREATE TABLE bucketdb.dept_part_buk ( deptno INT, dname STRING, location 
> STRING) PARTITIONED BY (location) CLUSTERED BY (deptno) INTO 2 BUCKETS STORED 
> AS textfile location 'file:///tmp';
> ALTER TABLE bucketdb.dept_part_buk ADD IF NOT EXISTS PARTITION 
> (location='nk') LOCATION 'file:///tmp/location=nk';
> ALTER TABLE bucketdb.dept_part_buk PARTITION (location='nk') set LOCATION 
> 'file:///tmp/location=nk1';
> {code}
> Investigation
> Before running query HMS state
> mysql> select * from SDS where SD_ID = 137;
> +-------+-------+------------------------------------------+------------------------------+------------------------------------------------------+--------------------------------------------+-------------+------------------------------------------------------------+----------+
> | SD_ID | CD_ID | INPUT_FORMAT                             | IS_COMPRESSED    
>             | IS_STOREDASSUBDIRECTORIES                            | LOCATION 
>                                   | NUM_BUCKETS | OUTPUT_FORMAT               
>                                | SERDE_ID |
> +-------+-------+------------------------------------------+------------------------------+------------------------------------------------------+--------------------------------------------+-------------+------------------------------------------------------------+----------+
> |   137 |   106 | org.apache.hadoop.mapred.TextInputFormat | 0x00             
>             | 0x00                                                 | 
> file:/tmp/location=nk |           2 | 
> org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat |      137 |
> +-------+-------+------------------------------------------+------------------------------+------------------------------------------------------+--------------------------------------------+-------------+------------------------------------------------------------+----------+
> mysql> select * from BUCKETING_COLS where SD_ID = 137;
> +-------+-----------------+-------------+
> | SD_ID | BUCKET_COL_NAME | INTEGER_IDX |
> +-------+-----------------+-------------+
> |   137 | deptno          |           0 |
> +-------+-----------------+-------------+
> Spark Sql Query
> ALTER TABLE bucketdb.dept_part_buk PARTITION (location='nk') set LOCATION 
> 'file:///Users/someuser/sparkdata/location=nk1';
> HMS state after Running the Spark SQL query HMS
> mysql> select * from SDS where SD_ID = 137;
> +-------+-------+------------------------------------------+------------------------------+------------------------------------------------------+---------------------------------------------+-------------+------------------------------------------------------------+----------+
> | SD_ID | CD_ID | INPUT_FORMAT                             | IS_COMPRESSED    
>             | IS_STOREDASSUBDIRECTORIES                            | LOCATION 
>                                    | NUM_BUCKETS | OUTPUT_FORMAT              
>                                 | SERDE_ID |
> +-------+-------+------------------------------------------+------------------------------+------------------------------------------------------+---------------------------------------------+-------------+------------------------------------------------------------+----------+
> |   137 |   106 | org.apache.hadoop.mapred.TextInputFormat | 0x00             
>             | 0x00                                                 | 
> file:/tmp/location=nk1 |           0 | 
> org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat |      137 |
> +-------+-------+------------------------------------------+------------------------------+------------------------------------------------------+---------------------------------------------+-------------+------------------------------------------------------------+----------+
> mysql> select * from BUCKETING_COLS where SD_ID = 141;
> Empty set (0.00 sec)



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-46477) Hive table (bucketdb.dept_part_buk) is bucketed but partition (location=nk) is not bucketed

Reply via email to