[ 
https://issues.apache.org/jira/browse/SPARK-20697?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Abhishek Madav updated SPARK-20697:
-----------------------------------
    Description: 
MSCK REPAIR TABLE used to recover partitions for a partitioned+bucketed table 
does not restore the bucketing information to the storage descriptor in the 
metastore. 

Steps to reproduce:
1) Create a paritioned+bucketed table in hive: CREATE TABLE partbucket(a int) 
PARTITIONED BY (b int) CLUSTERED BY (a) INTO 10 BUCKETS ROW FORMAT DELIMITED 
FIELDS TERMINATED BY ',';

2) In Hive-CLI issue a desc formatted for the table.

# col_name              data_type               comment             
                 
a                       int                                         
                 
# Partition Information          
# col_name              data_type               comment             
                 
b                       int                                         
                 
# Detailed Table Information             
Database:               sparkhivebucket          
Owner:                  devbld                   
CreateTime:             Wed May 10 10:31:07 PDT 2017     
LastAccessTime:         UNKNOWN                  
Protect Mode:           None                     
Retention:              0                        
Location:               hdfs://localhost:8020/user/hive/warehouse/partbucket 
Table Type:             MANAGED_TABLE            
Table Parameters:                
        transient_lastDdlTime   1494437467          
                 
# Storage Information            
SerDe Library:          org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe      
 
InputFormat:            org.apache.hadoop.mapred.TextInputFormat         
OutputFormat:           
org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat       
Compressed:             No                       
Num Buckets:            10                       
Bucket Columns:         [a]                      
Sort Columns:           []                       
Storage Desc Params:             
        field.delim             ,                   
        serialization.format    , 

3) In spark-shell, 

scala> spark.sql("MSCK REPAIR TABLE partbucket")

4) Back to Hive-CLI 

desc formatted partbucket;

# col_name              data_type               comment             
                 
a                       int                                         
                 
# Partition Information          
# col_name              data_type               comment             
                 
b                       int                                         
                 
# Detailed Table Information             
Database:               sparkhivebucket          
Owner:                  devbld                   
CreateTime:             Wed May 10 10:31:07 PDT 2017     
LastAccessTime:         UNKNOWN                  
Protect Mode:           None                     
Retention:              0                        
Location:               
hdfs://localhost:8020/user/hive/warehouse/sparkhivebucket.db/partbucket 
Table Type:             MANAGED_TABLE            
Table Parameters:                
        spark.sql.partitionProvider     catalog             
        transient_lastDdlTime   1494437647          
                 
# Storage Information            
SerDe Library:          org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe      
 
InputFormat:            org.apache.hadoop.mapred.TextInputFormat         
OutputFormat:           
org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat       
Compressed:             No                       
Num Buckets:            -1                       
Bucket Columns:         []                       
Sort Columns:           []                       
Storage Desc Params:             
        field.delim             ,                   
        serialization.format    , 


Further inserts to this table cannot be made in bucketed fashion through Hive. 

  was:
MSCK REPAIR TABLE used to recover partitions for a partitioned+bucketed table 
does not restore the bucketing information to the storage descriptor in the 
metastore. 

Steps to reproduce:
1) Create a paritioned+bucketed table in hive: CREATE TABLE partbucket(a int) 
PARTITIONED BY (b int) CLUSTERED BY (a) INTO 10 BUCKETS ROW FORMAT DELIMITED 
FIELDS TERMINATED BY ',';

2) In Hive-CLI issue a desc formatted for the table.

# col_name              data_type               comment             
                 
a                       int                                         
                 
# Partition Information          
# col_name              data_type               comment             
                 
b                       int                                         
                 
# Detailed Table Information             
Database:               sparkhivebucket          
Owner:                  devbld                   
CreateTime:             Wed May 10 10:31:07 PDT 2017     
LastAccessTime:         UNKNOWN                  
Protect Mode:           None                     
Retention:              0                        
Location:               hdfs://localhost:8020/user/hive/warehouse/partbucket 
Table Type:             MANAGED_TABLE            
Table Parameters:                
        transient_lastDdlTime   1494437467          
                 
# Storage Information            
SerDe Library:          org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe      
 
InputFormat:            org.apache.hadoop.mapred.TextInputFormat         
OutputFormat:           
org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat       
Compressed:             No                       
Num Buckets:            10                       
Bucket Columns:         [a]                      
Sort Columns:           []                       
Storage Desc Params:             
        field.delim             ,                   
        serialization.format    , 

3) In spark-shell, 

scala> spark.sql("MSCK REPAIR TABLE partbucket")

4) Back to Hive-CLI 

desc formatted partbucket;

# col_name              data_type               comment             
                 
a                       int                                         
                 
# Partition Information          
# col_name              data_type               comment             
                 
b                       int                                         
                 
# Detailed Table Information             
Database:               sparkhivebucket          
Owner:                  devbld                   
CreateTime:             Wed May 10 10:31:07 PDT 2017     
LastAccessTime:         UNKNOWN                  
Protect Mode:           None                     
Retention:              0                        
Location:               
hdfs://localhost:8020/user/hive/warehouse/sparkhivebucket.db/partbucket 
Table Type:             MANAGED_TABLE            
Table Parameters:                
        spark.sql.partitionProvider     catalog             
        transient_lastDdlTime   1494437647          
                 
# Storage Information            
SerDe Library:          org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe      
 
InputFormat:            org.apache.hadoop.mapred.TextInputFormat         
OutputFormat:           
org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat       
Compressed:             No                       
Num Buckets:            -1                       
Bucket Columns:         []                       
Sort Columns:           []                       
Storage Desc Params:             
        field.delim             ,                   
        serialization.format    , 





> MSCK REPAIR TABLE resets the Storage Information for bucketed hive tables.
> --------------------------------------------------------------------------
>
>                 Key: SPARK-20697
>                 URL: https://issues.apache.org/jira/browse/SPARK-20697
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 2.1.0
>            Reporter: Abhishek Madav
>
> MSCK REPAIR TABLE used to recover partitions for a partitioned+bucketed table 
> does not restore the bucketing information to the storage descriptor in the 
> metastore. 
> Steps to reproduce:
> 1) Create a paritioned+bucketed table in hive: CREATE TABLE partbucket(a int) 
> PARTITIONED BY (b int) CLUSTERED BY (a) INTO 10 BUCKETS ROW FORMAT DELIMITED 
> FIELDS TERMINATED BY ',';
> 2) In Hive-CLI issue a desc formatted for the table.
> # col_name                    data_type               comment             
>                
> a                     int                                         
>                
> # Partition Information                
> # col_name                    data_type               comment             
>                
> b                     int                                         
>                
> # Detailed Table Information           
> Database:             sparkhivebucket          
> Owner:                devbld                   
> CreateTime:           Wed May 10 10:31:07 PDT 2017     
> LastAccessTime:       UNKNOWN                  
> Protect Mode:         None                     
> Retention:            0                        
> Location:             hdfs://localhost:8020/user/hive/warehouse/partbucket 
> Table Type:           MANAGED_TABLE            
> Table Parameters:              
>       transient_lastDdlTime   1494437467          
>                
> # Storage Information          
> SerDe Library:        org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe      
>  
> InputFormat:          org.apache.hadoop.mapred.TextInputFormat         
> OutputFormat:         
> org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat       
> Compressed:           No                       
> Num Buckets:          10                       
> Bucket Columns:       [a]                      
> Sort Columns:         []                       
> Storage Desc Params:           
>       field.delim             ,                   
>       serialization.format    , 
> 3) In spark-shell, 
> scala> spark.sql("MSCK REPAIR TABLE partbucket")
> 4) Back to Hive-CLI 
> desc formatted partbucket;
> # col_name                    data_type               comment             
>                
> a                     int                                         
>                
> # Partition Information                
> # col_name                    data_type               comment             
>                
> b                     int                                         
>                
> # Detailed Table Information           
> Database:             sparkhivebucket          
> Owner:                devbld                   
> CreateTime:           Wed May 10 10:31:07 PDT 2017     
> LastAccessTime:       UNKNOWN                  
> Protect Mode:         None                     
> Retention:            0                        
> Location:             
> hdfs://localhost:8020/user/hive/warehouse/sparkhivebucket.db/partbucket 
> Table Type:           MANAGED_TABLE            
> Table Parameters:              
>       spark.sql.partitionProvider     catalog             
>       transient_lastDdlTime   1494437647          
>                
> # Storage Information          
> SerDe Library:        org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe      
>  
> InputFormat:          org.apache.hadoop.mapred.TextInputFormat         
> OutputFormat:         
> org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat       
> Compressed:           No                       
> Num Buckets:          -1                       
> Bucket Columns:       []                       
> Sort Columns:         []                       
> Storage Desc Params:           
>       field.delim             ,                   
>       serialization.format    , 
> Further inserts to this table cannot be made in bucketed fashion through 
> Hive. 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to