[jira] [Updated] (SPARK-20697) MSCK REPAIR TABLE resets the Storage Information for bucketed hive tables.

2018-11-16 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20697?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-20697:
--
Priority: Major  (was: Critical)

This sounds like Hive functionality though; is it even resolvable in Spark?

> MSCK REPAIR TABLE resets the Storage Information for bucketed hive tables.
> --
>
> Key: SPARK-20697
> URL: https://issues.apache.org/jira/browse/SPARK-20697
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0, 2.2.0, 2.2.1, 2.3.0
>Reporter: Abhishek Madav
>Priority: Major
>
> MSCK REPAIR TABLE used to recover partitions for a partitioned+bucketed table 
> does not restore the bucketing information to the storage descriptor in the 
> metastore. 
> Steps to reproduce:
> 1) Create a paritioned+bucketed table in hive: CREATE TABLE partbucket(a int) 
> PARTITIONED BY (b int) CLUSTERED BY (a) INTO 10 BUCKETS ROW FORMAT DELIMITED 
> FIELDS TERMINATED BY ',';
> 2) In Hive-CLI issue a desc formatted for the table.
> # col_namedata_type   comment 
>
> a int 
>
> # Partition Information
> # col_namedata_type   comment 
>
> b int 
>
> # Detailed Table Information   
> Database: sparkhivebucket  
> Owner:devbld   
> CreateTime:   Wed May 10 10:31:07 PDT 2017 
> LastAccessTime:   UNKNOWN  
> Protect Mode: None 
> Retention:0
> Location: hdfs://localhost:8020/user/hive/warehouse/partbucket 
> Table Type:   MANAGED_TABLE
> Table Parameters:  
>   transient_lastDdlTime   1494437467  
>
> # Storage Information  
> SerDe Library:org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe  
>  
> InputFormat:  org.apache.hadoop.mapred.TextInputFormat 
> OutputFormat: 
> org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat   
> Compressed:   No   
> Num Buckets:  10   
> Bucket Columns:   [a]  
> Sort Columns: []   
> Storage Desc Params:   
>   field.delim ,   
>   serialization.format, 
> 3) In spark-shell, 
> scala> spark.sql("MSCK REPAIR TABLE partbucket")
> 4) Back to Hive-CLI 
> desc formatted partbucket;
> # col_namedata_type   comment 
>
> a int 
>
> # Partition Information
> # col_namedata_type   comment 
>
> b int 
>
> # Detailed Table Information   
> Database: sparkhivebucket  
> Owner:devbld   
> CreateTime:   Wed May 10 10:31:07 PDT 2017 
> LastAccessTime:   UNKNOWN  
> Protect Mode: None 
> Retention:0
> Location: 
> hdfs://localhost:8020/user/hive/warehouse/sparkhivebucket.db/partbucket 
> Table Type:   MANAGED_TABLE
> Table Parameters:  
>   spark.sql.partitionProvider catalog 
>   transient_lastDdlTime   1494437647  
>
> # Storage Information  
> SerDe Library:org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe  
>  
> InputFormat:  org.apache.hadoop.mapred.TextInputFormat 
> OutputFormat: 
> org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat   
> Compressed:   No   
> Num Buckets:  -1   
> Bucket Columns:   []   
> Sort Columns: []   
> Storage Desc Params:   
>   field.delim ,   
>   serialization.format, 
> Further inserts to this table cannot be made in bucketed fashion through 
> Hive. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-20697) MSCK REPAIR TABLE resets the Storage Information for bucketed hive tables.

2018-03-20 Thread Abhishek Madav (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20697?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Abhishek Madav updated SPARK-20697:
---
Priority: Critical  (was: Major)

> MSCK REPAIR TABLE resets the Storage Information for bucketed hive tables.
> --
>
> Key: SPARK-20697
> URL: https://issues.apache.org/jira/browse/SPARK-20697
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0, 2.2.0, 2.2.1, 2.3.0
>Reporter: Abhishek Madav
>Priority: Critical
>
> MSCK REPAIR TABLE used to recover partitions for a partitioned+bucketed table 
> does not restore the bucketing information to the storage descriptor in the 
> metastore. 
> Steps to reproduce:
> 1) Create a paritioned+bucketed table in hive: CREATE TABLE partbucket(a int) 
> PARTITIONED BY (b int) CLUSTERED BY (a) INTO 10 BUCKETS ROW FORMAT DELIMITED 
> FIELDS TERMINATED BY ',';
> 2) In Hive-CLI issue a desc formatted for the table.
> # col_namedata_type   comment 
>
> a int 
>
> # Partition Information
> # col_namedata_type   comment 
>
> b int 
>
> # Detailed Table Information   
> Database: sparkhivebucket  
> Owner:devbld   
> CreateTime:   Wed May 10 10:31:07 PDT 2017 
> LastAccessTime:   UNKNOWN  
> Protect Mode: None 
> Retention:0
> Location: hdfs://localhost:8020/user/hive/warehouse/partbucket 
> Table Type:   MANAGED_TABLE
> Table Parameters:  
>   transient_lastDdlTime   1494437467  
>
> # Storage Information  
> SerDe Library:org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe  
>  
> InputFormat:  org.apache.hadoop.mapred.TextInputFormat 
> OutputFormat: 
> org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat   
> Compressed:   No   
> Num Buckets:  10   
> Bucket Columns:   [a]  
> Sort Columns: []   
> Storage Desc Params:   
>   field.delim ,   
>   serialization.format, 
> 3) In spark-shell, 
> scala> spark.sql("MSCK REPAIR TABLE partbucket")
> 4) Back to Hive-CLI 
> desc formatted partbucket;
> # col_namedata_type   comment 
>
> a int 
>
> # Partition Information
> # col_namedata_type   comment 
>
> b int 
>
> # Detailed Table Information   
> Database: sparkhivebucket  
> Owner:devbld   
> CreateTime:   Wed May 10 10:31:07 PDT 2017 
> LastAccessTime:   UNKNOWN  
> Protect Mode: None 
> Retention:0
> Location: 
> hdfs://localhost:8020/user/hive/warehouse/sparkhivebucket.db/partbucket 
> Table Type:   MANAGED_TABLE
> Table Parameters:  
>   spark.sql.partitionProvider catalog 
>   transient_lastDdlTime   1494437647  
>
> # Storage Information  
> SerDe Library:org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe  
>  
> InputFormat:  org.apache.hadoop.mapred.TextInputFormat 
> OutputFormat: 
> org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat   
> Compressed:   No   
> Num Buckets:  -1   
> Bucket Columns:   []   
> Sort Columns: []   
> Storage Desc Params:   
>   field.delim ,   
>   serialization.format, 
> Further inserts to this table cannot be made in bucketed fashion through 
> Hive. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-20697) MSCK REPAIR TABLE resets the Storage Information for bucketed hive tables.

2018-03-20 Thread Abhishek Madav (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20697?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Abhishek Madav updated SPARK-20697:
---
Affects Version/s: 2.2.0
   2.2.1
   2.3.0

> MSCK REPAIR TABLE resets the Storage Information for bucketed hive tables.
> --
>
> Key: SPARK-20697
> URL: https://issues.apache.org/jira/browse/SPARK-20697
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0, 2.2.0, 2.2.1, 2.3.0
>Reporter: Abhishek Madav
>Priority: Major
>
> MSCK REPAIR TABLE used to recover partitions for a partitioned+bucketed table 
> does not restore the bucketing information to the storage descriptor in the 
> metastore. 
> Steps to reproduce:
> 1) Create a paritioned+bucketed table in hive: CREATE TABLE partbucket(a int) 
> PARTITIONED BY (b int) CLUSTERED BY (a) INTO 10 BUCKETS ROW FORMAT DELIMITED 
> FIELDS TERMINATED BY ',';
> 2) In Hive-CLI issue a desc formatted for the table.
> # col_namedata_type   comment 
>
> a int 
>
> # Partition Information
> # col_namedata_type   comment 
>
> b int 
>
> # Detailed Table Information   
> Database: sparkhivebucket  
> Owner:devbld   
> CreateTime:   Wed May 10 10:31:07 PDT 2017 
> LastAccessTime:   UNKNOWN  
> Protect Mode: None 
> Retention:0
> Location: hdfs://localhost:8020/user/hive/warehouse/partbucket 
> Table Type:   MANAGED_TABLE
> Table Parameters:  
>   transient_lastDdlTime   1494437467  
>
> # Storage Information  
> SerDe Library:org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe  
>  
> InputFormat:  org.apache.hadoop.mapred.TextInputFormat 
> OutputFormat: 
> org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat   
> Compressed:   No   
> Num Buckets:  10   
> Bucket Columns:   [a]  
> Sort Columns: []   
> Storage Desc Params:   
>   field.delim ,   
>   serialization.format, 
> 3) In spark-shell, 
> scala> spark.sql("MSCK REPAIR TABLE partbucket")
> 4) Back to Hive-CLI 
> desc formatted partbucket;
> # col_namedata_type   comment 
>
> a int 
>
> # Partition Information
> # col_namedata_type   comment 
>
> b int 
>
> # Detailed Table Information   
> Database: sparkhivebucket  
> Owner:devbld   
> CreateTime:   Wed May 10 10:31:07 PDT 2017 
> LastAccessTime:   UNKNOWN  
> Protect Mode: None 
> Retention:0
> Location: 
> hdfs://localhost:8020/user/hive/warehouse/sparkhivebucket.db/partbucket 
> Table Type:   MANAGED_TABLE
> Table Parameters:  
>   spark.sql.partitionProvider catalog 
>   transient_lastDdlTime   1494437647  
>
> # Storage Information  
> SerDe Library:org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe  
>  
> InputFormat:  org.apache.hadoop.mapred.TextInputFormat 
> OutputFormat: 
> org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat   
> Compressed:   No   
> Num Buckets:  -1   
> Bucket Columns:   []   
> Sort Columns: []   
> Storage Desc Params:   
>   field.delim ,   
>   serialization.format, 
> Further inserts to this table cannot be made in bucketed fashion through 
> Hive. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-20697) MSCK REPAIR TABLE resets the Storage Information for bucketed hive tables.

2017-05-10 Thread Abhishek Madav (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20697?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Abhishek Madav updated SPARK-20697:
---
Description: 
MSCK REPAIR TABLE used to recover partitions for a partitioned+bucketed table 
does not restore the bucketing information to the storage descriptor in the 
metastore. 

Steps to reproduce:
1) Create a paritioned+bucketed table in hive: CREATE TABLE partbucket(a int) 
PARTITIONED BY (b int) CLUSTERED BY (a) INTO 10 BUCKETS ROW FORMAT DELIMITED 
FIELDS TERMINATED BY ',';

2) In Hive-CLI issue a desc formatted for the table.

# col_name  data_type   comment 
 
a   int 
 
# Partition Information  
# col_name  data_type   comment 
 
b   int 
 
# Detailed Table Information 
Database:   sparkhivebucket  
Owner:  devbld   
CreateTime: Wed May 10 10:31:07 PDT 2017 
LastAccessTime: UNKNOWN  
Protect Mode:   None 
Retention:  0
Location:   hdfs://localhost:8020/user/hive/warehouse/partbucket 
Table Type: MANAGED_TABLE
Table Parameters:
transient_lastDdlTime   1494437467  
 
# Storage Information
SerDe Library:  org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe  
 
InputFormat:org.apache.hadoop.mapred.TextInputFormat 
OutputFormat:   
org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat   
Compressed: No   
Num Buckets:10   
Bucket Columns: [a]  
Sort Columns:   []   
Storage Desc Params: 
field.delim ,   
serialization.format, 

3) In spark-shell, 

scala> spark.sql("MSCK REPAIR TABLE partbucket")

4) Back to Hive-CLI 

desc formatted partbucket;

# col_name  data_type   comment 
 
a   int 
 
# Partition Information  
# col_name  data_type   comment 
 
b   int 
 
# Detailed Table Information 
Database:   sparkhivebucket  
Owner:  devbld   
CreateTime: Wed May 10 10:31:07 PDT 2017 
LastAccessTime: UNKNOWN  
Protect Mode:   None 
Retention:  0
Location:   
hdfs://localhost:8020/user/hive/warehouse/sparkhivebucket.db/partbucket 
Table Type: MANAGED_TABLE
Table Parameters:
spark.sql.partitionProvider catalog 
transient_lastDdlTime   1494437647  
 
# Storage Information
SerDe Library:  org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe  
 
InputFormat:org.apache.hadoop.mapred.TextInputFormat 
OutputFormat:   
org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat   
Compressed: No   
Num Buckets:-1   
Bucket Columns: []   
Sort Columns:   []   
Storage Desc Params: 
field.delim ,   
serialization.format, 


Further inserts to this table cannot be made in bucketed fashion through Hive. 

  was:
MSCK REPAIR TABLE used to recover partitions for a partitioned+bucketed table 
does not restore the bucketing information to the storage descriptor in the 
metastore. 

Steps to reproduce:
1) Create a paritioned+bucketed table in hive: CREATE TABLE partbucket(a int) 
PARTITIONED BY (b int) CLUSTERED BY (a) INTO 10 BUCKETS ROW FORMAT DELIMITED 
FIELDS TERMINATED BY ',';

2) In Hive-CLI issue a desc formatted for the table.

# col_name  data_type   comment 
 
a   int 
 
# Partition Information  
# col_name  data_type   comment 
 
b   int 
 
# Detailed Table Information 
Database:   sparkhivebucket  
Owner: