[jira] [Commented] (HIVE-10880) The bucket number is not respected in insert overwrite.
[ https://issues.apache.org/jira/browse/HIVE-10880?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14661854#comment-14661854 ] Yongzhi Chen commented on HIVE-10880: - Thanks [~xuefuz] for reviewing the code. The bucket number is not respected in insert overwrite. --- Key: HIVE-10880 URL: https://issues.apache.org/jira/browse/HIVE-10880 Project: Hive Issue Type: Bug Affects Versions: 1.2.0 Reporter: Yongzhi Chen Assignee: Yongzhi Chen Priority: Critical Fix For: 1.3.0, 2.0.0 Attachments: HIVE-10880.1.patch, HIVE-10880.2.patch, HIVE-10880.3.patch, HIVE-10880.4.patch When hive.enforce.bucketing is true, the bucket number defined in the table is no longer respected in current master and 1.2. Reproduce: {code:sql} CREATE TABLE IF NOT EXISTS buckettestinput( data string ) ROW FORMAT DELIMITED FIELDS TERMINATED BY ','; CREATE TABLE IF NOT EXISTS buckettestoutput1( data string )CLUSTERED BY(data) INTO 2 BUCKETS ROW FORMAT DELIMITED FIELDS TERMINATED BY ','; CREATE TABLE IF NOT EXISTS buckettestoutput2( data string )CLUSTERED BY(data) INTO 2 BUCKETS ROW FORMAT DELIMITED FIELDS TERMINATED BY ','; {code} Then I inserted the following data into the buckettestinput table: {noformat} firstinsert1 firstinsert2 firstinsert3 firstinsert4 firstinsert5 firstinsert6 firstinsert7 firstinsert8 secondinsert1 secondinsert2 secondinsert3 secondinsert4 secondinsert5 secondinsert6 secondinsert7 secondinsert8 {noformat} {code:sql} set hive.enforce.bucketing = true; set hive.enforce.sorting=true; insert overwrite table buckettestoutput1 select * from buckettestinput where data like 'first%'; set hive.auto.convert.sortmerge.join=true; set hive.optimize.bucketmapjoin = true; set hive.optimize.bucketmapjoin.sortedmerge = true; select * from buckettestoutput1 a join buckettestoutput2 b on (a.data=b.data); {code} {noformat} Error: Error while compiling statement: FAILED: SemanticException [Error 10141]: Bucketed table metadata is not correct. Fix the metadata or don't use bucketed mapjoin, by setting hive.enforce.bucketmapjoin to false. The number of buckets for table buckettestoutput1 is 2, whereas the number of files is 1 (state=42000,code=10141) {noformat} The related debug information related to insert overwrite: {noformat} 0: jdbc:hive2://localhost:1 insert overwrite table buckettestoutput1 select * from buckettestinput where data like 'first%'insert overwrite table buckettestoutput1 0: jdbc:hive2://localhost:1 ; select * from buckettestinput where data like ' first%'; INFO : Number of reduce tasks determined at compile time: 2 INFO : In order to change the average load for a reducer (in bytes): INFO : set hive.exec.reducers.bytes.per.reducer=number INFO : In order to limit the maximum number of reducers: INFO : set hive.exec.reducers.max=number INFO : In order to set a constant number of reducers: INFO : set mapred.reduce.tasks=number INFO : Job running in-process (local Hadoop) INFO : 2015-06-01 11:09:29,650 Stage-1 map = 86%, reduce = 100% INFO : Ended Job = job_local107155352_0001 INFO : Loading data to table default.buckettestoutput1 from file:/user/hive/warehouse/buckettestoutput1/.hive-staging_hive_2015-06-01_11-09-28_166_3109203968904090801-1/-ext-1 INFO : Table default.buckettestoutput1 stats: [numFiles=1, numRows=4, totalSize=52, rawDataSize=48] No rows affected (1.692 seconds) {noformat} Insert use dynamic partition does not have the issue. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HIVE-10880) The bucket number is not respected in insert overwrite.
[ https://issues.apache.org/jira/browse/HIVE-10880?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14655152#comment-14655152 ] Yongzhi Chen commented on HIVE-10880: - The failure is not related(it is known issue): org.apache.hive.hcatalog.streaming.TestStreaming.testTransactionBatchEmptyCommit Error Message Table/View 'TXNS' already exists in Schema 'APP'. [~xuefuz], could you review the change? Thanks The patch is fixing following issue: In local mode and when enforce.bucketing is true, for bucket table, insert overwrite to table or static partition, bucket number is not respected. Because only dynamic partition works fine, this fix uses the same idea as how to handle the dynamic partition scenario. It seems that HIVE-11360 has similar issue. The bucket number is not respected in insert overwrite. --- Key: HIVE-10880 URL: https://issues.apache.org/jira/browse/HIVE-10880 Project: Hive Issue Type: Bug Affects Versions: 1.2.0 Reporter: Yongzhi Chen Assignee: Yongzhi Chen Priority: Critical Attachments: HIVE-10880.1.patch, HIVE-10880.2.patch, HIVE-10880.3.patch, HIVE-10880.4.patch When hive.enforce.bucketing is true, the bucket number defined in the table is no longer respected in current master and 1.2. Reproduce: {code:sql} CREATE TABLE IF NOT EXISTS buckettestinput( data string ) ROW FORMAT DELIMITED FIELDS TERMINATED BY ','; CREATE TABLE IF NOT EXISTS buckettestoutput1( data string )CLUSTERED BY(data) INTO 2 BUCKETS ROW FORMAT DELIMITED FIELDS TERMINATED BY ','; CREATE TABLE IF NOT EXISTS buckettestoutput2( data string )CLUSTERED BY(data) INTO 2 BUCKETS ROW FORMAT DELIMITED FIELDS TERMINATED BY ','; {code} Then I inserted the following data into the buckettestinput table: {noformat} firstinsert1 firstinsert2 firstinsert3 firstinsert4 firstinsert5 firstinsert6 firstinsert7 firstinsert8 secondinsert1 secondinsert2 secondinsert3 secondinsert4 secondinsert5 secondinsert6 secondinsert7 secondinsert8 {noformat} {code:sql} set hive.enforce.bucketing = true; set hive.enforce.sorting=true; insert overwrite table buckettestoutput1 select * from buckettestinput where data like 'first%'; set hive.auto.convert.sortmerge.join=true; set hive.optimize.bucketmapjoin = true; set hive.optimize.bucketmapjoin.sortedmerge = true; select * from buckettestoutput1 a join buckettestoutput2 b on (a.data=b.data); {code} {noformat} Error: Error while compiling statement: FAILED: SemanticException [Error 10141]: Bucketed table metadata is not correct. Fix the metadata or don't use bucketed mapjoin, by setting hive.enforce.bucketmapjoin to false. The number of buckets for table buckettestoutput1 is 2, whereas the number of files is 1 (state=42000,code=10141) {noformat} The related debug information related to insert overwrite: {noformat} 0: jdbc:hive2://localhost:1 insert overwrite table buckettestoutput1 select * from buckettestinput where data like 'first%'insert overwrite table buckettestoutput1 0: jdbc:hive2://localhost:1 ; select * from buckettestinput where data like ' first%'; INFO : Number of reduce tasks determined at compile time: 2 INFO : In order to change the average load for a reducer (in bytes): INFO : set hive.exec.reducers.bytes.per.reducer=number INFO : In order to limit the maximum number of reducers: INFO : set hive.exec.reducers.max=number INFO : In order to set a constant number of reducers: INFO : set mapred.reduce.tasks=number INFO : Job running in-process (local Hadoop) INFO : 2015-06-01 11:09:29,650 Stage-1 map = 86%, reduce = 100% INFO : Ended Job = job_local107155352_0001 INFO : Loading data to table default.buckettestoutput1 from file:/user/hive/warehouse/buckettestoutput1/.hive-staging_hive_2015-06-01_11-09-28_166_3109203968904090801-1/-ext-1 INFO : Table default.buckettestoutput1 stats: [numFiles=1, numRows=4, totalSize=52, rawDataSize=48] No rows affected (1.692 seconds) {noformat} Insert use dynamic partition does not have the issue. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HIVE-10880) The bucket number is not respected in insert overwrite.
[ https://issues.apache.org/jira/browse/HIVE-10880?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14658528#comment-14658528 ] Xuefu Zhang commented on HIVE-10880: Okay. I will take a look shortly. The bucket number is not respected in insert overwrite. --- Key: HIVE-10880 URL: https://issues.apache.org/jira/browse/HIVE-10880 Project: Hive Issue Type: Bug Affects Versions: 1.2.0 Reporter: Yongzhi Chen Assignee: Yongzhi Chen Priority: Critical Attachments: HIVE-10880.1.patch, HIVE-10880.2.patch, HIVE-10880.3.patch, HIVE-10880.4.patch When hive.enforce.bucketing is true, the bucket number defined in the table is no longer respected in current master and 1.2. Reproduce: {code:sql} CREATE TABLE IF NOT EXISTS buckettestinput( data string ) ROW FORMAT DELIMITED FIELDS TERMINATED BY ','; CREATE TABLE IF NOT EXISTS buckettestoutput1( data string )CLUSTERED BY(data) INTO 2 BUCKETS ROW FORMAT DELIMITED FIELDS TERMINATED BY ','; CREATE TABLE IF NOT EXISTS buckettestoutput2( data string )CLUSTERED BY(data) INTO 2 BUCKETS ROW FORMAT DELIMITED FIELDS TERMINATED BY ','; {code} Then I inserted the following data into the buckettestinput table: {noformat} firstinsert1 firstinsert2 firstinsert3 firstinsert4 firstinsert5 firstinsert6 firstinsert7 firstinsert8 secondinsert1 secondinsert2 secondinsert3 secondinsert4 secondinsert5 secondinsert6 secondinsert7 secondinsert8 {noformat} {code:sql} set hive.enforce.bucketing = true; set hive.enforce.sorting=true; insert overwrite table buckettestoutput1 select * from buckettestinput where data like 'first%'; set hive.auto.convert.sortmerge.join=true; set hive.optimize.bucketmapjoin = true; set hive.optimize.bucketmapjoin.sortedmerge = true; select * from buckettestoutput1 a join buckettestoutput2 b on (a.data=b.data); {code} {noformat} Error: Error while compiling statement: FAILED: SemanticException [Error 10141]: Bucketed table metadata is not correct. Fix the metadata or don't use bucketed mapjoin, by setting hive.enforce.bucketmapjoin to false. The number of buckets for table buckettestoutput1 is 2, whereas the number of files is 1 (state=42000,code=10141) {noformat} The related debug information related to insert overwrite: {noformat} 0: jdbc:hive2://localhost:1 insert overwrite table buckettestoutput1 select * from buckettestinput where data like 'first%'insert overwrite table buckettestoutput1 0: jdbc:hive2://localhost:1 ; select * from buckettestinput where data like ' first%'; INFO : Number of reduce tasks determined at compile time: 2 INFO : In order to change the average load for a reducer (in bytes): INFO : set hive.exec.reducers.bytes.per.reducer=number INFO : In order to limit the maximum number of reducers: INFO : set hive.exec.reducers.max=number INFO : In order to set a constant number of reducers: INFO : set mapred.reduce.tasks=number INFO : Job running in-process (local Hadoop) INFO : 2015-06-01 11:09:29,650 Stage-1 map = 86%, reduce = 100% INFO : Ended Job = job_local107155352_0001 INFO : Loading data to table default.buckettestoutput1 from file:/user/hive/warehouse/buckettestoutput1/.hive-staging_hive_2015-06-01_11-09-28_166_3109203968904090801-1/-ext-1 INFO : Table default.buckettestoutput1 stats: [numFiles=1, numRows=4, totalSize=52, rawDataSize=48] No rows affected (1.692 seconds) {noformat} Insert use dynamic partition does not have the issue. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HIVE-10880) The bucket number is not respected in insert overwrite.
[ https://issues.apache.org/jira/browse/HIVE-10880?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14659360#comment-14659360 ] Xuefu Zhang commented on HIVE-10880: +1 The bucket number is not respected in insert overwrite. --- Key: HIVE-10880 URL: https://issues.apache.org/jira/browse/HIVE-10880 Project: Hive Issue Type: Bug Affects Versions: 1.2.0 Reporter: Yongzhi Chen Assignee: Yongzhi Chen Priority: Critical Attachments: HIVE-10880.1.patch, HIVE-10880.2.patch, HIVE-10880.3.patch, HIVE-10880.4.patch When hive.enforce.bucketing is true, the bucket number defined in the table is no longer respected in current master and 1.2. Reproduce: {code:sql} CREATE TABLE IF NOT EXISTS buckettestinput( data string ) ROW FORMAT DELIMITED FIELDS TERMINATED BY ','; CREATE TABLE IF NOT EXISTS buckettestoutput1( data string )CLUSTERED BY(data) INTO 2 BUCKETS ROW FORMAT DELIMITED FIELDS TERMINATED BY ','; CREATE TABLE IF NOT EXISTS buckettestoutput2( data string )CLUSTERED BY(data) INTO 2 BUCKETS ROW FORMAT DELIMITED FIELDS TERMINATED BY ','; {code} Then I inserted the following data into the buckettestinput table: {noformat} firstinsert1 firstinsert2 firstinsert3 firstinsert4 firstinsert5 firstinsert6 firstinsert7 firstinsert8 secondinsert1 secondinsert2 secondinsert3 secondinsert4 secondinsert5 secondinsert6 secondinsert7 secondinsert8 {noformat} {code:sql} set hive.enforce.bucketing = true; set hive.enforce.sorting=true; insert overwrite table buckettestoutput1 select * from buckettestinput where data like 'first%'; set hive.auto.convert.sortmerge.join=true; set hive.optimize.bucketmapjoin = true; set hive.optimize.bucketmapjoin.sortedmerge = true; select * from buckettestoutput1 a join buckettestoutput2 b on (a.data=b.data); {code} {noformat} Error: Error while compiling statement: FAILED: SemanticException [Error 10141]: Bucketed table metadata is not correct. Fix the metadata or don't use bucketed mapjoin, by setting hive.enforce.bucketmapjoin to false. The number of buckets for table buckettestoutput1 is 2, whereas the number of files is 1 (state=42000,code=10141) {noformat} The related debug information related to insert overwrite: {noformat} 0: jdbc:hive2://localhost:1 insert overwrite table buckettestoutput1 select * from buckettestinput where data like 'first%'insert overwrite table buckettestoutput1 0: jdbc:hive2://localhost:1 ; select * from buckettestinput where data like ' first%'; INFO : Number of reduce tasks determined at compile time: 2 INFO : In order to change the average load for a reducer (in bytes): INFO : set hive.exec.reducers.bytes.per.reducer=number INFO : In order to limit the maximum number of reducers: INFO : set hive.exec.reducers.max=number INFO : In order to set a constant number of reducers: INFO : set mapred.reduce.tasks=number INFO : Job running in-process (local Hadoop) INFO : 2015-06-01 11:09:29,650 Stage-1 map = 86%, reduce = 100% INFO : Ended Job = job_local107155352_0001 INFO : Loading data to table default.buckettestoutput1 from file:/user/hive/warehouse/buckettestoutput1/.hive-staging_hive_2015-06-01_11-09-28_166_3109203968904090801-1/-ext-1 INFO : Table default.buckettestoutput1 stats: [numFiles=1, numRows=4, totalSize=52, rawDataSize=48] No rows affected (1.692 seconds) {noformat} Insert use dynamic partition does not have the issue. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HIVE-10880) The bucket number is not respected in insert overwrite.
[ https://issues.apache.org/jira/browse/HIVE-10880?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14653693#comment-14653693 ] Yongzhi Chen commented on HIVE-10880: - Ran the spark tests on my local machine, they passed. Re-Attach the patch. The bucket number is not respected in insert overwrite. --- Key: HIVE-10880 URL: https://issues.apache.org/jira/browse/HIVE-10880 Project: Hive Issue Type: Bug Affects Versions: 1.2.0 Reporter: Yongzhi Chen Assignee: Yongzhi Chen Priority: Critical Attachments: HIVE-10880.1.patch, HIVE-10880.2.patch, HIVE-10880.3.patch, HIVE-10880.4.patch When hive.enforce.bucketing is true, the bucket number defined in the table is no longer respected in current master and 1.2. Reproduce: {code:sql} CREATE TABLE IF NOT EXISTS buckettestinput( data string ) ROW FORMAT DELIMITED FIELDS TERMINATED BY ','; CREATE TABLE IF NOT EXISTS buckettestoutput1( data string )CLUSTERED BY(data) INTO 2 BUCKETS ROW FORMAT DELIMITED FIELDS TERMINATED BY ','; CREATE TABLE IF NOT EXISTS buckettestoutput2( data string )CLUSTERED BY(data) INTO 2 BUCKETS ROW FORMAT DELIMITED FIELDS TERMINATED BY ','; {code} Then I inserted the following data into the buckettestinput table: {noformat} firstinsert1 firstinsert2 firstinsert3 firstinsert4 firstinsert5 firstinsert6 firstinsert7 firstinsert8 secondinsert1 secondinsert2 secondinsert3 secondinsert4 secondinsert5 secondinsert6 secondinsert7 secondinsert8 {noformat} {code:sql} set hive.enforce.bucketing = true; set hive.enforce.sorting=true; insert overwrite table buckettestoutput1 select * from buckettestinput where data like 'first%'; set hive.auto.convert.sortmerge.join=true; set hive.optimize.bucketmapjoin = true; set hive.optimize.bucketmapjoin.sortedmerge = true; select * from buckettestoutput1 a join buckettestoutput2 b on (a.data=b.data); {code} {noformat} Error: Error while compiling statement: FAILED: SemanticException [Error 10141]: Bucketed table metadata is not correct. Fix the metadata or don't use bucketed mapjoin, by setting hive.enforce.bucketmapjoin to false. The number of buckets for table buckettestoutput1 is 2, whereas the number of files is 1 (state=42000,code=10141) {noformat} The related debug information related to insert overwrite: {noformat} 0: jdbc:hive2://localhost:1 insert overwrite table buckettestoutput1 select * from buckettestinput where data like 'first%'insert overwrite table buckettestoutput1 0: jdbc:hive2://localhost:1 ; select * from buckettestinput where data like ' first%'; INFO : Number of reduce tasks determined at compile time: 2 INFO : In order to change the average load for a reducer (in bytes): INFO : set hive.exec.reducers.bytes.per.reducer=number INFO : In order to limit the maximum number of reducers: INFO : set hive.exec.reducers.max=number INFO : In order to set a constant number of reducers: INFO : set mapred.reduce.tasks=number INFO : Job running in-process (local Hadoop) INFO : 2015-06-01 11:09:29,650 Stage-1 map = 86%, reduce = 100% INFO : Ended Job = job_local107155352_0001 INFO : Loading data to table default.buckettestoutput1 from file:/user/hive/warehouse/buckettestoutput1/.hive-staging_hive_2015-06-01_11-09-28_166_3109203968904090801-1/-ext-1 INFO : Table default.buckettestoutput1 stats: [numFiles=1, numRows=4, totalSize=52, rawDataSize=48] No rows affected (1.692 seconds) {noformat} Insert use dynamic partition does not have the issue. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HIVE-10880) The bucket number is not respected in insert overwrite.
[ https://issues.apache.org/jira/browse/HIVE-10880?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14653235#comment-14653235 ] Hive QA commented on HIVE-10880: {color:red}Overall{color}: -1 at least one tests failed Here are the results of testing the latest attachment: https://issues.apache.org/jira/secure/attachment/12748527/HIVE-10880.4.patch {color:red}ERROR:{color} -1 due to 15 failed/errored test(s), 9320 tests executed *Failed tests:* {noformat} org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_convert_enum_to_string org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_dynamic_rdd_cache org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver_groupby_bigdata org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver_having org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver_insert_into2 org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver_nullgroup2 org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver_ppd_join5 org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver_sample5 org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver_timestamp_1 org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver_timestamp_lazy org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver_transform_ppr2 org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver_union_remove_11 org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver_union_remove_19 org.apache.hive.hcatalog.streaming.TestStreaming.testTransactionBatchCommit_Json org.apache.hive.jdbc.TestSSL.testSSLConnectionWithProperty {noformat} Test results: http://ec2-174-129-184-35.compute-1.amazonaws.com/jenkins/job/PreCommit-HIVE-TRUNK-Build/4813/testReport Console output: http://ec2-174-129-184-35.compute-1.amazonaws.com/jenkins/job/PreCommit-HIVE-TRUNK-Build/4813/console Test logs: http://ec2-174-129-184-35.compute-1.amazonaws.com/logs/PreCommit-HIVE-TRUNK-Build-4813/ Messages: {noformat} Executing org.apache.hive.ptest.execution.PrepPhase Executing org.apache.hive.ptest.execution.ExecutionPhase Executing org.apache.hive.ptest.execution.ReportingPhase Tests exited with: TestsFailedException: 15 tests failed {noformat} This message is automatically generated. ATTACHMENT ID: 12748527 - PreCommit-HIVE-TRUNK-Build The bucket number is not respected in insert overwrite. --- Key: HIVE-10880 URL: https://issues.apache.org/jira/browse/HIVE-10880 Project: Hive Issue Type: Bug Affects Versions: 1.2.0 Reporter: Yongzhi Chen Assignee: Yongzhi Chen Priority: Critical Attachments: HIVE-10880.1.patch, HIVE-10880.2.patch, HIVE-10880.3.patch, HIVE-10880.4.patch When hive.enforce.bucketing is true, the bucket number defined in the table is no longer respected in current master and 1.2. Reproduce: {code:sql} CREATE TABLE IF NOT EXISTS buckettestinput( data string ) ROW FORMAT DELIMITED FIELDS TERMINATED BY ','; CREATE TABLE IF NOT EXISTS buckettestoutput1( data string )CLUSTERED BY(data) INTO 2 BUCKETS ROW FORMAT DELIMITED FIELDS TERMINATED BY ','; CREATE TABLE IF NOT EXISTS buckettestoutput2( data string )CLUSTERED BY(data) INTO 2 BUCKETS ROW FORMAT DELIMITED FIELDS TERMINATED BY ','; {code} Then I inserted the following data into the buckettestinput table: {noformat} firstinsert1 firstinsert2 firstinsert3 firstinsert4 firstinsert5 firstinsert6 firstinsert7 firstinsert8 secondinsert1 secondinsert2 secondinsert3 secondinsert4 secondinsert5 secondinsert6 secondinsert7 secondinsert8 {noformat} {code:sql} set hive.enforce.bucketing = true; set hive.enforce.sorting=true; insert overwrite table buckettestoutput1 select * from buckettestinput where data like 'first%'; set hive.auto.convert.sortmerge.join=true; set hive.optimize.bucketmapjoin = true; set hive.optimize.bucketmapjoin.sortedmerge = true; select * from buckettestoutput1 a join buckettestoutput2 b on (a.data=b.data); {code} {noformat} Error: Error while compiling statement: FAILED: SemanticException [Error 10141]: Bucketed table metadata is not correct. Fix the metadata or don't use bucketed mapjoin, by setting hive.enforce.bucketmapjoin to false. The number of buckets for table buckettestoutput1 is 2, whereas the number of files is 1 (state=42000,code=10141) {noformat} The related debug information related to insert overwrite: {noformat} 0: jdbc:hive2://localhost:1 insert overwrite table buckettestoutput1 select * from buckettestinput where data like 'first%'insert overwrite table buckettestoutput1 0: jdbc:hive2://localhost:1 ; select * from buckettestinput where data like ' first%'; INFO : Number of reduce tasks determined at compile time: 2 INFO : In order to change the average load for a reducer (in
[jira] [Commented] (HIVE-10880) The bucket number is not respected in insert overwrite.
[ https://issues.apache.org/jira/browse/HIVE-10880?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14654681#comment-14654681 ] Hive QA commented on HIVE-10880: {color:red}Overall{color}: -1 at least one tests failed Here are the results of testing the latest attachment: https://issues.apache.org/jira/secure/attachment/12748670/HIVE-10880.4.patch {color:red}ERROR:{color} -1 due to 1 failed/errored test(s), 9323 tests executed *Failed tests:* {noformat} org.apache.hive.hcatalog.streaming.TestStreaming.testTransactionBatchEmptyCommit {noformat} Test results: http://ec2-174-129-184-35.compute-1.amazonaws.com/jenkins/job/PreCommit-HIVE-TRUNK-Build/4826/testReport Console output: http://ec2-174-129-184-35.compute-1.amazonaws.com/jenkins/job/PreCommit-HIVE-TRUNK-Build/4826/console Test logs: http://ec2-174-129-184-35.compute-1.amazonaws.com/logs/PreCommit-HIVE-TRUNK-Build-4826/ Messages: {noformat} Executing org.apache.hive.ptest.execution.PrepPhase Executing org.apache.hive.ptest.execution.ExecutionPhase Executing org.apache.hive.ptest.execution.ReportingPhase Tests exited with: TestsFailedException: 1 tests failed {noformat} This message is automatically generated. ATTACHMENT ID: 12748670 - PreCommit-HIVE-TRUNK-Build The bucket number is not respected in insert overwrite. --- Key: HIVE-10880 URL: https://issues.apache.org/jira/browse/HIVE-10880 Project: Hive Issue Type: Bug Affects Versions: 1.2.0 Reporter: Yongzhi Chen Assignee: Yongzhi Chen Priority: Critical Attachments: HIVE-10880.1.patch, HIVE-10880.2.patch, HIVE-10880.3.patch, HIVE-10880.4.patch When hive.enforce.bucketing is true, the bucket number defined in the table is no longer respected in current master and 1.2. Reproduce: {code:sql} CREATE TABLE IF NOT EXISTS buckettestinput( data string ) ROW FORMAT DELIMITED FIELDS TERMINATED BY ','; CREATE TABLE IF NOT EXISTS buckettestoutput1( data string )CLUSTERED BY(data) INTO 2 BUCKETS ROW FORMAT DELIMITED FIELDS TERMINATED BY ','; CREATE TABLE IF NOT EXISTS buckettestoutput2( data string )CLUSTERED BY(data) INTO 2 BUCKETS ROW FORMAT DELIMITED FIELDS TERMINATED BY ','; {code} Then I inserted the following data into the buckettestinput table: {noformat} firstinsert1 firstinsert2 firstinsert3 firstinsert4 firstinsert5 firstinsert6 firstinsert7 firstinsert8 secondinsert1 secondinsert2 secondinsert3 secondinsert4 secondinsert5 secondinsert6 secondinsert7 secondinsert8 {noformat} {code:sql} set hive.enforce.bucketing = true; set hive.enforce.sorting=true; insert overwrite table buckettestoutput1 select * from buckettestinput where data like 'first%'; set hive.auto.convert.sortmerge.join=true; set hive.optimize.bucketmapjoin = true; set hive.optimize.bucketmapjoin.sortedmerge = true; select * from buckettestoutput1 a join buckettestoutput2 b on (a.data=b.data); {code} {noformat} Error: Error while compiling statement: FAILED: SemanticException [Error 10141]: Bucketed table metadata is not correct. Fix the metadata or don't use bucketed mapjoin, by setting hive.enforce.bucketmapjoin to false. The number of buckets for table buckettestoutput1 is 2, whereas the number of files is 1 (state=42000,code=10141) {noformat} The related debug information related to insert overwrite: {noformat} 0: jdbc:hive2://localhost:1 insert overwrite table buckettestoutput1 select * from buckettestinput where data like 'first%'insert overwrite table buckettestoutput1 0: jdbc:hive2://localhost:1 ; select * from buckettestinput where data like ' first%'; INFO : Number of reduce tasks determined at compile time: 2 INFO : In order to change the average load for a reducer (in bytes): INFO : set hive.exec.reducers.bytes.per.reducer=number INFO : In order to limit the maximum number of reducers: INFO : set hive.exec.reducers.max=number INFO : In order to set a constant number of reducers: INFO : set mapred.reduce.tasks=number INFO : Job running in-process (local Hadoop) INFO : 2015-06-01 11:09:29,650 Stage-1 map = 86%, reduce = 100% INFO : Ended Job = job_local107155352_0001 INFO : Loading data to table default.buckettestoutput1 from file:/user/hive/warehouse/buckettestoutput1/.hive-staging_hive_2015-06-01_11-09-28_166_3109203968904090801-1/-ext-1 INFO : Table default.buckettestoutput1 stats: [numFiles=1, numRows=4, totalSize=52, rawDataSize=48] No rows affected (1.692 seconds) {noformat} Insert use dynamic partition does not have the issue. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HIVE-10880) The bucket number is not respected in insert overwrite.
[ https://issues.apache.org/jira/browse/HIVE-10880?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14652398#comment-14652398 ] Yongzhi Chen commented on HIVE-10880: - The patch is fixing following issue: In local mode and when enforce.bucketing is true, for bucket table, insert overwrite to table or static partition, bucket number is not respected. Because only dynamic partition works fine, this fix uses the same idea as how to handle the dynamic partition scenario. Attach patch 4 after rebase. The bucket number is not respected in insert overwrite. --- Key: HIVE-10880 URL: https://issues.apache.org/jira/browse/HIVE-10880 Project: Hive Issue Type: Bug Affects Versions: 1.2.0 Reporter: Yongzhi Chen Assignee: Yongzhi Chen Priority: Critical Attachments: HIVE-10880.1.patch, HIVE-10880.2.patch, HIVE-10880.3.patch When hive.enforce.bucketing is true, the bucket number defined in the table is no longer respected in current master and 1.2. Reproduce: {code:sql} CREATE TABLE IF NOT EXISTS buckettestinput( data string ) ROW FORMAT DELIMITED FIELDS TERMINATED BY ','; CREATE TABLE IF NOT EXISTS buckettestoutput1( data string )CLUSTERED BY(data) INTO 2 BUCKETS ROW FORMAT DELIMITED FIELDS TERMINATED BY ','; CREATE TABLE IF NOT EXISTS buckettestoutput2( data string )CLUSTERED BY(data) INTO 2 BUCKETS ROW FORMAT DELIMITED FIELDS TERMINATED BY ','; {code} Then I inserted the following data into the buckettestinput table: {noformat} firstinsert1 firstinsert2 firstinsert3 firstinsert4 firstinsert5 firstinsert6 firstinsert7 firstinsert8 secondinsert1 secondinsert2 secondinsert3 secondinsert4 secondinsert5 secondinsert6 secondinsert7 secondinsert8 {noformat} {code:sql} set hive.enforce.bucketing = true; set hive.enforce.sorting=true; insert overwrite table buckettestoutput1 select * from buckettestinput where data like 'first%'; set hive.auto.convert.sortmerge.join=true; set hive.optimize.bucketmapjoin = true; set hive.optimize.bucketmapjoin.sortedmerge = true; select * from buckettestoutput1 a join buckettestoutput2 b on (a.data=b.data); {code} {noformat} Error: Error while compiling statement: FAILED: SemanticException [Error 10141]: Bucketed table metadata is not correct. Fix the metadata or don't use bucketed mapjoin, by setting hive.enforce.bucketmapjoin to false. The number of buckets for table buckettestoutput1 is 2, whereas the number of files is 1 (state=42000,code=10141) {noformat} The related debug information related to insert overwrite: {noformat} 0: jdbc:hive2://localhost:1 insert overwrite table buckettestoutput1 select * from buckettestinput where data like 'first%'insert overwrite table buckettestoutput1 0: jdbc:hive2://localhost:1 ; select * from buckettestinput where data like ' first%'; INFO : Number of reduce tasks determined at compile time: 2 INFO : In order to change the average load for a reducer (in bytes): INFO : set hive.exec.reducers.bytes.per.reducer=number INFO : In order to limit the maximum number of reducers: INFO : set hive.exec.reducers.max=number INFO : In order to set a constant number of reducers: INFO : set mapred.reduce.tasks=number INFO : Job running in-process (local Hadoop) INFO : 2015-06-01 11:09:29,650 Stage-1 map = 86%, reduce = 100% INFO : Ended Job = job_local107155352_0001 INFO : Loading data to table default.buckettestoutput1 from file:/user/hive/warehouse/buckettestoutput1/.hive-staging_hive_2015-06-01_11-09-28_166_3109203968904090801-1/-ext-1 INFO : Table default.buckettestoutput1 stats: [numFiles=1, numRows=4, totalSize=52, rawDataSize=48] No rows affected (1.692 seconds) {noformat} Insert use dynamic partition does not have the issue. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HIVE-10880) The bucket number is not respected in insert overwrite.
[ https://issues.apache.org/jira/browse/HIVE-10880?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14580759#comment-14580759 ] Yongzhi Chen commented on HIVE-10880: - My build uses -Phadoop-2, the error is: {noformat} INFO] [INFO] BUILD FAILURE [INFO] [INFO] Total time: 3.693s [INFO] Finished at: Tue Jun 09 17:00:24 EDT 2015 [INFO] Final Memory: 26M/310M [INFO] [ERROR] Failed to execute goal on project hive-shims-common: Could not resolve dependencies for project org.apache.hive.shims:hive-shims-common:jar:1.2.0: Could not find artifact org.apache.hadoop:hadoop-core:jar:2.6.0 in datanucleus (http://www.datanucleus.org/downloads/maven2) - [Help 1] [ERROR] [ERROR] To see the full stack trace of the errors, re-run Maven with the -e switch. [ERROR] Re-run Maven using the -X switch to enable full debug logging. [ERROR] [ERROR] For more information about the errors and possible solutions, please read the following articles: [ERROR] [Help 1] http://cwiki.apache.org/confluence/display/MAVEN/DependencyResolutionException [ERROR] [ERROR] After correcting the problems, you can resume the build with the command [ERROR] mvn goals -rf :hive-shims-common {noformat} And in run time, it does call into functions in hadoop-core.jar. All my test in hadoop-2 env. The problem is that without my fix, hive after insert overwrite bucketed table and partition in local mode, the table can not used to do bucketmapjoin.sortedmerge because of missing files (always 1 vs. bucket number). The bucket number is not respected in insert overwrite. --- Key: HIVE-10880 URL: https://issues.apache.org/jira/browse/HIVE-10880 Project: Hive Issue Type: Bug Affects Versions: 1.2.0 Reporter: Yongzhi Chen Assignee: Yongzhi Chen Priority: Blocker Attachments: HIVE-10880.1.patch, HIVE-10880.2.patch, HIVE-10880.3.patch When hive.enforce.bucketing is true, the bucket number defined in the table is no longer respected in current master and 1.2. This is a regression. Reproduce: {noformat} CREATE TABLE IF NOT EXISTS buckettestinput( data string ) ROW FORMAT DELIMITED FIELDS TERMINATED BY ','; CREATE TABLE IF NOT EXISTS buckettestoutput1( data string )CLUSTERED BY(data) INTO 2 BUCKETS ROW FORMAT DELIMITED FIELDS TERMINATED BY ','; CREATE TABLE IF NOT EXISTS buckettestoutput2( data string )CLUSTERED BY(data) INTO 2 BUCKETS ROW FORMAT DELIMITED FIELDS TERMINATED BY ','; Then I inserted the following data into the buckettestinput table firstinsert1 firstinsert2 firstinsert3 firstinsert4 firstinsert5 firstinsert6 firstinsert7 firstinsert8 secondinsert1 secondinsert2 secondinsert3 secondinsert4 secondinsert5 secondinsert6 secondinsert7 secondinsert8 set hive.enforce.bucketing = true; set hive.enforce.sorting=true; insert overwrite table buckettestoutput1 select * from buckettestinput where data like 'first%'; set hive.auto.convert.sortmerge.join=true; set hive.optimize.bucketmapjoin = true; set hive.optimize.bucketmapjoin.sortedmerge = true; select * from buckettestoutput1 a join buckettestoutput2 b on (a.data=b.data); Error: Error while compiling statement: FAILED: SemanticException [Error 10141]: Bucketed table metadata is not correct. Fix the metadata or don't use bucketed mapjoin, by setting hive.enforce.bucketmapjoin to false. The number of buckets for table buckettestoutput1 is 2, whereas the number of files is 1 (state=42000,code=10141) {noformat} The related debug information related to insert overwrite: {noformat} 0: jdbc:hive2://localhost:1 insert overwrite table buckettestoutput1 select * from buckettestinput where data like 'first%'insert overwrite table buckettestoutput1 0: jdbc:hive2://localhost:1 ; select * from buckettestinput where data like ' first%'; INFO : Number of reduce tasks determined at compile time: 2 INFO : In order to change the average load for a reducer (in bytes): INFO : set hive.exec.reducers.bytes.per.reducer=number INFO : In order to limit the maximum number of reducers: INFO : set hive.exec.reducers.max=number INFO : In order to set a constant number of reducers: INFO : set mapred.reduce.tasks=number INFO : Job running in-process (local Hadoop) INFO : 2015-06-01 11:09:29,650 Stage-1 map = 86%, reduce = 100% INFO : Ended Job = job_local107155352_0001 INFO : Loading data to table default.buckettestoutput1 from
[jira] [Commented] (HIVE-10880) The bucket number is not respected in insert overwrite.
[ https://issues.apache.org/jira/browse/HIVE-10880?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14579119#comment-14579119 ] Yongzhi Chen commented on HIVE-10880: - [~xuefuz], I agree with you, there are something more serious than the missing files. I think the bucket algorithm is broken. I just tried to insert overwrite from a very big table, all the data goes to one bucket too. Seems the hash map no longer working. I will try to figure out why. The bucket number is not respected in insert overwrite. --- Key: HIVE-10880 URL: https://issues.apache.org/jira/browse/HIVE-10880 Project: Hive Issue Type: Bug Affects Versions: 1.2.0 Reporter: Yongzhi Chen Assignee: Yongzhi Chen Priority: Blocker Attachments: HIVE-10880.1.patch, HIVE-10880.2.patch, HIVE-10880.3.patch When hive.enforce.bucketing is true, the bucket number defined in the table is no longer respected in current master and 1.2. This is a regression. Reproduce: {noformat} CREATE TABLE IF NOT EXISTS buckettestinput( data string ) ROW FORMAT DELIMITED FIELDS TERMINATED BY ','; CREATE TABLE IF NOT EXISTS buckettestoutput1( data string )CLUSTERED BY(data) INTO 2 BUCKETS ROW FORMAT DELIMITED FIELDS TERMINATED BY ','; CREATE TABLE IF NOT EXISTS buckettestoutput2( data string )CLUSTERED BY(data) INTO 2 BUCKETS ROW FORMAT DELIMITED FIELDS TERMINATED BY ','; Then I inserted the following data into the buckettestinput table firstinsert1 firstinsert2 firstinsert3 firstinsert4 firstinsert5 firstinsert6 firstinsert7 firstinsert8 secondinsert1 secondinsert2 secondinsert3 secondinsert4 secondinsert5 secondinsert6 secondinsert7 secondinsert8 set hive.enforce.bucketing = true; set hive.enforce.sorting=true; insert overwrite table buckettestoutput1 select * from buckettestinput where data like 'first%'; set hive.auto.convert.sortmerge.join=true; set hive.optimize.bucketmapjoin = true; set hive.optimize.bucketmapjoin.sortedmerge = true; select * from buckettestoutput1 a join buckettestoutput2 b on (a.data=b.data); Error: Error while compiling statement: FAILED: SemanticException [Error 10141]: Bucketed table metadata is not correct. Fix the metadata or don't use bucketed mapjoin, by setting hive.enforce.bucketmapjoin to false. The number of buckets for table buckettestoutput1 is 2, whereas the number of files is 1 (state=42000,code=10141) {noformat} The related debug information related to insert overwrite: {noformat} 0: jdbc:hive2://localhost:1 insert overwrite table buckettestoutput1 select * from buckettestinput where data like 'first%'insert overwrite table buckettestoutput1 0: jdbc:hive2://localhost:1 ; select * from buckettestinput where data like ' first%'; INFO : Number of reduce tasks determined at compile time: 2 INFO : In order to change the average load for a reducer (in bytes): INFO : set hive.exec.reducers.bytes.per.reducer=number INFO : In order to limit the maximum number of reducers: INFO : set hive.exec.reducers.max=number INFO : In order to set a constant number of reducers: INFO : set mapred.reduce.tasks=number INFO : Job running in-process (local Hadoop) INFO : 2015-06-01 11:09:29,650 Stage-1 map = 86%, reduce = 100% INFO : Ended Job = job_local107155352_0001 INFO : Loading data to table default.buckettestoutput1 from file:/user/hive/warehouse/buckettestoutput1/.hive-staging_hive_2015-06-01_11-09-28_166_3109203968904090801-1/-ext-1 INFO : Table default.buckettestoutput1 stats: [numFiles=1, numRows=4, totalSize=52, rawDataSize=48] No rows affected (1.692 seconds) {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HIVE-10880) The bucket number is not respected in insert overwrite.
[ https://issues.apache.org/jira/browse/HIVE-10880?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14578871#comment-14578871 ] Yongzhi Chen commented on HIVE-10880: - [~xuefuz], when I debug the issue, I noticed that right number of reducer is used. I also noticed that dynamic partition insert works fine because it adds the missing files. I think we should treat static partition and ordinary table the same way, so I fixed the issue by adding the missing buckets. Following is the code for dynamic partition part: {noformat} taskIDToFile = removeTempOrDuplicateFiles(items, fs); // if the table is bucketed and enforce bucketing, we should check and generate all buckets if (dpCtx.getNumBuckets() 0 taskIDToFile != null) { // refresh the file list items = fs.listStatus(parts[i].getPath()); // get the missing buckets and generate empty buckets String taskID1 = taskIDToFile.keySet().iterator().next(); Path bucketPath = taskIDToFile.values().iterator().next().getPath(); for (int j = 0; j dpCtx.getNumBuckets(); ++j) { String taskID2 = replaceTaskId(taskID1, j); if (!taskIDToFile.containsKey(taskID2)) { // create empty bucket, file name should be derived from taskID2 String path2 = replaceTaskIdFromFilename(bucketPath.toUri().getPath().toString(), j); result.add(path2); } } } {noformat} The bucket number is not respected in insert overwrite. --- Key: HIVE-10880 URL: https://issues.apache.org/jira/browse/HIVE-10880 Project: Hive Issue Type: Bug Affects Versions: 1.2.0 Reporter: Yongzhi Chen Assignee: Yongzhi Chen Priority: Blocker Attachments: HIVE-10880.1.patch, HIVE-10880.2.patch, HIVE-10880.3.patch When hive.enforce.bucketing is true, the bucket number defined in the table is no longer respected in current master and 1.2. This is a regression. Reproduce: {noformat} CREATE TABLE IF NOT EXISTS buckettestinput( data string ) ROW FORMAT DELIMITED FIELDS TERMINATED BY ','; CREATE TABLE IF NOT EXISTS buckettestoutput1( data string )CLUSTERED BY(data) INTO 2 BUCKETS ROW FORMAT DELIMITED FIELDS TERMINATED BY ','; CREATE TABLE IF NOT EXISTS buckettestoutput2( data string )CLUSTERED BY(data) INTO 2 BUCKETS ROW FORMAT DELIMITED FIELDS TERMINATED BY ','; Then I inserted the following data into the buckettestinput table firstinsert1 firstinsert2 firstinsert3 firstinsert4 firstinsert5 firstinsert6 firstinsert7 firstinsert8 secondinsert1 secondinsert2 secondinsert3 secondinsert4 secondinsert5 secondinsert6 secondinsert7 secondinsert8 set hive.enforce.bucketing = true; set hive.enforce.sorting=true; insert overwrite table buckettestoutput1 select * from buckettestinput where data like 'first%'; set hive.auto.convert.sortmerge.join=true; set hive.optimize.bucketmapjoin = true; set hive.optimize.bucketmapjoin.sortedmerge = true; select * from buckettestoutput1 a join buckettestoutput2 b on (a.data=b.data); Error: Error while compiling statement: FAILED: SemanticException [Error 10141]: Bucketed table metadata is not correct. Fix the metadata or don't use bucketed mapjoin, by setting hive.enforce.bucketmapjoin to false. The number of buckets for table buckettestoutput1 is 2, whereas the number of files is 1 (state=42000,code=10141) {noformat} The related debug information related to insert overwrite: {noformat} 0: jdbc:hive2://localhost:1 insert overwrite table buckettestoutput1 select * from buckettestinput where data like 'first%'insert overwrite table buckettestoutput1 0: jdbc:hive2://localhost:1 ; select * from buckettestinput where data like ' first%'; INFO : Number of reduce tasks determined at compile time: 2 INFO : In order to change the average load for a reducer (in bytes): INFO : set hive.exec.reducers.bytes.per.reducer=number INFO : In order to limit the maximum number of reducers: INFO : set hive.exec.reducers.max=number INFO : In order to set a constant number of reducers: INFO : set mapred.reduce.tasks=number INFO : Job running in-process (local Hadoop) INFO : 2015-06-01 11:09:29,650 Stage-1 map = 86%, reduce = 100% INFO : Ended Job = job_local107155352_0001 INFO : Loading data to table default.buckettestoutput1 from file:/user/hive/warehouse/buckettestoutput1/.hive-staging_hive_2015-06-01_11-09-28_166_3109203968904090801-1/-ext-1 INFO : Table default.buckettestoutput1 stats: [numFiles=1, numRows=4, totalSize=52, rawDataSize=48] No rows affected (1.692 seconds) {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HIVE-10880) The bucket number is not respected in insert overwrite.
[ https://issues.apache.org/jira/browse/HIVE-10880?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14578293#comment-14578293 ] Xuefu Zhang commented on HIVE-10880: [~ychena], thanks for working on this. Looking at the patch, I wasn't confident that I know the root cause of the problem and how your patch addresses it. From the problem description, I originally thought that it's a problem of setting the right number of reducers. However, your patch seems not approaching in that direction. Instead, it appears that your patch adds the missing buckets by creating empty files. I'm not sure if this fixes the root cause. In general, the rows should be relatively evenly distributed in different buckets, and so missing or empty bucket files should be rare rather than normal. Could you please share your thoughts on this? The bucket number is not respected in insert overwrite. --- Key: HIVE-10880 URL: https://issues.apache.org/jira/browse/HIVE-10880 Project: Hive Issue Type: Bug Affects Versions: 1.2.0 Reporter: Yongzhi Chen Assignee: Yongzhi Chen Priority: Blocker Attachments: HIVE-10880.1.patch, HIVE-10880.2.patch, HIVE-10880.3.patch When hive.enforce.bucketing is true, the bucket number defined in the table is no longer respected in current master and 1.2. This is a regression. Reproduce: {noformat} CREATE TABLE IF NOT EXISTS buckettestinput( data string ) ROW FORMAT DELIMITED FIELDS TERMINATED BY ','; CREATE TABLE IF NOT EXISTS buckettestoutput1( data string )CLUSTERED BY(data) INTO 2 BUCKETS ROW FORMAT DELIMITED FIELDS TERMINATED BY ','; CREATE TABLE IF NOT EXISTS buckettestoutput2( data string )CLUSTERED BY(data) INTO 2 BUCKETS ROW FORMAT DELIMITED FIELDS TERMINATED BY ','; Then I inserted the following data into the buckettestinput table firstinsert1 firstinsert2 firstinsert3 firstinsert4 firstinsert5 firstinsert6 firstinsert7 firstinsert8 secondinsert1 secondinsert2 secondinsert3 secondinsert4 secondinsert5 secondinsert6 secondinsert7 secondinsert8 set hive.enforce.bucketing = true; set hive.enforce.sorting=true; insert overwrite table buckettestoutput1 select * from buckettestinput where data like 'first%'; set hive.auto.convert.sortmerge.join=true; set hive.optimize.bucketmapjoin = true; set hive.optimize.bucketmapjoin.sortedmerge = true; select * from buckettestoutput1 a join buckettestoutput2 b on (a.data=b.data); Error: Error while compiling statement: FAILED: SemanticException [Error 10141]: Bucketed table metadata is not correct. Fix the metadata or don't use bucketed mapjoin, by setting hive.enforce.bucketmapjoin to false. The number of buckets for table buckettestoutput1 is 2, whereas the number of files is 1 (state=42000,code=10141) {noformat} The related debug information related to insert overwrite: {noformat} 0: jdbc:hive2://localhost:1 insert overwrite table buckettestoutput1 select * from buckettestinput where data like 'first%'insert overwrite table buckettestoutput1 0: jdbc:hive2://localhost:1 ; select * from buckettestinput where data like ' first%'; INFO : Number of reduce tasks determined at compile time: 2 INFO : In order to change the average load for a reducer (in bytes): INFO : set hive.exec.reducers.bytes.per.reducer=number INFO : In order to limit the maximum number of reducers: INFO : set hive.exec.reducers.max=number INFO : In order to set a constant number of reducers: INFO : set mapred.reduce.tasks=number INFO : Job running in-process (local Hadoop) INFO : 2015-06-01 11:09:29,650 Stage-1 map = 86%, reduce = 100% INFO : Ended Job = job_local107155352_0001 INFO : Loading data to table default.buckettestoutput1 from file:/user/hive/warehouse/buckettestoutput1/.hive-staging_hive_2015-06-01_11-09-28_166_3109203968904090801-1/-ext-1 INFO : Table default.buckettestoutput1 stats: [numFiles=1, numRows=4, totalSize=52, rawDataSize=48] No rows affected (1.692 seconds) {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HIVE-10880) The bucket number is not respected in insert overwrite.
[ https://issues.apache.org/jira/browse/HIVE-10880?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14574432#comment-14574432 ] Yongzhi Chen commented on HIVE-10880: - The failures are not related. Following two testes failed age more than 10. org.apache.hadoop.hive.cli.TestMinimrCliDriver.testCliDriver_ql_rewrite_gbtoidx_cbo_2 org.apache.hive.jdbc.TestJdbcWithLocalClusterSpark.testTempTable org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_autogen_colalias failed in build 4179(the build after this build) too. For the spark failure, I tested locally, all pass. And my code change only affect when hive.enforce.bucketing is true, the spark test never set this value. So it is not related. --- T E S T S --- --- T E S T S --- Running org.apache.hive.jdbc.TestJdbcWithLocalClusterSpark Tests run: 5, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 72.272 sec - in org.apache.hive.jdbc.TestJdbcWithLocalClusterSpark Results : Tests run: 5, Failures: 0, Errors: 0, Skipped: 0 Could anyone review the code? Thanks The bucket number is not respected in insert overwrite. --- Key: HIVE-10880 URL: https://issues.apache.org/jira/browse/HIVE-10880 Project: Hive Issue Type: Bug Affects Versions: 1.2.0 Reporter: Yongzhi Chen Assignee: Yongzhi Chen Priority: Blocker Attachments: HIVE-10880.1.patch, HIVE-10880.2.patch, HIVE-10880.3.patch When hive.enforce.bucketing is true, the bucket number defined in the table is no longer respected in current master and 1.2. This is a regression. Reproduce: {noformat} CREATE TABLE IF NOT EXISTS buckettestinput( data string ) ROW FORMAT DELIMITED FIELDS TERMINATED BY ','; CREATE TABLE IF NOT EXISTS buckettestoutput1( data string )CLUSTERED BY(data) INTO 2 BUCKETS ROW FORMAT DELIMITED FIELDS TERMINATED BY ','; CREATE TABLE IF NOT EXISTS buckettestoutput2( data string )CLUSTERED BY(data) INTO 2 BUCKETS ROW FORMAT DELIMITED FIELDS TERMINATED BY ','; Then I inserted the following data into the buckettestinput table firstinsert1 firstinsert2 firstinsert3 firstinsert4 firstinsert5 firstinsert6 firstinsert7 firstinsert8 secondinsert1 secondinsert2 secondinsert3 secondinsert4 secondinsert5 secondinsert6 secondinsert7 secondinsert8 set hive.enforce.bucketing = true; set hive.enforce.sorting=true; insert overwrite table buckettestoutput1 select * from buckettestinput where data like 'first%'; set hive.auto.convert.sortmerge.join=true; set hive.optimize.bucketmapjoin = true; set hive.optimize.bucketmapjoin.sortedmerge = true; select * from buckettestoutput1 a join buckettestoutput2 b on (a.data=b.data); Error: Error while compiling statement: FAILED: SemanticException [Error 10141]: Bucketed table metadata is not correct. Fix the metadata or don't use bucketed mapjoin, by setting hive.enforce.bucketmapjoin to false. The number of buckets for table buckettestoutput1 is 2, whereas the number of files is 1 (state=42000,code=10141) {noformat} The related debug information related to insert overwrite: {noformat} 0: jdbc:hive2://localhost:1 insert overwrite table buckettestoutput1 select * from buckettestinput where data like 'first%'insert overwrite table buckettestoutput1 0: jdbc:hive2://localhost:1 ; select * from buckettestinput where data like ' first%'; INFO : Number of reduce tasks determined at compile time: 2 INFO : In order to change the average load for a reducer (in bytes): INFO : set hive.exec.reducers.bytes.per.reducer=number INFO : In order to limit the maximum number of reducers: INFO : set hive.exec.reducers.max=number INFO : In order to set a constant number of reducers: INFO : set mapred.reduce.tasks=number INFO : Job running in-process (local Hadoop) INFO : 2015-06-01 11:09:29,650 Stage-1 map = 86%, reduce = 100% INFO : Ended Job = job_local107155352_0001 INFO : Loading data to table default.buckettestoutput1 from file:/user/hive/warehouse/buckettestoutput1/.hive-staging_hive_2015-06-01_11-09-28_166_3109203968904090801-1/-ext-1 INFO : Table default.buckettestoutput1 stats: [numFiles=1, numRows=4, totalSize=52, rawDataSize=48] No rows affected (1.692 seconds) {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HIVE-10880) The bucket number is not respected in insert overwrite.
[ https://issues.apache.org/jira/browse/HIVE-10880?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14572804#comment-14572804 ] Yongzhi Chen commented on HIVE-10880: - The implement of method private static String replaceTaskId(String taskId, int bucketNum) looks not right. For the code is in the source for a while, I am not very confident about that. Attached the patch 3 fixes that issue too. If the tests pass, should use patch3, otherwise keep patch2. The bucket number is not respected in insert overwrite. --- Key: HIVE-10880 URL: https://issues.apache.org/jira/browse/HIVE-10880 Project: Hive Issue Type: Bug Affects Versions: 1.2.0 Reporter: Yongzhi Chen Assignee: Yongzhi Chen Priority: Blocker Attachments: HIVE-10880.1.patch, HIVE-10880.2.patch, HIVE-10880.3.patch When hive.enforce.bucketing is true, the bucket number defined in the table is no longer respected in current master and 1.2. This is a regression. Reproduce: {noformat} CREATE TABLE IF NOT EXISTS buckettestinput( data string ) ROW FORMAT DELIMITED FIELDS TERMINATED BY ','; CREATE TABLE IF NOT EXISTS buckettestoutput1( data string )CLUSTERED BY(data) INTO 2 BUCKETS ROW FORMAT DELIMITED FIELDS TERMINATED BY ','; CREATE TABLE IF NOT EXISTS buckettestoutput2( data string )CLUSTERED BY(data) INTO 2 BUCKETS ROW FORMAT DELIMITED FIELDS TERMINATED BY ','; Then I inserted the following data into the buckettestinput table firstinsert1 firstinsert2 firstinsert3 firstinsert4 firstinsert5 firstinsert6 firstinsert7 firstinsert8 secondinsert1 secondinsert2 secondinsert3 secondinsert4 secondinsert5 secondinsert6 secondinsert7 secondinsert8 set hive.enforce.bucketing = true; set hive.enforce.sorting=true; insert overwrite table buckettestoutput1 select * from buckettestinput where data like 'first%'; set hive.auto.convert.sortmerge.join=true; set hive.optimize.bucketmapjoin = true; set hive.optimize.bucketmapjoin.sortedmerge = true; select * from buckettestoutput1 a join buckettestoutput2 b on (a.data=b.data); Error: Error while compiling statement: FAILED: SemanticException [Error 10141]: Bucketed table metadata is not correct. Fix the metadata or don't use bucketed mapjoin, by setting hive.enforce.bucketmapjoin to false. The number of buckets for table buckettestoutput1 is 2, whereas the number of files is 1 (state=42000,code=10141) {noformat} The related debug information related to insert overwrite: {noformat} 0: jdbc:hive2://localhost:1 insert overwrite table buckettestoutput1 select * from buckettestinput where data like 'first%'insert overwrite table buckettestoutput1 0: jdbc:hive2://localhost:1 ; select * from buckettestinput where data like ' first%'; INFO : Number of reduce tasks determined at compile time: 2 INFO : In order to change the average load for a reducer (in bytes): INFO : set hive.exec.reducers.bytes.per.reducer=number INFO : In order to limit the maximum number of reducers: INFO : set hive.exec.reducers.max=number INFO : In order to set a constant number of reducers: INFO : set mapred.reduce.tasks=number INFO : Job running in-process (local Hadoop) INFO : 2015-06-01 11:09:29,650 Stage-1 map = 86%, reduce = 100% INFO : Ended Job = job_local107155352_0001 INFO : Loading data to table default.buckettestoutput1 from file:/user/hive/warehouse/buckettestoutput1/.hive-staging_hive_2015-06-01_11-09-28_166_3109203968904090801-1/-ext-1 INFO : Table default.buckettestoutput1 stats: [numFiles=1, numRows=4, totalSize=52, rawDataSize=48] No rows affected (1.692 seconds) {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HIVE-10880) The bucket number is not respected in insert overwrite.
[ https://issues.apache.org/jira/browse/HIVE-10880?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14573423#comment-14573423 ] Hive QA commented on HIVE-10880: {color:red}Overall{color}: -1 at least one tests failed Here are the results of testing the latest attachment: https://issues.apache.org/jira/secure/attachment/12737564/HIVE-10880.3.patch {color:red}ERROR:{color} -1 due to 4 failed/errored test(s), 8999 tests executed *Failed tests:* {noformat} org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_autogen_colalias org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_udf_nondeterministic org.apache.hadoop.hive.cli.TestMinimrCliDriver.testCliDriver_ql_rewrite_gbtoidx_cbo_2 org.apache.hive.jdbc.TestJdbcWithLocalClusterSpark.testTempTable {noformat} Test results: http://ec2-174-129-184-35.compute-1.amazonaws.com/jenkins/job/PreCommit-HIVE-TRUNK-Build/4178/testReport Console output: http://ec2-174-129-184-35.compute-1.amazonaws.com/jenkins/job/PreCommit-HIVE-TRUNK-Build/4178/console Test logs: http://ec2-174-129-184-35.compute-1.amazonaws.com/logs/PreCommit-HIVE-TRUNK-Build-4178/ Messages: {noformat} Executing org.apache.hive.ptest.execution.PrepPhase Executing org.apache.hive.ptest.execution.ExecutionPhase Executing org.apache.hive.ptest.execution.ReportingPhase Tests exited with: TestsFailedException: 4 tests failed {noformat} This message is automatically generated. ATTACHMENT ID: 12737564 - PreCommit-HIVE-TRUNK-Build The bucket number is not respected in insert overwrite. --- Key: HIVE-10880 URL: https://issues.apache.org/jira/browse/HIVE-10880 Project: Hive Issue Type: Bug Affects Versions: 1.2.0 Reporter: Yongzhi Chen Assignee: Yongzhi Chen Priority: Blocker Attachments: HIVE-10880.1.patch, HIVE-10880.2.patch, HIVE-10880.3.patch When hive.enforce.bucketing is true, the bucket number defined in the table is no longer respected in current master and 1.2. This is a regression. Reproduce: {noformat} CREATE TABLE IF NOT EXISTS buckettestinput( data string ) ROW FORMAT DELIMITED FIELDS TERMINATED BY ','; CREATE TABLE IF NOT EXISTS buckettestoutput1( data string )CLUSTERED BY(data) INTO 2 BUCKETS ROW FORMAT DELIMITED FIELDS TERMINATED BY ','; CREATE TABLE IF NOT EXISTS buckettestoutput2( data string )CLUSTERED BY(data) INTO 2 BUCKETS ROW FORMAT DELIMITED FIELDS TERMINATED BY ','; Then I inserted the following data into the buckettestinput table firstinsert1 firstinsert2 firstinsert3 firstinsert4 firstinsert5 firstinsert6 firstinsert7 firstinsert8 secondinsert1 secondinsert2 secondinsert3 secondinsert4 secondinsert5 secondinsert6 secondinsert7 secondinsert8 set hive.enforce.bucketing = true; set hive.enforce.sorting=true; insert overwrite table buckettestoutput1 select * from buckettestinput where data like 'first%'; set hive.auto.convert.sortmerge.join=true; set hive.optimize.bucketmapjoin = true; set hive.optimize.bucketmapjoin.sortedmerge = true; select * from buckettestoutput1 a join buckettestoutput2 b on (a.data=b.data); Error: Error while compiling statement: FAILED: SemanticException [Error 10141]: Bucketed table metadata is not correct. Fix the metadata or don't use bucketed mapjoin, by setting hive.enforce.bucketmapjoin to false. The number of buckets for table buckettestoutput1 is 2, whereas the number of files is 1 (state=42000,code=10141) {noformat} The related debug information related to insert overwrite: {noformat} 0: jdbc:hive2://localhost:1 insert overwrite table buckettestoutput1 select * from buckettestinput where data like 'first%'insert overwrite table buckettestoutput1 0: jdbc:hive2://localhost:1 ; select * from buckettestinput where data like ' first%'; INFO : Number of reduce tasks determined at compile time: 2 INFO : In order to change the average load for a reducer (in bytes): INFO : set hive.exec.reducers.bytes.per.reducer=number INFO : In order to limit the maximum number of reducers: INFO : set hive.exec.reducers.max=number INFO : In order to set a constant number of reducers: INFO : set mapred.reduce.tasks=number INFO : Job running in-process (local Hadoop) INFO : 2015-06-01 11:09:29,650 Stage-1 map = 86%, reduce = 100% INFO : Ended Job = job_local107155352_0001 INFO : Loading data to table default.buckettestoutput1 from file:/user/hive/warehouse/buckettestoutput1/.hive-staging_hive_2015-06-01_11-09-28_166_3109203968904090801-1/-ext-1 INFO : Table default.buckettestoutput1 stats: [numFiles=1, numRows=4, totalSize=52, rawDataSize=48] No rows affected (1.692 seconds) {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HIVE-10880) The bucket number is not respected in insert overwrite.
[ https://issues.apache.org/jira/browse/HIVE-10880?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14572661#comment-14572661 ] Yongzhi Chen commented on HIVE-10880: - [~xuefuz], [~szehon], [~ctang.ma], [~csun], could you review the code? Thanks The bucket number is not respected in insert overwrite. --- Key: HIVE-10880 URL: https://issues.apache.org/jira/browse/HIVE-10880 Project: Hive Issue Type: Bug Affects Versions: 1.2.0 Reporter: Yongzhi Chen Assignee: Yongzhi Chen Priority: Blocker Attachments: HIVE-10880.1.patch, HIVE-10880.2.patch When hive.enforce.bucketing is true, the bucket number defined in the table is no longer respected in current master and 1.2. This is a regression. Reproduce: {noformat} CREATE TABLE IF NOT EXISTS buckettestinput( data string ) ROW FORMAT DELIMITED FIELDS TERMINATED BY ','; CREATE TABLE IF NOT EXISTS buckettestoutput1( data string )CLUSTERED BY(data) INTO 2 BUCKETS ROW FORMAT DELIMITED FIELDS TERMINATED BY ','; CREATE TABLE IF NOT EXISTS buckettestoutput2( data string )CLUSTERED BY(data) INTO 2 BUCKETS ROW FORMAT DELIMITED FIELDS TERMINATED BY ','; Then I inserted the following data into the buckettestinput table firstinsert1 firstinsert2 firstinsert3 firstinsert4 firstinsert5 firstinsert6 firstinsert7 firstinsert8 secondinsert1 secondinsert2 secondinsert3 secondinsert4 secondinsert5 secondinsert6 secondinsert7 secondinsert8 set hive.enforce.bucketing = true; set hive.enforce.sorting=true; insert overwrite table buckettestoutput1 select * from buckettestinput where data like 'first%'; set hive.auto.convert.sortmerge.join=true; set hive.optimize.bucketmapjoin = true; set hive.optimize.bucketmapjoin.sortedmerge = true; select * from buckettestoutput1 a join buckettestoutput2 b on (a.data=b.data); Error: Error while compiling statement: FAILED: SemanticException [Error 10141]: Bucketed table metadata is not correct. Fix the metadata or don't use bucketed mapjoin, by setting hive.enforce.bucketmapjoin to false. The number of buckets for table buckettestoutput1 is 2, whereas the number of files is 1 (state=42000,code=10141) {noformat} The related debug information related to insert overwrite: {noformat} 0: jdbc:hive2://localhost:1 insert overwrite table buckettestoutput1 select * from buckettestinput where data like 'first%'insert overwrite table buckettestoutput1 0: jdbc:hive2://localhost:1 ; select * from buckettestinput where data like ' first%'; INFO : Number of reduce tasks determined at compile time: 2 INFO : In order to change the average load for a reducer (in bytes): INFO : set hive.exec.reducers.bytes.per.reducer=number INFO : In order to limit the maximum number of reducers: INFO : set hive.exec.reducers.max=number INFO : In order to set a constant number of reducers: INFO : set mapred.reduce.tasks=number INFO : Job running in-process (local Hadoop) INFO : 2015-06-01 11:09:29,650 Stage-1 map = 86%, reduce = 100% INFO : Ended Job = job_local107155352_0001 INFO : Loading data to table default.buckettestoutput1 from file:/user/hive/warehouse/buckettestoutput1/.hive-staging_hive_2015-06-01_11-09-28_166_3109203968904090801-1/-ext-1 INFO : Table default.buckettestoutput1 stats: [numFiles=1, numRows=4, totalSize=52, rawDataSize=48] No rows affected (1.692 seconds) {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HIVE-10880) The bucket number is not respected in insert overwrite.
[ https://issues.apache.org/jira/browse/HIVE-10880?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14572658#comment-14572658 ] Yongzhi Chen commented on HIVE-10880: - All these failures are not related, their ages are 11 or more. The bucket number is not respected in insert overwrite. --- Key: HIVE-10880 URL: https://issues.apache.org/jira/browse/HIVE-10880 Project: Hive Issue Type: Bug Affects Versions: 1.2.0 Reporter: Yongzhi Chen Assignee: Yongzhi Chen Priority: Blocker Attachments: HIVE-10880.1.patch, HIVE-10880.2.patch When hive.enforce.bucketing is true, the bucket number defined in the table is no longer respected in current master and 1.2. This is a regression. Reproduce: {noformat} CREATE TABLE IF NOT EXISTS buckettestinput( data string ) ROW FORMAT DELIMITED FIELDS TERMINATED BY ','; CREATE TABLE IF NOT EXISTS buckettestoutput1( data string )CLUSTERED BY(data) INTO 2 BUCKETS ROW FORMAT DELIMITED FIELDS TERMINATED BY ','; CREATE TABLE IF NOT EXISTS buckettestoutput2( data string )CLUSTERED BY(data) INTO 2 BUCKETS ROW FORMAT DELIMITED FIELDS TERMINATED BY ','; Then I inserted the following data into the buckettestinput table firstinsert1 firstinsert2 firstinsert3 firstinsert4 firstinsert5 firstinsert6 firstinsert7 firstinsert8 secondinsert1 secondinsert2 secondinsert3 secondinsert4 secondinsert5 secondinsert6 secondinsert7 secondinsert8 set hive.enforce.bucketing = true; set hive.enforce.sorting=true; insert overwrite table buckettestoutput1 select * from buckettestinput where data like 'first%'; set hive.auto.convert.sortmerge.join=true; set hive.optimize.bucketmapjoin = true; set hive.optimize.bucketmapjoin.sortedmerge = true; select * from buckettestoutput1 a join buckettestoutput2 b on (a.data=b.data); Error: Error while compiling statement: FAILED: SemanticException [Error 10141]: Bucketed table metadata is not correct. Fix the metadata or don't use bucketed mapjoin, by setting hive.enforce.bucketmapjoin to false. The number of buckets for table buckettestoutput1 is 2, whereas the number of files is 1 (state=42000,code=10141) {noformat} The related debug information related to insert overwrite: {noformat} 0: jdbc:hive2://localhost:1 insert overwrite table buckettestoutput1 select * from buckettestinput where data like 'first%'insert overwrite table buckettestoutput1 0: jdbc:hive2://localhost:1 ; select * from buckettestinput where data like ' first%'; INFO : Number of reduce tasks determined at compile time: 2 INFO : In order to change the average load for a reducer (in bytes): INFO : set hive.exec.reducers.bytes.per.reducer=number INFO : In order to limit the maximum number of reducers: INFO : set hive.exec.reducers.max=number INFO : In order to set a constant number of reducers: INFO : set mapred.reduce.tasks=number INFO : Job running in-process (local Hadoop) INFO : 2015-06-01 11:09:29,650 Stage-1 map = 86%, reduce = 100% INFO : Ended Job = job_local107155352_0001 INFO : Loading data to table default.buckettestoutput1 from file:/user/hive/warehouse/buckettestoutput1/.hive-staging_hive_2015-06-01_11-09-28_166_3109203968904090801-1/-ext-1 INFO : Table default.buckettestoutput1 stats: [numFiles=1, numRows=4, totalSize=52, rawDataSize=48] No rows affected (1.692 seconds) {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HIVE-10880) The bucket number is not respected in insert overwrite.
[ https://issues.apache.org/jira/browse/HIVE-10880?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14571767#comment-14571767 ] Yongzhi Chen commented on HIVE-10880: - Attach second patch to fix the test failures. The bucket number is not respected in insert overwrite. --- Key: HIVE-10880 URL: https://issues.apache.org/jira/browse/HIVE-10880 Project: Hive Issue Type: Bug Affects Versions: 1.2.0 Reporter: Yongzhi Chen Assignee: Yongzhi Chen Priority: Blocker Attachments: HIVE-10880.1.patch When hive.enforce.bucketing is true, the bucket number defined in the table is no longer respected in current master and 1.2. This is a regression. Reproduce: {noformat} CREATE TABLE IF NOT EXISTS buckettestinput( data string ) ROW FORMAT DELIMITED FIELDS TERMINATED BY ','; CREATE TABLE IF NOT EXISTS buckettestoutput1( data string )CLUSTERED BY(data) INTO 2 BUCKETS ROW FORMAT DELIMITED FIELDS TERMINATED BY ','; CREATE TABLE IF NOT EXISTS buckettestoutput2( data string )CLUSTERED BY(data) INTO 2 BUCKETS ROW FORMAT DELIMITED FIELDS TERMINATED BY ','; Then I inserted the following data into the buckettestinput table firstinsert1 firstinsert2 firstinsert3 firstinsert4 firstinsert5 firstinsert6 firstinsert7 firstinsert8 secondinsert1 secondinsert2 secondinsert3 secondinsert4 secondinsert5 secondinsert6 secondinsert7 secondinsert8 set hive.enforce.bucketing = true; set hive.enforce.sorting=true; insert overwrite table buckettestoutput1 select * from buckettestinput where data like 'first%'; set hive.auto.convert.sortmerge.join=true; set hive.optimize.bucketmapjoin = true; set hive.optimize.bucketmapjoin.sortedmerge = true; select * from buckettestoutput1 a join buckettestoutput2 b on (a.data=b.data); Error: Error while compiling statement: FAILED: SemanticException [Error 10141]: Bucketed table metadata is not correct. Fix the metadata or don't use bucketed mapjoin, by setting hive.enforce.bucketmapjoin to false. The number of buckets for table buckettestoutput1 is 2, whereas the number of files is 1 (state=42000,code=10141) {noformat} The related debug information related to insert overwrite: {noformat} 0: jdbc:hive2://localhost:1 insert overwrite table buckettestoutput1 select * from buckettestinput where data like 'first%'insert overwrite table buckettestoutput1 0: jdbc:hive2://localhost:1 ; select * from buckettestinput where data like ' first%'; INFO : Number of reduce tasks determined at compile time: 2 INFO : In order to change the average load for a reducer (in bytes): INFO : set hive.exec.reducers.bytes.per.reducer=number INFO : In order to limit the maximum number of reducers: INFO : set hive.exec.reducers.max=number INFO : In order to set a constant number of reducers: INFO : set mapred.reduce.tasks=number INFO : Job running in-process (local Hadoop) INFO : 2015-06-01 11:09:29,650 Stage-1 map = 86%, reduce = 100% INFO : Ended Job = job_local107155352_0001 INFO : Loading data to table default.buckettestoutput1 from file:/user/hive/warehouse/buckettestoutput1/.hive-staging_hive_2015-06-01_11-09-28_166_3109203968904090801-1/-ext-1 INFO : Table default.buckettestoutput1 stats: [numFiles=1, numRows=4, totalSize=52, rawDataSize=48] No rows affected (1.692 seconds) {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HIVE-10880) The bucket number is not respected in insert overwrite.
[ https://issues.apache.org/jira/browse/HIVE-10880?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14569750#comment-14569750 ] Yongzhi Chen commented on HIVE-10880: - The insert overwrite problem happens for table insert or static partition insert, it works fine for dynamic partition insert. So make the code change similar to what is in dynamic partition. The bucket number is not respected in insert overwrite. --- Key: HIVE-10880 URL: https://issues.apache.org/jira/browse/HIVE-10880 Project: Hive Issue Type: Bug Affects Versions: 1.2.0 Reporter: Yongzhi Chen Assignee: Yongzhi Chen Priority: Blocker Attachments: HIVE-10880.1.patch When hive.enforce.bucketing is true, the bucket number defined in the table is no longer respected in current master and 1.2. This is a regression. Reproduce: {noformat} CREATE TABLE IF NOT EXISTS buckettestinput( data string ) ROW FORMAT DELIMITED FIELDS TERMINATED BY ','; CREATE TABLE IF NOT EXISTS buckettestoutput1( data string )CLUSTERED BY(data) INTO 2 BUCKETS ROW FORMAT DELIMITED FIELDS TERMINATED BY ','; CREATE TABLE IF NOT EXISTS buckettestoutput2( data string )CLUSTERED BY(data) INTO 2 BUCKETS ROW FORMAT DELIMITED FIELDS TERMINATED BY ','; Then I inserted the following data into the buckettestinput table firstinsert1 firstinsert2 firstinsert3 firstinsert4 firstinsert5 firstinsert6 firstinsert7 firstinsert8 secondinsert1 secondinsert2 secondinsert3 secondinsert4 secondinsert5 secondinsert6 secondinsert7 secondinsert8 set hive.enforce.bucketing = true; set hive.enforce.sorting=true; insert overwrite table buckettestoutput1 select * from buckettestinput where data like 'first%'; set hive.auto.convert.sortmerge.join=true; set hive.optimize.bucketmapjoin = true; set hive.optimize.bucketmapjoin.sortedmerge = true; select * from buckettestoutput1 a join buckettestoutput2 b on (a.data=b.data); Error: Error while compiling statement: FAILED: SemanticException [Error 10141]: Bucketed table metadata is not correct. Fix the metadata or don't use bucketed mapjoin, by setting hive.enforce.bucketmapjoin to false. The number of buckets for table buckettestoutput1 is 2, whereas the number of files is 1 (state=42000,code=10141) {noformat} The related debug information related to insert overwrite: {noformat} 0: jdbc:hive2://localhost:1 insert overwrite table buckettestoutput1 select * from buckettestinput where data like 'first%'insert overwrite table buckettestoutput1 0: jdbc:hive2://localhost:1 ; select * from buckettestinput where data like ' first%'; INFO : Number of reduce tasks determined at compile time: 2 INFO : In order to change the average load for a reducer (in bytes): INFO : set hive.exec.reducers.bytes.per.reducer=number INFO : In order to limit the maximum number of reducers: INFO : set hive.exec.reducers.max=number INFO : In order to set a constant number of reducers: INFO : set mapred.reduce.tasks=number INFO : Job running in-process (local Hadoop) INFO : 2015-06-01 11:09:29,650 Stage-1 map = 86%, reduce = 100% INFO : Ended Job = job_local107155352_0001 INFO : Loading data to table default.buckettestoutput1 from file:/user/hive/warehouse/buckettestoutput1/.hive-staging_hive_2015-06-01_11-09-28_166_3109203968904090801-1/-ext-1 INFO : Table default.buckettestoutput1 stats: [numFiles=1, numRows=4, totalSize=52, rawDataSize=48] No rows affected (1.692 seconds) {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HIVE-10880) The bucket number is not respected in insert overwrite.
[ https://issues.apache.org/jira/browse/HIVE-10880?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14570078#comment-14570078 ] Hive QA commented on HIVE-10880: {color:red}Overall{color}: -1 at least one tests failed Here are the results of testing the latest attachment: https://issues.apache.org/jira/secure/attachment/12737007/HIVE-10880.1.patch {color:red}ERROR:{color} -1 due to 20 failed/errored test(s), 8991 tests executed *Failed tests:* {noformat} org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_authorization_delete org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_authorization_delete_own_table org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_authorization_update org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_authorization_update_own_table org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_bucketsortoptimize_insert_2 org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_bucketsortoptimize_insert_6 org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_bucketsortoptimize_insert_7 org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_bucketsortoptimize_insert_8 org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_delete_where_no_match org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_groupby_sort_1_23 org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_groupby_sort_skew_1_23 org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_udaf_histogram_numeric org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_update_where_no_match org.apache.hadoop.hive.cli.TestMiniSparkOnYarnCliDriver.testCliDriver_bucket5 org.apache.hadoop.hive.cli.TestMiniTezCliDriver.testCliDriver_delete_where_no_match org.apache.hadoop.hive.cli.TestMiniTezCliDriver.testCliDriver_update_where_no_match org.apache.hadoop.hive.cli.TestMinimrCliDriver.testCliDriver_ql_rewrite_gbtoidx_cbo_2 org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver_groupby_sort_1_23 org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver_groupby_sort_skew_1_23 org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver_join_nullsafe {noformat} Test results: http://ec2-174-129-184-35.compute-1.amazonaws.com/jenkins/job/PreCommit-HIVE-TRUNK-Build/4147/testReport Console output: http://ec2-174-129-184-35.compute-1.amazonaws.com/jenkins/job/PreCommit-HIVE-TRUNK-Build/4147/console Test logs: http://ec2-174-129-184-35.compute-1.amazonaws.com/logs/PreCommit-HIVE-TRUNK-Build-4147/ Messages: {noformat} Executing org.apache.hive.ptest.execution.PrepPhase Executing org.apache.hive.ptest.execution.ExecutionPhase Executing org.apache.hive.ptest.execution.ReportingPhase Tests exited with: TestsFailedException: 20 tests failed {noformat} This message is automatically generated. ATTACHMENT ID: 12737007 - PreCommit-HIVE-TRUNK-Build The bucket number is not respected in insert overwrite. --- Key: HIVE-10880 URL: https://issues.apache.org/jira/browse/HIVE-10880 Project: Hive Issue Type: Bug Affects Versions: 1.2.0 Reporter: Yongzhi Chen Assignee: Yongzhi Chen Priority: Blocker Attachments: HIVE-10880.1.patch When hive.enforce.bucketing is true, the bucket number defined in the table is no longer respected in current master and 1.2. This is a regression. Reproduce: {noformat} CREATE TABLE IF NOT EXISTS buckettestinput( data string ) ROW FORMAT DELIMITED FIELDS TERMINATED BY ','; CREATE TABLE IF NOT EXISTS buckettestoutput1( data string )CLUSTERED BY(data) INTO 2 BUCKETS ROW FORMAT DELIMITED FIELDS TERMINATED BY ','; CREATE TABLE IF NOT EXISTS buckettestoutput2( data string )CLUSTERED BY(data) INTO 2 BUCKETS ROW FORMAT DELIMITED FIELDS TERMINATED BY ','; Then I inserted the following data into the buckettestinput table firstinsert1 firstinsert2 firstinsert3 firstinsert4 firstinsert5 firstinsert6 firstinsert7 firstinsert8 secondinsert1 secondinsert2 secondinsert3 secondinsert4 secondinsert5 secondinsert6 secondinsert7 secondinsert8 set hive.enforce.bucketing = true; set hive.enforce.sorting=true; insert overwrite table buckettestoutput1 select * from buckettestinput where data like 'first%'; set hive.auto.convert.sortmerge.join=true; set hive.optimize.bucketmapjoin = true; set hive.optimize.bucketmapjoin.sortedmerge = true; select * from buckettestoutput1 a join buckettestoutput2 b on (a.data=b.data); Error: Error while compiling statement: FAILED: SemanticException [Error 10141]: Bucketed table metadata is not correct. Fix the metadata or don't use bucketed mapjoin, by setting hive.enforce.bucketmapjoin to false. The number of buckets for table buckettestoutput1 is 2, whereas the number of files is 1 (state=42000,code=10141) {noformat} The related debug information related to insert overwrite:
[jira] [Commented] (HIVE-10880) The bucket number is not respected in insert overwrite.
[ https://issues.apache.org/jira/browse/HIVE-10880?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14567438#comment-14567438 ] Mostafa Mokhtar commented on HIVE-10880: [~ekoifman] The bucket number is not respected in insert overwrite. --- Key: HIVE-10880 URL: https://issues.apache.org/jira/browse/HIVE-10880 Project: Hive Issue Type: Bug Affects Versions: 1.2.0 Reporter: Yongzhi Chen Priority: Blocker When hive.enforce.bucketing is true, the bucket number defined in the table is no longer respected in current master and 1.2. This is a regression. Reproduce: {noformat} CREATE TABLE IF NOT EXISTS buckettestinput( data string ) ROW FORMAT DELIMITED FIELDS TERMINATED BY ','; CREATE TABLE IF NOT EXISTS buckettestoutput1( data string )CLUSTERED BY(data) INTO 2 BUCKETS ROW FORMAT DELIMITED FIELDS TERMINATED BY ','; CREATE TABLE IF NOT EXISTS buckettestoutput2( data string )CLUSTERED BY(data) INTO 2 BUCKETS ROW FORMAT DELIMITED FIELDS TERMINATED BY ','; Then I inserted the following data into the buckettestinput table firstinsert1 firstinsert2 firstinsert3 firstinsert4 firstinsert5 firstinsert6 firstinsert7 firstinsert8 secondinsert1 secondinsert2 secondinsert3 secondinsert4 secondinsert5 secondinsert6 secondinsert7 secondinsert8 set hive.enforce.bucketing = true; set hive.enforce.sorting=true; insert overwrite table buckettestoutput1 select * from buckettestinput where data like 'first%'; set hive.auto.convert.sortmerge.join=true; set hive.optimize.bucketmapjoin = true; set hive.optimize.bucketmapjoin.sortedmerge = true; select * from buckettestoutput1 a join buckettestoutput2 b on (a.data=b.data); Error: Error while compiling statement: FAILED: SemanticException [Error 10141]: Bucketed table metadata is not correct. Fix the metadata or don't use bucketed mapjoin, by setting hive.enforce.bucketmapjoin to false. The number of buckets for table buckettestoutput1 is 2, whereas the number of files is 1 (state=42000,code=10141) {noformat} The related debug information related to insert overwrite: {noformat} 0: jdbc:hive2://localhost:1 insert overwrite table buckettestoutput1 select * from buckettestinput where data like 'first%'insert overwrite table buckettestoutput1 0: jdbc:hive2://localhost:1 ; select * from buckettestinput where data like ' first%'; INFO : Number of reduce tasks determined at compile time: 2 INFO : In order to change the average load for a reducer (in bytes): INFO : set hive.exec.reducers.bytes.per.reducer=number INFO : In order to limit the maximum number of reducers: INFO : set hive.exec.reducers.max=number INFO : In order to set a constant number of reducers: INFO : set mapred.reduce.tasks=number INFO : Job running in-process (local Hadoop) INFO : 2015-06-01 11:09:29,650 Stage-1 map = 86%, reduce = 100% INFO : Ended Job = job_local107155352_0001 INFO : Loading data to table default.buckettestoutput1 from file:/user/hive/warehouse/buckettestoutput1/.hive-staging_hive_2015-06-01_11-09-28_166_3109203968904090801-1/-ext-1 INFO : Table default.buckettestoutput1 stats: [numFiles=1, numRows=4, totalSize=52, rawDataSize=48] No rows affected (1.692 seconds) {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)