[jira] [Commented] (HIVE-22561) Data loss on map join for bucketed, partitioned table
[ https://issues.apache.org/jira/browse/HIVE-22561?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17405970#comment-17405970 ] Brahma Reddy Battula commented on HIVE-22561: - Looks duplicate of HIVE-22098? > Data loss on map join for bucketed, partitioned table > - > > Key: HIVE-22561 > URL: https://issues.apache.org/jira/browse/HIVE-22561 > Project: Hive > Issue Type: Bug >Affects Versions: 3.1.2 >Reporter: Aditya Shah >Assignee: Aditya Shah >Priority: Blocker > Fix For: 3.1.0, 3.0.0 > > Attachments: HIVE-22561.1.branch-3.1.patch, > HIVE-22561.branch-3.1.patch, HIVE-22561.patch, Screenshot 2019-11-28 at > 8.45.17 PM.png, image-2019-11-28-20-46-25-432.png > > > A map join on a column (which is neither involved in bucketing and partition) > causes data loss. > Steps to reproduce: > Env: [hive-dev-box|[https://github.com/kgyrtkirk/hive-dev-box]] hive 3.1.2. > Create tables: > > {code:java} > CREATE TABLE `testj2`( > `id` int, > `bn` string, > `cn` string, > `ad` map, > `mi` array) > PARTITIONED BY ( > `br` string) > CLUSTERED BY ( > bn) > INTO 2 BUCKETS > ROW FORMAT DELIMITED > FIELDS TERMINATED BY ',' > STORED AS TEXTFILE > TBLPROPERTIES ( > 'bucketing_version'='2'); > CREATE TABLE `testj1`( > `id` int, > `can` string, > `cn` string, > `ad` map, > `av` boolean, > `mi` array) > PARTITIONED BY ( > `brand` string) > CLUSTERED BY ( > can) > INTO 2 BUCKETS > ROW FORMAT DELIMITED > FIELDS TERMINATED BY ',' > STORED AS TEXTFILE > TBLPROPERTIES ( > 'bucketing_version'='2'); > {code} > insert some data in both: > {code:java} > insert into testj1 values (100, 'mes_1', 'customer_1', map('city1', 560077), > false, array(5, 10), 'brand_1'), > (101, 'mes_2', 'customer_2', map('city2', 560078), true, array(10, 20), > 'brand_2'), > (102, 'mes_3', 'customer_3', map('city3', 560079), false, array(15, 30), > 'brand_3'), > (103, 'mes_4', 'customer_4', map('city4', 560080), true, array(20, 40), > 'brand_4'), > (104, 'mes_5', 'customer_5', map('city5', 560081), false, array(25, 50), > 'brand_5'); > insert into table testj2 values (100, 'tv_0', 'customer_0', map('city0', > 560076),array(0, 0, 0), 'tv'), > (101, 'tv_1', 'customer_1', map('city1', 560077),array(20, 25, 30), 'tv'), > (102, 'tv_2', 'customer_2', map('city2', 560078),array(40, 50, 60), 'tv'), > (103, 'tv_3', 'customer_3', map('city3', 560079),array(60, 75, 90), 'tv'), > (104, 'tv_4', 'customer_4', map('city4', 560080),array(80, 100, 120), 'tv'); > {code} > Do a join between them: > {code:java} > select t1.id, t1.can, t1.cn, t2.bn,t2.ad, t2.br FROM testj1 t1 JOIN testj2 t2 > on (t1.id = t2.id) order by t1.id; > {code} > Observed results: > !image-2019-11-28-20-46-25-432.png|width=524,height=100! > In the plan, I can see a map join. Disabling it gives the correct result. > > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (HIVE-22561) Data loss on map join for bucketed, partitioned table
[ https://issues.apache.org/jira/browse/HIVE-22561?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16996952#comment-16996952 ] Jesus Camacho Rodriguez commented on HIVE-22561: [~aditya-shah], I am not sure why it was not triggered... Nevertheless, the patch does not apply cleanly on branch-3.1. > Data loss on map join for bucketed, partitioned table > - > > Key: HIVE-22561 > URL: https://issues.apache.org/jira/browse/HIVE-22561 > Project: Hive > Issue Type: Bug >Affects Versions: 3.1.2 >Reporter: Aditya Shah >Assignee: Aditya Shah >Priority: Blocker > Fix For: 3.0.0, 3.1.0 > > Attachments: HIVE-22561.branch-3.1.patch, HIVE-22561.patch, > Screenshot 2019-11-28 at 8.45.17 PM.png, image-2019-11-28-20-46-25-432.png > > > A map join on a column (which is neither involved in bucketing and partition) > causes data loss. > Steps to reproduce: > Env: [hive-dev-box|[https://github.com/kgyrtkirk/hive-dev-box]] hive 3.1.2. > Create tables: > > {code:java} > CREATE TABLE `testj2`( > `id` int, > `bn` string, > `cn` string, > `ad` map, > `mi` array) > PARTITIONED BY ( > `br` string) > CLUSTERED BY ( > bn) > INTO 2 BUCKETS > ROW FORMAT DELIMITED > FIELDS TERMINATED BY ',' > STORED AS TEXTFILE > TBLPROPERTIES ( > 'bucketing_version'='2'); > CREATE TABLE `testj1`( > `id` int, > `can` string, > `cn` string, > `ad` map, > `av` boolean, > `mi` array) > PARTITIONED BY ( > `brand` string) > CLUSTERED BY ( > can) > INTO 2 BUCKETS > ROW FORMAT DELIMITED > FIELDS TERMINATED BY ',' > STORED AS TEXTFILE > TBLPROPERTIES ( > 'bucketing_version'='2'); > {code} > insert some data in both: > {code:java} > insert into testj1 values (100, 'mes_1', 'customer_1', map('city1', 560077), > false, array(5, 10), 'brand_1'), > (101, 'mes_2', 'customer_2', map('city2', 560078), true, array(10, 20), > 'brand_2'), > (102, 'mes_3', 'customer_3', map('city3', 560079), false, array(15, 30), > 'brand_3'), > (103, 'mes_4', 'customer_4', map('city4', 560080), true, array(20, 40), > 'brand_4'), > (104, 'mes_5', 'customer_5', map('city5', 560081), false, array(25, 50), > 'brand_5'); > insert into table testj2 values (100, 'tv_0', 'customer_0', map('city0', > 560076),array(0, 0, 0), 'tv'), > (101, 'tv_1', 'customer_1', map('city1', 560077),array(20, 25, 30), 'tv'), > (102, 'tv_2', 'customer_2', map('city2', 560078),array(40, 50, 60), 'tv'), > (103, 'tv_3', 'customer_3', map('city3', 560079),array(60, 75, 90), 'tv'), > (104, 'tv_4', 'customer_4', map('city4', 560080),array(80, 100, 120), 'tv'); > {code} > Do a join between them: > {code:java} > select t1.id, t1.can, t1.cn, t2.bn,t2.ad, t2.br FROM testj1 t1 JOIN testj2 t2 > on (t1.id = t2.id) order by t1.id; > {code} > Observed results: > !image-2019-11-28-20-46-25-432.png|width=524,height=100! > In the plan, I can see a map join. Disabling it gives the correct result. > > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (HIVE-22561) Data loss on map join for bucketed, partitioned table
[ https://issues.apache.org/jira/browse/HIVE-22561?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16994365#comment-16994365 ] Aditya Shah commented on HIVE-22561: [~jcamachorodriguez] it seems to me that the profile for branch-3.1 does not run even if I submit the patch with that name. Can you please check once and let me know if I'm missing something here? Thanks, Aditya > Data loss on map join for bucketed, partitioned table > - > > Key: HIVE-22561 > URL: https://issues.apache.org/jira/browse/HIVE-22561 > Project: Hive > Issue Type: Bug >Affects Versions: 3.1.2 >Reporter: Aditya Shah >Assignee: Aditya Shah >Priority: Blocker > Fix For: 3.0.0, 3.1.0 > > Attachments: HIVE-22561.branch-3.1.patch, HIVE-22561.patch, > Screenshot 2019-11-28 at 8.45.17 PM.png, image-2019-11-28-20-46-25-432.png > > > A map join on a column (which is neither involved in bucketing and partition) > causes data loss. > Steps to reproduce: > Env: [hive-dev-box|[https://github.com/kgyrtkirk/hive-dev-box]] hive 3.1.2. > Create tables: > > {code:java} > CREATE TABLE `testj2`( > `id` int, > `bn` string, > `cn` string, > `ad` map, > `mi` array) > PARTITIONED BY ( > `br` string) > CLUSTERED BY ( > bn) > INTO 2 BUCKETS > ROW FORMAT DELIMITED > FIELDS TERMINATED BY ',' > STORED AS TEXTFILE > TBLPROPERTIES ( > 'bucketing_version'='2'); > CREATE TABLE `testj1`( > `id` int, > `can` string, > `cn` string, > `ad` map, > `av` boolean, > `mi` array) > PARTITIONED BY ( > `brand` string) > CLUSTERED BY ( > can) > INTO 2 BUCKETS > ROW FORMAT DELIMITED > FIELDS TERMINATED BY ',' > STORED AS TEXTFILE > TBLPROPERTIES ( > 'bucketing_version'='2'); > {code} > insert some data in both: > {code:java} > insert into testj1 values (100, 'mes_1', 'customer_1', map('city1', 560077), > false, array(5, 10), 'brand_1'), > (101, 'mes_2', 'customer_2', map('city2', 560078), true, array(10, 20), > 'brand_2'), > (102, 'mes_3', 'customer_3', map('city3', 560079), false, array(15, 30), > 'brand_3'), > (103, 'mes_4', 'customer_4', map('city4', 560080), true, array(20, 40), > 'brand_4'), > (104, 'mes_5', 'customer_5', map('city5', 560081), false, array(25, 50), > 'brand_5'); > insert into table testj2 values (100, 'tv_0', 'customer_0', map('city0', > 560076),array(0, 0, 0), 'tv'), > (101, 'tv_1', 'customer_1', map('city1', 560077),array(20, 25, 30), 'tv'), > (102, 'tv_2', 'customer_2', map('city2', 560078),array(40, 50, 60), 'tv'), > (103, 'tv_3', 'customer_3', map('city3', 560079),array(60, 75, 90), 'tv'), > (104, 'tv_4', 'customer_4', map('city4', 560080),array(80, 100, 120), 'tv'); > {code} > Do a join between them: > {code:java} > select t1.id, t1.can, t1.cn, t2.bn,t2.ad, t2.br FROM testj1 t1 JOIN testj2 t2 > on (t1.id = t2.id) order by t1.id; > {code} > Observed results: > !image-2019-11-28-20-46-25-432.png|width=524,height=100! > In the plan, I can see a map join. Disabling it gives the correct result. > > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (HIVE-22561) Data loss on map join for bucketed, partitioned table
[ https://issues.apache.org/jira/browse/HIVE-22561?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16992113#comment-16992113 ] Jesus Camacho Rodriguez commented on HIVE-22561: [~aditya-shah], can you rebase the patch branch-3 and branch-3.1? It does not apply cleanly. Thanks > Data loss on map join for bucketed, partitioned table > - > > Key: HIVE-22561 > URL: https://issues.apache.org/jira/browse/HIVE-22561 > Project: Hive > Issue Type: Bug >Affects Versions: 3.1.2 >Reporter: Aditya Shah >Assignee: Aditya Shah >Priority: Blocker > Fix For: 3.0.0, 3.1.0 > > Attachments: HIVE-22561.patch, Screenshot 2019-11-28 at 8.45.17 > PM.png, image-2019-11-28-20-46-25-432.png > > > A map join on a column (which is neither involved in bucketing and partition) > causes data loss. > Steps to reproduce: > Env: [hive-dev-box|[https://github.com/kgyrtkirk/hive-dev-box]] hive 3.1.2. > Create tables: > > {code:java} > CREATE TABLE `testj2`( > `id` int, > `bn` string, > `cn` string, > `ad` map, > `mi` array) > PARTITIONED BY ( > `br` string) > CLUSTERED BY ( > bn) > INTO 2 BUCKETS > ROW FORMAT DELIMITED > FIELDS TERMINATED BY ',' > STORED AS TEXTFILE > TBLPROPERTIES ( > 'bucketing_version'='2'); > CREATE TABLE `testj1`( > `id` int, > `can` string, > `cn` string, > `ad` map, > `av` boolean, > `mi` array) > PARTITIONED BY ( > `brand` string) > CLUSTERED BY ( > can) > INTO 2 BUCKETS > ROW FORMAT DELIMITED > FIELDS TERMINATED BY ',' > STORED AS TEXTFILE > TBLPROPERTIES ( > 'bucketing_version'='2'); > {code} > insert some data in both: > {code:java} > insert into testj1 values (100, 'mes_1', 'customer_1', map('city1', 560077), > false, array(5, 10), 'brand_1'), > (101, 'mes_2', 'customer_2', map('city2', 560078), true, array(10, 20), > 'brand_2'), > (102, 'mes_3', 'customer_3', map('city3', 560079), false, array(15, 30), > 'brand_3'), > (103, 'mes_4', 'customer_4', map('city4', 560080), true, array(20, 40), > 'brand_4'), > (104, 'mes_5', 'customer_5', map('city5', 560081), false, array(25, 50), > 'brand_5'); > insert into table testj2 values (100, 'tv_0', 'customer_0', map('city0', > 560076),array(0, 0, 0), 'tv'), > (101, 'tv_1', 'customer_1', map('city1', 560077),array(20, 25, 30), 'tv'), > (102, 'tv_2', 'customer_2', map('city2', 560078),array(40, 50, 60), 'tv'), > (103, 'tv_3', 'customer_3', map('city3', 560079),array(60, 75, 90), 'tv'), > (104, 'tv_4', 'customer_4', map('city4', 560080),array(80, 100, 120), 'tv'); > {code} > Do a join between them: > {code:java} > select t1.id, t1.can, t1.cn, t2.bn,t2.ad, t2.br FROM testj1 t1 JOIN testj2 t2 > on (t1.id = t2.id) order by t1.id; > {code} > Observed results: > !image-2019-11-28-20-46-25-432.png|width=524,height=100! > In the plan, I can see a map join. Disabling it gives the correct result. > > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (HIVE-22561) Data loss on map join for bucketed, partitioned table
[ https://issues.apache.org/jira/browse/HIVE-22561?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16991697#comment-16991697 ] Hive QA commented on HIVE-22561: Here are the results of testing the latest attachment: https://issues.apache.org/jira/secure/attachment/12988312/HIVE-22561.patch {color:red}ERROR:{color} -1 due to build exiting with an error Test results: https://builds.apache.org/job/PreCommit-HIVE-Build/19833/testReport Console output: https://builds.apache.org/job/PreCommit-HIVE-Build/19833/console Test logs: http://104.198.109.242/logs/PreCommit-HIVE-Build-19833/ Messages: {noformat} Executing org.apache.hive.ptest.execution.TestCheckPhase Executing org.apache.hive.ptest.execution.PrepPhase Tests exited with: NonZeroExitCodeException Command 'bash /data/hiveptest/working/scratch/source-prep.sh' failed with exit status 1 and output '+ date '+%Y-%m-%d %T.%3N' 2019-12-09 15:20:31.922 + [[ -n /usr/lib/jvm/java-8-openjdk-amd64 ]] + export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64 + JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64 + export PATH=/usr/lib/jvm/java-8-openjdk-amd64/bin/:/usr/local/bin:/usr/bin:/bin:/usr/local/games:/usr/games + PATH=/usr/lib/jvm/java-8-openjdk-amd64/bin/:/usr/local/bin:/usr/bin:/bin:/usr/local/games:/usr/games + export 'ANT_OPTS=-Xmx1g -XX:MaxPermSize=256m ' + ANT_OPTS='-Xmx1g -XX:MaxPermSize=256m ' + export 'MAVEN_OPTS=-Xmx1g ' + MAVEN_OPTS='-Xmx1g ' + cd /data/hiveptest/working/ + tee /data/hiveptest/logs/PreCommit-HIVE-Build-19833/source-prep.txt + [[ false == \t\r\u\e ]] + mkdir -p maven ivy + [[ git = \s\v\n ]] + [[ git = \g\i\t ]] + [[ -z master ]] + [[ -d apache-github-source-source ]] + [[ ! -d apache-github-source-source/.git ]] + [[ ! -d apache-github-source-source ]] + date '+%Y-%m-%d %T.%3N' 2019-12-09 15:20:31.925 + cd apache-github-source-source + git fetch origin + git reset --hard HEAD HEAD is now at d7a193b HIVE-22598: Fix TestCompactor.testDisableCompactionDuringReplLoad flakyness (Peter Vary reviewed by Zoltan Haindrich) + git clean -f -d Removing standalone-metastore/metastore-server/src/gen/ + git checkout master Already on 'master' Your branch is up-to-date with 'origin/master'. + git reset --hard origin/master HEAD is now at d7a193b HIVE-22598: Fix TestCompactor.testDisableCompactionDuringReplLoad flakyness (Peter Vary reviewed by Zoltan Haindrich) + git merge --ff-only origin/master Already up-to-date. + date '+%Y-%m-%d %T.%3N' 2019-12-09 15:20:33.143 + rm -rf ../yetus_PreCommit-HIVE-Build-19833 + mkdir ../yetus_PreCommit-HIVE-Build-19833 + git gc + cp -R . ../yetus_PreCommit-HIVE-Build-19833 + mkdir /data/hiveptest/logs/PreCommit-HIVE-Build-19833/yetus + patchCommandPath=/data/hiveptest/working/scratch/smart-apply-patch.sh + patchFilePath=/data/hiveptest/working/scratch/build.patch + [[ -f /data/hiveptest/working/scratch/build.patch ]] + chmod +x /data/hiveptest/working/scratch/smart-apply-patch.sh + /data/hiveptest/working/scratch/smart-apply-patch.sh /data/hiveptest/working/scratch/build.patch error: a/ql/src/java/org/apache/hadoop/hive/ql/optimizer/metainfo/annotation/OpTraitsRulesProcFactory.java: does not exist in index error: a/ql/src/test/queries/clientpositive/bucket_map_join_tez2.q: does not exist in index error: a/ql/src/test/results/clientpositive/llap/bucket_map_join_tez2.q.out: does not exist in index error: a/ql/src/test/results/clientpositive/llap/limit_pushdown.q.out: does not exist in index error: a/ql/src/test/results/clientpositive/llap/offset_limit_ppd_optimizer.q.out: does not exist in index error: a/ql/src/test/results/clientpositive/llap/tez_smb_main.q.out: does not exist in index error: a/ql/src/test/results/clientpositive/spark/bucket_map_join_tez2.q.out: does not exist in index error: patch failed: ql/src/java/org/apache/hadoop/hive/ql/optimizer/metainfo/annotation/OpTraitsRulesProcFactory.java:110 Falling back to three-way merge... Applied patch to 'ql/src/java/org/apache/hadoop/hive/ql/optimizer/metainfo/annotation/OpTraitsRulesProcFactory.java' cleanly. error: patch failed: ql/src/test/queries/clientpositive/bucket_map_join_tez2.q:138 Falling back to three-way merge... Applied patch to 'ql/src/test/queries/clientpositive/bucket_map_join_tez2.q' with conflicts. error: patch failed: ql/src/test/results/clientpositive/llap/limit_pushdown.q.out:923 Falling back to three-way merge... Applied patch to 'ql/src/test/results/clientpositive/llap/limit_pushdown.q.out' with conflicts. error: patch failed: ql/src/test/results/clientpositive/llap/offset_limit_ppd_optimizer.q.out:1317 Falling back to three-way merge... Applied patch to 'ql/src/test/results/clientpositive/llap/offset_limit_ppd_optimizer.q.out' with conflicts. error: patch failed: ql/src/test/results/clientpositive/llap/tez_smb_main.q.out:592 Falling back to three-way merge... Applied patch to 'ql/src/test/results/clientpositive/llap/tez_smb_main.q.out'
[jira] [Commented] (HIVE-22561) Data loss on map join for bucketed, partitioned table
[ https://issues.apache.org/jira/browse/HIVE-22561?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16984499#comment-16984499 ] Aditya Shah commented on HIVE-22561: [~djaiswal] [~prasanth_j] [~jcamachorodriguez] Can you please take a look at this. I tried debugging a bit. Some of the observations I made where: # The mapjoin operator does not populate the hashtable (hybrid as well as normal) completely for each task. # The results vary with the number of buckets. Is the hashtable distributed in someway according to buckets? > Data loss on map join for bucketed, partitioned table > - > > Key: HIVE-22561 > URL: https://issues.apache.org/jira/browse/HIVE-22561 > Project: Hive > Issue Type: Bug >Affects Versions: 3.1.2 >Reporter: Aditya Shah >Priority: Blocker > Attachments: Screenshot 2019-11-28 at 8.45.17 PM.png, > image-2019-11-28-20-46-25-432.png > > > A map join on a column (which is neither involved in bucketing and partition) > causes data loss. > Steps to reproduce: > Env: [hive-dev-box|[https://github.com/kgyrtkirk/hive-dev-box]] hive 3.1.2. > Create tables: > > {code:java} > CREATE TABLE `testj2`( > `id` int, > `bn` string, > `cn` string, > `ad` map, > `mi` array) > PARTITIONED BY ( > `br` string) > CLUSTERED BY ( > bn) > INTO 2 BUCKETS > ROW FORMAT DELIMITED > FIELDS TERMINATED BY ',' > STORED AS TEXTFILE > TBLPROPERTIES ( > 'bucketing_version'='2'); > CREATE TABLE `testj1`( > `id` int, > `can` string, > `cn` string, > `ad` map, > `av` boolean, > `mi` array) > PARTITIONED BY ( > `brand` string) > CLUSTERED BY ( > can) > INTO 2 BUCKETS > ROW FORMAT DELIMITED > FIELDS TERMINATED BY ',' > STORED AS TEXTFILE > TBLPROPERTIES ( > 'bucketing_version'='2'); > {code} > insert some data in both: > {code:java} > insert into testj1 values (100, 'mes_1', 'customer_1', map('city1', 560077), > false, array(5, 10), 'brand_1'), > (101, 'mes_2', 'customer_2', map('city2', 560078), true, array(10, 20), > 'brand_2'), > (102, 'mes_3', 'customer_3', map('city3', 560079), false, array(15, 30), > 'brand_3'), > (103, 'mes_4', 'customer_4', map('city4', 560080), true, array(20, 40), > 'brand_4'), > (104, 'mes_5', 'customer_5', map('city5', 560081), false, array(25, 50), > 'brand_5'); > insert into table testj2 values (100, 'tv_0', 'customer_0', map('city0', > 560076),array(0, 0, 0), 'tv'), > (101, 'tv_1', 'customer_1', map('city1', 560077),array(20, 25, 30), 'tv'), > (102, 'tv_2', 'customer_2', map('city2', 560078),array(40, 50, 60), 'tv'), > (103, 'tv_3', 'customer_3', map('city3', 560079),array(60, 75, 90), 'tv'), > (104, 'tv_4', 'customer_4', map('city4', 560080),array(80, 100, 120), 'tv'); > {code} > Do a join between them: > {code:java} > select t1.id, t1.can, t1.cn, t2.bn,t2.ad, t2.br FROM testj1 t1 JOIN testj2 t2 > on (t1.id = t2.id) order by t1.id; > {code} > Observed results: > !image-2019-11-28-20-46-25-432.png|width=524,height=100! > In the plan, I can see a map join. Disabling it gives the correct result. > > -- This message was sent by Atlassian Jira (v8.3.4#803005)