[ https://issues.apache.org/jira/browse/HIVE-23438?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
范宜臻 updated HIVE-23438: ----------------------- Attachment: HIVE-23438.branch-2.3.patch Status: Patch Available (was: Open) > Missing Rows When Left Outer Join In N-way HybridGraceHashJoin > -------------------------------------------------------------- > > Key: HIVE-23438 > URL: https://issues.apache.org/jira/browse/HIVE-23438 > Project: Hive > Issue Type: Bug > Components: SQL, Tez > Affects Versions: 2.3.4 > Reporter: 范宜臻 > Priority: Major > Attachments: HIVE-23438.branch-2.3.patch > > > *Run Test in Patch File* > {code:java} > mvn test -Dtest=TestMiniTezCliDriver -Dqfile=hybridgrace_hashjoin_2.q{code} > *Manual Reproduce* > *STEP 1. Create test data(q_test_init_tez.sql)* > {code:java} > //create table src1 > CREATE TABLE src1 (key STRING COMMENT 'default', value STRING COMMENT > 'default') STORED AS TEXTFILE; > LOAD DATA LOCAL INPATH "${hiveconf:test.data.dir}/kv3.txt" INTO TABLE src1; > //create table src2 > CREATE TABLE src2(key STRING COMMENT 'default', value STRING COMMENT > 'default') STORED AS TEXTFILE; > LOAD DATA LOCAL INPATH "${hiveconf:test.data.dir}/kv11.txt" OVERWRITE INTO > TABLE src2; > //create table srcpart > CREATE TABLE srcpart (key STRING COMMENT 'default', value STRING COMMENT > 'default') > PARTITIONED BY (ds STRING, hr STRING) > STORED AS TEXTFILE; > LOAD DATA LOCAL INPATH "${hiveconf:test.data.dir}/kv1.txt" > OVERWRITE INTO TABLE srcpart PARTITION (ds="2008-04-08", hr="11"); > LOAD DATA LOCAL INPATH "${hiveconf:test.data.dir}/kv1.txt" > OVERWRITE INTO TABLE srcpart PARTITION (ds="2008-04-08", hr="12"); > LOAD DATA LOCAL INPATH "${hiveconf:test.data.dir}/kv1.txt" > OVERWRITE INTO TABLE srcpart PARTITION (ds="2008-04-09", hr="11"); > LOAD DATA LOCAL INPATH "${hiveconf:test.data.dir}/kv1.txt" > OVERWRITE INTO TABLE srcpart PARTITION (ds="2008-04-09", hr="12");{code} > *STEP 2. Run query* > {code:java} > set hive.auto.convert.join=true; > set hive.auto.convert.join.noconditionaltask=true; > set hive.auto.convert.join.noconditionaltask.size=10000000; > set hive.cbo.enable=false; > set hive.mapjoin.hybridgrace.hashtable=true; > select * > from > ( > select key from src1 group by key > ) x > left join src2 z on x.key = z.key > join > ( > select key from srcpart y group by key > ) y on y.key = x.key; > {code} > *EXPECTED RESULT*** > > {code:java} > 128 NULL NULL 128 > 146 146 1val_1461 146 > 150 150 1val_1501 150 > 238 NULL NULL 238 > 369 NULL NULL 369 > 406 406 1val_4061 406 > 273 273 1val_2731 273 > 98 NULL NULL 98 > 213 213 1val_2131 213 > 255 NULL NULL 255 > 401 401 1val_4011 401 > 278 NULL NULL 278 > 66 66 11val_6611 66 > 224 NULL NULL 224 > 311 NULL NULL 311 > {code} > > *ACTUAL RESULT* > {code:java} > 128 NULL NULL 128 > 146 146 1val_1461 146 > 150 150 1val_1501 150 > 213 213 1val_2131 213 > 238 NULL NULL 238 > 273 273 1val_2731 273 > 369 NULL NULL 369 > 406 406 1val_4061 406 > 98 NULL NULL 98 > 401 401 1val_4011 401 > 66 66 11val_6611 66 > {code} > > *ROOT CAUSE* > src1 left join src2, src1 is big table and src2 is small table. Join result > between big table row and the corresponding hashtable maybe NO_MATCH state, > however, these NO_MATCH rows is needed because LEFT OUTER JOIN. > In addition, these big table rows will not spilled into matchfile related to > this hashtable on disk because only SPILL state can use `spillBigTableRow`. > Then, these big table rows will be spilled into matchfile in hashtables of > table `srcpart`(second small table) > Finally, when reProcessBigTable, big table rows in matchfile are only read > from `firstSmallTable`, some datum are missing. > > *WORKAROUND* > configure firstSmallTable in completeInitializationOp and only spill big > table row into firstSmallTable when spill matchfile. > -- This message was sent by Atlassian Jira (v8.3.4#803005)