[ https://issues.apache.org/jira/browse/TEZ-4402?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17606828#comment-17606828 ]
Jonas Shaw commented on TEZ-4402: --------------------------------- Same hive and tez version, we got same incorrect result. The bug happen on certain row count in two full join table. > failed to do FULL JOIN > ----------------------- > > Key: TEZ-4402 > URL: https://issues.apache.org/jira/browse/TEZ-4402 > Project: Apache Tez > Issue Type: Bug > Environment: EMR (hive 3.1.2 + Tez 0.10.1) > Data stored in S3 > > Reporter: Youjun Yuan > Priority: Major > > we hit a query which FULL JOINs two tables, hive produces incorrect results, > for a single value of join key, it produces two records, each record has a > valid value for one table and NULL for the other table. > The query is: > > {code:java} > SELECT d.id, u.id > FROM ( > SELECT id > FROM airflow.tableA rud > WHERE rud.dt = '2022-04-02-1row' > ) d > FULL JOIN ( > SELECT id > FROM default.tableB > WHERE dt = '2022-04-01' and device_token='blabla' > ) u > ON u.id = d.id > ; > {code} > And produces two records for id=350570497 > {code:java} > 350570497 NULL > NULL 350570497 > Time taken: 62.692 seconds, Fetched: 2 row(s) {code} > I am sure tableB has only one row where device_token='blabla' > And we tried: > 1, SET mapreduce.job.reduces=1; then it produces right result; > 2, SET hive.execution.engine=mr; then it produces right result; > 3, *JOIN* (instead of FULL JOIN) worked as expected > 4, in sub query u, change filter _device_token='blabla'_ to > {_}id=350570497{_}, it worked ok > Below is the explain output of the query: > {code:java} > Plan optimized by CBO.Vertex dependency in root stage > Reducer 3 <- Map 1 (CUSTOM_SIMPLE_EDGE), Map 2 (CUSTOM_SIMPLE_EDGE)Stage-0 > Fetch Operator > limit:-1 > Stage-1 > Reducer 3 > File Output Operator [FS_10] > Map Join Operator [MAPJOIN_13] (rows=2 width=8) > > Conds:RS_6.KEY.reducesinkkey0=RS_7.KEY.reducesinkkey0(Outer),DynamicPartitionHashJoin:true,Output:["_col0","_col1"] > <-Map 1 [CUSTOM_SIMPLE_EDGE] > PARTITION_ONLY_SHUFFLE [RS_6] > PartitionCols:_col0 > Select Operator [SEL_2] (rows=1 width=4) > Output:["_col0"] > TableScan [TS_0] (rows=1 width=4) > > airflow@rds_users_delta,rud,Tbl:COMPLETE,Col:COMPLETE,Output:["id"] > <-Map 2 [CUSTOM_SIMPLE_EDGE] > PARTITION_ONLY_SHUFFLE [RS_7] > PartitionCols:_col0 > Select Operator [SEL_5] (rows=1 width=4) > Output:["_col0"] > Filter Operator [FIL_12] (rows=1 width=110) > predicate:(device_token = 'blabla') > TableScan [TS_3] (rows=215192362 width=109) > > default@users,users,Tbl:COMPLETE,Col:COMPLETE,Output:["id","device_token"] > {code} > I can't generate a small enough result set to reproduce the issue, I have > minimized the tableA to only 1 row, tableB has ~10m rows, but if I further > reduce the size of tableB, then the issue can't be reproduced. > > any suggestion would be highly appreciated, regarding the root cause of the > issue, how to work around it, or how to reproduce it with small enough > dataset. > -- This message was sent by Atlassian Jira (v8.20.10#820010)