[ https://issues.apache.org/jira/browse/HIVE-16498?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Adesh Kumar Rao updated HIVE-16498: ----------------------------------- Status: Open (was: Patch Available) > [Tez] ReduceRecordProcessor has no check to see if all the operators are done > or not and is reading complete data > ----------------------------------------------------------------------------------------------------------------- > > Key: HIVE-16498 > URL: https://issues.apache.org/jira/browse/HIVE-16498 > Project: Hive > Issue Type: Bug > Affects Versions: 1.2.0, 1.3.0 > Reporter: Adesh Kumar Rao > Fix For: 1.3.0, 1.2.0 > > Attachments: HIVE-16498.1.patch, HIVE-16498-branch-1.patch > > > ReducerRecordProcessor is not checking if the reducer (Operator) is done or > not and this causes reading of useless data. > It can be reproduced by a reduce side join. > The data for large_table is generated by following shell script and a table > can be created from the file `large.txt` > {code:java} > for (( j=1 ; j <=20; j++)) > do > for (( i=1; i <= 1000000; i++ )) > do > echo "$i,$j" >> large.txt > done > done > {code} > {code:java} > create external table large_table ( i int, j int) row format delimited fields > terminated by ',' location "hdfs://<some-hdfs-location>"; > set hive.auto.convert.join=false; -- So that reduce side join is used instead > of MapJoin > select * from large_table a join large_table b on a,j = b.j limit 100; > {code} > The above join query is stuck reading all the data from table (because of no > check) and does not seem to finish in real time as compared to MR or even Tez > with MapJoin enabled. > For reference, the same query takes around 5-6 minutes on MR and 2-3 minutes > in case of MapJoin on Tez. -- This message was sent by Atlassian JIRA (v6.3.15#6346)