[ https://issues.apache.org/jira/browse/HIVE-428?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
He Yongqiang resolved HIVE-428. ------------------------------- Resolution: Duplicate Release Note: Duplicate of hive-195 Close this issue, duplicate of hive-195. > Implement Map-side Hash-Join in Hive > ------------------------------------ > > Key: HIVE-428 > URL: https://issues.apache.org/jira/browse/HIVE-428 > Project: Hadoop Hive > Issue Type: New Feature > Reporter: He Yongqiang > > There are many situations that join will perform much better if map side hash > join is used. We have a small test with a simple equal join of two tables, > plain MR join with no map side hash join will execute about 50 seconds in a > 6-node cluster (each node 8core, 4G mem). With the mapside hash join is > applied, it only needs about 15 seconds. > The map side hash join can only be used when there is small files, which can > be replicated to each map. The map side hash join can be coexeuted together > with the map-side filter. > For example, > select A.a, A.c, B.b from A,B where A.a=B.d and A.a < 12 and B.b=10 > In our experiment, this statement can be translated into three different > plans if both A and B are plain data file ( with no special compress). > Plan 1 > Map-Reduce > both A and B are input for the map. the shuffle data involved is very large. > Plan 2 > 1) first filter B.b to a temp file B1 -- this is seperate Map only job > 2) replicate B1 to each map when filter A and join them in the map > no reduce is used > Plan 3 > produce a job which's each mapper is filtering A (so the mapper is assigned > with regard to only A), and directly replicate B to each mapper > Before each mapper is started filtering A, filter B and load passed B into > memory. And then start the mapper and join in the mem. > Plan 3 performs better in our experiment because it saved a seperate map-only > job. But Plan2 is suitable for the situation when B's original file is very > large, but its filtered file is much small. > This is the basic idea of Map side hash join. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.