[jira] Resolved: (HIVE-428) Implement Map-side Hash-Join in Hive

He Yongqiang (JIRA) Sat, 18 Apr 2009 04:39:41 -0700

     [ 
https://issues.apache.org/jira/browse/HIVE-428?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


He Yongqiang resolved HIVE-428.
-------------------------------

      Resolution: Duplicate
    Release Note: Duplicate of hive-195

Close this issue, duplicate of hive-195.

> Implement Map-side Hash-Join in Hive
> ------------------------------------
>
>                 Key: HIVE-428
>                 URL: https://issues.apache.org/jira/browse/HIVE-428
>             Project: Hadoop Hive
>          Issue Type: New Feature
>            Reporter: He Yongqiang
>
> There are many situations that join will perform much better if map side hash 
> join is used. We have a small test with a simple equal join of  two tables, 
> plain MR join with no map side hash join will execute about 50 seconds in a 
> 6-node cluster (each node 8core, 4G mem). With the mapside hash join is 
> applied, it only needs about 15 seconds.
> The map side hash join can only be used when there is small files, which can 
> be replicated to each map. The map side hash join can be coexeuted together 
> with the map-side filter.
> For example, 
> select A.a, A.c, B.b from A,B where A.a=B.d and A.a < 12 and B.b=10
> In our experiment, this statement can be translated into  three different 
> plans if both A and B are plain data file ( with no special compress).
> Plan 1
> Map-Reduce
> both A and B are input for the map. the shuffle data involved is very large.
> Plan 2
> 1) first filter B.b to a temp file B1 -- this is seperate Map only job
> 2) replicate B1 to each map when filter A and join them in the map
> no reduce is used
> Plan 3
> produce a job which's each mapper is filtering A (so the mapper is assigned 
> with regard to only A), and directly replicate B to each mapper
> Before each mapper is started filtering A, filter B and load passed B into 
> memory. And then start the mapper and join in the mem.
> Plan 3 performs better in our experiment because it saved a seperate map-only 
> job. But Plan2 is suitable for the situation when B's original file is very 
> large, but its filtered file is much small.
> This is the basic idea of Map side hash join.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Resolved: (HIVE-428) Implement Map-side Hash-Join in Hive

Reply via email to