[jira] [Updated] (HIVE-917) Bucketed Map Join

Carl Steinbach (JIRA) Thu, 21 Jun 2012 14:53:45 -0700

     [ 
https://issues.apache.org/jira/browse/HIVE-917?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Carl Steinbach updated HIVE-917:
--------------------------------

    Component/s: Query Processor
         Labels: Bucketing  (was: )
    
> Bucketed Map Join
> -----------------
>
>                 Key: HIVE-917
>                 URL: https://issues.apache.org/jira/browse/HIVE-917
>             Project: Hive
>          Issue Type: New Feature
>          Components: Query Processor
>            Reporter: Zheng Shao
>            Assignee: He Yongqiang
>              Labels: Bucketing
>             Fix For: 0.6.0
>
>         Attachments: hive-917-2010-2-15.patch, hive-917-2010-2-16.patch, 
> hive-917-2010-2-3.patch, hive-917-2010-2-8.patch
>
>
> Hive already have support for map-join. Map-join treats the big table as job 
> input, and in each mapper, it loads all data from a small table.
> In case the big table is already bucketed on the join key, we don't have to 
> load the whole small table in each of the mappers. This will greatly 
> alleviate the memory pressure, and make map-join work with medium-sized 
> tables.
> There are 4 steps we can improve:
> S0. This is what the user can already do now: create a new bucketed table and 
> insert all data from the small table to it; Submit BUCKETNUM jobs, each doing 
> a map-side join of "bigtable TABLEPARTITION(BUCKET i OUT OF NBUCKETS)" with 
> "smallbucketedtable TABLEPARTITION(BUCKET i OUT OF NBUCKETS)".
> S1. Change the code so that when map-join is loading the small table, we 
> automatically drop the rows with the keys that are NOT in the same bucket as 
> the big table. This should alleviate the problem on memory, but we might 
> still have thousands of mappers reading the whole of the small table.
> S2. Let's say the user already bucketed the small table on the join key into 
> exactly the same number of buckets (or a factor of the buckets of the big 
> table), then map-join can choose to load only the buckets that are useful.
> S3. Add a new hint (e.g. /*+ MAPBUCKETJOIN(a) */), so that Hive automatically 
> does S2, without the need of asking the user to create temporary bucketed 
> table for the small table.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (HIVE-917) Bucketed Map Join

Reply via email to