[ https://issues.apache.org/jira/browse/HIVE-917?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Carl Steinbach updated HIVE-917: -------------------------------- Component/s: Query Processor Labels: Bucketing (was: ) > Bucketed Map Join > ----------------- > > Key: HIVE-917 > URL: https://issues.apache.org/jira/browse/HIVE-917 > Project: Hive > Issue Type: New Feature > Components: Query Processor > Reporter: Zheng Shao > Assignee: He Yongqiang > Labels: Bucketing > Fix For: 0.6.0 > > Attachments: hive-917-2010-2-15.patch, hive-917-2010-2-16.patch, > hive-917-2010-2-3.patch, hive-917-2010-2-8.patch > > > Hive already have support for map-join. Map-join treats the big table as job > input, and in each mapper, it loads all data from a small table. > In case the big table is already bucketed on the join key, we don't have to > load the whole small table in each of the mappers. This will greatly > alleviate the memory pressure, and make map-join work with medium-sized > tables. > There are 4 steps we can improve: > S0. This is what the user can already do now: create a new bucketed table and > insert all data from the small table to it; Submit BUCKETNUM jobs, each doing > a map-side join of "bigtable TABLEPARTITION(BUCKET i OUT OF NBUCKETS)" with > "smallbucketedtable TABLEPARTITION(BUCKET i OUT OF NBUCKETS)". > S1. Change the code so that when map-join is loading the small table, we > automatically drop the rows with the keys that are NOT in the same bucket as > the big table. This should alleviate the problem on memory, but we might > still have thousands of mappers reading the whole of the small table. > S2. Let's say the user already bucketed the small table on the join key into > exactly the same number of buckets (or a factor of the buckets of the big > table), then map-join can choose to load only the buckets that are useful. > S3. Add a new hint (e.g. /*+ MAPBUCKETJOIN(a) */), so that Hive automatically > does S2, without the need of asking the user to create temporary bucketed > table for the small table. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira