[ https://issues.apache.org/jira/browse/HIVE-1721?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13249310#comment-13249310 ]
Ashutosh Chauhan commented on HIVE-1721: ---------------------------------------- @Alex, Reading the previous comments on jira, this is proposed to work as follows: * Create a local task and launch it on client machine, building a bloom filter on medium-sized table. (~200MB) * Create a Common Join MR job and launch it on cluster. Also, ship the bloom filter built in previous step to all the mapper nodes (via Distributed Cache). * In Mapper, look-up key of every row of large table in bloom filter. If it exists, then send that row to reducer, else filter it out. * In reducer, do the cross-product of rows of different table for a given key to get your joined output. As outlined above, it will be a win since you will be shuffling much less data from mappers to reducers. Though assumptions are cost of building bloom filter on client machine is small, there is huge difference in sizes of two tables and the join key is highly selective. One or more of these assumptions may be wrong in which case there might be a performance loss. So, there is a trade-off when to use this. I don't know if there exists a way to compute bloom filter in distributed fashion. If there is such a way, then you can do the step 1 through a MR job (instead of locally) and on a much larger table and then launch second MR job to do step 2 & 3. Again, there will be trade-offs here. > use bloom filters to improve the performance of joins > ----------------------------------------------------- > > Key: HIVE-1721 > URL: https://issues.apache.org/jira/browse/HIVE-1721 > Project: Hive > Issue Type: New Feature > Components: Query Processor > Reporter: Namit Jain > Labels: gsoc, gsoc2012, optimization > > In case of map-joins, it is likely that the big table will not find many > matching rows from the small table. > Currently, we perform a hash-map lookup for every row in the big table, which > can be pretty expensive. > It might be useful to try out a bloom-filter containing all the elements in > the small table. > Each element from the big table is first searched in the bloom filter, and > only in case of a positive match, > the small table hash table is explored. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira