[ 
https://issues.apache.org/jira/browse/HIVE-1721?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13249310#comment-13249310
 ] 

Ashutosh Chauhan commented on HIVE-1721:
----------------------------------------

@Alex,
Reading the previous comments on jira, this is proposed to work as follows:
* Create a local task and launch it on client machine, building a bloom filter 
on medium-sized table. (~200MB)
* Create a Common Join MR job and launch it on cluster. Also, ship the bloom 
filter built in previous step to all the mapper nodes (via Distributed Cache).
* In Mapper, look-up key of every row of large table in bloom filter. If it 
exists, then send that row to reducer, else filter it out.
* In reducer, do the cross-product of rows of different table for a given key 
to get your joined output. 

As outlined above, it will be a win since you will be shuffling much less data 
from mappers to reducers. Though assumptions are cost of building bloom filter 
on client machine is small, there is huge difference in sizes of two tables and 
the join key is highly selective. One or more of these assumptions may be wrong 
in which case there might be a performance loss. So, there is a trade-off when 
to use this.

I don't know if there exists a way to compute bloom filter in distributed 
fashion. If there is such a way, then you can do the step 1 through a MR job 
(instead of locally) and on a much larger table and then launch second MR job 
to do step 2 & 3. Again, there will be trade-offs here. 

                
> use bloom filters to improve the performance of joins
> -----------------------------------------------------
>
>                 Key: HIVE-1721
>                 URL: https://issues.apache.org/jira/browse/HIVE-1721
>             Project: Hive
>          Issue Type: New Feature
>          Components: Query Processor
>            Reporter: Namit Jain
>              Labels: gsoc, gsoc2012, optimization
>
> In case of map-joins, it is likely that the big table will not find many 
> matching rows from the small table.
> Currently, we perform a hash-map lookup for every row in the big table, which 
> can be pretty expensive.
> It might be useful to try out a bloom-filter containing all the elements in 
> the small table.
> Each element from the big table is first searched in the bloom filter, and 
> only in case of a positive match,
> the small table hash table is explored.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to