[ https://issues.apache.org/jira/browse/HIVE-1721?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13247985#comment-13247985 ]
alex gemini commented on HIVE-1721: ----------------------------------- I'm wondering how we apply bloom filter to big table.we use map side join for small table < 25M, if we use bloom filter build small table,we maybe can increase small table size to 200M, but in big table map stage,we need to read bloom filter and writer intermediate result back to disk and then reading this intermediate result to check the real small table,we still can't hold the actual real small table into memory(correct the logic if I'm wrong),we pay the cost of writer a intermediate result which is very close to final result.In this case we can't increase the map number because it will double the penalty of io.I guess it will only get benefit in three table join on same join key,one small with 2 big.In my opinion the other db system can get benefit of bloom filter is because they can hold the intermediate result in memory for further processing (like oracle) or print it immediate (like hbase). > use bloom filters to improve the performance of joins > ----------------------------------------------------- > > Key: HIVE-1721 > URL: https://issues.apache.org/jira/browse/HIVE-1721 > Project: Hive > Issue Type: New Feature > Components: Query Processor > Reporter: Namit Jain > Labels: gsoc, gsoc2012, optimization > > In case of map-joins, it is likely that the big table will not find many > matching rows from the small table. > Currently, we perform a hash-map lookup for every row in the big table, which > can be pretty expensive. > It might be useful to try out a bloom-filter containing all the elements in > the small table. > Each element from the big table is first searched in the bloom filter, and > only in case of a positive match, > the small table hash table is explored. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira