[
https://issues.apache.org/jira/browse/HIVE-1721?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13247985#comment-13247985
]
alex gemini commented on HIVE-1721:
-----------------------------------
I'm wondering how we apply bloom filter to big table.we use map side join for
small table < 25M, if we use bloom filter build small table,we maybe can
increase small table size to 200M, but in big table map stage,we need to read
bloom filter and writer intermediate result back to disk and then reading this
intermediate result to check the real small table,we still can't hold the
actual real small table into memory(correct the logic if I'm wrong),we pay the
cost of writer a intermediate result which is very close to final result.In
this case we can't increase the map number because it will double the penalty
of io.I guess it will only get benefit in three table join on same join key,one
small with 2 big.In my opinion the other db system can get benefit of bloom
filter is because they can hold the intermediate result in memory for further
processing (like oracle) or print it immediate (like hbase).
> use bloom filters to improve the performance of joins
> -----------------------------------------------------
>
> Key: HIVE-1721
> URL: https://issues.apache.org/jira/browse/HIVE-1721
> Project: Hive
> Issue Type: New Feature
> Components: Query Processor
> Reporter: Namit Jain
> Labels: gsoc, gsoc2012, optimization
>
> In case of map-joins, it is likely that the big table will not find many
> matching rows from the small table.
> Currently, we perform a hash-map lookup for every row in the big table, which
> can be pretty expensive.
> It might be useful to try out a bloom-filter containing all the elements in
> the small table.
> Each element from the big table is first searched in the bloom filter, and
> only in case of a positive match,
> the small table hash table is explored.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira