[
https://issues.apache.org/jira/browse/PIG-5255?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16525618#comment-16525618
]
Rohini Palaniswamy commented on PIG-5255:
-----------------------------------------
Please refer to my bytecode patch (https://reviews.apache.org/r/61265/diff/8#3)
to shade and use the newest version of roaringbitmap jar in Pig so that we
don't have conflicts with the older version shipped by Tez.
> Improvements to bloom join
> --------------------------
>
> Key: PIG-5255
> URL: https://issues.apache.org/jira/browse/PIG-5255
> Project: Pig
> Issue Type: New Feature
> Reporter: Rohini Palaniswamy
> Assignee: Satish Subhashrao Saley
> Priority: Major
> Fix For: 0.18.0
>
>
> 1) Need a new setting pig.bloomjoin.nocombiner to turn off combiner for bloom
> join. When the keys are all unique, the combiner is unnecessary overhead.
> 2) Mention in documentation that bloom join is also ideal in cases of right
> outer join with smaller dataset on the right. Replicate join only supports
> left outer join.
> 3) Write own bloom implementation for Murmur3 and Murmur3 with Kirsch &
> Mitzenmacher optimization which Cassandra uses
> (http://spyced.blogspot.com/2009/01/all-you-ever-wanted-to-know-about.html).
> Currently we use Hadoop's bloomfilter implementation which only has Jenkins
> and Murmur2. Murmur3 is faster and offers better distribution.
> 4) Move from BitSet to RoaringBitMap for
> - Speed and better compression
> - Scale
> Currently bloom join does not scale for billions of keys. Really need large
> bloom filters in those cases and cost of broadcasting those is greater than
> actual data size. For eg: Join of 32B records (4TB of data) with 4 billion
> records with keys being mostly unique. Lets say we construct 61 partitioned
> bloom filters of 3MB each (still not good enough bit vector size for the
> amount of keys) it is close to 200MB. If we broadcast 200MB to 30K tasks it
> becomes 6TB which is higher than the actual data size. In practice broadcast
> would only download once per node. Even considering that in a 6K nodes
> cluster the amount of data transfer would be around 1.2TB. Using
> RoaringBitMap should make a big difference in this case.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)