[jira] [Commented] (PIG-5255) Improvements to bloom join

Rohini Palaniswamy (JIRA) Wed, 27 Jun 2018 14:15:24 -0700


    [ 
https://issues.apache.org/jira/browse/PIG-5255?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16525618#comment-16525618
 ]


Rohini Palaniswamy commented on PIG-5255:
-----------------------------------------

Please refer to my bytecode patch (https://reviews.apache.org/r/61265/diff/8#3) 
to shade and use the newest version of roaringbitmap jar in Pig so that we 
don't have conflicts with the older version shipped by Tez.



> Improvements to bloom join
> --------------------------
>
>                 Key: PIG-5255
>                 URL: https://issues.apache.org/jira/browse/PIG-5255
>             Project: Pig
>          Issue Type: New Feature
>            Reporter: Rohini Palaniswamy
>            Assignee: Satish Subhashrao Saley
>            Priority: Major
>             Fix For: 0.18.0
>
>
> 1) Need a new setting pig.bloomjoin.nocombiner to turn off combiner for bloom 
> join. When the keys are all unique, the combiner is unnecessary overhead.
> 2) Mention in documentation that bloom join is also ideal in cases of right 
> outer join with smaller dataset on the right. Replicate join only supports 
> left outer join.
> 3) Write own bloom implementation for Murmur3 and Murmur3 with Kirsch & 
> Mitzenmacher optimization which Cassandra uses 
> (http://spyced.blogspot.com/2009/01/all-you-ever-wanted-to-know-about.html). 
> Currently we use Hadoop's bloomfilter implementation which only has Jenkins 
> and Murmur2. Murmur3 is faster and offers better distribution.
> 4) Move from BitSet to RoaringBitMap for
>   - Speed and better compression
>   - Scale
>   Currently bloom join does not scale for billions of keys. Really need large 
> bloom filters in those cases and cost of broadcasting those is greater than 
> actual data size. For eg: Join of 32B records (4TB of data) with 4 billion 
> records with keys being mostly unique. Lets say we construct  61 partitioned 
> bloom filters of 3MB each (still not good enough bit vector size for the 
> amount of keys) it is close to 200MB. If we broadcast 200MB to 30K tasks it 
> becomes 6TB which is higher than the actual data size. In practice broadcast 
> would only download once per node. Even considering that in a 6K nodes 
> cluster the amount of data transfer would be around 1.2TB. Using 
> RoaringBitMap should make a big difference in this case.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (PIG-5255) Improvements to bloom join

Reply via email to