[ https://issues.apache.org/jira/browse/PIG-5255?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Rohini Palaniswamy updated PIG-5255: ------------------------------------ Attachment: PIG-5255-4.patch > Improvements to bloom join > -------------------------- > > Key: PIG-5255 > URL: https://issues.apache.org/jira/browse/PIG-5255 > Project: Pig > Issue Type: New Feature > Reporter: Rohini Palaniswamy > Assignee: Satish Subhashrao Saley > Priority: Major > Fix For: 0.18.0 > > Attachments: PIG-5255-4.patch > > > 1) Need a new setting pig.bloomjoin.nocombiner to turn off combiner for bloom > join. When the keys are all unique, the combiner is unnecessary overhead. > 2) Mention in documentation that bloom join is also ideal in cases of right > outer join with smaller dataset on the right. Replicate join only supports > left outer join. > 3) Write own bloom implementation for Murmur3 and Murmur3 with Kirsch & > Mitzenmacher optimization which Cassandra uses > (http://spyced.blogspot.com/2009/01/all-you-ever-wanted-to-know-about.html). > Currently we use Hadoop's bloomfilter implementation which only has Jenkins > and Murmur2. Murmur3 is faster and offers better distribution. > 4) Move from BitSet to RoaringBitMap for > - Speed and better compression > - Scale > Currently bloom join does not scale for billions of keys. Really need large > bloom filters in those cases and cost of broadcasting those is greater than > actual data size. For eg: Join of 32B records (4TB of data) with 4 billion > records with keys being mostly unique. Lets say we construct 61 partitioned > bloom filters of 3MB each (still not good enough bit vector size for the > amount of keys) it is close to 200MB. If we broadcast 200MB to 30K tasks it > becomes 6TB which is higher than the actual data size. In practice broadcast > would only download once per node. Even considering that in a 6K nodes > cluster the amount of data transfer would be around 1.2TB. Using > RoaringBitMap should make a big difference in this case. -- This message was sent by Atlassian JIRA (v7.6.3#76005)