Rohini Palaniswamy created PIG-4963:
---------------------------------------

             Summary: Add a Bloom join
                 Key: PIG-4963
                 URL: https://issues.apache.org/jira/browse/PIG-4963
             Project: Pig
          Issue Type: New Feature
            Reporter: Rohini Palaniswamy


In PIG-4925, added option to pass BloomFilter as a scalar to bloom function. 
But found that actually using it for big data which required huge vector size 
was very inefficient and led to OOM.
   I had initially calculated that it would take around 12MB bytearray for 100 
million vectorsize (100000000 + 7) / 8 = 12500000 bytes) and that would be the 
scalar value broadcasted and would not take much space. But problem is 12MB was 
written out for every input record with BuildBloom$Initial before the 
aggregation happens and we arrive at the final BloomFilter vector. And with 
POPartialAgg it runs into OOM issues. 

If we added a bloom join implementation, which can be combined with hash or 
skewed join it would boost performance for a lot of jobs. Bloom filter of the 
smaller tables can be sent to the bigger tables as scalar and data filtered 
before hash or skewed join is used.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to