[jira] [Commented] (PIG-4963) Add a Bloom join

Rohini Palaniswamy (JIRA) Wed, 25 Jan 2017 20:41:19 -0800

    [ 
https://issues.apache.org/jira/browse/PIG-4963?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15839173#comment-15839173
 ]


Rohini Palaniswamy commented on PIG-4963:
-----------------------------------------

Will address 1.  For 3, I did a quick run of Join converting them to use bloom 
and they were fine except for full outer which is not supported. Actually tests 
added for bloom join cover all cases in the Join group and in fact cover lot 
more - tuple keys and more datatypes for keys, more cases for union and split. 
Also uses studentnulltab10k which tests null cases better. self join case is 
covered in multiquery.conf.

bq. But I feel it is more clear if the plan show a filter + regular local 
rearrange. The execution plan of the later is more understandable.
  I think it is unnecessary overhead to add a separate filter operator for just 
readability. The current Filter operator which executes a plan for filtering 
has no relation to the BloomFilter way of filtering and does not logically make 
sense to extend for BloomFilter. This is simpler and cleaner in terms of 
implementation and also should be faster in terms of execution as there is no 
unnecessary overhead.

> Add a Bloom join
> ----------------
>
>                 Key: PIG-4963
>                 URL: https://issues.apache.org/jira/browse/PIG-4963
>             Project: Pig
>          Issue Type: New Feature
>            Reporter: Rohini Palaniswamy
>            Assignee: Rohini Palaniswamy
>             Fix For: 0.17.0
>
>         Attachments: PIG-4963-1.patch, PIG-4963-2.patch, PIG-4963-3.patch, 
> PIG-4963-4.patch
>
>
> In PIG-4925, added option to pass BloomFilter as a scalar to bloom function. 
> But found that actually using it for big data which required huge vector size 
> was very inefficient and led to OOM.
>    I had initially calculated that it would take around 12MB bytearray for 
> 100 million vectorsize (100000000 + 7) / 8 = 12500000 bytes) and that would 
> be the scalar value broadcasted and would not take much space. But problem is 
> 12MB was written out for every input record with BuildBloom$Initial before 
> the aggregation happens and we arrive at the final BloomFilter vector. And 
> with POPartialAgg it runs into OOM issues. 
> If we added a bloom join implementation, which can be combined with hash or 
> skewed join it would boost performance for a lot of jobs. Bloom filter of the 
> smaller tables can be sent to the bigger tables as scalar and data filtered 
> before hash or skewed join is used.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (PIG-4963) Add a Bloom join

Reply via email to