[
https://issues.apache.org/jira/browse/PIG-5342?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Satish Subhashrao Saley updated PIG-5342:
-----------------------------------------
Attachment: PIG-5342-3.patch
> Add setting to turn off bloom join combiner
> -------------------------------------------
>
> Key: PIG-5342
> URL: https://issues.apache.org/jira/browse/PIG-5342
> Project: Pig
> Issue Type: Sub-task
> Reporter: Satish Subhashrao Saley
> Assignee: Satish Subhashrao Saley
> Priority: Major
> Attachments: PIG-5342-1.patch, PIG-5342-2.patch, PIG-5342-3.patch
>
>
> 1) Need a new setting pig.bloomjoin.nocombiner to turn off combiner for bloom
> join. When the keys are all unique, the combiner is unnecessary overhead.
> 2) In previous case, the keys were the bloom filter index and the values were
> the join key. Combining involved doing a distinct on the bag of values which
> has memory issues for more than 10 million records. That needs to be flipped
> and distinct combiner used to scale to a billions of records.
> 3) Mention in documentation that bloom join is also ideal in cases of right
> outer join with smaller dataset on the right. Replicate join only supports
> left outer join.
>
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)