[ https://issues.apache.org/jira/browse/PIG-5342?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Satish Subhashrao Saley updated PIG-5342: ----------------------------------------- Attachment: (was: PIG-5342-7.patch) > Add setting to turn off bloom join combiner > ------------------------------------------- > > Key: PIG-5342 > URL: https://issues.apache.org/jira/browse/PIG-5342 > Project: Pig > Issue Type: Sub-task > Reporter: Satish Subhashrao Saley > Assignee: Satish Subhashrao Saley > Priority: Major > Attachments: PIG-5342-1.patch, PIG-5342-2.patch, PIG-5342-3.patch, > PIG-5342-4.patch, PIG-5342-5.patch, PIG-5342-6.patch, PIG-5342-7.patch, > PIG-5342-8.patch > > > 1) Need a new setting pig.bloomjoin.nocombiner to turn off combiner for bloom > join. When the keys are all unique, the combiner is unnecessary overhead. > 2) In previous case, the keys were the bloom filter index and the values were > the join key. Combining involved doing a distinct on the bag of values which > has memory issues for more than 10 million records. That needs to be flipped > and distinct combiner used to scale to a billions of records. > 3) Mention in documentation that bloom join is also ideal in cases of right > outer join with smaller dataset on the right. Replicate join only supports > left outer join. > -- This message was sent by Atlassian JIRA (v7.6.3#76005)