[jira] [Commented] (PIG-5342) Add setting to turn off bloom join combiner

2018-10-04 Thread Rohini Palaniswamy (JIRA)


[ 
https://issues.apache.org/jira/browse/PIG-5342?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16638961#comment-16638961
 ] 

Rohini Palaniswamy commented on PIG-5342:
-

Committed the missing file

> Add setting to turn off bloom join combiner
> ---
>
> Key: PIG-5342
> URL: https://issues.apache.org/jira/browse/PIG-5342
> Project: Pig
>  Issue Type: Sub-task
>Reporter: Satish Subhashrao Saley
>Assignee: Satish Subhashrao Saley
>Priority: Major
> Fix For: 0.18.0
>
> Attachments: PIG-5342-1.patch, PIG-5342-2.patch, PIG-5342-3.patch, 
> PIG-5342-4.patch, PIG-5342-5.patch, PIG-5342-6.patch, PIG-5342-7.patch, 
> PIG-5342-8.patch
>
>
> 1) Need a new setting pig.bloomjoin.nocombiner to turn off combiner for bloom 
> join. When the keys are all unique, the combiner is unnecessary overhead.
> 2) In previous case, the keys were the bloom filter index and the values were 
> the join key. Combining involved doing a distinct on the bag of values which 
> has memory issues for more than 10 million records. That needs to be flipped 
> and distinct combiner used to scale to a billions of records.
> 3) Mention in documentation that bloom join is also ideal in cases of right 
> outer join with smaller dataset on the right. Replicate join only supports 
> left outer join.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (PIG-5342) Add setting to turn off bloom join combiner

2018-10-04 Thread Satish Subhashrao Saley (JIRA)


[ 
https://issues.apache.org/jira/browse/PIG-5342?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16638956#comment-16638956
 ] 

Satish Subhashrao Saley commented on PIG-5342:
--

Could you please amend the commit? BloomFilterPartitioner class wasn't 
committed. 
{code:java}
     [echo] *** Building Main Sources ***

     [echo] *** To compile with all warnings enabled, supply -Dall.warnings=1 
on command line ***

     [echo] *** Else, you will only be warned about deprecations ***

     [echo] *** Hadoop version used: 2 ; HBase version used: 1 ; Spark version 
used: 2 ***

    [javac] Compiling 1106 source files to /Users/saley/src/pig/build/classes

    [javac] 
/Users/saley/src/pig/src/org/apache/pig/backend/hadoop/executionengine/tez/plan/TezCompiler.java:113:
 error: cannot find symbol

    [javac] import 
org.apache.pig.backend.hadoop.executionengine.tez.runtime.BloomFilterPartitioner;

    [javac]                                                                 ^

    [javac]   symbol:   class BloomFilterPartitioner

    [javac]   location: package 
org.apache.pig.backend.hadoop.executionengine.tez.runtime

    [javac] 
/Users/saley/src/pig/src/org/apache/pig/backend/hadoop/executionengine/tez/plan/TezCompiler.java:1495:
 error: cannot find symbol

    [javac]             edge.partitionerClass = BloomFilterPartitioner.class;

    [javac]                                     ^

    [javac]   symbol:   class BloomFilterPartitioner

    [javac]   location: class TezCompiler

    [javac] Note: Some input files use or override a deprecated API.

    [javac] Note: Recompile with -Xlint:deprecation for details.

    [javac] Note: Some input files use unchecked or unsafe operations.

    [javac] Note: Recompile with -Xlint:unchecked for details.

    [javac] 2 errors

{code}

> Add setting to turn off bloom join combiner
> ---
>
> Key: PIG-5342
> URL: https://issues.apache.org/jira/browse/PIG-5342
> Project: Pig
>  Issue Type: Sub-task
>Reporter: Satish Subhashrao Saley
>Assignee: Satish Subhashrao Saley
>Priority: Major
> Fix For: 0.18.0
>
> Attachments: PIG-5342-1.patch, PIG-5342-2.patch, PIG-5342-3.patch, 
> PIG-5342-4.patch, PIG-5342-5.patch, PIG-5342-6.patch, PIG-5342-7.patch, 
> PIG-5342-8.patch
>
>
> 1) Need a new setting pig.bloomjoin.nocombiner to turn off combiner for bloom 
> join. When the keys are all unique, the combiner is unnecessary overhead.
> 2) In previous case, the keys were the bloom filter index and the values were 
> the join key. Combining involved doing a distinct on the bag of values which 
> has memory issues for more than 10 million records. That needs to be flipped 
> and distinct combiner used to scale to a billions of records.
> 3) Mention in documentation that bloom join is also ideal in cases of right 
> outer join with smaller dataset on the right. Replicate join only supports 
> left outer join.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (PIG-5342) Add setting to turn off bloom join combiner

2018-10-02 Thread Rohini Palaniswamy (JIRA)


[ 
https://issues.apache.org/jira/browse/PIG-5342?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16635834#comment-16635834
 ] 

Rohini Palaniswamy commented on PIG-5342:
-

The changes in TezPOPackageAnnotator.java also needs to be reverted and golden 
files will have to be regenerated.

> Add setting to turn off bloom join combiner
> ---
>
> Key: PIG-5342
> URL: https://issues.apache.org/jira/browse/PIG-5342
> Project: Pig
>  Issue Type: Sub-task
>Reporter: Satish Subhashrao Saley
>Assignee: Satish Subhashrao Saley
>Priority: Major
> Attachments: PIG-5342-1.patch, PIG-5342-2.patch, PIG-5342-3.patch, 
> PIG-5342-4.patch, PIG-5342-5.patch, PIG-5342-6.patch
>
>
> 1) Need a new setting pig.bloomjoin.nocombiner to turn off combiner for bloom 
> join. When the keys are all unique, the combiner is unnecessary overhead.
> 2) In previous case, the keys were the bloom filter index and the values were 
> the join key. Combining involved doing a distinct on the bag of values which 
> has memory issues for more than 10 million records. That needs to be flipped 
> and distinct combiner used to scale to a billions of records.
> 3) Mention in documentation that bloom join is also ideal in cases of right 
> outer join with smaller dataset on the right. Replicate join only supports 
> left outer join.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (PIG-5342) Add setting to turn off bloom join combiner

2018-10-01 Thread Satish Subhashrao Saley (JIRA)


[ 
https://issues.apache.org/jira/browse/PIG-5342?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16634662#comment-16634662
 ] 

Satish Subhashrao Saley commented on PIG-5342:
--

Updated patch.

> Add setting to turn off bloom join combiner
> ---
>
> Key: PIG-5342
> URL: https://issues.apache.org/jira/browse/PIG-5342
> Project: Pig
>  Issue Type: Sub-task
>Reporter: Satish Subhashrao Saley
>Assignee: Satish Subhashrao Saley
>Priority: Major
> Attachments: PIG-5342-1.patch, PIG-5342-2.patch, PIG-5342-3.patch, 
> PIG-5342-4.patch, PIG-5342-5.patch, PIG-5342-6.patch
>
>
> 1) Need a new setting pig.bloomjoin.nocombiner to turn off combiner for bloom 
> join. When the keys are all unique, the combiner is unnecessary overhead.
> 2) In previous case, the keys were the bloom filter index and the values were 
> the join key. Combining involved doing a distinct on the bag of values which 
> has memory issues for more than 10 million records. That needs to be flipped 
> and distinct combiner used to scale to a billions of records.
> 3) Mention in documentation that bloom join is also ideal in cases of right 
> outer join with smaller dataset on the right. Replicate join only supports 
> left outer join.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (PIG-5342) Add setting to turn off bloom join combiner

2018-09-26 Thread Rohini Palaniswamy (JIRA)


[ 
https://issues.apache.org/jira/browse/PIG-5342?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16629371#comment-16629371
 ] 

Rohini Palaniswamy commented on PIG-5342:
-

 For the reduce case, we can optimize by making the keys always 
NullableBytesWritable and doing the DataType.toBytes(key, keyType) in the 
POBuildBloomRearrangeTez itself on the map side. Comparator also needs to be 
set to PigBytesRawBytesComparator. Can you make that change?

Another optimization would be to use IntWritable instead of NullableTuple for 
the value type. But that needs more work. We can do that later in another jira.

> Add setting to turn off bloom join combiner
> ---
>
> Key: PIG-5342
> URL: https://issues.apache.org/jira/browse/PIG-5342
> Project: Pig
>  Issue Type: Sub-task
>Reporter: Satish Subhashrao Saley
>Assignee: Satish Subhashrao Saley
>Priority: Major
> Attachments: PIG-5342-1.patch, PIG-5342-2.patch, PIG-5342-3.patch, 
> PIG-5342-4.patch, PIG-5342-5.patch
>
>
> 1) Need a new setting pig.bloomjoin.nocombiner to turn off combiner for bloom 
> join. When the keys are all unique, the combiner is unnecessary overhead.
> 2) In previous case, the keys were the bloom filter index and the values were 
> the join key. Combining involved doing a distinct on the bag of values which 
> has memory issues for more than 10 million records. That needs to be flipped 
> and distinct combiner used to scale to a billions of records.
> 3) Mention in documentation that bloom join is also ideal in cases of right 
> outer join with smaller dataset on the right. Replicate join only supports 
> left outer join.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (PIG-5342) Add setting to turn off bloom join combiner

2018-07-06 Thread Satish Subhashrao Saley (JIRA)


[ 
https://issues.apache.org/jira/browse/PIG-5342?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16535081#comment-16535081
 ] 

Satish Subhashrao Saley commented on PIG-5342:
--

Updated patch

> Add setting to turn off bloom join combiner
> ---
>
> Key: PIG-5342
> URL: https://issues.apache.org/jira/browse/PIG-5342
> Project: Pig
>  Issue Type: Sub-task
>Reporter: Satish Subhashrao Saley
>Assignee: Satish Subhashrao Saley
>Priority: Major
> Attachments: PIG-5342-1.patch, PIG-5342-2.patch, PIG-5342-3.patch, 
> PIG-5342-4.patch
>
>
> 1) Need a new setting pig.bloomjoin.nocombiner to turn off combiner for bloom 
> join. When the keys are all unique, the combiner is unnecessary overhead.
> 2) In previous case, the keys were the bloom filter index and the values were 
> the join key. Combining involved doing a distinct on the bag of values which 
> has memory issues for more than 10 million records. That needs to be flipped 
> and distinct combiner used to scale to a billions of records.
> 3) Mention in documentation that bloom join is also ideal in cases of right 
> outer join with smaller dataset on the right. Replicate join only supports 
> left outer join.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (PIG-5342) Add setting to turn off bloom join combiner

2018-07-02 Thread Rohini Palaniswamy (JIRA)


[ 
https://issues.apache.org/jira/browse/PIG-5342?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16530184#comment-16530184
 ] 

Rohini Palaniswamy commented on PIG-5342:
-

1)Can you add pig.bloomjoin.num.filters in e2e tests to reduce type as well?
2) You still need combiner for the map type.
3) return (int) t.get(0); in BloomPartitioner

> Add setting to turn off bloom join combiner
> ---
>
> Key: PIG-5342
> URL: https://issues.apache.org/jira/browse/PIG-5342
> Project: Pig
>  Issue Type: Sub-task
>Reporter: Satish Subhashrao Saley
>Assignee: Satish Subhashrao Saley
>Priority: Major
> Attachments: PIG-5342-1.patch, PIG-5342-2.patch, PIG-5342-3.patch
>
>
> 1) Need a new setting pig.bloomjoin.nocombiner to turn off combiner for bloom 
> join. When the keys are all unique, the combiner is unnecessary overhead.
> 2) In previous case, the keys were the bloom filter index and the values were 
> the join key. Combining involved doing a distinct on the bag of values which 
> has memory issues for more than 10 million records. That needs to be flipped 
> and distinct combiner used to scale to a billions of records.
> 3) Mention in documentation that bloom join is also ideal in cases of right 
> outer join with smaller dataset on the right. Replicate join only supports 
> left outer join.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (PIG-5342) Add setting to turn off bloom join combiner

2018-06-28 Thread Satish Subhashrao Saley (JIRA)


[ 
https://issues.apache.org/jira/browse/PIG-5342?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16526504#comment-16526504
 ] 

Satish Subhashrao Saley commented on PIG-5342:
--

Updated patch.

 

> Add setting to turn off bloom join combiner
> ---
>
> Key: PIG-5342
> URL: https://issues.apache.org/jira/browse/PIG-5342
> Project: Pig
>  Issue Type: Sub-task
>Reporter: Satish Subhashrao Saley
>Assignee: Satish Subhashrao Saley
>Priority: Major
> Attachments: PIG-5342-1.patch, PIG-5342-2.patch, PIG-5342-3.patch
>
>
> 1) Need a new setting pig.bloomjoin.nocombiner to turn off combiner for bloom 
> join. When the keys are all unique, the combiner is unnecessary overhead.
> 2) In previous case, the keys were the bloom filter index and the values were 
> the join key. Combining involved doing a distinct on the bag of values which 
> has memory issues for more than 10 million records. That needs to be flipped 
> and distinct combiner used to scale to a billions of records.
> 3) Mention in documentation that bloom join is also ideal in cases of right 
> outer join with smaller dataset on the right. Replicate join only supports 
> left outer join.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (PIG-5342) Add setting to turn off bloom join combiner

2018-06-15 Thread Satish Subhashrao Saley (JIRA)


[ 
https://issues.apache.org/jira/browse/PIG-5342?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16514304#comment-16514304
 ] 

Satish Subhashrao Saley commented on PIG-5342:
--

Updated the patch.

> Add setting to turn off bloom join combiner
> ---
>
> Key: PIG-5342
> URL: https://issues.apache.org/jira/browse/PIG-5342
> Project: Pig
>  Issue Type: Sub-task
>Reporter: Satish Subhashrao Saley
>Assignee: Satish Subhashrao Saley
>Priority: Major
> Attachments: PIG-5342-1.patch, PIG-5342-2.patch
>
>
> 1) Need a new setting pig.bloomjoin.nocombiner to turn off combiner for bloom 
> join. When the keys are all unique, the combiner is unnecessary overhead.
> 2) Mention in documentation that bloom join is also ideal in cases of right 
> outer join with smaller dataset on the right. Replicate join only supports 
> left outer join.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (PIG-5342) Add setting to turn off bloom join combiner

2018-06-13 Thread Rohini Palaniswamy (JIRA)


[ 
https://issues.apache.org/jira/browse/PIG-5342?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16511706#comment-16511706
 ] 

Rohini Palaniswamy commented on PIG-5342:
-

Comments:
1) Bloom join is also ideal in cases of right outer join with smaller dataset 
on the right which is not supported by replicated join.
2) edge.setCombinerInMap(true); and edge.setCombinerInReducer(true); is 
redundant.
3) edge.partitionerClass = BloomFilterPartitioner.class; should be only for the 
reducer case. Same for key and value types. 
4) combineBloomOp is not used anymore and should be removed.
5) resuleWithCombiner -> resultWithCombiner
6) Can avoid the new NullableTuple() in bloomWriter.write(new 
NullableIntWritable(i), new NullableTuple(tuple)); 

> Add setting to turn off bloom join combiner
> ---
>
> Key: PIG-5342
> URL: https://issues.apache.org/jira/browse/PIG-5342
> Project: Pig
>  Issue Type: Sub-task
>Reporter: Satish Subhashrao Saley
>Assignee: Satish Subhashrao Saley
>Priority: Major
> Attachments: PIG-5342-1.patch
>
>
> 1) Need a new setting pig.bloomjoin.nocombiner to turn off combiner for bloom 
> join. When the keys are all unique, the combiner is unnecessary overhead.
> 2) Mention in documentation that bloom join is also ideal in cases of right 
> outer join with smaller dataset on the right. Replicate join only supports 
> left outer join.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)