[jira] [Commented] (PIG-4120) Broadcast the index file in case of POMergeCoGroup and POMergeJoin

2017-09-22 Thread Satish Subhashrao Saley (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-4120?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16176757#comment-16176757
 ] 

Satish Subhashrao Saley commented on PIG-4120:
--

Updated patch in review board.

> Broadcast the index file in case of POMergeCoGroup and POMergeJoin
> --
>
> Key: PIG-4120
> URL: https://issues.apache.org/jira/browse/PIG-4120
> Project: Pig
>  Issue Type: Sub-task
>  Components: tez
>Reporter: Rohini Palaniswamy
>Assignee: Satish Subhashrao Saley
> Fix For: 0.18.0
>
> Attachments: PIG-4120-1.patch, PIG-4120-2.patch
>
>
> Currently merge join and merge cogroup use two DAGs - the first DAG creates 
> the index file in hdfs and second DAG does the merge join.  Similar to 
> replicate join, we can broadcast the index file and cache it and use it in 
> merge join and merge cogroup. This will give better performance and also 
> eliminate need for the second DAG.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (PIG-4120) Broadcast the index file in case of POMergeCoGroup and POMergeJoin

2017-09-19 Thread Rohini Palaniswamy (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-4120?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16172428#comment-16172428
 ] 

Rohini Palaniswamy commented on PIG-4120:
-

Comments:
   - This needs to be implemented for POMergeCoGroup as well.


POMergeJoin.java:
  1) Remove protected transient TupleFactory mTupleFactory. mTupleFactory 
already comes from PhysicalOperator
  2) copy method
 - LRs = copy.LRs; -> this.LRs = copy.LRs;
- It should copy every non-transient field. Currently it is missing 
signature, rightInputFileName, etc

POMergeJoinTez.java:
  1) List logicalInputs is not used anywhere. Can be removed
  2) List keyValueReaders - You only need one KeyValueReader. 
  3) Get rid of the getName() function and refer to super.name(). Currently it 
is missing the case of sparse join.
{code}
@Override
public String name() {
return super.name().replace("MergeJoin", "MergeJoinTez") + 
"\t<-\t " + this.inputKey;
}
{code}
4) Cast to UnorderedKVReader is not required.
5) Tuple copy code can be shorter
{code}
while (reader.next()) {
   Tuple origTuple =(Tuple) reader.getCurrentValue();
   Tuple copy = mTupleFactory.newTuple(origTuple.getAll()); 
   index.add(copy);
}
{code}
6) Creating another copy of index is unnecessary
{code}
 LinkedList indexList = new LinkedList(index);
{code}

TezCompiler.java:
1) We keep else start in the same line as if block end.
2) joinOp.setupRightPipeline(rightPipelinePlan); and 
joinOp.setSignature(rightLoader.getSignature()); not required if copy() is 
fixed.

DefaultIndexableLoader.java:
1) Can you rename setIndex() to loadIndex() 

> Broadcast the index file in case of POMergeCoGroup and POMergeJoin
> --
>
> Key: PIG-4120
> URL: https://issues.apache.org/jira/browse/PIG-4120
> Project: Pig
>  Issue Type: Sub-task
>  Components: tez
>Reporter: Rohini Palaniswamy
>Assignee: Satish Subhashrao Saley
> Fix For: 0.18.0
>
> Attachments: PIG-4120-1.patch
>
>
> Currently merge join and merge cogroup use two DAGs - the first DAG creates 
> the index file in hdfs and second DAG does the merge join.  Similar to 
> replicate join, we can broadcast the index file and cache it and use it in 
> merge join and merge cogroup. This will give better performance and also 
> eliminate need for the second DAG.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)