[jira] [Commented] (PIG-4120) Broadcast the index file in case of POMergeCoGroup and POMergeJoin

Rohini Palaniswamy (JIRA) Tue, 19 Sep 2017 15:10:20 -0700

    [ 
https://issues.apache.org/jira/browse/PIG-4120?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16172428#comment-16172428
 ]


Rohini Palaniswamy commented on PIG-4120:
-----------------------------------------

Comments:
   - This needs to be implemented for POMergeCoGroup as well.


POMergeJoin.java:
  1) Remove protected transient TupleFactory mTupleFactory. mTupleFactory 
already comes from PhysicalOperator
  2) copy method
     - LRs = copy.LRs; -> this.LRs = copy.LRs;
    - It should copy every non-transient field. Currently it is missing 
signature, rightInputFileName, etc

POMergeJoinTez.java:
  1) List<LogicalInput> logicalInputs is not used anywhere. Can be removed
  2) List<KeyValueReader> keyValueReaders - You only need one KeyValueReader. 
  3) Get rid of the getName() function and refer to super.name(). Currently it 
is missing the case of sparse join.
{code}
            @Override
            public String name() {
                return super.name().replace("MergeJoin", "MergeJoinTez") + 
"\t<-\t " + this.inputKey;
            }
{code}
4) Cast to UnorderedKVReader is not required.
5) Tuple copy code can be shorter
{code}
while (reader.next()) {
   Tuple origTuple =(Tuple) reader.getCurrentValue();
   Tuple copy = mTupleFactory.newTuple(origTuple.getAll()); 
   index.add(copy);
}
{code}
6) Creating another copy of index is unnecessary
{code}
 LinkedList<Tuple> indexList = new LinkedList<Tuple>(index);
{code}

TezCompiler.java:
1) We keep else start in the same line as if block end.
2) joinOp.setupRightPipeline(rightPipelinePlan); and 
joinOp.setSignature(rightLoader.getSignature()); not required if copy() is 
fixed.

DefaultIndexableLoader.java:
1) Can you rename setIndex() to loadIndex() 

> Broadcast the index file in case of POMergeCoGroup and POMergeJoin
> ------------------------------------------------------------------
>
>                 Key: PIG-4120
>                 URL: https://issues.apache.org/jira/browse/PIG-4120
>             Project: Pig
>          Issue Type: Sub-task
>          Components: tez
>            Reporter: Rohini Palaniswamy
>            Assignee: Satish Subhashrao Saley
>             Fix For: 0.18.0
>
>         Attachments: PIG-4120-1.patch
>
>
> Currently merge join and merge cogroup use two DAGs - the first DAG creates 
> the index file in hdfs and second DAG does the merge join.  Similar to 
> replicate join, we can broadcast the index file and cache it and use it in 
> merge join and merge cogroup. This will give better performance and also 
> eliminate need for the second DAG.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Commented] (PIG-4120) Broadcast the index file in case of POMergeCoGroup and POMergeJoin

Reply via email to