[
https://issues.apache.org/jira/browse/PIG-4120?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16172428#comment-16172428
]
Rohini Palaniswamy commented on PIG-4120:
-
Comments:
- This needs to be implemented for POMergeCoGroup as well.
POMergeJoin.java:
1) Remove protected transient TupleFactory mTupleFactory. mTupleFactory
already comes from PhysicalOperator
2) copy method
- LRs = copy.LRs; -> this.LRs = copy.LRs;
- It should copy every non-transient field. Currently it is missing
signature, rightInputFileName, etc
POMergeJoinTez.java:
1) List logicalInputs is not used anywhere. Can be removed
2) List keyValueReaders - You only need one KeyValueReader.
3) Get rid of the getName() function and refer to super.name(). Currently it
is missing the case of sparse join.
{code}
@Override
public String name() {
return super.name().replace("MergeJoin", "MergeJoinTez") +
"\t<-\t " + this.inputKey;
}
{code}
4) Cast to UnorderedKVReader is not required.
5) Tuple copy code can be shorter
{code}
while (reader.next()) {
Tuple origTuple =(Tuple) reader.getCurrentValue();
Tuple copy = mTupleFactory.newTuple(origTuple.getAll());
index.add(copy);
}
{code}
6) Creating another copy of index is unnecessary
{code}
LinkedList indexList = new LinkedList(index);
{code}
TezCompiler.java:
1) We keep else start in the same line as if block end.
2) joinOp.setupRightPipeline(rightPipelinePlan); and
joinOp.setSignature(rightLoader.getSignature()); not required if copy() is
fixed.
DefaultIndexableLoader.java:
1) Can you rename setIndex() to loadIndex()
> Broadcast the index file in case of POMergeCoGroup and POMergeJoin
> --
>
> Key: PIG-4120
> URL: https://issues.apache.org/jira/browse/PIG-4120
> Project: Pig
> Issue Type: Sub-task
> Components: tez
>Reporter: Rohini Palaniswamy
>Assignee: Satish Subhashrao Saley
> Fix For: 0.18.0
>
> Attachments: PIG-4120-1.patch
>
>
> Currently merge join and merge cogroup use two DAGs - the first DAG creates
> the index file in hdfs and second DAG does the merge join. Similar to
> replicate join, we can broadcast the index file and cache it and use it in
> merge join and merge cogroup. This will give better performance and also
> eliminate need for the second DAG.
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)