[ https://issues.apache.org/jira/browse/PIG-4120?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16172428#comment-16172428 ]
Rohini Palaniswamy commented on PIG-4120: ----------------------------------------- Comments: - This needs to be implemented for POMergeCoGroup as well. POMergeJoin.java: 1) Remove protected transient TupleFactory mTupleFactory. mTupleFactory already comes from PhysicalOperator 2) copy method - LRs = copy.LRs; -> this.LRs = copy.LRs; - It should copy every non-transient field. Currently it is missing signature, rightInputFileName, etc POMergeJoinTez.java: 1) List<LogicalInput> logicalInputs is not used anywhere. Can be removed 2) List<KeyValueReader> keyValueReaders - You only need one KeyValueReader. 3) Get rid of the getName() function and refer to super.name(). Currently it is missing the case of sparse join. {code} @Override public String name() { return super.name().replace("MergeJoin", "MergeJoinTez") + "\t<-\t " + this.inputKey; } {code} 4) Cast to UnorderedKVReader is not required. 5) Tuple copy code can be shorter {code} while (reader.next()) { Tuple origTuple =(Tuple) reader.getCurrentValue(); Tuple copy = mTupleFactory.newTuple(origTuple.getAll()); index.add(copy); } {code} 6) Creating another copy of index is unnecessary {code} LinkedList<Tuple> indexList = new LinkedList<Tuple>(index); {code} TezCompiler.java: 1) We keep else start in the same line as if block end. 2) joinOp.setupRightPipeline(rightPipelinePlan); and joinOp.setSignature(rightLoader.getSignature()); not required if copy() is fixed. DefaultIndexableLoader.java: 1) Can you rename setIndex() to loadIndex() > Broadcast the index file in case of POMergeCoGroup and POMergeJoin > ------------------------------------------------------------------ > > Key: PIG-4120 > URL: https://issues.apache.org/jira/browse/PIG-4120 > Project: Pig > Issue Type: Sub-task > Components: tez > Reporter: Rohini Palaniswamy > Assignee: Satish Subhashrao Saley > Fix For: 0.18.0 > > Attachments: PIG-4120-1.patch > > > Currently merge join and merge cogroup use two DAGs - the first DAG creates > the index file in hdfs and second DAG does the merge join. Similar to > replicate join, we can broadcast the index file and cache it and use it in > merge join and merge cogroup. This will give better performance and also > eliminate need for the second DAG. -- This message was sent by Atlassian JIRA (v6.4.14#64029)