Rohini Palaniswamy created PIG-3631:
---------------------------------------

             Summary: Improve performance of replicate-join
                 Key: PIG-3631
                 URL: https://issues.apache.org/jira/browse/PIG-3631
             Project: Pig
          Issue Type: Sub-task
            Reporter: Rohini Palaniswamy


Replicated join is implemented in Tez as follows:
- POFRJoinTez extends POFRJoin. The difference between two is that replication 
hash table is constructed out of broadcasting edges in Tez instead of files on 
distributed cache in MR.
- TezCompiler adds a vertex per replicated table and connect it to POFRJoin 
vertex via broadcasting edge.

Verify no performance regression with MR:
  - The above approach is good when the replicate join is not the first vertex 
of the DAG (i.e in case of a MR, replicate join is part of a reduce). If it is 
the first vertex of the DAG, we need to compare and see that with this approach 
the performance does not regress with the MR's map only replicate join using 
distributed cache. 

Evaluate:
   - Instead of broadcasting key values and constructing hashmap, evaluate 
broadcasting (or distributing via cache based on performance) serialized 
hashmap and loading it as is similar to hive.




--
This message was sent by Atlassian JIRA
(v6.1.4#6159)

Reply via email to