[ https://issues.apache.org/jira/browse/PIG-4771?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
liyunzhang_intel updated PIG-4771: ---------------------------------- Attachment: PIG-4771_3.patch [~mohitSabharwal]: changes in PIG-4771_3.patch 1. modify the replication value from 3 to 10. 2. remove the unnecessary code. > Implement FR Join for spark engine > ---------------------------------- > > Key: PIG-4771 > URL: https://issues.apache.org/jira/browse/PIG-4771 > Project: Pig > Issue Type: Sub-task > Components: spark > Reporter: liyunzhang_intel > Assignee: liyunzhang_intel > Fix For: spark-branch > > Attachments: PIG-4771.patch, PIG-4771_2.patch, PIG-4771_3.patch > > > We use regular join to replace FR join in current code base(fd31fda). We need > to implement FR join. > Some info collected from > https://pig.apache.org/docs/r0.11.0/perf.html#replicated-joins: > *Replicated Joins* > Fragment replicate join is a special type of join that works well if one or > more relations are small enough to fit into main memory. In such cases, Pig > can perform a very efficient join because all of the hadoop work is done on > the map side. In this type of join the large relation is followed by one or > more small relations. The small relations must be small enough to fit into > main memory; if they don't, the process fails and an error is generated. > *Usage* > Perform a replicated join with the USING clause (see JOIN (inner) and JOIN > (outer)). In this example, a large relation is joined with two smaller > relations. Note that the large relation comes first followed by the smaller > relations; and, all small relations together must fit into main memory, > otherwise an error is generated. > big = LOAD 'big_data' AS (b1,b2,b3); > tiny = LOAD 'tiny_data' AS (t1,t2,t3); > mini = LOAD 'mini_data' AS (m1,m2,m3); > C = JOIN big BY b1, tiny BY t1, mini BY m1 USING 'replicated'; > *Conditions* > Fragment replicate joins are experimental; we don't have a strong sense of > how small the small relation must be to fit into memory. In our tests with a > simple query that involves just a JOIN, a relation of up to 100 M can be used > if the process overall gets 1 GB of memory. -- This message was sent by Atlassian JIRA (v6.3.4#6332)