[jira] [Updated] (PIG-4771) Implement FR Join for spark engine

liyunzhang_intel (JIRA) Mon, 16 May 2016 21:13:22 -0700

     [ 
https://issues.apache.org/jira/browse/PIG-4771?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


liyunzhang_intel updated PIG-4771:
----------------------------------
    Attachment: PIG-4771_3.patch

[~mohitSabharwal]:
changes in PIG-4771_3.patch
1. modify the replication value from 3 to 10.
2. remove the unnecessary code.


> Implement FR Join for spark engine
> ----------------------------------
>
>                 Key: PIG-4771
>                 URL: https://issues.apache.org/jira/browse/PIG-4771
>             Project: Pig
>          Issue Type: Sub-task
>          Components: spark
>            Reporter: liyunzhang_intel
>            Assignee: liyunzhang_intel
>             Fix For: spark-branch
>
>         Attachments: PIG-4771.patch, PIG-4771_2.patch, PIG-4771_3.patch
>
>
> We use regular join to replace FR join in current code base(fd31fda). We need 
> to implement FR join.
> Some info collected from 
> https://pig.apache.org/docs/r0.11.0/perf.html#replicated-joins:
> *Replicated Joins*
> Fragment replicate join is a special type of join that works well if one or 
> more relations are small enough to fit into main memory. In such cases, Pig 
> can perform a very efficient join because all of the hadoop work is done on 
> the map side. In this type of join the large relation is followed by one or 
> more small relations. The small relations must be small enough to fit into 
> main memory; if they don't, the process fails and an error is generated.
> *Usage*
> Perform a replicated join with the USING clause (see JOIN (inner) and JOIN 
> (outer)). In this example, a large relation is joined with two smaller 
> relations. Note that the large relation comes first followed by the smaller 
> relations; and, all small relations together must fit into main memory, 
> otherwise an error is generated.
> big = LOAD 'big_data' AS (b1,b2,b3);
> tiny = LOAD 'tiny_data' AS (t1,t2,t3);
> mini = LOAD 'mini_data' AS (m1,m2,m3);
> C = JOIN big BY b1, tiny BY t1, mini BY m1 USING 'replicated';
> *Conditions*
> Fragment replicate joins are experimental; we don't have a strong sense of 
> how small the small relation must be to fit into memory. In our tests with a 
> simple query that involves just a JOIN, a relation of up to 100 M can be used 
> if the process overall gets 1 GB of memory. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (PIG-4771) Implement FR Join for spark engine

Reply via email to