[
https://issues.apache.org/jira/browse/PIG-2423?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13188657#comment-13188657
]
Jie Li commented on PIG-2423:
-----------------------------
Thanks Thejas. For this moment I just paste here. I add two cases, and I'm
thinking if they can be more general. Feel free to improve them.
{code}
1. Use COGROUP to do the join
When there are GROUP-BY and JOIN on the same keys, we can usually combine them
using COGROUP to reduce the number of MapReduce jobs.
-- Query 1
A = load 'myfile' as (x, u, v);
B = load 'myotherfile' as (x, y, z);
t1 = group B by B.x;
t2 = foreach t1 generate group as x, COUNT(B.y) as count_y;
t3 = join A by A.x, t2 by t2.x;
-- Query 2
A = load 'myfile' as (x, u, v);
B = load 'myotherfile' as (x, y, z);
t1 = cogroup A by A.x, B by B.x;
t2 = filter t1 by NOT IsEmpty(A) AND NOT IsEmpty(B); -- an inner join
t3 = foreach t2 generate group, COUNT(B.y);
While the Query 1 requires two separate MR jobs, the Query 2 only requires one
MR job by using the COGROUP.
2. Use GROUP+FLATTEN to do the self join
Sometimes we need a self join to get some additional information. For example,
for each employer, find the average salary in his/her department.
-- Query 1
A = load 'myfile' as (name, salary, department);
t1 = group A by department;
t2 = foreach t1 generate group, AVG(A.salary) as avg_salary;
t3 = join A by department, t2 by group;
-- Query 2
A = load 'myfile' as (name, salary, department);
t1 = group A by department;
t2 = foreach t1 generate FLATTEN(A), AVG(A.salary) as avg_salary;
While the Query 1 needs two MR jobs, the Query 2 only requires one MR job by
using FLATTEN after GROUP to implement the self join.
{code}
> document use case where co-group is better choice than join
> ------------------------------------------------------------
>
> Key: PIG-2423
> URL: https://issues.apache.org/jira/browse/PIG-2423
> Project: Pig
> Issue Type: Improvement
> Components: documentation
> Reporter: Thejas M Nair
> Fix For: 0.10
>
>
> Optimization rules 2 and 3 suggested in
> https://issues.apache.org/jira/secure/attachment/12506841/pig_tpch.ppt
> (PIG-2397) recommend the use of co-group instead of join in certain cases.
> These should be documented in pig performance page.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira