[ 
https://issues.apache.org/jira/browse/PIG-4057?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14064199#comment-14064199
 ] 

Daniel Dai commented on PIG-4057:
---------------------------------

The problem is GFCross divide to difference number of cross groups. If the 
number of groups are the same, then result is correct, whether it is equal to 
number of reducers does not matter.

However, when number of cross groups correlates to the number of reducers, the 
performance is the best, in that:
1. If there are too much groups, every record will be duplicated many times, 
there are more intermediate results generated
2. If there are too few groups, some reducers takes no groups to process, cause 
imbalance load

I can fix it by forcing all GFCross use the same cross groups at the frontend. 
If tez auto-parallelism is used, we may end up with non-optimal cross groups. 
Auto-parallelism works by over estimate the number of reduce in the frontend, 
so we might end up having more cross groups, which means larger intermediate 
results. But I don't see an easy way to peek the number of reduce at the 
runtime and pass to GFCross.

> Group All followed by CROSS with default parallelism produces wrong results
> ---------------------------------------------------------------------------
>
>                 Key: PIG-4057
>                 URL: https://issues.apache.org/jira/browse/PIG-4057
>             Project: Pig
>          Issue Type: Bug
>            Reporter: Rohini Palaniswamy
>            Assignee: Daniel Dai
>             Fix For: 0.14.0
>
>
> SET default_parallel 199;
> ......
> by_size = ...
> uniq_vals = .....
> grpd = group uniq_vals all;
> all_vals = FOREACH grpd GENERATE uniq_vals;
> cross_result = CROSS by_size, all_vals;
> store cross_result into '/tmp/roh/cross/out/recipient_asns';
> Job1: grpd, all_vals, cross_result (The plan does GFCross function here for
> all_vals assuming cross parallelism to be 1 taking that of the current job 
> even
> though it should consider default parallelism 199 of Job 2. Parallelism of 
> Job1
> is 1 because of group all)
> Job2: cross_result (Actual CROSS of by_size and all_vals)



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to