[jira] [Commented] (PIG-4057) Group All followed by CROSS with default parallelism produces wrong results

Rohini Palaniswamy (JIRA) Mon, 14 Jul 2014 12:04:16 -0700

    [ 
https://issues.apache.org/jira/browse/PIG-4057?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14061036#comment-14061036
 ]


Rohini Palaniswamy commented on PIG-4057:
-----------------------------------------

One way to fix this would be to always have GFCross UDF as part of map task of 
the actual cross job and never do it as part of previous job's map or reduce. 
Trying to see if there is a better alternative for GFCross implementation that 
could do away with relying on parallelism of the reducer as it will cause 
problems with Tez auto parallelism. 

> Group All followed by CROSS with default parallelism produces wrong results
> ---------------------------------------------------------------------------
>
>                 Key: PIG-4057
>                 URL: https://issues.apache.org/jira/browse/PIG-4057
>             Project: Pig
>          Issue Type: Bug
>            Reporter: Rohini Palaniswamy
>             Fix For: 0.14.0
>
>
> SET default_parallel 199;
> ......
> by_size = ...
> uniq_vals = .....
> grpd = group uniq_vals all;
> all_vals = FOREACH grpd GENERATE uniq_vals;
> cross_result = CROSS by_size, all_vals;
> store cross_result into '/tmp/roh/cross/out/recipient_asns';
> Job1: grpd, all_vals, cross_result (The plan does GFCross function here for
> all_vals assuming cross parallelism to be 1 taking that of the current job 
> even
> though it should consider default parallelism 199 of Job 2. Parallelism of 
> Job1
> is 1 because of group all)
> Job2: cross_result (Actual CROSS of by_size and all_vals)



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (PIG-4057) Group All followed by CROSS with default parallelism produces wrong results

Reply via email to