[jira] [Updated] (PIG-3975) Multiple Scalar reference calls leading to missing records

Koji Noguchi (JIRA) Fri, 30 May 2014 16:08:26 -0700

     [ 
https://issues.apache.org/jira/browse/PIG-3975?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Koji Noguchi updated PIG-3975:
------------------------------

    Attachment: pig-3975-v01_withouttest.patch

Attaching a preliminary patch.
I believe what's happening is, 
MRCompiler.connectSoftLink() from PIG-1605  connects MRPlan based on 
PhysicalPlan's softlink.  But this is being called BEFORE 
MRCompiler.aggregateScalarsFiles() from PIG-1458 that creates an extra 
concatenate mapreduce job.

As a result, only the first scalar reference is getting the MRPlan dependency 
updated and rest are untouched.

I think there are two approaches for fixing this.
(1) Updating MRPlan dependency INSIDE MRCompiler.aggregateScalarsFiles().
OR
(2) Moving MRCompiler.connectSoftLink() to AFTER 
MRCompiler.aggregateScalarsFiles().

I took the second approach since it has an added benefit of 
MRCompiler.hasTooManyInputFiles() no longer depending on the job with scalar 
outputs.  I'm hoping this would also reduce the number of unnecessary 
concatenate jobs being created.


> Multiple Scalar reference calls leading to missing records
> ----------------------------------------------------------
>
>                 Key: PIG-3975
>                 URL: https://issues.apache.org/jira/browse/PIG-3975
>             Project: Pig
>          Issue Type: Bug
>    Affects Versions: 0.8.1, 0.9.2, 0.10.1, 0.11.1, 0.12.2
>            Reporter: Koji Noguchi
>            Assignee: Koji Noguchi
>            Priority: Critical
>         Attachments: pig-3975-v01_withouttest.patch
>
>
> We noticed that multiple pig runs with same input were producing different 
> outputs.
> Simplified script looked like this.
> {noformat}
> A = load 'input1' as (a1:int);
> B = group A by a1 parallel 200;
> C = load 'input2' as (c1:int);
> D = foreach C generate B.$0;
> store D into '/tmp/deletemeD';
> E = load 'input3' as (c1:int);
> F = foreach E generate B.$0;
> store F into '/tmp/deletemeF';
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (PIG-3975) Multiple Scalar reference calls leading to missing records

Reply via email to