[
https://issues.apache.org/jira/browse/PIG-3975?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Koji Noguchi updated PIG-3975:
------------------------------
Attachment: pig-3975-v01_withouttest.patch
Attaching a preliminary patch.
I believe what's happening is,
MRCompiler.connectSoftLink() from PIG-1605 connects MRPlan based on
PhysicalPlan's softlink. But this is being called BEFORE
MRCompiler.aggregateScalarsFiles() from PIG-1458 that creates an extra
concatenate mapreduce job.
As a result, only the first scalar reference is getting the MRPlan dependency
updated and rest are untouched.
I think there are two approaches for fixing this.
(1) Updating MRPlan dependency INSIDE MRCompiler.aggregateScalarsFiles().
OR
(2) Moving MRCompiler.connectSoftLink() to AFTER
MRCompiler.aggregateScalarsFiles().
I took the second approach since it has an added benefit of
MRCompiler.hasTooManyInputFiles() no longer depending on the job with scalar
outputs. I'm hoping this would also reduce the number of unnecessary
concatenate jobs being created.
> Multiple Scalar reference calls leading to missing records
> ----------------------------------------------------------
>
> Key: PIG-3975
> URL: https://issues.apache.org/jira/browse/PIG-3975
> Project: Pig
> Issue Type: Bug
> Affects Versions: 0.8.1, 0.9.2, 0.10.1, 0.11.1, 0.12.2
> Reporter: Koji Noguchi
> Assignee: Koji Noguchi
> Priority: Critical
> Attachments: pig-3975-v01_withouttest.patch
>
>
> We noticed that multiple pig runs with same input were producing different
> outputs.
> Simplified script looked like this.
> {noformat}
> A = load 'input1' as (a1:int);
> B = group A by a1 parallel 200;
> C = load 'input2' as (c1:int);
> D = foreach C generate B.$0;
> store D into '/tmp/deletemeD';
> E = load 'input3' as (c1:int);
> F = foreach E generate B.$0;
> store F into '/tmp/deletemeF';
> {noformat}
--
This message was sent by Atlassian JIRA
(v6.2#6252)