[ 
https://issues.apache.org/jira/browse/PIG-4548?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Koji Noguchi updated PIG-4548:
------------------------------
       Resolution: Fixed
     Hadoop Flags: Reviewed
    Fix Version/s: 0.17.1
           Status: Resolved  (was: Patch Available)

Thanks for the review Rohini. 
Committed to Trunk(0.18) and 0.17-branch. 

Thanks [~brane2] for reporting this critical issue!!!
(and sorry again for missing this for such a long time.)

> Records Lost With Specific Combination of Commands and Streaming Function
> -------------------------------------------------------------------------
>
>                 Key: PIG-4548
>                 URL: https://issues.apache.org/jira/browse/PIG-4548
>             Project: Pig
>          Issue Type: Bug
>    Affects Versions: 0.12.0, 0.14.0
>         Environment: Amazon EMR (Elastic Map-Reduce) AMI 3.3.1
>            Reporter: Steve T
>            Assignee: Koji Noguchi
>             Fix For: 0.18.0, 0.17.1
>
>         Attachments: pig-4548-v1.patch
>
>
> The below is the bare minimum I was able to extract from my original
> problem to in order to demonstrate the bug.  So, don't expect the following
> code to serve any practical purpose.  :)
> My input file (test_in) is two columns with a tab delimiter:
> 1   F
> 2   F
> My streaming function (sf.py) ignores the actual input and simply generates
> 2 records:
> #!/usr/bin/python
> if __name__ == '__main__':
>     print 'x'
>     print 'y'
> (But I should mention that in my original problem the input to output was
> one-to-one.  I just ignored the input here to get to the bare minimum
> effect.)
> My pig script:
> MY_INPUT = load 'test_in' as ( f1, f2);
> split MY_INPUT into T if (f2 == 'T'), F otherwise;
> T2 = group T by f1;
> store T2 into 'test_out/T2';
> F2 = group F by f1;
> store F2 into 'test_out/F2';  -- (this line is actually optional to demo
> the bug)
> F3 = stream F2 through `sf.py`;
> store F3 into 'test_out/F3';
> My expected output for test/out/F3 is two records that come directly from
> sf.py:
> x
> y
> However, I only get:
> x
> I've tried all of the following to get the expected behavior:
>    - upgraded Pig from 0.12.0 to 0.14.0
>    - local vs. distributed mode
>    - flush sys.stdout in the streaming function
>    - replace sf.py with sf.sh which is a bash script that used "echo x;
>    echo y" to do the same thing.  In this case, the final contents of
>    test_out/F# would vary - sometimes I would get both x and y, and sometimes
>    I would just get x.
> Aside from removing the one Pig line that I've marked optional, any other
> attempts to simplify the Pig script or input file causes the bug to not
> manifest.
> Log files can be found at 
> http://www.mail-archive.com/user@pig.apache.org/msg10195.html



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Reply via email to