[ https://issues.apache.org/jira/browse/PIG-4548?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Koji Noguchi updated PIG-4548: ------------------------------ Resolution: Fixed Hadoop Flags: Reviewed Fix Version/s: 0.17.1 Status: Resolved (was: Patch Available) Thanks for the review Rohini. Committed to Trunk(0.18) and 0.17-branch. Thanks [~brane2] for reporting this critical issue!!! (and sorry again for missing this for such a long time.) > Records Lost With Specific Combination of Commands and Streaming Function > ------------------------------------------------------------------------- > > Key: PIG-4548 > URL: https://issues.apache.org/jira/browse/PIG-4548 > Project: Pig > Issue Type: Bug > Affects Versions: 0.12.0, 0.14.0 > Environment: Amazon EMR (Elastic Map-Reduce) AMI 3.3.1 > Reporter: Steve T > Assignee: Koji Noguchi > Fix For: 0.18.0, 0.17.1 > > Attachments: pig-4548-v1.patch > > > The below is the bare minimum I was able to extract from my original > problem to in order to demonstrate the bug. So, don't expect the following > code to serve any practical purpose. :) > My input file (test_in) is two columns with a tab delimiter: > 1 F > 2 F > My streaming function (sf.py) ignores the actual input and simply generates > 2 records: > #!/usr/bin/python > if __name__ == '__main__': > print 'x' > print 'y' > (But I should mention that in my original problem the input to output was > one-to-one. I just ignored the input here to get to the bare minimum > effect.) > My pig script: > MY_INPUT = load 'test_in' as ( f1, f2); > split MY_INPUT into T if (f2 == 'T'), F otherwise; > T2 = group T by f1; > store T2 into 'test_out/T2'; > F2 = group F by f1; > store F2 into 'test_out/F2'; -- (this line is actually optional to demo > the bug) > F3 = stream F2 through `sf.py`; > store F3 into 'test_out/F3'; > My expected output for test/out/F3 is two records that come directly from > sf.py: > x > y > However, I only get: > x > I've tried all of the following to get the expected behavior: > - upgraded Pig from 0.12.0 to 0.14.0 > - local vs. distributed mode > - flush sys.stdout in the streaming function > - replace sf.py with sf.sh which is a bash script that used "echo x; > echo y" to do the same thing. In this case, the final contents of > test_out/F# would vary - sometimes I would get both x and y, and sometimes > I would just get x. > Aside from removing the one Pig line that I've marked optional, any other > attempts to simplify the Pig script or input file causes the bug to not > manifest. > Log files can be found at > http://www.mail-archive.com/user@pig.apache.org/msg10195.html -- This message was sent by Atlassian JIRA (v6.4.14#64029)