2009/2/19 Alan Gates <[email protected]>: > The second job in an order by is a sampling job, so the fact that it wrote > only one record is expected. The one record is a quantiles tuple that > describes how the next job should set up its partitioner. > > The third job should read exactly the same number of records as the first > job wrote, as its input is the output of the first job. > > Are you getting these numbers from hadoop's UI?
Yes. At the beginning of using pig I used to check actual record numbers to match with hadoop's UI and they did not differ. > If so, are you using > compression, as that sometimes messes up the reporting of the hadoop UI. Yes, input data is bzip2 compressed. I will check without compression. > Also, did you run a separate job to count the number of records on your > input and output files? No. I will check numbers separately. > We haven't tested current pig with hadoop 19. In fact, I didn't think it > ran with it at all without applying a patch. I don't know if that could > contribute to this or not. I'm using PIG-573.patch from https://issues.apache.org/jira/browse/PIG-573 to run current pig on hadoop branch-0.19. > > Alan. > > On Feb 18, 2009, at 3:03 PM, <[email protected]> wrote: > >> Hi, >> >> I passed 3,344,109,862 records to ORDER and got 3,339,587,570 in the >> output with no noticeable errors. >> >> There were three jobs. >> First got 3,344,109,862 records (map input) and produced the same >> number (map output). >> Second got 248,820 (map input) and produced 1 (reduce output). >> Third got 3,339,587,570 (map input) and produced the same number >> (reduce output). >> So I guess something was wrong in the second job. >> >> I used pig from trunk at revision 743989 and hadoop from branch-0.19 >> at revision 745383. >> >> I'd be happy to use pig with no data lost and ready to provide >> additional details or tests if it helps. >> Thanks. > >
