I counted the number of records on uncompressed input and sorted output. 3,344,109,862 input 3,339,587,570 output
So, the difference is 4,522,292 records. 2009/2/19 <[email protected]>: > 2009/2/19 Alan Gates <[email protected]>: >> The second job in an order by is a sampling job, so the fact that it wrote >> only one record is expected. The one record is a quantiles tuple that >> describes how the next job should set up its partitioner. >> >> The third job should read exactly the same number of records as the first >> job wrote, as its input is the output of the first job. >> >> Are you getting these numbers from hadoop's UI? > > Yes. At the beginning of using pig I used to check actual record > numbers to match with hadoop's UI and they did not differ. > >> If so, are you using >> compression, as that sometimes messes up the reporting of the hadoop UI. > > Yes, input data is bzip2 compressed. I will check without compression. > >> Also, did you run a separate job to count the number of records on your >> input and output files? > > No. I will check numbers separately. > >> We haven't tested current pig with hadoop 19. In fact, I didn't think it >> ran with it at all without applying a patch. I don't know if that could >> contribute to this or not. > > I'm using PIG-573.patch from > https://issues.apache.org/jira/browse/PIG-573 to run current pig on > hadoop branch-0.19. > >> >> Alan. >> >> On Feb 18, 2009, at 3:03 PM, <[email protected]> wrote: >> >>> Hi, >>> >>> I passed 3,344,109,862 records to ORDER and got 3,339,587,570 in the >>> output with no noticeable errors. >>> >>> There were three jobs. >>> First got 3,344,109,862 records (map input) and produced the same >>> number (map output). >>> Second got 248,820 (map input) and produced 1 (reduce output). >>> Third got 3,339,587,570 (map input) and produced the same number >>> (reduce output). >>> So I guess something was wrong in the second job. >>> >>> I used pig from trunk at revision 743989 and hadoop from branch-0.19 >>> at revision 745383. >>> >>> I'd be happy to use pig with no data lost and ready to provide >>> additional details or tests if it helps. >>> Thanks. >> >> >
