Re: 4,522,292 records lost while sorting

Alan Gates Wed, 18 Feb 2009 16:06:57 -0800

The second job in an order by is a sampling job, so the fact that itwrote only one record is expected. The one record is a quantilestuple that describes how the next job should set up its partitioner.

The third job should read exactly the same number of records as thefirst job wrote, as its input is the output of the first job.

Are you getting these numbers from hadoop's UI? If so, are you usingcompression, as that sometimes messes up the reporting of the hadoopUI. Also, did you run a separate job to count the number of recordson your input and output files?

We haven't tested current pig with hadoop 19. In fact, I didn't thinkit ran with it at all without applying a patch. I don't know if thatcould contribute to this or not.


Alan.

On Feb 18, 2009, at 3:03 PM, <[email protected]> wrote:

Hi,

I passed 3,344,109,862 records to ORDER and got 3,339,587,570 in the
output with no noticeable errors.

There were three jobs.
First got 3,344,109,862 records (map input) and produced the same
number (map output).
Second got 248,820 (map input) and produced 1 (reduce output).
Third got 3,339,587,570 (map input) and produced the same number
(reduce output).
So I guess something was wrong in the second job.

I used pig from trunk at revision 743989 and hadoop from branch-0.19
at revision 745383.

I'd be happy to use pig with no data lost and ready to provide
additional details or tests if it helps.
Thanks.

Re: 4,522,292 records lost while sorting

Reply via email to