2009/2/19 Alan Gates <[email protected]>:
> The second job in an order by is a sampling job, so the fact that it wrote
> only one record is expected.  The one record is a quantiles tuple that
> describes how the next job should set up its partitioner.
>
> The third job should read exactly the same number of records as the first
> job wrote, as its input is the output of the first job.
>
> Are you getting these numbers from hadoop's UI?

Yes.  At the beginning of using pig I used to check actual record
numbers to match with hadoop's UI and they did not differ.

> If so, are you using
> compression, as that sometimes messes up the reporting of the hadoop UI.

Yes, input data is bzip2 compressed.  I will check without compression.

>  Also, did you run a separate job to count the number of records on your
> input and output files?

No.  I will check numbers separately.

> We haven't tested current pig with hadoop 19.  In fact, I didn't think it
> ran with it at all without applying a patch.  I don't know if that could
> contribute to this or not.

I'm using PIG-573.patch from
https://issues.apache.org/jira/browse/PIG-573 to run current pig on
hadoop branch-0.19.

>
> Alan.
>
> On Feb 18, 2009, at 3:03 PM, <[email protected]> wrote:
>
>> Hi,
>>
>> I passed 3,344,109,862 records to ORDER and got 3,339,587,570 in the
>> output with no noticeable errors.
>>
>> There were three jobs.
>> First got 3,344,109,862 records (map input) and produced the same
>> number (map output).
>> Second got 248,820 (map input) and produced 1 (reduce output).
>> Third got 3,339,587,570 (map input) and produced the same number
>> (reduce output).
>> So I guess something was wrong in the second job.
>>
>> I used pig from trunk at revision 743989 and hadoop from branch-0.19
>> at revision 745383.
>>
>> I'd be happy to use pig with no data lost and ready to provide
>> additional details or tests if it helps.
>> Thanks.
>
>

Reply via email to