My best guess is that at a low level a string is often terminated by having
a null byte at the end.
Perhaps that's where the difference lies.
Perhaps the gz decompressor simply stops at the null byte and the basic
record reader that follows simply continues.
In this situation your input file contains bytes that should not occur in
an ASCII file (like the json file you have) and as such you can expect the
unexpected ;)

Niels
On Jun 10, 2013 7:24 PM, "William Oberman" <ober...@civicscience.com> wrote:

> I posted this to the pig mailing list, but it might be more related to
> hadoop itself, I'm not sure.
>
> Quick recap: I had a file of "\n" separated lines of JSON.  I decided to
> compress it to save on storage costs.  After compression I got a different
> answer for a pig query that basically == "count lines".
>
> After a lot of digging, I found an input file that had a line that is a
> huge block of null characters followed by a "\n".  I wrote scripts to
> examine the file directly, and if I stop counting at the weird line, I get
> the same count as what pig claims for that file.   If I count all lines
> (e.g. don't fail at the corrupt line) I get the "uncompressed" count pig
> claims.
>
> I don't know how to debug hadoop/pig quite as well, though I'm trying now.
>  But, my working theory is that some combination of pig/hadoop aborts
> processing the gz stream on a null character (or something like that), but
> keeps chugging on a non-gz stream.  Does that sound familiar or make sense
> to anyone?
>
> will
>

Reply via email to