I posted this to the pig mailing list, but it might be more related to
hadoop itself, I'm not sure.

Quick recap: I had a file of "\n" separated lines of JSON.  I decided to
compress it to save on storage costs.  After compression I got a different
answer for a pig query that basically == "count lines".

After a lot of digging, I found an input file that had a line that is a
huge block of null characters followed by a "\n".  I wrote scripts to
examine the file directly, and if I stop counting at the weird line, I get
the same count as what pig claims for that file.   If I count all lines
(e.g. don't fail at the corrupt line) I get the "uncompressed" count pig
claims.

I don't know how to debug hadoop/pig quite as well, though I'm trying now.
 But, my working theory is that some combination of pig/hadoop aborts
processing the gz stream on a null character (or something like that), but
keeps chugging on a non-gz stream.  Does that sound familiar or make sense
to anyone?

will

Reply via email to