I posted this to the pig mailing list, but it might be more related to hadoop itself, I'm not sure.
Quick recap: I had a file of "\n" separated lines of JSON. I decided to compress it to save on storage costs. After compression I got a different answer for a pig query that basically == "count lines". After a lot of digging, I found an input file that had a line that is a huge block of null characters followed by a "\n". I wrote scripts to examine the file directly, and if I stop counting at the weird line, I get the same count as what pig claims for that file. If I count all lines (e.g. don't fail at the corrupt line) I get the "uncompressed" count pig claims. I don't know how to debug hadoop/pig quite as well, though I'm trying now. But, my working theory is that some combination of pig/hadoop aborts processing the gz stream on a null character (or something like that), but keeps chugging on a non-gz stream. Does that sound familiar or make sense to anyone? will