My best guess is that at a low level a string is often terminated by having
a null byte at the end.
Perhaps that's where the difference lies.
Perhaps the gz decompressor simply stops at the null byte and the basic
record reader that follows simply continues.
In this situation your input file contai
I posted this to the pig mailing list, but it might be more related to
hadoop itself, I'm not sure.
Quick recap: I had a file of "\n" separated lines of JSON. I decided to
compress it to save on storage costs. After compression I got a different
answer for a pig query that basically == "count li