Hi,

I'm having trouble with Hadoop (tested with 0.17 and 0.19) not fully processing 
certain gzipped input files. Basically it only actually reads and processes a 
first part of the gzipped file, and just ignores the rest without any kind of 
warning.

It affects at least (but is maybe not limited to?) any gzip files that are a 
result of concatenation (which should be legal to do with gzip format):
http://www.gnu.org/software/gzip/manual/gzip.html#Advanced-usage

Repro case, using the "WordCount" example from the hadoop distribution:
$ echo 'one two three' > f1
$ echo 'four five six' > f2
$ gzip -c f1 > combined_file.gz
$ gzip -c f2 >> combined_file.gz

Now, if I run "WordCount" with combined_file.gz as input, it will only find the 
words 'one', 'two', 'three', but not 'four', 'five', 'six'.

It seems Java's GZIPInputStream may have a similar issue:
http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=4691425

Now, if I unzip and re-gzip this 'combined_file.gz' manually, the problem goes 
away.

It's especially dangerous since Hadoop doesn't show any errors or complains in 
the least. It just ignores this extra input. The only way of noticing is to run 
one's app with gzipped- and unzipped data side by side and notice the record 
counts being different.

Is anyone else familiar with this problem? Any solutions, workarounds, short of 
re-gzipping very large amounts of data?

Thanks!
/ Oscar

________________________________
The information transmitted in this email is intended only for the person(s) or 
entity to which it is addressed and may contain confidential and/or privileged 
material. Any review, retransmission, dissemination or other use of, or taking 
of any action in reliance upon, this information by persons or entities other 
than the intended recipient is prohibited. If you received this email in error, 
please contact the sender and permanently delete the email from any computer.

Reply via email to