I’m trying to do a simple count() on a large number of GZipped files in S3.
My job is failing with the following message:

14/05/15 19:12:37 WARN scheduler.TaskSetManager: Loss was due to
java.io.IOException
java.io.IOException: incorrect header check
    at 
org.apache.hadoop.io.compress.zlib.ZlibDecompressor.inflateBytesDirect(Native
Method)
    at 
org.apache.hadoop.io.compress.zlib.ZlibDecompressor.decompress(ZlibDecompressor.java:221)
    at 
org.apache.hadoop.io.compress.DecompressorStream.decompress(DecompressorStream.java:82)
    at 
org.apache.hadoop.io.compress.DecompressorStream.read(DecompressorStream.java:76)
    at java.io.InputStream.read(InputStream.java:101)

<snipped>

I traced this down to
HADOOP-5281<https://issues.apache.org/jira/browse/HADOOP-5281>,
but I’m not sure if 1) it’s the same issue, or 2) how to go about resolving
it.

I gather I need to update some Hadoop jar? Any tips on where to look/what
to do?

I’m running Spark on an EC2 cluster created by spark-ec2 with no special
options used.

Nick




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/count-ing-gz-files-gives-java-io-IOException-incorrect-header-check-tp5768.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Reply via email to