I think there are a few ways to do this... the simplest one might be to
manually build a set of comma-separated paths that excludes the bad file,
and pass that to textFile().

When you call textFile() under the hood it is going to pass your filename
string to hadoopFile() which calls setInputPaths() on the hadoop
FileInputformat.

http://hadoop.apache.org/docs/r2.3.0/api/org/apache/hadoop/mapred/FileInputFormat.html#setInputPaths(org.apache.hadoop.mapred.JobConf,
org.apache.hadoop.fs.Path...)

I think this can accept a comma-separate list of paths.

So you could do something like this (this is pseudo-code):
files = fs.listStatus("s3n://bucket/stuff/*.gz")
files = files.filter(not the bad file)
fileStr = files.map(f => f.getPath.toString).mkstring(",")

sc.textFile(fileStr)...

- Patrick




On Fri, May 30, 2014 at 4:20 PM, Nicholas Chammas <
nicholas.cham...@gmail.com> wrote:

> YES, your hunches were correct. I've identified at least one file among
> the hundreds I'm processing that is indeed not a valid gzip file.
>
> Does anyone know of an easy way to exclude a specific file or files when
> calling sc.textFile() on a pattern? e.g. Something like: 
> sc.textFile('s3n://bucket/stuff/*.gz,
> exclude:s3n://bucket/stuff/bad.gz')
> 
>
>
> On Wed, May 21, 2014 at 11:50 PM, Nicholas Chammas <
> nicholas.cham...@gmail.com> wrote:
>
>> Thanks for the suggestions, people. I will try to hone in on which
>> specific gzipped files, if any, are actually corrupt.
>>
>> Michael,
>>
>> I'm using Hadoop 1.0.4, which I believe is the default version that gets
>> deployed by spark-ec2. The JIRA issue I linked to earlier, HADOOP-5281
>> <https://issues.apache.org/jira/browse/HADOOP-5281>, affects Hadoop
>> 0.18.0 and is fixed in 0.20.0 and is also related to gzip compression. I
>> know there is some funkiness in how Hadoop is versioned, so I'm not sure if
>> this issue is relevant to 1.0.4.
>>
>> Were you able to resolve your issue by changing your version of Hadoop?
>> How did you do that?
>>
>> Nick
>> 
>>
>>
>> On Wed, May 21, 2014 at 11:38 PM, Andrew Ash <and...@andrewash.com>
>> wrote:
>>
>>> One thing you can try is to pull each file out of S3 and decompress with
>>> "gzip -d" to see if it works.  I'm guessing there's a corrupted .gz file
>>> somewhere in your path glob.
>>>
>>> Andrew
>>>
>>>
>>> On Wed, May 21, 2014 at 12:40 PM, Michael Cutler <mich...@tumra.com>
>>> wrote:
>>>
>>>> Hi Nick,
>>>>
>>>> Which version of Hadoop are you using with Spark?  I spotted an issue
>>>> with the built-in GzipDecompressor while doing something similar with
>>>> Hadoop 1.0.4, all my Gzip files were valid and tested yet certain files
>>>> blew up from Hadoop/Spark.
>>>>
>>>> The following JIRA ticket goes into more detail
>>>> https://issues.apache.org/jira/browse/HADOOP-8900 and it affects all
>>>> Hadoop releases prior to 1.2.X
>>>>
>>>> MC
>>>>
>>>>
>>>>
>>>>
>>>>  *Michael Cutler*
>>>> Founder, CTO
>>>>
>>>>
>>>> * Mobile: +44 789 990 7847 Email:   mich...@tumra.com
>>>> <mich...@tumra.com> Web:     tumra.com
>>>> <http://tumra.com/?utm_source=signature&utm_medium=email> *
>>>> *Visit us at our offices in Chiswick Park <http://goo.gl/maps/abBxq>*
>>>> *Registered in England & Wales, 07916412. VAT No. 130595328*
>>>>
>>>>
>>>> This email and any files transmitted with it are confidential and may
>>>> also be privileged. It is intended only for the person to whom it is
>>>> addressed. If you have received this email in error, please inform the
>>>> sender immediately. If you are not the intended recipient you must not
>>>> use, disclose, copy, print, distribute or rely on this email.
>>>>
>>>>
>>>> On 21 May 2014 14:26, Madhu <ma...@madhu.com> wrote:
>>>>
>>>>> Can you identify a specific file that fails?
>>>>> There might be a real bug here, but I have found gzip to be reliable.
>>>>> Every time I have run into a "bad header" error with gzip, I had a
>>>>> non-gzip
>>>>> file with the wrong extension for whatever reason.
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> -----
>>>>> Madhu
>>>>> https://www.linkedin.com/in/msiddalingaiah
>>>>> --
>>>>> View this message in context:
>>>>> http://apache-spark-user-list.1001560.n3.nabble.com/count-ing-gz-files-gives-java-io-IOException-incorrect-header-check-tp5768p6169.html
>>>>> Sent from the Apache Spark User List mailing list archive at
>>>>> Nabble.com.
>>>>>
>>>>
>>>>
>>>
>>
>

Reply via email to