Re: One corrupt gzip in a directory of 100s

Ted Yu Wed, 01 Apr 2015 08:48:18 -0700

bq. writing the output (to Amazon S3) failed

What's the value of "fs.s3.maxRetries" ?
Increasing the value should help.


Cheers

On Wed, Apr 1, 2015 at 8:34 AM, Romi Kuntsman <r...@totango.com> wrote:

> What about communication errors and not corrupted files?
> Both when reading input and when writing output.
> We currently experience a failure of the entire process, if the last stage
> of writing the output (to Amazon S3) failed because of a very temporary DNS
> resolution issue (easily resolved by retrying).
>
> *Romi Kuntsman*, *Big Data Engineer*
>  http://www.totango.com
>
> On Wed, Apr 1, 2015 at 12:58 PM, Gil Vernik <g...@il.ibm.com> wrote:
>
> > I actually saw the same issue, where we analyzed some container with few
> > hundreds of GBs zip files - one was corrupted and Spark exit with
> > Exception on the entire job.
> > I like SPARK-6593, since it  can cover also additional cases, not just in
> > case of corrupted zip files.
> >
> >
> >
> > From:   Dale Richardson <dale...@hotmail.com>
> > To:     "dev@spark.apache.org" <dev@spark.apache.org>
> > Date:   29/03/2015 11:48 PM
> > Subject:        One corrupt gzip in a directory of 100s
> >
> >
> >
> > Recently had an incident reported to me where somebody was analysing a
> > directory of gzipped log files, and was struggling to load them into
> spark
> > because one of the files was corrupted - calling
> > sc.textFiles('hdfs:///logs/*.gz') caused an IOException on the particular
> > executor that was reading that file, which caused the entire job to be
> > cancelled after the retry count was exceeded, without any way of catching
> > and recovering from the error.  While normally I think it is entirely
> > appropriate to stop execution if something is wrong with your input,
> > sometimes it is useful to analyse what you can get (as long as you are
> > aware that input has been skipped), and treat corrupt files as acceptable
> > losses.
> > To cater for this particular case I've added SPARK-6593 (PR at
> > https://github.com/apache/spark/pull/5250). Which adds an option
> > (spark.hadoop.ignoreInputErrors) to log exceptions raised by the hadoop
> > Input format, but to continue on with the next task.
> > Ideally in this case you would want to report the corrupt file paths back
> > to the master so they could be dealt with in a particular way (eg moved
> to
> > a separate directory), but that would require a public API
> > change/addition. I was pondering on an addition to Spark's hadoop API
> that
> > could report processing status back to the master via an optional
> > accumulator that collects filepath/Option(exception message) tuples so
> the
> > user has some idea of what files are being processed, and what files are
> > being skipped.
> > Regards,Dale.
> >
>

Re: One corrupt gzip in a directory of 100s

Reply via email to