Re: One corrupt gzip in a directory of 100s

Romi Kuntsman Wed, 01 Apr 2015 08:38:17 -0700

What about communication errors and not corrupted files?
Both when reading input and when writing output.
We currently experience a failure of the entire process, if the last stage
of writing the output (to Amazon S3) failed because of a very temporary DNS
resolution issue (easily resolved by retrying).


*Romi Kuntsman*, *Big Data Engineer*
 http://www.totango.com

On Wed, Apr 1, 2015 at 12:58 PM, Gil Vernik <g...@il.ibm.com> wrote:

> I actually saw the same issue, where we analyzed some container with few
> hundreds of GBs zip files - one was corrupted and Spark exit with
> Exception on the entire job.
> I like SPARK-6593, since it  can cover also additional cases, not just in
> case of corrupted zip files.
>
>
>
> From:   Dale Richardson <dale...@hotmail.com>
> To:     "dev@spark.apache.org" <dev@spark.apache.org>
> Date:   29/03/2015 11:48 PM
> Subject:        One corrupt gzip in a directory of 100s
>
>
>
> Recently had an incident reported to me where somebody was analysing a
> directory of gzipped log files, and was struggling to load them into spark
> because one of the files was corrupted - calling
> sc.textFiles('hdfs:///logs/*.gz') caused an IOException on the particular
> executor that was reading that file, which caused the entire job to be
> cancelled after the retry count was exceeded, without any way of catching
> and recovering from the error.  While normally I think it is entirely
> appropriate to stop execution if something is wrong with your input,
> sometimes it is useful to analyse what you can get (as long as you are
> aware that input has been skipped), and treat corrupt files as acceptable
> losses.
> To cater for this particular case I've added SPARK-6593 (PR at
> https://github.com/apache/spark/pull/5250). Which adds an option
> (spark.hadoop.ignoreInputErrors) to log exceptions raised by the hadoop
> Input format, but to continue on with the next task.
> Ideally in this case you would want to report the corrupt file paths back
> to the master so they could be dealt with in a particular way (eg moved to
> a separate directory), but that would require a public API
> change/addition. I was pondering on an addition to Spark's hadoop API that
> could report processing status back to the master via an optional
> accumulator that collects filepath/Option(exception message) tuples so the
> user has some idea of what files are being processed, and what files are
> being skipped.
> Regards,Dale.
>

Re: One corrupt gzip in a directory of 100s

Reply via email to