S3n is governed by the same config parameter. Cheers
> On Apr 2, 2015, at 7:33 AM, Romi Kuntsman <r...@totango.com> wrote: > > Hi Ted, > Not sure what's the config value, I'm using s3n filesystem and not s3. > > The error that I get is the following: > (so does that mean it's 4 retries?) > > Caused by: org.apache.spark.SparkException: Job aborted due to stage > failure: Task 2 in stage 0.0 failed 4 times, most recent failure: Lost task > 2.3 in stage 0.0 (TID 11, ip.ec2.internal): java.net.UnknownHostException: > mybucket.s3.amazonaws.com > at > java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:178) > at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392) > at java.net.Socket.connect(Socket.java:579) > at sun.security.ssl.SSLSocketImpl.connect(SSLSocketImpl.java:618) > at sun.security.ssl.SSLSocketImpl.<init>(SSLSocketImpl.java:451) > at > sun.security.ssl.SSLSocketFactoryImpl.createSocket(SSLSocketFactoryImpl.java:140) > at > org.apache.commons.httpclient.protocol.SSLProtocolSocketFactory.createSocket(SSLProtocolSocketFactory.java:82) > at > org.apache.commons.httpclient.protocol.ControllerThreadSocketFactory$1.doit(ControllerThreadSocketFactory.java:91) > at > org.apache.commons.httpclient.protocol.ControllerThreadSocketFactory$SocketTask.run(ControllerThreadSocketFactory.java:158) > at java.lang.Thread.run(Thread.java:745) > > Romi Kuntsman, Big Data Engineer > http://www.totango.com > >> On Wed, Apr 1, 2015 at 6:46 PM, Ted Yu <yuzhih...@gmail.com> wrote: >> bq. writing the output (to Amazon S3) failed >> >> What's the value of "fs.s3.maxRetries" ? >> Increasing the value should help. >> >> Cheers >> >>> On Wed, Apr 1, 2015 at 8:34 AM, Romi Kuntsman <r...@totango.com> wrote: >>> What about communication errors and not corrupted files? >>> Both when reading input and when writing output. >>> We currently experience a failure of the entire process, if the last stage >>> of writing the output (to Amazon S3) failed because of a very temporary DNS >>> resolution issue (easily resolved by retrying). >>> >>> *Romi Kuntsman*, *Big Data Engineer* >>> >>> http://www.totango.com >>> >>> On Wed, Apr 1, 2015 at 12:58 PM, Gil Vernik <g...@il.ibm.com> wrote: >>> >>> > I actually saw the same issue, where we analyzed some container with few >>> > hundreds of GBs zip files - one was corrupted and Spark exit with >>> > Exception on the entire job. >>> > I like SPARK-6593, since it can cover also additional cases, not just in >>> > case of corrupted zip files. >>> > >>> > >>> > >>> > From: Dale Richardson <dale...@hotmail.com> >>> > To: "dev@spark.apache.org" <dev@spark.apache.org> >>> > Date: 29/03/2015 11:48 PM >>> > Subject: One corrupt gzip in a directory of 100s >>> > >>> > >>> > >>> > Recently had an incident reported to me where somebody was analysing a >>> > directory of gzipped log files, and was struggling to load them into spark >>> > because one of the files was corrupted - calling >>> > sc.textFiles('hdfs:///logs/*.gz') caused an IOException on the particular >>> > executor that was reading that file, which caused the entire job to be >>> > cancelled after the retry count was exceeded, without any way of catching >>> > and recovering from the error. While normally I think it is entirely >>> > appropriate to stop execution if something is wrong with your input, >>> > sometimes it is useful to analyse what you can get (as long as you are >>> > aware that input has been skipped), and treat corrupt files as acceptable >>> > losses. >>> > To cater for this particular case I've added SPARK-6593 (PR at >>> > https://github.com/apache/spark/pull/5250). Which adds an option >>> > (spark.hadoop.ignoreInputErrors) to log exceptions raised by the hadoop >>> > Input format, but to continue on with the next task. >>> > Ideally in this case you would want to report the corrupt file paths back >>> > to the master so they could be dealt with in a particular way (eg moved to >>> > a separate directory), but that would require a public API >>> > change/addition. I was pondering on an addition to Spark's hadoop API that >>> > could report processing status back to the master via an optional >>> > accumulator that collects filepath/Option(exception message) tuples so the >>> > user has some idea of what files are being processed, and what files are >>> > being skipped. >>> > Regards,Dale. >>> > >