Can you look a bit more in the error logs? It could be getting killed
because of OOM etc. One thing you can try is to set the
spark.shuffle.blockTransferService to nio from netty.

Thanks
Best Regards

On Wed, Jun 24, 2015 at 5:46 AM, ÐΞ€ρ@Ҝ (๏̯͡๏) <deepuj...@gmail.com> wrote:

> I have a Spark job that has 7 stages. The first 3 stage complete and the
> fourth stage beings (joins two RDDs). This stage has multiple task
>  failures all the below exception.
>
> Multiple tasks (100s) of them get the same exception with different hosts.
> How can all the host suddenly stop responding when few moments ago 3 stages
> ran successfully. If I re-run the three stages will again run successfully.
> I cannot think of it being a cluster issue.
>
>
> Any suggestions ?
>
>
> Spark Version : 1.3.1
>
> Exception:
>
> org.apache.spark.shuffle.FetchFailedException: Failed to connect to HOST
>       at 
> org.apache.spark.shuffle.hash.BlockStoreShuffleFetcher$.org$apache$spark$shuffle$hash$BlockStoreShuffleFetcher$$unpackBlock$1(BlockStoreShuffleFetcher.scala:67)
>       at 
> org.apache.spark.shuffle.hash.BlockStoreShuffleFetcher$$anonfun$3.apply(BlockStoreShuffleFetcher.scala:83)
>       at 
> org.apache.spark.shuffle.hash.BlockStoreShuffleFetcher$$anonfun$3.apply(BlockStoreShuffleFetcher.scala:83)
>       at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
>       at 
> org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32)
>       at 
> org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
>       at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
>       at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
>       at 
> org.apache.spark.util.collection.ExternalAppendOnlyMap.insertAll(ExternalAppendOnlyMap.scala:125)
>       at org.apache.sp
>
>
> --
> Deepak
>
>

Reply via email to