I have a Spark job that has 7 stages. The first 3 stage complete and the
fourth stage beings (joins two RDDs). This stage has multiple task
 failures all the below exception.

Multiple tasks (100s) of them get the same exception with different hosts.
How can all the host suddenly stop responding when few moments ago 3 stages
ran successfully. If I re-run the three stages will again run successfully.
I cannot think of it being a cluster issue.


Any suggestions ?


Spark Version : 1.3.1

Exception:

org.apache.spark.shuffle.FetchFailedException: Failed to connect to HOST
        at 
org.apache.spark.shuffle.hash.BlockStoreShuffleFetcher$.org$apache$spark$shuffle$hash$BlockStoreShuffleFetcher$$unpackBlock$1(BlockStoreShuffleFetcher.scala:67)
        at 
org.apache.spark.shuffle.hash.BlockStoreShuffleFetcher$$anonfun$3.apply(BlockStoreShuffleFetcher.scala:83)
        at 
org.apache.spark.shuffle.hash.BlockStoreShuffleFetcher$$anonfun$3.apply(BlockStoreShuffleFetcher.scala:83)
        at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
        at 
org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32)
        at 
org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
        at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
        at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
        at 
org.apache.spark.util.collection.ExternalAppendOnlyMap.insertAll(ExternalAppendOnlyMap.scala:125)
        at org.apache.sp


-- 
Deepak
  • [no subject] ๏̯͡๏

Reply via email to