I have a job that runs fine on relatively small input datasets but then reaches a threshold where I begin to consistently get "Fetch failure" for the Failure Reason, late in the job, during a saveAsText() operation.
The first error we are seeing on the "Details for Stage" page is "ExecutorLostFailure" My Shuffle Read is 3.3 GB and that's the only thing that seems high, we have three servers and they are configured on this job for 5g memory, and the job is running in spark-shell. The first error in the shell is "Lost executor 2 on (servername): remote Akka client disassociated. We are still trying to understand how to best diagnose jobs using the web ui so it's likely that there's some helpful info here that we just don't know how to interpret -- is there any kind of "troubleshooting guide" beyond the Spark Configuration page? I don't know if I'm providing enough info here. thanks. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Fetch-Failure-tp20787.html Sent from the Apache Spark User List mailing list archive at Nabble.com. --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org