I'm running Spark 1.3.1 on AWS... Having long-running application (spark context) which accepts and completes jobs fine. However, it crashes at as it seems random times (anywhere from 1 hour and up to 6 days)... At a latter case, context run and finished hundreds of jobs without an issue and then suddenly crashed with the following 2 lines on executors' logs:
/15/06/24 10:35:44 ERROR executor.CoarseGrainedExecutorBackend: Driver Disassociated [akka.tcp://sparkExecutor@ip-***-**-36-70.us-west-2.compute.internal:59891] -> [akka.tcp://sparkDriver@ip-***-**-42-150.us-west-2.compute.internal:56572] disassociated! Shutting down. 15/06/24 10:35:44 WARN remote.ReliableDeliverySupervisor: Association with remote system [akka.tcp://sparkDriver@ip-***-**-42-150.us-west-2.compute.internal:56572] has failed, address is now gated for [5000] ms. Reason is: [Disassociated]./ Following advises on the forums, I've removed SPARK_PUBLIC_DNS setting and increased the following Akka configs: /spark.akka.failure-detector.threshold 30000 spark.akka.heartbeat.interval 100000 spark.akka.heartbeat.pauses 600000/ This resulted in context crash after 2 hours with different warnings/errors *during* the operation: [2015-06-25 04:50:16,769] WARN ient.AppClient$ClientActor [] [akka://JobServer/user/context-supervisor/spark-sql-context] - Connection to akka.tcp://sparkMaster@ec2-***.us-west-2.compute.amazonaws.com:7077 failed; waiting for master to reconnect... [2015-06-25 04:50:17,400] ERROR cheduler.TaskSchedulerImpl [] [akka://JobServer/user/context-supervisor/spark-sql-context] - Lost executor 0 on ip-***.us-west-2.compute.internal: remote Akka client disassociated Despite of these, the log continues and shows couple jobs even done after these... But then the end of story, context silently died... Help with understanding and dealing with this would be greatly appreciated! -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Akka-failures-Driver-Disassociated-tp23486.html Sent from the Apache Spark User List mailing list archive at Nabble.com. --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org