SparkApplication UI shows that one of the executor "Cannot find Addresss" Aggregated Metrics by Executor Executor ID Address Task Time Total Tasks Failed Tasks Succeeded Tasks Input Shuffle Read Shuffle Write Shuffle Spill (Memory) Shuffle Spill (Disk) 0 mddworker1.c.fi-mdd-poc.internal:42197 0 ms 0 0 0 0.0 B 136.1 MB 184.9 MB 146.8 GB 135.4 MB 1 CANNOT FIND ADDRESS 0 ms 0 0 0 0.0 B 87.4 MB 142.0 MB 61.4 GB 81.4 MB
I also see following in one of the executor logs for which the driver may have lost communication. 14/10/29 13:18:33 WARN : Master_Client Heartbeat last execution took 90859 ms. Longer than the FIXED_EXECUTION_INTERVAL_MS 5000 14/10/29 13:18:33 WARN : WorkerClientToWorkerHeartbeat last execution took 90859 ms. Longer than the FIXED_EXECUTION_INTERVAL_MS 1000 14/10/29 13:18:33 WARN AkkaUtils: Error sending message in 1 attempts java.util.concurrent.TimeoutException: Futures timed out after [30 seconds] at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219) at scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223) at scala.concurrent.Await$$anonfun$result$1.apply(package.scala:107) at scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53) at scala.concurrent.Await$.result(package.scala:107) at org.apache.spark.util.AkkaUtils$.askWithReply(AkkaUtils.scala:176) at org.apache.spark.executor.Executor$$anon$1.run(Executor.scala:362) I have also seen other variation of timeouts 14/10/29 06:21:05 WARN SendingConnection: Error finishing connection to mddworker1.c.fi-mdd-poc.internal/10.240.179.241:40442 java.net.ConnectException: Connection refused 14/10/29 06:21:05 ERROR BlockManager: Failed to report broadcast_6_piece0 to master; giving up. or 14/10/29 07:23:40 WARN AkkaUtils: Error sending message in 1 attempts java.util.concurrent.TimeoutException: Futures timed out after [10 seconds] at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219) at scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223) at scala.concurrent.Await$$anonfun$result$1.apply(package.scala:107) at scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53) at scala.concurrent.Await$.result(package.scala:107) at org.apache.spark.util.AkkaUtils$.askWithReply(AkkaUtils.scala:176) at org.apache.spark.storage.BlockManagerMaster.askDriverWithReply(BlockManagerMaster.scala:218) at org.apache.spark.storage.BlockManagerMaster.updateBlockInfo(BlockManagerMaster.scala:58) at org.apache.spark.storage.BlockManager.org$apache$spark$storage$BlockManager$$tryToReportBlockStatus(BlockManager.scala:310) at org.apache.spark.storage.BlockManager$$anonfun$reportAllBlocks$3.apply(BlockManager.scala:190) at org.apache.spark.storage.BlockManager$$anonfun$reportAllBlocks$3.apply(BlockManager.scala:188) at scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772) at org.apache.spark.util.TimeStampedHashMap.foreach(TimeStampedHashMap.scala:107) at scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:771) at org.apache.spark.storage.BlockManager.reportAllBlocks(BlockManager.scala:188) at org.apache.spark.storage.BlockManager.reregister(BlockManager.scala:207) at org.apache.spark.executor.Executor$$anon$1.run(Executor.scala:366) How do I track down what is causing this problem? Any suggestion on solution, debugging or workaround will be helpful! -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/CANNOT-FIND-ADDRESS-tp17637.html Sent from the Apache Spark User List mailing list archive at Nabble.com. --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org