Spark Tasks failing with Cannot find address
I have a spark stage that has 8 tasks. 7/8 have completed. However 1 task is failing with Cannot find address Aggregated Metrics by ExecutorExecutor IDAddressTask TimeTotal TasksFailed TasksSucceeded TasksShuffle Read Size / RecordsShuffle Write Size / RecordsShuffle Spill (Memory)Shuffle Spill (Disk)19CANNOT FIND ADDRESS24 min1101248.9 MB / 561940060.0 B / 00.0 B0.0 B 47CANNOT FIND ADDRESS2.3 h1101295.3 MB / 562020370.0 B / 00.0 B0.0 B Any suggestions ? -- Deepak
Re: Spark Tasks failing with Cannot find address
Spark Version 1.3 Command: ./bin/spark-submit -v --master yarn-cluster --driver-class-path /apache/hadoop/share/hadoop/common/hadoop-common-2.4.1-company-2.jar:/apache/hadoop/lib/hadoop-lzo-0.6.0.jar:/apache/hadoop-2.4.1-2.1.3.0-2-company/share/hadoop/yarn/lib/guava-11.0.2.jar:/apache/hadoop-2.4.1-2.1.3.0-2-company/share/hadoop/hdfs/hadoop-hdfs-2.4.1-company-2.jar --num-executors 100 --driver-memory 4g --driver-java-options -XX:MaxPermSize=4G --executor-memory 8g --executor-cores 1 --queue hdmi-express --class com. company.ep.poc.spark.reporting.SparkApp /home/dvasthimal/spark1.3/spark_reporting-1.0-SNAPSHOT.jar startDate=2015-04-6 endDate=2015-04-7 input=/user/dvasthimal/epdatasets_small/exptsession subcommand=viewItem output=/user/dvasthimal/epdatasets/viewItem On Wed, Apr 8, 2015 at 2:30 PM, ÐΞ€ρ@Ҝ (๏̯͡๏) deepuj...@gmail.com wrote: I have a spark stage that has 8 tasks. 7/8 have completed. However 1 task is failing with Cannot find address Aggregated Metrics by ExecutorExecutor IDAddressTask TimeTotal TasksFailed TasksSucceeded TasksShuffle Read Size / RecordsShuffle Write Size / RecordsShuffle Spill (Memory)Shuffle Spill (Disk)19CANNOT FIND ADDRESS24 min1101248.9 MB / 561940060.0 B / 00.0 B0.0 B 47CANNOT FIND ADDRESS2.3 h1101295.3 MB / 562020370.0 B / 00.0 B0.0 B Any suggestions ? -- Deepak -- Deepak
Re: CANNOT FIND ADDRESS
no luck :(! Still observing the same behavior! -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/CANNOT-FIND-ADDRESS-tp17637p17988.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: CANNOT FIND ADDRESS
Tr this spark.storage.memoryFraction0.9 On 31 Oct 2014 20:27, akhandeshi ami.khande...@gmail.com wrote: Thanks for the pointers! I did tried but didn't seem to help... In my latest try, I am doing spark-submit local But see the same message in spark App ui (4040) localhost CANNOT FIND ADDRESS In the logs, I see a lot of in-memory map to disk. I don't understand why that is the case. There should be over 35 gb ram available for input that is not significantly large. It may be link to the performance issues that I am seeing. I have another post for seeing advice on that. It seems, I am not able to tune spark sufficiently to execute my process successfully. 14/10/31 13:45:12 INFO ExternalAppendOnlyMap: Thread 206 spilling in-memory map of 1777 MB to disk (2 times so far) 14/10/31 13:45:12 INFO ExternalAppendOnlyMap: Thread 228 spilling in-memory map of 393 MB to disk (1 time so f ar) 14/10/31 13:45:12 INFO ExternalAppendOnlyMap: Thread 259 spilling in-memory map of 392 MB to disk (2 times so far) 14/10/31 13:45:14 INctorsBySecId();^ZFO ExternalAppendOnlyMap: Thread 230 spilling in-memory map of 554 MB to disk (2 times so far) 14/10/31 13:45:15 INFO ExternalAppendOnlyMap: Thread 235 spilling in-memory map of 3990 MB to disk (1 time so far) 14/10/31 13:45:15 INFO ExternalAppendOnlyMap: Thread 236 spilling in-memory map of 2667 MB to disk (1 time so far) 14/10/31 13:45:17 INFO ExternalAppendOnlyMap: Thread 259 spilling in-memory map of 825 MB to disk (3 times so far) 14/10/31 13:45:24 INFO ExternalAppendOnlyMap: Thread 228 spilling in-memory map of 4618 MB to disk (2 times so far) 14/10/31 13:45:26 INFO ExternalAppendOnlyMap: Thread 233 spilling in-memory map of 15869 MB to disk (1 time so far) 14/10/31 13:45:37 INFO ExternalAppendOnlyMap: Thread 206 spilling in-memory map of 3026 MB to disk (3 times so far) 14/10/31 13:45:48 INFO ExternalAppendOnlyMap: Thread 228 spilling in-memory map of 401 MB to disk (3 times so far) 14/10/31 13:45:48 INFO ExternalAppendOnlyMap: Thread 259 spilling in-memory map of 392 MB to disk (4 times so My spark properties are: NameValue spark.akka.frameSize50 spark.akka.timeout 900 spark.app.name Simple File Merge Application spark.core.connection.ack.wait.timeout 900 spark.default.parallelism 10 spark.driver.host spark-single.c.fi-mdd-poc.internal spark.driver.memory 35g spark.driver.port 40255 spark.eventLog.enabled true spark.fileserver.urihttp://10.240.106.135:59255 spark.jars file:/home/ami_khandeshi_gmail_com/SparkExample-1.0.jar spark.masterlocal[16] spark.scheduler.modeFIFO spark.shuffle.consolidateFiles true spark.storage.memoryFraction0.3 spark.tachyonStore.baseDir /tmp spark.tachyonStore.folderName spark-21ad0fd2-2177-48ce-9242-8dbb33f2a1f1 spark.tachyonStore.url tachyon://mdd:19998 -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/CANNOT-FIND-ADDRESS-tp17637p17824.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: CANNOT FIND ADDRESS
Thanks for the pointers! I did tried but didn't seem to help... In my latest try, I am doing spark-submit local But see the same message in spark App ui (4040) localhost CANNOT FIND ADDRESS In the logs, I see a lot of in-memory map to disk. I don't understand why that is the case. There should be over 35 gb ram available for input that is not significantly large. It may be link to the performance issues that I am seeing. I have another post for seeing advice on that. It seems, I am not able to tune spark sufficiently to execute my process successfully. 14/10/31 13:45:12 INFO ExternalAppendOnlyMap: Thread 206 spilling in-memory map of 1777 MB to disk (2 times so far) 14/10/31 13:45:12 INFO ExternalAppendOnlyMap: Thread 228 spilling in-memory map of 393 MB to disk (1 time so f ar) 14/10/31 13:45:12 INFO ExternalAppendOnlyMap: Thread 259 spilling in-memory map of 392 MB to disk (2 times so far) 14/10/31 13:45:14 INctorsBySecId();^ZFO ExternalAppendOnlyMap: Thread 230 spilling in-memory map of 554 MB to disk (2 times so far) 14/10/31 13:45:15 INFO ExternalAppendOnlyMap: Thread 235 spilling in-memory map of 3990 MB to disk (1 time so far) 14/10/31 13:45:15 INFO ExternalAppendOnlyMap: Thread 236 spilling in-memory map of 2667 MB to disk (1 time so far) 14/10/31 13:45:17 INFO ExternalAppendOnlyMap: Thread 259 spilling in-memory map of 825 MB to disk (3 times so far) 14/10/31 13:45:24 INFO ExternalAppendOnlyMap: Thread 228 spilling in-memory map of 4618 MB to disk (2 times so far) 14/10/31 13:45:26 INFO ExternalAppendOnlyMap: Thread 233 spilling in-memory map of 15869 MB to disk (1 time so far) 14/10/31 13:45:37 INFO ExternalAppendOnlyMap: Thread 206 spilling in-memory map of 3026 MB to disk (3 times so far) 14/10/31 13:45:48 INFO ExternalAppendOnlyMap: Thread 228 spilling in-memory map of 401 MB to disk (3 times so far) 14/10/31 13:45:48 INFO ExternalAppendOnlyMap: Thread 259 spilling in-memory map of 392 MB to disk (4 times so My spark properties are: NameValue spark.akka.frameSize50 spark.akka.timeout 900 spark.app.name Simple File Merge Application spark.core.connection.ack.wait.timeout 900 spark.default.parallelism 10 spark.driver.host spark-single.c.fi-mdd-poc.internal spark.driver.memory 35g spark.driver.port 40255 spark.eventLog.enabled true spark.fileserver.urihttp://10.240.106.135:59255 spark.jars file:/home/ami_khandeshi_gmail_com/SparkExample-1.0.jar spark.masterlocal[16] spark.scheduler.modeFIFO spark.shuffle.consolidateFiles true spark.storage.memoryFraction0.3 spark.tachyonStore.baseDir /tmp spark.tachyonStore.folderName spark-21ad0fd2-2177-48ce-9242-8dbb33f2a1f1 spark.tachyonStore.url tachyon://mdd:19998 -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/CANNOT-FIND-ADDRESS-tp17637p17824.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
CANNOT FIND ADDRESS
SparkApplication UI shows that one of the executor Cannot find Addresss Aggregated Metrics by Executor Executor ID Address Task Time Total Tasks Failed Tasks Succeeded Tasks Input Shuffle ReadShuffle Write Shuffle Spill (Memory) Shuffle Spill (Disk) 0 mddworker1.c.fi-mdd-poc.internal:42197 0 ms0 0 0 0.0 B 136.1 MB184.9 MB 146.8 GB135.4 MB 1 CANNOT FIND ADDRESS 0 ms0 0 0 0.0 B 87.4 MB 142.0 MB61.4 GB 81.4 MB I also see following in one of the executor logs for which the driver may have lost communication. 14/10/29 13:18:33 WARN : Master_Client Heartbeat last execution took 90859 ms. Longer than the FIXED_EXECUTION_INTERVAL_MS 5000 14/10/29 13:18:33 WARN : WorkerClientToWorkerHeartbeat last execution took 90859 ms. Longer than the FIXED_EXECUTION_INTERVAL_MS 1000 14/10/29 13:18:33 WARN AkkaUtils: Error sending message in 1 attempts java.util.concurrent.TimeoutException: Futures timed out after [30 seconds] at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219) at scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223) at scala.concurrent.Await$$anonfun$result$1.apply(package.scala:107) at scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53) at scala.concurrent.Await$.result(package.scala:107) at org.apache.spark.util.AkkaUtils$.askWithReply(AkkaUtils.scala:176) at org.apache.spark.executor.Executor$$anon$1.run(Executor.scala:362) I have also seen other variation of timeouts 14/10/29 06:21:05 WARN SendingConnection: Error finishing connection to mddworker1.c.fi-mdd-poc.internal/10.240.179.241:40442 java.net.ConnectException: Connection refused 14/10/29 06:21:05 ERROR BlockManager: Failed to report broadcast_6_piece0 to master; giving up. or 14/10/29 07:23:40 WARN AkkaUtils: Error sending message in 1 attempts java.util.concurrent.TimeoutException: Futures timed out after [10 seconds] at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219) at scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223) at scala.concurrent.Await$$anonfun$result$1.apply(package.scala:107) at scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53) at scala.concurrent.Await$.result(package.scala:107) at org.apache.spark.util.AkkaUtils$.askWithReply(AkkaUtils.scala:176) at org.apache.spark.storage.BlockManagerMaster.askDriverWithReply(BlockManagerMaster.scala:218) at org.apache.spark.storage.BlockManagerMaster.updateBlockInfo(BlockManagerMaster.scala:58) at org.apache.spark.storage.BlockManager.org$apache$spark$storage$BlockManager$$tryToReportBlockStatus(BlockManager.scala:310) at org.apache.spark.storage.BlockManager$$anonfun$reportAllBlocks$3.apply(BlockManager.scala:190) at org.apache.spark.storage.BlockManager$$anonfun$reportAllBlocks$3.apply(BlockManager.scala:188) at scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772) at org.apache.spark.util.TimeStampedHashMap.foreach(TimeStampedHashMap.scala:107) at scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:771) at org.apache.spark.storage.BlockManager.reportAllBlocks(BlockManager.scala:188) at org.apache.spark.storage.BlockManager.reregister(BlockManager.scala:207) at org.apache.spark.executor.Executor$$anon$1.run(Executor.scala:366) How do I track down what is causing this problem? Any suggestion on solution, debugging or workaround will be helpful! -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/CANNOT-FIND-ADDRESS-tp17637.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: CANNOT FIND ADDRESS
CANNOT FIND ADDRESS occurs when your executor has crashed. I'll look further down where it shows each task and see if you see any tasks failed. Then you can examine the error log of that executor and see why it died. On Wed, Oct 29, 2014 at 9:35 AM, akhandeshi ami.khande...@gmail.com wrote: SparkApplication UI shows that one of the executor Cannot find Addresss Aggregated Metrics by Executor Executor ID Address Task Time Total Tasks Failed Tasks Succeeded Tasks Input Shuffle ReadShuffle Write Shuffle Spill (Memory) Shuffle Spill (Disk) 0 mddworker1.c.fi-mdd-poc.internal:42197 0 ms0 0 0 0.0 B 136.1 MB184.9 MB 146.8 GB135.4 MB 1 CANNOT FIND ADDRESS 0 ms0 0 0 0.0 B 87.4 MB 142.0 MB61.4 GB 81.4 MB I also see following in one of the executor logs for which the driver may have lost communication. 14/10/29 13:18:33 WARN : Master_Client Heartbeat last execution took 90859 ms. Longer than the FIXED_EXECUTION_INTERVAL_MS 5000 14/10/29 13:18:33 WARN : WorkerClientToWorkerHeartbeat last execution took 90859 ms. Longer than the FIXED_EXECUTION_INTERVAL_MS 1000 14/10/29 13:18:33 WARN AkkaUtils: Error sending message in 1 attempts java.util.concurrent.TimeoutException: Futures timed out after [30 seconds] at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219) at scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223) at scala.concurrent.Await$$anonfun$result$1.apply(package.scala:107) at scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53) at scala.concurrent.Await$.result(package.scala:107) at org.apache.spark.util.AkkaUtils$.askWithReply(AkkaUtils.scala:176) at org.apache.spark.executor.Executor$$anon$1.run(Executor.scala:362) I have also seen other variation of timeouts 14/10/29 06:21:05 WARN SendingConnection: Error finishing connection to mddworker1.c.fi-mdd-poc.internal/10.240.179.241:40442 java.net.ConnectException: Connection refused 14/10/29 06:21:05 ERROR BlockManager: Failed to report broadcast_6_piece0 to master; giving up. or 14/10/29 07:23:40 WARN AkkaUtils: Error sending message in 1 attempts java.util.concurrent.TimeoutException: Futures timed out after [10 seconds] at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219) at scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223) at scala.concurrent.Await$$anonfun$result$1.apply(package.scala:107) at scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53) at scala.concurrent.Await$.result(package.scala:107) at org.apache.spark.util.AkkaUtils$.askWithReply(AkkaUtils.scala:176) at org.apache.spark.storage.BlockManagerMaster.askDriverWithReply(BlockManagerMaster.scala:218) at org.apache.spark.storage.BlockManagerMaster.updateBlockInfo(BlockManagerMaster.scala:58) at org.apache.spark.storage.BlockManager.org $apache$spark$storage$BlockManager$$tryToReportBlockStatus(BlockManager.scala:310) at org.apache.spark.storage.BlockManager$$anonfun$reportAllBlocks$3.apply(BlockManager.scala:190) at org.apache.spark.storage.BlockManager$$anonfun$reportAllBlocks$3.apply(BlockManager.scala:188) at scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772) at org.apache.spark.util.TimeStampedHashMap.foreach(TimeStampedHashMap.scala:107) at scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:771) at org.apache.spark.storage.BlockManager.reportAllBlocks(BlockManager.scala:188) at org.apache.spark.storage.BlockManager.reregister(BlockManager.scala:207) at org.apache.spark.executor.Executor$$anon$1.run(Executor.scala:366) How do I track down what is causing this problem? Any suggestion on solution, debugging or workaround will be helpful! -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/CANNOT-FIND-ADDRESS-tp17637.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: CANNOT FIND ADDRESS
Thanks...hmm It is seems to be a timeout issue perhaps?? Not sure what is causing it? or how to debug? I see following error message... 4/10/29 13:26:04 ERROR ContextCleaner: Error cleaning broadcast 9 akka.pattern.AskTimeoutException: Timed out at akka.pattern.PromiseActorRef$$anonfun$1.apply$mcV$sp(AskSupport.scala:334) at akka.actor.Scheduler$$anon$11.run(Scheduler.scala:118) at scala.concurrent.Future$InternalCallbackExecutor$.scala$concurrent$Future$InternalCallbackExecutor$ $unbatchedExecute(Future.scala:694) at scala.concurrent.Future$InternalCallbackExecutor$.execute(Future.scala:691) at akka.actor.LightArrayRevolverScheduler$TaskHolder.executeTask(Scheduler.scala:455) at akka.actor.LightArrayRevolverScheduler$$anon$12.executeBucket$1(Scheduler.scala:407) at akka.actor.LightArrayRevolverScheduler$$anon$12.nextTick(Scheduler.scala:411) at akka.actor.LightArrayRevolverScheduler$$anon$12.run(Scheduler.scala:363) at java.lang.Thread.run(Thread.java:745) 14/10/29 13:26:04 WARN BlockManagerMaster: Failed to remove broadcast 9 with removeFromMaster = true - Timed o ut} -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/CANNOT-FIND-ADDRESS-tp17637p17646.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: CANNOT FIND ADDRESS
Can you try setting the following while creating the sparkContext and see if the issue still exists? .set(spark.core.connection.ack.wait.timeout,900) .set(spark.akka.frameSize,50) .set(spark.akka.timeout,900) Looks like your executor is stuck on GC Pause. Thanks Best Regards On Wed, Oct 29, 2014 at 9:20 PM, akhandeshi ami.khande...@gmail.com wrote: Thanks...hmm It is seems to be a timeout issue perhaps?? Not sure what is causing it? or how to debug? I see following error message... 4/10/29 13:26:04 ERROR ContextCleaner: Error cleaning broadcast 9 akka.pattern.AskTimeoutException: Timed out at akka.pattern.PromiseActorRef$$anonfun$1.apply$mcV$sp(AskSupport.scala:334) at akka.actor.Scheduler$$anon$11.run(Scheduler.scala:118) at scala.concurrent.Future$InternalCallbackExecutor$.scala$concurrent$Future$InternalCallbackExecutor$ $unbatchedExecute(Future.scala:694) at scala.concurrent.Future$InternalCallbackExecutor$.execute(Future.scala:691) at akka.actor.LightArrayRevolverScheduler$TaskHolder.executeTask(Scheduler.scala:455) at akka.actor.LightArrayRevolverScheduler$$anon$12.executeBucket$1(Scheduler.scala:407) at akka.actor.LightArrayRevolverScheduler$$anon$12.nextTick(Scheduler.scala:411) at akka.actor.LightArrayRevolverScheduler$$anon$12.run(Scheduler.scala:363) at java.lang.Thread.run(Thread.java:745) 14/10/29 13:26:04 WARN BlockManagerMaster: Failed to remove broadcast 9 with removeFromMaster = true - Timed o ut} -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/CANNOT-FIND-ADDRESS-tp17637p17646.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Executor address issue: CANNOT FIND ADDRESS (Spark 0.9.1)
Hi, One of the executors in my spark cluster shows a CANNOT FIND ADDRESS address, for one of the stages which failed. After that stages, I got cascading failures for all my stages :/ (stages that seem complete but still appears as active stage in the dashboard; incomplete or failed stages that are still in the active sections). Just a note that in the later stages, there were no more CANNOT FIND ADDRESS issues. Did anybody get this address issue and find a solution? Could this problem explain the cascading failures? Thanks! Nicolas -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Executor-address-issue-CANNOT-FIND-ADDRESS-Spark-0-9-1-tp13748.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org