Hi all, Any idea why I am getting 'Executor heartbeat timed out' ? I am fairly new to Spark so I have less knowledge about the internals of it. The job was running for a day or so on 102 Gb of data with 40 workers. -Best, Anup.
15/09/29 06:32:03 ERROR TaskSchedulerImpl: Lost executor driver on localhost: Executor heartbeat timed out after 395987 ms 15/09/29 06:32:03 WARN MemoryStore: Failed to reserve initial memory threshold of 1024.0 KB for computing block rdd_2_1813 in memory. 15/09/29 06:32:03 WARN MemoryStore: Not enough space to cache rdd_2_1813 in memory! (computed 840.0 B so far) 15/09/29 06:32:03 WARN TaskSetManager: Lost task 1782.0 in stage 2713.0 (TID 9101184, localhost): ExecutorLostFailure (executor driver lost) 15/09/29 06:32:03 ERROR TaskSetManager: Task 1782 in stage 2713.0 failed 1 times; aborting job 15/09/29 06:32:03 WARN TaskSetManager: Lost task 1791.0 in stage 2713.0 (TID 9101193, localhost): ExecutorLostFailure (executor driver lost) 15/09/29 06:32:03 WARN TaskSetManager: Lost task 1800.0 in stage 2713.0 (TID 9101202, localhost): ExecutorLostFailure (executor driver lost) 15/09/29 06:32:03 WARN TaskSetManager: Lost task 1764.0 in stage 2713.0 (TID 9101166, localhost): ExecutorLostFailure (executor driver lost) 15/09/29 06:32:03 WARN TaskSetManager: Lost task 1773.0 in stage 2713.0 (TID 9101175, localhost): ExecutorLostFailure (executor driver lost) 15/09/29 06:32:03 WARN TaskSetManager: Lost task 1809.0 in stage 2713.0 (TID 9101211, localhost): ExecutorLostFailure (executor driver lost) 15/09/29 06:32:03 WARN TaskSetManager: Lost task 1794.0 in stage 2713.0 (TID 9101196, localhost): ExecutorLostFailure (executor driver lost) 15/09/29 06:32:03 WARN TaskSetManager: Lost task 1740.0 in stage 2713.0 (TID 9101142, localhost): ExecutorLostFailure (executor driver lost) 15/09/29 06:32:03 WARN TaskSetManager: Lost task 1803.0 in stage 2713.0 (TID 9101205, localhost): ExecutorLostFailure (executor driver lost) 15/09/29 06:32:03 WARN TaskSetManager: Lost task 1812.0 in stage 2713.0 (TID 9101214, localhost): ExecutorLostFailure (executor driver lost) 15/09/29 06:32:03 WARN TaskSetManager: Lost task 1785.0 in stage 2713.0 (TID 9101187, localhost): ExecutorLostFailure (executor driver lost) 15/09/29 06:32:03 WARN TaskSetManager: Lost task 1767.0 in stage 2713.0 (TID 9101169, localhost): ExecutorLostFailure (executor driver lost) 15/09/29 06:32:03 WARN TaskSetManager: Lost task 1776.0 in stage 2713.0 (TID 9101178, localhost): ExecutorLostFailure (executor driver lost) 15/09/29 06:32:03 WARN TaskSetManager: Lost task 1797.0 in stage 2713.0 (TID 9101199, localhost): ExecutorLostFailure (executor driver lost) 15/09/29 06:32:03 WARN TaskSetManager: Lost task 1779.0 in stage 2713.0 (TID 9101181, localhost): ExecutorLostFailure (executor driver lost) 15/09/29 06:32:03 WARN TaskSetManager: Lost task 1806.0 in stage 2713.0 (TID 9101208, localhost): ExecutorLostFailure (executor driver lost) 15/09/29 06:32:03 WARN TaskSetManager: Lost task 1788.0 in stage 2713.0 (TID 9101190, localhost): ExecutorLostFailure (executor driver lost) 15/09/29 06:32:03 WARN TaskSetManager: Lost task 1761.0 in stage 2713.0 (TID 9101163, localhost): ExecutorLostFailure (executor driver lost) 15/09/29 06:32:03 WARN TaskSetManager: Lost task 1755.0 in stage 2713.0 (TID 9101157, localhost): ExecutorLostFailure (executor driver lost) 15/09/29 06:32:03 WARN TaskSetManager: Lost task 1796.0 in stage 2713.0 (TID 9101198, localhost): ExecutorLostFailure (executor driver lost) 15/09/29 06:32:03 WARN TaskSetManager: Lost task 1778.0 in stage 2713.0 (TID 9101180, localhost): ExecutorLostFailure (executor driver lost) 15/09/29 06:32:03 WARN TaskSetManager: Lost task 1787.0 in stage 2713.0 (TID 9101189, localhost): ExecutorLostFailure (executor driver lost) 15/09/29 06:32:03 WARN TaskSetManager: Lost task 1805.0 in stage 2713.0 (TID 9101207, localhost): ExecutorLostFailure (executor driver lost) 15/09/29 06:32:03 WARN TaskSetManager: Lost task 1790.0 in stage 2713.0 (TID 9101192, localhost): ExecutorLostFailure (executor driver lost) 15/09/29 06:32:03 WARN TaskSetManager: Lost task 1781.0 in stage 2713.0 (TID 9101183, localhost): ExecutorLostFailure (executor driver lost) 15/09/29 06:32:03 WARN TaskSetManager: Lost task 1808.0 in stage 2713.0 (TID 9101210, localhost): ExecutorLostFailure (executor driver lost) 15/09/29 06:32:03 WARN TaskSetManager: Lost task 1799.0 in stage 2713.0 (TID 9101201, localhost): ExecutorLostFailure (executor driver lost) 15/09/29 06:32:03 WARN TaskSetManager: Lost task 1772.0 in stage 2713.0 (TID 9101174, localhost): ExecutorLostFailure (executor driver lost) 15/09/29 06:32:03 WARN TaskSetManager: Lost task 1763.0 in stage 2713.0 (TID 9101165, localhost): ExecutorLostFailure (executor driver lost) 15/09/29 06:32:03 WARN TaskSetManager: Lost task 1802.0 in stage 2713.0 (TID 9101204, localhost): ExecutorLostFailure (executor driver lost) 15/09/29 06:32:03 WARN TaskSetManager: Lost task 1748.0 in stage 2713.0 (TID 9101150, localhost): ExecutorLostFailure (executor driver lost) 15/09/29 06:32:03 WARN TaskSetManager: Lost task 1775.0 in stage 2713.0 (TID 9101177, localhost): ExecutorLostFailure (executor driver lost) 15/09/29 06:32:03 WARN TaskSetManager: Lost task 1766.0 in stage 2713.0 (TID 9101168, localhost): ExecutorLostFailure (executor driver lost) 15/09/29 06:32:03 WARN TaskSetManager: Lost task 1811.0 in stage 2713.0 (TID 9101213, localhost): ExecutorLostFailure (executor driver lost) 15/09/29 06:32:03 WARN TaskSetManager: Lost task 1793.0 in stage 2713.0 (TID 9101195, localhost): ExecutorLostFailure (executor driver lost) 15/09/29 06:32:03 WARN TaskSetManager: Lost task 1769.0 in stage 2713.0 (TID 9101171, localhost): ExecutorLostFailure (executor driver lost) 15/09/29 06:32:03 WARN TaskSetManager: Lost task 1810.0 in stage 2713.0 (TID 9101212, localhost): ExecutorLostFailure (executor driver lost) 15/09/29 06:32:03 WARN TaskSetManager: Lost task 1801.0 in stage 2713.0 (TID 9101203, localhost): ExecutorLostFailure (executor driver lost) 15/09/29 06:32:03 WARN TaskSetManager: Lost task 1795.0 in stage 2713.0 (TID 9101197, localhost): ExecutorLostFailure (executor driver lost) 15/09/29 06:32:03 WARN TaskSetManager: Lost task 1777.0 in stage 2713.0 (TID 9101179, localhost): ExecutorLostFailure (executor driver lost) 15/09/29 06:32:03 WARN TaskSetManager: Lost task 1786.0 in stage 2713.0 (TID 9101188, localhost): ExecutorLostFailure (executor driver lost) 15/09/29 06:32:03 WARN TaskSetManager: Lost task 1804.0 in stage 2713.0 (TID 9101206, localhost): ExecutorLostFailure (executor driver lost) 15/09/29 06:32:03 WARN TaskSetManager: Lost task 1813.0 in stage 2713.0 (TID 9101215, localhost): ExecutorLostFailure (executor driver lost) 15/09/29 06:32:03 WARN TaskSetManager: Lost task 1807.0 in stage 2713.0 (TID 9101209, localhost): ExecutorLostFailure (executor driver lost) 15/09/29 06:32:03 WARN TaskSetManager: Lost task 1789.0 in stage 2713.0 (TID 9101191, localhost): ExecutorLostFailure (executor driver lost) 15/09/29 06:32:03 WARN TaskSetManager: Lost task 1780.0 in stage 2713.0 (TID 9101182, localhost): ExecutorLostFailure (executor driver lost) 15/09/29 06:32:03 WARN TaskSetManager: Lost task 1798.0 in stage 2713.0 (TID 9101200, localhost): ExecutorLostFailure (executor driver lost) 15/09/29 06:32:03 WARN TaskSetManager: Lost task 1792.0 in stage 2713.0 (TID 9101194, localhost): ExecutorLostFailure (executor driver lost) 15/09/29 06:32:03 WARN TaskSetManager: Lost task 1765.0 in stage 2713.0 (TID 9101167, localhost): ExecutorLostFailure (executor driver lost) 15/09/29 06:32:03 WARN TaskSetManager: Lost task 1774.0 in stage 2713.0 (TID 9101176, localhost): ExecutorLostFailure (executor driver lost) 15/09/29 06:32:03 WARN TaskSetManager: Lost task 1783.0 in stage 2713.0 (TID 9101185, localhost): ExecutorLostFailure (executor driver lost) 15/09/29 06:32:03 WARN TaskSetManager: Lost task 1756.0 in stage 2713.0 (TID 9101158, localhost): ExecutorLostFailure (executor driver lost) [Stage 2713:=========================> (1762 + 51) / 3354]15/09/29 06:32:03 WARN SparkContext: Killing executors is only supported in coarse-grained mode 15/09/29 06:32:04 ERROR BlockManager: Failed to report rdd_2_3032 to master; giving up. Traceback (most recent call last): File "/data/home/as198/sdword2vec.py", line 139, in <module> main() File "/data/home/as198/sdword2vec.py", line 136, in main tryGensim() File "/data/home/as198/sdword2vec.py", line 114, in tryGensim model_dm.build_vocab(articles) File "/usr/lib/python2.7/site-packages/gensim-0.12.2-py2.7-linux-x86_64.egg/gensim/models/word2vec.py", line 495, in build_vocab self.scan_vocab(sentences, trim_rule=trim_rule) # initial survey File "/usr/lib/python2.7/site-packages/gensim-0.12.2-py2.7-linux-x86_64.egg/gensim/models/doc2vec.py", line 620, in scan_vocab for document_no, document in enumerate(documents): File "/data/home/ass198/sdword2vec.py", line 97, in __iter__ for article in labeled_rdd.collect(): File "/usr/local/spark-1.5.0-bin-hadoop2.6/python/lib/pyspark.zip/pyspark/rdd.py", line 773, in collect File "/usr/local/spark-1.5.0-bin-hadoop2.6/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py", line 538, in __call__ File "/usr/local/spark-1.5.0-bin-hadoop2.6/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py", line 300, in get_return_value py4j.protocol.Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.collectAndServe. : org.apache.spark.SparkException: Job aborted due to stage failure: Task 1782 in stage 2713.0 failed 1 times, most recent failure: Lost task 1782.0 in stage 2713.0 (TID 9101184, localhost): ExecutorLostFailure (executor driver lost) Driver stacktrace: at org.apache.spark.scheduler.DAGScheduler.org $apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala: 1280) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1. apply(DAGScheduler.scala:1268) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1. apply(DAGScheduler.scala:1267) at scala.collection.mutable.ResizableArray$class .foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47 ) at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1267) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1. apply(DAGScheduler.scala:697) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1. apply(DAGScheduler.scala:697) at scala.Option.foreach(Option.scala:236) at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala: 697) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala: 1493) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala: 1455) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala: 1444) at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48) at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:567) at org.apache.spark.SparkContext.runJob(SparkContext.scala:1813) at org.apache.spark.SparkContext.runJob(SparkContext.scala:1826) at org.apache.spark.SparkContext.runJob(SparkContext.scala:1839) at org.apache.spark.SparkContext.runJob(SparkContext.scala:1910) at org.apache.spark.rdd.RDD$$anonfun$collect$1.apply(RDD.scala:905) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala: 147) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala: 108) at org.apache.spark.rdd.RDD.withScope(RDD.scala:306) at org.apache.spark.rdd.RDD.collect(RDD.scala:904) at org.apache.spark.api.python.PythonRDD$.collectAndServe(PythonRDD.scala:373) at org.apache.spark.api.python.PythonRDD.collectAndServe(PythonRDD.scala) at sun.reflect.GeneratedMethodAccessor62.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java: 43) at java.lang.reflect.Method.invoke(Method.java:606) at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231) at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:379 ) at py4j.Gateway.invoke(Gateway.java:259) at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java: 133) at py4j.commands.CallCommand.execute(CallCommand.java:79) at py4j.GatewayConnection.run(GatewayConnection.java:207) at java.lang.Thread.run(Thread.java:745)