Try increasing memory (--conf spark.executor.memory=3g or --executor-memory) for executors. Here is something I noted from your logs
15/09/29 06:32:03 WARN MemoryStore: Failed to reserve initial memory threshold of 1024.0 KB for computing block rdd_2_1813 in memory. 15/09/29 06:32:03 WARN MemoryStore: Not enough space to cache rdd_2_1813 in memory! (computed 840.0 B so far) On Tue, Sep 29, 2015 at 11:02 AM Anup Sawant <anupsatishsaw...@gmail.com> wrote: > Hi all, > Any idea why I am getting 'Executor heartbeat timed out' ? I am fairly new > to Spark so I have less knowledge about the internals of it. The job was > running for a day or so on 102 Gb of data with 40 workers. > -Best, > Anup. > > 15/09/29 06:32:03 ERROR TaskSchedulerImpl: Lost executor driver on > localhost: Executor heartbeat timed out after 395987 ms > 15/09/29 06:32:03 WARN MemoryStore: Failed to reserve initial memory > threshold of 1024.0 KB for computing block rdd_2_1813 in memory. > 15/09/29 06:32:03 WARN MemoryStore: Not enough space to cache rdd_2_1813 > in memory! (computed 840.0 B so far) > 15/09/29 06:32:03 WARN TaskSetManager: Lost task 1782.0 in stage 2713.0 > (TID 9101184, localhost): ExecutorLostFailure (executor driver lost) > 15/09/29 06:32:03 ERROR TaskSetManager: Task 1782 in stage 2713.0 failed 1 > times; aborting job > 15/09/29 06:32:03 WARN TaskSetManager: Lost task 1791.0 in stage 2713.0 > (TID 9101193, localhost): ExecutorLostFailure (executor driver lost) > 15/09/29 06:32:03 WARN TaskSetManager: Lost task 1800.0 in stage 2713.0 > (TID 9101202, localhost): ExecutorLostFailure (executor driver lost) > 15/09/29 06:32:03 WARN TaskSetManager: Lost task 1764.0 in stage 2713.0 > (TID 9101166, localhost): ExecutorLostFailure (executor driver lost) > 15/09/29 06:32:03 WARN TaskSetManager: Lost task 1773.0 in stage 2713.0 > (TID 9101175, localhost): ExecutorLostFailure (executor driver lost) > 15/09/29 06:32:03 WARN TaskSetManager: Lost task 1809.0 in stage 2713.0 > (TID 9101211, localhost): ExecutorLostFailure (executor driver lost) > 15/09/29 06:32:03 WARN TaskSetManager: Lost task 1794.0 in stage 2713.0 > (TID 9101196, localhost): ExecutorLostFailure (executor driver lost) > 15/09/29 06:32:03 WARN TaskSetManager: Lost task 1740.0 in stage 2713.0 > (TID 9101142, localhost): ExecutorLostFailure (executor driver lost) > 15/09/29 06:32:03 WARN TaskSetManager: Lost task 1803.0 in stage 2713.0 > (TID 9101205, localhost): ExecutorLostFailure (executor driver lost) > 15/09/29 06:32:03 WARN TaskSetManager: Lost task 1812.0 in stage 2713.0 > (TID 9101214, localhost): ExecutorLostFailure (executor driver lost) > 15/09/29 06:32:03 WARN TaskSetManager: Lost task 1785.0 in stage 2713.0 > (TID 9101187, localhost): ExecutorLostFailure (executor driver lost) > 15/09/29 06:32:03 WARN TaskSetManager: Lost task 1767.0 in stage 2713.0 > (TID 9101169, localhost): ExecutorLostFailure (executor driver lost) > 15/09/29 06:32:03 WARN TaskSetManager: Lost task 1776.0 in stage 2713.0 > (TID 9101178, localhost): ExecutorLostFailure (executor driver lost) > 15/09/29 06:32:03 WARN TaskSetManager: Lost task 1797.0 in stage 2713.0 > (TID 9101199, localhost): ExecutorLostFailure (executor driver lost) > 15/09/29 06:32:03 WARN TaskSetManager: Lost task 1779.0 in stage 2713.0 > (TID 9101181, localhost): ExecutorLostFailure (executor driver lost) > 15/09/29 06:32:03 WARN TaskSetManager: Lost task 1806.0 in stage 2713.0 > (TID 9101208, localhost): ExecutorLostFailure (executor driver lost) > 15/09/29 06:32:03 WARN TaskSetManager: Lost task 1788.0 in stage 2713.0 > (TID 9101190, localhost): ExecutorLostFailure (executor driver lost) > 15/09/29 06:32:03 WARN TaskSetManager: Lost task 1761.0 in stage 2713.0 > (TID 9101163, localhost): ExecutorLostFailure (executor driver lost) > 15/09/29 06:32:03 WARN TaskSetManager: Lost task 1755.0 in stage 2713.0 > (TID 9101157, localhost): ExecutorLostFailure (executor driver lost) > 15/09/29 06:32:03 WARN TaskSetManager: Lost task 1796.0 in stage 2713.0 > (TID 9101198, localhost): ExecutorLostFailure (executor driver lost) > 15/09/29 06:32:03 WARN TaskSetManager: Lost task 1778.0 in stage 2713.0 > (TID 9101180, localhost): ExecutorLostFailure (executor driver lost) > 15/09/29 06:32:03 WARN TaskSetManager: Lost task 1787.0 in stage 2713.0 > (TID 9101189, localhost): ExecutorLostFailure (executor driver lost) > 15/09/29 06:32:03 WARN TaskSetManager: Lost task 1805.0 in stage 2713.0 > (TID 9101207, localhost): ExecutorLostFailure (executor driver lost) > 15/09/29 06:32:03 WARN TaskSetManager: Lost task 1790.0 in stage 2713.0 > (TID 9101192, localhost): ExecutorLostFailure (executor driver lost) > 15/09/29 06:32:03 WARN TaskSetManager: Lost task 1781.0 in stage 2713.0 > (TID 9101183, localhost): ExecutorLostFailure (executor driver lost) > 15/09/29 06:32:03 WARN TaskSetManager: Lost task 1808.0 in stage 2713.0 > (TID 9101210, localhost): ExecutorLostFailure (executor driver lost) > 15/09/29 06:32:03 WARN TaskSetManager: Lost task 1799.0 in stage 2713.0 > (TID 9101201, localhost): ExecutorLostFailure (executor driver lost) > 15/09/29 06:32:03 WARN TaskSetManager: Lost task 1772.0 in stage 2713.0 > (TID 9101174, localhost): ExecutorLostFailure (executor driver lost) > 15/09/29 06:32:03 WARN TaskSetManager: Lost task 1763.0 in stage 2713.0 > (TID 9101165, localhost): ExecutorLostFailure (executor driver lost) > 15/09/29 06:32:03 WARN TaskSetManager: Lost task 1802.0 in stage 2713.0 > (TID 9101204, localhost): ExecutorLostFailure (executor driver lost) > 15/09/29 06:32:03 WARN TaskSetManager: Lost task 1748.0 in stage 2713.0 > (TID 9101150, localhost): ExecutorLostFailure (executor driver lost) > 15/09/29 06:32:03 WARN TaskSetManager: Lost task 1775.0 in stage 2713.0 > (TID 9101177, localhost): ExecutorLostFailure (executor driver lost) > 15/09/29 06:32:03 WARN TaskSetManager: Lost task 1766.0 in stage 2713.0 > (TID 9101168, localhost): ExecutorLostFailure (executor driver lost) > 15/09/29 06:32:03 WARN TaskSetManager: Lost task 1811.0 in stage 2713.0 > (TID 9101213, localhost): ExecutorLostFailure (executor driver lost) > 15/09/29 06:32:03 WARN TaskSetManager: Lost task 1793.0 in stage 2713.0 > (TID 9101195, localhost): ExecutorLostFailure (executor driver lost) > 15/09/29 06:32:03 WARN TaskSetManager: Lost task 1769.0 in stage 2713.0 > (TID 9101171, localhost): ExecutorLostFailure (executor driver lost) > 15/09/29 06:32:03 WARN TaskSetManager: Lost task 1810.0 in stage 2713.0 > (TID 9101212, localhost): ExecutorLostFailure (executor driver lost) > 15/09/29 06:32:03 WARN TaskSetManager: Lost task 1801.0 in stage 2713.0 > (TID 9101203, localhost): ExecutorLostFailure (executor driver lost) > 15/09/29 06:32:03 WARN TaskSetManager: Lost task 1795.0 in stage 2713.0 > (TID 9101197, localhost): ExecutorLostFailure (executor driver lost) > 15/09/29 06:32:03 WARN TaskSetManager: Lost task 1777.0 in stage 2713.0 > (TID 9101179, localhost): ExecutorLostFailure (executor driver lost) > 15/09/29 06:32:03 WARN TaskSetManager: Lost task 1786.0 in stage 2713.0 > (TID 9101188, localhost): ExecutorLostFailure (executor driver lost) > 15/09/29 06:32:03 WARN TaskSetManager: Lost task 1804.0 in stage 2713.0 > (TID 9101206, localhost): ExecutorLostFailure (executor driver lost) > 15/09/29 06:32:03 WARN TaskSetManager: Lost task 1813.0 in stage 2713.0 > (TID 9101215, localhost): ExecutorLostFailure (executor driver lost) > 15/09/29 06:32:03 WARN TaskSetManager: Lost task 1807.0 in stage 2713.0 > (TID 9101209, localhost): ExecutorLostFailure (executor driver lost) > 15/09/29 06:32:03 WARN TaskSetManager: Lost task 1789.0 in stage 2713.0 > (TID 9101191, localhost): ExecutorLostFailure (executor driver lost) > 15/09/29 06:32:03 WARN TaskSetManager: Lost task 1780.0 in stage 2713.0 > (TID 9101182, localhost): ExecutorLostFailure (executor driver lost) > 15/09/29 06:32:03 WARN TaskSetManager: Lost task 1798.0 in stage 2713.0 > (TID 9101200, localhost): ExecutorLostFailure (executor driver lost) > 15/09/29 06:32:03 WARN TaskSetManager: Lost task 1792.0 in stage 2713.0 > (TID 9101194, localhost): ExecutorLostFailure (executor driver lost) > 15/09/29 06:32:03 WARN TaskSetManager: Lost task 1765.0 in stage 2713.0 > (TID 9101167, localhost): ExecutorLostFailure (executor driver lost) > 15/09/29 06:32:03 WARN TaskSetManager: Lost task 1774.0 in stage 2713.0 > (TID 9101176, localhost): ExecutorLostFailure (executor driver lost) > 15/09/29 06:32:03 WARN TaskSetManager: Lost task 1783.0 in stage 2713.0 > (TID 9101185, localhost): ExecutorLostFailure (executor driver lost) > 15/09/29 06:32:03 WARN TaskSetManager: Lost task 1756.0 in stage 2713.0 > (TID 9101158, localhost): ExecutorLostFailure (executor driver lost) > [Stage 2713:=========================> (1762 + 51) > / 3354]15/09/29 06:32:03 WARN SparkContext: Killing executors is only > supported in coarse-grained mode > 15/09/29 06:32:04 ERROR BlockManager: Failed to report rdd_2_3032 to > master; giving up. > Traceback (most recent call last): > File "/data/home/as198/sdword2vec.py", line 139, in <module> > main() > File "/data/home/as198/sdword2vec.py", line 136, in main > tryGensim() > File "/data/home/as198/sdword2vec.py", line 114, in tryGensim > model_dm.build_vocab(articles) > File > "/usr/lib/python2.7/site-packages/gensim-0.12.2-py2.7-linux-x86_64.egg/gensim/models/word2vec.py", > line 495, in build_vocab > self.scan_vocab(sentences, trim_rule=trim_rule) # initial survey > File > "/usr/lib/python2.7/site-packages/gensim-0.12.2-py2.7-linux-x86_64.egg/gensim/models/doc2vec.py", > line 620, in scan_vocab > for document_no, document in enumerate(documents): > File "/data/home/ass198/sdword2vec.py", line 97, in __iter__ > for article in labeled_rdd.collect(): > File > "/usr/local/spark-1.5.0-bin-hadoop2.6/python/lib/pyspark.zip/pyspark/rdd.py", > line 773, in collect > File > "/usr/local/spark-1.5.0-bin-hadoop2.6/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py", > line 538, in __call__ > File > "/usr/local/spark-1.5.0-bin-hadoop2.6/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py", > line 300, in get_return_value > py4j.protocol.Py4JJavaError: An error occurred while calling > z:org.apache.spark.api.python.PythonRDD.collectAndServe. > : org.apache.spark.SparkException: Job aborted due to stage failure: Task > 1782 in stage 2713.0 failed 1 times, most recent failure: Lost task 1782.0 > in stage 2713.0 (TID 9101184, localhost): ExecutorLostFailure (executor > driver lost) > Driver stacktrace: > at org.apache.spark.scheduler.DAGScheduler.org > $apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala: > 1280) > at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1. > apply(DAGScheduler.scala:1268) > at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1. > apply(DAGScheduler.scala:1267) > at scala.collection.mutable.ResizableArray$class > .foreach(ResizableArray.scala:59) > at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala: > 47) > at > org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1267 > ) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1. > apply(DAGScheduler.scala:697) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1. > apply(DAGScheduler.scala:697) > at scala.Option.foreach(Option.scala:236) > at > org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala: > 697) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala: > 1493) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala: > 1455) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala: > 1444) > at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48) > at > org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:567) > at org.apache.spark.SparkContext.runJob(SparkContext.scala:1813) > at org.apache.spark.SparkContext.runJob(SparkContext.scala:1826) > at org.apache.spark.SparkContext.runJob(SparkContext.scala:1839) > at org.apache.spark.SparkContext.runJob(SparkContext.scala:1910) > at org.apache.spark.rdd.RDD$$anonfun$collect$1.apply(RDD.scala:905 > ) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala: > 147) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala: > 108) > at org.apache.spark.rdd.RDD.withScope(RDD.scala:306) > at org.apache.spark.rdd.RDD.collect(RDD.scala:904) > at > org.apache.spark.api.python.PythonRDD$.collectAndServe(PythonRDD.scala:373 > ) > at > org.apache.spark.api.python.PythonRDD.collectAndServe(PythonRDD.scala) > at sun.reflect.GeneratedMethodAccessor62.invoke(Unknown Source) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java: > 43) > at java.lang.reflect.Method.invoke(Method.java:606) > at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231) > at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java: > 379) > at py4j.Gateway.invoke(Gateway.java:259) > at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java: > 133) > at py4j.commands.CallCommand.execute(CallCommand.java:79) > at py4j.GatewayConnection.run(GatewayConnection.java:207) > at java.lang.Thread.run(Thread.java:745) > >