Re: 1TB shuffle failed with executor lost failure
The exit code 52 comes from org.apache.spark.util.SparkExitCode, and it is val OOM=52 - i.e. an OutOfMemoryError Refer https://github.com/apache/spark/blob/d6dc12ef0146ae409834c78737c116050961f350/core/src/main/scala/org/apache/spark/util/SparkExitCode.scala On 19 September 2016 at 14:57, Cyanny LIANGwrote: > My job is 1TB join + 10 GB table on spark1.6.1 > run on yarn mode: > > *1. if I open shuffle service, the error is * > Job aborted due to stage failure: ShuffleMapStage 2 (writeToDirectory at > NativeMethodAccessorImpl.java:-2) has failed the maximum allowable number > of times: 4. Most recent failure reason: > org.apache.spark.shuffle.FetchFailedException: > java.lang.RuntimeException: Executor is not registered > (appId=application_1473819702737_1239, > execId=52) > at org.apache.spark.network.shuffle.ExternalShuffleBlockResolver. > getBlockData(ExternalShuffleBlockResolver.java:105) > at org.apache.spark.network.shuffle.ExternalShuffleBlockHandler. > receive(ExternalShuffleBlockHandler.java:74) > at org.apache.spark.network.server.TransportRequestHandler. > processRpcRequest(TransportRequestHandler.java:114) > at org.apache.spark.network.server.TransportRequestHandler.handle( > TransportRequestHandler.java:87) > at org.apache.spark.network.server.TransportChannelHandler. > channelRead0(TransportChannelHandler.java:101) > > *2. if I close shuffle service, * > *set spark.executor.instances 80* > the error is : > ExecutorLostFailure (executor 71 exited caused by one of the running > tasks) Reason: Container marked as failed: > container_1473819702737_1432_01_406847560 > on host: nmg01-spark-a0021.nmg01.baidu.com. Exit status: 52. Diagnostics: > Exception from container-launch: ExitCodeException exitCode=52: > ExitCodeException exitCode=52: > > These errors are reported on shuffle stage > My data is skew, some ids have 400million rows, but some ids only have > 1million rows, is anybody has some ideas to solve the problem? > > > *3. My config is * > Here is my config > I use tungsten-sort in off-heap mode, in on-heap mode, the oom problem > will be more serious > > spark.driver.cores 4 > > spark.driver.memory 8g > > > # use on client mode > > > spark.yarn.am.memory 8g > > > spark.yarn.am.cores 4 > > > spark.executor.memory 8g > > > spark.executor.cores 4 > > spark.yarn.executor.memoryOverhead 6144 > > > spark.memory.offHeap.enabled true > > > spark.memory.offHeap.size 40 > > Best & Regards > Cyanny LIANG >
1TB shuffle failed with executor lost failure
My job is 1TB join + 10 GB table on spark1.6.1 run on yarn mode: *1. if I open shuffle service, the error is * Job aborted due to stage failure: ShuffleMapStage 2 (writeToDirectory at NativeMethodAccessorImpl.java:-2) has failed the maximum allowable number of times: 4. Most recent failure reason: org.apache.spark.shuffle.FetchFailedException: java.lang.RuntimeException: Executor is not registered (appId=application_1473819702737_1239, execId=52) at org.apache.spark.network.shuffle.ExternalShuffleBlockResolver.getBlockData(ExternalShuffleBlockResolver.java:105) at org.apache.spark.network.shuffle.ExternalShuffleBlockHandler.receive(ExternalShuffleBlockHandler.java:74) at org.apache.spark.network.server.TransportRequestHandler.processRpcRequest(TransportRequestHandler.java:114) at org.apache.spark.network.server.TransportRequestHandler.handle(TransportRequestHandler.java:87) at org.apache.spark.network.server.TransportChannelHandler.channelRead0(TransportChannelHandler.java:101) *2. if I close shuffle service, * *set spark.executor.instances 80* the error is : ExecutorLostFailure (executor 71 exited caused by one of the running tasks) Reason: Container marked as failed: container_1473819702737_1432_01_406847560 on host: nmg01-spark-a0021.nmg01.baidu.com. Exit status: 52. Diagnostics: Exception from container-launch: ExitCodeException exitCode=52: ExitCodeException exitCode=52: These errors are reported on shuffle stage My data is skew, some ids have 400million rows, but some ids only have 1million rows, is anybody has some ideas to solve the problem? *3. My config is * Here is my config I use tungsten-sort in off-heap mode, in on-heap mode, the oom problem will be more serious spark.driver.cores 4 spark.driver.memory 8g # use on client mode spark.yarn.am.memory 8g spark.yarn.am.cores 4 spark.executor.memory 8g spark.executor.cores 4 spark.yarn.executor.memoryOverhead 6144 spark.memory.offHeap.enabled true spark.memory.offHeap.size 40 Best & Regards Cyanny LIANG
Re: Executor Lost Failure
Try increasing memory (--conf spark.executor.memory=3g or --executor-memory) for executors. Here is something I noted from your logs 15/09/29 06:32:03 WARN MemoryStore: Failed to reserve initial memory threshold of 1024.0 KB for computing block rdd_2_1813 in memory. 15/09/29 06:32:03 WARN MemoryStore: Not enough space to cache rdd_2_1813 in memory! (computed 840.0 B so far) On Tue, Sep 29, 2015 at 11:02 AM Anup Sawantwrote: > Hi all, > Any idea why I am getting 'Executor heartbeat timed out' ? I am fairly new > to Spark so I have less knowledge about the internals of it. The job was > running for a day or so on 102 Gb of data with 40 workers. > -Best, > Anup. > > 15/09/29 06:32:03 ERROR TaskSchedulerImpl: Lost executor driver on > localhost: Executor heartbeat timed out after 395987 ms > 15/09/29 06:32:03 WARN MemoryStore: Failed to reserve initial memory > threshold of 1024.0 KB for computing block rdd_2_1813 in memory. > 15/09/29 06:32:03 WARN MemoryStore: Not enough space to cache rdd_2_1813 > in memory! (computed 840.0 B so far) > 15/09/29 06:32:03 WARN TaskSetManager: Lost task 1782.0 in stage 2713.0 > (TID 9101184, localhost): ExecutorLostFailure (executor driver lost) > 15/09/29 06:32:03 ERROR TaskSetManager: Task 1782 in stage 2713.0 failed 1 > times; aborting job > 15/09/29 06:32:03 WARN TaskSetManager: Lost task 1791.0 in stage 2713.0 > (TID 9101193, localhost): ExecutorLostFailure (executor driver lost) > 15/09/29 06:32:03 WARN TaskSetManager: Lost task 1800.0 in stage 2713.0 > (TID 9101202, localhost): ExecutorLostFailure (executor driver lost) > 15/09/29 06:32:03 WARN TaskSetManager: Lost task 1764.0 in stage 2713.0 > (TID 9101166, localhost): ExecutorLostFailure (executor driver lost) > 15/09/29 06:32:03 WARN TaskSetManager: Lost task 1773.0 in stage 2713.0 > (TID 9101175, localhost): ExecutorLostFailure (executor driver lost) > 15/09/29 06:32:03 WARN TaskSetManager: Lost task 1809.0 in stage 2713.0 > (TID 9101211, localhost): ExecutorLostFailure (executor driver lost) > 15/09/29 06:32:03 WARN TaskSetManager: Lost task 1794.0 in stage 2713.0 > (TID 9101196, localhost): ExecutorLostFailure (executor driver lost) > 15/09/29 06:32:03 WARN TaskSetManager: Lost task 1740.0 in stage 2713.0 > (TID 9101142, localhost): ExecutorLostFailure (executor driver lost) > 15/09/29 06:32:03 WARN TaskSetManager: Lost task 1803.0 in stage 2713.0 > (TID 9101205, localhost): ExecutorLostFailure (executor driver lost) > 15/09/29 06:32:03 WARN TaskSetManager: Lost task 1812.0 in stage 2713.0 > (TID 9101214, localhost): ExecutorLostFailure (executor driver lost) > 15/09/29 06:32:03 WARN TaskSetManager: Lost task 1785.0 in stage 2713.0 > (TID 9101187, localhost): ExecutorLostFailure (executor driver lost) > 15/09/29 06:32:03 WARN TaskSetManager: Lost task 1767.0 in stage 2713.0 > (TID 9101169, localhost): ExecutorLostFailure (executor driver lost) > 15/09/29 06:32:03 WARN TaskSetManager: Lost task 1776.0 in stage 2713.0 > (TID 9101178, localhost): ExecutorLostFailure (executor driver lost) > 15/09/29 06:32:03 WARN TaskSetManager: Lost task 1797.0 in stage 2713.0 > (TID 9101199, localhost): ExecutorLostFailure (executor driver lost) > 15/09/29 06:32:03 WARN TaskSetManager: Lost task 1779.0 in stage 2713.0 > (TID 9101181, localhost): ExecutorLostFailure (executor driver lost) > 15/09/29 06:32:03 WARN TaskSetManager: Lost task 1806.0 in stage 2713.0 > (TID 9101208, localhost): ExecutorLostFailure (executor driver lost) > 15/09/29 06:32:03 WARN TaskSetManager: Lost task 1788.0 in stage 2713.0 > (TID 9101190, localhost): ExecutorLostFailure (executor driver lost) > 15/09/29 06:32:03 WARN TaskSetManager: Lost task 1761.0 in stage 2713.0 > (TID 9101163, localhost): ExecutorLostFailure (executor driver lost) > 15/09/29 06:32:03 WARN TaskSetManager: Lost task 1755.0 in stage 2713.0 > (TID 9101157, localhost): ExecutorLostFailure (executor driver lost) > 15/09/29 06:32:03 WARN TaskSetManager: Lost task 1796.0 in stage 2713.0 > (TID 9101198, localhost): ExecutorLostFailure (executor driver lost) > 15/09/29 06:32:03 WARN TaskSetManager: Lost task 1778.0 in stage 2713.0 > (TID 9101180, localhost): ExecutorLostFailure (executor driver lost) > 15/09/29 06:32:03 WARN TaskSetManager: Lost task 1787.0 in stage 2713.0 > (TID 9101189, localhost): ExecutorLostFailure (executor driver lost) > 15/09/29 06:32:03 WARN TaskSetManager: Lost task 1805.0 in stage 2713.0 > (TID 9101207, localhost): ExecutorLostFailure (executor driver lost) > 15/09/29 06:32:03 WARN TaskSetManager: Lost task 1790.0 in stage 2713.0 > (TID 9101192, localhost): ExecutorLostFailure (executor driver lost) > 15/09/29 06:32:03 WARN TaskSetManager: Lost task 1781.0 in stage 2713.0 > (TID 9101183, localhost): ExecutorLostFailure (executor driver lost) > 15/09/29 06:32:03 WARN TaskSetManager: Lost task 1808.0 in stage 2713.0 > (TID 9101210, localhost): ExecutorLostFailure (executor driver lost) > 15/09/29 06:32:03 WARN TaskSetManager: Lost
Re: Executor Lost Failure
Can you list the spark-submit command line you used ? Thanks On Tue, Sep 29, 2015 at 9:02 AM, Anup Sawantwrote: > Hi all, > Any idea why I am getting 'Executor heartbeat timed out' ? I am fairly new > to Spark so I have less knowledge about the internals of it. The job was > running for a day or so on 102 Gb of data with 40 workers. > -Best, > Anup. > > 15/09/29 06:32:03 ERROR TaskSchedulerImpl: Lost executor driver on > localhost: Executor heartbeat timed out after 395987 ms > 15/09/29 06:32:03 WARN MemoryStore: Failed to reserve initial memory > threshold of 1024.0 KB for computing block rdd_2_1813 in memory. > 15/09/29 06:32:03 WARN MemoryStore: Not enough space to cache rdd_2_1813 > in memory! (computed 840.0 B so far) > 15/09/29 06:32:03 WARN TaskSetManager: Lost task 1782.0 in stage 2713.0 > (TID 9101184, localhost): ExecutorLostFailure (executor driver lost) > 15/09/29 06:32:03 ERROR TaskSetManager: Task 1782 in stage 2713.0 failed 1 > times; aborting job > 15/09/29 06:32:03 WARN TaskSetManager: Lost task 1791.0 in stage 2713.0 > (TID 9101193, localhost): ExecutorLostFailure (executor driver lost) > 15/09/29 06:32:03 WARN TaskSetManager: Lost task 1800.0 in stage 2713.0 > (TID 9101202, localhost): ExecutorLostFailure (executor driver lost) > 15/09/29 06:32:03 WARN TaskSetManager: Lost task 1764.0 in stage 2713.0 > (TID 9101166, localhost): ExecutorLostFailure (executor driver lost) > 15/09/29 06:32:03 WARN TaskSetManager: Lost task 1773.0 in stage 2713.0 > (TID 9101175, localhost): ExecutorLostFailure (executor driver lost) > 15/09/29 06:32:03 WARN TaskSetManager: Lost task 1809.0 in stage 2713.0 > (TID 9101211, localhost): ExecutorLostFailure (executor driver lost) > 15/09/29 06:32:03 WARN TaskSetManager: Lost task 1794.0 in stage 2713.0 > (TID 9101196, localhost): ExecutorLostFailure (executor driver lost) > 15/09/29 06:32:03 WARN TaskSetManager: Lost task 1740.0 in stage 2713.0 > (TID 9101142, localhost): ExecutorLostFailure (executor driver lost) > 15/09/29 06:32:03 WARN TaskSetManager: Lost task 1803.0 in stage 2713.0 > (TID 9101205, localhost): ExecutorLostFailure (executor driver lost) > 15/09/29 06:32:03 WARN TaskSetManager: Lost task 1812.0 in stage 2713.0 > (TID 9101214, localhost): ExecutorLostFailure (executor driver lost) > 15/09/29 06:32:03 WARN TaskSetManager: Lost task 1785.0 in stage 2713.0 > (TID 9101187, localhost): ExecutorLostFailure (executor driver lost) > 15/09/29 06:32:03 WARN TaskSetManager: Lost task 1767.0 in stage 2713.0 > (TID 9101169, localhost): ExecutorLostFailure (executor driver lost) > 15/09/29 06:32:03 WARN TaskSetManager: Lost task 1776.0 in stage 2713.0 > (TID 9101178, localhost): ExecutorLostFailure (executor driver lost) > 15/09/29 06:32:03 WARN TaskSetManager: Lost task 1797.0 in stage 2713.0 > (TID 9101199, localhost): ExecutorLostFailure (executor driver lost) > 15/09/29 06:32:03 WARN TaskSetManager: Lost task 1779.0 in stage 2713.0 > (TID 9101181, localhost): ExecutorLostFailure (executor driver lost) > 15/09/29 06:32:03 WARN TaskSetManager: Lost task 1806.0 in stage 2713.0 > (TID 9101208, localhost): ExecutorLostFailure (executor driver lost) > 15/09/29 06:32:03 WARN TaskSetManager: Lost task 1788.0 in stage 2713.0 > (TID 9101190, localhost): ExecutorLostFailure (executor driver lost) > 15/09/29 06:32:03 WARN TaskSetManager: Lost task 1761.0 in stage 2713.0 > (TID 9101163, localhost): ExecutorLostFailure (executor driver lost) > 15/09/29 06:32:03 WARN TaskSetManager: Lost task 1755.0 in stage 2713.0 > (TID 9101157, localhost): ExecutorLostFailure (executor driver lost) > 15/09/29 06:32:03 WARN TaskSetManager: Lost task 1796.0 in stage 2713.0 > (TID 9101198, localhost): ExecutorLostFailure (executor driver lost) > 15/09/29 06:32:03 WARN TaskSetManager: Lost task 1778.0 in stage 2713.0 > (TID 9101180, localhost): ExecutorLostFailure (executor driver lost) > 15/09/29 06:32:03 WARN TaskSetManager: Lost task 1787.0 in stage 2713.0 > (TID 9101189, localhost): ExecutorLostFailure (executor driver lost) > 15/09/29 06:32:03 WARN TaskSetManager: Lost task 1805.0 in stage 2713.0 > (TID 9101207, localhost): ExecutorLostFailure (executor driver lost) > 15/09/29 06:32:03 WARN TaskSetManager: Lost task 1790.0 in stage 2713.0 > (TID 9101192, localhost): ExecutorLostFailure (executor driver lost) > 15/09/29 06:32:03 WARN TaskSetManager: Lost task 1781.0 in stage 2713.0 > (TID 9101183, localhost): ExecutorLostFailure (executor driver lost) > 15/09/29 06:32:03 WARN TaskSetManager: Lost task 1808.0 in stage 2713.0 > (TID 9101210, localhost): ExecutorLostFailure (executor driver lost) > 15/09/29 06:32:03 WARN TaskSetManager: Lost task 1799.0 in stage 2713.0 > (TID 9101201, localhost): ExecutorLostFailure (executor driver lost) > 15/09/29 06:32:03 WARN TaskSetManager: Lost task 1772.0 in stage 2713.0 > (TID 9101174, localhost): ExecutorLostFailure (executor driver lost) > 15/09/29 06:32:03 WARN TaskSetManager: Lost task 1763.0 in stage 2713.0
Executor Lost Failure
Hi all, Any idea why I am getting 'Executor heartbeat timed out' ? I am fairly new to Spark so I have less knowledge about the internals of it. The job was running for a day or so on 102 Gb of data with 40 workers. -Best, Anup. 15/09/29 06:32:03 ERROR TaskSchedulerImpl: Lost executor driver on localhost: Executor heartbeat timed out after 395987 ms 15/09/29 06:32:03 WARN MemoryStore: Failed to reserve initial memory threshold of 1024.0 KB for computing block rdd_2_1813 in memory. 15/09/29 06:32:03 WARN MemoryStore: Not enough space to cache rdd_2_1813 in memory! (computed 840.0 B so far) 15/09/29 06:32:03 WARN TaskSetManager: Lost task 1782.0 in stage 2713.0 (TID 9101184, localhost): ExecutorLostFailure (executor driver lost) 15/09/29 06:32:03 ERROR TaskSetManager: Task 1782 in stage 2713.0 failed 1 times; aborting job 15/09/29 06:32:03 WARN TaskSetManager: Lost task 1791.0 in stage 2713.0 (TID 9101193, localhost): ExecutorLostFailure (executor driver lost) 15/09/29 06:32:03 WARN TaskSetManager: Lost task 1800.0 in stage 2713.0 (TID 9101202, localhost): ExecutorLostFailure (executor driver lost) 15/09/29 06:32:03 WARN TaskSetManager: Lost task 1764.0 in stage 2713.0 (TID 9101166, localhost): ExecutorLostFailure (executor driver lost) 15/09/29 06:32:03 WARN TaskSetManager: Lost task 1773.0 in stage 2713.0 (TID 9101175, localhost): ExecutorLostFailure (executor driver lost) 15/09/29 06:32:03 WARN TaskSetManager: Lost task 1809.0 in stage 2713.0 (TID 9101211, localhost): ExecutorLostFailure (executor driver lost) 15/09/29 06:32:03 WARN TaskSetManager: Lost task 1794.0 in stage 2713.0 (TID 9101196, localhost): ExecutorLostFailure (executor driver lost) 15/09/29 06:32:03 WARN TaskSetManager: Lost task 1740.0 in stage 2713.0 (TID 9101142, localhost): ExecutorLostFailure (executor driver lost) 15/09/29 06:32:03 WARN TaskSetManager: Lost task 1803.0 in stage 2713.0 (TID 9101205, localhost): ExecutorLostFailure (executor driver lost) 15/09/29 06:32:03 WARN TaskSetManager: Lost task 1812.0 in stage 2713.0 (TID 9101214, localhost): ExecutorLostFailure (executor driver lost) 15/09/29 06:32:03 WARN TaskSetManager: Lost task 1785.0 in stage 2713.0 (TID 9101187, localhost): ExecutorLostFailure (executor driver lost) 15/09/29 06:32:03 WARN TaskSetManager: Lost task 1767.0 in stage 2713.0 (TID 9101169, localhost): ExecutorLostFailure (executor driver lost) 15/09/29 06:32:03 WARN TaskSetManager: Lost task 1776.0 in stage 2713.0 (TID 9101178, localhost): ExecutorLostFailure (executor driver lost) 15/09/29 06:32:03 WARN TaskSetManager: Lost task 1797.0 in stage 2713.0 (TID 9101199, localhost): ExecutorLostFailure (executor driver lost) 15/09/29 06:32:03 WARN TaskSetManager: Lost task 1779.0 in stage 2713.0 (TID 9101181, localhost): ExecutorLostFailure (executor driver lost) 15/09/29 06:32:03 WARN TaskSetManager: Lost task 1806.0 in stage 2713.0 (TID 9101208, localhost): ExecutorLostFailure (executor driver lost) 15/09/29 06:32:03 WARN TaskSetManager: Lost task 1788.0 in stage 2713.0 (TID 9101190, localhost): ExecutorLostFailure (executor driver lost) 15/09/29 06:32:03 WARN TaskSetManager: Lost task 1761.0 in stage 2713.0 (TID 9101163, localhost): ExecutorLostFailure (executor driver lost) 15/09/29 06:32:03 WARN TaskSetManager: Lost task 1755.0 in stage 2713.0 (TID 9101157, localhost): ExecutorLostFailure (executor driver lost) 15/09/29 06:32:03 WARN TaskSetManager: Lost task 1796.0 in stage 2713.0 (TID 9101198, localhost): ExecutorLostFailure (executor driver lost) 15/09/29 06:32:03 WARN TaskSetManager: Lost task 1778.0 in stage 2713.0 (TID 9101180, localhost): ExecutorLostFailure (executor driver lost) 15/09/29 06:32:03 WARN TaskSetManager: Lost task 1787.0 in stage 2713.0 (TID 9101189, localhost): ExecutorLostFailure (executor driver lost) 15/09/29 06:32:03 WARN TaskSetManager: Lost task 1805.0 in stage 2713.0 (TID 9101207, localhost): ExecutorLostFailure (executor driver lost) 15/09/29 06:32:03 WARN TaskSetManager: Lost task 1790.0 in stage 2713.0 (TID 9101192, localhost): ExecutorLostFailure (executor driver lost) 15/09/29 06:32:03 WARN TaskSetManager: Lost task 1781.0 in stage 2713.0 (TID 9101183, localhost): ExecutorLostFailure (executor driver lost) 15/09/29 06:32:03 WARN TaskSetManager: Lost task 1808.0 in stage 2713.0 (TID 9101210, localhost): ExecutorLostFailure (executor driver lost) 15/09/29 06:32:03 WARN TaskSetManager: Lost task 1799.0 in stage 2713.0 (TID 9101201, localhost): ExecutorLostFailure (executor driver lost) 15/09/29 06:32:03 WARN TaskSetManager: Lost task 1772.0 in stage 2713.0 (TID 9101174, localhost): ExecutorLostFailure (executor driver lost) 15/09/29 06:32:03 WARN TaskSetManager: Lost task 1763.0 in stage 2713.0 (TID 9101165, localhost): ExecutorLostFailure (executor driver lost) 15/09/29 06:32:03 WARN TaskSetManager: Lost task 1802.0 in stage 2713.0 (TID 9101204, localhost): ExecutorLostFailure (executor driver lost) 15/09/29 06:32:03 WARN TaskSetManager: Lost task 1748.0 in stage 2713.0 (TID
Re: foreachRDD causing executor lost failure
If you can look a bit in the executor logs, you would see the exact reason (mostly a OOM/GC etc). Instead of using foreach, try to use mapPartitions or foreachPartitions. Thanks Best Regards On Tue, Sep 8, 2015 at 10:45 PM, Priya Ch <learnings.chitt...@gmail.com> wrote: > Hello All, > > I am using foreachRDD in my code as - > > dstream.foreachRDD { rdd => rdd.foreach { record => // look up with > cassandra table > // save updated rows to cassandra table. > } > } > This foreachRDD is causing executor lost failure. what is the behavior of > this foreachRDD ??? > > Thanks, > Padma Ch >
foreachRDD causing executor lost failure
Hello All, I am using foreachRDD in my code as - dstream.foreachRDD { rdd => rdd.foreach { record => // look up with cassandra table // save updated rows to cassandra table. } } This foreachRDD is causing executor lost failure. what is the behavior of this foreachRDD ??? Thanks, Padma Ch
Re: Executor lost failure
If you're using YARN with Spark 1.3.1, you could be running into https://issues.apache.org/jira/browse/SPARK-8119, although without more information it's impossible to know. On Tue, Sep 1, 2015 at 11:28 AM, Priya Ch <learnings.chitt...@gmail.com> wrote: > Hi All, > > I have a spark streaming application which writes the processed results > to cassandra. In local mode, the code seems to work fine. The moment i > start running in distributed mode using yarn, i see executor lost failure. > I increased executor memory to occupy entire node's memory which is around > 12gb/ But still see the same issue. > > What could be the possible scenarios for executor lost failure ? >
Executor lost failure
Hi All, I have a spark streaming application which writes the processed results to cassandra. In local mode, the code seems to work fine. The moment i start running in distributed mode using yarn, i see executor lost failure. I increased executor memory to occupy entire node's memory which is around 12gb/ But still see the same issue. What could be the possible scenarios for executor lost failure ?
Re: Fwd: Executor Lost Failure
Yes... found the output on web UI of the slave. Thanks :) On Tue, Nov 11, 2014 at 2:48 AM, Ankur Dave ankurd...@gmail.com wrote: At 2014-11-10 22:53:49 +0530, Ritesh Kumar Singh riteshoneinamill...@gmail.com wrote: Tasks are now getting submitted, but many tasks don't happen. Like, after opening the spark-shell, I load a text file from disk and try printing its contentsas: sc.textFile(/path/to/file).foreach(println) It does not give me any output. That's because foreach launches tasks on the slaves. When each task tries to print its lines, they go to the stdout file on the slave rather than to your console at the driver. You should see the file's contents in each of the slaves' stdout files in the web UI. This only happens when running on a cluster. In local mode, all the tasks are running locally and can output to the driver, so foreach(println) is more useful. Ankur
Re: Executor Lost Failure
On Mon, Nov 10, 2014 at 10:52 PM, Ritesh Kumar Singh riteshoneinamill...@gmail.com wrote: Tasks are now getting submitted, but many tasks don't happen. Like, after opening the spark-shell, I load a text file from disk and try printing its contentsas: sc.textFile(/path/to/file).foreach(println) It does not give me any output. While running this: sc.textFile(/path/to/file).count gives me the right number of lines in the text file. Not sure what the error is. But here is the output on the console for print case: 14/11/10 22:48:02 INFO MemoryStore: ensureFreeSpace(215230) called with curMem=709528, maxMem=463837593 14/11/10 22:48:02 INFO MemoryStore: Block broadcast_6 stored as values in memory (estimated size 210.2 KB, free 441.5 MB) 14/11/10 22:48:02 INFO MemoryStore: ensureFreeSpace(17239) called with curMem=924758, maxMem=463837593 14/11/10 22:48:02 INFO MemoryStore: Block broadcast_6_piece0 stored as bytes in memory (estimated size 16.8 KB, free 441.5 MB) 14/11/10 22:48:02 INFO BlockManagerInfo: Added broadcast_6_piece0 in memory on gonephishing.local:42648 (size: 16.8 KB, free: 442.3 MB) 14/11/10 22:48:02 INFO BlockManagerMaster: Updated info of block broadcast_6_piece0 14/11/10 22:48:02 INFO FileInputFormat: Total input paths to process : 1 14/11/10 22:48:02 INFO SparkContext: Starting job: foreach at console:13 14/11/10 22:48:02 INFO DAGScheduler: Got job 3 (foreach at console:13) with 2 output partitions (allowLocal=false) 14/11/10 22:48:02 INFO DAGScheduler: Final stage: Stage 3(foreach at console:13) 14/11/10 22:48:02 INFO DAGScheduler: Parents of final stage: List() 14/11/10 22:48:02 INFO DAGScheduler: Missing parents: List() 14/11/10 22:48:02 INFO DAGScheduler: Submitting Stage 3 (Desktop/mnd.txt MappedRDD[7] at textFile at console:13), which has no missing parents 14/11/10 22:48:02 INFO MemoryStore: ensureFreeSpace(2504) called with curMem=941997, maxMem=463837593 14/11/10 22:48:02 INFO MemoryStore: Block broadcast_7 stored as values in memory (estimated size 2.4 KB, free 441.4 MB) 14/11/10 22:48:02 INFO MemoryStore: ensureFreeSpace(1602) called with curMem=944501, maxMem=463837593 14/11/10 22:48:02 INFO MemoryStore: Block broadcast_7_piece0 stored as bytes in memory (estimated size 1602.0 B, free 441.4 MB) 14/11/10 22:48:02 INFO BlockManagerInfo: Added broadcast_7_piece0 in memory on gonephishing.local:42648 (size: 1602.0 B, free: 442.3 MB) 14/11/10 22:48:02 INFO BlockManagerMaster: Updated info of block broadcast_7_piece0 14/11/10 22:48:02 INFO DAGScheduler: Submitting 2 missing tasks from Stage 3 (Desktop/mnd.txt MappedRDD[7] at textFile at console:13) 14/11/10 22:48:02 INFO TaskSchedulerImpl: Adding task set 3.0 with 2 tasks 14/11/10 22:48:02 INFO TaskSetManager: Starting task 0.0 in stage 3.0 (TID 6, gonephishing.local, PROCESS_LOCAL, 1216 bytes) 14/11/10 22:48:02 INFO TaskSetManager: Starting task 1.0 in stage 3.0 (TID 7, gonephishing.local, PROCESS_LOCAL, 1216 bytes) 14/11/10 22:48:02 INFO BlockManagerInfo: Added broadcast_7_piece0 in memory on gonephishing.local:48857 (size: 1602.0 B, free: 442.3 MB) 14/11/10 22:48:02 INFO BlockManagerInfo: Added broadcast_6_piece0 in memory on gonephishing.local:48857 (size: 16.8 KB, free: 442.3 MB) 14/11/10 22:48:02 INFO TaskSetManager: Finished task 0.0 in stage 3.0 (TID 6) in 308 ms on gonephishing.local (1/2) 14/11/10 22:48:02 INFO DAGScheduler: Stage 3 (foreach at console:13) finished in 0.321 s 14/11/10 22:48:02 INFO TaskSetManager: Finished task 1.0 in stage 3.0 (TID 7) in 315 ms on gonephishing.local (2/2) 14/11/10 22:48:02 INFO SparkContext: Job finished: foreach at console:13, took 0.376602079 s 14/11/10 22:48:02 INFO TaskSchedulerImpl: Removed TaskSet 3.0, whose tasks have all completed, from pool === On Mon, Nov 10, 2014 at 8:01 PM, Akhil Das ak...@sigmoidanalytics.com wrote: Try adding the following configurations also, might work. spark.rdd.compress true spark.storage.memoryFraction 1 spark.core.connection.ack.wait.timeout 600 spark.akka.frameSize 50 Thanks Best Regards On Mon, Nov 10, 2014 at 6:51 PM, Ritesh Kumar Singh riteshoneinamill...@gmail.com wrote: Hi, I am trying to submit my application using spark-submit, using following spark-default.conf params: spark.master spark://master-ip:7077 spark.eventLog.enabled true spark.serializer org.apache.spark.serializer.KryoSerializer spark.executor.extraJavaOptions -XX:+PrintGCDetails -Dkey=value -Dnumbers=one two three === But every time I am getting this error: 14/11/10 18:39:17 ERROR TaskSchedulerImpl: Lost executor 1 on aa.local: remote Akka client disassociated 14/11/10 18:39:17 WARN TaskSetManager: Lost task 1.0 in stage 0.0 (TID 1, aa.local): ExecutorLostFailure (executor lost) 14/11/10 18:39:17
Fwd: Executor Lost Failure
-- Forwarded message -- From: Ritesh Kumar Singh riteshoneinamill...@gmail.com Date: Mon, Nov 10, 2014 at 10:52 PM Subject: Re: Executor Lost Failure To: Akhil Das ak...@sigmoidanalytics.com Tasks are now getting submitted, but many tasks don't happen. Like, after opening the spark-shell, I load a text file from disk and try printing its contentsas: sc.textFile(/path/to/file).foreach(println) It does not give me any output. While running this: sc.textFile(/path/to/file).count gives me the right number of lines in the text file. Not sure what the error is. But here is the output on the console for print case: 14/11/10 22:48:02 INFO MemoryStore: ensureFreeSpace(215230) called with curMem=709528, maxMem=463837593 14/11/10 22:48:02 INFO MemoryStore: Block broadcast_6 stored as values in memory (estimated size 210.2 KB, free 441.5 MB) 14/11/10 22:48:02 INFO MemoryStore: ensureFreeSpace(17239) called with curMem=924758, maxMem=463837593 14/11/10 22:48:02 INFO MemoryStore: Block broadcast_6_piece0 stored as bytes in memory (estimated size 16.8 KB, free 441.5 MB) 14/11/10 22:48:02 INFO BlockManagerInfo: Added broadcast_6_piece0 in memory on gonephishing.local:42648 (size: 16.8 KB, free: 442.3 MB) 14/11/10 22:48:02 INFO BlockManagerMaster: Updated info of block broadcast_6_piece0 14/11/10 22:48:02 INFO FileInputFormat: Total input paths to process : 1 14/11/10 22:48:02 INFO SparkContext: Starting job: foreach at console:13 14/11/10 22:48:02 INFO DAGScheduler: Got job 3 (foreach at console:13) with 2 output partitions (allowLocal=false) 14/11/10 22:48:02 INFO DAGScheduler: Final stage: Stage 3(foreach at console:13) 14/11/10 22:48:02 INFO DAGScheduler: Parents of final stage: List() 14/11/10 22:48:02 INFO DAGScheduler: Missing parents: List() 14/11/10 22:48:02 INFO DAGScheduler: Submitting Stage 3 (Desktop/mnd.txt MappedRDD[7] at textFile at console:13), which has no missing parents 14/11/10 22:48:02 INFO MemoryStore: ensureFreeSpace(2504) called with curMem=941997, maxMem=463837593 14/11/10 22:48:02 INFO MemoryStore: Block broadcast_7 stored as values in memory (estimated size 2.4 KB, free 441.4 MB) 14/11/10 22:48:02 INFO MemoryStore: ensureFreeSpace(1602) called with curMem=944501, maxMem=463837593 14/11/10 22:48:02 INFO MemoryStore: Block broadcast_7_piece0 stored as bytes in memory (estimated size 1602.0 B, free 441.4 MB) 14/11/10 22:48:02 INFO BlockManagerInfo: Added broadcast_7_piece0 in memory on gonephishing.local:42648 (size: 1602.0 B, free: 442.3 MB) 14/11/10 22:48:02 INFO BlockManagerMaster: Updated info of block broadcast_7_piece0 14/11/10 22:48:02 INFO DAGScheduler: Submitting 2 missing tasks from Stage 3 (Desktop/mnd.txt MappedRDD[7] at textFile at console:13) 14/11/10 22:48:02 INFO TaskSchedulerImpl: Adding task set 3.0 with 2 tasks 14/11/10 22:48:02 INFO TaskSetManager: Starting task 0.0 in stage 3.0 (TID 6, gonephishing.local, PROCESS_LOCAL, 1216 bytes) 14/11/10 22:48:02 INFO TaskSetManager: Starting task 1.0 in stage 3.0 (TID 7, gonephishing.local, PROCESS_LOCAL, 1216 bytes) 14/11/10 22:48:02 INFO BlockManagerInfo: Added broadcast_7_piece0 in memory on gonephishing.local:48857 (size: 1602.0 B, free: 442.3 MB) 14/11/10 22:48:02 INFO BlockManagerInfo: Added broadcast_6_piece0 in memory on gonephishing.local:48857 (size: 16.8 KB, free: 442.3 MB) 14/11/10 22:48:02 INFO TaskSetManager: Finished task 0.0 in stage 3.0 (TID 6) in 308 ms on gonephishing.local (1/2) 14/11/10 22:48:02 INFO DAGScheduler: Stage 3 (foreach at console:13) finished in 0.321 s 14/11/10 22:48:02 INFO TaskSetManager: Finished task 1.0 in stage 3.0 (TID 7) in 315 ms on gonephishing.local (2/2) 14/11/10 22:48:02 INFO SparkContext: Job finished: foreach at console:13, took 0.376602079 s 14/11/10 22:48:02 INFO TaskSchedulerImpl: Removed TaskSet 3.0, whose tasks have all completed, from pool === On Mon, Nov 10, 2014 at 8:01 PM, Akhil Das ak...@sigmoidanalytics.com wrote: Try adding the following configurations also, might work. spark.rdd.compress true spark.storage.memoryFraction 1 spark.core.connection.ack.wait.timeout 600 spark.akka.frameSize 50 Thanks Best Regards On Mon, Nov 10, 2014 at 6:51 PM, Ritesh Kumar Singh riteshoneinamill...@gmail.com wrote: Hi, I am trying to submit my application using spark-submit, using following spark-default.conf params: spark.master spark://master-ip:7077 spark.eventLog.enabled true spark.serializer org.apache.spark.serializer.KryoSerializer spark.executor.extraJavaOptions -XX:+PrintGCDetails -Dkey=value -Dnumbers=one two three === But every time I am getting this error: 14/11/10 18:39:17 ERROR TaskSchedulerImpl: Lost executor 1 on aa.local: remote Akka client disassociated 14/11/10 18:39:17 WARN TaskSetManager: Lost task 1.0 in stage 0.0 (TID 1, aa.local
Re: Fwd: Executor Lost Failure
At 2014-11-10 22:53:49 +0530, Ritesh Kumar Singh riteshoneinamill...@gmail.com wrote: Tasks are now getting submitted, but many tasks don't happen. Like, after opening the spark-shell, I load a text file from disk and try printing its contentsas: sc.textFile(/path/to/file).foreach(println) It does not give me any output. That's because foreach launches tasks on the slaves. When each task tries to print its lines, they go to the stdout file on the slave rather than to your console at the driver. You should see the file's contents in each of the slaves' stdout files in the web UI. This only happens when running on a cluster. In local mode, all the tasks are running locally and can output to the driver, so foreach(println) is more useful. Ankur - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org