Re: Lost executors

2014-08-13 Thread rpandya
After a lot of grovelling through logs, I found out that the Nagios monitor process detected that the machine was almost out of memory, and killed the SNAP executor process. So why is the machine running out of memory? Each node has 128GB of RAM, 4 executors, about 40GB of data. It did run out of

Re: Lost executors

2014-08-13 Thread rpandya
I'm running Spark 1.0.1 with SPARK_MEMORY=60g, so 4 executors at that size would indeed run out of memory (the machine has 110GB). And in fact they would get repeatedly restarted and killed until eventually Spark gave up. I'll try with a smaller limit, but it'll be a while - somehow my HDFS got

Re: Lost executors

2014-08-08 Thread rpandya
Hi Avishek, I'm running on a manual cluster setup, and all the code is Scala. The load averages don't seem high when I see these failures (about 12 on a 16-core machine). Ravi -- View this message in context:

Lost executors

2014-08-07 Thread rpandya
I'm running into a problem with executors failing, and it's not clear what's causing it. Any suggestions on how to diagnose fix it would be appreciated. There are a variety of errors in the logs, and I don't see a consistent triggering error. I've tried varying the number of executors per

Re: Memory compute-intensive tasks

2014-08-04 Thread rpandya
This one turned out to be another problem with my app configuration, not with Spark. The compute task was dependent on the local filesystem, and config errors on 8 of 10 of the nodes made them fail early. The Spark wrapper was not checking the process exit value, so it appeared as if they were

Visualizing stage task dependency graph

2014-08-04 Thread rpandya
Is there a way to visualize the task dependency graph of an application, during or after its execution? The list of stages on port 4040 is useful, but still quite limited. For example, I've found that if I don't cache() the result of one expensive computation, it will get repeated 4 times, but it

Re: Memory compute-intensive tasks

2014-07-29 Thread rpandya
OK, I did figure this out. I was running the app (avocado) using spark-submit, when it was actually designed to take command line arguments to connect to a spark cluster. Since I didn't provide any such arguments, it started a nested local Spark cluster *inside* the YARN Spark executor and so of

Re: Memory compute-intensive tasks

2014-07-18 Thread rpandya
Hi Matei- Changing to coalesce(numNodes, true) still runs all partitions on a single node, which I verified by printing the hostname before I exec the external process. -- View this message in context:

Re: Memory compute-intensive tasks

2014-07-18 Thread rpandya
I also tried increasing --num-executors to numNodes * coresPerNode and using coalesce(numNodes*10,true), and it still ran all the tasks on one node. It seems like it is placing all the executors on one node (though not always the same node, which indicates it is aware of more than one!). I'm using

Re: Memory compute-intensive tasks

2014-07-16 Thread rpandya
Matei - I tried using coalesce(numNodes, true), but it then seemed to run too few SNAP tasks - only 2 or 3 when I had specified 46. The job failed, perhaps for unrelated reasons, with some odd exceptions in the log (at the end of this message). But I really don't want to force data movement