After a lot of grovelling through logs, I found out that the Nagios monitor
process detected that the machine was almost out of memory, and killed the
SNAP executor process.
So why is the machine running out of memory? Each node has 128GB of RAM, 4
executors, about 40GB of data. It did run out of
I'm running Spark 1.0.1 with SPARK_MEMORY=60g, so 4 executors at that size
would indeed run out of memory (the machine has 110GB). And in fact they
would get repeatedly restarted and killed until eventually Spark gave up.
I'll try with a smaller limit, but it'll be a while - somehow my HDFS got
Hi Avishek,
I'm running on a manual cluster setup, and all the code is Scala. The load
averages don't seem high when I see these failures (about 12 on a 16-core
machine).
Ravi
--
View this message in context:
I'm running into a problem with executors failing, and it's not clear what's
causing it. Any suggestions on how to diagnose fix it would be
appreciated.
There are a variety of errors in the logs, and I don't see a consistent
triggering error. I've tried varying the number of executors per
This one turned out to be another problem with my app configuration, not with
Spark. The compute task was dependent on the local filesystem, and config
errors on 8 of 10 of the nodes made them fail early. The Spark wrapper was
not checking the process exit value, so it appeared as if they were
Is there a way to visualize the task dependency graph of an application,
during or after its execution? The list of stages on port 4040 is useful,
but still quite limited. For example, I've found that if I don't cache() the
result of one expensive computation, it will get repeated 4 times, but it
OK, I did figure this out. I was running the app (avocado) using
spark-submit, when it was actually designed to take command line arguments
to connect to a spark cluster. Since I didn't provide any such arguments, it
started a nested local Spark cluster *inside* the YARN Spark executor and so
of
Hi Matei-
Changing to coalesce(numNodes, true) still runs all partitions on a single
node, which I verified by printing the hostname before I exec the external
process.
--
View this message in context:
I also tried increasing --num-executors to numNodes * coresPerNode and using
coalesce(numNodes*10,true), and it still ran all the tasks on one node. It
seems like it is placing all the executors on one node (though not always
the same node, which indicates it is aware of more than one!). I'm using
Matei - I tried using coalesce(numNodes, true), but it then seemed to run too
few SNAP tasks - only 2 or 3 when I had specified 46. The job failed,
perhaps for unrelated reasons, with some odd exceptions in the log (at the
end of this message). But I really don't want to force data movement
10 matches
Mail list logo