In general, Java processes fail with an OutOfMemoryError when your code and data does not fit into the memory allocated to the runtime. In Spark, that memory is controlled through the --executor-memory flag. If you are running Spark on YARN, then YARN configuration will dictate the maximum memory that your Spark executors can request. Here is a pretty good article about setting memory in Spark on YARN: http://www.cloudera.com/documentation/enterprise/5-5-x/topics/cdh_ig_running_spark_on_yarn.html
If the OS were to kill your process because the system has run out of memory, you would see an error printed to standard error that looks like this: Java HotSpot(TM) 64-Bit Server VM warning: INFO: os::commit_memory(0x00000000e2320000, 37601280, 0) failed; error='Cannot allocate memory' (errno=12) # There is insufficient memory for the Java Runtime Environment to continue. On Wed, Jan 18, 2017 at 10:25 AM, David Frese <david.fr...@active-group.de> wrote: > Hello everybody, > > being quite new to Spark, I am struggling a lot with OutOfMemory exceptions > and "GC overhead limit reached" failures of my jobs, submitted from a > spark-shell and "master yarn". > > Playing with --num-executors, --executor-memory and --executor-cores I > occasionally get something done. But I'm also not the only one using the > cluster, and it seems to me, that my jobs sometimes fail with the above > errors, because other people have something running, or have a spark-shell > open at that time; or at least it seems that with the same code, data and > settings, the job sometimes completes and sometimes fails. > > Is that "expected behaviour"? > > What options/tools can be used to make the success/failure of a job > deterministic - there a lot things out there like, 'dynamic allocation', > Hadoop 'fair scheduler'; but very hard for a newbee to evaluate that (resp. > make suggestions to the admins). > > If it cannot be made deterministic, how can I reliably distinguish the OOM > failures that are caused by incorrect settings on my side (e.g. because my > data does not fit into memory), and those failures that are caused by > resource consumption/blocking from other jobs? > > Thanks for sharing your thoughts and experiences! > > > > > > -- > View this message in context: http://apache-spark-user-list. > 1001560.n3.nabble.com/Do-jobs-fail-because-of-other-users- > of-a-cluster-tp28318.html > Sent from the Apache Spark User List mailing list archive at Nabble.com. > > --------------------------------------------------------------------- > To unsubscribe e-mail: user-unsubscr...@spark.apache.org > >