In general, Java processes fail with an OutOfMemoryError when your code and
data does not fit into the memory allocated to the runtime.  In Spark, that
memory is controlled through the --executor-memory flag.
If you are running Spark on YARN, then YARN configuration will dictate the
maximum memory that your Spark executors can request.  Here is a pretty
good article about setting memory in Spark on YARN:
http://www.cloudera.com/documentation/enterprise/5-5-x/topics/cdh_ig_running_spark_on_yarn.html

If the OS were to kill your process because the system has run out of
memory, you would see an error printed to standard error that looks like
this:

Java HotSpot(TM) 64-Bit Server VM warning: INFO:
os::commit_memory(0x00000000e2320000, 37601280, 0) failed;
error='Cannot allocate memory' (errno=12)
# There is insufficient memory for the Java Runtime Environment to continue.



On Wed, Jan 18, 2017 at 10:25 AM, David Frese <david.fr...@active-group.de>
wrote:

> Hello everybody,
>
> being quite new to Spark, I am struggling a lot with OutOfMemory exceptions
> and "GC overhead limit reached" failures of my jobs, submitted from a
> spark-shell and "master yarn".
>
> Playing with --num-executors, --executor-memory and --executor-cores I
> occasionally get something done. But I'm also not the only one using the
> cluster, and it seems to me, that my jobs sometimes fail with the above
> errors, because other people have something running, or have a spark-shell
> open at that time; or at least it seems that with the same code, data and
> settings, the job sometimes completes and sometimes fails.
>
> Is that "expected behaviour"?
>
> What options/tools can be used to make the success/failure of a job
> deterministic - there a lot things out there like, 'dynamic allocation',
> Hadoop 'fair scheduler'; but very hard for a newbee to evaluate that (resp.
> make suggestions to the admins).
>
> If it cannot be made deterministic, how can I reliably distinguish the OOM
> failures that are caused by incorrect settings on my side (e.g. because my
> data does not fit into memory), and those failures that are caused by
> resource consumption/blocking from other jobs?
>
> Thanks for sharing your thoughts and experiences!
>
>
>
>
>
> --
> View this message in context: http://apache-spark-user-list.
> 1001560.n3.nabble.com/Do-jobs-fail-because-of-other-users-
> of-a-cluster-tp28318.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>

Reply via email to