Re: Do jobs fail because of other users of a cluster?

Matthew Dailey Mon, 23 Jan 2017 17:44:17 -0800

In general, Java processes fail with an OutOfMemoryError when your code and
data does not fit into the memory allocated to the runtime.  In Spark, that
memory is controlled through the --executor-memory flag.
If you are running Spark on YARN, then YARN configuration will dictate the
maximum memory that your Spark executors can request.  Here is a pretty
good article about setting memory in Spark on YARN:
http://www.cloudera.com/documentation/enterprise/5-5-x/topics/cdh_ig_running_spark_on_yarn.html


If the OS were to kill your process because the system has run out of
memory, you would see an error printed to standard error that looks like
this:

Java HotSpot(TM) 64-Bit Server VM warning: INFO:
os::commit_memory(0x00000000e2320000, 37601280, 0) failed;
error='Cannot allocate memory' (errno=12)
# There is insufficient memory for the Java Runtime Environment to continue.



On Wed, Jan 18, 2017 at 10:25 AM, David Frese <david.fr...@active-group.de>
wrote:

> Hello everybody,
>
> being quite new to Spark, I am struggling a lot with OutOfMemory exceptions
> and "GC overhead limit reached" failures of my jobs, submitted from a
> spark-shell and "master yarn".
>
> Playing with --num-executors, --executor-memory and --executor-cores I
> occasionally get something done. But I'm also not the only one using the
> cluster, and it seems to me, that my jobs sometimes fail with the above
> errors, because other people have something running, or have a spark-shell
> open at that time; or at least it seems that with the same code, data and
> settings, the job sometimes completes and sometimes fails.
>
> Is that "expected behaviour"?
>
> What options/tools can be used to make the success/failure of a job
> deterministic - there a lot things out there like, 'dynamic allocation',
> Hadoop 'fair scheduler'; but very hard for a newbee to evaluate that (resp.
> make suggestions to the admins).
>
> If it cannot be made deterministic, how can I reliably distinguish the OOM
> failures that are caused by incorrect settings on my side (e.g. because my
> data does not fit into memory), and those failures that are caused by
> resource consumption/blocking from other jobs?
>
> Thanks for sharing your thoughts and experiences!
>
>
>
>
>
> --
> View this message in context: http://apache-spark-user-list.
> 1001560.n3.nabble.com/Do-jobs-fail-because-of-other-users-
> of-a-cluster-tp28318.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>

Re: Do jobs fail because of other users of a cluster?

Reply via email to