Do jobs fail because of other users of a cluster?

David Frese Wed, 18 Jan 2017 07:26:01 -0800

Hello everybody,

being quite new to Spark, I am struggling a lot with OutOfMemory exceptions
and "GC overhead limit reached" failures of my jobs, submitted from a
spark-shell and "master yarn".


Playing with --num-executors, --executor-memory and --executor-cores I
occasionally get something done. But I'm also not the only one using the
cluster, and it seems to me, that my jobs sometimes fail with the above
errors, because other people have something running, or have a spark-shell
open at that time; or at least it seems that with the same code, data and
settings, the job sometimes completes and sometimes fails.

Is that "expected behaviour"?

What options/tools can be used to make the success/failure of a job
deterministic - there a lot things out there like, 'dynamic allocation',
Hadoop 'fair scheduler'; but very hard for a newbee to evaluate that (resp.
make suggestions to the admins).

If it cannot be made deterministic, how can I reliably distinguish the OOM
failures that are caused by incorrect settings on my side (e.g. because my
data does not fit into memory), and those failures that are caused by
resource consumption/blocking from other jobs?

Thanks for sharing your thoughts and experiences!





--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Do-jobs-fail-because-of-other-users-of-a-cluster-tp28318.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Do jobs fail because of other users of a cluster?

Reply via email to