I am running Spark 1.4.1 on mesos.

The spark job does a "cartesian" of 4 RDDs (aRdd, bRdd, cRdd, dRdd) of size
100, 100, 7 and 1 respectively. Lets call it prouctRDD.

Creation of "aRdd" needs data pull from multiple data sources, merging it
and creating a tuple of JavaRdd, finally aRDD looks something like this:
JavaRDD<Tuple4<A1, A2>>
bRdd, cRdd and dRdds are just List<> of values.

Then apply a transformation on prouctRDD and finally call "saveAsTextFile"
to save the result of my transformation.

Problem:
By setting "spark.mesos.coarse=true", creation of "aRdd" works fine but
driver crashes while doing the cartesian but when I do
"spark.mesos.coarse=true", the job works like a charm. I am running spark
on mesos.

Comments:
So I wanted to understand what role does "spark.mesos.coarse=true" plays in
terms of memory and compute performance. My findings look counter intuitive
since:

   1. "spark.mesos.coarse=true" just runs on 1 mesos task, so there should
   be an overhead of spinning up mesos tasks which should impact the
   performance.
   2. What config for "spark.mesos.coarse" recommended for running spark on
   mesos? Or there is no best answer and it depends on usecase?
   3. Also by setting "spark.mesos.coarse=true", I notice that I get huge
   GC pauses even with small dataset but a long running job (but this can be a
   separate discussion).

Let me know if I am missing something obvious, we are learning spark tuning
as we move forward :)

-- 
Thanks,
-Utkarsh

Reply via email to