SparkML. RandomForest scalability question.

Eugene Morozov Tue, 08 Mar 2016 15:40:40 -0800

Hi,

I have 4 nodes cluster: one master (also has hdfs namenode) and 3 workers
(also have 3 colocated hdfs datanodes). Each worker has only 2 cores and
spark.executor.memory is 2.3g.
Input file is two hdfs blocks, one block configured = 64MB.


I train random forest regression with numTrees=50 and maxDepth=10 and for
different number of default parallelism I measure time. I don't see
expected boost in time with increased parallelism. Is the following
expected?

parallelism, time (minutes)
2, 27.0
3, 20.5
4, 23.8
5, 21.4
6, 19.9
12, 22.6
24, 29.7

I saw spark does pretty good job with scheduling tasks as evenly as
possible. For parallelism 2 and 3, tasks are always at different machines.
And for all others parallelisms cached rdd blocks are split evenly across
the cluster. And all data are cached and kept as deserialized 1x.

I realise there shouldn't be any boost for 24 parallelism with my set up,
but I've measured it out of curiosity. I'd expect to have some boost with 4
and 5 parallelism, though.

There might be some disk contention (HDD is in place), as there are pretty
high write shuflles (300 to 600 MB), but that have to be applied to all of
the parallelism after 3. I've monitored disk usages using atop utility and
haven't noticed any contention there.

I realise this might be as well as poor measurement as I've run this thing
once and took time. Usually the recommended way is to measure it several
times, get mean, sigma, etc.

Has anyone experienced similar behaviour? Could you give me any advices or
explanation of what's happening?
--
Be well!
Jean Morozov

SparkML. RandomForest scalability question.

Reply via email to