Hi, I have 4 nodes cluster: one master (also has hdfs namenode) and 3 workers (also have 3 colocated hdfs datanodes). Each worker has only 2 cores and spark.executor.memory is 2.3g. Input file is two hdfs blocks, one block configured = 64MB.
I train random forest regression with numTrees=50 and maxDepth=10 and for different number of default parallelism I measure time. I don't see expected boost in time with increased parallelism. Is the following expected? parallelism, time (minutes) 2, 27.0 3, 20.5 4, 23.8 5, 21.4 6, 19.9 12, 22.6 24, 29.7 I saw spark does pretty good job with scheduling tasks as evenly as possible. For parallelism 2 and 3, tasks are always at different machines. And for all others parallelisms cached rdd blocks are split evenly across the cluster. And all data are cached and kept as deserialized 1x. I realise there shouldn't be any boost for 24 parallelism with my set up, but I've measured it out of curiosity. I'd expect to have some boost with 4 and 5 parallelism, though. There might be some disk contention (HDD is in place), as there are pretty high write shuflles (300 to 600 MB), but that have to be applied to all of the parallelism after 3. I've monitored disk usages using atop utility and haven't noticed any contention there. I realise this might be as well as poor measurement as I've run this thing once and took time. Usually the recommended way is to measure it several times, get mean, sigma, etc. Has anyone experienced similar behaviour? Could you give me any advices or explanation of what's happening? -- Be well! Jean Morozov