It’ll help to see the code or at least understand what transformations you’re 
using.

Also, you have 15 nodes but not using all of them, so that means you may be 
losing data locality. You can see this in the job UI for Spark if any jobs do 
not have node or process local.

From: diplomatic Guru
Date: Friday, July 3, 2015 at 8:58 AM
To: "user@spark.apache.org<mailto:user@spark.apache.org>"
Subject: Spark performance issue

Hello guys,

I'm after some advice on Spark performance.

I've a MapReduce job that read inputs carry out a simple calculation and write 
the results into HDFS. I've implemented the same logic in Spark job.

When I tried both jobs on same datasets, I'm getting different execution time, 
which is expected.

BUT
......
In my example, MapReduce job is performing much better than Spark.

The difference is that I'm not changing much with the MR job configuration, 
e.g., memory, cores, etc...But this is not the case with Spark as it's very 
flexible. So I'm sure my configuration isn't correct which is why MR is 
outperforming Spark but need your advice.

For example:

Test 1:
4.5GB data -  MR job took ~55 seconds to compute, but Spark took ~3 minutes and 
20 seconds.

Test 2:
25GB data -MR took 2 minutes and 15 seconds, whereas Spark job is still 
running, and it's already been 15 minutes.


I have a cluster of 15 nodes. The maximum memory that I could allocate to each 
executor is 6GB. Therefore, for Test 1, this is the config I used:

--executor-memory 6G --num-executors 4 --driver-memory 6G  --executor-cores 2 
(also I set "spark.storage.memoryFraction" to 0.3)


For Test 2:
--executor-memory 6G --num-executors 10 --driver-memory 6G  --executor-cores 2 
(also I set "spark.storage.memoryFraction" to 0.3)

I tried all possible combination but couldn't get better performance. Any 
suggestions will be much appreciated.






Reply via email to