Re: Spark 2.0.0 performance; potential large Spark core regression

Michael Allman Fri, 08 Jul 2016 08:44:36 -0700

Hi Adam,

Do you have your spark confs and your spark-env.sh somewhere where we can see 
them? If not, can you make them available?


Cheers,

Michael

> On Jul 8, 2016, at 3:17 AM, Adam Roberts <arobe...@uk.ibm.com> wrote:
> 
> Hi, we've been testing the performance of Spark 2.0 compared to previous 
> releases, unfortunately there are no Spark 2.0 compatible versions of HiBench 
> and SparkPerf apart from those I'm working on (see 
> https://github.com/databricks/spark-perf/issues/108 
> <https://github.com/databricks/spark-perf/issues/108>) 
> 
> With the Spark 2.0 version of SparkPerf we've noticed a 30% geomean 
> regression with a very small scale factor and so we've generated a couple of 
> profiles comparing 1.5.2 vs 2.0.0. Same JDK version and same platform. We 
> will gather a 1.6.2 comparison and increase the scale factor. 
> 
> Has anybody noticed a similar problem? My changes for SparkPerf and Spark 2.0 
> are very limited and AFAIK don't interfere with Spark core functionality, so 
> any feedback on the changes would be much appreciated and welcome, I'd much 
> prefer it if my changes are the problem. 
> 
> A summary for your convenience follows (this matches what I've mentioned on 
> the SparkPerf issue above) 
> 
> 1. spark-perf/config/config.py : SCALE_FACTOR=0.05
> No. Of Workers: 1
> Executor per Worker : 1
> Executor Memory: 18G
> Driver Memory : 8G
> Serializer: kryo 
> 
> 2. $SPARK_HOME/conf/spark-defaults.conf: executor Java Options: 
> -Xdisableexplicitgc -Xcompressedrefs 
> 
> Main changes I made for the benchmark itself
> Use Scala 2.11.8 and Spark 2.0.0 RC2 on our local filesystem
> MLAlgorithmTests use Vectors.fromML
> For streaming-tests in HdfsRecoveryTest we use wordStream.foreachRDD not 
> wordStream.foreach
> KVDataTest uses awaitTerminationOrTimeout in a SparkStreamingContext instead 
> of awaitTermination
> Trivial: we use compact not compact.render for outputting json
> 
> In Spark 2.0 the top five methods where we spend our time is as follows, the 
> percentage is how much of the overall processing time was spent in this 
> particular method: 
> 1.        AppendOnlyMap.changeValue 44% 
> 2.        SortShuffleWriter.write 19% 
> 3.        SizeTracker.estimateSize 7.5% 
> 4.        SizeEstimator.estimate 5.36% 
> 5.        Range.foreach 3.6% 
> 
> and in 1.5.2 the top five methods are: 
> 1.        AppendOnlyMap.changeValue 38% 
> 2.        ExternalSorter.insertAll 33% 
> 3.        Range.foreach 4% 
> 4.        SizeEstimator.estimate 2% 
> 5.        SizeEstimator.visitSingleObject 2% 
> 
> I see the following scores, on the left I have the test name followed by the 
> 1.5.2 time and then the 2.0.0 time
> scheduling throughput: 5.2s vs 7.08s
> agg by key; 0.72s vs 1.01s
> agg by key int: 0.93s vs 1.19s
> agg by key naive: 1.88s vs 2.02
> sort by key: 0.64s vs 0.8s
> sort by key int: 0.59s vs 0.64s
> scala count: 0.09s vs 0.08s
> scala count w fltr: 0.31s vs 0.47s 
> 
> This is only running the Spark core tests (scheduling throughput through 
> scala-count-w-filtr, including all inbetween). 
> 
> Cheers, 
> 
> 
> Unless stated otherwise above:
> IBM United Kingdom Limited - Registered in England and Wales with number 
> 741598. 
> Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6 3AU

Re: Spark 2.0.0 performance; potential large Spark core regression

Reply via email to