Hi Adam, Do you have your spark confs and your spark-env.sh somewhere where we can see them? If not, can you make them available?
Cheers, Michael > On Jul 8, 2016, at 3:17 AM, Adam Roberts <arobe...@uk.ibm.com> wrote: > > Hi, we've been testing the performance of Spark 2.0 compared to previous > releases, unfortunately there are no Spark 2.0 compatible versions of HiBench > and SparkPerf apart from those I'm working on (see > https://github.com/databricks/spark-perf/issues/108 > <https://github.com/databricks/spark-perf/issues/108>) > > With the Spark 2.0 version of SparkPerf we've noticed a 30% geomean > regression with a very small scale factor and so we've generated a couple of > profiles comparing 1.5.2 vs 2.0.0. Same JDK version and same platform. We > will gather a 1.6.2 comparison and increase the scale factor. > > Has anybody noticed a similar problem? My changes for SparkPerf and Spark 2.0 > are very limited and AFAIK don't interfere with Spark core functionality, so > any feedback on the changes would be much appreciated and welcome, I'd much > prefer it if my changes are the problem. > > A summary for your convenience follows (this matches what I've mentioned on > the SparkPerf issue above) > > 1. spark-perf/config/config.py : SCALE_FACTOR=0.05 > No. Of Workers: 1 > Executor per Worker : 1 > Executor Memory: 18G > Driver Memory : 8G > Serializer: kryo > > 2. $SPARK_HOME/conf/spark-defaults.conf: executor Java Options: > -Xdisableexplicitgc -Xcompressedrefs > > Main changes I made for the benchmark itself > Use Scala 2.11.8 and Spark 2.0.0 RC2 on our local filesystem > MLAlgorithmTests use Vectors.fromML > For streaming-tests in HdfsRecoveryTest we use wordStream.foreachRDD not > wordStream.foreach > KVDataTest uses awaitTerminationOrTimeout in a SparkStreamingContext instead > of awaitTermination > Trivial: we use compact not compact.render for outputting json > > In Spark 2.0 the top five methods where we spend our time is as follows, the > percentage is how much of the overall processing time was spent in this > particular method: > 1. AppendOnlyMap.changeValue 44% > 2. SortShuffleWriter.write 19% > 3. SizeTracker.estimateSize 7.5% > 4. SizeEstimator.estimate 5.36% > 5. Range.foreach 3.6% > > and in 1.5.2 the top five methods are: > 1. AppendOnlyMap.changeValue 38% > 2. ExternalSorter.insertAll 33% > 3. Range.foreach 4% > 4. SizeEstimator.estimate 2% > 5. SizeEstimator.visitSingleObject 2% > > I see the following scores, on the left I have the test name followed by the > 1.5.2 time and then the 2.0.0 time > scheduling throughput: 5.2s vs 7.08s > agg by key; 0.72s vs 1.01s > agg by key int: 0.93s vs 1.19s > agg by key naive: 1.88s vs 2.02 > sort by key: 0.64s vs 0.8s > sort by key int: 0.59s vs 0.64s > scala count: 0.09s vs 0.08s > scala count w fltr: 0.31s vs 0.47s > > This is only running the Spark core tests (scheduling throughput through > scala-count-w-filtr, including all inbetween). > > Cheers, > > > Unless stated otherwise above: > IBM United Kingdom Limited - Registered in England and Wales with number > 741598. > Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6 3AU