Re: Spark performance regression test suite

Adam Roberts Mon, 11 Jul 2016 09:27:06 -0700

Agreed, this is something that we do regularly when producing our own 
Spark distributions in IBM and so it will be beneficial to share updates 
with the wider community, so far it looks like Spark 1.6.2 is the best out 
of the box on spark-perf and HiBench (of course this may vary for real 
workloads, individual applications and tuning efforts) but we have more 
2.0 tests to be performed and we're not aware of any regressions between 
previous versions except for perhaps with the Spark 2.0.0 post I made.


I'm looking for testing and feedback from any Spark gurus with my 2.0 
changes for spark-perf (have a look at the open issue Holden's mentioned: 
https://github.com/databricks/spark-perf/issues/108) and the same goes for 
HiBench (FWIW we see the same regression on HiBench too: 
https://github.com/intel-hadoop/HiBench/issues/221).

One idea for us is that the benchmarking could be run optionally as part 
of the existing contribution process, an ideal solution IMO would involve 
an additional parameter for the Jenkins job that when ticked will result 
in a performance run being done with and without the change. As we don't 
have direct access to the Jenkins build button in the community, when 
contributing a change users could mark their change with something like 
@performance or "jenkins performance test this please". 

Alternatively the influential Spark folk could notice a change with a 
potential performance impact and have it tested accordingly. While 
microbenchmarks are useful it will be important to test the whole of 
Spark. Then there's also the use of tags in the JIRA - lots for us to work 
with if we wanted this.

This probably means the addition and therefore maintenance of dedicated 
machines in the build farm although this would highlight any regressions 
FAST as opposed to later on in the development cycle.

If there is indeed a regression we may have the fun task of binary 
chopping commits between 1.6.2 and now...again TBC but a real possibility, 
so interested to see if anybody else is doing regression testing and if 
they see a similar problem.

If we don't go down the "benchmark as you contribute" route, having such a 
suite will be perfect - it would clone the latest versions of each 
benchmark, build them for the current version of Spark (can identify the 
release from the pom), run the benchmarks we care about (let's say in 
Spark standalone mode with a couple of executors) and produce a geomean 
score - highlighting any significant deviations.

I'm happy to help with designing/reviewing this

Cheers,







From:   Michael Gummelt <mgumm...@mesosphere.io>
To:     Eric Liang <e...@databricks.com>
Cc:     Holden Karau <hol...@pigscanfly.ca>, Ted Yu <yuzhih...@gmail.com>, 
Michael Allman <mich...@videoamp.com>, dev <dev@spark.apache.org>
Date:   11/07/2016 17:00
Subject:        Re: Spark performance regression test suite



I second any effort to update, automate, and communicate the results of 
spark-perf (https://github.com/databricks/spark-perf)

On Fri, Jul 8, 2016 at 12:28 PM, Eric Liang <e...@databricks.com> wrote:
Something like speed.pypy.org or the Chrome performance dashboards would 
be very useful.

On Fri, Jul 8, 2016 at 9:50 AM Holden Karau <hol...@pigscanfly.ca> wrote:
There are also the spark-perf and spark-sql-perf projects in the 
Databricks github (although I see an open issue for Spark 2.0 support in 
one of them).

On Friday, July 8, 2016, Ted Yu <yuzhih...@gmail.com> wrote:
Found a few issues:

[SPARK-6810] Performance benchmarks for SparkR
[SPARK-2833] performance tests for linear regression
[SPARK-15447] Performance test for ALS in Spark 2.0

Haven't found one for Spark core.

On Fri, Jul 8, 2016 at 8:58 AM, Michael Allman <mich...@videoamp.com> 
wrote:
Hello,

I've seen a few messages on the mailing list regarding Spark performance 
concerns, especially regressions from previous versions. It got me 
thinking that perhaps an automated performance regression suite would be a 
worthwhile contribution? Is anyone working on this? Do we have a Jira 
issue for it?

I cannot commit to taking charge of such a project. I just thought it 
would be a great contribution for someone who does have the time and the 
chops to build it.

Cheers,

Michael
---------------------------------------------------------------------
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org




-- 
Cell : 425-233-8271
Twitter: https://twitter.com/holdenkarau




-- 
Michael Gummelt
Software Engineer
Mesosphere

Unless stated otherwise above:
IBM United Kingdom Limited - Registered in England and Wales with number 
741598. 
Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6 3AU

Re: Spark performance regression test suite

Reply via email to