Agreed, this is something that we do regularly when producing our own Spark distributions in IBM and so it will be beneficial to share updates with the wider community, so far it looks like Spark 1.6.2 is the best out of the box on spark-perf and HiBench (of course this may vary for real workloads, individual applications and tuning efforts) but we have more 2.0 tests to be performed and we're not aware of any regressions between previous versions except for perhaps with the Spark 2.0.0 post I made.
I'm looking for testing and feedback from any Spark gurus with my 2.0 changes for spark-perf (have a look at the open issue Holden's mentioned: https://github.com/databricks/spark-perf/issues/108) and the same goes for HiBench (FWIW we see the same regression on HiBench too: https://github.com/intel-hadoop/HiBench/issues/221). One idea for us is that the benchmarking could be run optionally as part of the existing contribution process, an ideal solution IMO would involve an additional parameter for the Jenkins job that when ticked will result in a performance run being done with and without the change. As we don't have direct access to the Jenkins build button in the community, when contributing a change users could mark their change with something like @performance or "jenkins performance test this please". Alternatively the influential Spark folk could notice a change with a potential performance impact and have it tested accordingly. While microbenchmarks are useful it will be important to test the whole of Spark. Then there's also the use of tags in the JIRA - lots for us to work with if we wanted this. This probably means the addition and therefore maintenance of dedicated machines in the build farm although this would highlight any regressions FAST as opposed to later on in the development cycle. If there is indeed a regression we may have the fun task of binary chopping commits between 1.6.2 and now...again TBC but a real possibility, so interested to see if anybody else is doing regression testing and if they see a similar problem. If we don't go down the "benchmark as you contribute" route, having such a suite will be perfect - it would clone the latest versions of each benchmark, build them for the current version of Spark (can identify the release from the pom), run the benchmarks we care about (let's say in Spark standalone mode with a couple of executors) and produce a geomean score - highlighting any significant deviations. I'm happy to help with designing/reviewing this Cheers, From: Michael Gummelt <mgumm...@mesosphere.io> To: Eric Liang <e...@databricks.com> Cc: Holden Karau <hol...@pigscanfly.ca>, Ted Yu <yuzhih...@gmail.com>, Michael Allman <mich...@videoamp.com>, dev <dev@spark.apache.org> Date: 11/07/2016 17:00 Subject: Re: Spark performance regression test suite I second any effort to update, automate, and communicate the results of spark-perf (https://github.com/databricks/spark-perf) On Fri, Jul 8, 2016 at 12:28 PM, Eric Liang <e...@databricks.com> wrote: Something like speed.pypy.org or the Chrome performance dashboards would be very useful. On Fri, Jul 8, 2016 at 9:50 AM Holden Karau <hol...@pigscanfly.ca> wrote: There are also the spark-perf and spark-sql-perf projects in the Databricks github (although I see an open issue for Spark 2.0 support in one of them). On Friday, July 8, 2016, Ted Yu <yuzhih...@gmail.com> wrote: Found a few issues: [SPARK-6810] Performance benchmarks for SparkR [SPARK-2833] performance tests for linear regression [SPARK-15447] Performance test for ALS in Spark 2.0 Haven't found one for Spark core. On Fri, Jul 8, 2016 at 8:58 AM, Michael Allman <mich...@videoamp.com> wrote: Hello, I've seen a few messages on the mailing list regarding Spark performance concerns, especially regressions from previous versions. It got me thinking that perhaps an automated performance regression suite would be a worthwhile contribution? Is anyone working on this? Do we have a Jira issue for it? I cannot commit to taking charge of such a project. I just thought it would be a great contribution for someone who does have the time and the chops to build it. Cheers, Michael --------------------------------------------------------------------- To unsubscribe e-mail: dev-unsubscr...@spark.apache.org -- Cell : 425-233-8271 Twitter: https://twitter.com/holdenkarau -- Michael Gummelt Software Engineer Mesosphere Unless stated otherwise above: IBM United Kingdom Limited - Registered in England and Wales with number 741598. Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6 3AU