> On Jun 12, 2017, at 1:45 PM, Pavol Vaskovic via swift-dev 
> <swift-dev@swift.org> wrote:
> 
> On Tue, May 16, 2017 at 9:10 PM, Dave Abrahams via swift-dev 
> <swift-dev@swift.org <mailto:swift-dev@swift.org>> wrote:
> 
> on Thu May 11 2017, Pavol Vaskovic <swift-dev-AT-swift.org> wrote:
> 
> I have run Benchmark_O with --num-iters=100 on my machine for the the
> whole performance test suite, to get a feeling for the distribution of
> benchmark samples, because I also want to move the Benchmark_Driver to
> use MEAN instead of MIN in the analysis.
> 
> I'm concerned about that, especially for microbenchmarks; it seems to me
> as though MIN is the right measurement.  Can you explain why MEAN is
> better?
> 
> 
> On Wed, May 17, 2017 at 1:26 AM, Andrew Trick <atr...@apple.com 
> <mailto:atr...@apple.com>> wrote:
> Using MEAN wasn’t part of the aforementioned SR-4669. The purpose of that 
> task is to reduce the time CI takes to get useful results (e.g. by using 3 
> runs as a baseline). MEAN isn’t useful if you’re only gathering 3 data points.
> 
> 
> Current approach to detecting performance changes is fragile for tests that 
> have very low absolute runtime, as they are easily over the 5% 
> improvement/regression threshold when the test machine gets a little bit 
> noisy. For example in benchmark on PR #9806 
> <https://github.com/apple/swift/pull/9806#issuecomment-303370149>:
> 
> BitCount      12      14      +16.7%  0.86x
> SuffixCountableRange  10      11      +10.0%  0.91x
> MapReduce     303     331     +9.2%   0.92x
> These are all false changes (and there are quite a few more there).

The current design assumes that in such cases, the workload will be increased 
so that is not an issue.

The reason why we use the min is that statistically we are not interesting in 
estimated the "mean" or "center" of the distribution. Rather, we are actually 
interested in the "speed of light" of the computation implying that we are 
looking for the min.

> 
> To partially address this issue (I'm guessing) the last SPEEDUP column 
> sometimes features mysterious question mark in brackets. Its emitted when the 
> new MIN falls inside the (MIN..MAX) range of the OLD baseline. It is not 
> checked the other way around.
> 
> I'm suggesting to use MEAN value in combination with SD (standard-deviation) 
> to detect the changes (improvements/regressions). At the moment, this is hard 
> to do, because the aggregate test results reported by Benchmark_O (and co.) 
> can include anomalous results in the sample population that messes up the 
> MEAN and SD, too. Currently it is only visible in the high sample range - the 
> difference between reported MIN and MAX. But it is not clear how many results 
> are anomalous.

What do you mean by anomalous results?

> 
> Currently I'm working on improved sample filtering algorithm. Stay tuned for 
> demonstration in Benchmark_Driver (Python), if it pans out, it might be time 
> to change adaptive sampling in DriverUtil.swift.

Have you looked at using the Mann-Whitney U algorithm? (I am not sure if we are 
using it or not)

> 
> Best regards
> Pavol Vaskovic
> 
> 
> _______________________________________________
> swift-dev mailing list
> swift-dev@swift.org
> https://lists.swift.org/mailman/listinfo/swift-dev

_______________________________________________
swift-dev mailing list
swift-dev@swift.org
https://lists.swift.org/mailman/listinfo/swift-dev

Reply via email to