> On Jun 12, 2017, at 4:45 PM, Pavol Vaskovic via swift-dev > <swift-dev@swift.org> wrote: > > Hi Andrew, > > On Mon, Jun 12, 2017 at 11:55 PM, Andrew Trick <atr...@apple.com > <mailto:atr...@apple.com>> wrote: >> To partially address this issue (I'm guessing) the last SPEEDUP column >> sometimes features mysterious question mark in brackets. Its emitted when >> the new MIN falls inside the (MIN..MAX) range of the OLD baseline. It is not >> checked the other way around. > > That bug must have been introduced during one of the rewrites. Is that in the > driver or compare script? Why not fix that bug? > > That is in the compare script. It looks like the else branch got lost during > a rewrite > <https://github.com/apple/swift/commit/cb23837bb932f21b61d2a79c936d88c167fd91d0#diff-5ca4ab28608a4259eff23c72eed7ae8d> > (search for "(?)" in that diff). I could certainly fix that too, but I'm not > sure that would be enough to fix all our problems. > > We clearly don’t want to see any false changes. The ‘?’ is a signal to me to > avoid reporting those results. They should either be ignored as flaky > benchmarks or rerun. I thought rerunning them was the fix you were working on. > > If you have some other proposal for fixing this then please, in a separate > proposal, explain your new approach, why your new approach works, and > demonstrate it’s effectiveness with results that you’ve gathered over time on > the side. Please don’t change how the driver computes performance changes on > a whim while introducing other features. > ... > I honestly don’t know what MEAN/SD has to do with the problem you’re pointing > to above. The benchmark harness is already setup to compute the average > iteration time, and our benchmarks are not currently designed to measure > cache effects or any other phenomenon that would have a statistically > meaningful sample distribution. Statistical methods might be interesting if > you’re analyzing benchmark results over a long period of time or system noise > levels across benchmarks. > > The primary purpose of the benchmark suite is identifying performance > bugs/regressions at the point they occur. It should be no more complicated > than necessary to do that. The current approach is simple: run a > microbenchmark long enough in a loop to factor out benchmark startup time, > cache/cpu warmup effects, and timer resolution, then compute the average > iteration time. Throw away any run that was apparently impacted by system > noise. > > We really have two problems: > 1. spurious results > 2. the turnaround time for the entire benchmark suite > > > I don't think we can get more consistent test results just from re-running > tests that were detected as changes in the first pass, as described in > SR-4669 <https://bugs.swift.org/browse/SR-4669>, because that improves > accuracy only for one side of the comparison - the branch. When the > measurement error is with the baseline from the master, re-running the branch > would not help.
When we are benchmarking, we can always have access to the baseline compiler by stashing the build directory. So we can always take more samples (in fact when I was talking about re-running I always assumed we would). > > I have sketched an algorithm for getting more consistent test results, so far > its in Numbers. I have ran the whole test suite for 100 samples and observed > the varying distribution of test results. The first result is quite often an > outlier, with subsequent results being quicker. Depending on the "weather" on > the test machine, you sometimes measure anomalies. So I'm tracking the > coefficient of variance from the sample population and purging anomalous > results when it exceeds 5%. This results in solid sample population where > standard deviation is a meaningful value, that can be use in judging the > significance of change between master and branch. > > This week I'm working on transferring this algorithm to Python and putting it > probably somewhere around `Benchmark_Driver`. It is possible this would > ultimately land in Swift (DriverUtil.swift), but to demonstrate the soundness > of this approach to you all, I wanted to do the Python implementation first. > > Depending on how this behaves, my hunch is we could speed up the benchmark > suite, by not running test samples for 1 second and taking many samples, but > to adaptively sample each benchmark until we get a stable sample population. > In worst case this would degrade to current (1s/sample)*num_samples. This > could be further improved on by running multiple passes through the test > suite, to eliminate anomalies caused by other background processes. That is > the core idea from --rerun (SR-4669). > > --Pavol > > _______________________________________________ > swift-dev mailing list > swift-dev@swift.org > https://lists.swift.org/mailman/listinfo/swift-dev
_______________________________________________ swift-dev mailing list swift-dev@swift.org https://lists.swift.org/mailman/listinfo/swift-dev