Re: [swift-dev] Measuring MEAN Performance (was: Questions about Swift-CI)

2017-06-12 Thread Andrew Trick via swift-dev

> On Jun 12, 2017, at 10:36 PM, Pavol Vaskovic  wrote:
> 
> As the next two paragraphs after the part you quoted go on explaining, I'm 
> hoping that with this approach we could adaptively sample the benchmark until 
> we get stable population, but starting from lower iteration count. 
> 
> If the Python implementation bears this out, the proper solution would be to 
> change the implementation in DriverUtil.swift, from the current ~1s run 
> adaptive num-iters to more finer grained runs. We'd be gathering more smaller 
> samples, tossing out anomalies as we go until we gather stable sample 
> population (with low coefficient of variation) or run out of the allotted 
> time.

~1s might be longer than necessary for the benchmarks with cheap setup. Another 
option is for the benchmark to call back to the Driver’s “start button” after 
setup. With no setup work, I think 200 ms is a bare minimum if we care about 
changes in the 1% range.

I’m confused though because I thought we agreed that all samples need to run 
with exactly the same number of iterations. So, there would be one short run to 
find the desired num_iters for each benchmark, then each subsequent invocation 
of the benchmark harness would be handed num_iters as input.

-Andy

> This has a potential to speed up the benchmark suite with more intelligent 
> management of the measurements, instead of using brute force of super-long 
> runtime to drown out the errors as we do currently. 
> 
> (I am aware of various aspects this approach might introduce that have the 
> potential to mess with the caching: time measurement itself, more frequent 
> logging - this would currently rely on --verbose mode, invoking Benchmark_O 
> from Python…)
> 
> The proof is in the pudding, so I guess we'll learn if this approach would 
> work this week, when I hammer the implementation down in Python for 
> demonstration. 
> 
> --Pavol
> 
> On Tue, 13 Jun 2017 at 03:19, Andrew Trick  > wrote:
> 
>> On Jun 12, 2017, at 4:45 PM, Pavol Vaskovic > > wrote:
>> 
>> I have sketched an algorithm for getting more consistent test results, so 
>> far its in Numbers. I have ran the whole test suite for 100 samples and 
>> observed the varying distribution of test results. The first result is quite 
>> often an outlier, with subsequent results being quicker. Depending on the 
>> "weather" on the test machine, you sometimes measure anomalies. So I'm 
>> tracking the coefficient of variance from the sample population and purging 
>> anomalous results when it exceeds 5%. This results in solid sample 
>> population where standard deviation is a meaningful value, that can be use 
>> in judging the significance of change between master and branch.
> 
> That’s a reasonable approach for running 100 samples. I’m not sure how it 
> fits with the goal of minimizing turnaround time. Typically you don’t need 
> more than 3 samples (keeping in mind were usually averaging over thousands of 
> iterations per sample).
> 
> -Andy

___
swift-dev mailing list
swift-dev@swift.org
https://lists.swift.org/mailman/listinfo/swift-dev


Re: [swift-dev] Measuring MEAN Performance (was: Questions about Swift-CI)

2017-06-12 Thread Pavol Vaskovic via swift-dev
As the next two paragraphs after the part you quoted go on explaining, I'm
hoping that with this approach we could adaptively sample the benchmark
until we get stable population, but starting from lower iteration count.

If the Python implementation bears this out, the proper solution would be
to change the implementation in DriverUtil.swift, from the current ~1s run
adaptive num-iters to more finer grained runs. We'd be gathering more
smaller samples, tossing out anomalies as we go until we gather stable
sample population (with low coefficient of variation) or run out of the
allotted time.

This has a potential to speed up the benchmark suite with more intelligent
management of the measurements, instead of using brute force of super-long
runtime to drown out the errors as we do currently.

(I am aware of various aspects this approach might introduce that have the
potential to mess with the caching: time measurement itself, more frequent
logging - this would currently rely on --verbose mode, invoking Benchmark_O
from Python…)

The proof is in the pudding, so I guess we'll learn if this approach would
work this week, when I hammer the implementation down in Python for
demonstration.

--Pavol

On Tue, 13 Jun 2017 at 03:19, Andrew Trick  wrote:

>
> On Jun 12, 2017, at 4:45 PM, Pavol Vaskovic  wrote:
>
> I have sketched an algorithm for getting more consistent test results, so
> far its in Numbers. I have ran the whole test suite for 100 samples and
> observed the varying distribution of test results. The first result is
> quite often an outlier, with subsequent results being quicker. Depending on
> the "weather" on the test machine, you sometimes measure anomalies. So I'm
> tracking the coefficient of variance from the sample population and purging
> anomalous results when it exceeds 5%. This results in solid sample
> population where standard deviation is a meaningful value, that can be use
> in judging the significance of change between master and branch.
>
>
> That’s a reasonable approach for running 100 samples. I’m not sure how it
> fits with the goal of minimizing turnaround time. Typically you don’t need
> more than 3 samples (keeping in mind were usually averaging over thousands
> of iterations per sample).
>
> -Andy
>
___
swift-dev mailing list
swift-dev@swift.org
https://lists.swift.org/mailman/listinfo/swift-dev


Re: [swift-dev] Measuring MEAN Performance (was: Questions about Swift-CI)

2017-06-12 Thread Andrew Trick via swift-dev

> On Jun 12, 2017, at 5:29 PM, Michael Gottesman  wrote:
> 
>> I don't know what that is. 
> 
> Check it out: https://en.wikipedia.org/wiki/Mann–Whitney_U_test 
> . It is a 
> non-parametric test that two sets of samples are from the same distribution. 
> As a bonus, it does not assume that our data is from a normal distribution (a 
> problem with using mean/standard deviation which assumes a normal 
> distribution).

This is a fairly important point that I didn’t stress enough. In my experience 
with other benchmark suites the sample distribution is nothing close to normal 
which is why I’ve always thought MEAN/SD was silly. But the “noise” I was 
dealing with was in the underlying H/W and OS mode transitions. General system 
noise from other processes might lead to a more normal distribution… but as 
I’ve said, benchmarking on a noisy system is something to be avoided.

-Andy___
swift-dev mailing list
swift-dev@swift.org
https://lists.swift.org/mailman/listinfo/swift-dev


Re: [swift-dev] Measuring MEAN Performance (was: Questions about Swift-CI)

2017-06-12 Thread Andrew Trick via swift-dev

> On Jun 12, 2017, at 5:55 PM, Pavol Vaskovic  wrote:
> 
> 
> 
> On Tue, Jun 13, 2017 at 2:31 AM, Michael Gottesman  > wrote:
>> I don't think we can get more consistent test results just from re-running 
>> tests that were detected as changes in the first pass, as described in 
>> SR-4669 , because that improves 
>> accuracy only for one side of the comparison - the branch. When the 
>> measurement error is with the baseline from the master, re-running the 
>> branch would not help.
> 
> When we are benchmarking, we can always have access to the baseline compiler 
> by stashing the build directory. So we can always take more samples (in fact 
> when I was talking about re-running I always assumed we would).
> 
> Well, if I understand correctly how the swift-CI builds perf-PR, then 
> switching between master and branch from Benchmark_Driver is not possible...
> 
> Or are you thinking about manual benchmarking scenario?
> 
> --Pavol

I was thinking (hoping) Benchmark_Driver would support this and we could ask 
for support from CI to call the driver that way.

-Andy
___
swift-dev mailing list
swift-dev@swift.org
https://lists.swift.org/mailman/listinfo/swift-dev


Re: [swift-dev] Measuring MEAN Performance (was: Questions about Swift-CI)

2017-06-12 Thread Andrew Trick via swift-dev

> On Jun 12, 2017, at 4:45 PM, Pavol Vaskovic  wrote:
> 
> I have sketched an algorithm for getting more consistent test results, so far 
> its in Numbers. I have ran the whole test suite for 100 samples and observed 
> the varying distribution of test results. The first result is quite often an 
> outlier, with subsequent results being quicker. Depending on the "weather" on 
> the test machine, you sometimes measure anomalies. So I'm tracking the 
> coefficient of variance from the sample population and purging anomalous 
> results when it exceeds 5%. This results in solid sample population where 
> standard deviation is a meaningful value, that can be use in judging the 
> significance of change between master and branch.

That’s a reasonable approach for running 100 samples. I’m not sure how it fits 
with the goal of minimizing turnaround time. Typically you don’t need more than 
3 samples (keeping in mind were usually averaging over thousands of iterations 
per sample).

-Andy___
swift-dev mailing list
swift-dev@swift.org
https://lists.swift.org/mailman/listinfo/swift-dev


Re: [swift-dev] Measuring MEAN Performance (was: Questions about Swift-CI)

2017-06-12 Thread Andrew Trick via swift-dev

> On Jun 12, 2017, at 4:45 PM, Pavol Vaskovic  wrote:
> 
> We really have two problems:
> 1. spurious results 
> 2. the turnaround time for the entire benchmark suite
> 
> 
> I don't think we can get more consistent test results just from re-running 
> tests that were detected as changes in the first pass, as described inSR-4669 
> , because that improves accuracy only 
> for one side of the comparison - the branch. When the measurement error is 
> with the baseline from the master, re-running the branch would not help.

My understanding of this feature is that it would rerun both branches (or 
possibly whichever is slower or more jittery, but that’s probably over 
complicating it).

-Andy___
swift-dev mailing list
swift-dev@swift.org
https://lists.swift.org/mailman/listinfo/swift-dev


Re: [swift-dev] Measuring MEAN Performance (was: Questions about Swift-CI)

2017-06-12 Thread Pavol Vaskovic via swift-dev
On Tue, Jun 13, 2017 at 2:31 AM, Michael Gottesman 
wrote:
>
> I don't think we can get more consistent test results just from re-running
> tests that were detected as changes in the first pass, as described in
> SR-4669 , because that improves
> accuracy only for one side of the comparison - the branch. When the
> measurement error is with the baseline from the master, re-running the
> branch would not help.
>
>
> When we are benchmarking, we can always have access to the baseline
> compiler by stashing the build directory. So we can always take more
> samples (in fact when I was talking about re-running I always assumed we
> would).
>

Well, if I understand correctly how the swift-CI builds perf-PR, then
switching between master and branch from Benchmark_Driver is not possible...

Or are you thinking about manual benchmarking scenario?

--Pavol
___
swift-dev mailing list
swift-dev@swift.org
https://lists.swift.org/mailman/listinfo/swift-dev


Re: [swift-dev] Measuring MEAN Performance (was: Questions about Swift-CI)

2017-06-12 Thread Pavol Vaskovic via swift-dev
Hi Andrew,

I wrote the draft of this e-mail few weeks ago, and the following sentence
is not true:
>
> Its emitted when the new MIN falls inside the (MIN..MAX) range of the OLD
> baseline. It is not checked the other way around.
>
See below...

On Tue, Jun 13, 2017 at 1:45 AM, Pavol Vaskovic  wrote:

> Hi Andrew,
>
> On Mon, Jun 12, 2017 at 11:55 PM, Andrew Trick  wrote:
>
>> To partially address this issue (I'm guessing) the last SPEEDUP column
>> sometimes features mysterious question mark in brackets. Its emitted when
>> the new MIN falls inside the (MIN..MAX) range of the OLD baseline. It is
>> not checked the other way around.
>>
>>
>> That bug must have been introduced during one of the rewrites. Is that in
>> the driver or compare script? Why not fix that bug?
>>
>
> That is in the compare script. It looks like the else branch got lost during
> a rewrite
> 
>  (search
> for "(?)" in that diff). I could certainly fix that too, but I'm not sure
> that would be enough to fix all our problems.
>

 I even wrote tests for this
,
and my implementation is pretty clear too
…
somehow I forgot this.

# Add ' (?)' to the speedup column as indication of dubious changes:
>
> # result's MIN falls inside the (MIN, MAX) interval of result they are
>
> # being compared with.
>
> self.is_dubious = (
>
> ' (?)' if ((old.min < new.min and new.min < old.max) or
>
>(new.min < old.min and old.min < new.max))
>
> else '')
>
>
I'm sorry for the confusion.

--Pavol
___
swift-dev mailing list
swift-dev@swift.org
https://lists.swift.org/mailman/listinfo/swift-dev


Re: [swift-dev] Measuring MEAN Performance (was: Questions about Swift-CI)

2017-06-12 Thread Michael Gottesman via swift-dev

> On Jun 12, 2017, at 4:45 PM, Pavol Vaskovic via swift-dev 
>  wrote:
> 
> Hi Andrew,
> 
> On Mon, Jun 12, 2017 at 11:55 PM, Andrew Trick  > wrote:
>> To partially address this issue (I'm guessing) the last SPEEDUP column 
>> sometimes features mysterious question mark in brackets. Its emitted when 
>> the new MIN falls inside the (MIN..MAX) range of the OLD baseline. It is not 
>> checked the other way around.
> 
> That bug must have been introduced during one of the rewrites. Is that in the 
> driver or compare script? Why not fix that bug?
> 
> That is in the compare script. It looks like the else branch got lost during 
> a rewrite 
> 
>  (search for "(?)" in that diff). I could certainly fix that too, but I'm not 
> sure that would be enough to fix all our problems.
>  
> We clearly don’t want to see any false changes. The ‘?’ is a signal to me to 
> avoid reporting those results. They should either be ignored as flaky 
> benchmarks or rerun. I thought rerunning them was the fix you were working on.
> 
> If you have some other proposal for fixing this then please, in a separate 
> proposal, explain your new approach, why your new approach works, and 
> demonstrate it’s effectiveness with results that you’ve gathered over time on 
> the side. Please don’t change how the driver computes performance changes on 
> a whim while introducing other features.
> ... 
> I honestly don’t know what MEAN/SD has to do with the problem you’re pointing 
> to above. The benchmark harness is already setup to compute the average 
> iteration time, and our benchmarks are not currently designed to measure 
> cache effects or any other phenomenon that would have a statistically 
> meaningful sample distribution. Statistical methods might be interesting if 
> you’re analyzing benchmark results over a long period of time or system noise 
> levels across benchmarks.
> 
> The primary purpose of the benchmark suite is identifying performance 
> bugs/regressions at the point they occur. It should be no more complicated 
> than necessary to do that. The current approach is simple: run a 
> microbenchmark long enough in a loop to factor out benchmark startup time, 
> cache/cpu warmup effects, and timer resolution, then compute the average 
> iteration time. Throw away any run that was apparently impacted by system 
> noise.
> 
> We really have two problems:
> 1. spurious results 
> 2. the turnaround time for the entire benchmark suite
> 
> 
> I don't think we can get more consistent test results just from re-running 
> tests that were detected as changes in the first pass, as described in 
> SR-4669 , because that improves 
> accuracy only for one side of the comparison - the branch. When the 
> measurement error is with the baseline from the master, re-running the branch 
> would not help.

When we are benchmarking, we can always have access to the baseline compiler by 
stashing the build directory. So we can always take more samples (in fact when 
I was talking about re-running I always assumed we would).

> 
> I have sketched an algorithm for getting more consistent test results, so far 
> its in Numbers. I have ran the whole test suite for 100 samples and observed 
> the varying distribution of test results. The first result is quite often an 
> outlier, with subsequent results being quicker. Depending on the "weather" on 
> the test machine, you sometimes measure anomalies. So I'm tracking the 
> coefficient of variance from the sample population and purging anomalous 
> results when it exceeds 5%. This results in solid sample population where 
> standard deviation is a meaningful value, that can be use in judging the 
> significance of change between master and branch.
> 
> This week I'm working on transferring this algorithm to Python and putting it 
> probably somewhere around `Benchmark_Driver`. It is possible this would 
> ultimately land in Swift (DriverUtil.swift), but to demonstrate the soundness 
> of this approach to you all, I wanted to do the Python implementation first.
> 
> Depending on how this behaves, my hunch is we could speed up the benchmark 
> suite, by not running test samples for 1 second and taking many samples, but 
> to adaptively sample each benchmark until we get a stable sample population. 
> In worst case this would degrade to current (1s/sample)*num_samples. This 
> could be further improved on by running multiple passes through the test 
> suite, to eliminate anomalies caused by other background processes. That is 
> the core idea from --rerun (SR-4669). 
> 
> --Pavol
> 
> ___
> swift-dev mailing list
> swift-dev@swift.org
> https://lists.swift.org/mailman/listinfo/swift-dev

___
swift-dev mailing list
swift-dev@swift.org
https:

Re: [swift-dev] Measuring MEAN Performance (was: Questions about Swift-CI)

2017-06-12 Thread Michael Gottesman via swift-dev

> On Jun 12, 2017, at 4:54 PM, Pavol Vaskovic  wrote:
> 
> 
> 
> On Mon, Jun 12, 2017 at 11:55 PM, Michael Gottesman  > wrote:
> 
> The current design assumes that in such cases, the workload will be increased 
> so that is not an issue.
> 
> I understand. But clearly some part of our process is failing, because there 
> are multiple benchmarks in 10ms range in the tree for months without fixing 
> this.

I think that is just inertia and being busy. Patch? I'll review = ).

>  
> The reason why we use the min is that statistically we are not interesting in 
> estimated the "mean" or "center" of the distribution. Rather, we are actually 
> interested in the "speed of light" of the computation implying that we are 
> looking for the min.
> 
> I understand that. But all measurements have a certain degree of error 
> associated with them. Our issue is two-fold: we need to differentiate between 
> normal variation between measured samples under "perfect" conditions and 
> samples that are worse because of interference from other background 
> processes.

I disagree. CPUs are inherently messy but disruptions tend to be due to 
temporary spikes most of the time once you have quieted down your system by 
unloading a few processes.

>  
> What do you mean by anomalous results?
> 
> I mean results that significantly stand out from the measured sample 
> population.

What that could mean is that we need to run a couple of extra iterations to 
warm up the cpu/cache/etc before we start gathering samples.

> 
>> Currently I'm working on improved sample filtering algorithm. Stay tuned for 
>> demonstration in Benchmark_Driver (Python), if it pans out, it might be time 
>> to change adaptive sampling in DriverUtil.swift.
> 
> Have you looked at using the Mann-Whitney U algorithm? (I am not sure if we 
> are using it or not)
> 
> I don't know what that is.

Check it out: https://en.wikipedia.org/wiki/Mann–Whitney_U_test 
. It is a 
non-parametric test that two sets of samples are from the same distribution. As 
a bonus, it does not assume that our data is from a normal distribution (a 
problem with using mean/standard deviation which assumes a normal distribution).

We have been using Mann-Whitney internally for a while successfully to reduce 
the noise.

> Here's what I've been doing:
> 
> Depending on the "weather" on the test machine, you sometimes measure 
> anomalies. So I'm tracking the coefficient of variance from the sample 
> population and purging anomalous results (1 sigma from max) when it exceeds 
> 5%. This results in quite solid sample population where standard deviation is 
> a meaningful value, that can be use in judging the significance of change 
> between master and branch.
> 
> --Pavol

___
swift-dev mailing list
swift-dev@swift.org
https://lists.swift.org/mailman/listinfo/swift-dev


Re: [swift-dev] Measuring MEAN Performance (was: Questions about Swift-CI)

2017-06-12 Thread Pavol Vaskovic via swift-dev
On Mon, Jun 12, 2017 at 11:55 PM, Michael Gottesman 
wrote:

>
> The current design assumes that in such cases, the workload will be
> increased so that is not an issue.
>

I understand. But clearly some part of our process is failing, because
there are multiple benchmarks in 10ms range in the tree for months without
fixing this.


> The reason why we use the min is that statistically we are not interesting
> in estimated the "mean" or "center" of the distribution. Rather, we are
> actually interested in the "speed of light" of the computation implying
> that we are looking for the min.
>

I understand that. But all measurements have a certain degree of error
associated with them. Our issue is two-fold: we need to differentiate
between normal variation between measured samples under "perfect"
conditions and samples that are worse because of interference from other
background processes.


> What do you mean by anomalous results?
>

I mean results that significantly stand out from the measured sample
population.

Currently I'm working on improved sample filtering algorithm. Stay tuned
> for demonstration in Benchmark_Driver (Python), if it pans out, it might be
> time to change adaptive sampling in DriverUtil.swift.
>
>
> Have you looked at using the Mann-Whitney U algorithm? (I am not sure if
> we are using it or not)
>

I don't know what that is. Here's what I've been doing:

Depending on the "weather" on the test machine, you sometimes measure
anomalies. So I'm tracking the coefficient of variance from the sample
population and purging anomalous results (1 sigma from max) when it exceeds
5%. This results in quite solid sample population where standard deviation
is a meaningful value, that can be use in judging the significance of
change between master and branch.

--Pavol
___
swift-dev mailing list
swift-dev@swift.org
https://lists.swift.org/mailman/listinfo/swift-dev


[swift-dev] Spanish Translation

2017-06-12 Thread Luis Leos via swift-dev
I'm available to translate if needed.

Thanks!

-Luis Leos
Sent from my iPhone
___
swift-dev mailing list
swift-dev@swift.org
https://lists.swift.org/mailman/listinfo/swift-dev


Re: [swift-dev] Measuring MEAN Performance (was: Questions about Swift-CI)

2017-06-12 Thread Pavol Vaskovic via swift-dev
Hi Andrew,

On Mon, Jun 12, 2017 at 11:55 PM, Andrew Trick  wrote:

> To partially address this issue (I'm guessing) the last SPEEDUP column
> sometimes features mysterious question mark in brackets. Its emitted when
> the new MIN falls inside the (MIN..MAX) range of the OLD baseline. It is
> not checked the other way around.
>
>
> That bug must have been introduced during one of the rewrites. Is that in
> the driver or compare script? Why not fix that bug?
>

That is in the compare script. It looks like the else branch got lost during
a rewrite

(search
for "(?)" in that diff). I could certainly fix that too, but I'm not sure
that would be enough to fix all our problems.


> We clearly don’t want to see any false changes. The ‘?’ is a signal to me
> to avoid reporting those results. They should either be ignored as flaky
> benchmarks or rerun. I thought rerunning them was the fix you were working
> on.
>
> If you have some other proposal for fixing this then please, in a separate
> proposal, explain your new approach, why your new approach works, and
> demonstrate it’s effectiveness with results that you’ve gathered over time
> on the side. Please don’t change how the driver computes performance
> changes on a whim while introducing other features.
>
...

> I honestly don’t know what MEAN/SD has to do with the problem you’re
> pointing to above. The benchmark harness is already setup to compute the
> average iteration time, and our benchmarks are not currently designed to
> measure cache effects or any other phenomenon that would have a
> statistically meaningful sample distribution. Statistical methods might be
> interesting if you’re analyzing benchmark results over a long period of
> time or system noise levels across benchmarks.
>
> The primary purpose of the benchmark suite is identifying performance
> bugs/regressions at the point they occur. It should be no more complicated
> than necessary to do that. The current approach is simple: run a
> microbenchmark long enough in a loop to factor out benchmark startup time,
> cache/cpu warmup effects, and timer resolution, then compute the average
> iteration time. Throw away any run that was apparently impacted by system
> noise.
>
> We really have two problems:
> 1. spurious results
> 2. the turnaround time for the entire benchmark suite
>
>
I don't think we can get more consistent test results just from re-running
tests that were detected as changes in the first pass, as described in
SR-4669 , because that improves
accuracy only for one side of the comparison - the branch. When the
measurement error is with the baseline from the master, re-running the
branch would not help.

I have sketched an algorithm for getting more consistent test results, so
far its in Numbers. I have ran the whole test suite for 100 samples and
observed the varying distribution of test results. The first result is
quite often an outlier, with subsequent results being quicker. Depending on
the "weather" on the test machine, you sometimes measure anomalies. So I'm
tracking the coefficient of variance from the sample population and purging
anomalous results when it exceeds 5%. This results in solid sample
population where standard deviation is a meaningful value, that can be use
in judging the significance of change between master and branch.

This week I'm working on transferring this algorithm to Python and putting
it probably somewhere around `Benchmark_Driver`. It is possible this would
ultimately land in Swift (DriverUtil.swift), but to demonstrate the
soundness of this approach to you all, I wanted to do the Python
implementation first.

Depending on how this behaves, my hunch is we could speed up the benchmark
suite, by not running test samples for 1 second and taking many samples,
but to adaptively sample each benchmark until we get a stable sample
population. In worst case this would degrade to current
(1s/sample)*num_samples. This could be further improved on by running
multiple passes through the test suite, to eliminate anomalies caused by
other background processes. That is the core idea from --rerun (SR-4669).

--Pavol
___
swift-dev mailing list
swift-dev@swift.org
https://lists.swift.org/mailman/listinfo/swift-dev


Re: [swift-dev] Measuring MEAN Performance (was: Questions about Swift-CI)

2017-06-12 Thread Andrew Trick via swift-dev

> On Jun 12, 2017, at 2:55 PM, Michael Gottesman  wrote:
> 
>> Current approach to detecting performance changes is fragile for tests that 
>> have very low absolute runtime, as they are easily over the 5% 
>> improvement/regression threshold when the test machine gets a little bit 
>> noisy. For example in benchmark on PR #9806 
>> :
>> 
>> BitCount 12  14  +16.7%  0.86x
>> SuffixCountableRange 10  11  +10.0%  0.91x
>> MapReduce303 331 +9.2%   0.92x
>> These are all false changes (and there are quite a few more there).
> 
> The current design assumes that in such cases, the workload will be increased 
> so that is not an issue.

That is also a valid fix for the problem, which I forgot to mention.
-Andy
___
swift-dev mailing list
swift-dev@swift.org
https://lists.swift.org/mailman/listinfo/swift-dev


Re: [swift-dev] Measuring MEAN Performance (was: Questions about Swift-CI)

2017-06-12 Thread Andrew Trick via swift-dev

> On Jun 12, 2017, at 1:45 PM, Pavol Vaskovic  wrote:
> 
> On Tue, May 16, 2017 at 9:10 PM, Dave Abrahams via swift-dev 
> mailto:swift-dev@swift.org>> wrote:
> 
> on Thu May 11 2017, Pavol Vaskovic  wrote:
> 
> I have run Benchmark_O with --num-iters=100 on my machine for the the
> whole performance test suite, to get a feeling for the distribution of
> benchmark samples, because I also want to move the Benchmark_Driver to
> use MEAN instead of MIN in the analysis.
> 
> I'm concerned about that, especially for microbenchmarks; it seems to me
> as though MIN is the right measurement.  Can you explain why MEAN is
> better?
> 
> 
> On Wed, May 17, 2017 at 1:26 AM, Andrew Trick  > wrote:
> Using MEAN wasn’t part of the aforementioned SR-4669. The purpose of that 
> task is to reduce the time CI takes to get useful results (e.g. by using 3 
> runs as a baseline). MEAN isn’t useful if you’re only gathering 3 data points.
> 
> 
> Current approach to detecting performance changes is fragile for tests that 
> have very low absolute runtime, as they are easily over the 5% 
> improvement/regression threshold when the test machine gets a little bit 
> noisy. For example in benchmark on PR #9806 
> :
> 
> BitCount  12  14  +16.7%  0.86x
> SuffixCountableRange  10  11  +10.0%  0.91x
> MapReduce 303 331 +9.2%   0.92x
> These are all false changes (and there are quite a few more there).
> 
> To partially address this issue (I'm guessing) the last SPEEDUP column 
> sometimes features mysterious question mark in brackets. Its emitted when the 
> new MIN falls inside the (MIN..MAX) range of the OLD baseline. It is not 
> checked the other way around.

That bug must have been introduced during one of the rewrites. Is that in the 
driver or compare script? Why not fix that bug?

We clearly don’t want to see any false changes. The ‘?’ is a signal to me to 
avoid reporting those results. They should either be ignored as flaky 
benchmarks or rerun. I thought rerunning them was the fix you were working on.

If you have some other proposal for fixing this then please, in a separate 
proposal, explain your new approach, why your new approach works, and 
demonstrate it’s effectiveness with results that you’ve gathered over time on 
the side. Please don’t change how the driver computes performance changes on a 
whim while introducing other features.

> I'm suggesting to use MEAN value in combination with SD (standard-deviation) 
> to detect the changes (improvements/regressions). At the moment, this is hard 
> to do, because the aggregate test results reported by Benchmark_O (and co.) 
> can include anomalous results in the sample population that messes up the 
> MEAN and SD, too. Currently it is only visible in the high sample range - the 
> difference between reported MIN and MAX. But it is not clear how many results 
> are anomalous.

I honestly don’t know what MEAN/SD has to do with the problem you’re pointing 
to above. The benchmark harness is already setup to compute the average 
iteration time, and our benchmarks are not currently designed to measure cache 
effects or any other phenomenon that would have a statistically meaningful 
sample distribution. Statistical methods might be interesting if you’re 
analyzing benchmark results over a long period of time or system noise levels 
across benchmarks.

The primary purpose of the benchmark suite is identifying performance 
bugs/regressions at the point they occur. It should be no more complicated than 
necessary to do that. The current approach is simple: run a microbenchmark long 
enough in a loop to factor out benchmark startup time, cache/cpu warmup 
effects, and timer resolution, then compute the average iteration time. Throw 
away any run that was apparently impacted by system noise.

We really have two problems:
1. spurious results 
2. the turnaround time for the entire benchmark suite

Running benchmarks on a noisy machine is a losing proposition because you won’t 
be able to address problem #1 without making problem #2 much worse.

-Andy

> Currently I'm working on improved sample filtering algorithm. Stay tuned for 
> demonstration in Benchmark_Driver (Python), if it pans out, it might be time 
> to change adaptive sampling in DriverUtil.swift.
> 
> Best regards
> Pavol Vaskovic
> 
> 

___
swift-dev mailing list
swift-dev@swift.org
https://lists.swift.org/mailman/listinfo/swift-dev


Re: [swift-dev] Measuring MEAN Performance (was: Questions about Swift-CI)

2017-06-12 Thread Michael Gottesman via swift-dev

> On Jun 12, 2017, at 1:45 PM, Pavol Vaskovic via swift-dev 
>  wrote:
> 
> On Tue, May 16, 2017 at 9:10 PM, Dave Abrahams via swift-dev 
> mailto:swift-dev@swift.org>> wrote:
> 
> on Thu May 11 2017, Pavol Vaskovic  wrote:
> 
> I have run Benchmark_O with --num-iters=100 on my machine for the the
> whole performance test suite, to get a feeling for the distribution of
> benchmark samples, because I also want to move the Benchmark_Driver to
> use MEAN instead of MIN in the analysis.
> 
> I'm concerned about that, especially for microbenchmarks; it seems to me
> as though MIN is the right measurement.  Can you explain why MEAN is
> better?
> 
> 
> On Wed, May 17, 2017 at 1:26 AM, Andrew Trick  > wrote:
> Using MEAN wasn’t part of the aforementioned SR-4669. The purpose of that 
> task is to reduce the time CI takes to get useful results (e.g. by using 3 
> runs as a baseline). MEAN isn’t useful if you’re only gathering 3 data points.
> 
> 
> Current approach to detecting performance changes is fragile for tests that 
> have very low absolute runtime, as they are easily over the 5% 
> improvement/regression threshold when the test machine gets a little bit 
> noisy. For example in benchmark on PR #9806 
> :
> 
> BitCount  12  14  +16.7%  0.86x
> SuffixCountableRange  10  11  +10.0%  0.91x
> MapReduce 303 331 +9.2%   0.92x
> These are all false changes (and there are quite a few more there).

The current design assumes that in such cases, the workload will be increased 
so that is not an issue.

The reason why we use the min is that statistically we are not interesting in 
estimated the "mean" or "center" of the distribution. Rather, we are actually 
interested in the "speed of light" of the computation implying that we are 
looking for the min.

> 
> To partially address this issue (I'm guessing) the last SPEEDUP column 
> sometimes features mysterious question mark in brackets. Its emitted when the 
> new MIN falls inside the (MIN..MAX) range of the OLD baseline. It is not 
> checked the other way around.
> 
> I'm suggesting to use MEAN value in combination with SD (standard-deviation) 
> to detect the changes (improvements/regressions). At the moment, this is hard 
> to do, because the aggregate test results reported by Benchmark_O (and co.) 
> can include anomalous results in the sample population that messes up the 
> MEAN and SD, too. Currently it is only visible in the high sample range - the 
> difference between reported MIN and MAX. But it is not clear how many results 
> are anomalous.

What do you mean by anomalous results?

> 
> Currently I'm working on improved sample filtering algorithm. Stay tuned for 
> demonstration in Benchmark_Driver (Python), if it pans out, it might be time 
> to change adaptive sampling in DriverUtil.swift.

Have you looked at using the Mann-Whitney U algorithm? (I am not sure if we are 
using it or not)

> 
> Best regards
> Pavol Vaskovic
> 
> 
> ___
> swift-dev mailing list
> swift-dev@swift.org
> https://lists.swift.org/mailman/listinfo/swift-dev

___
swift-dev mailing list
swift-dev@swift.org
https://lists.swift.org/mailman/listinfo/swift-dev


[swift-dev] Measuring MEAN Performance (was: Questions about Swift-CI)

2017-06-12 Thread Pavol Vaskovic via swift-dev
On Tue, May 16, 2017 at 9:10 PM, Dave Abrahams via swift-dev <
swift-dev@swift.org> wrote:

>
> on Thu May 11 2017, Pavol Vaskovic  wrote:
>
> I have run Benchmark_O with --num-iters=100 on my machine for the the
>> whole performance test suite, to get a feeling for the distribution of
>> benchmark samples, because I also want to move the Benchmark_Driver to
>> use MEAN instead of MIN in the analysis.
>
>
> I'm concerned about that, especially for microbenchmarks; it seems to me
> as though MIN is the right measurement.  Can you explain why MEAN is
> better?
>
>
On Wed, May 17, 2017 at 1:26 AM, Andrew Trick  wrote:

> Using MEAN wasn’t part of the aforementioned SR-4669. The purpose of that
> task is to reduce the time CI takes to get useful results (e.g. by using 3
> runs as a baseline). MEAN isn’t useful if you’re only gathering 3 data
> points.
>


Current approach to detecting performance changes is fragile for tests that
have very low absolute runtime, as they are easily over the 5%
improvement/regression threshold when the test machine gets a little bit
noisy. For example in benchmark on PR #9806
:

BitCount 12 14 +16.7% 0.86x
> SuffixCountableRange 10 11 +10.0% 0.91x
> MapReduce 303 331 +9.2% 0.92x

These are all false changes (and there are quite a few more there).

To partially address this issue (I'm guessing) the last SPEEDUP column
sometimes features mysterious question mark in brackets. Its emitted when
the new MIN falls inside the (MIN..MAX) range of the OLD baseline. It is
not checked the other way around.

I'm suggesting to use MEAN value in combination with SD
(standard-deviation) to detect the changes (improvements/regressions). At
the moment, this is hard to do, because the aggregate test results reported
by Benchmark_O (and co.) can include anomalous results in the sample
population that messes up the MEAN and SD, too. Currently it is only
visible in the high sample range - the difference between reported MIN and
MAX. But it is not clear how many results are anomalous.

Currently I'm working on improved sample filtering algorithm. Stay tuned
for demonstration in Benchmark_Driver (Python), if it pans out, it might be
time to change adaptive sampling in DriverUtil.swift.

Best regards
Pavol Vaskovic
___
swift-dev mailing list
swift-dev@swift.org
https://lists.swift.org/mailman/listinfo/swift-dev