henrikingo opened a new issue, #99:
URL: https://github.com/apache/otava/issues/99
I thought this could be a fun task for a new contributor interested in
change point detection. This is focused on generating test cases, but in
addition could be an interesting blog post to educate a wider public on what
exactly change detection does with benchmarking results. Third, this set of
test cases will be an excellent benchmark with which we can compare Otava
against competing tools and algorithms. (e.g. anomaly detection is frequently
offered in services like Datadog or Grafana.)
Sub-tasks:
- Create the below "basic building block" timeseries, as well as
combinations of them. Use python preferably. (Generating a file with CSV data,
for example with Excel, is kind of possible...)
- Create Otava test cases with pytest that used the data sets.
- Test also other alogirhtms and tools
- Write a blog post with your findings
0. Assumptions:
The problem space is one where a performance test is run repeatedly,
generating a time series of measurements. Ideally this would produce an
infinitely long constant timeseries, but in practice the output of the
performance test includes a random noise component. Strictly speaking we do not
know the distribution of this noise, and therefore also not of the test
results, but in practice assuming a normal distribution seems to work well.
Finally, there will be performance regressions and improvements in the system
under test. These are discrete steps up or down, after which the the timeseries
again continues as a function of constant + noise.
In other words, things we do NOT encounter in the domain are for example
trending, where the timeseries keeps increasing or decreasing at a constant
rate, or cyclic behavior, such as seasonality at specific time of day, week or
year.
1. The basic building blocks
These timeseries can themselves already be used as test cases:
Parameter L = [50, 500] is the length of all these timeseries. For
simplicity, most of the below also have a negative counterpart, which is
omitted. (e.g. decrease vs increase)
Let's begin:
Constant: S = x, x, x, x...
Noise, normally distributed: S = x1, x2, x3, ... where X = N(0, sigma)
- for benchmarking purposes, we can define a maximum range where all values
are within 99.99% percentile, or roughly 4 standard deviations. This makes it
possible for an algorithm to correctly detect 100% of cases, as this noise
component is not unbounded.
Noise, uniform distributed, aka white noise or static noise: random(min, max)
Outlier, aka anomaly, is a single deviating point: S = x,x,x,x,x,x',x,x...
where x' != x
Step function, aka single change point: S = x1,x1,x1,x2,x2,x2,x2.....
Regression + fix soon after: S = x1,x1...x2,...x2, x3, x3, x3...
- Amount of x2 points is small compared to x1 and x3, but at least 2 points
- Special case: x1 == x3
- Special case: x2 is a single point. This can happen, but mathematically
this is indistinguishable from a single outlier (see above).
2. More interesting phenomena
Banding, is a form of noise (unwanted change) where the results oscillate
randomly between two values.
S = x1, x2, x,2, x1, x2, x1, x1, x1, x2, x2, x1, x2, x2...
Typically:
abs(x2 - x1) << x1
and also:
x1,x2 > std dev when random noise is mixed in
Constant mean, change in variance: S = N(0, mu1).... , N(0, mu2)...
mu2 > mu1
(mu1 > mu2 is the negative case)
Constant mean and variance, but phase changes: S = cos(x)..., sin(x)...
Multiple consecutive changes: S= x0,x0,x0... x1, x2, ... xn, xn, xn....
Where x0 <x1 < x2 ... < xn
- It depends on the specific problem space what the right behavior is here.
For perf testing it is possible that multiple independent improvements were
merged back to back.
3. Generate all possible combinations of the above, including with itself.
(e.g. more than a single outlier, more than a singe phase change...)
4. Verify that otava handles all of the above.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]