henrikingo opened a new issue, #99:
URL: https://github.com/apache/otava/issues/99

   I thought this could be a fun task for a new contributor interested in 
change point detection. This is focused on generating test cases, but in 
addition could be an interesting blog post to educate a wider public on what 
exactly change detection does with benchmarking results. Third, this set of 
test cases will be an excellent benchmark with which we can compare Otava 
against competing tools and algorithms. (e.g. anomaly detection is frequently 
offered in services like Datadog or Grafana.)
   
   Sub-tasks:
    - Create the below "basic building block" timeseries, as well as 
combinations of them. Use python preferably. (Generating a file with CSV data, 
for example with Excel, is kind of possible...)
    - Create Otava test cases with pytest that used the data sets.
    - Test also other alogirhtms and tools
    - Write a blog post with your findings
   
   
   
   0. Assumptions:
   
   The problem space is one where a performance test is run repeatedly, 
generating a time series of measurements. Ideally this would produce an 
infinitely long constant timeseries, but in practice the output of the 
performance test includes a random noise component. Strictly speaking we do not 
know the distribution of this noise, and therefore also not of the test 
results, but in practice assuming a normal distribution seems to work well. 
Finally, there will be performance regressions and improvements in the system 
under test. These are discrete steps up or down, after which the the timeseries 
again continues as a function of constant + noise.
   
   In other words, things we do NOT encounter in the domain are for example 
trending, where the timeseries keeps increasing or decreasing at a constant 
rate, or cyclic behavior, such as seasonality at specific time of day, week or 
year.
   
   1. The basic building blocks
   
   These timeseries can themselves already be used as test cases:
   
   Parameter L = [50, 500] is the length of all these timeseries. For 
simplicity, most of the below also have a negative counterpart, which is 
omitted. (e.g. decrease vs increase)
   
   Let's begin:
   
   Constant: S = x, x, x, x...
   
   Noise, normally distributed:  S = x1, x2, x3, ... where X = N(0, sigma)
    - for benchmarking purposes, we can define a maximum range where all values 
are within 99.99% percentile, or roughly 4 standard deviations. This makes it 
possible for an algorithm to correctly detect 100% of cases, as this noise 
component is not unbounded.
   
   Noise, uniform distributed, aka white noise or static noise: random(min, max)
   
   Outlier, aka anomaly, is a single deviating point: S = x,x,x,x,x,x',x,x... 
where x' != x
   
   
   Step function, aka single change point:  S = x1,x1,x1,x2,x2,x2,x2.....
   
   
   Regression + fix soon after: S = x1,x1...x2,...x2, x3, x3, x3...
    - Amount of x2 points is small compared to x1 and x3, but at least 2 points
    - Special case: x1 == x3
    - Special case: x2 is a single point. This can happen, but mathematically 
this is indistinguishable from a single outlier (see above).
   
   
   2. More interesting phenomena
   
   Banding, is a form of noise (unwanted change) where the results oscillate 
randomly between two values. 
   S = x1, x2, x,2, x1, x2, x1, x1, x1, x2, x2, x1, x2, x2...
   Typically:
        abs(x2 - x1) << x1
   and also:
        x1,x2 > std dev when random noise is mixed in
   
   
   
   Constant mean, change in variance: S =  N(0, mu1).... , N(0, mu2)...
   mu2 > mu1
   (mu1 > mu2 is the negative case)
   
   
   Constant mean and variance, but phase changes:  S = cos(x)..., sin(x)...
   
   Multiple consecutive changes: S= x0,x0,x0... x1, x2, ... xn, xn, xn....
   Where x0 <x1 < x2 ... < xn
   - It depends on the specific problem space what the right behavior is here. 
For perf testing it is possible that multiple independent improvements were 
merged back to back.
   
   3. Generate all possible combinations of the above, including with itself. 
(e.g. more than a single outlier, more than a singe phase change...)
   
   4. Verify that otava handles all of the above.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to