Sowiks opened a new pull request, #96:
URL: https://github.com/apache/otava/pull/96

   This is a PR request to replace mongodb `signal_processing_algorithms` 
package with internal implementation as discussed in 
https://lists.apache.org/thread/4vwp79kmsjd3zbf4fjcgkggf33jot65c . I tried to 
do minimal changes to existing code (`analysis.py` and `series.py`). Better 
integration is possible, but it can wait till next PR.
   
   There is, however, an issue that I identified during my implementation of 
the methodology from _**A Nonparametric Approach for Multiple Change Point 
Analysis of Multivariate Data [Matteson and 
James]**_(https://arxiv.org/abs/1306.4933). Long story short, it doesn't seem 
that `signal_processing_algorithms==1.3.5` has correct implementation. Namely, 
let's look at section **2.2 Estimating the Location of a Change Point** from 
the paper, more specifically formula (7) and the following discussion in the 
last paragraph of that section. I add them here for your convenience:
   
   > ... Let $Z_1, \cdots Z_t \in ℝ^d$ be an independent sequence of 
observations and let $1 \leq \tau < \kappa \leq T$ be constants. Now define the 
following sets $X_\tau = \\{ Z_1, Z_2, \cdots , Z_\tau \\}$ and $Y_\tau(\kappa) 
= \\{ Z_{\tau + 1}, Z_{\tau + 2} , \cdots , Z_\kappa \\}$. A change point 
location $\hat{\tau}$ is then estimated as
   $$(\hat{\tau}, \hat{\kappa}) = \text{arg}\max\limits_{(\tau, \kappa)} 
\hat{Q} (X_\tau, Y_\tau(\kappa); \alpha).$$
   ... If it is known that at most one change point exists, we $\kappa = T$. 
Otherwise, the variable $\kappa$ is introduced to alleviate a weakness of 
bisection, as mentioned in Venkatraman (1992), in which it may be more 
difficult to detect certain types of distributional changes in the multiple 
change point setting using only bisection. For example, if we fix $\kappa = T$ 
and the set $Y_\tau(T)$ contains observations across multiple change points 
(e.g., distinct distributions), then it is possible that the resulting mixture 
distribution in $Y_\tau(T)$ is indistinguishable from the distribution of the 
observations in $X_\tau$, even when $\tau$ corresponds to a valid change point. 
We avoid this confounding by allowing $\kappa$ to vary, with minimal 
computational cost by storing the distances mentioned above. This modification 
to bisection is similar to that taken in Olshen and Venkatraman (2004).
   
   The main idea of that section is to allow $\kappa$ to vary, not to be simply 
set to the end of the series $\kappa=T$. However, when I implemented the 
methodology as in the paper (allowing $\tau < \kappa \leq T$ to vary) 
`tigerbeetle` tests failed. With little experimentation I found that erroneous 
implementation (with fixed $\kappa=T$) resolves the issues with `tigerbeetle` 
tests, which is unlikely to be a coincidence. I think this is because 
`signal_processing_algorithms` package has a mistake/typo in it where $\kappa$ 
is fixed at $T$ (at least version 1.3.5). Moreover, this would also explain 
arguments from **_Hunter: Using Change Point Detection to Hunt for Performance 
Regressions [Fleming et al.]_** that caught my eye. In the section **3.3 
Fixed-Sized Windows** the authors say:
   
   > As we began using Hunter on larger and larger data series, we discovered 
that change points identified in previous runs would suddenly disappear from 
Hunter’s results. This issue turned out to be caused by performance regressions 
that were fixed shortly after being introduced. This is a known issue with 
E-divisive means and is discussed in [5]. Because E-divisive means divides the 
time series into two parts, most of the data points on either side of the split 
showed similar values. The algorithm, therefore, by design, would treat the two 
nearby changes as a temporary anomaly, rather than a persistent change, and 
therefore filter it out.
   
   The issues they discussed in that section seems to be related to the same 
idea: if you don't allow $\kappa$ to vary, the algorithm might miss some change 
points if they are within the interval. I wonder if they also used fixed 
$\kappa$ instead, which lead to the described issues.
   
   Nevertheless, going back to this PR. It contains two commits:
   
   1. First commit matches the output of `signal_processing_algorithms` in all 
tests. If there is an error in my logic somewhere, this is the commit to 
replace the  `signal_processing_algorithms` package.
   2. Second commit also corrects fixed $\kappa$ issues. This results in 
different results in `tigerbeetle` tests, which was corrected here.
   
   Finally, I included some visualization of `tigerbeetle` tests for commit-1 
vs commit-2 for you to see if they make sense.
   
   ```python
   import matplotlib.pyplot as plt
   import numpy as np
   ```
   
   
   ```python
   series = [26705, 26475, 26641, 26806, 26835, 26911, 26564, 26812, 26874, 
26682, 15672, 26745, 26460, 26977, 2685
    23547, 23674, 23519, 23670, 23662, 23462, 23750, 23717, 23524, 23588, 
23687, 23793, 23937, 23715, 23570, 23730, 23690, 23699, 23670, 23860, 23988, 
23652, 23681, 23798, 23728, 23604, 23523, 23412, 23685, 23773, 23771, 23718, 
23409, 23739, 23674, 23597, 23682, 23680, 23711, 23660, 23990, 23938, 23742, 
23703, 23536, 24363, 24414, 24483, 24509, 24944, 24235, 24560, 24236, 24667, 
24730, 28346, 28437, 28436, 28057, 28217, 28456, 28427, 28398, 28250, 28331, 
28222, 28726, 28578, 28345, 28274, 28514, 28590, 28449, 28305, 28411, 28788, 
28404, 28821, 28580, 27483, 26805, 27487, 27124, 26898, 27295, 26951, 27312, 
27660, 27154, 27050, 26989, 27193, 27503, 27326, 27375, 27513, 27057, 27421, 
27574, 27609, 27123, 27824, 27644, 27394, 27836, 27949, 27702, 27457, 27272, 
28207, 27802, 27516, 27586, 28005, 27768, 28543, 28237, 27915, 28437, 28342, 
27733, 28296, 28524, 28687, 28258, 28611, 29360, 28590, 29641, 28965, 29474, 
29256, 28611, 28205, 28539, 27962, 28398, 28509, 28240, 28592, 28102, 28461, 
 28578, 28669, 28507, 28535, 28226, 28536, 28561, 28087, 27953, 28398, 28007, 
28518, 28337, 28242, 28607, 28545, 28514, 28377, 28010, 28412, 28633, 28576, 
28195, 28637, 28724, 28466, 28287, 28719, 28425, 28860, 28842, 28604, 28327, 
28216, 28946, 28918, 29287, 28725, 29148, 29541, 29137, 29628, 29087, 28612, 
29154, 29108, 28884, 29234, 28695, 28969, 28809, 28695, 28634, 28916, 29852, 
29389, 29757, 29531, 29363, 29251, 29552, 29561, 29046, 29795, 29022, 29395, 
28921, 29739, 29257, 29455, 29376, 29528, 28909, 29492, 28984, 29621, 29026, 
29457, 29102, 29114, 28924, 29162, 29259, 29554, 29616, 29211, 29367, 29460, 
28836, 29645, 29586, 28848, 29324, 28969, 29150, 29243, 29081, 29312, 28923, 
29272, 29117, 29072, 29529, 29737, 29652, 29612, 29856, 29012, 30402, 29969, 
29309, 29439, 29285, 29421, 29023, 28772, 29692, 29416, 29267, 29542, 29904, 
30045, 29739, 29945, 29141, 29163, 29765, 29197, 29441, 28910, 29504, 29614, 
29643, 29506, 29420, 29672, 29432, 29784, 29888, 29309, 29247, 29816, 292
 54, 29813, 29451, 29382, 29618, 28558, 29845, 29499, 29283, 29184, 29246, 
28790, 29952, 29145, 29415, 30437, 29227, 29605, 29859, 29156, 29807, 29406, 
29734, 29861, 29140, 29983, 29832, 29919, 29896, 29991, 29266, 29001, 29459, 
29548, 29310, 29042, 29303, 29894, 29091, 29018, 29537, 29614, 29180, 29736, 
29500, 29218, 29581, 28906, 28542, 29306, 28987, 29878, 28865, 30272, 29707, 
29662, 29815, 30492, 29347, 30096, 29054, 30238, 28813, 31895, 28915]
   ```
   
   
   ```python
   def plot(old, new):
       plt.style.use('ggplot')
       plt.plot(series)
       plt.plot(old, np.take(series, old), 'o')
       plt.plot(new, np.take(series, new), 'kx')
       plt.legend(['Data', 'Old', 'New'])
       plt.show()
   
   ```
   
   
   ```python
   # window_len=30, max_pvalue=0.01, min_magnitude=0.05
   plot(old=[27, 71], new=[15, 71])
   ```
   <img width="570" height="413" alt="output_3_0" 
src="https://github.com/user-attachments/assets/123b6169-a15a-444c-852c-cdfc1b1f6315";
 />
   
   
   ```python
   # window_len=30, max_pvalue=0.05, min_magnitude=0.05
   plot(old=[16, 71], new=[15, 71])
   ```
   <img width="570" height="413" alt="output_4_0" 
src="https://github.com/user-attachments/assets/eae611c1-edc4-4a95-8172-36f20ac4702b";
 />
   
   
   ```python
   # window_len=30, max_pvalue=0.1, min_magnitude=0.05
   plot(old=[16, 71], new=[10, 11, 15, 71, 363])
   ```
   <img width="570" height="413" alt="output_5_0" 
src="https://github.com/user-attachments/assets/90592766-a0e6-41b4-aeb0-1f0c7697f6e0";
 />
   
   
   ```python
   # window_len=30, max_pvalue=0.2, min_magnitude=0.05
   plot(old=[16, 71], new=[10, 11, 15, 71])
   ```
   <img width="570" height="413" alt="output_6_0" 
src="https://github.com/user-attachments/assets/7e6f15ac-d65c-4779-97a4-f67ddca8ad85";
 />
   
   
   ```python
   # window_len=30, max_pvalue=0.2, min_magnitude=0.0
   plot(
       old=[16, 27, 29, 56, 58, 60, 61, 69, 71, 82, 83, 91, 95, 108, 114, 116, 
117, 131, 138, 142, 148, 165, 167, 178, 187, 189, 190, 192, 206, 212, 213, 220, 
241, 243, 244, 246, 247, 249, 260, 266, 268, 272, 274, 275, 278, 282, 284, 288, 
295, 297, 311, 314, 325, 330, 347, 351],
       new=[3, 6, 7, 10, 11, 13, 15, 16, 28, 29, 35, 37, 39, 41, 44, 48, 49, 
56, 58, 61, 65, 66, 69, 71, 74, 76, 82, 95, 108, 117, 125, 126, 129, 131, 136, 
137, 142, 148, 165, 169, 187, 190, 192, 197, 200, 212, 220, 241, 243, 246, 247, 
249, 250, 260, 265, 266, 268, 278, 282, 288, 305, 306, 325, 330, 337, 338, 340, 
347, 349, 363]
   )
   ```
   <img width="570" height="413" alt="output_7_0" 
src="https://github.com/user-attachments/assets/c6990c40-07df-430f-8b82-308188b69217";
 />
   
   
   ```python
   # window_len=30, max_pvalue=0.1, min_magnitude=0.0
   plot(
       old=[16, 27, 29, 56, 58, 61, 71, 82, 95, 113, 116, 117, 131, 138, 142, 
148, 157, 165, 167, 178, 187, 189, 192, 206, 212, 213, 220, 246, 247, 249, 260, 
266, 268, 272, 278, 282, 311, 312, 325, 330, 347, 351],
       new=[3, 6, 10, 11, 15, 16, 28, 29, 35, 37, 39, 41, 44, 48, 49, 61, 71, 
95, 117, 131, 142, 148, 165, 169, 192, 206, 212, 260, 265, 268, 278, 282, 288, 
305, 363]
   )
   ```
   <img width="570" height="413" alt="output_8_0" 
src="https://github.com/user-attachments/assets/a0e6421a-21b9-46e7-9088-19e064e23bdb";
 />
   
   
   ```python
   # window_len=30, max_pvalue=0.01, min_magnitude=0.0
   plot(
       old=[27, 61, 71, 82, 95, 131, 142, 148, 192, 212, 249, 260, 265, 353],
       new=[15, 26, 61, 71, 95, 117, 131, 142, 148, 165, 169, 192, 212, 260]
   )
   ```
   <img width="570" height="413" alt="output_9_0" 
src="https://github.com/user-attachments/assets/757adce2-5475-433e-a696-2004f2453a2f";
 />
   
   
   ```python
   # window_len=30, max_pvalue=0.001, min_magnitude=0.0
   plot(
       old=[71, 95, 113, 131, 142, 148, 192, 212, 260],
       new=[15, 61, 71, 95, 117, 131, 142, 148, 192, 212, 260]
   )
   ```
   <img width="570" height="413" alt="output_10_0" 
src="https://github.com/user-attachments/assets/c949b76f-bda8-445e-ae6b-bc5428c32bcc";
 />
   
   
   ```python
   # window_len=30, max_pvalue=0.0001, min_magnitude=0.0
   plot(
       old=[71, 95, 113, 131, 192, 212],
       new=[71, 95, 117, 131, 142, 148, 192, 212]
   )
   ```
   <img width="570" height="413" alt="output_11_0" 
src="https://github.com/user-attachments/assets/6536bd3e-89ff-4eec-b330-eeda19bebd4c";
 />
   
   
   ```python
   # window_len=30, max_pvalue=0.00001, min_magnitude=0.0
   plot(old=[71, 95, 131, 192, 212], new=[71, 95, 131, 192, 212])
   ```
   <img width="570" height="413" alt="output_12_0" 
src="https://github.com/user-attachments/assets/023d6735-1c4d-4bcf-888e-781a107ca650";
 />


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to