Sowiks opened a new pull request, #96: URL: https://github.com/apache/otava/pull/96
This is a PR request to replace mongodb `signal_processing_algorithms` package with internal implementation as discussed in https://lists.apache.org/thread/4vwp79kmsjd3zbf4fjcgkggf33jot65c . I tried to do minimal changes to existing code (`analysis.py` and `series.py`). Better integration is possible, but it can wait till next PR. There is, however, an issue that I identified during my implementation of the methodology from _**A Nonparametric Approach for Multiple Change Point Analysis of Multivariate Data [Matteson and James]**_(https://arxiv.org/abs/1306.4933). Long story short, it doesn't seem that `signal_processing_algorithms==1.3.5` has correct implementation. Namely, let's look at section **2.2 Estimating the Location of a Change Point** from the paper, more specifically formula (7) and the following discussion in the last paragraph of that section. I add them here for your convenience: > ... Let $Z_1, \cdots Z_t \in ℝ^d$ be an independent sequence of observations and let $1 \leq \tau < \kappa \leq T$ be constants. Now define the following sets $X_\tau = \\{ Z_1, Z_2, \cdots , Z_\tau \\}$ and $Y_\tau(\kappa) = \\{ Z_{\tau + 1}, Z_{\tau + 2} , \cdots , Z_\kappa \\}$. A change point location $\hat{\tau}$ is then estimated as $$(\hat{\tau}, \hat{\kappa}) = \text{arg}\max\limits_{(\tau, \kappa)} \hat{Q} (X_\tau, Y_\tau(\kappa); \alpha).$$ ... If it is known that at most one change point exists, we $\kappa = T$. Otherwise, the variable $\kappa$ is introduced to alleviate a weakness of bisection, as mentioned in Venkatraman (1992), in which it may be more difficult to detect certain types of distributional changes in the multiple change point setting using only bisection. For example, if we fix $\kappa = T$ and the set $Y_\tau(T)$ contains observations across multiple change points (e.g., distinct distributions), then it is possible that the resulting mixture distribution in $Y_\tau(T)$ is indistinguishable from the distribution of the observations in $X_\tau$, even when $\tau$ corresponds to a valid change point. We avoid this confounding by allowing $\kappa$ to vary, with minimal computational cost by storing the distances mentioned above. This modification to bisection is similar to that taken in Olshen and Venkatraman (2004). The main idea of that section is to allow $\kappa$ to vary, not to be simply set to the end of the series $\kappa=T$. However, when I implemented the methodology as in the paper (allowing $\tau < \kappa \leq T$ to vary) `tigerbeetle` tests failed. With little experimentation I found that erroneous implementation (with fixed $\kappa=T$) resolves the issues with `tigerbeetle` tests, which is unlikely to be a coincidence. I think this is because `signal_processing_algorithms` package has a mistake/typo in it where $\kappa$ is fixed at $T$ (at least version 1.3.5). Moreover, this would also explain arguments from **_Hunter: Using Change Point Detection to Hunt for Performance Regressions [Fleming et al.]_** that caught my eye. In the section **3.3 Fixed-Sized Windows** the authors say: > As we began using Hunter on larger and larger data series, we discovered that change points identified in previous runs would suddenly disappear from Hunter’s results. This issue turned out to be caused by performance regressions that were fixed shortly after being introduced. This is a known issue with E-divisive means and is discussed in [5]. Because E-divisive means divides the time series into two parts, most of the data points on either side of the split showed similar values. The algorithm, therefore, by design, would treat the two nearby changes as a temporary anomaly, rather than a persistent change, and therefore filter it out. The issues they discussed in that section seems to be related to the same idea: if you don't allow $\kappa$ to vary, the algorithm might miss some change points if they are within the interval. I wonder if they also used fixed $\kappa$ instead, which lead to the described issues. Nevertheless, going back to this PR. It contains two commits: 1. First commit matches the output of `signal_processing_algorithms` in all tests. If there is an error in my logic somewhere, this is the commit to replace the `signal_processing_algorithms` package. 2. Second commit also corrects fixed $\kappa$ issues. This results in different results in `tigerbeetle` tests, which was corrected here. Finally, I included some visualization of `tigerbeetle` tests for commit-1 vs commit-2 for you to see if they make sense. ```python import matplotlib.pyplot as plt import numpy as np ``` ```python series = [26705, 26475, 26641, 26806, 26835, 26911, 26564, 26812, 26874, 26682, 15672, 26745, 26460, 26977, 2685 23547, 23674, 23519, 23670, 23662, 23462, 23750, 23717, 23524, 23588, 23687, 23793, 23937, 23715, 23570, 23730, 23690, 23699, 23670, 23860, 23988, 23652, 23681, 23798, 23728, 23604, 23523, 23412, 23685, 23773, 23771, 23718, 23409, 23739, 23674, 23597, 23682, 23680, 23711, 23660, 23990, 23938, 23742, 23703, 23536, 24363, 24414, 24483, 24509, 24944, 24235, 24560, 24236, 24667, 24730, 28346, 28437, 28436, 28057, 28217, 28456, 28427, 28398, 28250, 28331, 28222, 28726, 28578, 28345, 28274, 28514, 28590, 28449, 28305, 28411, 28788, 28404, 28821, 28580, 27483, 26805, 27487, 27124, 26898, 27295, 26951, 27312, 27660, 27154, 27050, 26989, 27193, 27503, 27326, 27375, 27513, 27057, 27421, 27574, 27609, 27123, 27824, 27644, 27394, 27836, 27949, 27702, 27457, 27272, 28207, 27802, 27516, 27586, 28005, 27768, 28543, 28237, 27915, 28437, 28342, 27733, 28296, 28524, 28687, 28258, 28611, 29360, 28590, 29641, 28965, 29474, 29256, 28611, 28205, 28539, 27962, 28398, 28509, 28240, 28592, 28102, 28461, 28578, 28669, 28507, 28535, 28226, 28536, 28561, 28087, 27953, 28398, 28007, 28518, 28337, 28242, 28607, 28545, 28514, 28377, 28010, 28412, 28633, 28576, 28195, 28637, 28724, 28466, 28287, 28719, 28425, 28860, 28842, 28604, 28327, 28216, 28946, 28918, 29287, 28725, 29148, 29541, 29137, 29628, 29087, 28612, 29154, 29108, 28884, 29234, 28695, 28969, 28809, 28695, 28634, 28916, 29852, 29389, 29757, 29531, 29363, 29251, 29552, 29561, 29046, 29795, 29022, 29395, 28921, 29739, 29257, 29455, 29376, 29528, 28909, 29492, 28984, 29621, 29026, 29457, 29102, 29114, 28924, 29162, 29259, 29554, 29616, 29211, 29367, 29460, 28836, 29645, 29586, 28848, 29324, 28969, 29150, 29243, 29081, 29312, 28923, 29272, 29117, 29072, 29529, 29737, 29652, 29612, 29856, 29012, 30402, 29969, 29309, 29439, 29285, 29421, 29023, 28772, 29692, 29416, 29267, 29542, 29904, 30045, 29739, 29945, 29141, 29163, 29765, 29197, 29441, 28910, 29504, 29614, 29643, 29506, 29420, 29672, 29432, 29784, 29888, 29309, 29247, 29816, 292 54, 29813, 29451, 29382, 29618, 28558, 29845, 29499, 29283, 29184, 29246, 28790, 29952, 29145, 29415, 30437, 29227, 29605, 29859, 29156, 29807, 29406, 29734, 29861, 29140, 29983, 29832, 29919, 29896, 29991, 29266, 29001, 29459, 29548, 29310, 29042, 29303, 29894, 29091, 29018, 29537, 29614, 29180, 29736, 29500, 29218, 29581, 28906, 28542, 29306, 28987, 29878, 28865, 30272, 29707, 29662, 29815, 30492, 29347, 30096, 29054, 30238, 28813, 31895, 28915] ``` ```python def plot(old, new): plt.style.use('ggplot') plt.plot(series) plt.plot(old, np.take(series, old), 'o') plt.plot(new, np.take(series, new), 'kx') plt.legend(['Data', 'Old', 'New']) plt.show() ``` ```python # window_len=30, max_pvalue=0.01, min_magnitude=0.05 plot(old=[27, 71], new=[15, 71]) ``` <img width="570" height="413" alt="output_3_0" src="https://github.com/user-attachments/assets/123b6169-a15a-444c-852c-cdfc1b1f6315" /> ```python # window_len=30, max_pvalue=0.05, min_magnitude=0.05 plot(old=[16, 71], new=[15, 71]) ``` <img width="570" height="413" alt="output_4_0" src="https://github.com/user-attachments/assets/eae611c1-edc4-4a95-8172-36f20ac4702b" /> ```python # window_len=30, max_pvalue=0.1, min_magnitude=0.05 plot(old=[16, 71], new=[10, 11, 15, 71, 363]) ``` <img width="570" height="413" alt="output_5_0" src="https://github.com/user-attachments/assets/90592766-a0e6-41b4-aeb0-1f0c7697f6e0" /> ```python # window_len=30, max_pvalue=0.2, min_magnitude=0.05 plot(old=[16, 71], new=[10, 11, 15, 71]) ``` <img width="570" height="413" alt="output_6_0" src="https://github.com/user-attachments/assets/7e6f15ac-d65c-4779-97a4-f67ddca8ad85" /> ```python # window_len=30, max_pvalue=0.2, min_magnitude=0.0 plot( old=[16, 27, 29, 56, 58, 60, 61, 69, 71, 82, 83, 91, 95, 108, 114, 116, 117, 131, 138, 142, 148, 165, 167, 178, 187, 189, 190, 192, 206, 212, 213, 220, 241, 243, 244, 246, 247, 249, 260, 266, 268, 272, 274, 275, 278, 282, 284, 288, 295, 297, 311, 314, 325, 330, 347, 351], new=[3, 6, 7, 10, 11, 13, 15, 16, 28, 29, 35, 37, 39, 41, 44, 48, 49, 56, 58, 61, 65, 66, 69, 71, 74, 76, 82, 95, 108, 117, 125, 126, 129, 131, 136, 137, 142, 148, 165, 169, 187, 190, 192, 197, 200, 212, 220, 241, 243, 246, 247, 249, 250, 260, 265, 266, 268, 278, 282, 288, 305, 306, 325, 330, 337, 338, 340, 347, 349, 363] ) ``` <img width="570" height="413" alt="output_7_0" src="https://github.com/user-attachments/assets/c6990c40-07df-430f-8b82-308188b69217" /> ```python # window_len=30, max_pvalue=0.1, min_magnitude=0.0 plot( old=[16, 27, 29, 56, 58, 61, 71, 82, 95, 113, 116, 117, 131, 138, 142, 148, 157, 165, 167, 178, 187, 189, 192, 206, 212, 213, 220, 246, 247, 249, 260, 266, 268, 272, 278, 282, 311, 312, 325, 330, 347, 351], new=[3, 6, 10, 11, 15, 16, 28, 29, 35, 37, 39, 41, 44, 48, 49, 61, 71, 95, 117, 131, 142, 148, 165, 169, 192, 206, 212, 260, 265, 268, 278, 282, 288, 305, 363] ) ``` <img width="570" height="413" alt="output_8_0" src="https://github.com/user-attachments/assets/a0e6421a-21b9-46e7-9088-19e064e23bdb" /> ```python # window_len=30, max_pvalue=0.01, min_magnitude=0.0 plot( old=[27, 61, 71, 82, 95, 131, 142, 148, 192, 212, 249, 260, 265, 353], new=[15, 26, 61, 71, 95, 117, 131, 142, 148, 165, 169, 192, 212, 260] ) ``` <img width="570" height="413" alt="output_9_0" src="https://github.com/user-attachments/assets/757adce2-5475-433e-a696-2004f2453a2f" /> ```python # window_len=30, max_pvalue=0.001, min_magnitude=0.0 plot( old=[71, 95, 113, 131, 142, 148, 192, 212, 260], new=[15, 61, 71, 95, 117, 131, 142, 148, 192, 212, 260] ) ``` <img width="570" height="413" alt="output_10_0" src="https://github.com/user-attachments/assets/c949b76f-bda8-445e-ae6b-bc5428c32bcc" /> ```python # window_len=30, max_pvalue=0.0001, min_magnitude=0.0 plot( old=[71, 95, 113, 131, 192, 212], new=[71, 95, 117, 131, 142, 148, 192, 212] ) ``` <img width="570" height="413" alt="output_11_0" src="https://github.com/user-attachments/assets/6536bd3e-89ff-4eec-b330-eeda19bebd4c" /> ```python # window_len=30, max_pvalue=0.00001, min_magnitude=0.0 plot(old=[71, 95, 131, 192, 212], new=[71, 95, 131, 192, 212]) ``` <img width="570" height="413" alt="output_12_0" src="https://github.com/user-attachments/assets/023d6735-1c4d-4bcf-888e-781a107ca650" /> -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
