This discussion has popped up in a few tickets, so I thought I might just
start a new email thread to approach it in a top down manner.
On versioning:
- The 0.7 branch is an ASF packaging of Hunter/Otava the way we knew it
until we joined the ASF incubator program. It depends on the original
signal_processing module and supports python versions 3.8-3.10. It is
possible we will not make more releases from this branch, but if necessary,
we could release new 0.7.x versions.
- The current master branch is heading toward a series of 0.8.x and
possibly 0.9.x releases. It already contains upgrades to support all python
versions up to 3.14 and the core algorithm is a fresh rewrite that
completely replaced the external dependency.
- As such, I think it would be valuable to get out a 0.8.0 release
with those improvements. Since we want to rotate the release manager role
(a requirement in the incubator program) this is largely between myself and
Sean to do. For me it will be a couple weeks until I have bandwidth to do
it, but it is in my backlog unless Sean beats me to it.
- In my thinking, the refactoring proposed in this email is the major
remaining piece of work, after which we would release the 1.0.0 (and very
likely then start discussing graduating the incubator program, though that
is out of scope for the current discussion).
- API and all kinds of other breakage is allowed and encouraged between
0.7.x and 1.0.0.
So, pulling together a few discussions from the past 3-6 months... It seems
we could refactor Otava into the following modular components:
() = interface / schema
* = new, doesn't currently exist
----------------------
| releases & packaging |
---------------------
( standard data structure ) ---
- tabular/csv and json | o |
- one can be primary format | t |
- python container classes | a |
- replace Series, AnalyzedSeries | v |
------------------- | a |
| integrations | | |
| - data storage | | m |
| - ingest/events* | | a |
| - notifications | | n |
-------------------- | a |
| g |-(* HTTP API)
-------------- | e |
| otava-cli | | r |
| - csv | | |
| - json* | | |
-------------- | |
| (otava-lib) | | |
| - edivisive | ---
| - multidim* |
| - incremental|
| - core params|
--------------
If you start reading from the bottom left, we have the core e-divisive
implementation, including the incremental mode (which is an optimization
invented purely within Otava). By my count this is pretty much there,
except we seem to have never implemented the multivariate version of
e-divisive, which should provide some marginal improvement in the
algorithm's accuracy.
The core library does not depend on any database or other external system.
Thus it has a limited set of configuration parameters:
- From e-divisive implementation: p-value, and alpha.
(The latter we've essentially set to 1 in the previous implementation,
but it can have a float value 0 < alpha < 2. My understanding is that it
affects the "geometry" of the distance between points - analogous to how
the least square line fitting algorithm uses a power of two to emphasize
the weight of large distances/differences. So I'm guessing here a large
alpha would make the algorithm more sensitive to large change points and a
small alpha would give more weight to small differences?)
- window_size, if we decide to keep the window approach
- minimum threshold - even if it's not actually an parameter to the
algorithm itself
- Note that it would be nice to support also an option to test against
different implementations and versions of e-divisive, such as currently we
have --orig-edivisive option. But this becomes impractical since we cannot
copy the old signal_processing dependency into the project, and otoh it
will not work on modern python versions. So in practice if someone wants to
compare different versions of MongoDB e-divisive, Datastax Hunter and ASF
Otava, then you would from now on have to actually install multiple
versions and run them separately.
The otava-cli is then a minimalist command line executable that exposes via
ConfigArgParse the options available in otava-lib. Note that this is a
subset of what is currently available in otava and a subset of the
functionality exposed in otava manager component. The operations on this
level are just to feed data in, find change points, get results out. For
example the proposal in one ticket to allow using different p-value for
different tests or metrics is not relevant of this level: you would in any
case just run otava-cli two separate times, and use whatever p-value you
want each time.
Also, in otava-cli it would be the case that ConfigArgParse is the entire
space of configuration options. Anything too complicated to express via
ConfigArgParse, would by definition move into the larger modules.
The integrations module, which could actually be 3 separate modules, is
responsible for connectivity to data stores, but also notifications. I
believe currently in Otava we have the code for slack notifications? In
Nyrkiƶ we also implemented Github notifications which is quite cool. There
has been also email notifications but I don't think that code is actually
in Otava at all? (In practice GitHub or hypothetical Jira integration will
indirectly cause emails to arrive in your inbox...)
Main work item here is to design a plugin mechanism so that Otava doesn't
need to depend on every database and issue tracker in the world.
Second, I've observed that the level of code re-use in the current data
store immporters is poor. So just a basic refactoring into better object
oriented hierarchy should make a big difference here.
Final point: Currently the data store is typically the entry point for test
results coming into Otava for analysis = finding change points. There could
be a need for some kind of event or streaming API where new results arrive.
This is especially meaningful if you wanted to use the incremental
e-divisive - you need to know which is the old data an d which is the new
point(s).
Finally, on top of all of this would be a more comprehensive user
interface, which binds all of these together. This is where you manage
passwords to the database and issue tracker, where you can define which
test names and metrics have which parameters, where their data is stored
and what SQL is used to fetch it.
Also, this layer could grow into some kind of service/daemon if we feel
that is a useful direction. Currently it is just CLI, so in a way this is
just an extension of the otava-cli component, but I feel there's still a
clear separation between the core otava-cli focused on the math and the
larger manager component focusing on the messy real world realities.
Finally, I think we should define and document the data structure and field
names we use for this data. For example, each data store could then use a
compliant schema and column names. But this would also benefit downstream
users, like Nyrkiƶ, when we could directly use and extend a well known data
structure. I will return to this one in a separate message another day,
henrik
--
*nyrkio.com <http://nyrkio.com/>* ~ *git blame for performance*
Henrik Ingo, CEO
[email protected] LinkedIn:
www.linkedin.com/in/heingo
+358 40 569 7354 Twitter: twitter.com/h_ingo