[
https://issues.apache.org/jira/browse/MAHOUT-1853?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15408256#comment-15408256
]
Pat Ferrel commented on MAHOUT-1853:
------------------------------------
is rootLLR normally distributed (the positive half)? If so we'd have to
calculate all rootLLR scores and fit the normal params to get the 10% or other
adaptive threshold, right?
I understand that O(n^2) never occurs in practice. Even for cases where O(k
k_max n) is high intuition would say that this threshold could be calculated
once and applied for some time since it will tend to stay the same for any
specific type of indicator. Calculating it may be a once in a great while
operation and the threshold would usually be used in #2 above.
I'm somewhat ignorant of t-digest other than having read your anomaly detection
book. I think it's in Mahout but the docs are here:
https://github.com/tdunning/t-digest. I assume that using t-digest would remove
the need to do any separate distribution param fitting (as long as we use
rootLLR) and could even be applied as online learning producing an adaptive
threshold to feed into #2 above? I imagine it can also be applied periodically
on P`X in batch.
No need to respond if I'm on the right track.
> Improvements to CCO (Correlated Cross-Occurrence)
> -------------------------------------------------
>
> Key: MAHOUT-1853
> URL: https://issues.apache.org/jira/browse/MAHOUT-1853
> Project: Mahout
> Issue Type: New Feature
> Affects Versions: 0.12.0
> Reporter: Andrew Palumbo
> Assignee: Pat Ferrel
> Fix For: 0.13.0
>
>
> Improvements to CCO (Correlated Cross-Occurrence) to include auto-threshold
> calculation for LLR downsampling, and possible multiple fixed thresholds for
> A’A, A’B etc. This is to account for the vast difference in dimensionality
> between indicator types.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)