[ https://issues.apache.org/jira/browse/MAHOUT-1853?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15408256#comment-15408256 ]
Pat Ferrel commented on MAHOUT-1853: ------------------------------------ is rootLLR normally distributed (the positive half)? If so we'd have to calculate all rootLLR scores and fit the normal params to get the 10% or other adaptive threshold, right? I understand that O(n^2) never occurs in practice. Even for cases where O(k k_max n) is high intuition would say that this threshold could be calculated once and applied for some time since it will tend to stay the same for any specific type of indicator. Calculating it may be a once in a great while operation and the threshold would usually be used in #2 above. I'm somewhat ignorant of t-digest other than having read your anomaly detection book. I think it's in Mahout but the docs are here: https://github.com/tdunning/t-digest. I assume that using t-digest would remove the need to do any separate distribution param fitting (as long as we use rootLLR) and could even be applied as online learning producing an adaptive threshold to feed into #2 above? I imagine it can also be applied periodically on P`X in batch. No need to respond if I'm on the right track. > Improvements to CCO (Correlated Cross-Occurrence) > ------------------------------------------------- > > Key: MAHOUT-1853 > URL: https://issues.apache.org/jira/browse/MAHOUT-1853 > Project: Mahout > Issue Type: New Feature > Affects Versions: 0.12.0 > Reporter: Andrew Palumbo > Assignee: Pat Ferrel > Fix For: 0.13.0 > > > Improvements to CCO (Correlated Cross-Occurrence) to include auto-threshold > calculation for LLR downsampling, and possible multiple fixed thresholds for > A’A, A’B etc. This is to account for the vast difference in dimensionality > between indicator types. -- This message was sent by Atlassian JIRA (v6.3.4#6332)