T-digest is still in mahout-math.. I believe it is still shipped to the backend 
in spark-dependency-reduced.jar

-------- Original message --------
From: "Pat Ferrel (JIRA)" <j...@apache.org>
Date: 08/04/2016 2:19 PM (GMT-05:00)
To: dev@mahout.apache.org
Subject: [jira] [Commented] (MAHOUT-1853) Improvements to CCO (Correlated 
Cross-Occurrence)


    [ 
https://issues.apache.org/jira/browse/MAHOUT-1853?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15408256#comment-15408256
 ]

Pat Ferrel commented on MAHOUT-1853:
------------------------------------

is rootLLR normally distributed (the positive half)? If so we'd have to 
calculate all rootLLR scores and fit the normal params to get the 10% or other 
adaptive threshold, right?

I understand that O(n^2) never occurs in practice. Even for cases where O(k 
k_max n) is high intuition would say that this threshold could be calculated 
once and applied for some time since it will tend to stay the same for any 
specific type of indicator. Calculating it may be a once in a great while 
operation and the threshold would usually be used in #2 above.

I'm somewhat ignorant of t-digest other than having read your anomaly detection 
book. I think it's in Mahout but the docs are here: 
https://github.com/tdunning/t-digest. I assume that using t-digest would remove 
the need to do any separate distribution param fitting (as long as we use 
rootLLR) and could even be applied as online learning producing an adaptive 
threshold to feed into #2 above? I imagine it can also be applied periodically 
on P`X in batch.

No need to respond if I'm on the right track.



> Improvements to CCO (Correlated Cross-Occurrence)
> -------------------------------------------------
>
>                 Key: MAHOUT-1853
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-1853
>             Project: Mahout
>          Issue Type: New Feature
>    Affects Versions: 0.12.0
>            Reporter: Andrew Palumbo
>            Assignee: Pat Ferrel
>             Fix For: 0.13.0
>
>
> Improvements to CCO (Correlated Cross-Occurrence) to include auto-threshold 
> calculation for LLR downsampling, and possible multiple fixed thresholds for 
> A’A, A’B etc. This is to account for the vast difference in dimensionality 
> between indicator types.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to