[ 
https://issues.apache.org/jira/browse/MAHOUT-1853?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15408148#comment-15408148
 ] 

Ted Dunning commented on MAHOUT-1853:
-------------------------------------

First, I think that the root LLR function would be more appropriate so that you 
don't have indicators that occur less often than expected.

Regarding the threshold, significance is monotonic in LLR score so thresholding 
on either is equivalent. The only question is picking the value. Picking based 
on a significance level has no strong motivation because there is a vast number 
of repeated and correlated comparisons in play.

As such, I would simply use something like t-digest (available in Mahout as 
part of the OnlineSummarizer if that has survived, otherwise available as a 
simple dependency) to aggregate the scores you get in these cases and pick, 
say, the top 1-10%. The knob should be turned based on how sparse you want the 
indicators to be on average. If you have the distribution of all the scores 
available, then picking the cutoff is trivial.

Note that this isn't really n^2. Instead, it is k n = O(n) where k is the 
number of categories. This is different from the case of text or general 
viewing behaviors because the vocabulary there is unbounded and grows with n. 
This means that the computation of the indicators is only O(k n) for the 
counting and O(k^2) for the cooccurrence counting. If k_max is the interaction 
cut in some other behavior that has unbounded size, then the cost of the 
counting is O(k k_max n) for counting and scoring. Both are scalable due to the 
limitation imposed by the finiteness of k and the artificial limit of the 
interaction cut.




> Improvements to CCO (Correlated Cross-Occurrence)
> -------------------------------------------------
>
>                 Key: MAHOUT-1853
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-1853
>             Project: Mahout
>          Issue Type: New Feature
>    Affects Versions: 0.12.0
>            Reporter: Andrew Palumbo
>            Assignee: Pat Ferrel
>             Fix For: 0.13.0
>
>
> Improvements to CCO (Correlated Cross-Occurrence) to include auto-threshold 
> calculation for LLR downsampling, and possible multiple fixed thresholds for 
> A’A, A’B etc. This is to account for the vast difference in dimensionality 
> between indicator types.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to