[ 
https://issues.apache.org/jira/browse/MAHOUT-1853?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15302371#comment-15302371
 ] 

Pat Ferrel edited comment on MAHOUT-1853 at 5/26/16 5:03 PM:
-------------------------------------------------------------

Steps:

1) allow an array of absolute LLR value thresholds, one for each matrix pair
2) allow thresholds to be a confidence of correlation (actually confidence that 
non-correlation is rejected) or fraction of total cross-occurrences are 
retained after downsampling. To reduce how often this must be done the absolute 
value thresholds should be output after calculation for later re-use in #1

#1 is very easy but not all that useful since LLR values will vary quite a bit. 
#1 also retains the O(n) computation complexity. I imagine #1 would be used 
with #2 since #2 is much more computationally complex and can output thresholds 
for #1.

#2 require worst-case O(n^2) complexity. Some matrix pairs will have low 
dimensionality in one direction or both. In fact this low dimensionality is the 
reason we need a different kind of downsampling for these pairs. Imagine a 
conversion A'A which is items by items and may be very large but sparse, then 
A'B may be products by gender, so a rank of 2 columns but much denser. 

The calculation for #2 would, I believe, require performing the un-downsampled 
A'A then determining the threshold from the LLR scores, then making another 
pass to downsample, this will add significant computation time and could make 
it impractical except for rare re-calculation tasks. In which case the absolute 
threshold would be recorded and used for subsequent A'A and A'B using #1.

Since it is likely to be impractical to calculate #2 very often it may be 
better done as an analytics job rather than part of the A'A job.

For most recommender cases the current downsampling method is fine but for 
other uses of the CCO algorithm #2 may be required for occasional threshold 
re-calc. In some sense we won't know until we try.

Any comments from [~tdunning] or [~dlyubimov] would be welcome


was (Author: pferrel):
Steps:

1) allow an array of absolute LLR value thresholds for each matrix pair
2) allow thresholds to be a confidence of correlation (actually confidence that 
non-correlation is rejected) or fraction of total cross-occurrences are 
retained after downsampling. To reduce how often this must be done the absolute 
value thresholds should be output after calculation for later re-use in #1

#1 is very easy but not all that useful since LLR values will vary quite a bit. 
#1 also retains the O(n) computation complexity. I imagine #1 would be used 
with #2 since #2 is much more computationally complex and can output thresholds 
for #1.

#2 require worst-case O(n^2) complexity. Some matrix pairs will have low 
dimensionality in one direction or both. In fact this low dimensionality is the 
reason we need a different kind of downsampling for these pairs. Imagine a 
conversion A'A which is items by items and may be very large but sparse, then 
A'B may be products by gender, so a rank of 2 columns but much denser. 

The calculation for #2 would, I believe, require performing the un-downsampled 
A'A then determining the threshold from the LLR scores, then making another 
pass to downsample, this will add significant computation time and could make 
it impractical except for rare re-calculation tasks. In which case the absolute 
threshold would be recorded and used for subsequent A'A and A'B using #1.

Since it is likely to be impractical to calculate #2 very often it may be 
better done as an analytics job rather than part of the A'A job.

For most recommender cases the current downsampling method is fine but for 
other uses of the CCO algorithm #2 may be required for occasional threshold 
re-calc. In some sense we won't know until we try.

Any comments from [~tdunning] or [~dlyubimov] would be welcome

> Improvements to CCO (Correlated Cross-Occurrence)
> -------------------------------------------------
>
>                 Key: MAHOUT-1853
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-1853
>             Project: Mahout
>          Issue Type: New Feature
>    Affects Versions: 0.12.0
>            Reporter: Andrew Palumbo
>            Assignee: Pat Ferrel
>             Fix For: 0.13.0
>
>
> Improvements to CCO (Correlated Cross-Occurrence) to include auto-threshold 
> calculation for LLR downsampling, and possible multiple fixed thresholds for 
> A’A, A’B etc. This is to account for the vast difference in dimensionality 
> between indicator types.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to