[
https://issues.apache.org/jira/browse/MAHOUT-1853?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15302371#comment-15302371
]
Pat Ferrel commented on MAHOUT-1853:
------------------------------------
Steps:
1) allow an array of absolute LLR value thresholds for each matrix pair
2) allow thresholds to be a confidence of correlation (actually confidence that
non-correlation is rejected) or fraction of total cross-occurrences are
retained after downsampling. To reduce how often this must be done the absolute
value thresholds should be output after calculation for later re-use in #1
#1 is very easy but not all that useful since LLR values will vary quite a bit.
#1 also retains the O(n) computation complexity. I imagine #1 would be used
with #2 since #2 is much more computationally complex and can output thresholds
for #1.
#2 require worst-case O(n^2) complexity. Some matrix pairs will have low
dimensionality in one direction or both. In fact this low dimensionality is the
reason we need a different kind of downsampling for these pairs. Imagine a
conversion A'A which is items by items and may be very large but sparse, then
A'B may be products by gender, so a rank of 2 columns but much denser.
The calculation for #2 would, I believe, require performing the un-downsampled
A'A then determining the threshold from the LLR scores, then making another
pass to downsample, this will add significant computation time and could make
it impractical except for rare re-calculation tasks. In which case the absolute
threshold would be recorded and used for subsequent A'A and A'B using #1.
Since it is likely to be impractical to calculate #2 very often it may be
better done as an analytics job rather than part of the A'A job.
For most recommender cases the current downsampling method is fine but for
other uses of the CCO algorithm #2 may be required for occasional threshold
re-calc. In some sense we won't know until we try.
Any comments from [~tdunning] or [~dlyubimov] would be welcome
> Improvements to CCO (Correlated Cross-Occurrence)
> -------------------------------------------------
>
> Key: MAHOUT-1853
> URL: https://issues.apache.org/jira/browse/MAHOUT-1853
> Project: Mahout
> Issue Type: New Feature
> Affects Versions: 0.12.0
> Reporter: Andrew Palumbo
> Assignee: Pat Ferrel
> Fix For: 0.13.0
>
>
> Improvements to CCO (Correlated Cross-Occurrence) to include auto-threshold
> calculation for LLR downsampling, and possible multiple fixed thresholds for
> A’A, A’B etc. This is to account for the vast difference in dimensionality
> between indicator types.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)