[ 
https://issues.apache.org/jira/browse/MAHOUT-1422?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13965052#comment-13965052
 ] 

Ted Dunning commented on MAHOUT-1422:
-------------------------------------

{quote}
One last question, how do I get the four counts for LLR (X & Y, X but not Y, Y 
but not X, neither of them) in the cross-co-occurrence case?
{quote}
You nearly have them already.  If you look at the cooccurrence counts of the 
adjoined history matrix, the counts that contribute to A_1' A_2 are just a 
subset of the counts that contribute to A' A (where A is A_1 | A_2).  There are 
four such subsets corresponding to A_1' A_1, A_1'A_2, A_2'A_1, and A_2'A_2 and 
these are collected in the corresponding four quadrants of A'A.  

This view of things makes it clear that cross occurrence analysis is a 
simplification of the co-occurrence analysis of A'A.

SOOO ...

If X is when column i of A_1 is non-zero and Y is when column j of A_2 is 
non-zero, the four numbers we start with are:
{noformat}
k_{X and Y} = (A_1' A_2)[i, j]  (the accumulated cross-occurrences of i and j)
k_X = columnSum(A_1)[i]         (the count of the unique rows of A_1 that have 
non-zero in column i)
k_Y = columnSum(A_2)[j]         (the count of the unique rows of A_2 that have 
non-zero in column j)
N = A_1.rowSize = A_2.rowSize   (the max possible value for k_X or k_Y)
{noformat}

The four numbers we need for LLR are

{noformat}
k_11 = k_{X and Y}
k_12 = k_X - k_{X and Y}
k_21 = k_Y - k_{X and Y}
k_22 = N - k_X - k_Y + k_{X and Y}
{noformat}


> Make a version of RSJ that uses two inputs
> ------------------------------------------
>
>                 Key: MAHOUT-1422
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-1422
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Collaborative Filtering
>    Affects Versions: 1.0
>         Environment: mapreduce
>            Reporter: Pat Ferrel
>              Labels: recommender, similarity
>             Fix For: 1.0
>
>
> Currently the RowSimiairtyJob uses a similarity measure to pairwise compare 
> all rows in a DistributedRowMatrix.
> For many applications including a cross-action recommender we need something 
> like RSJ that takes two DRMs and compares matching rows of each.  The output 
> would be the same form as RSJ, and ideally would allow the use of any 
> similarity type already defined--especially LLR.
> There are two implementations of a Cross-Recommender one based on the Mahout 
> RecommenderJob, and another based on Solr, that can immediately benefit from 
> a Cross-RSJ. 
> A modification of the matrix multiply job may be a place to start since the 
> current RSJ seems to rely heavily if self-similarity.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to