Turning itemsimilarity into rowsimilarity if fairly simple but requires 
altering CooccurrenceAnalysis.cooccurrence to swap the transposes and calculate 
the LLR values for rows rather than columns. The input will be something like a 
DRM. Row similarity does something like AA’ with LLR weighting and uses similar 
downsampling as I take it from the Hadoop code. Let me know if I’m on the wrong 
track here.


With the new application ID preserving code the following input could be 
directly processed (it’s my test case)

doc1\tNow is the time for all good people to come to aid of their party
doc2\tNow is the time for all good people to come to aid of their country
doc3\tNow is the time for all good people to come to aid of their hood
doc4\tNow is the time for all good people to come to aid of their friends
doc5\tNow is the time for all good people to come to aid of their looser brother
doc6\tThe quick brown fox jumped over the lazy dog
doc7\tThe quick brown fox jumped over the lazy boy
doc8\tThe quick brown fox jumped over the lazy cat
doc9\tThe quick brown fox jumped over the lazy wolverine
doc10\tThe quick brown fox jumped over the lazy cantelope

The output will be something like the following, with or without LLR strengths.
doc1\tdoc2 doc3 doc4 doc5
…
doc6\tdoc7 doc8 doc9 doc10
...
 
It would be pretty easy to tack on a text analyzer from lucene to turn this 
into a full function doc similarity job since LLR doesn’t need TF-IDF. 

One question is: is there any reason to do the cross-similarity in RSJ, so 
[AB’]? I can’t picture what this would mean so am assuming the answer is no.

rowsimilarity

Reply via email to