Turning itemsimilarity into rowsimilarity if fairly simple but requires altering CooccurrenceAnalysis.cooccurrence to swap the transposes and calculate the LLR values for rows rather than columns. The input will be something like a DRM. Row similarity does something like AA’ with LLR weighting and uses similar downsampling as I take it from the Hadoop code. Let me know if I’m on the wrong track here.
With the new application ID preserving code the following input could be directly processed (it’s my test case) doc1\tNow is the time for all good people to come to aid of their party doc2\tNow is the time for all good people to come to aid of their country doc3\tNow is the time for all good people to come to aid of their hood doc4\tNow is the time for all good people to come to aid of their friends doc5\tNow is the time for all good people to come to aid of their looser brother doc6\tThe quick brown fox jumped over the lazy dog doc7\tThe quick brown fox jumped over the lazy boy doc8\tThe quick brown fox jumped over the lazy cat doc9\tThe quick brown fox jumped over the lazy wolverine doc10\tThe quick brown fox jumped over the lazy cantelope The output will be something like the following, with or without LLR strengths. doc1\tdoc2 doc3 doc4 doc5 … doc6\tdoc7 doc8 doc9 doc10 ... It would be pretty easy to tack on a text analyzer from lucene to turn this into a full function doc similarity job since LLR doesn’t need TF-IDF. One question is: is there any reason to do the cross-similarity in RSJ, so [AB’]? I can’t picture what this would mean so am assuming the answer is no.
