spark-rowsimilarity is implemented with LLR. It produces exactly what is shown below, it’s in the test case. It is not really made for textual doc similarity yet since as you say more is needed. For text it would be better to:
1) run the docs through a lucene analyzer. 2) LLR to filter unneeded terms. 3) TF-IDF weight the terms remaining 4) Use cosine to determine similarity strengths. Which is what I believe you said. Eventually I’ll get to this. As-is it’s more like user similarity. On Sep 18, 2014, at 11:15 AM, Ted Dunning <[email protected]> wrote: LLR with text is commonly done (that is where it comes from). The simple approach would be to have sentences be users and words be items. This will result in word-word connections. This doesn't directly give document-document similarities. That could be done by transposing the original data (word is user, document is item) but I don't quite understand how to interpret that. Another approach is simply using term weighting and document normalization and scoring every doc against every other. That comes down to a matrix multiplication which is very similar to the transposed LLR problem so that may give an interpretation. On Mon, Aug 25, 2014 at 10:15 AM, Pat Ferrel <[email protected]> wrote: LLR with text or non-interaction data. What do we use for counts? Do we care how many times a token is seen in a doc for instance or do we just look to see if it was seen. I assume the later, which means we need a new numNonZeroElementsPerRow several places in math-scala, right? All the same questions are going to come up over this as did for numNonZeroElementsPerColumn so please speak now or I’ll just mirror its implementation. On Aug 25, 2014, at 9:38 AM, Pat Ferrel <[email protected]> wrote: Turning itemsimilarity into rowsimilarity if fairly simple but requires altering CooccurrenceAnalysis.cooccurrence to swap the transposes and calculate the LLR values for rows rather than columns. The input will be something like a DRM. Row similarity does something like AA’ with LLR weighting and uses similar downsampling as I take it from the Hadoop code. Let me know if I’m on the wrong track here. With the new application ID preserving code the following input could be directly processed (it’s my test case) doc1\tNow is the time for all good people to come to aid of their party doc2\tNow is the time for all good people to come to aid of their country doc3\tNow is the time for all good people to come to aid of their hood doc4\tNow is the time for all good people to come to aid of their friends doc5\tNow is the time for all good people to come to aid of their looser brother doc6\tThe quick brown fox jumped over the lazy dog doc7\tThe quick brown fox jumped over the lazy boy doc8\tThe quick brown fox jumped over the lazy cat doc9\tThe quick brown fox jumped over the lazy wolverine doc10\tThe quick brown fox jumped over the lazy cantelope The output will be something like the following, with or without LLR strengths. doc1\tdoc2 doc3 doc4 doc5 … doc6\tdoc7 doc8 doc9 doc10 ... It would be pretty easy to tack on a text analyzer from lucene to turn this into a full function doc similarity job since LLR doesn’t need TF-IDF. One question is: is there any reason to do the cross-similarity in RSJ, so [AB’]? I can’t picture what this would mean so am assuming the answer is no.
