Re: rowsimilarity

Pat Ferrel Thu, 18 Sep 2014 13:56:17 -0700

spark-rowsimilarity is implemented with LLR. It produces exactly what is shown 
below, it’s in the test case. It is not really made for textual doc similarity 
yet since as you say more is needed. For text it would be better to:


1) run the docs through a lucene analyzer.
2) LLR to filter unneeded terms.
3) TF-IDF weight the terms remaining
4) Use cosine to determine similarity strengths. 

Which is what I believe you said. Eventually I’ll get to this. As-is it’s more 
like user similarity.


On Sep 18, 2014, at 11:15 AM, Ted Dunning <[email protected]> wrote:


LLR with text is commonly done (that is where it comes from).

The simple approach would be to have sentences be users and words be items.  
This will result in word-word connections.

This doesn't directly give document-document similarities.  That could be done 
by transposing the original data (word is user, document is item) but I don't 
quite understand how to interpret that.  Another approach is simply using term 
weighting and document normalization and scoring every doc against every other. 
 That comes down to a matrix multiplication which is very similar to the 
transposed LLR problem so that may give an interpretation.


On Mon, Aug 25, 2014 at 10:15 AM, Pat Ferrel <[email protected]> wrote:
LLR with text or non-interaction data. What do we use for counts? Do we care 
how many times a token is seen in a doc for instance or do we just look to see 
if it was seen. I assume the later, which means we need a new 
numNonZeroElementsPerRow several places in math-scala, right?

All the same questions are going to come up over this as did for 
numNonZeroElementsPerColumn so please speak now or I’ll just mirror its 
implementation.


On Aug 25, 2014, at 9:38 AM, Pat Ferrel <[email protected]> wrote:

Turning itemsimilarity into rowsimilarity if fairly simple but requires 
altering CooccurrenceAnalysis.cooccurrence to swap the transposes and calculate 
the LLR values for rows rather than columns. The input will be something like a 
DRM. Row similarity does something like AA’ with LLR weighting and uses similar 
downsampling as I take it from the Hadoop code. Let me know if I’m on the wrong 
track here.

With the new application ID preserving code the following input could be 
directly processed (it’s my test case)

doc1\tNow is the time for all good people to come to aid of their party
doc2\tNow is the time for all good people to come to aid of their country
doc3\tNow is the time for all good people to come to aid of their hood
doc4\tNow is the time for all good people to come to aid of their friends
doc5\tNow is the time for all good people to come to aid of their looser brother
doc6\tThe quick brown fox jumped over the lazy dog
doc7\tThe quick brown fox jumped over the lazy boy
doc8\tThe quick brown fox jumped over the lazy cat
doc9\tThe quick brown fox jumped over the lazy wolverine
doc10\tThe quick brown fox jumped over the lazy cantelope

The output will be something like the following, with or without LLR strengths.
doc1\tdoc2 doc3 doc4 doc5
…
doc6\tdoc7 doc8 doc9 doc10
...

It would be pretty easy to tack on a text analyzer from lucene to turn this 
into a full function doc similarity job since LLR doesn’t need TF-IDF.

One question is: is there any reason to do the cross-similarity in RSJ, so 
[AB’]? I can’t picture what this would mean so am assuming the answer is no.

Re: rowsimilarity

Reply via email to