Thinking a bit more about the use of LLR only for similarity. Imagine the case 
where you are doing text analysis and have TF-IDF weights in the input matrix. 
LLR has one trait that makes me wonder about settling on it alone for general 
similarity and it’s more an observation since I have no data to address it. LLR 
ignores the input weights. Using document terminology now—LLR has some of the 
same characteristics as TF-IDF + cosine. It tends to mitigate the effect of 
large documents or extremely popular terms. But has there been any analysis 
that suggests LLR beats the traditional  TF-IDF + cosine for document 
similarity?

BTW we once used the Mahout Item-based recommender with an ecom dataset of 2.5M 
users, 500K items over a year of fairly active interactions. We tried several 
of the similarity measures in Mahout and computed MAP for cross validation on 
each. LLR was the clear winner so I understand it’s singular use in 
recommenders. 



On Aug 6, 2014, at 5:38 PM, Pat Ferrel <pat.fer...@gmail.com> wrote:

BTW the cooccurrence code is going into RSJ too and there are uses of that 
where cosine is expected. I don’t know how to think about cross-cosine. Is 
there an argument for LLR only in RSJ?

> On Aug 6, 2014, at 5:20 PM, Sebastian Schelter <ssc.o...@googlemail.com> 
> wrote:
> 
> I chose against porting all the similarity measures to the dsl version of
> the cooccurrence analysis for two reasons. First, adding the measures in a
> generalizable way makes the code superhard to read. Second, in practice, I
> have never seen something giving better results than llr. As Ted pointed
> out, a lot of the foundations of using similarity measures comes from
> wanting to predict ratings, which people never do in practice. I think we
> should restrict ourselves to approaches that work with implicit, count-like
> data.
> 
> -s

Reply via email to