Thinking a bit more about the use of LLR only for similarity. Imagine the case where you are doing text analysis and have TF-IDF weights in the input matrix. LLR has one trait that makes me wonder about settling on it alone for general similarity and it’s more an observation since I have no data to address it. LLR ignores the input weights. Using document terminology now—LLR has some of the same characteristics as TF-IDF + cosine. It tends to mitigate the effect of large documents or extremely popular terms. But has there been any analysis that suggests LLR beats the traditional TF-IDF + cosine for document similarity?
BTW we once used the Mahout Item-based recommender with an ecom dataset of 2.5M users, 500K items over a year of fairly active interactions. We tried several of the similarity measures in Mahout and computed MAP for cross validation on each. LLR was the clear winner so I understand it’s singular use in recommenders. On Aug 6, 2014, at 5:38 PM, Pat Ferrel <pat.fer...@gmail.com> wrote: BTW the cooccurrence code is going into RSJ too and there are uses of that where cosine is expected. I don’t know how to think about cross-cosine. Is there an argument for LLR only in RSJ? > On Aug 6, 2014, at 5:20 PM, Sebastian Schelter <ssc.o...@googlemail.com> > wrote: > > I chose against porting all the similarity measures to the dsl version of > the cooccurrence analysis for two reasons. First, adding the measures in a > generalizable way makes the code superhard to read. Second, in practice, I > have never seen something giving better results than llr. As Ted pointed > out, a lot of the foundations of using similarity measures comes from > wanting to predict ratings, which people never do in practice. I think we > should restrict ourselves to approaches that work with implicit, count-like > data. > > -s