The ItemSimilarityJob cannot be directly used as its not working on a DistributedRowMatrix but on data structures unique to collaborative filtering, so if you ask me I'd say that a separate job would be required.
If you wanna give it a try, a good starting point to get an idea how the computation of the pairwise cosine similarities works could be to take a look at the example in the comment of org.apache.mahout.cf.taste.hadoop.similarity.item.ItemSimilarityJob (starting at line 59). Just think of items as documents and users as terms. -sebastian 2010/6/9 Kris Jack <[email protected]> > Hi Sebastion, > > Thanks for the reference. I had a look through the paper and it's > certainly > very relevant to the problem that I'm trying to solve. Do you think the CF > functionality could be co-opted to output such document similarities as it > stands or will it require modification? If it can be used straight off, > say > to give the top 25 most related documents for each document, then how would > you suggest that I go about this? > > Thanks, > Kris > > > > 2010/6/8 Sebastian Schelter <[email protected]> > > > Hi Kris, > > > > actually the code to compute the item-to-item similarities in the > > collaborative filtering part of mahout (which at the first look seems to > be > > a totally different problem than yours) is based on a paper that deals > with > > computing the pairwise similarity of text documents in a very simple way. > > Maybe that could be helpful to you: > > > > Elsayed et al: Pairwise Document Similarity in Large Collections with > > MapReduce > > > > > http://www.umiacs.umd.edu/~jimmylin/publications/Elsayed_etal_ACL2008_short.pdf > < > http://www.umiacs.umd.edu/%7Ejimmylin/publications/Elsayed_etal_ACL2008_short.pdf > > > > < > > > http://www.umiacs.umd.edu/%7Ejimmylin/publications/Elsayed_etal_ACL2008_short.pdf > > > > > > > -sebastian > > > > > > 2010/6/8 Kris Jack <[email protected]> > > > > > Hi everyone, > > > > > > I currently use lucene's moreLikeThis function through solr to find > > > documents that are related to one another. A single call, however, > takes > > > around 4 seconds to complete and I would like to reduce this. I got to > > > thinking that I might be able to use Mahout to generate a document > > > similarity matrix offline that could then be looked-up in real time for > > > serving. Is this a reasonable use of Mahout? If so, what functions > will > > > generate a document similarity matrix? Also, I would like to be able > to > > > keep the text processing advantages provided through lucene so it would > > > help > > > if I could still use my lucene index. If not, then could you recommend > > any > > > alternative solutions please? > > > > > > Many thanks, > > > Kris > > > > > > > > > -- > Dr Kris Jack, > http://www.mendeley.com/profiles/kris-jack/ >
