Re: Generating a Document Similarity Matrix

Sebastian Schelter Wed, 09 Jun 2010 10:57:00 -0700

The ItemSimilarityJob cannot be directly used as its not working on a
DistributedRowMatrix but on data structures unique to collaborative
filtering, so if you ask me I'd say that a separate job would be required.


If you wanna give it a try, a good starting point to get an idea how the
computation of the pairwise cosine similarities works could be to take a
look at the example in the comment of
org.apache.mahout.cf.taste.hadoop.similarity.item.ItemSimilarityJob
(starting at line 59). Just think of items as documents and users as terms.

-sebastian

2010/6/9 Kris Jack <[email protected]>

> Hi Sebastion,
>
> Thanks for the reference.  I had a look through the paper and it's
> certainly
> very relevant to the problem that I'm trying to solve.  Do you think the CF
> functionality could be co-opted to output such document similarities as it
> stands or will it require modification?  If it can be used straight off,
> say
> to give the top 25 most related documents for each document, then how would
> you suggest that I go about this?
>
> Thanks,
> Kris
>
>
>
> 2010/6/8 Sebastian Schelter <[email protected]>
>
> > Hi Kris,
> >
> > actually the code to compute the item-to-item similarities in the
> > collaborative filtering part of mahout (which at the first look seems to
> be
> > a totally different problem than yours) is based on a paper that deals
> with
> > computing the pairwise similarity of text documents in a very simple way.
> > Maybe that  could be helpful to you:
> >
> > Elsayed et al: Pairwise Document Similarity in Large Collections with
> > MapReduce
> >
> >
> http://www.umiacs.umd.edu/~jimmylin/publications/Elsayed_etal_ACL2008_short.pdf
> <
> http://www.umiacs.umd.edu/%7Ejimmylin/publications/Elsayed_etal_ACL2008_short.pdf
> >
> > <
> >
> http://www.umiacs.umd.edu/%7Ejimmylin/publications/Elsayed_etal_ACL2008_short.pdf
> > >
> >
> > -sebastian
> >
> >
> > 2010/6/8 Kris Jack <[email protected]>
> >
> > > Hi everyone,
> > >
> > > I currently use lucene's moreLikeThis function through solr to find
> > > documents that are related to one another.  A single call, however,
> takes
> > > around 4 seconds to complete and I would like to reduce this.  I got to
> > > thinking that I might be able to use Mahout to generate a document
> > > similarity matrix offline that could then be looked-up in real time for
> > > serving.  Is this a reasonable use of Mahout?  If so, what functions
> will
> > > generate a document similarity matrix?  Also, I would like to be able
> to
> > > keep the text processing advantages provided through lucene so it would
> > > help
> > > if I could still use my lucene index.  If not, then could you recommend
> > any
> > > alternative solutions please?
> > >
> > > Many thanks,
> > > Kris
> > >
> >
>
>
>
> --
> Dr Kris Jack,
> http://www.mendeley.com/profiles/kris-jack/
>

Re: Generating a Document Similarity Matrix

Reply via email to