Re: Generating a Document Similarity Matrix

Jake Mannix Tue, 08 Jun 2010 10:09:36 -0700

Hi Kris,

  So you already know how to make a sparse feature matrix out of
your Solr index, based on Grant's instructions?  Once you have that
matrix loaded into HDFS, then the following Java code should
create a document-document similarity matrix:


//
  String p = "/path/to/matrix/on/hdfs";
  String tmpPath = "/tmp/matrixmultiplyspace";
  int numDocuments = // whatever your numDocuments is
  int numTerms = // total number of terms in the matrix

  DistributedRowMatrix text = new DistributedRowMatrix(inputPath,
    tmpPath, numDocuments, numTerms);
  JobConf conf = new JobConf("similarity job");
  text.configure(conf);

  DistributedRowMatrix transpose = text.transpose();

  DistributedRowMatrix similarity = transpose.times(transpose);
  System.out.println("Similarity matrix lives: " + similarity.getRowPath());
//

Now, the rows of this similarity are going to be way too dense, so
you'll want to write a small map-reduce job (well, no reduce is necessary)
to run over this matrix and trim out all the unuseful entries of each
row, but that shouldn't be too hard to do.

Of course, to do it really efficiently, that functionality could be folded
into the reducer of the matrix multiply job, and done in the same pass over
the data as that one.

  -jake



On Tue, Jun 8, 2010 at 9:44 AM, Kris Jack <mrkrisj...@gmail.com> wrote:

> Hi Jake,
>
> Thanks for that.  The first solution that you suggest is more like what I
> was imagining.
>
> Please excuse me, I'm new to Mahout and don't know how to use it to
> generate
> the full document-document similarity matrix.  I would rather not have to
> re-implement the moreLikeThis algorithm that, although rather straight
> forward, may take time for a newbie to MapReduce like me.  Could you guide
> me a little in finding the relevant Mahout code for generating the matrix
> or
> is it not really designed for that?
>
> For the moment, I would be happy to have an off-line batch version working.
> Also, it is desirable to take advantage of the text processing features
> that
> I have already configured using solr, so I would prefer to read in the
> feature vectors for the documents from a lucene index, as I am doing at
> present (e.g.
>
> http://lucene.grantingersoll.com/2010/02/16/trijug-intro-to-mahout-slides-and-demo-examples/
> ).
>
> Thanks,
> Kris
>
>
>
> 2010/6/8 Jake Mannix <jake.man...@gmail.com>
>
> > Hi Kris,
> >
> >  If you generate a full document-document similarity matrix offline, and
> > then make sure to sparsify the rows (trim off all similarities below a
> > threshold, or only take the top N for each row, etc...).  Then encoding
> > these values directly in the index would indeed allow for *superfast*
> > MoreLikeThis functionality, because you've already computed all
> > of the similar results offline.
> >
> >  The only downside is that it won't apply to newly indexed documents.
> > If your indexing setup is such that you don't fold in new documents live,
> > but do so in batch, then this should be fine.
> >
> >  An alternative is to use something like a Locality Sensitive Hash
> > (something one of my co-workers is writing up a nice implementation
> > of now, and I'm going to get him to contribute it once it's fully
> tested),
> > to reduce the search space (as a lucene Filter) and speed up the
> > query.
> >
> >  -jake
> >
> > On Tue, Jun 8, 2010 at 8:11 AM, Kris Jack <mrkrisj...@gmail.com> wrote:
> >
> > > Hi Olivier,
> > >
> > > Thanks for your suggestions.  I have over 10 million documents and they
> > > have
> > > quite a lot of meta-data associated with them including rather large
> text
> > > fields.  It is possible to tweak the moreLikeThis function from solr.
>  I
> > > have tried changing the parameters (
> > > http://wiki.apache.org/solr/MoreLikeThis)
> > > but am not managing to get results in under 300ms without sacrificing
> the
> > > quality of the results too much.
> > >
> > > I suspect that there would be gains to be made from reducing the
> > > dimensionality of the feature vectors before indexing with lucene so I
> > may
> > > give that a try.  I'll keep you posted if I come up with other
> solutions.
> > >
> > > Thanks,
> > > Kris
> > >
> > >
> > >
> > > 2010/6/8 Olivier Grisel <olivier.gri...@ensta.org>
> > >
> > > > 2010/6/8 Kris Jack <mrkrisj...@gmail.com>:
> > > > > Hi everyone,
> > > > >
> > > > > I currently use lucene's moreLikeThis function through solr to find
> > > > > documents that are related to one another.  A single call, however,
> > > takes
> > > > > around 4 seconds to complete and I would like to reduce this.  I
> got
> > to
> > > > > thinking that I might be able to use Mahout to generate a document
> > > > > similarity matrix offline that could then be looked-up in real time
> > for
> > > > > serving.  Is this a reasonable use of Mahout?  If so, what
> functions
> > > will
> > > > > generate a document similarity matrix?  Also, I would like to be
> able
> > > to
> > > > > keep the text processing advantages provided through lucene so it
> > would
> > > > help
> > > > > if I could still use my lucene index.  If not, then could you
> > recommend
> > > > any
> > > > > alternative solutions please?
> > > >
> > > > How many documents do you have in your index? Have you tried to tweak
> > > > the MoreLikeThis parameters ? (I don't know if it's possible using
> the
> > > > solr interface, I use it directly using the lucene java API)
> > > >
> > > > For instance you can trade off recall for speed by decreasing the
> > > > number of terms to use in the query and trade recall for precision
> and
> > > > speed by increasing the percentage of terms that should match.
> > > >
> > > > You could also use Mahout implementation of SVD to build low
> > > > dimensional semantic vectors representing your documents (a.k.a.
> > > > Latent Semantic Indexing) and then index those transformed frequency
> > > > vectors in a dedicated lucene index (or document field provided you
> > > > name the resulting terms with something that does not match real life
> > > > terms present in other). However using standard SVD will probably
> > > > result in dense (as opposed to sparse) low dimensional semantic
> > > > vectors. I don't think lucene's lookup performance is good with dense
> > > > frequency vectors even though the number of terms is greatly reduced
> > > > by SVD. Hence it would probably be better to either threshold the top
> > > > 100 absolute values of each semantic vectors before indexing
> (probably
> > > > the simpler solution) or using a sparsifying penalty contrained
> > > > variant of SVD / LSI. You should have a look at the literature on
> > > > sparse coding or sparse dictionary learning, Sparse-PCA and more
> > > > generally L1 penalty regression methods such as the Lasso and LARS. I
> > > > don't know about any library  for sparse semantic coding of document
> > > > that works automatically with lucene. Probably some non trivial
> coding
> > > > is needed there.
> > > >
> > > > Another alternative is finding low dimensional (64 or 32 components)
> > > > dense codes and then binary thresholding then and store integer code
> > > > in the DB or the lucene index and then build smart exact match
> queries
> > > > to find all document lying in the hamming ball of size 1 or 2 of the
> > > > reference document's binary code. But I think this approach while
> > > > promising for web scale document collections is even more
> experimental
> > > > and requires very good code low dim encoders (I don't think linear
> > > > models such as SVD are good enough for reducing sparse 10e6
> components
> > > > vectors to dense 64 components vectors, non linear encoders such as
> > > > Stacked Restricted Boltzmann Machines are probably a better choice).
> > > >
> > > > In any case let us know about your results, I am really interested on
> > > > practical yet scalable solutions to this problem.
> > > >
> > > > --
> > > > Olivier
> > > > http://twitter.com/ogrisel - http://github.com/ogrisel
> > > >
> > >
> > >
> > >
> > > --
> > > Dr Kris Jack,
> > > http://www.mendeley.com/profiles/kris-jack/
> > >
> >
>
>
>
> --
> Dr Kris Jack,
> http://www.mendeley.com/profiles/kris-jack/
>

Re: Generating a Document Similarity Matrix

Reply via email to