Re: Generating a Document Similarity Matrix

Kris Jack Wed, 09 Jun 2010 05:12:30 -0700

Hi,

Thanks very for the code.  In implementing it, I got a little stuck on
specifying the JobConf's similarity job -


 JobConf conf = new JobConf("similarity job");

I assume that I should define here how I would like two vectors to be
compared with one another?  Please do correct me if that's wrong.  If so,
however, could you point me to any examples of what this code should look
like (e.g. cosine similarity)?  I'm sure that these kinds of similarity jobs
must already exist in Mahout but being new to both Mahout and MapReduce, I'm
not sure where to find them.

Thanks,
Kris



2010/6/8 Jake Mannix <jake.man...@gmail.com>

> Hi Kris,
>
>  So you already know how to make a sparse feature matrix out of
> your Solr index, based on Grant's instructions?  Once you have that
> matrix loaded into HDFS, then the following Java code should
> create a document-document similarity matrix:
>
> //
>  String p = "/path/to/matrix/on/hdfs";
>  String tmpPath = "/tmp/matrixmultiplyspace";
>  int numDocuments = // whatever your numDocuments is
>  int numTerms = // total number of terms in the matrix
>
>  DistributedRowMatrix text = new DistributedRowMatrix(inputPath,
>    tmpPath, numDocuments, numTerms);
>  JobConf conf = new JobConf("similarity job");
>  text.configure(conf);
>
>  DistributedRowMatrix transpose = text.transpose();
>
>  DistributedRowMatrix similarity = transpose.times(transpose);
>  System.out.println("Similarity matrix lives: " + similarity.getRowPath());
> //
>
> Now, the rows of this similarity are going to be way too dense, so
> you'll want to write a small map-reduce job (well, no reduce is necessary)
> to run over this matrix and trim out all the unuseful entries of each
> row, but that shouldn't be too hard to do.
>
> Of course, to do it really efficiently, that functionality could be folded
> into the reducer of the matrix multiply job, and done in the same pass over
> the data as that one.
>
>  -jake
>
>
>
> On Tue, Jun 8, 2010 at 9:44 AM, Kris Jack <mrkrisj...@gmail.com> wrote:
>
> > Hi Jake,
> >
> > Thanks for that.  The first solution that you suggest is more like what I
> > was imagining.
> >
> > Please excuse me, I'm new to Mahout and don't know how to use it to
> > generate
> > the full document-document similarity matrix.  I would rather not have to
> > re-implement the moreLikeThis algorithm that, although rather straight
> > forward, may take time for a newbie to MapReduce like me.  Could you
> guide
> > me a little in finding the relevant Mahout code for generating the matrix
> > or
> > is it not really designed for that?
> >
> > For the moment, I would be happy to have an off-line batch version
> working.
> > Also, it is desirable to take advantage of the text processing features
> > that
> > I have already configured using solr, so I would prefer to read in the
> > feature vectors for the documents from a lucene index, as I am doing at
> > present (e.g.
> >
> >
> http://lucene.grantingersoll.com/2010/02/16/trijug-intro-to-mahout-slides-and-demo-examples/
> > ).
> >
> > Thanks,
> > Kris
> >
> >
> >
> > 2010/6/8 Jake Mannix <jake.man...@gmail.com>
> >
> > > Hi Kris,
> > >
> > >  If you generate a full document-document similarity matrix offline,
> and
> > > then make sure to sparsify the rows (trim off all similarities below a
> > > threshold, or only take the top N for each row, etc...).  Then encoding
> > > these values directly in the index would indeed allow for *superfast*
> > > MoreLikeThis functionality, because you've already computed all
> > > of the similar results offline.
> > >
> > >  The only downside is that it won't apply to newly indexed documents.
> > > If your indexing setup is such that you don't fold in new documents
> live,
> > > but do so in batch, then this should be fine.
> > >
> > >  An alternative is to use something like a Locality Sensitive Hash
> > > (something one of my co-workers is writing up a nice implementation
> > > of now, and I'm going to get him to contribute it once it's fully
> > tested),
> > > to reduce the search space (as a lucene Filter) and speed up the
> > > query.
> > >
> > >  -jake
> > >
> > > On Tue, Jun 8, 2010 at 8:11 AM, Kris Jack <mrkrisj...@gmail.com>
> wrote:
> > >
> > > > Hi Olivier,
> > > >
> > > > Thanks for your suggestions.  I have over 10 million documents and
> they
> > > > have
> > > > quite a lot of meta-data associated with them including rather large
> > text
> > > > fields.  It is possible to tweak the moreLikeThis function from solr.
> >  I
> > > > have tried changing the parameters (
> > > > http://wiki.apache.org/solr/MoreLikeThis)
> > > > but am not managing to get results in under 300ms without sacrificing
> > the
> > > > quality of the results too much.
> > > >
> > > > I suspect that there would be gains to be made from reducing the
> > > > dimensionality of the feature vectors before indexing with lucene so
> I
> > > may
> > > > give that a try.  I'll keep you posted if I come up with other
> > solutions.
> > > >
> > > > Thanks,
> > > > Kris
> > > >
> > > >
> > > >
> > > > 2010/6/8 Olivier Grisel <olivier.gri...@ensta.org>
> > > >
> > > > > 2010/6/8 Kris Jack <mrkrisj...@gmail.com>:
> > > > > > Hi everyone,
> > > > > >
> > > > > > I currently use lucene's moreLikeThis function through solr to
> find
> > > > > > documents that are related to one another.  A single call,
> however,
> > > > takes
> > > > > > around 4 seconds to complete and I would like to reduce this.  I
> > got
> > > to
> > > > > > thinking that I might be able to use Mahout to generate a
> document
> > > > > > similarity matrix offline that could then be looked-up in real
> time
> > > for
> > > > > > serving.  Is this a reasonable use of Mahout?  If so, what
> > functions
> > > > will
> > > > > > generate a document similarity matrix?  Also, I would like to be
> > able
> > > > to
> > > > > > keep the text processing advantages provided through lucene so it
> > > would
> > > > > help
> > > > > > if I could still use my lucene index.  If not, then could you
> > > recommend
> > > > > any
> > > > > > alternative solutions please?
> > > > >
> > > > > How many documents do you have in your index? Have you tried to
> tweak
> > > > > the MoreLikeThis parameters ? (I don't know if it's possible using
> > the
> > > > > solr interface, I use it directly using the lucene java API)
> > > > >
> > > > > For instance you can trade off recall for speed by decreasing the
> > > > > number of terms to use in the query and trade recall for precision
> > and
> > > > > speed by increasing the percentage of terms that should match.
> > > > >
> > > > > You could also use Mahout implementation of SVD to build low
> > > > > dimensional semantic vectors representing your documents (a.k.a.
> > > > > Latent Semantic Indexing) and then index those transformed
> frequency
> > > > > vectors in a dedicated lucene index (or document field provided you
> > > > > name the resulting terms with something that does not match real
> life
> > > > > terms present in other). However using standard SVD will probably
> > > > > result in dense (as opposed to sparse) low dimensional semantic
> > > > > vectors. I don't think lucene's lookup performance is good with
> dense
> > > > > frequency vectors even though the number of terms is greatly
> reduced
> > > > > by SVD. Hence it would probably be better to either threshold the
> top
> > > > > 100 absolute values of each semantic vectors before indexing
> > (probably
> > > > > the simpler solution) or using a sparsifying penalty contrained
> > > > > variant of SVD / LSI. You should have a look at the literature on
> > > > > sparse coding or sparse dictionary learning, Sparse-PCA and more
> > > > > generally L1 penalty regression methods such as the Lasso and LARS.
> I
> > > > > don't know about any library  for sparse semantic coding of
> document
> > > > > that works automatically with lucene. Probably some non trivial
> > coding
> > > > > is needed there.
> > > > >
> > > > > Another alternative is finding low dimensional (64 or 32
> components)
> > > > > dense codes and then binary thresholding then and store integer
> code
> > > > > in the DB or the lucene index and then build smart exact match
> > queries
> > > > > to find all document lying in the hamming ball of size 1 or 2 of
> the
> > > > > reference document's binary code. But I think this approach while
> > > > > promising for web scale document collections is even more
> > experimental
> > > > > and requires very good code low dim encoders (I don't think linear
> > > > > models such as SVD are good enough for reducing sparse 10e6
> > components
> > > > > vectors to dense 64 components vectors, non linear encoders such as
> > > > > Stacked Restricted Boltzmann Machines are probably a better
> choice).
> > > > >
> > > > > In any case let us know about your results, I am really interested
> on
> > > > > practical yet scalable solutions to this problem.
> > > > >
> > > > > --
> > > > > Olivier
> > > > > http://twitter.com/ogrisel - http://github.com/ogrisel
> > > > >
> > > >
> > > >
> > > >
> > > > --
> > > > Dr Kris Jack,
> > > > http://www.mendeley.com/profiles/kris-jack/
> > > >
> > >
> >
> >
> >
> > --
> > Dr Kris Jack,
> > http://www.mendeley.com/profiles/kris-jack/
> >
>



-- 
Dr Kris Jack,
http://www.mendeley.com/profiles/kris-jack/

Re: Generating a Document Similarity Matrix

Reply via email to