Re: Generating a Document Similarity Matrix

Kris Jack Tue, 08 Jun 2010 09:59:54 -0700

Sorry, copied the previous link from the wrong tab :/  Meant to be
https://cwiki.apache.org/MAHOUT/creating-vectors-from-text.html for reading
in lucene vectors.




2010/6/8 Kris Jack <mrkrisj...@gmail.com>

> Hi Jake,
>
> Thanks for that.  The first solution that you suggest is more like what I
> was imagining.
>
> Please excuse me, I'm new to Mahout and don't know how to use it to
> generate the full document-document similarity matrix.  I would rather not
> have to re-implement the moreLikeThis algorithm that, although rather
> straight forward, may take time for a newbie to MapReduce like me.  Could
> you guide me a little in finding the relevant Mahout code for generating the
> matrix or is it not really designed for that?
>
> For the moment, I would be happy to have an off-line batch version
> working.  Also, it is desirable to take advantage of the text processing
> features that I have already configured using solr, so I would prefer to
> read in the feature vectors for the documents from a lucene index, as I am
> doing at present (e.g.
> http://lucene.grantingersoll.com/2010/02/16/trijug-intro-to-mahout-slides-and-demo-examples/
> ).
>
> Thanks,
> Kris
>
>
>
> 2010/6/8 Jake Mannix <jake.man...@gmail.com>
>
> Hi Kris,
>>
>>  If you generate a full document-document similarity matrix offline, and
>> then make sure to sparsify the rows (trim off all similarities below a
>> threshold, or only take the top N for each row, etc...).  Then encoding
>> these values directly in the index would indeed allow for *superfast*
>> MoreLikeThis functionality, because you've already computed all
>> of the similar results offline.
>>
>>  The only downside is that it won't apply to newly indexed documents.
>> If your indexing setup is such that you don't fold in new documents live,
>> but do so in batch, then this should be fine.
>>
>>  An alternative is to use something like a Locality Sensitive Hash
>> (something one of my co-workers is writing up a nice implementation
>> of now, and I'm going to get him to contribute it once it's fully tested),
>> to reduce the search space (as a lucene Filter) and speed up the
>> query.
>>
>>  -jake
>>
>> On Tue, Jun 8, 2010 at 8:11 AM, Kris Jack <mrkrisj...@gmail.com> wrote:
>>
>> > Hi Olivier,
>> >
>> > Thanks for your suggestions.  I have over 10 million documents and they
>> > have
>> > quite a lot of meta-data associated with them including rather large
>> text
>> > fields.  It is possible to tweak the moreLikeThis function from solr.  I
>> > have tried changing the parameters (
>> > http://wiki.apache.org/solr/MoreLikeThis)
>> > but am not managing to get results in under 300ms without sacrificing
>> the
>> > quality of the results too much.
>> >
>> > I suspect that there would be gains to be made from reducing the
>> > dimensionality of the feature vectors before indexing with lucene so I
>> may
>> > give that a try.  I'll keep you posted if I come up with other
>> solutions.
>> >
>> > Thanks,
>> > Kris
>> >
>> >
>> >
>> > 2010/6/8 Olivier Grisel <olivier.gri...@ensta.org>
>> >
>> > > 2010/6/8 Kris Jack <mrkrisj...@gmail.com>:
>> > > > Hi everyone,
>> > > >
>> > > > I currently use lucene's moreLikeThis function through solr to find
>> > > > documents that are related to one another.  A single call, however,
>> > takes
>> > > > around 4 seconds to complete and I would like to reduce this.  I got
>> to
>> > > > thinking that I might be able to use Mahout to generate a document
>> > > > similarity matrix offline that could then be looked-up in real time
>> for
>> > > > serving.  Is this a reasonable use of Mahout?  If so, what functions
>> > will
>> > > > generate a document similarity matrix?  Also, I would like to be
>> able
>> > to
>> > > > keep the text processing advantages provided through lucene so it
>> would
>> > > help
>> > > > if I could still use my lucene index.  If not, then could you
>> recommend
>> > > any
>> > > > alternative solutions please?
>> > >
>> > > How many documents do you have in your index? Have you tried to tweak
>> > > the MoreLikeThis parameters ? (I don't know if it's possible using the
>> > > solr interface, I use it directly using the lucene java API)
>> > >
>> > > For instance you can trade off recall for speed by decreasing the
>> > > number of terms to use in the query and trade recall for precision and
>> > > speed by increasing the percentage of terms that should match.
>> > >
>> > > You could also use Mahout implementation of SVD to build low
>> > > dimensional semantic vectors representing your documents (a.k.a.
>> > > Latent Semantic Indexing) and then index those transformed frequency
>> > > vectors in a dedicated lucene index (or document field provided you
>> > > name the resulting terms with something that does not match real life
>> > > terms present in other). However using standard SVD will probably
>> > > result in dense (as opposed to sparse) low dimensional semantic
>> > > vectors. I don't think lucene's lookup performance is good with dense
>> > > frequency vectors even though the number of terms is greatly reduced
>> > > by SVD. Hence it would probably be better to either threshold the top
>> > > 100 absolute values of each semantic vectors before indexing (probably
>> > > the simpler solution) or using a sparsifying penalty contrained
>> > > variant of SVD / LSI. You should have a look at the literature on
>> > > sparse coding or sparse dictionary learning, Sparse-PCA and more
>> > > generally L1 penalty regression methods such as the Lasso and LARS. I
>> > > don't know about any library  for sparse semantic coding of document
>> > > that works automatically with lucene. Probably some non trivial coding
>> > > is needed there.
>> > >
>> > > Another alternative is finding low dimensional (64 or 32 components)
>> > > dense codes and then binary thresholding then and store integer code
>> > > in the DB or the lucene index and then build smart exact match queries
>> > > to find all document lying in the hamming ball of size 1 or 2 of the
>> > > reference document's binary code. But I think this approach while
>> > > promising for web scale document collections is even more experimental
>> > > and requires very good code low dim encoders (I don't think linear
>> > > models such as SVD are good enough for reducing sparse 10e6 components
>> > > vectors to dense 64 components vectors, non linear encoders such as
>> > > Stacked Restricted Boltzmann Machines are probably a better choice).
>> > >
>> > > In any case let us know about your results, I am really interested on
>> > > practical yet scalable solutions to this problem.
>> > >
>> > > --
>> > > Olivier
>> > > http://twitter.com/ogrisel - http://github.com/ogrisel
>> > >
>> >
>> >
>> >
>> > --
>> > Dr Kris Jack,
>> > http://www.mendeley.com/profiles/kris-jack/
>> >
>>
>
>
>
> --
> Dr Kris Jack,
> http://www.mendeley.com/profiles/kris-jack/
>



-- 
Dr Kris Jack,
http://www.mendeley.com/profiles/kris-jack/

Re: Generating a Document Similarity Matrix

Reply via email to