LSI using Mahout ssvd - folding a new doc into the space

2012-06-29 Thread Chris Hokamp
Hi all, I'm trying to implement Latent Semantic Indexing using the mahout ssvd tool, and I'm having trouble understanding how I can use the output of ssvd Mahout to 'fold' new queries (documents) into the LSI space. Specifically, I can't find a way to multiply a vector representing a query

Re: LSI using Mahout ssvd - folding a new doc into the space

2012-06-29 Thread Sean Owen
PM, Chris Hokamp chris.hok...@gmail.com wrote: Hi all, I'm trying to implement Latent Semantic Indexing using the mahout ssvd tool, and I'm having trouble understanding how I can use the output of ssvd Mahout to 'fold' new queries (documents) into the LSI space. Specifically, I can't find

Re: LSI using Mahout ssvd - folding a new doc into the space

2012-06-29 Thread Chris Hokamp
, Chris Hokamp chris.hok...@gmail.com wrote: Hi all, I'm trying to implement Latent Semantic Indexing using the mahout ssvd tool, and I'm having trouble understanding how I can use the output of ssvd Mahout to 'fold' new queries (documents) into the LSI space. Specifically, I can't

Re: LSI using Mahout ssvd - folding a new doc into the space

2012-06-29 Thread Dmitriy Lyubimov
Mahout to 'fold' new queries (documents) into the LSI space. Specifically, I can't find a way to multiply a vector representing a query by the inverse of the matrix of singular values - I can't find a way to solve for the inverse of the diagonal matrix of singular values. I can

Re: LSI using Mahout ssvd - folding a new doc into the space

2012-06-29 Thread Sean Owen
Latent Semantic Indexing using the mahout ssvd tool, and I'm having trouble understanding how I can use the output of ssvd Mahout to 'fold' new queries (documents) into the LSI space. Specifically, I can't find a way to multiply a vector representing a query by the inverse of the matrix

Re: LSI using Mahout ssvd - folding a new doc into the space

2012-06-29 Thread Dmitriy Lyubimov
29, 2012 at 10:27 PM, Chris Hokamp chris.hok...@gmail.com wrote: Hi all, I'm trying to implement Latent Semantic Indexing using the mahout ssvd tool, and I'm having trouble understanding how I can use the output of ssvd Mahout to 'fold' new queries (documents) into the LSI space

Re: LSI using Mahout ssvd - folding a new doc into the space

2012-06-29 Thread Chris Hokamp
) into the LSI space. Specifically, I can't find a way to multiply a vector representing a query by the inverse of the matrix of singular values - I can't find a way to solve for the inverse of the diagonal matrix of singular values. I can generate the output matrices using ssvd

Re: lsi

2011-11-17 Thread Grant Ingersoll
I've never implemented LSI. Is there a way to incrementally build the model (by simply indexing documents) or is it something that one only runs after the fact once one has built up the much bigger matrix? If it's the former, I bet it wouldn't be that hard to just implement the appropriate

Re: lsi

2011-11-17 Thread Ted Dunning
implemented LSI. Is there a way to incrementally build the model (by simply indexing documents) or is it something that one only runs after the fact once one has built up the much bigger matrix? If it's the former, I bet it wouldn't be that hard to just implement the appropriate new codecs

Re: lsi

2011-11-17 Thread Dmitriy Lyubimov
never implemented LSI.  Is there a way to incrementally build the model (by simply indexing documents) or is it something that one only runs after the fact once one has built up the much bigger matrix?  If it's the former, I bet it wouldn't be that hard to just implement the appropriate new codecs

Re: lsi

2011-11-17 Thread Dmitriy Lyubimov
, 2011 at 1:47 PM, Grant Ingersoll gsing...@apache.org wrote: I've never implemented LSI.  Is there a way to incrementally build the model (by simply indexing documents) or is it something that one only runs after the fact once one has built up the much bigger matrix?  If it's the former, I bet

Re: lsi

2011-11-14 Thread Ken Krugler
you are *already* scoring, and add to that scoring function an LSI cosine as one feature among many. Hopefully it will improve precision, even if it will do nothing for recall (as it's only being applied to results already retrieved by the text query). Alternatively, to improve recall

Re: lsi

2011-11-14 Thread Dmitriy Lyubimov
index running out of memory can handle a LOT of documents and still meet the 10ms goal. This means that LSI as a raw search engine on large corpora is going to be less cost effective by a couple of orders of magnitude. That leaves lots of room to use synthetic tokens, query expansion

Re: lsi

2011-11-14 Thread Grant Ingersoll
Might be useful: https://github.com/algoriffic/lsa4solr Looks like it hasn't been kept up to date. On Nov 13, 2011, at 1:47 PM, Sebastian Schelter wrote: Is there some documentation/tutorial available on how to build a LSI pipeline with mahout and lucene? --sebastian

Re: lsi

2011-11-14 Thread Ted Dunning
off. The issue of memory bandwidth isn't getting better very quickly so there isn't much hope for making this better. A good inverted index running out of memory can handle a LOT of documents and still meet the 10ms goal. This means that LSI as a raw search engine on large corpora

lsi

2011-11-13 Thread Sebastian Schelter
Is there some documentation/tutorial available on how to build a LSI pipeline with mahout and lucene? --sebastian

Re: lsi

2011-11-13 Thread Ted Dunning
Essentially not. And I would worry about how to push the LSI vectors back into lucene in a coherent and usable way. On Sun, Nov 13, 2011 at 10:47 AM, Sebastian Schelter s...@apache.org wrote: Is there some documentation/tutorial available on how to build a LSI pipeline with mahout and lucene

Re: lsi

2011-11-13 Thread Jake Mannix
Store the vectors as binary payloads and keep the projection matrix in memory with the qureryBuilder, to add an lsi cosine between query and doc scoring feature? On Nov 13, 2011 12:59 PM, Ted Dunning ted.dunn...@gmail.com wrote: Essentially not. And I would worry about how to push the LSI

Re: lsi

2011-11-13 Thread Dmitriy Lyubimov
than classic svd based lsi. On Nov 13, 2011 10:48 AM, Sebastian Schelter s...@apache.org wrote: Is there some documentation/tutorial available on how to build a LSI pipeline with mahout and lucene? --sebastian

Re: lsi

2011-11-13 Thread Jake Mannix
for added relevance: if you are already doing Lucene for your scoring needs, you already are getting some good precision and recall. The idea is this: you take results you are *already* scoring, and add to that scoring function an LSI cosine as one feature among many. Hopefully it will improve precision

Re: lsi

2011-11-13 Thread Lance Norskog
retrieval. It's not just projection, it's for added relevance: if you are already doing Lucene for your scoring needs, you already are getting some good precision and recall. The idea is this: you take results you are *already* scoring, and add to that scoring function an LSI cosine as one

Re: lsi

2011-11-13 Thread Ted Dunning
to that scoring function an LSI cosine as one feature among many. Hopefully it will improve precision, even if it will do nothing for recall (as it's only being applied to results already retrieved by the text query). I have done this with Lucene (some time ago) and had a hell of a time getting decent

Re: lsi

2011-11-13 Thread Jake Mannix
a *good* idea to use LSI in this way (or, in fact, to use LSI at all), just that if you *do* have a good scoring model (like some kind of strongly predictive static prior, like PageRank), then doing even fairly dumb recall-enhancing techniques can improve things quite nicely, and discretized LSI like

Re: seq2sparse and lsi fold-in

2011-01-06 Thread Jake Mannix
Dmitriy, I'm not sure if you figured this out on your own and I didn't see the email, but if not: On Thu, Dec 30, 2010 at 3:57 PM, Dmitriy Lyubimov dlie...@gmail.com wrote: Also, if i have a bunch of new documents to fold-in, it looks like i'd need to run a matrix multiplication job between

Re: seq2sparse and lsi fold-in

2011-01-06 Thread Dmitriy Lyubimov
Thank you, Jake. Yes, i have figured that, and it seems that DRM.times does just that. I was just not sure of the production quality of this code. It seems DRM experiences a lot of fixes and discussions lately, including simple multiplication. On a side node one needs to compute Cx V^t x

seq2sparse and lsi fold-in

2010-12-30 Thread Dmitriy Lyubimov
Hi, I would like to try LSI processing of results produced by seq2sparse. What's more, I need to be able to fold-in a bunch of new documents afterwards. Is there any support for fold-in indexing in Mahout? if not, is there a quick way for me to gain the understanding of seq2sparse output

Re: seq2sparse and lsi fold-in

2010-12-30 Thread Dmitriy Lyubimov
to try LSI processing of results produced by seq2sparse. What's more, I need to be able to fold-in a bunch of new documents afterwards. Is there any support for fold-in indexing in Mahout? if not, is there a quick way for me to gain the understanding of seq2sparse output? In particular, if i

Re: seq2sparse and lsi fold-in

2010-12-30 Thread Ted Dunning
in memory. On Thu, Dec 30, 2010 at 12:56 PM, Dmitriy Lyubimov dlie...@gmail.comwrote: Hi, I would like to try LSI processing of results produced by seq2sparse. What's more, I need to be able to fold-in a bunch of new documents afterwards. Is there any support for fold-in indexing in Mahout

Re: seq2sparse and lsi fold-in

2010-12-30 Thread Ted Dunning
: Hi, I would like to try LSI processing of results produced by seq2sparse. What's more, I need to be able to fold-in a bunch of new documents afterwards. Is there any support for fold-in indexing in Mahout? if not, is there a quick way for me to gain the understanding of seq2sparse output