Hi all,
I'm trying to implement Latent Semantic Indexing using the mahout ssvd
tool, and I'm having trouble understanding how I can use the output of ssvd
Mahout to 'fold' new queries (documents) into the LSI space. Specifically,
I can't find a way to multiply a vector representing a query
PM, Chris Hokamp chris.hok...@gmail.com wrote:
Hi all,
I'm trying to implement Latent Semantic Indexing using the mahout ssvd
tool, and I'm having trouble understanding how I can use the output of ssvd
Mahout to 'fold' new queries (documents) into the LSI space. Specifically,
I can't find
, Chris Hokamp chris.hok...@gmail.com
wrote:
Hi all,
I'm trying to implement Latent Semantic Indexing using the mahout ssvd
tool, and I'm having trouble understanding how I can use the output of
ssvd
Mahout to 'fold' new queries (documents) into the LSI space.
Specifically,
I can't
Mahout to 'fold' new queries (documents) into the LSI space.
Specifically,
I can't find a way to multiply a vector representing a query by the
inverse
of the matrix of singular values - I can't find a way to solve for the
inverse of the diagonal matrix of singular values.
I can
Latent Semantic Indexing using the mahout ssvd
tool, and I'm having trouble understanding how I can use the output of
ssvd
Mahout to 'fold' new queries (documents) into the LSI space.
Specifically,
I can't find a way to multiply a vector representing a query by the
inverse
of the matrix
29, 2012 at 10:27 PM, Chris Hokamp chris.hok...@gmail.com
wrote:
Hi all,
I'm trying to implement Latent Semantic Indexing using the mahout ssvd
tool, and I'm having trouble understanding how I can use the output of
ssvd
Mahout to 'fold' new queries (documents) into the LSI space
) into the LSI space.
Specifically,
I can't find a way to multiply a vector representing a query by the
inverse
of the matrix of singular values - I can't find a way to solve for
the
inverse of the diagonal matrix of singular values.
I can generate the output matrices using ssvd
I've never implemented LSI. Is there a way to incrementally build the model
(by simply indexing documents) or is it something that one only runs after the
fact once one has built up the much bigger matrix? If it's the former, I bet
it wouldn't be that hard to just implement the appropriate
implemented LSI. Is there a way to incrementally build the
model (by simply indexing documents) or is it something that one only runs
after the fact once one has built up the much bigger matrix? If it's the
former, I bet it wouldn't be that hard to just implement the appropriate
new codecs
never implemented LSI. Is there a way to incrementally build the model
(by simply indexing documents) or is it something that one only runs after
the fact once one has built up the much bigger matrix? If it's the former, I
bet it wouldn't be that hard to just implement the appropriate new codecs
, 2011 at 1:47 PM, Grant Ingersoll gsing...@apache.org wrote:
I've never implemented LSI. Is there a way to incrementally build the model
(by simply indexing documents) or is it something that one only runs after
the fact once one has built up the much bigger matrix? If it's the former,
I bet
you are *already* scoring, and add to
that
scoring function an LSI cosine as one feature among many. Hopefully it
will improve precision, even if it will do nothing for recall (as it's only
being
applied to results already retrieved by the text query).
Alternatively, to improve recall
index running out
of memory can handle a LOT of documents and still meet the 10ms goal. This
means that LSI as a raw search engine on large corpora is going to be less
cost effective by a couple of orders of magnitude. That leaves lots of
room to use synthetic tokens, query expansion
Might be useful: https://github.com/algoriffic/lsa4solr
Looks like it hasn't been kept up to date.
On Nov 13, 2011, at 1:47 PM, Sebastian Schelter wrote:
Is there some documentation/tutorial available on how to build a LSI
pipeline with mahout and lucene?
--sebastian
off.
The issue of memory bandwidth isn't getting better very quickly so there
isn't much hope for making this better. A good inverted index running
out
of memory can handle a LOT of documents and still meet the 10ms goal.
This
means that LSI as a raw search engine on large corpora
Is there some documentation/tutorial available on how to build a LSI
pipeline with mahout and lucene?
--sebastian
Essentially not.
And I would worry about how to push the LSI vectors back into lucene in a
coherent and usable way.
On Sun, Nov 13, 2011 at 10:47 AM, Sebastian Schelter s...@apache.org wrote:
Is there some documentation/tutorial available on how to build a LSI
pipeline with mahout and lucene
Store the vectors as binary payloads and keep the projection matrix in
memory with the qureryBuilder, to add an lsi cosine between query and doc
scoring feature?
On Nov 13, 2011 12:59 PM, Ted Dunning ted.dunn...@gmail.com wrote:
Essentially not.
And I would worry about how to push the LSI
than classic svd based lsi.
On Nov 13, 2011 10:48 AM, Sebastian Schelter s...@apache.org wrote:
Is there some documentation/tutorial available on how to build a LSI
pipeline with mahout and lucene?
--sebastian
for added relevance: if you are already doing
Lucene for your scoring needs, you already are getting some good precision
and recall.
The idea is this: you take results you are *already* scoring, and add to
that
scoring function an LSI cosine as one feature among many. Hopefully it
will improve precision
retrieval.
It's not just projection, it's for added relevance: if you are already
doing
Lucene for your scoring needs, you already are getting some good precision
and recall.
The idea is this: you take results you are *already* scoring, and add to
that
scoring function an LSI cosine as one
to
that
scoring function an LSI cosine as one feature among many. Hopefully it
will improve precision, even if it will do nothing for recall (as it's only
being
applied to results already retrieved by the text query).
I have done this with Lucene (some time ago) and had a hell of a time
getting decent
a *good* idea to use LSI in this way (or,
in
fact, to use LSI at all), just that if you *do* have a good scoring model
(like
some kind of strongly predictive static prior, like PageRank), then doing
even fairly dumb recall-enhancing techniques can improve things quite
nicely, and discretized LSI like
Dmitriy,
I'm not sure if you figured this out on your own and I didn't see the
email,
but if not:
On Thu, Dec 30, 2010 at 3:57 PM, Dmitriy Lyubimov dlie...@gmail.com wrote:
Also, if i have a bunch of new documents to fold-in, it looks like i'd need
to run a matrix multiplication job between
Thank you, Jake.
Yes, i have figured that, and it seems that DRM.times does just that. I was
just not sure of the production quality of this code. It seems DRM
experiences a lot of fixes and discussions lately, including simple
multiplication.
On a side node one needs to compute Cx V^t x
Hi,
I would like to try LSI processing of results produced by seq2sparse.
What's more, I need to be able to fold-in a bunch of new documents
afterwards.
Is there any support for fold-in indexing in Mahout?
if not, is there a quick way for me to gain the understanding of seq2sparse
output
to try LSI processing of results produced by seq2sparse.
What's more, I need to be able to fold-in a bunch of new documents
afterwards.
Is there any support for fold-in indexing in Mahout?
if not, is there a quick way for me to gain the understanding of seq2sparse
output?
In particular, if i
in memory.
On Thu, Dec 30, 2010 at 12:56 PM, Dmitriy Lyubimov dlie...@gmail.comwrote:
Hi,
I would like to try LSI processing of results produced by seq2sparse.
What's more, I need to be able to fold-in a bunch of new documents
afterwards.
Is there any support for fold-in indexing in Mahout
:
Hi,
I would like to try LSI processing of results produced by seq2sparse.
What's more, I need to be able to fold-in a bunch of new documents
afterwards.
Is there any support for fold-in indexing in Mahout?
if not, is there a quick way for me to gain the understanding of
seq2sparse
output
29 matches
Mail list logo