That handles coherent.

IT doesn't handle usable.

Storing the vectors as binary payloads handles the situation for
projection-like applications, but that doesn't help retrieval.

The problem with retrieval is that you need to do a dot product for all
vectors.  THat used to require that you keep a low resolution version of
all of the vectors in memory and implement a very efficient dot product for
speed.  If you can batch up several queries (10 or more at a time), you can
get almost an order of magnitude improvement in retrieval speed since the
queries are almost always memory bound on the corpus vectors.

With modern memory sizes, you can definitely hold some pretty big corpora.
 With 100 dimensions of floats, you can store two documents per kB which
gives 2 million documents per GB.  The problem really has more to do with
retrieval speed.  While the ALU can do multiplications and additions at a
pretty high rate, main memory can typically only be accessed at a rate of a
GB per second or so (or a few times that) which means that retrieval time
is going to be a significant fraction of a second or more for many corpora.
 That might be good enough, but high volume search engines typically need
search times of 10ms or better for the raw search engine which doesn't
allow for many vectors even if I am an order of magnitude off.

The issue of memory bandwidth isn't getting better very quickly so there
isn't much hope for making this better.  A good inverted index running out
of memory can handle a LOT of documents and still meet the 10ms goal.  This
means that LSI as a raw search engine on large corpora is going to be less
cost effective by a couple of orders of magnitude.  That leaves lots of
room to use synthetic tokens, query expansion and other magic tricks in the
original inverted index search engine to achieve similar effects at much
less cost.


On Sun, Nov 13, 2011 at 3:56 PM, Jake Mannix <[email protected]> wrote:

> Store the vectors as binary payloads and keep the projection matrix in
> memory with the qureryBuilder, to add an "lsi cosine" between query and doc
> scoring feature?
>
> On Nov 13, 2011 12:59 PM, "Ted Dunning" <[email protected]> wrote:
>
> Essentially not.
>
> And I would worry about how to push the LSI vectors back into lucene in a
> coherent and usable way.
>
> On Sun, Nov 13, 2011 at 10:47 AM, Sebastian Schelter <[email protected]>
> wrote: > Is there some docum...
>

Reply via email to