Btw are there any known limitations on using in-memory kd trees to do knn
scans?

Their operational characteristics don't seem that bad.  Still probably not
as good as some lsh based schemes but still should be quite good for many
practical uses where N is millions and k is perhaps a thousand or so.

For cosine and tanimoto stuff, it may need preliminary conversion to
hypershperical.

Also with the use of some improvements ( Max distance scan,
bin-first-search) perhaps it may get to the point where it gets somewhat
acceptable?
 On Nov 13, 2011 10:10 PM, "Ted Dunning" <[email protected]> wrote:

> That handles coherent.
>
> IT doesn't handle usable.
>
> Storing the vectors as binary payloads handles the situation for
> projection-like applications, but that doesn't help retrieval.
>
> The problem with retrieval is that you need to do a dot product for all
> vectors.  THat used to require that you keep a low resolution version of
> all of the vectors in memory and implement a very efficient dot product for
> speed.  If you can batch up several queries (10 or more at a time), you can
> get almost an order of magnitude improvement in retrieval speed since the
> queries are almost always memory bound on the corpus vectors.
>
> With modern memory sizes, you can definitely hold some pretty big corpora.
>  With 100 dimensions of floats, you can store two documents per kB which
> gives 2 million documents per GB.  The problem really has more to do with
> retrieval speed.  While the ALU can do multiplications and additions at a
> pretty high rate, main memory can typically only be accessed at a rate of a
> GB per second or so (or a few times that) which means that retrieval time
> is going to be a significant fraction of a second or more for many corpora.
>  That might be good enough, but high volume search engines typically need
> search times of 10ms or better for the raw search engine which doesn't
> allow for many vectors even if I am an order of magnitude off.
>
> The issue of memory bandwidth isn't getting better very quickly so there
> isn't much hope for making this better.  A good inverted index running out
> of memory can handle a LOT of documents and still meet the 10ms goal.  This
> means that LSI as a raw search engine on large corpora is going to be less
> cost effective by a couple of orders of magnitude.  That leaves lots of
> room to use synthetic tokens, query expansion and other magic tricks in the
> original inverted index search engine to achieve similar effects at much
> less cost.
>
>
> On Sun, Nov 13, 2011 at 3:56 PM, Jake Mannix <[email protected]>
> wrote:
>
> > Store the vectors as binary payloads and keep the projection matrix in
> > memory with the qureryBuilder, to add an "lsi cosine" between query and
> doc
> > scoring feature?
> >
> > On Nov 13, 2011 12:59 PM, "Ted Dunning" <[email protected]> wrote:
> >
> > Essentially not.
> >
> > And I would worry about how to push the LSI vectors back into lucene in a
> > coherent and usable way.
> >
> > On Sun, Nov 13, 2011 at 10:47 AM, Sebastian Schelter <[email protected]>
> > wrote: > Is there some docum...
> >
>

Reply via email to