Re: Autocompletion on large index

Elmer Thu, 07 Jul 2011 06:42:31 -0700

I just tested my autocompleter in a clean environment (instead of
sharing a lot of resources with the java servlet) and I was able to run
the autocompletion in mem, using the RAMDirectory.


This morning I posted a modified TST implementation in this thread. I
have compared my autocompleter with the TST (with max prefix length set
to 20 chars).

Compared to the TST implementation, my autocompleter is able to match
tokens from the complete titles (tokenizes on whitespace), i.e.  'inf'
will match titles:
"information retrieval"
"best practices in information retrieval"
It also ranks the lookups by their popularity using the frequency of the
terms in the source index to sort the lookup results.
If somebody is interested, I can provide my autocompletion class (based
on spellcheck class in Lucene 3.1.0).

How I tested:
Both auto completion implementations use the same index, holding 1.32M
titles. I generated 1000 random prefixes of length 1,2 or 3 chars. Both
implementations 'warmed up' by looking up 1000 prefixes prior to
measuring the time it takes to perform 1000 lookups that each return 20
results (at most). Heapspace set to 2.5GB

Result:
-TST uses at least 600MB of memory with a ~10 GC activities
-My autocompleter uses at least 407MB with 1 GC activity
-TST runs 1000  completions in  7262ms
-My implementation runs 1000 completions in 18617ms
-Both used ~100% cpu from 1 core during test

For now, I think I'm gonna stick to my own autocompleter until TST can
be used 'efficiently' and can sort by popularity based on frequency. It
seems that current TSTLookup implementation doesn't use term frequencies
from a source dictionary. Also, I'd like to match tokens from within
each term in the source index. I don't think that's possible without
changing the inner working of the TSTLookup?

BR,
Elmer

On Wed, 2011-07-06 at 20:02 +0200, Elmer wrote:
> > You could try storing your autocomplete index in a RAMDirectory?
> 
> I forgot to mention. I tried this previously, but that also resulted in heap 
> space problems. That's why I was interested in using the new suggest classes 
> :)
> 
> BR,
> Elmer
> 
> -----Oorspronkelijk bericht----- 
> From: Michael McCandless
> Sent: Wednesday, July 06, 2011 6:23 PM
> To: java-user@lucene.apache.org
> Subject: Re: Autocompletion on large index
> 
> You could try storing your autocomplete index in a RAMDirectory?
> 
> But: I'm surprised you see the FST suggest impl using up so much RAM;
> very low memory usage is one of the strengths of the FST approach.
> Can you share the text (titles) you are feeding to the suggest module?
> 
> Mike McCandless
> 
> http://blog.mikemccandless.com
> 
> On Wed, Jul 6, 2011 at 12:08 PM, Elmer <evanchaste...@gmail.com> wrote:
> > Hi again.
> >
> > I have created my own autocompleter based on the spellchecker. This
> > works well in a sense that it is able to create an auto completion index
> > from my 'publication' index. However, integrated in my web application,
> > each keypress asks autocompleter to search the index, which is stored on
> > disk (not in mem), just like spellchecker does (except that spellchecker
> > is not invoked every keypress).
> > With Lucene 3.3.0, auto completion modules are included, which load
> > their trees/fsa/... in memory. I'd like to use these modules, but the
> > problem is that they use more than 2.5GB, causing heap space exceptions.
> > This happens when I try to build a LookUp index (fst,jaspell or tst,
> > doesn't matter) from my 'publication' index consisting of 1.3M
> > publications. The field I use for autocompletion holds the titles of the
> > publications indexed untokenized (but lowercased).
> >
> > Code:
> > Lookup autoCompleter = new TSTLookup();
> > FSDirectory dir = FSDirectory.open(new File("PATHTOINDEX"));
> > LuceneDictionary dict = new
> > LuceneDictionary(IndexReader.open(dir),"title_suggest");
> > autoCompleter.build(dict);
> >
> > Is it possible to have the autocompletion module to work in-memory on
> > such a dataset without increasing java's heapspace?
> > FTR, the 3.3.0 autocompletion modules use more than 2.5GB of RAM, where
> > my own autocompleter index is stored on disk using about 300MB.
> >
> > BR,
> > Elmer
> >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> > For additional commands, e-mail: java-user-h...@lucene.apache.org
> >
> >
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
> 



---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Re: Autocompletion on large index

Reply via email to