I just profiled the application and tst.TernaryTreeNode takes 99.99..% of the memory.

I'll test further tomorrow and report on mem usage for runnable smaller indexes.
I will email you privately for sharing the index to work with.

BR,
Elmer


-----Oorspronkelijk bericht----- From: Michael McCandless
Sent: Wednesday, July 06, 2011 8:39 PM
To: java-user@lucene.apache.org
Subject: Re: Autocompletion on large index

Hmm... so I suspect the fst suggest module must first gather up all
titles, then sort them, in RAM, and then build the actual FST.  Maybe
it's this gather + sort that's taking so much RAM?

1.3 M publications times 100 chars times 2 bytes/char = ~248 MB.  So
that shouldn't be it...

Is this a an accessible corpus?  Can I somehow get a copy to play with...?

Are you able to [temporarily, once] build the full FST and other
suggest impls and compare how much RAM is required for building and
then lookups?

Mike McCandless

http://blog.mikemccandless.com

On Wed, Jul 6, 2011 at 1:50 PM, Elmer <evanchaste...@gmail.com> wrote:
Hi Mike,

That's what I thought when I started indexing it. To be clear, it happens on
build time.
I don't know if memory efficiency is better when building has finished.

The titles I index are titles from the dblp computer sience bibliography.
They can take up to... say 100 characters.
Examples:
-------
- Auditory stimulus optimization with feedback from fuzzy clustering of
neuronal responses
- Two-objective method for crisp and fuzzy interval comparison in
optimization
- Bound Constrained Smooth Optimization for Solving Variational Inequalities
and Related Problems
- Retrieval of bibliographic records using Apache Lucene
- Digital Library Information Appliances
-------

The "title_suggest" field uses the KeyWordTokenizer and LowerCaseFilter in
that order.

I also tried to do the same for the author names, and this works without
problems. Actually it builds the tree/fsa/... faster from dictionary than
from file (the lookup data file that can be stored and loaded through the
.store and .load methods). But the larger set of publication titles is
currently no-go with 2.5GB of heapspace, only having a main class that
builds the LookUp data.

BR,
Elmer


-----Oorspronkelijk bericht----- From: Michael McCandless
Sent: Wednesday, July 06, 2011 6:23 PM
To: java-user@lucene.apache.org
Subject: Re: Autocompletion on large index

You could try storing your autocomplete index in a RAMDirectory?

But: I'm surprised you see the FST suggest impl using up so much RAM;
very low memory usage is one of the strengths of the FST approach.
Can you share the text (titles) you are feeding to the suggest module?

Mike McCandless

http://blog.mikemccandless.com

On Wed, Jul 6, 2011 at 12:08 PM, Elmer <evanchaste...@gmail.com> wrote:

Hi again.

I have created my own autocompleter based on the spellchecker. This
works well in a sense that it is able to create an auto completion index
from my 'publication' index. However, integrated in my web application,
each keypress asks autocompleter to search the index, which is stored on
disk (not in mem), just like spellchecker does (except that spellchecker
is not invoked every keypress).
With Lucene 3.3.0, auto completion modules are included, which load
their trees/fsa/... in memory. I'd like to use these modules, but the
problem is that they use more than 2.5GB, causing heap space exceptions.
This happens when I try to build a LookUp index (fst,jaspell or tst,
doesn't matter) from my 'publication' index consisting of 1.3M
publications. The field I use for autocompletion holds the titles of the
publications indexed untokenized (but lowercased).

Code:
Lookup autoCompleter = new TSTLookup();
FSDirectory dir = FSDirectory.open(new File("PATHTOINDEX"));
LuceneDictionary dict = new
LuceneDictionary(IndexReader.open(dir),"title_suggest");
autoCompleter.build(dict);

Is it possible to have the autocompletion module to work in-memory on
such a dataset without increasing java's heapspace?
FTR, the 3.3.0 autocompletion modules use more than 2.5GB of RAM, where
my own autocompleter index is stored on disk using about 300MB.

BR,
Elmer


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Reply via email to