On Wed, Dec 1, 2010 at 3:16 PM, Burton-West, Tom <tburt...@umich.edu> wrote:
> Thanks Mike,
>
> Yes we have many unique terms due to dirty OCR and 400 languages and probably 
> lots of low doc freq terms as well (although with the ICUTokenizer and 
> ICUFoldingFilter we should get fewer terms due to bad tokenization and 
> normalization.)

OK likely this explains the lowish RAM efficiency.

> Is this additional overhead because each unique term takes a certain amount 
> of space compared to adding entries to a list for an existing term?

Exactly.  There's a highish "startup cost" for each term.... but then
appending docs/positions to that term is more efficient especially for
higher frequency terms.  In the limit, a single unique term  across
all docs will have very high RAM efficiency...

> Does turning on IndexWriters infostream have a significant impact on memory 
> use or indexing speed?

I don't believe so....

Mike

Reply via email to