On Wed, Dec 1, 2010 at 3:16 PM, Burton-West, Tom <tburt...@umich.edu> wrote: > Thanks Mike, > > Yes we have many unique terms due to dirty OCR and 400 languages and probably > lots of low doc freq terms as well (although with the ICUTokenizer and > ICUFoldingFilter we should get fewer terms due to bad tokenization and > normalization.)
OK likely this explains the lowish RAM efficiency. > Is this additional overhead because each unique term takes a certain amount > of space compared to adding entries to a list for an existing term? Exactly. There's a highish "startup cost" for each term.... but then appending docs/positions to that term is more efficient especially for higher frequency terms. In the limit, a single unique term across all docs will have very high RAM efficiency... > Does turning on IndexWriters infostream have a significant impact on memory > use or indexing speed? I don't believe so.... Mike