[pylucene-dev] Large index files: Sort leads to "GC Warning: Repeated allocation of very large block"

Marc Weeber Tue, 18 Dec 2007 14:34:33 -0800

Hi all,

I am using the the following things
- Debian etch linux
- PyLucene GCC, latest from the GCC trunk
- gcc 4.2.1 with -DLARGE_CONFIG added to the source
- large index of 17Gb, 50M documents

In this index, I want to look for the cooccurrence of two words. Forthis, I use a booleanQuery:


q = PyLucene.BooleanQuery()

q.add(PyLucene.TermQuery(PyLucene.Term('profile', 'umls/C0086418')),PyLucene.BooleanClause.Occur.MUST)q.add(PyLucene.TermQuery(PyLucene.Term('profile', 'umls/C0003062')),PyLucene.BooleanClause.Occur.MUST)


In this case, the cooccurrence is in about 30,000 documents

this all goes OK if I do a search, it eats about 120M of memory.However, if I sort on another field using PyLucene.Sort('date',False), I get the "GC Warning: Repeated allocation of very largeblock" . This process eats about 500M of memory.

Interestingly, if I use a query term that does not occur in the index(and cooccurrence is 0), it still costs 500M of memory. Also, before Icompiled with -DLARGE_CONFIG, memory use was lower but the warning wasstill there

Is there a way to a) be more prudent on the memory usage or b) anothermore memory efficient (and without warnings) way of getting thecooccurrence info?


thanks in advance for any insights from all of you,

best,

Marc

_______________________________________________
pylucene-dev mailing list
[email protected]
http://lists.osafoundation.org/mailman/listinfo/pylucene-dev

[pylucene-dev] Large index files: Sort leads to "GC Warning: Repeated allocation of very large block"

Reply via email to