Hi all,
I am using the the following things
- Debian etch linux
- PyLucene GCC, latest from the GCC trunk
- gcc 4.2.1 with -DLARGE_CONFIG added to the source
- large index of 17Gb, 50M documents
In this index, I want to look for the cooccurrence of two words. For
this, I use a booleanQuery:
q = PyLucene.BooleanQuery()
q.add(PyLucene.TermQuery(PyLucene.Term('profile', 'umls/C0086418')),
PyLucene.BooleanClause.Occur.MUST)
q.add(PyLucene.TermQuery(PyLucene.Term('profile', 'umls/C0003062')),
PyLucene.BooleanClause.Occur.MUST)
In this case, the cooccurrence is in about 30,000 documents
this all goes OK if I do a search, it eats about 120M of memory.
However, if I sort on another field using PyLucene.Sort('date',
False), I get the "GC Warning: Repeated allocation of very large
block" . This process eats about 500M of memory.
Interestingly, if I use a query term that does not occur in the index
(and cooccurrence is 0), it still costs 500M of memory. Also, before I
compiled with -DLARGE_CONFIG, memory use was lower but the warning was
still there
Is there a way to a) be more prudent on the memory usage or b) another
more memory efficient (and without warnings) way of getting the
cooccurrence info?
thanks in advance for any insights from all of you,
best,
Marc
_______________________________________________
pylucene-dev mailing list
[email protected]
http://lists.osafoundation.org/mailman/listinfo/pylucene-dev