Hey, Herb. There is a memory leak in the string array in pylucene 2.4. In this case it would be the iteration of tfvP.getTerms(). The fix made it into 2.9, more history here: http://mail-archives.apache.org/mod_mbox/lucene-pylucene-dev/200907.mbox/%3calpine.osx.2.01.0907301553230.5...@yuzu%3e
On Apr 14, 2010, at 10:21 AM, Herbert Roitblat wrote: > Hi, folks. > I am using PyLucene and doing a lot of get tokens. lucene.py reports > version 2.4.0. It is rpath linux with 8GB of memory. Python is 2.4. > > The system indexes 116,000 documents just fine. > > Maxheap is '2048m', 64 bit environment. > > Then I need to get the tokens from these documents and near the end, I run > into: > > java.lang.OutOfMemoryError: GC overhead limit exceeded > > The heap is apparently filling up with each document retrieved and never > getting cleared. I was expecting that it would give me the information for > one document, then clear that and give me the info for another, etc. I've > looked at it with jhat. > > I have tried deleting the Python objects that receive any information from > Lucene--no effect. > I have tried reusing the Python objects that receive any information from > Lucene--no effect. > I have tried running the Python garbage collector (it slowed the program > slightly, but generally no effect). > > Is there anything else I can do to get the tokens for a document and make > sure that this does not fill up the heap? I need to be able to run a million > or more documents through this and get their tokens. > > > Here is a code snippet. > > reader = self.index.getReader() > lReader = reader.get() > searcher = self.index.getSearcher() > lSearcher = searcher.get() > query = lucene.TermQuery(lucene.Term(OTDocument.UID_FIELD_ID, uid)) > hits = list(lSearcher.search(query)) > if hits: > hit = lucene.Hit.cast_(hits[0]) > tfvs = lReader.getTermFreqVectors(hit.id) > > if tfvs is not None: # this happens if the vector is not stored > for tfv in tfvs: # There's one for each field that has a > TermFreqVector > tfvP = lucene.TermFreqVector.cast_(tfv) > if returnAllFields or tfvP.field in termFields: # add only > asked fields > tFields[tfvP.field] = dict([(t,f) for (t,f) in > zip(tfvP.getTerms(),tfvP.getTermFrequencies()) if f >=minFreq]) > else: > # This shouldn't happen, but we just log the error and march on > self.log.error("Unable to fetch doc %s from index"%(uid)) > > lReader.close() > lSearcher.close() > > lReader is really: > lucene.IndexReader.open(self._store) > > I've tried the Lucene list, but no one there has yet come up with a solution. > If filling the heap is a Lucene problem (is it a bug), I need to look for a > way to circumvent that bug. > > Thanks, > > Herb