Re: Problem getting tokens for document

Aric Coady Wed, 14 Apr 2010 10:33:35 -0700

Hey, Herb.

There is a memory leak in the string array in pylucene 2.4.  In this case it 
would be the iteration of tfvP.getTerms().  The fix made it into 2.9, more 
history here:
http://mail-archives.apache.org/mod_mbox/lucene-pylucene-dev/200907.mbox/%3calpine.osx.2.01.0907301553230.5...@yuzu%3e


On Apr 14, 2010, at 10:21 AM, Herbert Roitblat wrote:

> Hi, folks.
> I am using PyLucene and doing a lot of get tokens.  lucene.py reports
> version 2.4.0.  It is rpath linux with 8GB of memory.  Python is 2.4.
> 
> The system indexes 116,000 documents just fine.  
> 
> Maxheap is '2048m', 64 bit environment.
> 
> Then I need to get the tokens from these documents and near the end, I run
> into:
> 
> java.lang.OutOfMemoryError: GC overhead limit exceeded
> 
> The heap is apparently filling up with each document retrieved and never 
> getting cleared.  I was expecting that it would give me the information for 
> one document, then clear that and give me the info for another, etc.  I've 
> looked at it with jhat.
> 
> I have tried deleting the Python objects that receive any information from 
> Lucene--no effect.
> I have tried reusing the Python objects that receive any information from 
> Lucene--no effect.
> I have tried running the Python garbage collector (it slowed the program 
> slightly, but generally no effect).
> 
> Is there anything else I can do to get the tokens for a document and make 
> sure that this does not fill up the heap?  I need to be able to run a million 
> or more documents through this and get their tokens.
> 
> 
> Here is a code snippet.
> 
>        reader = self.index.getReader()
>        lReader = reader.get()
>        searcher = self.index.getSearcher()
>        lSearcher = searcher.get()
>        query = lucene.TermQuery(lucene.Term(OTDocument.UID_FIELD_ID, uid))
>        hits = list(lSearcher.search(query))
>        if hits:
>            hit = lucene.Hit.cast_(hits[0])
>            tfvs = lReader.getTermFreqVectors(hit.id)
> 
>            if tfvs is not None: # this happens if the vector is not stored
>                for tfv in tfvs: # There's one for each field that has a 
> TermFreqVector
>                    tfvP = lucene.TermFreqVector.cast_(tfv)
>                    if returnAllFields or tfvP.field in termFields: # add only 
> asked fields
>                        tFields[tfvP.field] = dict([(t,f) for (t,f) in 
> zip(tfvP.getTerms(),tfvP.getTermFrequencies()) if f >=minFreq])
>        else:
>            # This shouldn't happen, but we just log the error and march on
>            self.log.error("Unable to fetch doc %s from index"%(uid))
> 
>        lReader.close()
>        lSearcher.close()
> 
> lReader is really:
> lucene.IndexReader.open(self._store)
> 
> I've tried the Lucene list, but no one there has yet come up with a solution. 
>  If filling the heap is a Lucene problem (is it a bug), I need to look for a 
> way to circumvent that bug.  
> 
> Thanks, 
> 
> Herb

Re: Problem getting tokens for document

Reply via email to