Re: [pylucene-dev] TermDocs.read() method

Martin Bachwerk Tue, 09 Sep 2008 10:04:34 -0700

The index is about 500MB big (328419 documents)..
maxheap is set to 512m..


Martin

On Sep 9, 2008, at 9:52, Martin Bachwerk<[EMAIL PROTECTED]> wrote:
Oh, sure.

I've been iterating over all terms in one document.. and counting the
total number of occurencies for this term in all documents:

tf = ireader.getTermFreqVector(docID, 'content')

1. no leak but 5-10x slower
for word, freq in zip(tf.getTerms(), tf.getTermFrequencies()):
docs = ireader.termDocs(Term('content', word))
while docs.next():
  totalHits += docs.freq()

2. quick, but fills up memory
for word, freq in zip(tf.getTerms(), tf.getTermFrequencies()):
td = ireader.termDocs(Term('content', word))
ptd = PythonTermDocs(td)
values = ptd.read(docNum)
totalHits = sum(values[len(values)/2:])
ptd.close()
td.close()
How big is your index ?
How much java memory did you give your process (via initVM) ?

Andi..
Hope this helps,

Martin
On Sep 9, 2008, at 3:39, Martin Bachwerk<[EMAIL PROTECTED]> wrote:
Hey again,
I'm honestly very poor with memory allocation and stuff, but whenusing this .read() method instead of an iteration over all termDocswith next() I get a huge memory leak.. it just goes up and up andup and never down.. I've tried using del on the values list,td.close() and running gc.collect() at times, but nothing seems tomake any difference.
Can you please send a small piece of code that I can run toreproduce this leak ?
Thanks !

Andi..
I'm running Python 2.4 atm and can't change to 2.5 yet fordifferent reasons, so I would really appreciate some help here. Iwill test it on a 2.5 maching though, just to see if it's the samethere or better.
Thanks!
Martin
On Mon, 8 Sep 2008, Martin Bachwerk wrote:
I've been trying to use the read() method on TermDocs asdescribed forPyLucene (with an int to specify the number of documents to readin).
However, I've been getting an error, that sort of suggests, that the
call is actually trying to run the Java API version of the method(with2 arrays as arguments and an integer n as return value).. Thisactuallyworks too, but only asfar as the integer, I can't find a way tofill the
two arrays.. :(

Error trace:
docs, freqs = td.read(10)
InvalidArgsError: (<type 'TermDocs'>, 'read', (10,))

Could someone please help! I'm using PyLucene 2.3.1.
The docs are out of data here, sorry.
In the new PyLucene (the one built with JCC, the one you'rerunning), the docs should say that a PythonTermDocs instanceshould be wrapped around the TermDocs instance as follows: (alsosee SpecialsFilter.py sample)
values = PythonTermDocs(td).read(10)
docs=values[:len(values)/2]
freqs=values[len(values/2):]
Yes, this is quite ugly and I intend to change the way arrays arehandled in JCC before I release version 2.0 so that this kind ofkludge is no longer necessary.
Andi..
Thanks,

Martin



_______________________________________________
pylucene-dev mailing list
pylucene-dev@osafoundation.org
http://lists.osafoundation.org/mailman/listinfo/pylucene-dev
_______________________________________________
pylucene-dev mailing list
pylucene-dev@osafoundation.org
http://lists.osafoundation.org/mailman/listinfo/pylucene-dev
_______________________________________________
pylucene-dev mailing list
pylucene-dev@osafoundation.org
http://lists.osafoundation.org/mailman/listinfo/pylucene-dev
_______________________________________________
pylucene-dev mailing list
pylucene-dev@osafoundation.org
http://lists.osafoundation.org/mailman/listinfo/pylucene-dev


_______________________________________________
pylucene-dev mailing list
pylucene-dev@osafoundation.org
http://lists.osafoundation.org/mailman/listinfo/pylucene-dev

Re: [pylucene-dev] TermDocs.read() method

Reply via email to