Hello again,

the index is kinda large indeed.. even though I have Field.Store.NO set for the actual content.. (ok the documents are 2-3k large in average, but it could be smaller still..)

The memory use is just growing and growing.. though doesn't go into critical area, it just ate up 800megs out of 1024 I have in some 15 mins.. after that it stayed stable. I guess this would be acceptable.. but I don't quite understand why it is the case..

The arrays are pretty much dependant on the term (i.e. word).. for words like "is" they're around the size of the number of documents.. for rare words they can be 1-2-3.. entries long..

I don't have Java code to test all this sorry.

Martin

On Tue, 9 Sep 2008, Martin Bachwerk wrote:

The index is about 500MB big (328419 documents)..
maxheap is set to 512m..

When you say 'leak', what exactly happens ? out of memory error ?
Memory use just grows ? beyond what you expect ?
How big are the arrays that come back from the PythonTermDocs ?

I notice that your index is big for the number of docs it contains. Are you storing the document contents into the index as well ? (that would explain it and is a matter of choice, not a bug).

Does the equivalent Java code leak just as well ?

I'm asking all these questions since reproducing this is going to take more work than just running a piece of code. First I have to package your code snippets into a coherent program and then I have to work with a big index.

Andi..


Martin


On Sep 9, 2008, at 9:52, Martin Bachwerk <[EMAIL PROTECTED]> wrote:

Oh, sure.

I've been iterating over all terms in one document.. and counting the
total number of occurencies for this term in all documents:

tf = ireader.getTermFreqVector(docID, 'content')

1. no leak but 5-10x slower
for word, freq in zip(tf.getTerms(), tf.getTermFrequencies()):
docs = ireader.termDocs(Term('content', word))
while docs.next():
  totalHits += docs.freq()

2. quick, but fills up memory
for word, freq in zip(tf.getTerms(), tf.getTermFrequencies()):
td = ireader.termDocs(Term('content', word))
ptd = PythonTermDocs(td)
values = ptd.read(docNum)
totalHits = sum(values[len(values)/2:])
ptd.close()
td.close()

How big is your index ?
How much java memory did you give your process (via initVM) ?

Andi..



Hope this helps,

Martin

On Sep 9, 2008, at 3:39, Martin Bachwerk <[EMAIL PROTECTED]> wrote:

Hey again,

I'm honestly very poor with memory allocation and stuff, but when using this .read() method instead of an iteration over all termDocs with next() I get a huge memory leak.. it just goes up and up and up and never down.. I've tried using del on the values list, td.close() and running gc.collect() at times, but nothing seems to make any difference.

Can you please send a small piece of code that I can run to reproduce this leak ?

Thanks !

Andi..



I'm running Python 2.4 atm and can't change to 2.5 yet for different reasons, so I would really appreciate some help here. I will test it on a 2.5 maching though, just to see if it's the same there or better.

Thanks!
Martin

On Mon, 8 Sep 2008, Martin Bachwerk wrote:

I've been trying to use the read() method on TermDocs as described for PyLucene (with an int to specify the number of documents to read in). However, I've been getting an error, that sort of suggests, that the call is actually trying to run the Java API version of the method (with 2 arrays as arguments and an integer n as return value).. This actually works too, but only asfar as the integer, I can't find a way to fill the
two arrays.. :(

Error trace:
docs, freqs = td.read(10)
InvalidArgsError: (<type 'TermDocs'>, 'read', (10,))

Could someone please help! I'm using PyLucene 2.3.1.

The docs are out of data here, sorry.

In the new PyLucene (the one built with JCC, the one you're running), the docs should say that a PythonTermDocs instance should be wrapped around the TermDocs instance as follows: (also see SpecialsFilter.py sample)

values = PythonTermDocs(td).read(10)
docs=values[:len(values)/2]
freqs=values[len(values/2):]

Yes, this is quite ugly and I intend to change the way arrays are handled in JCC before I release version 2.0 so that this kind of kludge is no longer necessary.

Andi..


Thanks,

Martin



_______________________________________________
pylucene-dev mailing list
pylucene-dev@osafoundation.org
http://lists.osafoundation.org/mailman/listinfo/pylucene-dev

_______________________________________________
pylucene-dev mailing list
pylucene-dev@osafoundation.org
http://lists.osafoundation.org/mailman/listinfo/pylucene-dev







_______________________________________________
pylucene-dev mailing list
pylucene-dev@osafoundation.org
http://lists.osafoundation.org/mailman/listinfo/pylucene-dev
_______________________________________________
pylucene-dev mailing list
pylucene-dev@osafoundation.org
http://lists.osafoundation.org/mailman/listinfo/pylucene-dev



_______________________________________________
pylucene-dev mailing list
pylucene-dev@osafoundation.org
http://lists.osafoundation.org/mailman/listinfo/pylucene-dev

_______________________________________________
pylucene-dev mailing list
pylucene-dev@osafoundation.org
http://lists.osafoundation.org/mailman/listinfo/pylucene-dev



_______________________________________________
pylucene-dev mailing list
pylucene-dev@osafoundation.org
http://lists.osafoundation.org/mailman/listinfo/pylucene-dev

Reply via email to