On Thu, 17 Jan 2008, Brian Merrell wrote:
The docs are pretty standard English language documents. I can try to find
some decent spam (I hesitate to give our the actual documents for privacy
reasons) but I imagine anything you have would reproduce the problem. I am
using the basic MeetLucene Indexer.py with the following Analyzer:
It would be helpful if all I had to do to reproduce the problem was to
type in a one-liner in a shell. With what you sent me, I still have to do
work to figure out how to include your code into the indexer.py program.
It's not rocket science but since you've already done it, it's less work for
you to send this via email than for me to reconstruct it.
Maybe my assumptions are wrong and it's more work for you too ?
Now, that being said, looking at your code, it's quite possible that the
leak is with the BrianFilter instances. You can verify this by checking the
number of PythonTokenFilter instances left in the env after each document is
indexed. I'm assuming, possibly wrongly, that BrianAnalyzer's tokenStream()
method is called once for every document. If that's indeed the case, you are
going to leak BrianFilter instances and it's going to be necessary to
finalize() these instances manually.
1. to verify how many PythonTokenFilter instances you have in env:
print env._dumpRefs('class
org.osafoundation.lucene.analysis.PythonTokenFilter', 0)
where env is the result of the call to initVM() and can also be obtained
by calling getVMEnv()
2. to call finalize() on your BrianFilter instances, you need to first keep
track of them. Then, after each document is indexed, call finalize() on
the accumulated objects. Below is a modification of BrianAnalyzer to
support this (assuming one single thread of execution per analyzer)
class BrianAnalyzer(PythonAnalyzer):
def __init__(self):
super(BrianAnalyzer, self).__init__()
self._filters = []
def tokenStream(self, fieldName, reader):
filter =
BrianFilter(LowerCaseFilter(StandardFilter(StandardTokenizer(reader))))
self._filters.append(filter)
return filter
def finalizeFilters(self):
for filter in self._filters:
filter.finalize()
del self._filters[:]
after each document is indexed, call brianAnalyzer.finalizeFilters()
Andi..
_______________________________________________
pylucene-dev mailing list
[email protected]
http://lists.osafoundation.org/mailman/listinfo/pylucene-dev