Dear all,

I have been using PyLucene for some time now (really loving it, actually), and I have now encountered an intriguing situation. I have a large inde file, 17Gb, 50M documents. I want to look for cooccurrences for terms in a certain field (a boolean query), and rank order the results on another field (date). In some situations this works like a charm. Sure, the sort needs lots of memory (around 500M), but once that's up and running, sorted queries are really fast. In other situations, however, memory use explodes. This occurs for a range of java and lucene versions

Let me show the python code (should be readable by all of you Java people):

===================
stable memory use code
===================

# initialize things
storeDir = 'LuceneData/PubmedSentenceIndex/'
store = lucene.FSDirectory.getDirectory(storeDir, False)
searcher = lucene.IndexSearcher(store)
sortObject = lucene.Sort('date', False)

# six example query term for cooccurrences
concepts = [
              ["umls/C0086418", "umls/C0003062"],
              ["umls/C0086418", "umls/C0870071"],
              ["umls/C0870071", "umls/C0003062"],
              ["umls/C0870071", "umls/C0449445"],
              ["umls/C0086418", "umls/C0449445"],
              ["umls/C0003062", "umls/C0449445"],
      ]

for cui1, cui2 in concepts:
      q = lucene.BooleanQuery()
q.add(lucene.TermQuery(lucene.Term('profile', cui1)), lucene.BooleanClause.Occur.MUST) q.add(lucene.TermQuery(lucene.Term('profile', cui2)), lucene.BooleanClause.Occur.MUST)

      hits = searcher.search(q, sortObject)

=======

in the code above, the major objects are created before entering the loop. In the loop, the query is generated, and searcher executes the query, wit the addition of the sort object. After the first search, memory use is about 500M, and remains stable during all other loops

However, if I create the searcher object EACH time in the loop, (for instance, just before the actual search is done), each search adds 300M to the memeory useage of the search. It seems that the garbage collection does not really work. I tried to invoke both the python (gc.collect()) and the java (lucene.System.gc()) garbage collector, but there is no such luck. adding sleep times, for instance, does not work either.

Does anyone of you have an inkling what is going on here?

Thanks in advance for any information,

best,

Marc







---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to