On Thu, 4 Jan 2007, Terry Jones wrote:

I have a fairly simple question about trying to write a long-running
BSD-based reader/writer class. I've written something that works, but it's
extremely slow (0.4 seconds each time I add a tiny document).

In summary, I want to have a class that looks something like this:

   class Indexer:
       def __init__(): pass
       def addDoc(): pass
       def search(): pass
       def close(): pass

I'll have a program make itself an instance i of this class and then I want
to periodically call i.addDoc() and i.search(), in no particular order. A
search should of course be able to find anything previously added by addDoc.
Before I shut down I'll call the close method.

I'm using the BSD variant of PyLucene because I want transactions.

Without going into too much detail, here's what I currently do.

1) __init__ creates two new DB instances, and calls db.open on them. These
   are stored as self.file1 and self.file2. This is done inside a transaction.

2) addDoc makes a new Document (doc) and calls doc.add to add two fields
   (with just a few chars of data in them). Then it does

       txn = None
       try:
           txn = self.env.txn_begin(None)
           directory = DbDirectory(txn, self.file1, self.file2, 0)
           writer = IndexWriter(directory, self.analyzer, self.createIndex)
           writer.addDocument(doc)
           writer.close()
       except:
           if txn is not None:
               txn.abort()
           raise
       else:
           txn.commit()


This addDoc function is very slow. Sorry I'm not reporting what exactly is
slow, the profiling data from profile or hotshot doesn't show me and I've
not gone to time the individual PyLucene calls.

I don't see a way around this. I want to use transactions, the DbDirectory
needs to be passed an open transaction, and IndexWriter must be passed the
newly created directory.  So it doesn't look like I can store a writer in
self. I could think about opening a transaction in __init__ but I'd still
need to commit it at some point and open another, so that doesn't seem to
help.

I'm wondering if I am doing something wrong here.

If not, is the slowness due to using the Berkeley directory? I.e., would
the problem go away if I used a normal FSDirectory or a RAMDirectory?


While there is a certain overhead with transactions and opening and closing an index for every addition, I did notice that there was a fair amount of thrashing around in the Lucene directory I/O and got things to be considerably faster by batching all updates and doing them in a RAMDirectory before adding the RAMDirectory contents to the DBDirectory via the addIndexes API.

For an example, see the code around line 485 at:
http://svn.osafoundation.org/chandler/trunk/chandler/repository/persistence/FileContainer.py

Andi..
_______________________________________________
pylucene-dev mailing list
[email protected]
http://lists.osafoundation.org/mailman/listinfo/pylucene-dev

Reply via email to