The indexing process works like this. 1.) Open a document and parse its contents. 2.) Look up records in the first database based on the contents of the document, updating records where appropriate and inserting new ones. 3.) Transforming the document based on what was obtained from the first database. 4.) Create a filesystem structure in the form of a folder and two or more files. 5.) Look up some more records in a second database, updating and inserting as necessary.
For architectural reasons the above steps must be performed in that order. This means that operations cant be separated or queued up in the way that you suggested. Each step is dependent on the previous step. But by having multiple threads and using synchronisation around the database operations you can ensure that multiple database operations are always pending. Each thread will be at a different stage in the pipeline, but a few will always be ready to perform a database operation so the idea is to maximise the throughput. As you said, the more rows per transaction, the more rows per second. Which brings us back to the original issue. Why cant i have multiple threads all using the same connection within a single transaction ? Of course i know the simple answer, which is that the current api does not support this. But im wondering why, and if there are any other ways to achieve the desired performance. Emerson On 12/28/06, Roger Binns <[EMAIL PROTECTED]> wrote:
Emerson Clarke wrote: > The idea is that because i am accessing two databases, and doing > several file system operations per document, there should be a large > gain by using many threads. There is no actual indexing process, the > whole structure is the index, but if anything the database operations > take the most time. The filesystem operations have a very small > amount of overhead. That is all unclear from your original description. Aren't you trying to "index" several million documents and doesn't the process of indexing consist of two parts? 1: Open the document, parse it in various ways, build index data, close it 2: Add a row to a SQLite database My point was that #1 is way more work than #2, so you can run #1's in multiple threads/processes and do #2 in a single thread using a queue/pipe object for communication. On the other hand, if #1 is way less work than #2 then you will be bound by the speed at which you decide to make transactions in SQLite. A 7200 rpm disk limits you to 60 transactions a second. The more rows per transaction, the more rows per second. Roger ----------------------------------------------------------------------------- To unsubscribe, send email to [EMAIL PROTECTED] -----------------------------------------------------------------------------
----------------------------------------------------------------------------- To unsubscribe, send email to [EMAIL PROTECTED] -----------------------------------------------------------------------------