Emerson Clarke wrote:

The indexing process works like this.

1.) Open a document and parse its contents.
2.) Look up records in the first database based on the contents of the
document, updating records where appropriate and inserting new ones.
3.) Transforming the document based on what was obtained from the
first database.
4.) Create a filesystem structure in the form of a folder and two or more files.
5.) Look up some more records in a second database, updating and
inserting as necessary.

For architectural reasons the above steps must be performed in that
order. This means that operations cant be separated or queued up in
the way that you suggested.  Each step is dependent on the previous
step.

But by having multiple threads and using synchronisation around the
database operations you can ensure that multiple database operations
are always pending.

Each thread will be at a different stage in the pipeline, but a few
will always be ready to perform a database operation so the idea is to
maximise the throughput.

As you said, the more rows per transaction, the more rows per second.
Which brings us back to the original issue.  Why cant i have multiple
threads all using the same connection within a single transaction ?

Of course i know the simple answer, which is that the current api does
not support this.  But im wondering why, and if there are any other
ways to achieve the desired performance.

Emerson


On 12/28/06, Roger Binns <[EMAIL PROTECTED]> wrote:

Emerson Clarke wrote:
> The idea is that because i am accessing two databases, and doing
> several file system operations per document, there should be a large
> gain by using many threads.  There is no actual indexing process, the
> whole structure is the index, but if anything the database operations
> take the most time.  The filesystem operations have a very small
> amount of overhead.

That is all unclear from your original description.  Aren't you trying
to "index" several million documents and doesn't the process of indexing
consist of two parts?

1: Open the document, parse it in various ways, build index data, close it
2: Add a row to a SQLite database

My point was that #1 is way more work than #2, so you can run #1's in
multiple threads/processes and do #2 in a single thread using a
queue/pipe object for communication.

On the other hand, if #1 is way less work than #2 then you will be bound
by the speed at which you decide to make transactions in SQLite.  A 7200
rpm disk limits you to 60 transactions a second.  The more rows per
transaction, the more rows per second.

Roger

-----------------------------------------------------------------------------
To unsubscribe, send email to [EMAIL PROTECTED]
-----------------------------------------------------------------------------



-----------------------------------------------------------------------------
To unsubscribe, send email to [EMAIL PROTECTED]
-----------------------------------------------------------------------------


Sorry if this has been posited before, but our solution to this was data normalisation. IE, we store only the filename, and an integer ID which is the directory of the file, and of course a lookup table of id's to directory names. This helped us greatly, as well as reducing overhead of amount of data stored.

-----------------------------------------------------------------------------
To unsubscribe, send email to [EMAIL PROTECTED]
-----------------------------------------------------------------------------

Reply via email to