-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Emerson Clarke wrote:
| I have deliberately tried to avoid giving too much detail on the
| architecture of the index since that was not the point and i didnt
| want to end up debating it.

I don't want to debate your index architecture either :-).  Quite simply
several times a month people post to this ideally wanting SQLite changed
to match how they want to structure things.  People on the list explore
with the poster how the items of data are related, and can suggest an
alternative way of doing things.  Usually the poster finds that simpler
than what they had first thought of and goes away happy.

Is this your question: I want SQLite to work differently than it
currently does so that it matches how I want to do things?

If that is the case, then the answer is you can go ahead and rewrite as
much of SQLite as you want to to do that.  The code is public domain so
there are no legal or technical hindrances standing in your way.  This
thread may as well end at that.

On the other hand, if you do want to work within the constraints of
SQLite then there are quite a few things that can be suggested.  But
that is only possible if more is known about the relationships of the data.

| I did make an attempt to explain that A and B could not be done at the
| same time in a previous message, but perhaps its been lost in the
| conversation.  The process involves several stages some of which are
| database operations and some of which are file operations and that the
| operations are not separable.  They must be done in sequential order.

I was trying to establish what has to be serialized.  In particular the
question was about if A and B had any relationships to each other.  If
they do, then that means they would have to be processed serially and I
don't see the relevance of threading etc.

If they can be processed at the same time, then that means some sort of
partitioning can happen.  In theory breaking the datasets into 10
partitions can give 10 times the performance, but in practise there will
need to be some coordination in order to make it look like there is one
database not multiple pieces.

| The database operations, though very small still consume the most time
| and are the most sensetive to how the synchronisation takes place and
| where the transactions are placed.

Have you considered just using plain DB/dbm/gdbm and then importing the
data on demand into SQLite?

Also a lot of the synchronisation is because SQLite makes damn sure it
doesn't lose your data.  If your documents are permanent (ie you can
access them later if need be), then you can loosen the constraints on
SQLite.  For example you could run with pragma synchronous=off and then
do a checkpoint every 100,000 documents where you close the database,
copy it to a permanent file, sync, and start again.  You could also use
a ram disk and copy to permanent storage as your checkpoint.

| I dont think custom functions are
| appropriate for what im doing and im not sure how virtual tables would
| be either, i rather suspect that would be a very complicated approach.

You can (ab)use custom functions and virtual tables to help behind the
scenes.  For example they can be used to make data sets that are
partitioned appear to be a single whole.  Another example is if you have
your database in two pieces - one that is read only with "old" data and
new one with updates.  That can again appear to the rest of the code as
one database.  Finally you can also make the functions and virtual
tables have side effects even on what appear to be read only queries.

| The schema is extemely simple, and there is barely any logic too the
| indexing process at all.

Maybe not even indexing the documents at all would work?  If you used a
virtual table, you can make it grovel through the documents on demand.
You can even build indices (in the SQL sense) which are in your own
format and performance characteristics and use those for the virtual table.

| Unfortunately i cannot do this with sqlite at the moment...

Correct.  SQLite errs on the side of being a library with no controller,
working with multiple processes and only having the lowest common
denominator operating system locking functionality available.  There are
techniques that can be used to improve concurrency.  DRH has a policy of
only using those that are at least 17 years old, otherwise there are
likely to be patent implications.  See this page for example:

~  http://www.sqlite.org/cvstrac/wiki?p=BlueSky

In summary, you can do one or more of the following:

- - Use some other database
- - Rewrite SQLite bits yourself
- - Use some sort of partitioning mechanism
- - ... which can be hidden using custom functions and virtual tables
- - Use a different storage mechanism (eg db/gdbm) with SQLite giving you
a front end (virtual tables)
- - Relax synchronisation and use a checkpointing mechanism

Roger
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.6 (GNU/Linux)

iD8DBQFFljt9mOOfHg372QQRAhz2AKCWLXiRz3CBL1uUrf2nO0TWGSKz+gCdHhsh
W92csPBnyp1gwHyrJRerxLw=
=Y3QK
-----END PGP SIGNATURE-----

-----------------------------------------------------------------------------
To unsubscribe, send email to [EMAIL PROTECTED]
-----------------------------------------------------------------------------

Reply via email to