Hi,

I will start out by stating that I am not deeply familiar with the 
sqlite & WAL file layout, but I will try anyway, from my current 
understanding:

The general problem that needs to be solved is to allow SQLITE to be 
constantly used over time (with no 'idle' time where no sql operations 
are done, and it is not inside any transaction).

The current WAL design does not allow this, since if there are all the 
time open read transactions and/or write transactions, the WAL will 
continue to grow forever.
I copy/paste below a tiny C program that re-creates this with write 
transactions.

Remember also that in the WAL file we 'gather' up a lot of work. If we 
do it in the background, this work-load can be smoothed up to run in 
parallel to the regular system work, but if we must make the SQL 'idle' 
(close all transactions) in order to execute the checkpoint, on a heavy 
load system this can mean halting the system for a long period of time 
(in our case, typically for 30 seconds!).

This needs to be solved by two "features" to be added to WAL:

1) 'incremental' checkpoint.

2) WAL file recycling (this item can also be solved by two WAL files, 
but I think its better to make the WAL file format a little bit more 
complex than start having to handle in the code management of multiple 
files).

Incremental checkpointing means that checkpointing can be done up until 
the first open transaction (mxFrame). This means both copying the WAL 
pages up-until mxFrame of the first open transaction to the DB file, and 
then MARKING those frames as 'DONE' (I will talk later on how to do the 
'DONE' marking).

This means that from that point onwards - accessing those pages will be 
to the DB file, not the WAL file.

WAL recycling will be done by writing pages to the beginning of the WAL 
file when a certain amount of pages from the beginning of the WAL file 
are 'checkpointed' (marked as DONE). This can also happen in the middle 
of a transaction.

Example:

Legend:
H - header.
1, 2, 3.. - page of transaction 1, 2, 3..
C - commit marker.
BOF1: beginning of WAL-1. EOF1: end of WAL-1
BOF2: beginning of WAL-2. EOF2: end of WAL-2
P: 64K of padding (junk data)

WAL file with transactions 1 2 and 3 committed, and transaction 4 open:
H 1 1 1C 2 2 2 2C 3 3 3 3C 4 4 4
   +-- BOF1                     +--- EOF1

We continue to add to transaction 4:
H 1 1 1C 2 2 2 2C 3 3 3 3C 4 4 4 4 4
   +-- BOF1                         +--- EOF1

In the meantime, we checkpointed transactions 1 and 2, because there is 
a read transaction working on transaction 3:
H 1 1 1C 2 2 2 2C 3 3 3 3C 4 4 4 4 4
                   +-- BOF1         +--- EOF1
                   +----- checkpointed up to here

No we decided we want to recycle. Since there is no read transaction 
open on transaction 1 and 2 (cannot be, since if you need a page from 
transaction 1, you will find it in the DB), we can reuse them:
H 4 1 1C 2 2 2 2C 3 3 3 3C 4 4 4 4 4
   +-- BOF2        +-- BOF1         +--- EOF1
   +-- EOF2        +---- checkpointed up to here
   +------ the 6th page of transaction 4

And now we close transaction 4:
H 4 4C 1C 2 2 2 2C 3 3 3 3C 4 4 4 4 4
   | +-- EOF2       +-- BOF1         +--- EOF1
   +--- BOF2        +---- checkpointed up to here

Lets open transaction 5, and write faster than the background checkpointing:

H 4 4C 5  5 5 5 5  3 3 3 3C 4 4 4 4 4
   |             |  +-- BOF1         +--- EOF1
   +--- BOF2     |  +---- checkpointed up to here
                 +-- EOF2

So we need more place for 5, we will write it at the end:
H 4 4C  5 5 5 5 5  3 3 3 3C 4 4 4 4 4 5 5 5 5C
   |             |  |                | |     +-- EOF3
   |             |  |                | +--- BOF3
   |             |  +-- BOF1         +--- EOF1
   +--- BOF2     |  +---- checkpointed up to here
                 +-- EOF2

Ok. This is the most complicated 'mess up' of this WAL file possible. It 
cannot get 'messier' than this, since we recyclye ONLY when there is a 
minimum of N (configurable) amount of checkpointed pages at the 
beginning of the WAL file AND if there are no 'gaps' in the not-yet 
checkpointed pages.
The first condition is mainly for efficiency or large workloads (since 
recycling has its costs). This second condition is to prevent the WAL 
files layout to have multiple recycling points in it (prevents it from 
getting 'messed up'). A wal file can only have at MAX one recycle point 
at one point in file, and MAX of 2 empty gaps.
If we continue the example above, this can happen when we checkpoint 
transactions 3 and 4 but not transaction 5. So we have a gap at the 
beginning (end of transaction 4) and in the middle (transaction 3 and 
beginning of transaction 4)

H 4 4C  5 5 5 5 5  3 3 3 3C 4 4 4 4 4 5 5 5 5C
         +-- BOF1|                     +- BOF2|
                 +--- EOF1                    +--- EOF2


How to mark as DONE:
--------------------
We actually don't mark a frame as DONE.
We give each frame a sequence number, and we write on each commit frame 
the current BOF1.
We can figure out which frame is good and which is junk using the 
following algorithm:
1. scan the WAL file and build the index. some of the frames in the 
index are invalid.
2. search for the commit frame with largest sequence number, and get 
the BOF1 from it.
3. pass on all frames from BOF1 to last commit frame and make sure it's 
valid (salt-1, salt-2 and the checksum are valid).
4. In case one of the frames is not valid, find the commit frame 
previous to the one chosen in step 2, and do steps 3 and 4 again.
5. remove all invalid frames from the index.

First example:
     0 0 0  0 0 0 0  0 0 1 1  1 1 1
     1 2 3  4 5 6 7  8 9 0 1  2 3 4
H P 1 1 1C 2 2 2 2C 3 3 3 3C 4 4 4
     +-- BOF1                     +--- EOF1

1. we build index file for the above WAL.
2. Last commit frame is frame 11, so we check for the BOF1 and find it 
is frame 01.
3. we check that each frame between frame 01 and frame 11 is valid

Second example:

     1   1 1 2 2    0 0 1 1  1 1 1 1 1   2 2 2 2
     7   8 9 0 1    8 9 0 1  2 3 4 5 6   2 3 4 5
H P 4C  5 5 5 5 P  3 3 3 3C 4 4 4 4 4 P 5 5 5 5C
     |         |    |                |   |     +-- EOF3
     |         |    |                |   +--- BOF3
     |         |    +-- BOF1         +--- EOF1
     +--- BOF2 |    +---- checkpointed up to here
               +-- EOF2

1. we build index file for the above WAL.
2. Last commit frame is frame 25, so we check for the BOF1 and find it 
is frame 08.
3. we check that each frame between frame 08 and frame 25 is valid.
4. lets say that we tried to write the HD frames 20, 21, 24 and 25, and 
in the middle of this the system crashed. the OS and HD can write frames 
24 and 25 before frames 20-21, so it is possible to frames 24-25 to be 
good, and frames 20-21 to be corrupted.
5. once we know frames 20-21 are corrupted, we search for previus commit 
frame, which is frame 17, and do the check again (this time is should pass).


How to avoid writing sectors near 'critical' sectors:
-----------------------------------------------------
HD's and OS's work with basic 'sector' units that start from 512, but 
can in many cases even reach up to 32K!
'critical sectors' in a WAL file are sectors that if lost can lead to DB 
corruption. For example, a WAL frame of a checkpoint that stopped in the 
middle of transfer to the DB, and then that WAL frame is lost (due to 
corrupt), then next time the system is up - the DB may be corrupt.
So pages of such a WAL frame I call 'critical'.

In the current 'append only' WAL implementation, such corruption cannot 
happen. But, once we add recycling feature, this could happen, by 
writing to a adjacent sector in the FILE (just before or just after) - 
which if the system is crashes in the middle - could corrupt that 
logical sector (since the HD and OS have a minimum sector size which 
they write to).

To avoid this, we need to keep a 'safe distance' from 'critical sectors' 
in the WAL. How we do this: whenever we recycle, we add at the end of 
the file PAD ('PAD' is 64K of pages - with no content - just used as 
padding). This PAD at EOF will be used later on when BOF3 needs to 
append at end of file.
At the beginning of the file, in order not to overwrite the header H, it 
will skip PAD bytes and start only there writing the BOF2.
EOF2 will stop PAD bytes before BOF1, and then skip to BOF3 (which 
already from before had PAD bytes distance from EOF1).

So the real layout of the previous example will be more like this:
H P 4C  5 5 5 5 P  3 3 3 3C 4 4 4 4 4 P 5 5 5 5C
     |         |    |                |   |     +-- EOF3
     |         |    |                |   +--- BOF3
     |         |    +-- BOF1         +--- EOF1
     +--- BOF2 |    +---- checkpointed up to here
               +-- EOF2

This allows us to write to the WAL file without getting 'too close' to 
'critical sectors'.

I can continue deeper to the headers structures (some fields needs to be 
added) and to what is the structure of the index file, but I want to 
hear your opinion about all the above first, and if this sounds 
reasonable, we can continue work on it.

Yoni


Example program (WAL file grow limitlessly):
----------------
/* to compile: gcc -g test.c PATH_TO_SQLITE/libsqlite3.a -lrt */
#include "sqlite3.h"
#include "stdio.h"
#include "stdlib.h"
#include "fcntl.h"

static void sql_exec(sqlite3 *conn, char *query)
{
     char *err;
     if (sqlite3_exec(conn, query, NULL, 0, &err))
     {
         printf("sqlite: failed exec '%s'. err: %s\n", query, err);
        exit(1);
     }
}

static sqlite3 *sql_open_conn(void)
{
     sqlite3 *conn;
     if (sqlite3_open_v2("test.db", &conn, SQLITE_OPEN_READWRITE, NULL))
     {
        printf("sqlite3_open_v2 failed\n");
        exit(1);
     }
     return conn;
}

static int do_checkpoint()
{
     sqlite3 *conn;
     while (1)
     {
        sleep(1);
        printf("calling wal checkpoint\n");
        fflush(0);
        conn = sql_open_conn();
        sql_exec(conn, "PRAGMA wal_checkpoint");
        sqlite3_close(conn);
     }
}

int main(int argc, char **argv)
{
     sqlite3 *conn = NULL;
     char *err_msg = NULL;
     pthread_t thread;
     int fd, i = 0;
     unlink("test.db");
     unlink("test.db-wal");
     unlink("test.db-shm");
     if ((fd = open("test.db", O_CREAT|O_RDWR, 0666)<0))
     {
         printf("could not open test.db\n");
         exit(1);
     }
     close(fd);
     conn = sql_open_conn();
     sql_exec(conn, "PRAGMA journal_mode=WAL");
     sql_exec(conn, "PRAGMA wal_autocheckpoint=-1");
     sql_exec(conn, "create table tbl1 (one varchar(20), two varchar(20))");
     if (pthread_create(&thread, NULL, do_checkpoint, NULL))
     {
         printf("could not start thread\n");
        exit(1);
     }
     sql_exec(conn, "BEGIN TRANSACTION");
     while (1)
     {
         if (!(i++%10000))
        {
            sql_exec(conn, "END TRANSACTION");
            sql_exec(conn, "BEGIN TRANSACTION");
        }
        sql_exec(conn, "INSERT INTO tbl1 values('aaaaaaaaaaaaaaaaaaa', "
            "'bbbbbbbbbbbbbbbbbbb')");
     }
     sql_exec(conn, "END TRANSACTION");
     sql_exec(conn, "PRAGMA wal_checkpoint");
     sqlite3_close(conn);
     return 0;
}

On 29/11/2010 5:13 PM, Pavel Ivanov wrote:
>> Well, I love sqlite, and I want to continue using it (small, fast,
>> reliable ...).
>> I think it is better to solve such problems inside sqlite
>
> It's impossible. Just try to design the solution you want. Think of
> how SQLite should behave to make you happy, think of it with all
> details and don't forget that SQLite is a library living inside
> several processes simultaneously, sometimes even on different
> computers (although it's discouraged), those processes can crash
> unpredictably and you have to manage all crashes predictably.
> So if you are able to come up with some solution and post all
> technical details here I think SQLite developers will be happy to
> implement it.
>
>
> Pavel
>
> On Sun, Nov 28, 2010 at 3:19 AM, Yoni Londner<yonih...@gmail.com>  wrote:
>> Hi,
>>
>>   >  For a large scale system you have a third choice: use some other RDBMS
>>   >  which is implemented in one process and has a much better control over
>>   >  its data and much better communication between readers and writers.
>> Well, I love sqlite, and I want to continue using it (small, fast,
>> reliable ...).
>> I think it is better to solve such problems inside sqlite (Not to
>> mention you don't need large scale system to reproduce this issue, 50
>> lines of C code is enough).
>>
>> Yoni.
>> _______________________________________________
>> sqlite-users mailing list
>> sqlite-users@sqlite.org
>> http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users
>>
> _______________________________________________
> sqlite-users mailing list
> sqlite-users@sqlite.org
> http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users
_______________________________________________
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users

Reply via email to