You seem to use such term as "background checkpointing". What's that?
Who runs this background process and what happens when it crashes
(when all other readers/writers are still working)?


Pavel

On Thu, Dec 2, 2010 at 11:04 AM, Yoni Londner <yonih...@gmail.com> wrote:
> Hi,
>
> I will start out by stating that I am not deeply familiar with the
> sqlite & WAL file layout, but I will try anyway, from my current
> understanding:
>
> The general problem that needs to be solved is to allow SQLITE to be
> constantly used over time (with no 'idle' time where no sql operations
> are done, and it is not inside any transaction).
>
> The current WAL design does not allow this, since if there are all the
> time open read transactions and/or write transactions, the WAL will
> continue to grow forever.
> I copy/paste below a tiny C program that re-creates this with write
> transactions.
>
> Remember also that in the WAL file we 'gather' up a lot of work. If we
> do it in the background, this work-load can be smoothed up to run in
> parallel to the regular system work, but if we must make the SQL 'idle'
> (close all transactions) in order to execute the checkpoint, on a heavy
> load system this can mean halting the system for a long period of time
> (in our case, typically for 30 seconds!).
>
> This needs to be solved by two "features" to be added to WAL:
>
> 1) 'incremental' checkpoint.
>
> 2) WAL file recycling (this item can also be solved by two WAL files,
> but I think its better to make the WAL file format a little bit more
> complex than start having to handle in the code management of multiple
> files).
>
> Incremental checkpointing means that checkpointing can be done up until
> the first open transaction (mxFrame). This means both copying the WAL
> pages up-until mxFrame of the first open transaction to the DB file, and
> then MARKING those frames as 'DONE' (I will talk later on how to do the
> 'DONE' marking).
>
> This means that from that point onwards - accessing those pages will be
> to the DB file, not the WAL file.
>
> WAL recycling will be done by writing pages to the beginning of the WAL
> file when a certain amount of pages from the beginning of the WAL file
> are 'checkpointed' (marked as DONE). This can also happen in the middle
> of a transaction.
>
> Example:
>
> Legend:
> H - header.
> 1, 2, 3.. - page of transaction 1, 2, 3..
> C - commit marker.
> BOF1: beginning of WAL-1. EOF1: end of WAL-1
> BOF2: beginning of WAL-2. EOF2: end of WAL-2
> P: 64K of padding (junk data)
>
> WAL file with transactions 1 2 and 3 committed, and transaction 4 open:
> H 1 1 1C 2 2 2 2C 3 3 3 3C 4 4 4
>   +-- BOF1                     +--- EOF1
>
> We continue to add to transaction 4:
> H 1 1 1C 2 2 2 2C 3 3 3 3C 4 4 4 4 4
>   +-- BOF1                         +--- EOF1
>
> In the meantime, we checkpointed transactions 1 and 2, because there is
> a read transaction working on transaction 3:
> H 1 1 1C 2 2 2 2C 3 3 3 3C 4 4 4 4 4
>                   +-- BOF1         +--- EOF1
>                   +----- checkpointed up to here
>
> No we decided we want to recycle. Since there is no read transaction
> open on transaction 1 and 2 (cannot be, since if you need a page from
> transaction 1, you will find it in the DB), we can reuse them:
> H 4 1 1C 2 2 2 2C 3 3 3 3C 4 4 4 4 4
>   +-- BOF2        +-- BOF1         +--- EOF1
>   +-- EOF2        +---- checkpointed up to here
>   +------ the 6th page of transaction 4
>
> And now we close transaction 4:
> H 4 4C 1C 2 2 2 2C 3 3 3 3C 4 4 4 4 4
>   | +-- EOF2       +-- BOF1         +--- EOF1
>   +--- BOF2        +---- checkpointed up to here
>
> Lets open transaction 5, and write faster than the background checkpointing:
>
> H 4 4C 5  5 5 5 5  3 3 3 3C 4 4 4 4 4
>   |             |  +-- BOF1         +--- EOF1
>   +--- BOF2     |  +---- checkpointed up to here
>                 +-- EOF2
>
> So we need more place for 5, we will write it at the end:
> H 4 4C  5 5 5 5 5  3 3 3 3C 4 4 4 4 4 5 5 5 5C
>   |             |  |                | |     +-- EOF3
>   |             |  |                | +--- BOF3
>   |             |  +-- BOF1         +--- EOF1
>   +--- BOF2     |  +---- checkpointed up to here
>                 +-- EOF2
>
> Ok. This is the most complicated 'mess up' of this WAL file possible. It
> cannot get 'messier' than this, since we recyclye ONLY when there is a
> minimum of N (configurable) amount of checkpointed pages at the
> beginning of the WAL file AND if there are no 'gaps' in the not-yet
> checkpointed pages.
> The first condition is mainly for efficiency or large workloads (since
> recycling has its costs). This second condition is to prevent the WAL
> files layout to have multiple recycling points in it (prevents it from
> getting 'messed up'). A wal file can only have at MAX one recycle point
> at one point in file, and MAX of 2 empty gaps.
> If we continue the example above, this can happen when we checkpoint
> transactions 3 and 4 but not transaction 5. So we have a gap at the
> beginning (end of transaction 4) and in the middle (transaction 3 and
> beginning of transaction 4)
>
> H 4 4C  5 5 5 5 5  3 3 3 3C 4 4 4 4 4 5 5 5 5C
>         +-- BOF1|                     +- BOF2|
>                 +--- EOF1                    +--- EOF2
>
>
> How to mark as DONE:
> --------------------
> We actually don't mark a frame as DONE.
> We give each frame a sequence number, and we write on each commit frame
> the current BOF1.
> We can figure out which frame is good and which is junk using the
> following algorithm:
> 1. scan the WAL file and build the index. some of the frames in the
> index are invalid.
> 2. search for the commit frame with largest sequence number, and get
> the BOF1 from it.
> 3. pass on all frames from BOF1 to last commit frame and make sure it's
> valid (salt-1, salt-2 and the checksum are valid).
> 4. In case one of the frames is not valid, find the commit frame
> previous to the one chosen in step 2, and do steps 3 and 4 again.
> 5. remove all invalid frames from the index.
>
> First example:
>     0 0 0  0 0 0 0  0 0 1 1  1 1 1
>     1 2 3  4 5 6 7  8 9 0 1  2 3 4
> H P 1 1 1C 2 2 2 2C 3 3 3 3C 4 4 4
>     +-- BOF1                     +--- EOF1
>
> 1. we build index file for the above WAL.
> 2. Last commit frame is frame 11, so we check for the BOF1 and find it
> is frame 01.
> 3. we check that each frame between frame 01 and frame 11 is valid
>
> Second example:
>
>     1   1 1 2 2    0 0 1 1  1 1 1 1 1   2 2 2 2
>     7   8 9 0 1    8 9 0 1  2 3 4 5 6   2 3 4 5
> H P 4C  5 5 5 5 P  3 3 3 3C 4 4 4 4 4 P 5 5 5 5C
>     |         |    |                |   |     +-- EOF3
>     |         |    |                |   +--- BOF3
>     |         |    +-- BOF1         +--- EOF1
>     +--- BOF2 |    +---- checkpointed up to here
>               +-- EOF2
>
> 1. we build index file for the above WAL.
> 2. Last commit frame is frame 25, so we check for the BOF1 and find it
> is frame 08.
> 3. we check that each frame between frame 08 and frame 25 is valid.
> 4. lets say that we tried to write the HD frames 20, 21, 24 and 25, and
> in the middle of this the system crashed. the OS and HD can write frames
> 24 and 25 before frames 20-21, so it is possible to frames 24-25 to be
> good, and frames 20-21 to be corrupted.
> 5. once we know frames 20-21 are corrupted, we search for previus commit
> frame, which is frame 17, and do the check again (this time is should pass).
>
>
> How to avoid writing sectors near 'critical' sectors:
> -----------------------------------------------------
> HD's and OS's work with basic 'sector' units that start from 512, but
> can in many cases even reach up to 32K!
> 'critical sectors' in a WAL file are sectors that if lost can lead to DB
> corruption. For example, a WAL frame of a checkpoint that stopped in the
> middle of transfer to the DB, and then that WAL frame is lost (due to
> corrupt), then next time the system is up - the DB may be corrupt.
> So pages of such a WAL frame I call 'critical'.
>
> In the current 'append only' WAL implementation, such corruption cannot
> happen. But, once we add recycling feature, this could happen, by
> writing to a adjacent sector in the FILE (just before or just after) -
> which if the system is crashes in the middle - could corrupt that
> logical sector (since the HD and OS have a minimum sector size which
> they write to).
>
> To avoid this, we need to keep a 'safe distance' from 'critical sectors'
> in the WAL. How we do this: whenever we recycle, we add at the end of
> the file PAD ('PAD' is 64K of pages - with no content - just used as
> padding). This PAD at EOF will be used later on when BOF3 needs to
> append at end of file.
> At the beginning of the file, in order not to overwrite the header H, it
> will skip PAD bytes and start only there writing the BOF2.
> EOF2 will stop PAD bytes before BOF1, and then skip to BOF3 (which
> already from before had PAD bytes distance from EOF1).
>
> So the real layout of the previous example will be more like this:
> H P 4C  5 5 5 5 P  3 3 3 3C 4 4 4 4 4 P 5 5 5 5C
>     |         |    |                |   |     +-- EOF3
>     |         |    |                |   +--- BOF3
>     |         |    +-- BOF1         +--- EOF1
>     +--- BOF2 |    +---- checkpointed up to here
>               +-- EOF2
>
> This allows us to write to the WAL file without getting 'too close' to
> 'critical sectors'.
>
> I can continue deeper to the headers structures (some fields needs to be
> added) and to what is the structure of the index file, but I want to
> hear your opinion about all the above first, and if this sounds
> reasonable, we can continue work on it.
>
> Yoni
>
>
> Example program (WAL file grow limitlessly):
> ----------------
> /* to compile: gcc -g test.c PATH_TO_SQLITE/libsqlite3.a -lrt */
> #include "sqlite3.h"
> #include "stdio.h"
> #include "stdlib.h"
> #include "fcntl.h"
>
> static void sql_exec(sqlite3 *conn, char *query)
> {
>     char *err;
>     if (sqlite3_exec(conn, query, NULL, 0, &err))
>     {
>         printf("sqlite: failed exec '%s'. err: %s\n", query, err);
>        exit(1);
>     }
> }
>
> static sqlite3 *sql_open_conn(void)
> {
>     sqlite3 *conn;
>     if (sqlite3_open_v2("test.db", &conn, SQLITE_OPEN_READWRITE, NULL))
>     {
>        printf("sqlite3_open_v2 failed\n");
>        exit(1);
>     }
>     return conn;
> }
>
> static int do_checkpoint()
> {
>     sqlite3 *conn;
>     while (1)
>     {
>        sleep(1);
>        printf("calling wal checkpoint\n");
>        fflush(0);
>        conn = sql_open_conn();
>        sql_exec(conn, "PRAGMA wal_checkpoint");
>        sqlite3_close(conn);
>     }
> }
>
> int main(int argc, char **argv)
> {
>     sqlite3 *conn = NULL;
>     char *err_msg = NULL;
>     pthread_t thread;
>     int fd, i = 0;
>     unlink("test.db");
>     unlink("test.db-wal");
>     unlink("test.db-shm");
>     if ((fd = open("test.db", O_CREAT|O_RDWR, 0666)<0))
>     {
>         printf("could not open test.db\n");
>         exit(1);
>     }
>     close(fd);
>     conn = sql_open_conn();
>     sql_exec(conn, "PRAGMA journal_mode=WAL");
>     sql_exec(conn, "PRAGMA wal_autocheckpoint=-1");
>     sql_exec(conn, "create table tbl1 (one varchar(20), two varchar(20))");
>     if (pthread_create(&thread, NULL, do_checkpoint, NULL))
>     {
>         printf("could not start thread\n");
>        exit(1);
>     }
>     sql_exec(conn, "BEGIN TRANSACTION");
>     while (1)
>     {
>         if (!(i++%10000))
>        {
>            sql_exec(conn, "END TRANSACTION");
>            sql_exec(conn, "BEGIN TRANSACTION");
>        }
>        sql_exec(conn, "INSERT INTO tbl1 values('aaaaaaaaaaaaaaaaaaa', "
>            "'bbbbbbbbbbbbbbbbbbb')");
>     }
>     sql_exec(conn, "END TRANSACTION");
>     sql_exec(conn, "PRAGMA wal_checkpoint");
>     sqlite3_close(conn);
>     return 0;
> }
>
> On 29/11/2010 5:13 PM, Pavel Ivanov wrote:
>>> Well, I love sqlite, and I want to continue using it (small, fast,
>>> reliable ...).
>>> I think it is better to solve such problems inside sqlite
>>
>> It's impossible. Just try to design the solution you want. Think of
>> how SQLite should behave to make you happy, think of it with all
>> details and don't forget that SQLite is a library living inside
>> several processes simultaneously, sometimes even on different
>> computers (although it's discouraged), those processes can crash
>> unpredictably and you have to manage all crashes predictably.
>> So if you are able to come up with some solution and post all
>> technical details here I think SQLite developers will be happy to
>> implement it.
>>
>>
>> Pavel
>>
>> On Sun, Nov 28, 2010 at 3:19 AM, Yoni Londner<yonih...@gmail.com>  wrote:
>>> Hi,
>>>
>>>   >  For a large scale system you have a third choice: use some other RDBMS
>>>   >  which is implemented in one process and has a much better control over
>>>   >  its data and much better communication between readers and writers.
>>> Well, I love sqlite, and I want to continue using it (small, fast,
>>> reliable ...).
>>> I think it is better to solve such problems inside sqlite (Not to
>>> mention you don't need large scale system to reproduce this issue, 50
>>> lines of C code is enough).
>>>
>>> Yoni.
>>> _______________________________________________
>>> sqlite-users mailing list
>>> sqlite-users@sqlite.org
>>> http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users
>>>
>> _______________________________________________
>> sqlite-users mailing list
>> sqlite-users@sqlite.org
>> http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users
> _______________________________________________
> sqlite-users mailing list
> sqlite-users@sqlite.org
> http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users
>
_______________________________________________
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users

Reply via email to