You seem to use such term as "background checkpointing". What's that? Who runs this background process and what happens when it crashes (when all other readers/writers are still working)?
Pavel On Thu, Dec 2, 2010 at 11:04 AM, Yoni Londner <yonih...@gmail.com> wrote: > Hi, > > I will start out by stating that I am not deeply familiar with the > sqlite & WAL file layout, but I will try anyway, from my current > understanding: > > The general problem that needs to be solved is to allow SQLITE to be > constantly used over time (with no 'idle' time where no sql operations > are done, and it is not inside any transaction). > > The current WAL design does not allow this, since if there are all the > time open read transactions and/or write transactions, the WAL will > continue to grow forever. > I copy/paste below a tiny C program that re-creates this with write > transactions. > > Remember also that in the WAL file we 'gather' up a lot of work. If we > do it in the background, this work-load can be smoothed up to run in > parallel to the regular system work, but if we must make the SQL 'idle' > (close all transactions) in order to execute the checkpoint, on a heavy > load system this can mean halting the system for a long period of time > (in our case, typically for 30 seconds!). > > This needs to be solved by two "features" to be added to WAL: > > 1) 'incremental' checkpoint. > > 2) WAL file recycling (this item can also be solved by two WAL files, > but I think its better to make the WAL file format a little bit more > complex than start having to handle in the code management of multiple > files). > > Incremental checkpointing means that checkpointing can be done up until > the first open transaction (mxFrame). This means both copying the WAL > pages up-until mxFrame of the first open transaction to the DB file, and > then MARKING those frames as 'DONE' (I will talk later on how to do the > 'DONE' marking). > > This means that from that point onwards - accessing those pages will be > to the DB file, not the WAL file. > > WAL recycling will be done by writing pages to the beginning of the WAL > file when a certain amount of pages from the beginning of the WAL file > are 'checkpointed' (marked as DONE). This can also happen in the middle > of a transaction. > > Example: > > Legend: > H - header. > 1, 2, 3.. - page of transaction 1, 2, 3.. > C - commit marker. > BOF1: beginning of WAL-1. EOF1: end of WAL-1 > BOF2: beginning of WAL-2. EOF2: end of WAL-2 > P: 64K of padding (junk data) > > WAL file with transactions 1 2 and 3 committed, and transaction 4 open: > H 1 1 1C 2 2 2 2C 3 3 3 3C 4 4 4 > +-- BOF1 +--- EOF1 > > We continue to add to transaction 4: > H 1 1 1C 2 2 2 2C 3 3 3 3C 4 4 4 4 4 > +-- BOF1 +--- EOF1 > > In the meantime, we checkpointed transactions 1 and 2, because there is > a read transaction working on transaction 3: > H 1 1 1C 2 2 2 2C 3 3 3 3C 4 4 4 4 4 > +-- BOF1 +--- EOF1 > +----- checkpointed up to here > > No we decided we want to recycle. Since there is no read transaction > open on transaction 1 and 2 (cannot be, since if you need a page from > transaction 1, you will find it in the DB), we can reuse them: > H 4 1 1C 2 2 2 2C 3 3 3 3C 4 4 4 4 4 > +-- BOF2 +-- BOF1 +--- EOF1 > +-- EOF2 +---- checkpointed up to here > +------ the 6th page of transaction 4 > > And now we close transaction 4: > H 4 4C 1C 2 2 2 2C 3 3 3 3C 4 4 4 4 4 > | +-- EOF2 +-- BOF1 +--- EOF1 > +--- BOF2 +---- checkpointed up to here > > Lets open transaction 5, and write faster than the background checkpointing: > > H 4 4C 5 5 5 5 5 3 3 3 3C 4 4 4 4 4 > | | +-- BOF1 +--- EOF1 > +--- BOF2 | +---- checkpointed up to here > +-- EOF2 > > So we need more place for 5, we will write it at the end: > H 4 4C 5 5 5 5 5 3 3 3 3C 4 4 4 4 4 5 5 5 5C > | | | | | +-- EOF3 > | | | | +--- BOF3 > | | +-- BOF1 +--- EOF1 > +--- BOF2 | +---- checkpointed up to here > +-- EOF2 > > Ok. This is the most complicated 'mess up' of this WAL file possible. It > cannot get 'messier' than this, since we recyclye ONLY when there is a > minimum of N (configurable) amount of checkpointed pages at the > beginning of the WAL file AND if there are no 'gaps' in the not-yet > checkpointed pages. > The first condition is mainly for efficiency or large workloads (since > recycling has its costs). This second condition is to prevent the WAL > files layout to have multiple recycling points in it (prevents it from > getting 'messed up'). A wal file can only have at MAX one recycle point > at one point in file, and MAX of 2 empty gaps. > If we continue the example above, this can happen when we checkpoint > transactions 3 and 4 but not transaction 5. So we have a gap at the > beginning (end of transaction 4) and in the middle (transaction 3 and > beginning of transaction 4) > > H 4 4C 5 5 5 5 5 3 3 3 3C 4 4 4 4 4 5 5 5 5C > +-- BOF1| +- BOF2| > +--- EOF1 +--- EOF2 > > > How to mark as DONE: > -------------------- > We actually don't mark a frame as DONE. > We give each frame a sequence number, and we write on each commit frame > the current BOF1. > We can figure out which frame is good and which is junk using the > following algorithm: > 1. scan the WAL file and build the index. some of the frames in the > index are invalid. > 2. search for the commit frame with largest sequence number, and get > the BOF1 from it. > 3. pass on all frames from BOF1 to last commit frame and make sure it's > valid (salt-1, salt-2 and the checksum are valid). > 4. In case one of the frames is not valid, find the commit frame > previous to the one chosen in step 2, and do steps 3 and 4 again. > 5. remove all invalid frames from the index. > > First example: > 0 0 0 0 0 0 0 0 0 1 1 1 1 1 > 1 2 3 4 5 6 7 8 9 0 1 2 3 4 > H P 1 1 1C 2 2 2 2C 3 3 3 3C 4 4 4 > +-- BOF1 +--- EOF1 > > 1. we build index file for the above WAL. > 2. Last commit frame is frame 11, so we check for the BOF1 and find it > is frame 01. > 3. we check that each frame between frame 01 and frame 11 is valid > > Second example: > > 1 1 1 2 2 0 0 1 1 1 1 1 1 1 2 2 2 2 > 7 8 9 0 1 8 9 0 1 2 3 4 5 6 2 3 4 5 > H P 4C 5 5 5 5 P 3 3 3 3C 4 4 4 4 4 P 5 5 5 5C > | | | | | +-- EOF3 > | | | | +--- BOF3 > | | +-- BOF1 +--- EOF1 > +--- BOF2 | +---- checkpointed up to here > +-- EOF2 > > 1. we build index file for the above WAL. > 2. Last commit frame is frame 25, so we check for the BOF1 and find it > is frame 08. > 3. we check that each frame between frame 08 and frame 25 is valid. > 4. lets say that we tried to write the HD frames 20, 21, 24 and 25, and > in the middle of this the system crashed. the OS and HD can write frames > 24 and 25 before frames 20-21, so it is possible to frames 24-25 to be > good, and frames 20-21 to be corrupted. > 5. once we know frames 20-21 are corrupted, we search for previus commit > frame, which is frame 17, and do the check again (this time is should pass). > > > How to avoid writing sectors near 'critical' sectors: > ----------------------------------------------------- > HD's and OS's work with basic 'sector' units that start from 512, but > can in many cases even reach up to 32K! > 'critical sectors' in a WAL file are sectors that if lost can lead to DB > corruption. For example, a WAL frame of a checkpoint that stopped in the > middle of transfer to the DB, and then that WAL frame is lost (due to > corrupt), then next time the system is up - the DB may be corrupt. > So pages of such a WAL frame I call 'critical'. > > In the current 'append only' WAL implementation, such corruption cannot > happen. But, once we add recycling feature, this could happen, by > writing to a adjacent sector in the FILE (just before or just after) - > which if the system is crashes in the middle - could corrupt that > logical sector (since the HD and OS have a minimum sector size which > they write to). > > To avoid this, we need to keep a 'safe distance' from 'critical sectors' > in the WAL. How we do this: whenever we recycle, we add at the end of > the file PAD ('PAD' is 64K of pages - with no content - just used as > padding). This PAD at EOF will be used later on when BOF3 needs to > append at end of file. > At the beginning of the file, in order not to overwrite the header H, it > will skip PAD bytes and start only there writing the BOF2. > EOF2 will stop PAD bytes before BOF1, and then skip to BOF3 (which > already from before had PAD bytes distance from EOF1). > > So the real layout of the previous example will be more like this: > H P 4C 5 5 5 5 P 3 3 3 3C 4 4 4 4 4 P 5 5 5 5C > | | | | | +-- EOF3 > | | | | +--- BOF3 > | | +-- BOF1 +--- EOF1 > +--- BOF2 | +---- checkpointed up to here > +-- EOF2 > > This allows us to write to the WAL file without getting 'too close' to > 'critical sectors'. > > I can continue deeper to the headers structures (some fields needs to be > added) and to what is the structure of the index file, but I want to > hear your opinion about all the above first, and if this sounds > reasonable, we can continue work on it. > > Yoni > > > Example program (WAL file grow limitlessly): > ---------------- > /* to compile: gcc -g test.c PATH_TO_SQLITE/libsqlite3.a -lrt */ > #include "sqlite3.h" > #include "stdio.h" > #include "stdlib.h" > #include "fcntl.h" > > static void sql_exec(sqlite3 *conn, char *query) > { > char *err; > if (sqlite3_exec(conn, query, NULL, 0, &err)) > { > printf("sqlite: failed exec '%s'. err: %s\n", query, err); > exit(1); > } > } > > static sqlite3 *sql_open_conn(void) > { > sqlite3 *conn; > if (sqlite3_open_v2("test.db", &conn, SQLITE_OPEN_READWRITE, NULL)) > { > printf("sqlite3_open_v2 failed\n"); > exit(1); > } > return conn; > } > > static int do_checkpoint() > { > sqlite3 *conn; > while (1) > { > sleep(1); > printf("calling wal checkpoint\n"); > fflush(0); > conn = sql_open_conn(); > sql_exec(conn, "PRAGMA wal_checkpoint"); > sqlite3_close(conn); > } > } > > int main(int argc, char **argv) > { > sqlite3 *conn = NULL; > char *err_msg = NULL; > pthread_t thread; > int fd, i = 0; > unlink("test.db"); > unlink("test.db-wal"); > unlink("test.db-shm"); > if ((fd = open("test.db", O_CREAT|O_RDWR, 0666)<0)) > { > printf("could not open test.db\n"); > exit(1); > } > close(fd); > conn = sql_open_conn(); > sql_exec(conn, "PRAGMA journal_mode=WAL"); > sql_exec(conn, "PRAGMA wal_autocheckpoint=-1"); > sql_exec(conn, "create table tbl1 (one varchar(20), two varchar(20))"); > if (pthread_create(&thread, NULL, do_checkpoint, NULL)) > { > printf("could not start thread\n"); > exit(1); > } > sql_exec(conn, "BEGIN TRANSACTION"); > while (1) > { > if (!(i++%10000)) > { > sql_exec(conn, "END TRANSACTION"); > sql_exec(conn, "BEGIN TRANSACTION"); > } > sql_exec(conn, "INSERT INTO tbl1 values('aaaaaaaaaaaaaaaaaaa', " > "'bbbbbbbbbbbbbbbbbbb')"); > } > sql_exec(conn, "END TRANSACTION"); > sql_exec(conn, "PRAGMA wal_checkpoint"); > sqlite3_close(conn); > return 0; > } > > On 29/11/2010 5:13 PM, Pavel Ivanov wrote: >>> Well, I love sqlite, and I want to continue using it (small, fast, >>> reliable ...). >>> I think it is better to solve such problems inside sqlite >> >> It's impossible. Just try to design the solution you want. Think of >> how SQLite should behave to make you happy, think of it with all >> details and don't forget that SQLite is a library living inside >> several processes simultaneously, sometimes even on different >> computers (although it's discouraged), those processes can crash >> unpredictably and you have to manage all crashes predictably. >> So if you are able to come up with some solution and post all >> technical details here I think SQLite developers will be happy to >> implement it. >> >> >> Pavel >> >> On Sun, Nov 28, 2010 at 3:19 AM, Yoni Londner<yonih...@gmail.com> wrote: >>> Hi, >>> >>> > For a large scale system you have a third choice: use some other RDBMS >>> > which is implemented in one process and has a much better control over >>> > its data and much better communication between readers and writers. >>> Well, I love sqlite, and I want to continue using it (small, fast, >>> reliable ...). >>> I think it is better to solve such problems inside sqlite (Not to >>> mention you don't need large scale system to reproduce this issue, 50 >>> lines of C code is enough). >>> >>> Yoni. >>> _______________________________________________ >>> sqlite-users mailing list >>> sqlite-users@sqlite.org >>> http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users >>> >> _______________________________________________ >> sqlite-users mailing list >> sqlite-users@sqlite.org >> http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users > _______________________________________________ > sqlite-users mailing list > sqlite-users@sqlite.org > http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users > _______________________________________________ sqlite-users mailing list sqlite-users@sqlite.org http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users