I am debugging db corruption. After I get some corrupted db, I found that they
all corrupted by writing null data.
So, I decide to add some check and dump call stackin the source code in order
to find out who corrupts the db.
Here is the code I added in the source code.
int sqlite3CheckNullData(const unsigned char* data, const int length)
{
const size_t* s = (const size_t*)data;
const unsigned char* d = (const unsigned char*)data;
int n = length/sizeof(size_t);
int i;
for (i = 0; i n; i++) {
if (s[i]!=0) {
return 0;
}
}
for (i = i*sizeof(size_t); ilength; i++) {
if (d[i]!=0) {
return 0;
}
}
return 1;
}
static int unixWrite(
sqlite3_file *id,
const void *pBuf,
int amt,
sqlite3_int64 offset
){
unixFile *pFile = (unixFile*)id;
if (amt0sqlite3CheckNullData(pBuf, amt)) {
SQLITE_KNOWN_ERROR(SQLITE_CORRUPT, "writing null data into %s from %d length
%d", unixGetFilename(pFile-zPath), offset, amt);
}
...
}
The code is simple. I check the data whether is all null in
[sqlite3CheckNullData], and add a macro [SQLITE_KNOWN_ERROR], which is defined
as [sqlite_log], to throw this error outside SQLite. Outside SQLite, I dump the
call stack of all thread, and I got this:
0x195774000 + 113628 objc_msgSend (in libobjc.dylib) + 28
0x1000f8000 + 7781724 _ZL9LogSQLitePviPKc,WCDataBase.mm,line 81
0x1000f8000 + 2836888 sqlite3_vlog,printf.c,line 1023
0x1000f8000 + 2778664 sqlite3KnownError,main.c,line 3192
0x1000f8000 + 2554560 unixWrite,os_unix.c,line 3335
0x1000f8000 + 2821984 sqlite3WalCheckpoint,wal.c,line 1798
0x1000f8000 + 2819864 sqlite3WalClose,wal.c,line 1914
0x1000f8000 + 2529964 sqlite3PagerClose,pager.c,line 3995
0x1000f8000 + 2574152 sqlite3BtreeClose,btree.c,line 2516
0x1000f8000 + 277 sqlite3LeaveMutexAndCloseZombie,main.c,line
10834297741736
0x1000f8000 + 2774220 sqlite3Close,main.c,line 1026
This is the only thread operating database. All other call stack of threads
make no sense.
You can see the SQLite checkpointing. That is the reason why my database
corrupt. And I have no idea how this happened even I checking the source code.
Here is some of my conclusion:
1. This checking null data also work for writing into WAL file, but there is no
report that WAL is been written by null data.
2.Some rogue file descriptor may write the null data into WAL file. But, I have
several db with the same problem. It?s a rare event that the rogue writter only
write the null data into the WAL, not all other db files or normal files.
3. I guess it could be a problem of operating system. I work on iOS, but I have
no any further idea.
4. It would happened in normal knee. But it could easily happen when the disk
free space is low. I also haveno any further idea about this.
So, this is my confusion:
1. Does anyone have any idea about this?
2. What can I do to reserve this type of corruption?
Note that if a page of sqlite_master is been rewritten by null data, the
[.dump] shell command will not work to repair the database.