Hi Maurice,
If you're running into corruption both in ext3 metadata and in MySQL
data, it is certainly not he fault of MySQL as you're likely aware.
I am hoping they are not related. The problems with MySQL surfaced
almost immediately after upgrading to 5.0.x.
[details deleted]
You can see that there are in fact many bits flipped in each. I
would suspect higher-level corruption than
I initially thought this as well, but the explanation on the ext3
mailing list is that it really is just a lone flipped bit in both
instances. The other differences are due to fsck padding out the
block when it guesses what the correct size is.
Do note that data on e.g. the PCI bus is not protected by any sort
of checksum. I've seen this cause corruption problems with PCI
risers and RAID cards. Are you using a PCI riser card? Note that
LSI does *not* certify their cards to be used on risers if you are
custom building a machine.
Yes, there is a riser card. Wouldn't this imply that LSI is saying
you can't use a 1U or a 2U box?
It's kind of scary there is no end-to-end parity implemented
somewhere along the whole data path to prevent this. It sort of
defeats the point of RAID 6 and ECC.
How did you determine this was the cause?
Do you mean a Serially-Attached SCSI aka SAS controller, I assume?
No, it's SATA to SCSI.
Is this a custom build machine or a vendor integrated one?
It is custom-built.
Maurice Volaski wrote:
In using drbd 8.0.5 recently, I have come across at least two
instances where a bit on disk apparently flipped spontaneously in
the ext3 metadata on volumes running on top of drbd.
Also, I have been seeing regular corruption of a mysql database,
which runs on top of drbd, and when I reported this as a bug since
I also recently upgraded mysql versions, they question whether drbd
could be responsible!
All the volumes have been fscked recently and there were no
reported errors. And, of course, there have been no errors reported
from the underlying hardware.
I have since upgraded to 8.0.6, but it's too early to say whether
there is a change.
I'm also seeing the backup server complain of not being files not
comparing, though this may be a separate problem on the backup
server.
The ext-3 bit flipping:
At 12:00 PM -0400 9/11/07, [EMAIL PROTECTED] wrote:
I have come across two files, essentially untouched in years, on two
different ext3 filesystems on the same server, Gentoo AMD 64-bit with
kernel 2.6.22 and fsck version 1.40.2 currently, spontaneously
becoming supremely large:
Filesystem one
Inode 16257874, i_size is 18014398562775391, should be 53297152
Filesystem two
Inode 2121855, i_size is 35184386120704, should be 14032896.
Both were discovered during an ordinary backup operation (via EMC
Insiginia's Retrospect Linux client).
The backup runs daily and so one day, one file must have grew
spontaneously to this size and then on another day, it happened to
the second file, which is on a second filesystem. The backup attempt
generated repeated errors:
EXT3-fs warning (device dm-2): ext3_block_to_path: block > big
Both filesystems are running on different logical volumes, but
underlying that is are drbd network raid devices and underlying that
is a RAID 6-based SATA disk array.
The answer to the bug report regarding mysql data corruption, who
is blaming drbd!
http://bugs.mysql.com/?id=31038
Updated by: Heikki Tuuri
Reported by: Maurice Volaski
Category: Server: InnoDB
Severity: S2 (Serious)
Status: Open
Version: 5.0.48
OS: Linux
OS Details: Gentoo
Tags: database page corruption locking up corrupt doublewrite
[17 Sep 18:49] Heikki Tuuri
Maurice, my first guess is to suspect the RAID-1 driver.
My initial report of mysql data corruption:
A 64-bit Gentoo Linux box had just been upgraded from MySQL 4.1
to5.0.44 fresh (by dumping in 4.1 and restoring in 5.0.44) and
almostimmediately after that, during which time the database was
not used,a crash occurred during a scripted mysqldump. So I
restored and dayslater, it happened again. The crash details seem
to be trying tosuggest some other aspect of the operating system,
even the memoryor disk is flipping a bit. Or could I be running
into a bug in thisversion of MySQL?
Here's the output of the crash
-----------------------------------
InnoDB: Database page corruption on disk or a failed
InnoDB: file read of page 533.
InnoDB: You may have to recover from a backup.
070827 3:10:04 InnoDB: Page dump in ascii and hex (16384 bytes):
len 16384; hex
[dump itself deleted
forbrevity]
;InnoDB: End of page dump
070827 3:10:04 InnoDB: Page checksum
646563254,prior-to-4.0.14-form checksum 2415947328
InnoDB: stored checksum 4187530870, prior-to-4.0.14-form
storedchecksum 2415947328
InnoDB: Page lsn 0 4409041, low 4 bytes of lsn at page end 4409041
InnoDB: Page number (if stored to page already) 533,
InnoDB: space id (if created with >= MySQL-4.1.1 and stored already) 0
InnoDB: Page may be an index page where index id is 0 35
InnoDB: (index PRIMARY of table elegance/image)
InnoDB: Database page corruption on disk or a failed
InnoDB: file read of page 533.
InnoDB: You may have to recover from a backup.
InnoDB: It is also possible that your operating
InnoDB: system has corrupted its own file cache
InnoDB: and rebooting your computer removes the
InnoDB: error.
InnoDB: If the corrupt page is an index page
InnoDB: you can also try to fix the corruption
InnoDB: by dumping, dropping, and reimporting
InnoDB: the corrupt table. You can use CHECK
InnoDB: TABLE to scan your table for corruption.
InnoDB: See also
InnoDB:http://dev.mysql.com/doc/refman/5.0/en/forcing-recovery.html
InnoDB: about forcing recovery.
InnoDB: Ending processing because of a corrupt database page.
--
high performance mysql consulting
www.provenscaling.com
--
Maurice Volaski, [EMAIL PROTECTED]
Computing Support, Rose F. Kennedy Center
Albert Einstein College of Medicine of Yeshiva University
--
MySQL General Mailing List
For list archives: http://lists.mysql.com/mysql
To unsubscribe: http://lists.mysql.com/[EMAIL PROTECTED]