Could drbd randomly flip bits? Was: Database page corruption on disk occurring during mysqldump on a fresh database and Was: Spontaneous development of supremely large files on dif
In using drbd 8.0.5 recently, I have come across at least two instances where a bit on disk apparently flipped spontaneously in the ext3 metadata on volumes running on top of drbd. Also, I have been seeing regular corruption of a mysql database, which runs on top of drbd, and when I reported this as a bug since I also recently upgraded mysql versions, they question whether drbd could be responsible! All the volumes have been fscked recently and there were no reported errors. And, of course, there have been no errors reported from the underlying hardware. I have since upgraded to 8.0.6, but it's too early to say whether there is a change. I'm also seeing the backup server complain of not being files not comparing, though this may be a separate problem on the backup server. The ext-3 bit flipping: At 12:00 PM -0400 9/11/07, [EMAIL PROTECTED] wrote: I have come across two files, essentially untouched in years, on two different ext3 filesystems on the same server, Gentoo AMD 64-bit with kernel 2.6.22 and fsck version 1.40.2 currently, spontaneously becoming supremely large: Filesystem one Inode 16257874, i_size is 18014398562775391, should be 53297152 Filesystem two Inode 2121855, i_size is 35184386120704, should be 14032896. Both were discovered during an ordinary backup operation (via EMC Insiginia's Retrospect Linux client). The backup runs daily and so one day, one file must have grew spontaneously to this size and then on another day, it happened to the second file, which is on a second filesystem. The backup attempt generated repeated errors: EXT3-fs warning (device dm-2): ext3_block_to_path: block big Both filesystems are running on different logical volumes, but underlying that is are drbd network raid devices and underlying that is a RAID 6-based SATA disk array. The answer to the bug report regarding mysql data corruption, who is blaming drbd! http://bugs.mysql.com/?id=31038 Updated by: Heikki Tuuri Reported by: Maurice Volaski Category:Server: InnoDB Severity:S2 (Serious) Status: Open Version: 5.0.48 OS: Linux OS Details: Gentoo Tags:database page corruption locking up corrupt doublewrite [17 Sep 18:49] Heikki Tuuri Maurice, my first guess is to suspect the RAID-1 driver. My initial report of mysql data corruption: A 64-bit Gentoo Linux box had just been upgraded from MySQL 4.1 to5.0.44 fresh (by dumping in 4.1 and restoring in 5.0.44) and almostimmediately after that, during which time the database was not used,a crash occurred during a scripted mysqldump. So I restored and dayslater, it happened again. The crash details seem to be trying tosuggest some other aspect of the operating system, even the memoryor disk is flipping a bit. Or could I be running into a bug in thisversion of MySQL? Here's the output of the crash --- InnoDB: Database page corruption on disk or a failed InnoDB: file read of page 533. InnoDB: You may have to recover from a backup. 070827 3:10:04 InnoDB: Page dump in ascii and hex (16384 bytes): len 16384; hex [dump itself deleted forbrevity] ;InnoDB: End of page dump 070827 3:10:04 InnoDB: Page checksum 646563254,prior-to-4.0.14-form checksum 2415947328 InnoDB: stored checksum 4187530870, prior-to-4.0.14-form storedchecksum 2415947328 InnoDB: Page lsn 0 4409041, low 4 bytes of lsn at page end 4409041 InnoDB: Page number (if stored to page already) 533, InnoDB: space id (if created with = MySQL-4.1.1 and stored already) 0 InnoDB: Page may be an index page where index id is 0 35 InnoDB: (index PRIMARY of table elegance/image) InnoDB: Database page corruption on disk or a failed InnoDB: file read of page 533. InnoDB: You may have to recover from a backup. InnoDB: It is also possible that your operating InnoDB: system has corrupted its own file cache InnoDB: and rebooting your computer removes the InnoDB: error. InnoDB: If the corrupt page is an index page InnoDB: you can also try to fix the corruption InnoDB: by dumping, dropping, and reimporting InnoDB: the corrupt table. You can use CHECK InnoDB: TABLE to scan your table for corruption. InnoDB: See also InnoDB:http://dev.mysql.com/doc/refman/5.0/en/forcing-recovery.html InnoDB: about forcing recovery. InnoDB: Ending processing because of a corrupt database page. -- Maurice Volaski, [EMAIL PROTECTED] Computing Support, Rose F. Kennedy Center Albert Einstein College of Medicine of Yeshiva University -- MySQL General Mailing List For list archives: http://lists.mysql.com/mysql To unsubscribe:http://lists.mysql.com/[EMAIL PROTECTED]
Re: Could drbd randomly flip bits? Was: Database page corruption on disk occurring during mysqldump on a fresh database and Was: Spontaneous development of supremely large files on
On Sep 17, 2007 13:31 -0400, Maurice Volaski wrote: In using drbd 8.0.5 recently, I have come across at least two instances where a bit on disk apparently flipped spontaneously in the ext3 metadata on volumes running on top of drbd. Also, I have been seeing regular corruption of a mysql database, which runs on top of drbd, and when I reported this as a bug since I also recently upgraded mysql versions, they question whether drbd could be responsible! Seems unlikely - more likely to be RAM or similar (would include cable for PATA/SCSI but that is less likely an issue for SATA). Shouldn't trip the ECC and produce machine check exceptions and ones that were unrecoverable? The disks are part of hardware RAID with a SATA II cableless backplane and SATA-SCSI controller, so there is a SCSI cable and SCSI HBA (LSI Logic). -- Maurice Volaski, [EMAIL PROTECTED] Computing Support, Rose F. Kennedy Center Albert Einstein College of Medicine of Yeshiva University -- MySQL General Mailing List For list archives: http://lists.mysql.com/mysql To unsubscribe:http://lists.mysql.com/[EMAIL PROTECTED]
Re: Could drbd randomly flip bits? Was: Database page corruption on disk occurring during mysqldump on a fresh database and Was: Spontaneous development of supremely large files
Hi Maurice, If you're running into corruption both in ext3 metadata and in MySQL data, it is certainly not he fault of MySQL as you're likely aware. I am hoping they are not related. The problems with MySQL surfaced almost immediately after upgrading to 5.0.x. [details deleted] You can see that there are in fact many bits flipped in each. I would suspect higher-level corruption than I initially thought this as well, but the explanation on the ext3 mailing list is that it really is just a lone flipped bit in both instances. The other differences are due to fsck padding out the block when it guesses what the correct size is. Do note that data on e.g. the PCI bus is not protected by any sort of checksum. I've seen this cause corruption problems with PCI risers and RAID cards. Are you using a PCI riser card? Note that LSI does *not* certify their cards to be used on risers if you are custom building a machine. Yes, there is a riser card. Wouldn't this imply that LSI is saying you can't use a 1U or a 2U box? It's kind of scary there is no end-to-end parity implemented somewhere along the whole data path to prevent this. It sort of defeats the point of RAID 6 and ECC. How did you determine this was the cause? Do you mean a Serially-Attached SCSI aka SAS controller, I assume? No, it's SATA to SCSI. Is this a custom build machine or a vendor integrated one? It is custom-built. Maurice Volaski wrote: In using drbd 8.0.5 recently, I have come across at least two instances where a bit on disk apparently flipped spontaneously in the ext3 metadata on volumes running on top of drbd. Also, I have been seeing regular corruption of a mysql database, which runs on top of drbd, and when I reported this as a bug since I also recently upgraded mysql versions, they question whether drbd could be responsible! All the volumes have been fscked recently and there were no reported errors. And, of course, there have been no errors reported from the underlying hardware. I have since upgraded to 8.0.6, but it's too early to say whether there is a change. I'm also seeing the backup server complain of not being files not comparing, though this may be a separate problem on the backup server. The ext-3 bit flipping: At 12:00 PM -0400 9/11/07, [EMAIL PROTECTED] wrote: I have come across two files, essentially untouched in years, on two different ext3 filesystems on the same server, Gentoo AMD 64-bit with kernel 2.6.22 and fsck version 1.40.2 currently, spontaneously becoming supremely large: Filesystem one Inode 16257874, i_size is 18014398562775391, should be 53297152 Filesystem two Inode 2121855, i_size is 35184386120704, should be 14032896. Both were discovered during an ordinary backup operation (via EMC Insiginia's Retrospect Linux client). The backup runs daily and so one day, one file must have grew spontaneously to this size and then on another day, it happened to the second file, which is on a second filesystem. The backup attempt generated repeated errors: EXT3-fs warning (device dm-2): ext3_block_to_path: block big Both filesystems are running on different logical volumes, but underlying that is are drbd network raid devices and underlying that is a RAID 6-based SATA disk array. The answer to the bug report regarding mysql data corruption, who is blaming drbd! http://bugs.mysql.com/?id=31038 Updated by: Heikki Tuuri Reported by: Maurice Volaski Category:Server: InnoDB Severity:S2 (Serious) Status: Open Version: 5.0.48 OS: Linux OS Details: Gentoo Tags:database page corruption locking up corrupt doublewrite [17 Sep 18:49] Heikki Tuuri Maurice, my first guess is to suspect the RAID-1 driver. My initial report of mysql data corruption: A 64-bit Gentoo Linux box had just been upgraded from MySQL 4.1 to5.0.44 fresh (by dumping in 4.1 and restoring in 5.0.44) and almostimmediately after that, during which time the database was not used,a crash occurred during a scripted mysqldump. So I restored and dayslater, it happened again. The crash details seem to be trying tosuggest some other aspect of the operating system, even the memoryor disk is flipping a bit. Or could I be running into a bug in thisversion of MySQL? Here's the output of the crash --- InnoDB: Database page corruption on disk or a failed InnoDB: file read of page 533. InnoDB: You may have to recover from a backup. 070827 3:10:04 InnoDB: Page dump in ascii and hex (16384 bytes): len 16384; hex [dump itself deleted forbrevity] ;InnoDB: End of page dump 070827 3:10:04 InnoDB: Page checksum
Re: Could drbd randomly flip bits? Was: Database page corruption on disk occurring during mysqldump on a fresh database and Was: Spontaneous development of supremely large file
I guess I will watch it closely for now and if it trips up again failover to the drbd peer and see what happens there. I suppose I could even deattach the local disks and have it run using the peer over the wire. That should eliminate the local I/O subsystem. It's kind of scary there is no end-to-end parity implemented somewhere along the whole data path to prevent this. It sort of defeats the point of RAID 6 and ECC. I agree, it's pretty damn scary. You can read about the story and the ensuing discussion here: I wonder if drbd could help out with that. Interesting. I hadn't heard of such a thing until I just looked it up. But in any case that adds yet another variable (and a fairly uncommon one) to the mix. It's this one: http://www.acnc.com/02_01_jetstor_sata_416s.html. I thought units like it are very popular. -- Maurice Volaski, [EMAIL PROTECTED] Computing Support, Rose F. Kennedy Center Albert Einstein College of Medicine of Yeshiva University -- MySQL General Mailing List For list archives: http://lists.mysql.com/mysql To unsubscribe:http://lists.mysql.com/[EMAIL PROTECTED]
Re: Could drbd randomly flip bits? Was: Database page corruption on disk occurring during mysqldump on a fresh database and Was: Spontaneous development of supremely large file
I failed over the server and ran a short backup and there were no didn't compare errors where on the first server, they are there pretty reliably. I guess this confirms some hardware on the first server is flipping bits. Essentially, users could have any number of munged files (most files are binary) since the problem surfaced a few weeks ago, and there'd be know way to know. Unfortunately, the secondary server was off for a short time at one point, so even if the munging were taken place on the I/O subsystem and not in RAM, it is possible that some blocks got copied badly to the secondary server. Anyway, it seems the problem is definitely hardware and not due to either ext3, drbd or mysql! -- Maurice Volaski, [EMAIL PROTECTED] Computing Support, Rose F. Kennedy Center Albert Einstein College of Medicine of Yeshiva University -- MySQL General Mailing List For list archives: http://lists.mysql.com/mysql To unsubscribe:http://lists.mysql.com/[EMAIL PROTECTED]
Re: The current version is 5.0.48, no?
Thank you for this info, but it just seems make a simple question a matter of confusion. It tells us that MySQL is being marketed under two editions, but nowhere does it say that the current release of each is matched bugfix for bugfix and the version difference is just arithmetic. Since community's 5.0.45 came out a few months ago and enterprise's 5.0.48 came out just a few weeks ago, and from the look of the release notes, I want to believe that community version is indeed out of date. In the last episode (Sep 13), Maurice Volaski said: I just learned that the current version of MySQL is 5.0.48, described here http://dev.mysql.com/doc/refman/5.0/en/releasenotes-es-5-0.html and available from http://download.dorsalsource.org/files/b/5/165/mysql-5.0.48.tar.gz The current Mysql Enterprise version is 5.0.48. The current Mysql Community version is 5.0.45. Enterprise release notes: http://dev.mysql.com/doc/refman/5.0/en/releasenotes-es-5-0.html Community release notes: http://dev.mysql.com/doc/refman/5.0/en/releasenotes-cs.html Comparison: http://www.mysql.com/products/which-edition.html -- Maurice Volaski, [EMAIL PROTECTED] Computing Support, Rose F. Kennedy Center Albert Einstein College of Medicine of Yeshiva University -- MySQL General Mailing List For list archives: http://lists.mysql.com/mysql To unsubscribe:http://lists.mysql.com/[EMAIL PROTECTED]
Re: Database page corruption on disk occurring during mysqldump on a fresh database
It certainly seems that 5.0.44 and 5.0.45 are unstable. I have logged this as bug http://bugs.mysql.com/bug.php?id=31008 A 64-bit Gentoo Linux box had just been upgraded from MySQL 4.1 to 5.0.44 fresh (by dumping in 4.1 and restoring in 5.0.44) and almost immediately after that, during which time the database was not used, a crash occurred during a scripted mysqldump. So I restored and days later, it happened again. The crash details seem to be trying to suggest some other aspect of the operating system, even the memory or disk is flipping a bit. Or could I be running into a bug in this version of MySQL? Here's the output of the crash --- InnoDB: Database page corruption on disk or a failed InnoDB: file read of page 533. InnoDB: You may have to recover from a backup. 070827 3:10:04 InnoDB: Page dump in ascii and hex (16384 bytes): len 16384; hex [dump itself deleted for brevity] ;InnoDB: End of page dump 070827 3:10:04 InnoDB: Page checksum 646563254, prior-to-4.0.14-form checksum 2415947328 InnoDB: stored checksum 4187530870, prior-to-4.0.14-form stored checksum 2415947328 InnoDB: Page lsn 0 4409041, low 4 bytes of lsn at page end 4409041 InnoDB: Page number (if stored to page already) 533, InnoDB: space id (if created with = MySQL-4.1.1 and stored already) 0 InnoDB: Page may be an index page where index id is 0 35 InnoDB: (index PRIMARY of table elegance/image) InnoDB: Database page corruption on disk or a failed InnoDB: file read of page 533. InnoDB: You may have to recover from a backup. InnoDB: It is also possible that your operating InnoDB: system has corrupted its own file cache InnoDB: and rebooting your computer removes the InnoDB: error. InnoDB: If the corrupt page is an index page InnoDB: you can also try to fix the corruption InnoDB: by dumping, dropping, and reimporting InnoDB: the corrupt table. You can use CHECK InnoDB: TABLE to scan your table for corruption. InnoDB: See also InnoDB: http://dev.mysql.com/doc/refman/5.0/en/forcing-recovery.html InnoDB: about forcing recovery. InnoDB: Ending processing because of a corrupt database page. -- Maurice Volaski, [EMAIL PROTECTED] Computing Support, Rose F. Kennedy Center Albert Einstein College of Medicine of Yeshiva University -- MySQL General Mailing List For list archives: http://lists.mysql.com/mysql To unsubscribe:http://lists.mysql.com/[EMAIL PROTECTED]
The current version is 5.0.48, no?
I just learned that the current version of MySQL is 5.0.48, described here http://dev.mysql.com/doc/refman/5.0/en/releasenotes-es-5-0.html and available from http://download.dorsalsource.org/files/b/5/165/mysql-5.0.48.tar.gz When I search this list, I see no mention of it or the previous release, 5.0.46. Is there some reason we shouldn't be 5.0.48? -- Maurice Volaski, [EMAIL PROTECTED] Computing Support, Rose F. Kennedy Center Albert Einstein College of Medicine of Yeshiva University -- MySQL General Mailing List For list archives: http://lists.mysql.com/mysql To unsubscribe:http://lists.mysql.com/[EMAIL PROTECTED]
Is bad hardware confusing MySQL and InnoDB?
Some processes on a server (64-bit Gentoo Linux with MySQL 5.0.44), which seemed to be related to I/O on LVM volumes hung and it was necessary to force reboot it. The mysql data was not on an LVM volume though it still may have been affected since over time, more and more processes became unresponsive. While fsck recovered the journal and detected no problems on any volume, at least one database was not spared: 070911 23:40:34 InnoDB: Page checksum 3958948568, prior-to-4.0.14-form checksum 2746081740 InnoDB: stored checksum 2722580120, prior-to-4.0.14-form stored checksum 2746081740 InnoDB: Page lsn 0 491535, low 4 bytes of lsn at page end 491535 InnoDB: Page number (if stored to page already) 199, InnoDB: space id (if created with = MySQL-4.1.1 and stored already) 0 InnoDB: Page may be an index page where index id is 0 17 InnoDB: Also the page in the doublewrite buffer is corrupt. InnoDB: Cannot continue operation. Is it wrong to expect InnoDB to have avoided this or does it suggest that it couldn't have, i.e., a hardware defect? -- Maurice Volaski, [EMAIL PROTECTED] Computing Support, Rose F. Kennedy Center Albert Einstein College of Medicine of Yeshiva University -- MySQL General Mailing List For list archives: http://lists.mysql.com/mysql To unsubscribe:http://lists.mysql.com/[EMAIL PROTECTED]
Re: Database page corruption on disk occurring during mysqldump on a fresh database
Thank you for your replies. I attempted to restore again and most oddly, mysql complained that it couldn't restore to a particular table because it wasn't in the database, which, of course, it had to be because the restore itself had just recreated it. So I blew away the entire mysql directory on the disk, updated to 5.0.45, and then it did not complain when I restored that time. So far, it has not since. Hi This might be happening due to two reasons; 1 The system date might not be correct. 2. Some things wrong with log postion (Incorrect log position) Regards, Krishna Chandra Prajapati The checksum errors might be due to various reasons. We had similar issue where we restored the database multiple times, replaced the ram sticks nothing helped. Finally we drilled down the issue to the chassis. Recommend testing the restore on a different machine to rule out any hardware issue. -- Thanks Alex http://alexlurthu.wordpress.comhttp://alexlurthu.wordpress.com On 8/31/07, Maurice Volaski mailto:[EMAIL PROTECTED][EMAIL PROTECTED] wrote: A 64-bit Gentoo Linux box had just been upgraded from MySQL 4.1 to 5.0.44 fresh (by dumping in 4.1 and restoring in 5.0.44) and almost immediately after that, during which time the database was not used, a crash occurred during a scripted mysqldump. So I restored and days later, it happened again. The crash details seem to be trying to suggest some other aspect of the operating system, even the memory or disk is flipping a bit. Or could I be running into a bug in this version of MySQL? -- Maurice Volaski, [EMAIL PROTECTED] Computing Support, Rose F. Kennedy Center Albert Einstein College of Medicine of Yeshiva University -- MySQL General Mailing List For list archives: http://lists.mysql.com/mysql To unsubscribe:http://lists.mysql.com/[EMAIL PROTECTED]
Database page corruption on disk occurring during mysqldump on a fresh database
A 64-bit Gentoo Linux box had just been upgraded from MySQL 4.1 to 5.0.44 fresh (by dumping in 4.1 and restoring in 5.0.44) and almost immediately after that, during which time the database was not used, a crash occurred during a scripted mysqldump. So I restored and days later, it happened again. The crash details seem to be trying to suggest some other aspect of the operating system, even the memory or disk is flipping a bit. Or could I be running into a bug in this version of MySQL? Here's the output of the crash --- InnoDB: Database page corruption on disk or a failed InnoDB: file read of page 533. InnoDB: You may have to recover from a backup. 070827 3:10:04 InnoDB: Page dump in ascii and hex (16384 bytes): len 16384; hex [dump itself deleted for brevity] ;InnoDB: End of page dump 070827 3:10:04 InnoDB: Page checksum 646563254, prior-to-4.0.14-form checksum 2415947328 InnoDB: stored checksum 4187530870, prior-to-4.0.14-form stored checksum 2415947328 InnoDB: Page lsn 0 4409041, low 4 bytes of lsn at page end 4409041 InnoDB: Page number (if stored to page already) 533, InnoDB: space id (if created with = MySQL-4.1.1 and stored already) 0 InnoDB: Page may be an index page where index id is 0 35 InnoDB: (index PRIMARY of table elegance/image) InnoDB: Database page corruption on disk or a failed InnoDB: file read of page 533. InnoDB: You may have to recover from a backup. InnoDB: It is also possible that your operating InnoDB: system has corrupted its own file cache InnoDB: and rebooting your computer removes the InnoDB: error. InnoDB: If the corrupt page is an index page InnoDB: you can also try to fix the corruption InnoDB: by dumping, dropping, and reimporting InnoDB: the corrupt table. You can use CHECK InnoDB: TABLE to scan your table for corruption. InnoDB: See also InnoDB: http://dev.mysql.com/doc/refman/5.0/en/forcing-recovery.html InnoDB: about forcing recovery. InnoDB: Ending processing because of a corrupt database page. -- Maurice Volaski, [EMAIL PROTECTED] Computing Support, Rose F. Kennedy Center Albert Einstein College of Medicine of Yeshiva University -- MySQL General Mailing List For list archives: http://lists.mysql.com/mysql To unsubscribe:http://lists.mysql.com/[EMAIL PROTECTED]