Could drbd randomly flip bits? Was: Database page corruption on disk occurring during mysqldump on a fresh database and Was: Spontaneous development of supremely large files on dif

2007-09-17 Thread Maurice Volaski
In using drbd 8.0.5 recently, I have come across at least two 
instances where a bit on disk apparently flipped spontaneously in the 
ext3 metadata on volumes running on top of drbd.


Also, I have been seeing regular corruption of a mysql database, 
which runs on top of drbd, and when I reported this as a bug since I 
also recently upgraded mysql versions, they question whether drbd 
could be responsible!


All the volumes have been fscked recently and there were no reported 
errors. And, of course, there have been no errors reported from the 
underlying hardware.


I have since upgraded to 8.0.6, but it's too early to say whether 
there is a change.


I'm also seeing the backup server complain of not being files not 
comparing, though this may be a separate problem on the backup server.




The ext-3  bit flipping:
At 12:00 PM -0400 9/11/07, [EMAIL PROTECTED] wrote:

I have come across two files, essentially untouched in years, on two
different ext3 filesystems on the same server, Gentoo AMD 64-bit with
kernel 2.6.22 and fsck version 1.40.2 currently, spontaneously
becoming supremely large:

Filesystem one
Inode 16257874, i_size is 18014398562775391, should be 53297152

Filesystem two
Inode 2121855, i_size is 35184386120704, should be 14032896.

Both were discovered during an ordinary backup operation (via EMC
Insiginia's Retrospect Linux client).

The backup runs daily and so one day, one file must have grew
spontaneously to this size and then on another day, it happened to
the second file, which is on a second filesystem. The backup attempt
generated repeated errors:

EXT3-fs warning (device dm-2): ext3_block_to_path: block  big

Both filesystems are running on different logical volumes, but
underlying that is are drbd network raid devices and underlying that
is a RAID 6-based SATA disk array.




The answer to the bug report regarding mysql data corruption, who is 
blaming drbd!

http://bugs.mysql.com/?id=31038

 Updated by:  Heikki Tuuri
 Reported by: Maurice Volaski
 Category:Server: InnoDB
 Severity:S2 (Serious)
 Status:  Open
 Version: 5.0.48
 OS:  Linux
 OS Details:  Gentoo
 Tags:database page corruption locking up corrupt doublewrite

[17 Sep 18:49] Heikki Tuuri

Maurice, my first guess is to suspect the RAID-1 driver.



My initial report of mysql data corruption:
A 64-bit Gentoo Linux box had just been upgraded from MySQL 4.1 
to5.0.44 fresh (by dumping in 4.1 and restoring in 5.0.44) and 
almostimmediately after that, during which time the database was 
not used,a crash occurred during a scripted mysqldump. So I 
restored and dayslater, it happened again. The crash details seem 
to be trying tosuggest some other aspect of the operating system, 
even the memoryor disk is flipping a bit. Or could I be running 
into a bug in thisversion of MySQL?


Here's the output of the crash
---
InnoDB: Database page corruption on disk or a failed
InnoDB: file read of page 533.
InnoDB: You may have to recover from a backup.
070827  3:10:04  InnoDB: Page dump in ascii and hex (16384 bytes):
 len 16384; hex

[dump itself deleted 
forbrevity]



   
 ;InnoDB: End of page dump
070827  3:10:04  InnoDB: Page checksum 
646563254,prior-to-4.0.14-form checksum 2415947328
InnoDB: stored checksum 4187530870, prior-to-4.0.14-form 
storedchecksum 2415947328

InnoDB: Page lsn 0 4409041, low 4 bytes of lsn at page end 4409041
InnoDB: Page number (if stored to page already) 533,
InnoDB: space id (if created with = MySQL-4.1.1 and stored already) 0
InnoDB: Page may be an index page where index id is 0 35
InnoDB: (index PRIMARY of table elegance/image)
InnoDB: Database page corruption on disk or a failed
InnoDB: file read of page 533.
InnoDB: You may have to recover from a backup.
InnoDB: It is also possible that your operating
InnoDB: system has corrupted its own file cache
InnoDB: and rebooting your computer removes the
InnoDB: error.
InnoDB: If the corrupt page is an index page
InnoDB: you can also try to fix the corruption
InnoDB: by dumping, dropping, and reimporting
InnoDB: the corrupt table. You can use CHECK
InnoDB: TABLE to scan your table for corruption.
InnoDB: See also 
InnoDB:http://dev.mysql.com/doc/refman/5.0/en/forcing-recovery.html

InnoDB: about forcing recovery.

InnoDB: Ending processing because of a corrupt database page.


--

Maurice Volaski, [EMAIL PROTECTED]
Computing Support, Rose F. Kennedy Center
Albert Einstein College of Medicine of Yeshiva University

--
MySQL General Mailing List
For list archives: http://lists.mysql.com/mysql
To unsubscribe:http://lists.mysql.com/[EMAIL PROTECTED]



Re: Could drbd randomly flip bits? Was: Database page corruption on disk occurring during mysqldump on a fresh database and Was: Spontaneous development of supremely large files on

2007-09-17 Thread Maurice Volaski

On Sep 17, 2007  13:31 -0400, Maurice Volaski wrote:

 In using drbd 8.0.5 recently, I have come across at least two
 instances where a bit on disk apparently flipped spontaneously in the
 ext3 metadata on volumes running on top of drbd.

 Also, I have been seeing regular corruption of a mysql database,
 which runs on top of drbd, and when I reported this as a bug since I
 also recently upgraded mysql versions, they question whether drbd
 could be responsible!


Seems unlikely - more likely to be RAM or similar (would include cable
for PATA/SCSI but that is less likely an issue for SATA).



Shouldn't trip the ECC and produce machine check exceptions and ones 
that were unrecoverable?


The disks are part of hardware RAID with a SATA II cableless 
backplane and SATA-SCSI controller, so there is a SCSI cable and SCSI 
HBA (LSI Logic).

--

Maurice Volaski, [EMAIL PROTECTED]
Computing Support, Rose F. Kennedy Center
Albert Einstein College of Medicine of Yeshiva University

--
MySQL General Mailing List
For list archives: http://lists.mysql.com/mysql
To unsubscribe:http://lists.mysql.com/[EMAIL PROTECTED]



Re: Could drbd randomly flip bits? Was: Database page corruption on disk occurring during mysqldump on a fresh database and Was: Spontaneous development of supremely large files

2007-09-17 Thread Maurice Volaski

Hi Maurice,

If you're running into corruption both in ext3 metadata and in MySQL 
data, it is certainly not he fault of MySQL as you're likely aware.


I am hoping they are not related. The problems with MySQL surfaced 
almost immediately after upgrading to 5.0.x.




[details deleted]

You can see that there are in fact many bits flipped in each.  I 
would suspect higher-level corruption than


I initially thought this as well, but the explanation on the ext3 
mailing list is that it really is just a lone flipped bit in both 
instances. The other differences are due to fsck padding out the 
block when it guesses what the correct size is.


Do note that data on e.g. the PCI bus is not protected by any sort 
of checksum.  I've seen this cause corruption problems with PCI 
risers and RAID cards.  Are you using a PCI riser card?  Note that 
LSI does *not* certify their cards to be used on risers if you are 
custom building a machine.




Yes, there is a riser card. Wouldn't this imply that LSI is saying 
you can't use a 1U or a 2U box?


It's kind of scary there is no end-to-end parity implemented 
somewhere along the whole data path to prevent this. It sort of 
defeats the point of RAID 6 and ECC.


How did you determine this was the cause?



Do you mean a Serially-Attached SCSI aka SAS controller, I assume?


No, it's SATA to SCSI.


Is this a custom build machine or a vendor integrated one?


It is custom-built.




Maurice Volaski wrote:
In using drbd 8.0.5 recently, I have come across at least two 
instances where a bit on disk apparently flipped spontaneously in 
the ext3 metadata on volumes running on top of drbd.


Also, I have been seeing regular corruption of a mysql database, 
which runs on top of drbd, and when I reported this as a bug since 
I also recently upgraded mysql versions, they question whether drbd 
could be responsible!


All the volumes have been fscked recently and there were no 
reported errors. And, of course, there have been no errors reported 
from the underlying hardware.


I have since upgraded to 8.0.6, but it's too early to say whether 
there is a change.


I'm also seeing the backup server complain of not being files not 
comparing, though this may be a separate problem on the backup 
server.




The ext-3  bit flipping:
At 12:00 PM -0400 9/11/07, [EMAIL PROTECTED] wrote:

I have come across two files, essentially untouched in years, on two
different ext3 filesystems on the same server, Gentoo AMD 64-bit with
kernel 2.6.22 and fsck version 1.40.2 currently, spontaneously
becoming supremely large:

Filesystem one
Inode 16257874, i_size is 18014398562775391, should be 53297152

Filesystem two
Inode 2121855, i_size is 35184386120704, should be 14032896.

Both were discovered during an ordinary backup operation (via EMC
Insiginia's Retrospect Linux client).

The backup runs daily and so one day, one file must have grew
spontaneously to this size and then on another day, it happened to
the second file, which is on a second filesystem. The backup attempt
generated repeated errors:

EXT3-fs warning (device dm-2): ext3_block_to_path: block  big

Both filesystems are running on different logical volumes, but
underlying that is are drbd network raid devices and underlying that
is a RAID 6-based SATA disk array.




The answer to the bug report regarding mysql data corruption, who 
is blaming drbd!

http://bugs.mysql.com/?id=31038

 Updated by:  Heikki Tuuri
 Reported by: Maurice Volaski
 Category:Server: InnoDB
 Severity:S2 (Serious)
 Status:  Open
 Version: 5.0.48
 OS:  Linux
 OS Details:  Gentoo
 Tags:database page corruption locking up corrupt doublewrite

[17 Sep 18:49] Heikki Tuuri

Maurice, my first guess is to suspect the RAID-1 driver.



My initial report of mysql data corruption:
A 64-bit Gentoo Linux box had just been upgraded from MySQL 4.1 
to5.0.44 fresh (by dumping in 4.1 and restoring in 5.0.44) and 
almostimmediately after that, during which time the database was 
not used,a crash occurred during a scripted mysqldump. So I 
restored and dayslater, it happened again. The crash details seem 
to be trying tosuggest some other aspect of the operating system, 
even the memoryor disk is flipping a bit. Or could I be running 
into a bug in thisversion of MySQL?


Here's the output of the crash
---
InnoDB: Database page corruption on disk or a failed
InnoDB: file read of page 533.
InnoDB: You may have to recover from a backup.
070827  3:10:04  InnoDB: Page dump in ascii and hex (16384 bytes):
 len 16384; hex

[dump itself deleted 
forbrevity]   





 ;InnoDB: End of page dump
070827  3:10:04  InnoDB: Page checksum

Re: Could drbd randomly flip bits? Was: Database page corruption on disk occurring during mysqldump on a fresh database and Was: Spontaneous development of supremely large file

2007-09-17 Thread Maurice Volaski
I guess I will watch it closely for now and if it trips up again 
failover to the drbd peer and see what happens there. I suppose I 
could even deattach the local disks and have it run using the peer 
over the wire. That should eliminate the local I/O subsystem.



It's kind of scary there is no end-to-end parity implemented 
somewhere along the whole data path to prevent this. It sort of 
defeats the point of RAID 6 and ECC.


I agree, it's pretty damn scary.  You can read about the story and 
the ensuing discussion here:


I wonder if drbd could help out with that.


Interesting.  I hadn't heard of such a thing until I just looked it 
up.  But in any case that adds yet another variable (and a fairly 
uncommon one) to the mix.




It's this one: http://www.acnc.com/02_01_jetstor_sata_416s.html. I 
thought units like it are very popular.

--

Maurice Volaski, [EMAIL PROTECTED]
Computing Support, Rose F. Kennedy Center
Albert Einstein College of Medicine of Yeshiva University

--
MySQL General Mailing List
For list archives: http://lists.mysql.com/mysql
To unsubscribe:http://lists.mysql.com/[EMAIL PROTECTED]



Re: Could drbd randomly flip bits? Was: Database page corruption on disk occurring during mysqldump on a fresh database and Was: Spontaneous development of supremely large file

2007-09-17 Thread Maurice Volaski
I failed over the server and ran a short backup and there were no 
didn't compare errors where on the first server, they are there 
pretty reliably. I guess this confirms some hardware on the first 
server is flipping bits. Essentially, users could have any number of 
munged files (most files are binary) since the problem surfaced a few 
weeks ago, and there'd be know way to know. Unfortunately, the 
secondary server was off for a short time at one point, so even if 
the munging were taken place on the I/O subsystem and not in RAM, it 
is possible that some blocks got copied badly to the secondary server.


Anyway, it seems the problem is definitely hardware and not due to 
either ext3, drbd or mysql!

--

Maurice Volaski, [EMAIL PROTECTED]
Computing Support, Rose F. Kennedy Center
Albert Einstein College of Medicine of Yeshiva University

--
MySQL General Mailing List
For list archives: http://lists.mysql.com/mysql
To unsubscribe:http://lists.mysql.com/[EMAIL PROTECTED]



Re: The current version is 5.0.48, no?

2007-09-14 Thread Maurice Volaski
Thank you for this info, but it just seems make a simple question a 
matter of confusion.


It tells us that MySQL is being marketed under two editions, but nowhere
does it say that the current release of each is matched bugfix for bugfix
and the version difference is just arithmetic.

Since community's 5.0.45 came out a few months ago and enterprise's 5.0.48
came out just a few weeks ago, and from the look of the release 
notes, I want to believe that community version is indeed out of date.




In the last episode (Sep 13), Maurice Volaski said:

 I just learned that the current version of MySQL is 5.0.48, described here
 http://dev.mysql.com/doc/refman/5.0/en/releasenotes-es-5-0.html and
 available from
 http://download.dorsalsource.org/files/b/5/165/mysql-5.0.48.tar.gz


The current Mysql Enterprise version is 5.0.48.  The current Mysql
Community version is 5.0.45.

Enterprise release notes:
http://dev.mysql.com/doc/refman/5.0/en/releasenotes-es-5-0.html

Community release notes:
http://dev.mysql.com/doc/refman/5.0/en/releasenotes-cs.html

Comparison:
http://www.mysql.com/products/which-edition.html


--

Maurice Volaski, [EMAIL PROTECTED]
Computing Support, Rose F. Kennedy Center
Albert Einstein College of Medicine of Yeshiva University

--
MySQL General Mailing List
For list archives: http://lists.mysql.com/mysql
To unsubscribe:http://lists.mysql.com/[EMAIL PROTECTED]



Re: Database page corruption on disk occurring during mysqldump on a fresh database

2007-09-13 Thread Maurice Volaski
It certainly seems that 5.0.44 and 5.0.45 are unstable. I have logged 
this as bug http://bugs.mysql.com/bug.php?id=31008


A 64-bit Gentoo Linux box had just been upgraded from MySQL 4.1 to 
5.0.44 fresh (by dumping in 4.1 and restoring in 5.0.44) and almost 
immediately after that, during which time the database was not used, 
a crash occurred during a scripted mysqldump. So I restored and days 
later, it happened again. The crash details seem to be trying to 
suggest some other aspect of the operating system, even the memory 
or disk is flipping a bit. Or could I be running into a bug in this 
version of MySQL?


Here's the output of the crash
---
InnoDB: Database page corruption on disk or a failed
InnoDB: file read of page 533.
InnoDB: You may have to recover from a backup.
070827  3:10:04  InnoDB: Page dump in ascii and hex (16384 bytes):
 len 16384; hex

[dump itself deleted for 
brevity] 

  
 ;InnoDB: End of page dump
070827  3:10:04  InnoDB: Page checksum 646563254, 
prior-to-4.0.14-form checksum 2415947328
InnoDB: stored checksum 4187530870, prior-to-4.0.14-form stored 
checksum 2415947328

InnoDB: Page lsn 0 4409041, low 4 bytes of lsn at page end 4409041
InnoDB: Page number (if stored to page already) 533,
InnoDB: space id (if created with = MySQL-4.1.1 and stored already) 0
InnoDB: Page may be an index page where index id is 0 35
InnoDB: (index PRIMARY of table elegance/image)
InnoDB: Database page corruption on disk or a failed
InnoDB: file read of page 533.
InnoDB: You may have to recover from a backup.
InnoDB: It is also possible that your operating
InnoDB: system has corrupted its own file cache
InnoDB: and rebooting your computer removes the
InnoDB: error.
InnoDB: If the corrupt page is an index page
InnoDB: you can also try to fix the corruption
InnoDB: by dumping, dropping, and reimporting
InnoDB: the corrupt table. You can use CHECK
InnoDB: TABLE to scan your table for corruption.
InnoDB: See also InnoDB: 
http://dev.mysql.com/doc/refman/5.0/en/forcing-recovery.html

InnoDB: about forcing recovery.
InnoDB: Ending processing because of a corrupt database page.


--

Maurice Volaski, [EMAIL PROTECTED]
Computing Support, Rose F. Kennedy Center
Albert Einstein College of Medicine of Yeshiva University

--
MySQL General Mailing List
For list archives: http://lists.mysql.com/mysql
To unsubscribe:http://lists.mysql.com/[EMAIL PROTECTED]



The current version is 5.0.48, no?

2007-09-13 Thread Maurice Volaski
I just learned that the current version of MySQL is 5.0.48, described 
here http://dev.mysql.com/doc/refman/5.0/en/releasenotes-es-5-0.html 
and available from 
http://download.dorsalsource.org/files/b/5/165/mysql-5.0.48.tar.gz


When I search this list, I see no mention of it or the previous 
release, 5.0.46. Is there some reason we shouldn't be 5.0.48?

--

Maurice Volaski, [EMAIL PROTECTED]
Computing Support, Rose F. Kennedy Center
Albert Einstein College of Medicine of Yeshiva University

--
MySQL General Mailing List
For list archives: http://lists.mysql.com/mysql
To unsubscribe:http://lists.mysql.com/[EMAIL PROTECTED]



Is bad hardware confusing MySQL and InnoDB?

2007-09-12 Thread Maurice Volaski
Some processes on a server (64-bit Gentoo Linux with MySQL 5.0.44), 
which seemed to be related to I/O on LVM volumes hung and it was 
necessary to force reboot it. The mysql data was not on an LVM volume 
though it still may have been affected since over time, more and more 
processes became unresponsive. While fsck recovered the journal and 
detected no problems on any volume, at least one database was not 
spared:


070911 23:40:34  InnoDB: Page checksum 3958948568, 
prior-to-4.0.14-form checksum 2746081740
InnoDB: stored checksum 2722580120, prior-to-4.0.14-form stored 
checksum 2746081740

InnoDB: Page lsn 0 491535, low 4 bytes of lsn at page end 491535
InnoDB: Page number (if stored to page already) 199,
InnoDB: space id (if created with = MySQL-4.1.1 and stored already) 0
InnoDB: Page may be an index page where index id is 0 17
InnoDB: Also the page in the doublewrite buffer is corrupt.
InnoDB: Cannot continue operation.

Is it wrong to expect InnoDB to have avoided this or does it suggest 
that it couldn't have, i.e., a hardware defect?

--

Maurice Volaski, [EMAIL PROTECTED]
Computing Support, Rose F. Kennedy Center
Albert Einstein College of Medicine of Yeshiva University

--
MySQL General Mailing List
For list archives: http://lists.mysql.com/mysql
To unsubscribe:http://lists.mysql.com/[EMAIL PROTECTED]



Re: Database page corruption on disk occurring during mysqldump on a fresh database

2007-09-10 Thread Maurice Volaski
Thank you for your replies. I attempted to restore again and most 
oddly, mysql complained that it couldn't restore to a particular 
table because it wasn't in the database, which, of course, it had to 
be because the restore itself had just recreated it. So I blew away 
the entire mysql directory on the disk, updated to 5.0.45, and then 
it did not complain when I restored that time. So far, it has not 
since.




Hi
This might be happening due to two reasons;
1 The system date might not be correct.
2. Some things wrong with log postion (Incorrect log position)

Regards,
Krishna Chandra Prajapati



The checksum errors might be due to various reasons. We had similar 
issue where we restored the database multiple times, replaced the 
ram sticks nothing helped. Finally we drilled down the issue to the 
chassis. Recommend testing the restore on a different machine to 
rule out any hardware issue.


--
Thanks
Alex
http://alexlurthu.wordpress.comhttp://alexlurthu.wordpress.com



On 8/31/07, Maurice Volaski 
mailto:[EMAIL PROTECTED][EMAIL PROTECTED] wrote:


A 64-bit Gentoo Linux box had just been upgraded from MySQL 4.1 to
5.0.44 fresh (by dumping in 4.1 and restoring in 5.0.44) and almost
immediately after that, during which time the database was not used,
a crash occurred during a scripted mysqldump. So I restored and days
later, it happened again. The crash details seem to be trying to
suggest some other aspect of the operating system, even the memory or
disk is flipping a bit. Or could I be running into a bug in this
version of MySQL?




--

Maurice Volaski, [EMAIL PROTECTED]
Computing Support, Rose F. Kennedy Center
Albert Einstein College of Medicine of Yeshiva University

--
MySQL General Mailing List
For list archives: http://lists.mysql.com/mysql
To unsubscribe:http://lists.mysql.com/[EMAIL PROTECTED]



Database page corruption on disk occurring during mysqldump on a fresh database

2007-08-31 Thread Maurice Volaski
A 64-bit Gentoo Linux box had just been upgraded from MySQL 4.1 to 
5.0.44 fresh (by dumping in 4.1 and restoring in 5.0.44) and almost 
immediately after that, during which time the database was not used, 
a crash occurred during a scripted mysqldump. So I restored and days 
later, it happened again. The crash details seem to be trying to 
suggest some other aspect of the operating system, even the memory or 
disk is flipping a bit. Or could I be running into a bug in this 
version of MySQL?


Here's the output of the crash
---
InnoDB: Database page corruption on disk or a failed
InnoDB: file read of page 533.
InnoDB: You may have to recover from a backup.
070827  3:10:04  InnoDB: Page dump in ascii and hex (16384 bytes):
 len 16384; hex

[dump itself deleted for brevity]

 ;InnoDB: End of page dump
070827  3:10:04  InnoDB: Page checksum 646563254, 
prior-to-4.0.14-form checksum 2415947328
InnoDB: stored checksum 4187530870, prior-to-4.0.14-form stored 
checksum 2415947328

InnoDB: Page lsn 0 4409041, low 4 bytes of lsn at page end 4409041
InnoDB: Page number (if stored to page already) 533,
InnoDB: space id (if created with = MySQL-4.1.1 and stored already) 0
InnoDB: Page may be an index page where index id is 0 35
InnoDB: (index PRIMARY of table elegance/image)
InnoDB: Database page corruption on disk or a failed
InnoDB: file read of page 533.
InnoDB: You may have to recover from a backup.
InnoDB: It is also possible that your operating
InnoDB: system has corrupted its own file cache
InnoDB: and rebooting your computer removes the
InnoDB: error.
InnoDB: If the corrupt page is an index page
InnoDB: you can also try to fix the corruption
InnoDB: by dumping, dropping, and reimporting
InnoDB: the corrupt table. You can use CHECK
InnoDB: TABLE to scan your table for corruption.
InnoDB: See also InnoDB: 
http://dev.mysql.com/doc/refman/5.0/en/forcing-recovery.html

InnoDB: about forcing recovery.
InnoDB: Ending processing because of a corrupt database page.

--

Maurice Volaski, [EMAIL PROTECTED]
Computing Support, Rose F. Kennedy Center
Albert Einstein College of Medicine of Yeshiva University

--
MySQL General Mailing List
For list archives: http://lists.mysql.com/mysql
To unsubscribe:http://lists.mysql.com/[EMAIL PROTECTED]