Re: Odd data corruption problem with LVM/ReiserFS
On Tue, 22 Feb 2005 20:49:00 +0100, Marc A. Lehmann <[EMAIL PROTECTED]> wrote: > > >A reboot fixes this for both ext3 and reiserfs (i.e. the error is gone). > > > > > > > Well, it didn't fix it for me. The fs was trashed for good. The major > > question for me is now usability of md/dm for any purpose with 2.6.x. > > For me this is a showstopper for any kind of 2.6 production use. > > Well, I do use reiserfs->aes-loop->lvm/dm->md5/raid5, and it never failed > for me, except once, and the error is likely to be outside reiserfs, and > possibly outside lvm. Marc, what about you, were you using dm-snapshot when you experienced temporary corruption? Alex - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Odd data corruption problem with LVM/ReiserFS
I found out some interesting things tonight. I removed my /var and /home snapshots, and all the corruption, with the exception of files I had changed while /var and /home were in their corrupted state, had disappeared! I overwrote several files on /var that were corrupt with clean copies from my backups, and verified that they were OK. I then created a new /var snapshot, mounted it, only to find out that the files on that snapshot were still corrupt, but the files under the real /var were still in good shape. I umounted, lvremoved, lvcreated, and mounted the /var snapshot, and saw the same results. Even after removing the snapshot, rebooting, and recreating the snapshot I saw the same thing (real /var had correct file, snapshot /var had corrupt file). Do you think my volume group has simply become corrupt and will need to be recreated, or do you guys think this is a bug in dm-snapshot? If so, please let me know what I can do to help you guys debug this. Thanks, Alex On Mon, 21 Feb 2005 15:18:52 +, Alasdair G Kergon <[EMAIL PROTECTED]> wrote: > On Sun, Feb 20, 2005 at 11:25:37PM -0600, Alex Adriaanse wrote: > > This morning was the first time my backup script took > > a snapshot since upgrading to 2.6.10-ac12 (yesterday I had taken a few > > snapshots myself for testing purposes, this seemed to work fine). > > a) Activating a snapshot requires a lot of memory; > > b) If a snapshot can't get the memory it needs you have to back it > out manually (using dmsetup - some combination of resume, remove & > possibly reload) to avoid locking up the volume - what you have to do > depends how far it got before it failed; > > c) You should be OK once a snapshot is active and its origin has > successfully had a block written to it. > > Work is underway to address the various problems with snapshot activation > - we think we understand them all - but until the fixes have worked their > way through, unless you've enough memory in the machine it's best to avoid > them. > > Suggestions: > Only do one snapshot+backup at once; > Make sure logging in as root and using dmsetup does not depend on access > to anything in /var or /home (similar to the case of hard NFS mounts with > the server down) so you can still log in; > > BTW Also never snapshot the root filesystem unless you've mounted it noatime > or disabled hotplug etc. - e.g. the machine can lock up attempting to > update the atime on /sbin/hotplug while writes to the filesystem are blocked > > Alasdair > -- > [EMAIL PROTECTED] > - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Odd data corruption problem with LVM/ReiserFS
Alasdair, Thanks for the tips. Do you think it's possible DM's snapshots could've caused this corruption, or do you think the problem lies elsewhere? Alex On Mon, 21 Feb 2005 15:18:52 +, Alasdair G Kergon <[EMAIL PROTECTED]> wrote: > On Sun, Feb 20, 2005 at 11:25:37PM -0600, Alex Adriaanse wrote: > > This morning was the first time my backup script took > > a snapshot since upgrading to 2.6.10-ac12 (yesterday I had taken a few > > snapshots myself for testing purposes, this seemed to work fine). > > a) Activating a snapshot requires a lot of memory; > > b) If a snapshot can't get the memory it needs you have to back it > out manually (using dmsetup - some combination of resume, remove & > possibly reload) to avoid locking up the volume - what you have to do > depends how far it got before it failed; > > c) You should be OK once a snapshot is active and its origin has > successfully had a block written to it. > > Work is underway to address the various problems with snapshot activation > - we think we understand them all - but until the fixes have worked their > way through, unless you've enough memory in the machine it's best to avoid > them. > > Suggestions: > Only do one snapshot+backup at once; > Make sure logging in as root and using dmsetup does not depend on access > to anything in /var or /home (similar to the case of hard NFS mounts with > the server down) so you can still log in; > > BTW Also never snapshot the root filesystem unless you've mounted it noatime > or disabled hotplug etc. - e.g. the machine can lock up attempting to > update the atime on /sbin/hotplug while writes to the filesystem are blocked > > Alasdair > -- > [EMAIL PROTECTED] > - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Odd data corruption problem with LVM/ReiserFS
On Mon, 21 Feb 2005 12:37:53 +0100, Andreas Steinmetz <[EMAIL PROTECTED]> wrote: > Alex Adriaanse wrote: > > As far as I can tell all the directories are still intact, but there > > was a good number of files that had been corrupted. Those files > > looked like they had some chunks removed, and some had a bunch of NUL > > characters (in blocks of 4096 characters). Some files even had chunks > > of other files inside of them! > > I can second that. I had the same experience this weekend on a > md/dm/reiserfs setup. The funny thing is that e.g. find reports I/O > errors but if you then run tar on the tree you eventually get the > correct data from tar. Then run find again and you'll again get I/O errors. The weird thing is I did not see any I/O errors in my logs, and running find on /var worked without a problem. By the way, did you take any DM snapshots when you experienced that corruption? - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Odd data corruption problem with LVM/ReiserFS
As of this morning I've experienced some very odd data corruption problem on my server. Let me post some background information first. For the past few years I've been running this server under Linux 2.4.x and Debian Woody. It has two software RAID 1 partitions, one for the ReiserFS root filesystem (md0), and one for LVM running on top of RAID 1 (md1). Under LVM I have three logical volumes, one for /usr, one for /var, and one for /home. All of them run ReiserFS. Also, during daily backups I'd create a snapshot of /var and /home and back that up. I haven't experienced any problems with this, other than occasional power outages which might've corruped some log files by adding a bunch of NULs to it, but that's never caused problems for me. A few weeks ago I decided to upgrade to Debian Sarge. This was a fairly smooth process, and haven't seen any problems with that (and I don't think this is related to the problem I described below). Also, last week I decided to upgrade from the 2.4.22 kernel to 2.6.10-ac12. This has been a pretty smooth ride too (until this morning). One exception is that I might not have had swap turned on due to device name changes, so yesterday I saw some big processes getting killed due to out-of-memory conditions (my server has 256MB non-ECC RAM, normally with 512MB of swap). That swap issue had not been fixed until this afternoon, after the crash/corruption. Yesterday afternoon I also updated the metadata of my LVM volume group from version 1 to version 2. Before then I temporarily stopped taking snapshots once I upgraded to 2.6.10-ac12 since it didn't like taking snapshots inside LVM1 volume groups. This morning was the first time my backup script took a snapshot since upgrading to 2.6.10-ac12 (yesterday I had taken a few snapshots myself for testing purposes, this seemed to work fine). This morning when I tried to login after the backup process (which takes snapshots) had started I couldn't get in. SSH would just hang after I sent my username. After a while I gave up waiting and tried to reboot the server by attaching a keyboard and hitting Ctrl-Alt-Del, which started the shutdown process. I can't fully remember if that successfully rebooted the server, but I believe I ended up having to press the reset button because the shutdown process would hang at some point. The server came back up but some processes wouldn't start due to some corrupted files in the /var partition. I checked the logs, and saw a bunch of the messages below. On a sidenote, when my backup script isn't able to mount a snapshot, it removes it, waits a minute, then tries creating/mounting the snapshot again, supposedly up to 10 times, even though those messages, spaced apart by one minute, occurred much more than 10 times, but that might be a bug in my script. This was due to occasional problems I had with older kernels which sometimes failed to mount the snapshot, but were successful when trying again later. These are the messages I saw: Feb 20 09:59:16 homer kernel: lvcreate: page allocation failure. order:0, mode:0xd0 Feb 20 09:59:16 homer kernel: [__alloc_pages+440/864] __alloc_pages+0x1b8/0x360 Feb 20 09:59:16 homer kernel: [alloc_pl+51/96] alloc_pl+0x33/0x60 Feb 20 09:59:16 homer kernel: [client_alloc_pages+28/96] client_alloc_pages+0x1c/0x60 Feb 20 09:59:16 homer kernel: [vmalloc+32/48] vmalloc+0x20/0x30 Feb 20 09:59:16 homer kernel: [kcopyd_client_create+104/192] kcopyd_client_create+0x68/0xc0 Feb 20 09:59:16 homer kernel: [dm_create_persistent+199/320] dm_create_persistent+0xc7/0x140 Feb 20 09:59:16 homer kernel: [snapshot_ctr+680/880] snapshot_ctr+0x2a8/0x370 Feb 20 09:59:16 homer kernel: [dm_table_add_target+262/432] dm_table_add_target+0x106/0x1b0 Feb 20 09:59:16 homer kernel: [populate_table+130/224] populate_table+0x82/0xe0 Feb 20 09:59:16 homer kernel: [table_load+103/368] table_load+0x67/0x170 Feb 20 09:59:16 homer kernel: [ctl_ioctl+241/336] ctl_ioctl+0xf1/0x150 Feb 20 09:59:16 homer kernel: [table_load+0/368] table_load+0x0/0x170 Feb 20 09:59:16 homer kernel: [sys_ioctl+173/528] sys_ioctl+0xad/0x210 Feb 20 09:59:16 homer kernel: [syscall_call+7/11] syscall_call+0x7/0xb Feb 20 09:59:16 homer kernel: device-mapper: error adding target to table Feb 20 09:59:16 homer kernel: lvremove: page allocation failure. order:0, mode:0xd0 Feb 20 09:59:16 homer kernel: [__alloc_pages+440/864] __alloc_pages+0x1b8/0x360 Feb 20 09:59:16 homer kernel: [alloc_pl+51/96] alloc_pl+0x33/0x60 Feb 20 09:59:16 homer kernel: [client_alloc_pages+28/96] client_alloc_pages+0x1c/0x60 Feb 20 09:59:16 homer kernel: [vmalloc+32/48] vmalloc+0x20/0x30 Feb 20 09:59:16 homer kernel: [kcopyd_client_create+104/192] kcopyd_client_create+0x68/0xc0 Feb 20 09:59:16 homer kernel: [dm_create_persistent+199/320] dm_create_persistent+0xc7/0x140 Feb 20 09:59:16 homer kernel: [snapshot_ctr+680/880] snapshot_ctr+0x2a8/0x370 Feb 20 09:59:16 homer kernel: [dm_table_add_target+262/432] dm_table_add_target+