Re: raid5: I lost a XFS file system due to a minor IDE cable problem
On Tue, 2007-05-29 at 13:28 +1000, David Chinner wrote: > On Mon, May 28, 2007 at 05:45:27PM -0500, Alberto Alonso wrote: > > On Fri, 2007-05-25 at 18:36 +1000, David Chinner wrote: > > > I consider the possibility of serving out bad data (i.e after > > > a remount to readonly) to be the worst possible disruption of > > > service that can happen ;) > > > > I guess it does depend on the nature of the failure. A write failure > > on block 2000 does not imply corruption of the other 2TB of data. > > The rest might not be corrupted, but if block 2000 is a index of > some sort (i.e. metadata), you could reference any of that 2TB > incorrectly and get the wrong data, write to the wrong spot on disk, > etc. Forgive my ignorance, but if block 2000 is an index, to access the data that it references you would go through block 2000, which would return an error without continuing to access any data pointed to by it. Isn't that how things work? > > > > > I personally have found the XFS file system to be great for > > > > my needs (except issues with NFS interaction, where the bug report > > > > never got answered), but that doesn't mean it can not be improved. > > > > > > Got a pointer? > > > > I can't seem to find it. I'm pretty sure I used bugzilla to report > > it. I did find the kernel dump file though, so here it is: > > > > Oct 3 15:34:07 localhost kernel: xfs_iget_core: ambiguous vns: > > vp/0xd1e69c80, invp/0xc989e380 > > Oh, I haven't seen any of those problems for quite some time. > > > = /proc/kmsg started. > > Oct 3 15:51:23 localhost kernel: > > Inspecting /boot/System.map-2.6.8-2-686-smp > > Oh, well, yes, kernels that old did have that problem. It got fixed > some time around 2.6.12 or 2.6.13 IIRC Time for a kernel upgrade then :-) Thanks for all your enlightenment, I think I am learning quite a few things. Alberto - To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: raid5: I lost a XFS file system due to a minor IDE cable problem
On Mon, May 28, 2007 at 05:45:27PM -0500, Alberto Alonso wrote: > On Fri, 2007-05-25 at 18:36 +1000, David Chinner wrote: > > On Fri, May 25, 2007 at 12:43:51AM -0500, Alberto Alonso wrote: > > > I think his point was that going into a read only mode causes a > > > less catastrophic situation (ie. a web server can still serve > > > pages). > > > > Sure - but once you've detected one corruption or had metadata > > I/O errors, can you trust the rest of the filesystem? > > > > > I think that is a valid point, rather than shutting down > > > the file system completely, an automatic switch to where the least > > > disruption of service can occur is always desired. > > > > I consider the possibility of serving out bad data (i.e after > > a remount to readonly) to be the worst possible disruption of > > service that can happen ;) > > I guess it does depend on the nature of the failure. A write failure > on block 2000 does not imply corruption of the other 2TB of data. The rest might not be corrupted, but if block 2000 is a index of some sort (i.e. metadata), you could reference any of that 2TB incorrectly and get the wrong data, write to the wrong spot on disk, etc. > > > I personally have found the XFS file system to be great for > > > my needs (except issues with NFS interaction, where the bug report > > > never got answered), but that doesn't mean it can not be improved. > > > > Got a pointer? > > I can't seem to find it. I'm pretty sure I used bugzilla to report > it. I did find the kernel dump file though, so here it is: > > Oct 3 15:34:07 localhost kernel: xfs_iget_core: ambiguous vns: > vp/0xd1e69c80, invp/0xc989e380 Oh, I haven't seen any of those problems for quite some time. > = /proc/kmsg started. > Oct 3 15:51:23 localhost kernel: > Inspecting /boot/System.map-2.6.8-2-686-smp Oh, well, yes, kernels that old did have that problem. It got fixed some time around 2.6.12 or 2.6.13 IIRC Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group - To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: raid5: I lost a XFS file system due to a minor IDE cable problem
On Mon, May 28, 2007 at 05:30:52PM +0200, Pallai Roland wrote: > > On Monday 28 May 2007 14:53:55 Pallai Roland wrote: > > On Friday 25 May 2007 02:05:47 David Chinner wrote: > > > "-o ro,norecovery" will allow you to mount the filesystem and get any > > > uncorrupted data off it. > > > > > > You still may get shutdowns if you trip across corrupted metadata in > > > the filesystem, though. > > > > This filesystem is completely dead. > > [...] > > I tried to make a md patch to stop writes if a raid5 array got 2+ failed > drives, but I found it's already done, oops. :) handle_stripe5() ignores > writes in this case quietly, I tried and works. Hmmm - it clears the uptodate bit on the bio, which is supposed to make the bio return EIO. That looks to be doing the right thing... > There's an another layer I used on this box between md and xfs: loop-aes. I Oh, that's a kind of important thing to forget to mention > used it since years and rock stable, but now it's my first suspect, cause I > found a bug in it today: > I assembled my array from n-1 disks, and I failed a second disk for a test > and I found /dev/loop1 still provides *random* data where /dev/md1 serves > nothing, it's definitely a loop-aes bug: . > It's not an explanation to my screwed up file system, but for me it's enough > to drop loop-aes. Eh. If you can get random data back instead of an error from the block device, then I'm not surprised your filesystem is toast. If it's one sector in a larger block that is corrupted, then the only thing that will protect you from this sort of corruption causing problems is metadata checksums (yet another thin on my list of stuff to do). Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group - To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: raid5: I lost a XFS file system due to a minor IDE cable problem
On Mon, May 28, 2007 at 01:17:31PM +0200, Pallai Roland wrote: > On Monday 28 May 2007 04:17:18 David Chinner wrote: > > H. A quick look at the linux code makes me thikn that background > > writeback on linux has never been able to cause a shutdown in this case. > > However, the same error on Irix will definitely cause a shutdown, > > though > I hope Linux will follow Irix, that's a consistent standpoint. I raised a bug for this yesterday when writing that reply. It won't get forgotten now > David, have you a plan to implement your "reporting raid5 block layer" > idea? No one else has caring about this silent data loss on temporary > (cable, power) failed raid5 arrays as I see, I really hope you do at least! Yeah, I'd love to get something like this happening, but given it's about half way down my list of "stuff to do when I have some spare time" I'd say it will be about 2015 before I get to it. Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group - To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: raid5: I lost a XFS file system due to a minor IDE cable problem
On Fri, 2007-05-25 at 18:36 +1000, David Chinner wrote: > On Fri, May 25, 2007 at 12:43:51AM -0500, Alberto Alonso wrote: > > I think his point was that going into a read only mode causes a > > less catastrophic situation (ie. a web server can still serve > > pages). > > Sure - but once you've detected one corruption or had metadata > I/O errors, can you trust the rest of the filesystem? > > > I think that is a valid point, rather than shutting down > > the file system completely, an automatic switch to where the least > > disruption of service can occur is always desired. > > I consider the possibility of serving out bad data (i.e after > a remount to readonly) to be the worst possible disruption of > service that can happen ;) I guess it does depend on the nature of the failure. A write failure on block 2000 does not imply corruption of the other 2TB of data. I wish I knew more on the internals of file systems, unfortunately since I don't, I was just commenting on feature that would be nice, but maybe there is no way to implement them. I figured that a dynamic table with bad blocks could be kept, if an attempt to access those blocks is generated (read or write) an I/O error is returned, if the block is not on the list, the access is processed. This would help a server with large file systems continue operations for most users. > > I personally have found the XFS file system to be great for > > my needs (except issues with NFS interaction, where the bug report > > never got answered), but that doesn't mean it can not be improved. > > Got a pointer? I can't seem to find it. I'm pretty sure I used bugzilla to report it. I did find the kernel dump file though, so here it is: Oct 3 15:34:07 localhost kernel: xfs_iget_core: ambiguous vns: vp/0xd1e69c80, invp/0xc989e380 Oct 3 15:34:07 localhost kernel: [ cut here ] Oct 3 15:34:07 localhost kernel: kernel BUG at fs/xfs/support/debug.c:106! Oct 3 15:34:07 localhost kernel: invalid operand: [#1] Oct 3 15:34:07 localhost kernel: PREEMPT SMP Oct 3 15:34:07 localhost kernel: Modules linked in: af_packet iptable_filter ip_tables nfsd exportfs lockd sunrpc ipv6xfs capability commoncap ext3 jbd mbc ache aic7xxx i2c_dev tsdev floppy mousedev parport_pc parport psmouse evdev pcspkrhw_random shpchp pciehp pci_hotplug intel_agp intel_mch_agp agpgart uhci_h cd usbcore piix ide_core e1000 cfi_cmdset_0001 cfi_util mtdpart mtdcore jedec_probe gen_probe chipreg dm_mod w83781d i2c_sensor i2c_i801 i2c_core raid5 xor genrtc sd_mod aic79xx scsi_mod raid1 md unix font vesafb cfbcopyarea cfbimgblt cfbfillrect Oct 3 15:34:07 localhost kernel: CPU:0 Oct 3 15:34:07 localhost kernel: EIP:0060:[__crc_pm_idle +3334982/5290900]Not tainted Oct 3 15:34:07 localhost kernel: EFLAGS: 00010246 (2.6.8-2-686-smp) Oct 3 15:34:07 localhost kernel: EIP is at cmn_err+0xc5/0xe0 [xfs] Oct 3 15:34:07 localhost kernel: eax: ebx: f602c000 ecx: c02dcfbc edx: c02dcfbc Oct 3 15:34:07 localhost kernel: esi: f8c40e28 edi: f8c56a3e ebp: 0293 esp: f602da08 Oct 3 15:34:07 localhost kernel: ds: 007b es: 007b ss: 0068 Oct 3 15:34:07 localhost kernel: Process nfsd (pid: 2740, threadinfo=f602c000 task=f71a7210) Oct 3 15:34:07 localhost kernel: Stack: f8c40e28 f8c40def f8c56a00 f602c000 074aa1aa f8c41700 ea2f0a40 Oct 3 15:34:07 localhost kernel:f8c0a745 f8c41700 d1e69c80 c989e380 f7d4cc00 c2934754 074aa1aa Oct 3 15:34:07 localhost kernel: f6555624 074aa1aa f7d4cc00 c017d6bd f6555620 Oct 3 15:34:07 localhost kernel: Call Trace: Oct 3 15:34:07 localhost kernel: [__crc_pm_idle+3123398/5290900] xfs_iget_core+0x565/0x6b0 [xfs] Oct 3 15:34:07 localhost kernel: [iget_locked+189/256] iget_locked +0xbd/0x100 Oct 3 15:34:07 localhost kernel: [__crc_pm_idle+3124083/5290900] xfs_iget+0x162/0x1a0 [xfs] Oct 3 15:34:07 localhost kernel: [__crc_pm_idle+3252484/5290900] xfs_vget+0x63/0x100 [xfs] Oct 3 15:34:07 localhost kernel: [__crc_pm_idle+3331204/5290900] vfs_vget+0x43/0x50 [xfs] Oct 3 15:34:07 localhost kernel: [__crc_pm_idle+3329570/5290900] linvfs_get_dentry+0x51/0x90 [xfs] Oct 3 15:34:07 localhost kernel: [__crc_pm_idle+1536451/5290900] find_exported_dentry+0x42/0x830 [exportfs] Oct 3 15:34:07 localhost kernel: [__crc_pm_idle+3234969/5290900] xfs_trans_tail_ail+0x38/0x80 [xfs] Oct 3 15:34:07 localhost kernel: [__crc_pm_idle+3174595/5290900] xlog_write+0x102/0x580 [xfs] Oct 3 15:34:07 localhost kernel: [__crc_pm_idle+3234969/5290900] xfs_trans_tail_ail+0x38/0x80 [xfs] Oct 3 15:34:07 localhost kernel: [__crc_pm_idle+3170617/5290900] xlog_assign_tail_lsn+0x18/0x90 [xfs] Oct 3 15:34:07 localhost kernel: [__crc_pm_idle+3234969/5290900] xfs_trans_tail_ail+0x38/0x80 [xfs] Oct 3 15:34:07 localhost kernel: [__crc_pm_idle+3174595/5290900] xlog_write+0x102/0x580 [xfs] Oct 3 15:34:07 localhost kernel: [alloc_skb+71/240] alloc_skb +0x47/0xf0 Oct 3 15:34:07 l
Re: raid5: I lost a XFS file system due to a minor IDE cable problem
On Monday 28 May 2007 14:53:55 Pallai Roland wrote: > On Friday 25 May 2007 02:05:47 David Chinner wrote: > > "-o ro,norecovery" will allow you to mount the filesystem and get any > > uncorrupted data off it. > > > > You still may get shutdowns if you trip across corrupted metadata in > > the filesystem, though. > > This filesystem is completely dead. > [...] I tried to make a md patch to stop writes if a raid5 array got 2+ failed drives, but I found it's already done, oops. :) handle_stripe5() ignores writes in this case quietly, I tried and works. So how I lost my file system? My first guess about partially successed writes wasn't right: there wasn't real write to the disks after the second disk has been kicked, so the scenario is same to a simple power loss from this point of view. Am I thinking right? There's an another layer I used on this box between md and xfs: loop-aes. I used it since years and rock stable, but now it's my first suspect, cause I found a bug in it today: I assembled my array from n-1 disks, and I failed a second disk for a test and I found /dev/loop1 still provides *random* data where /dev/md1 serves nothing, it's definitely a loop-aes bug: /dev/loop1: [0700]:180907 (/dev/md1) encryption=AES128 multi-key-v3 hq:~# dd if=/dev/md1 bs=1k count=128 skip=128 >/dev/null dd: reading `/dev/md1': Input/output error 0+0 records in 0+0 records out hq:~# dd if=/dev/loop1 bs=1k count=128 skip=128 | md5sum 128+0 records in 128+0 records out 131072 bytes (131 kB) copied, 0.027775 seconds, 4.7 MB/s e2548a924a0e835bb45fb50058acba98 - (!!!) hq:~# dd if=/dev/loop1 bs=1k count=128 skip=128 | md5sum 128+0 records in 128+0 records out 131072 bytes (131 kB) copied, 0.030311 seconds, 4.3 MB/s c6a23412fb75eb5a7eb1d6a7813eb86b - (!!!) It's not an explanation to my screwed up file system, but for me it's enough to drop loop-aes. Eh. -- d - To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: raid5: I lost a XFS file system due to a minor IDE cable problem
On Friday 25 May 2007 02:05:47 David Chinner wrote: > "-o ro,norecovery" will allow you to mount the filesystem and get any > uncorrupted data off it. > > You still may get shutdowns if you trip across corrupted metadata in > the filesystem, though. This filesystem is completely dead. hq:~# mount -o ro,norecovery /dev/loop1 /mnt/r5 May 28 13:41:50 hq kernel: Mounting filesystem "loop1" in no-recovery mode. Filesystem will be inconsistent. May 28 13:41:50 hq kernel: XFS: failed to read root inode hq:~# xfs_db /dev/loop1 xfs_db: cannot read root inode (22) xfs_db: cannot read realtime bitmap inode (22) Segmentation fault hq:~# strace xfs_db /dev/loop1 _llseek(4, 0, [0], SEEK_SET)= 0 read(4, "XFSB\0\0\20\0\0\0\0\0\6\374\253\0\0\0\0\0\0\0\0\0\0\0\0"..., 512) = 512 pread(4, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 512, 480141901312) = 512 pread(4, "\30G$L\203\33OE [EMAIL PROTECTED]"\324\2074DY\323\6"..., 8192, 131072) = 8192 write(2, "xfs_db: cannot read root inode ("..., 36xfs_db: cannot read root inode (22) ) = 36 pread(4, "\30G$L\203\33OE [EMAIL PROTECTED]"\324\2074DY\323\6"..., 8192, 131072) = 8192 write(2, "xfs_db: cannot read realtime bit"..., 47xfs_db: cannot read realtime bitmap inode (22) ) = 47 --- SIGSEGV (Segmentation fault) @ 0 (0) --- +++ killed by SIGSEGV +++ Browsing with hexdump -C, seems like a part of a PDF file is at 128Kb, on the place of the root inode. :( -- d - To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: raid5: I lost a XFS file system due to a minor IDE cable problem
On Monday 28 May 2007 04:17:18 David Chinner wrote: > On Mon, May 28, 2007 at 03:50:17AM +0200, Pallai Roland wrote: > > On Monday 28 May 2007 02:30:11 David Chinner wrote: > > > On Fri, May 25, 2007 at 04:35:36PM +0200, Pallai Roland wrote: > > > > .and I've spammed such messages. This "internal error" isn't a good > > > > reason to shut down the file system? > > > > > > Actaully, that error does shut the filesystem down in most cases. When > > > you see that output, the function is returning -EFSCORRUPTED. You've > > > got a corrupted freespace btree. > > > > > > The reason why you get spammed is that this is happening during > > > background writeback, and there is no one to return the -EFSCORRUPTED > > > error to. The background writeback path doesn't specifically detect > > > shut down filesystems or trigger shutdowns on errors because that > > > happens in different layers so you just end up with failed data writes. > > > These errors will occur on the next foreground data or metadata > > > allocation and that will shut the filesystem down at that point. > > > > > > I'm not sure that we should be ignoring EFSCORRUPTED errors here; maybe > > > in this case we should be shutting down the filesystem. That would > > > certainly cut down on the spamming and would not appear to change > > > anything other behaviour > > > > If I remember correctly, my file system wasn't shutted down at all, it > > was "writeable" for whole night, the yafc slowly "written" files to it. > > Maybe all write operations had failed, but yafc doesn't warn. > > So you never created new files or directories, unlinked files or > directories, did synchronous writes, etc? Just had slowly growing files? I just overwritten badly downloaded files. > > Spamming is just annoying when we need to find out what went wrong (My > > kernel.log is 300Mb), but for data security it's important to react to > > EFSCORRUPTED error in any case, I think so. Please consider this. > > The filesystem has responded correctly to the corruption in terms of > data security (i.e. failed the data write and warned noisily about > it), but it probably hasn't done everything it should > > H. A quick look at the linux code makes me thikn that background > writeback on linux has never been able to cause a shutdown in this > case. However, the same error on Irix will definitely cause a > shutdown, though I hope Linux will follow Irix, that's a consistent standpoint. David, have you a plan to implement your "reporting raid5 block layer" idea? No one else has caring about this silent data loss on temporary (cable, power) failed raid5 arrays as I see, I really hope you do at least! -- d - To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: raid5: I lost a XFS file system due to a minor IDE cable problem
On Mon, May 28, 2007 at 03:50:17AM +0200, Pallai Roland wrote: > On Monday 28 May 2007 02:30:11 David Chinner wrote: > > On Fri, May 25, 2007 at 04:35:36PM +0200, Pallai Roland wrote: > > > .and I've spammed such messages. This "internal error" isn't a good > > > reason to shut down the file system? > > > > Actaully, that error does shut the filesystem down in most cases. When you > > see that output, the function is returning -EFSCORRUPTED. You've got a > > corrupted freespace btree. > > > > The reason why you get spammed is that this is happening during background > > writeback, and there is no one to return the -EFSCORRUPTED error to. The > > background writeback path doesn't specifically detect shut down filesystems > > or trigger shutdowns on errors because that happens in different layers so > > you just end up with failed data writes. These errors will occur on the > > next foreground data or metadata allocation and that will shut the > > filesystem down at that point. > > > > I'm not sure that we should be ignoring EFSCORRUPTED errors here; maybe in > > this case we should be shutting down the filesystem. That would certainly > > cut down on the spamming and would not appear to change anything other > > behaviour > If I remember correctly, my file system wasn't shutted down at all, it > was "writeable" for whole night, the yafc slowly "written" files to it. Maybe > all write operations had failed, but yafc doesn't warn. So you never created new files or directories, unlinked files or directories, did synchronous writes, etc? Just had slowly growing files? > Spamming is just annoying when we need to find out what went wrong (My > kernel.log is 300Mb), but for data security it's important to react to > EFSCORRUPTED error in any case, I think so. Please consider this. The filesystem has responded correctly to the corruption in terms of data security (i.e. failed the data write and warned noisily about it), but it probably hasn't done everything it should H. A quick look at the linux code makes me thikn that background writeback on linux has never been able to cause a shutdown in this case. However, the same error on Irix will definitely cause a shutdown, though Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group - To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: raid5: I lost a XFS file system due to a minor IDE cable problem
On Monday 28 May 2007 02:30:11 David Chinner wrote: > On Fri, May 25, 2007 at 04:35:36PM +0200, Pallai Roland wrote: > > On Friday 25 May 2007 06:55:00 David Chinner wrote: > > > Oh, did you look at your logs and find that XFS had spammed them > > > about writes that were failing? > > > > The first message after the incident: > > > > May 24 01:53:50 hq kernel: Filesystem "loop1": XFS internal error > > xfs_btree_check_sblock at line 336 of file fs/xfs/xfs_btree.c. Caller > > 0xf8ac14f8 May 24 01:53:50 hq kernel: > > xfs_btree_check_sblock+0x4f/0xc2 [xfs] > > xfs_alloc_lookup+0x34e/0x47b [xfs] May 24 01:53:50 HF kernel: > > xfs_alloc_lookup+0x34e/0x47b [xfs] kmem_zone_zalloc+0x1b/0x43 > > [xfs] May 24 01:53:50 hq kernel: > > xfs_alloc_ag_vextent+0x24d/0x1110 [xfs] > > xfs_alloc_vextent+0x3bd/0x53b [xfs] May 24 01:53:50 hq kernel: > > xfs_bmapi+0x1ac4/0x23cd [xfs] > > xfs_bmap_search_multi_extents+0x8e/0xd8 [xfs] May 24 01:53:50 hq kernel: > > xlog_dealloc_log+0x49/0xea [xfs] > > xfs_iomap_write_allocate+0x2d9/0x58b [xfs] May 24 01:53:50 hq kernel: > > xfs_iomap+0x60e/0x82d [xfs] > > __wake_up_common+0x39/0x59 May 24 01:53:50 hq kernel: > > xfs_map_blocks+0x39/0x6c [xfs] > > xfs_page_state_convert+0x644/0xf9c [xfs] May 24 01:53:50 hq kernel: > > schedule+0x5d1/0xf4d xfs_vm_writepage+0x0/0xe0 > > [xfs] May 24 01:53:50 hq kernel: xfs_vm_writepage+0x57/0xe0 > > [xfs] mpage_writepages+0x1fb/0x3bb May 24 01:53:50 hq kernel: > > mpage_writepages+0x133/0x3bb > > xfs_vm_writepage+0x0/0xe0 [xfs] May 24 01:53:50 hq kernel: > > do_writepages+0x35/0x3b __writeback_single_inode+0x88/0x387 > > May 24 01:53:50 hq kernel: sync_sb_inodes+0x1b4/0x2a8 > > writeback_inodes+0x63/0xdc May 24 01:53:50 hq kernel: > > background_writeout+0x66/0x9f pdflush+0x0/0x1ad > > May 24 01:53:50 hq kernel: pdflush+0xef/0x1ad > > background_writeout+0x0/0x9f May 24 01:53:50 hq kernel: > > kthread+0xc2/0xc6 kthread+0x0/0xc6 May 24 01:53:50 hq kernel: > > kernel_thread_helper+0x5/0xb > > > > .and I've spammed such messages. This "internal error" isn't a good > > reason to shut down the file system? > > Actaully, that error does shut the filesystem down in most cases. When you > see that output, the function is returning -EFSCORRUPTED. You've got a > corrupted freespace btree. > > The reason why you get spammed is that this is happening during background > writeback, and there is no one to return the -EFSCORRUPTED error to. The > background writeback path doesn't specifically detect shut down filesystems > or trigger shutdowns on errors because that happens in different layers so > you just end up with failed data writes. These errors will occur on the > next foreground data or metadata allocation and that will shut the > filesystem down at that point. > > I'm not sure that we should be ignoring EFSCORRUPTED errors here; maybe in > this case we should be shutting down the filesystem. That would certainly > cut down on the spamming and would not appear to change anything other > behaviour If I remember correctly, my file system wasn't shutted down at all, it was "writeable" for whole night, the yafc slowly "written" files to it. Maybe all write operations had failed, but yafc doesn't warn. Spamming is just annoying when we need to find out what went wrong (My kernel.log is 300Mb), but for data security it's important to react to EFSCORRUPTED error in any case, I think so. Please consider this. > > I think if there's a sign of corrupted file system, the first thing we > > should do is to stop writes (or the entire FS) and let the admin to > > examine the situation. > > Yes, that's *exactly* what a shutdown does. In this case, your writes are > being stopped - hence the error messages - but the filesystem has not yet > been shutdown. All writes being stopped that were involved in the freespace btree, but a few operations were executed (on the corrupted FS), right? Ignoring of EFSCORRUPTED isn't a good idea in this case. -- d - To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: raid5: I lost a XFS file system due to a minor IDE cable problem
On Fri, May 25, 2007 at 04:35:36PM +0200, Pallai Roland wrote: > > On Friday 25 May 2007 06:55:00 David Chinner wrote: > > Oh, did you look at your logs and find that XFS had spammed them > > about writes that were failing? > > The first message after the incident: > > May 24 01:53:50 hq kernel: Filesystem "loop1": XFS internal error > xfs_btree_check_sblock at line 336 of file fs/xfs/xfs_btree.c. Caller > 0xf8ac14f8 > May 24 01:53:50 hq kernel: xfs_btree_check_sblock+0x4f/0xc2 [xfs] > xfs_alloc_lookup+0x34e/0x47b [xfs] > May 24 01:53:50 HF kernel: xfs_alloc_lookup+0x34e/0x47b [xfs] > kmem_zone_zalloc+0x1b/0x43 [xfs] > May 24 01:53:50 hq kernel: xfs_alloc_ag_vextent+0x24d/0x1110 [xfs] > xfs_alloc_vextent+0x3bd/0x53b [xfs] > May 24 01:53:50 hq kernel: xfs_bmapi+0x1ac4/0x23cd [xfs] > xfs_bmap_search_multi_extents+0x8e/0xd8 [xfs] > May 24 01:53:50 hq kernel: xlog_dealloc_log+0x49/0xea [xfs] > xfs_iomap_write_allocate+0x2d9/0x58b [xfs] > May 24 01:53:50 hq kernel: xfs_iomap+0x60e/0x82d [xfs] > __wake_up_common+0x39/0x59 > May 24 01:53:50 hq kernel: xfs_map_blocks+0x39/0x6c [xfs] > xfs_page_state_convert+0x644/0xf9c [xfs] > May 24 01:53:50 hq kernel: schedule+0x5d1/0xf4d > xfs_vm_writepage+0x0/0xe0 [xfs] > May 24 01:53:50 hq kernel: xfs_vm_writepage+0x57/0xe0 [xfs] > mpage_writepages+0x1fb/0x3bb > May 24 01:53:50 hq kernel: mpage_writepages+0x133/0x3bb > xfs_vm_writepage+0x0/0xe0 [xfs] > May 24 01:53:50 hq kernel: do_writepages+0x35/0x3b > __writeback_single_inode+0x88/0x387 > May 24 01:53:50 hq kernel: sync_sb_inodes+0x1b4/0x2a8 > writeback_inodes+0x63/0xdc > May 24 01:53:50 hq kernel: background_writeout+0x66/0x9f > pdflush+0x0/0x1ad > May 24 01:53:50 hq kernel: pdflush+0xef/0x1ad > background_writeout+0x0/0x9f > May 24 01:53:50 hq kernel: kthread+0xc2/0xc6 > kthread+0x0/0xc6 > May 24 01:53:50 hq kernel: kernel_thread_helper+0x5/0xb > > .and I've spammed such messages. This "internal error" isn't a good reason to > shut down > the file system? Actaully, that error does shut the filesystem down in most cases. When you see that output, the function is returning -EFSCORRUPTED. You've got a corrupted freespace btree. The reason why you get spammed is that this is happening during background writeback, and there is no one to return the -EFSCORRUPTED error to. The background writeback path doesn't specifically detect shut down filesystems or trigger shutdowns on errors because that happens in different layers so you just end up with failed data writes. These errors will occur on the next foreground data or metadata allocation and that will shut the filesystem down at that point. I'm not sure that we should be ignoring EFSCORRUPTED errors here; maybe in this case we should be shutting down the filesystem. That would certainly cut down on the spamming and would not appear to change anything other behaviour > I think if there's a sign of corrupted file system, the first thing we should > do > is to stop writes (or the entire FS) and let the admin to examine the > situation. Yes, that's *exactly* what a shutdown does. In this case, your writes are being stopped - hence the error messages - but the filesystem has not yet been shutdown. Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group - To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: raid5: I lost a XFS file system due to a minor IDE cable problem
On Friday 25 May 2007 06:55:00 David Chinner wrote: > Oh, did you look at your logs and find that XFS had spammed them > about writes that were failing? The first message after the incident: May 24 01:53:50 hq kernel: Filesystem "loop1": XFS internal error xfs_btree_check_sblock at line 336 of file fs/xfs/xfs_btree.c. Caller 0xf8ac14f8 May 24 01:53:50 hq kernel: xfs_btree_check_sblock+0x4f/0xc2 [xfs] xfs_alloc_lookup+0x34e/0x47b [xfs] May 24 01:53:50 HF kernel: xfs_alloc_lookup+0x34e/0x47b [xfs] kmem_zone_zalloc+0x1b/0x43 [xfs] May 24 01:53:50 hq kernel: xfs_alloc_ag_vextent+0x24d/0x1110 [xfs] xfs_alloc_vextent+0x3bd/0x53b [xfs] May 24 01:53:50 hq kernel: xfs_bmapi+0x1ac4/0x23cd [xfs] xfs_bmap_search_multi_extents+0x8e/0xd8 [xfs] May 24 01:53:50 hq kernel: xlog_dealloc_log+0x49/0xea [xfs] xfs_iomap_write_allocate+0x2d9/0x58b [xfs] May 24 01:53:50 hq kernel: xfs_iomap+0x60e/0x82d [xfs] __wake_up_common+0x39/0x59 May 24 01:53:50 hq kernel: xfs_map_blocks+0x39/0x6c [xfs] xfs_page_state_convert+0x644/0xf9c [xfs] May 24 01:53:50 hq kernel: schedule+0x5d1/0xf4d xfs_vm_writepage+0x0/0xe0 [xfs] May 24 01:53:50 hq kernel: xfs_vm_writepage+0x57/0xe0 [xfs] mpage_writepages+0x1fb/0x3bb May 24 01:53:50 hq kernel: mpage_writepages+0x133/0x3bb xfs_vm_writepage+0x0/0xe0 [xfs] May 24 01:53:50 hq kernel: do_writepages+0x35/0x3b __writeback_single_inode+0x88/0x387 May 24 01:53:50 hq kernel: sync_sb_inodes+0x1b4/0x2a8 writeback_inodes+0x63/0xdc May 24 01:53:50 hq kernel: background_writeout+0x66/0x9f pdflush+0x0/0x1ad May 24 01:53:50 hq kernel: pdflush+0xef/0x1ad background_writeout+0x0/0x9f May 24 01:53:50 hq kernel: kthread+0xc2/0xc6 kthread+0x0/0xc6 May 24 01:53:50 hq kernel: kernel_thread_helper+0x5/0xb ..and I've spammed such messages. This "internal error" isn't a good reason to shut down the file system? I think if there's a sign of corrupted file system, the first thing we should do is to stop writes (or the entire FS) and let the admin to examine the situation. I'm not talking about my case where the md raid5 was a braindead, I'm talking about general situations. -- d - To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: raid5: I lost a XFS file system due to a minor IDE cable problem
On Friday 25 May 2007 03:35:48 Pallai Roland wrote: > On Fri, 2007-05-25 at 10:05 +1000, David Chinner wrote: > > On Thu, May 24, 2007 at 07:20:35AM -0400, Justin Piszcz wrote: > > > On Thu, 24 May 2007, Pallai Roland wrote: > > > >It's a good question too, but I think the md layer could > > > >save dumb filesystems like XFS if denies writes after 2 disks are > > > > failed, and > > > >I cannot see a good reason why it's not behave this way. > > > > How is *any* filesystem supposed to know that the underlying block > > device has gone bad if it is not returning errors? > > It is returning errors, I think so. If I try to write raid5 with 2 > failed disks with dd, I've got errors on the missing chunks. > The difference between ext3 and XFS is that ext3 will remount to > read-only on the first write error but the XFS won't, XFS only fails > only the current operation, IMHO. The method of ext3 isn't perfect, but > in practice, it's working well. Sorry, I was wrong: md really isn't returning error! It's madness, IMHO. The reason why ext3 safer on raid5 in practice is that ext3 remounts to read-only on read errors too and when a raid5 array got 2 failed drives and there's some read, the error= behavior of ext3 will be activated and stops further writes. You're right, it's not a good solution and there should be read operations to prevent data loss in this case on ext3 too. Raid5 *must deny all writes* when 2 disks failed: I still can't see a good reason why not, and the current method is braindead! > > I did mention this exact scenario in the filesystems workshop back > > in february - we'd *really* like to know if a RAID block device has gone > > into degraded mode (i.e. lost a disk) so we can throttle new writes > > until the rebuil dhas been completed. Stopping writes completely on a > > fatal error (like 2 lost disks in RAID5, and 3 lost disks in RAID6) > > would also be possible if only we could get the information out > > of the block layer. Yes, it's sounds good, but I think we need a quick fix now, it's a real problem and easily can lead to mass data loss. -- d - To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: raid5: I lost a XFS file system due to a minor IDE cable problem
On Fri, May 25, 2007 at 12:43:51AM -0500, Alberto Alonso wrote: > > > The difference between ext3 and XFS is that ext3 will remount to > > > read-only on the first write error but the XFS won't, XFS only fails > > > only the current operation, IMHO. The method of ext3 isn't perfect, but > > > in practice, it's working well. > > > > XFS will shutdown the filesystem if metadata corruption will occur > > due to a failed write. We don't immediately fail the filesystem on > > data write errors because on large systems you can get *transient* > > I/O errors (e.g. FC path failover) and so retrying failed data > > writes is useful for preventing unnecessary shutdowns of the > > filesystem. > > > > Different design criteria, different solutions... > > I think his point was that going into a read only mode causes a > less catastrophic situation (ie. a web server can still serve > pages). Sure - but once you've detected one corruption or had metadata I/O errors, can you trust the rest of the filesystem? > I think that is a valid point, rather than shutting down > the file system completely, an automatic switch to where the least > disruption of service can occur is always desired. I consider the possibility of serving out bad data (i.e after a remount to readonly) to be the worst possible disruption of service that can happen ;) > Maybe the automatic failure mode could be something that is > configurable via the mount options. If only it were that simple. Have you looked to see how many hooks there are in XFS to shutdown without causing further damage? % grep FORCED_SHUTDOWN fs/xfs/*.[ch] fs/xfs/*/*.[ch] | wc -l 116 Changing the way we handle shutdowns would take a lot of time, effort and testing. When can I expect a patch? ;) > I personally have found the XFS file system to be great for > my needs (except issues with NFS interaction, where the bug report > never got answered), but that doesn't mean it can not be improved. Got a pointer? Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group - To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: raid5: I lost a XFS file system due to a minor IDE cable problem
> > The difference between ext3 and XFS is that ext3 will remount to > > read-only on the first write error but the XFS won't, XFS only fails > > only the current operation, IMHO. The method of ext3 isn't perfect, but > > in practice, it's working well. > > XFS will shutdown the filesystem if metadata corruption will occur > due to a failed write. We don't immediately fail the filesystem on > data write errors because on large systems you can get *transient* > I/O errors (e.g. FC path failover) and so retrying failed data > writes is useful for preventing unnecessary shutdowns of the > filesystem. > > Different design criteria, different solutions... I think his point was that going into a read only mode causes a less catastrophic situation (ie. a web server can still serve pages). I think that is a valid point, rather than shutting down the file system completely, an automatic switch to where the least disruption of service can occur is always desired. Maybe the automatic failure mode could be something that is configurable via the mount options. I personally have found the XFS file system to be great for my needs (except issues with NFS interaction, where the bug report never got answered), but that doesn't mean it can not be improved. Just my 2 cents, Alberto > Cheers, > > Dave. -- Alberto AlonsoGlobal Gate Systems LLC. (512) 351-7233http://www.ggsys.net Hardware, consulting, sysadmin, monitoring and remote backups - To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: raid5: I lost a XFS file system due to a minor IDE cable problem
On Fri, May 25, 2007 at 03:35:48AM +0200, Pallai Roland wrote: > On Fri, 2007-05-25 at 10:05 +1000, David Chinner wrote: > > > >It's a good question too, but I think the md layer could > > > >save dumb filesystems like XFS if denies writes after 2 disks are > > > >failed, > > > >and > > > >I cannot see a good reason why it's not behave this way. > > > > How is *any* filesystem supposed to know that the underlying block > > device has gone bad if it is not returning errors? > It is returning errors, I think so. If I try to write raid5 with 2 > failed disks with dd, I've got errors on the missing chunks. Oh, did you look at your logs and find that XFS had spammed them about writes that were failing? > The difference between ext3 and XFS is that ext3 will remount to > read-only on the first write error but the XFS won't, XFS only fails > only the current operation, IMHO. The method of ext3 isn't perfect, but > in practice, it's working well. XFS will shutdown the filesystem if metadata corruption will occur due to a failed write. We don't immediately fail the filesystem on data write errors because on large systems you can get *transient* I/O errors (e.g. FC path failover) and so retrying failed data writes is useful for preventing unnecessary shutdowns of the filesystem. Different design criteria, different solutions... Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group - To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: raid5: I lost a XFS file system due to a minor IDE cable problem
On Fri, 2007-05-25 at 10:05 +1000, David Chinner wrote: > On Thu, May 24, 2007 at 07:20:35AM -0400, Justin Piszcz wrote: > > On Thu, 24 May 2007, Pallai Roland wrote: > > >I wondering why the md raid5 does accept writes after 2 disks failed. I've > > >an > > >array built from 7 drives, filesystem is XFS. Yesterday, an IDE cable > > >failed > > >(my friend kicked it off from the box on the floor:) and 2 disks have been > > >kicked but my download (yafc) not stopped, it tried and could write the > > >file > > >system for whole night! > > >Now I changed the cable, tried to reassembly the array (mdadm -f --run), > > >event counter increased from 4908158 up to 4929612 on the failed disks, > > >but I > > >cannot mount the file system and the 'xfs_repair -n' shows lot of errors > > >there. This is expainable by the partially successed writes. Ext3 and JFS > > >has "error=" mount option to switch filesystem read-only on any error, but > > >XFS hasn't: why? > > "-o ro,norecovery" will allow you to mount the filesystem and get any > uncorrupted data off it. > > You still may get shutdowns if you trip across corrupted metadata in > the filesystem, though. Thanks, I'll try it > > >It's a good question too, but I think the md layer could > > >save dumb filesystems like XFS if denies writes after 2 disks are failed, > > >and > > >I cannot see a good reason why it's not behave this way. > > How is *any* filesystem supposed to know that the underlying block > device has gone bad if it is not returning errors? It is returning errors, I think so. If I try to write raid5 with 2 failed disks with dd, I've got errors on the missing chunks. The difference between ext3 and XFS is that ext3 will remount to read-only on the first write error but the XFS won't, XFS only fails only the current operation, IMHO. The method of ext3 isn't perfect, but in practice, it's working well. > I did mention this exact scenario in the filesystems workshop back > in february - we'd *really* like to know if a RAID block device has gone > into degraded mode (i.e. lost a disk) so we can throttle new writes > until the rebuil dhas been completed. Stopping writes completely on a > fatal error (like 2 lost disks in RAID5, and 3 lost disks in RAID6) > would also be possible if only we could get the information out > of the block layer. It would be nice, but as I mentioned above, ext3 do it well in practice now. > > >Do you have better idea how can I avoid such filesystem corruptions in the > > >future? No, I don't want to use ext3 on this box. :) > > Well, the problem is a bug in MD - it should have detected > drives going away and stopped access to the device until it was > repaired. You would have had the same problem with ext3, or JFS, > or reiser or any other filesystem, too. > > > >my mount error: > > >XFS: Log inconsistent (didn't find previous header) > > >XFS: failed to find log head > > >XFS: log mount/recovery failed: error 5 > > >XFS: log mount failed > > You MD device is still hosed - error 5 = EIO; the md device is > reporting errors back the filesystem now. You need to fix that > before trying to recover any data... I play with it tomorrow, thanks for your help -- d - To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: raid5: I lost a XFS file system due to a minor IDE cable problem
On Thu, May 24, 2007 at 07:20:35AM -0400, Justin Piszcz wrote: > Including XFS mailing list on this one. Thanks Justin. > On Thu, 24 May 2007, Pallai Roland wrote: > > > > >Hi, > > > >I wondering why the md raid5 does accept writes after 2 disks failed. I've > >an > >array built from 7 drives, filesystem is XFS. Yesterday, an IDE cable > >failed > >(my friend kicked it off from the box on the floor:) and 2 disks have been > >kicked but my download (yafc) not stopped, it tried and could write the > >file > >system for whole night! > >Now I changed the cable, tried to reassembly the array (mdadm -f --run), > >event counter increased from 4908158 up to 4929612 on the failed disks, > >but I > >cannot mount the file system and the 'xfs_repair -n' shows lot of errors > >there. This is expainable by the partially successed writes. Ext3 and JFS > >has "error=" mount option to switch filesystem read-only on any error, but > >XFS hasn't: why? "-o ro,norecovery" will allow you to mount the filesystem and get any uncorrupted data off it. You still may get shutdowns if you trip across corrupted metadata in the filesystem, though. > >It's a good question too, but I think the md layer could > >save dumb filesystems like XFS if denies writes after 2 disks are failed, > >and > >I cannot see a good reason why it's not behave this way. How is *any* filesystem supposed to know that the underlying block device has gone bad if it is not returning errors? I did mention this exact scenario in the filesystems workshop back in february - we'd *really* like to know if a RAID block device has gone into degraded mode (i.e. lost a disk) so we can throttle new writes until the rebuil dhas been completed. Stopping writes completely on a fatal error (like 2 lost disks in RAID5, and 3 lost disks in RAID6) would also be possible if only we could get the information out of the block layer. > >Do you have better idea how can I avoid such filesystem corruptions in the > >future? No, I don't want to use ext3 on this box. :) Well, the problem is a bug in MD - it should have detected drives going away and stopped access to the device until it was repaired. You would have had the same problem with ext3, or JFS, or reiser or any other filesystem, too. > >my mount error: > >XFS: Log inconsistent (didn't find previous header) > >XFS: failed to find log head > >XFS: log mount/recovery failed: error 5 > >XFS: log mount failed You MD device is still hosed - error 5 = EIO; the md device is reporting errors back the filesystem now. You need to fix that before trying to recover any data... Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group - To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: raid5: I lost a XFS file system due to a minor IDE cable problem
Including XFS mailing list on this one. On Thu, 24 May 2007, Pallai Roland wrote: Hi, I wondering why the md raid5 does accept writes after 2 disks failed. I've an array built from 7 drives, filesystem is XFS. Yesterday, an IDE cable failed (my friend kicked it off from the box on the floor:) and 2 disks have been kicked but my download (yafc) not stopped, it tried and could write the file system for whole night! Now I changed the cable, tried to reassembly the array (mdadm -f --run), event counter increased from 4908158 up to 4929612 on the failed disks, but I cannot mount the file system and the 'xfs_repair -n' shows lot of errors there. This is expainable by the partially successed writes. Ext3 and JFS has "error=" mount option to switch filesystem read-only on any error, but XFS hasn't: why? It's a good question too, but I think the md layer could save dumb filesystems like XFS if denies writes after 2 disks are failed, and I cannot see a good reason why it's not behave this way. Do you have better idea how can I avoid such filesystem corruptions in the future? No, I don't want to use ext3 on this box. :) my mount error: XFS: Log inconsistent (didn't find previous header) XFS: failed to find log head XFS: log mount/recovery failed: error 5 XFS: log mount failed -- d - To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html - To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
raid5: I lost a XFS file system due to a minor IDE cable problem
Hi, I wondering why the md raid5 does accept writes after 2 disks failed. I've an array built from 7 drives, filesystem is XFS. Yesterday, an IDE cable failed (my friend kicked it off from the box on the floor:) and 2 disks have been kicked but my download (yafc) not stopped, it tried and could write the file system for whole night! Now I changed the cable, tried to reassembly the array (mdadm -f --run), event counter increased from 4908158 up to 4929612 on the failed disks, but I cannot mount the file system and the 'xfs_repair -n' shows lot of errors there. This is expainable by the partially successed writes. Ext3 and JFS has "error=" mount option to switch filesystem read-only on any error, but XFS hasn't: why? It's a good question too, but I think the md layer could save dumb filesystems like XFS if denies writes after 2 disks are failed, and I cannot see a good reason why it's not behave this way. Do you have better idea how can I avoid such filesystem corruptions in the future? No, I don't want to use ext3 on this box. :) my mount error: XFS: Log inconsistent (didn't find previous header) XFS: failed to find log head XFS: log mount/recovery failed: error 5 XFS: log mount failed -- d - To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html