Re: raid5: I lost a XFS file system due to a minor IDE cable problem
On Fri, May 25, 2007 at 04:35:36PM +0200, Pallai Roland wrote: On Friday 25 May 2007 06:55:00 David Chinner wrote: Oh, did you look at your logs and find that XFS had spammed them about writes that were failing? The first message after the incident: May 24 01:53:50 hq kernel: Filesystem loop1: XFS internal error xfs_btree_check_sblock at line 336 of file fs/xfs/xfs_btree.c. Caller 0xf8ac14f8 May 24 01:53:50 hq kernel: f8adae69 xfs_btree_check_sblock+0x4f/0xc2 [xfs] f8ac14f8 xfs_alloc_lookup+0x34e/0x47b [xfs] May 24 01:53:50 HF kernel: f8ac14f8 xfs_alloc_lookup+0x34e/0x47b [xfs] f8b1a9c7 kmem_zone_zalloc+0x1b/0x43 [xfs] May 24 01:53:50 hq kernel: f8abe645 xfs_alloc_ag_vextent+0x24d/0x1110 [xfs] f8ac0647 xfs_alloc_vextent+0x3bd/0x53b [xfs] May 24 01:53:50 hq kernel: f8ad2f7e xfs_bmapi+0x1ac4/0x23cd [xfs] f8acab97 xfs_bmap_search_multi_extents+0x8e/0xd8 [xfs] May 24 01:53:50 hq kernel: f8b1 xlog_dealloc_log+0x49/0xea [xfs] f8afdaee xfs_iomap_write_allocate+0x2d9/0x58b [xfs] May 24 01:53:50 hq kernel: f8afc3ae xfs_iomap+0x60e/0x82d [xfs] c0113bc8 __wake_up_common+0x39/0x59 May 24 01:53:50 hq kernel: f8b1ae11 xfs_map_blocks+0x39/0x6c [xfs] f8b1bd7b xfs_page_state_convert+0x644/0xf9c [xfs] May 24 01:53:50 hq kernel: c036f384 schedule+0x5d1/0xf4d f8b1c780 xfs_vm_writepage+0x0/0xe0 [xfs] May 24 01:53:50 hq kernel: f8b1c7d7 xfs_vm_writepage+0x57/0xe0 [xfs] c01830e8 mpage_writepages+0x1fb/0x3bb May 24 01:53:50 hq kernel: c0183020 mpage_writepages+0x133/0x3bb f8b1c780 xfs_vm_writepage+0x0/0xe0 [xfs] May 24 01:53:50 hq kernel: c0147bb3 do_writepages+0x35/0x3b c018135c __writeback_single_inode+0x88/0x387 May 24 01:53:50 hq kernel: c01819b7 sync_sb_inodes+0x1b4/0x2a8 c0181c63 writeback_inodes+0x63/0xdc May 24 01:53:50 hq kernel: c0147943 background_writeout+0x66/0x9f c01482b3 pdflush+0x0/0x1ad May 24 01:53:50 hq kernel: c01483a2 pdflush+0xef/0x1ad c01478dd background_writeout+0x0/0x9f May 24 01:53:50 hq kernel: c012d10b kthread+0xc2/0xc6 c012d049 kthread+0x0/0xc6 May 24 01:53:50 hq kernel: c0100dd5 kernel_thread_helper+0x5/0xb .and I've spammed such messages. This internal error isn't a good reason to shut down the file system? Actaully, that error does shut the filesystem down in most cases. When you see that output, the function is returning -EFSCORRUPTED. You've got a corrupted freespace btree. The reason why you get spammed is that this is happening during background writeback, and there is no one to return the -EFSCORRUPTED error to. The background writeback path doesn't specifically detect shut down filesystems or trigger shutdowns on errors because that happens in different layers so you just end up with failed data writes. These errors will occur on the next foreground data or metadata allocation and that will shut the filesystem down at that point. I'm not sure that we should be ignoring EFSCORRUPTED errors here; maybe in this case we should be shutting down the filesystem. That would certainly cut down on the spamming and would not appear to change anything other behaviour I think if there's a sign of corrupted file system, the first thing we should do is to stop writes (or the entire FS) and let the admin to examine the situation. Yes, that's *exactly* what a shutdown does. In this case, your writes are being stopped - hence the error messages - but the filesystem has not yet been shutdown. Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFD] BIO_RW_BARRIER - what it means for devices, filesystems, and dm/md.
Thanks everyone for your input. There was some very valuable observations in the various emails. I will try to pull most of it together and bring out what seem to be the important points. 1/ A BIO_RW_BARRIER request should never fail with -EOPNOTSUP. This is certainly a very attractive position - it makes the interface cleaner and makes life easier for filesystems and other clients of the block interface. Currently filesystems handle -EOPNOTSUP by a/ resubmitting the request without the BARRIER (after waiting for earlier requests to complete) and b/ possibly printing an error message to the kernel logs. The block layer can do both of these just as easily and it does make sense to do it there. md/dm modules could keep count of requests as has been suggested (though that would be a fairly big change for raid0 as it currently doesn't know when a request completes - bi_endio goes directly to the filesystem). However I think the idea of a zero-length BIO_RW_BARRIER would be a good option. raid0 could send one of these down each device, and when they all return, the barrier request can be sent to it's target device(s). I think this is a worthy goal that we should work towards. 2/ Maybe barriers provide stronger semantics than are required. All write requests are synchronised around a barrier write. This is often more than is required and apparently can cause a measurable slowdown. Also the FUA for the actual commit write might not be needed. It is important for consistency that the preceding writes are in safe storage before the commit write, but it is not so important that the commit write is immediately safe on storage. That isn't needed until a 'sync' or 'fsync' or similar. One possible alternative is: - writes can overtake barriers, but barrier cannot overtake writes. - flush before the barrier, not after. This is considerably weaker, and hence cheaper. But I think it is enough for all filesystems (providing it is still an option to call blkdev_issue_flush on 'fsync'). Another alternative would be to tag each bio was being in a particular barrier-group. Then bio's in different groups could overtake each other in either direction, but a BARRIER request must be totally ordered w.r.t. other requests in the barrier group. This would require an extra bio field, and would give the filesystem more appearance of control. I'm not yet sure how much it would really help... It would allow us to set FUA on all bios with a non-zero barrier-group. That would mean we don't have to flush the entire cache, just those blocks that are critical but I'm still not sure it's a good idea. Of course, these weaker rules would only apply inside the elevator. Once the request goes to the device we need to work with what the device provides, which probably means total-ordering around the barrier. I think this requires more discussion before a way forward is clear. 3/ Do we need explicit control of the 'ordered' mode? Consider a SCSI device that has NV RAM cache. mode_sense reports that write-back is enabled, so _FUA or _FLUSH will be used. But as it is *NV* ram, QUEUE_ORDER_DRAIN is really the best mode. But it seems there is no way to query this information. Using _FLUSH causes the NVRAM to be flushed to media which is a terrible performance problem. Setting SYNC_NV doesn't work on the particular device in question. We currently tell customers to mount with -o nobarriers, but that really feels like the wrong solution. We should be telling the scsi device don't flush. An advantage of 'nobarriers' is it can go in /etc/fstab. Where would you record that a SCSI drive should be set to QUEUE_ORDERD_DRAIN ?? I think the implementation priorities here are: 1/ implement a zero-length BIO_RW_BARRIER option. 2/ Use it (or otherwise) to make all dm and md modules handle barriers (and loop?). 3/ Devise and implement appropriate fall-backs with-in the block layer so that -EOPNOTSUP is never returned. 4/ Remove unneeded cruft from filesystems (and elsewhere). Comments? Thanks, NeilBrown - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFD] BIO_RW_BARRIER - what it means for devices, filesystems, and dm/md.
On Friday May 25, [EMAIL PROTECTED] wrote: 2007/5/25, Neil Brown [EMAIL PROTECTED]: - Are there other bit that we could handle better? BIO_RW_FAILFAST? BIO_RW_SYNC? What exactly do they mean? BIO_RW_FAILFAST: means low-level driver shouldn't do much (or no) error recovery. Mainly used by mutlipath targets to avoid long SCSI recovery. This should just be propagated when passing requests on. Is it much or no? Would it be reasonable to use this for reads from a non-degraded raid1? What about writes? What I would really like is some clarification on what sort of errors get retried, how often, and how much timeout there is.. And does the 'error' code returned in -bi_end_io allow us to differentiate media errors from other errors yet? BIO_RW_SYNC: means this is a bio of a synchronous request. I don't know whether there are more uses to it but this at least causes queues to be flushed immediately instead of waiting for more requests for a short time. Should also just be passed on. Otherwise performance gets poor since something above will rather wait for the current request/bio to complete instead of sending more. Yes, this one is pretty straight forward.. I mentioned it more as a reminder to my self that I really should support it in raid5 :-( NeilBrown - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: raid5: I lost a XFS file system due to a minor IDE cable problem
On Mon, May 28, 2007 at 03:50:17AM +0200, Pallai Roland wrote: On Monday 28 May 2007 02:30:11 David Chinner wrote: On Fri, May 25, 2007 at 04:35:36PM +0200, Pallai Roland wrote: .and I've spammed such messages. This internal error isn't a good reason to shut down the file system? Actaully, that error does shut the filesystem down in most cases. When you see that output, the function is returning -EFSCORRUPTED. You've got a corrupted freespace btree. The reason why you get spammed is that this is happening during background writeback, and there is no one to return the -EFSCORRUPTED error to. The background writeback path doesn't specifically detect shut down filesystems or trigger shutdowns on errors because that happens in different layers so you just end up with failed data writes. These errors will occur on the next foreground data or metadata allocation and that will shut the filesystem down at that point. I'm not sure that we should be ignoring EFSCORRUPTED errors here; maybe in this case we should be shutting down the filesystem. That would certainly cut down on the spamming and would not appear to change anything other behaviour If I remember correctly, my file system wasn't shutted down at all, it was writeable for whole night, the yafc slowly written files to it. Maybe all write operations had failed, but yafc doesn't warn. So you never created new files or directories, unlinked files or directories, did synchronous writes, etc? Just had slowly growing files? Spamming is just annoying when we need to find out what went wrong (My kernel.log is 300Mb), but for data security it's important to react to EFSCORRUPTED error in any case, I think so. Please consider this. The filesystem has responded correctly to the corruption in terms of data security (i.e. failed the data write and warned noisily about it), but it probably hasn't done everything it should H. A quick look at the linux code makes me thikn that background writeback on linux has never been able to cause a shutdown in this case. However, the same error on Irix will definitely cause a shutdown, though Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFD] BIO_RW_BARRIER - what it means for devices, filesystems, and dm/md.
On Mon, May 28, 2007 at 11:30:32AM +1000, Neil Brown wrote: Thanks everyone for your input. There was some very valuable observations in the various emails. I will try to pull most of it together and bring out what seem to be the important points. 1/ A BIO_RW_BARRIER request should never fail with -EOPNOTSUP. Sounds good to me, but how do we test to see if the underlying device supports barriers? Do we just assume that they do and only change behaviour if -o nobarrier is specified in the mount options? 2/ Maybe barriers provide stronger semantics than are required. All write requests are synchronised around a barrier write. This is often more than is required and apparently can cause a measurable slowdown. Also the FUA for the actual commit write might not be needed. It is important for consistency that the preceding writes are in safe storage before the commit write, but it is not so important that the commit write is immediately safe on storage. That isn't needed until a 'sync' or 'fsync' or similar. The use of barriers in XFS assumes the commit write to be on stable storage before it returns. One of the ordering guarantees that we need is that the transaction (commit write) is on disk before the metadata block containing the change in the transaction is written to disk and the current barrier behaviour gives us that. One possible alternative is: - writes can overtake barriers, but barrier cannot overtake writes. No, that breaks the above usage of a barrier - flush before the barrier, not after. This is considerably weaker, and hence cheaper. But I think it is enough for all filesystems (providing it is still an option to call blkdev_issue_flush on 'fsync'). No, not enough for XFS. Another alternative would be to tag each bio was being in a particular barrier-group. Then bio's in different groups could overtake each other in either direction, but a BARRIER request must be totally ordered w.r.t. other requests in the barrier group. This would require an extra bio field, and would give the filesystem more appearance of control. I'm not yet sure how much it would really help... And that assumes the filesystem is tracking exact dependencies between I/Os. Such a mechanism would probably require filesystems to be redesigned to use this, but I can see how it would be useful for doing things like ensuring ordering between just an inode and it's data writes. What would the overhead of having to support several hundred thousand different barrier groups be (i.e. one per dirty inode in a system)? I think the implementation priorities here are: Depending on the answer to my first question: 0/ implement a specific test for filesystems to run at mount time to determine if barriers are supported or not. 1/ implement a zero-length BIO_RW_BARRIER option. 2/ Use it (or otherwise) to make all dm and md modules handle barriers (and loop?). 3/ Devise and implement appropriate fall-backs with-in the block layer so that -EOPNOTSUP is never returned. 4/ Remove unneeded cruft from filesystems (and elsewhere). Sounds like a good start. ;) Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html