[RFD] BIO_RW_BARRIER - what it means for devices, filesystems, and dm/md.
This mail is about an issue that has been of concern to me for quite a while and I think it is (well past) time to air it more widely and try to come to a resolution. This issue is how write barriers (the block-device kind, not the memory-barrier kind) should be handled by the various layers. The following is my understanding, which could well be wrong in various specifics. Corrections and other comments are more than welcome. What are barriers? == Barriers (as generated by requests with BIO_RW_BARRIER) are intended to ensure that the data in the barrier request is not visible until all writes submitted earlier are safe on the media, and that the data is safe on the media before any subsequently submitted requests are visible on the device. This is achieved by tagging request in the elevator (or any other request queue) so that no re-ordering is performed around a BIO_RW_BARRIER request, and by sending appropriate commands to the device so that any write-behind caching is defeated by the barrier request. Along side BIO_RW_BARRIER is blkdev_issue_flush which calls q-issue_flush_fn. This can be used to achieve similar effects. There is no guarantee that a device can support BIO_RW_BARRIER - it is always possible that a request will fail with EOPNOTSUPP. Conversely, blkdev_issue_flush must be supported on any device that uses write-behind caching (it if cannot be supported, then write-behind caching should be turned off, at least by default). We can think of there being three types of devices: 1/ SAFE. With a SAFE device, there is no write-behind cache, or if there is it is non-volatile. Once a write completes it is completely safe. Such a device does not require barriers or -issue_flush_fn, and can respond to them either by a no-op or with -EOPNOTSUPP (the former is preferred). 2/ FLUSHABLE. A FLUSHABLE device may have a volatile write-behind cache. This cache can be flushed with a call to blkdev_issue_flush. It may not support barrier requests. 3/ BARRIER. A BARRIER device supports both blkdev_issue_flush and BIO_RW_BARRIER. Either may be used to synchronise any write-behind cache to non-volatile storage (media). Handling of SAFE and FLUSHABLE devices is essentially the same and can work on a BARRIER device. The BARRIER device has the option of more efficient handling. How does a filesystem use this? === A filesystem will often have a concept of a 'commit' block which makes an assertion about the correctness of other blocks in the filesystem. In the most gross sense, this could be the writing of the superblock of an ext2 filesystem, with the dirty bit clear. This write commits all other writes to the filesystem that precede it. More subtle/useful is the commit block in a journal as with ext3 and others. This write commits some number of preceding writes in the journal or elsewhere. The filesystem will want to ensure that all preceding writes are safe before writing the barrier block. There are two ways to achieve this. 1/ Issue all 'preceding writes', wait for them to complete (bi_endio called), call blkdev_issue_flush, issue the commit write, wait for it to complete, call blkdev_issue_flush a second time. (This is needed for FLUSHABLE) 2/ Set the BIO_RW_BARRIER bit in the write request for the commit block. (This is more efficient on BARRIER). The second, while much easier, can fail. So a filesystem should be prepared to deal with that failure by falling back to the first option. Thus the general sequence might be: a/ issue all preceding writes. b/ issue the commit write with BIO_RW_BARRIER c/ wait for the commit to complete. If it was successful - done. If it failed other than with EOPNOTSUPP, abort else continue d/ wait for all 'preceding writes' to complete e/ call blkdev_issue_flush f/ issue commit write without BIO_RW_BARRIER g/ wait for commit write to complete if it failed, abort h/ call blkdev_issue DONE steps b and c can be left out if it is known that the device does not support barriers. The only way to discover this to try and see if it fails. I don't think any filesystem follows all these steps. ext3 has the right structure, but it doesn't include steps e and h. reiserfs is similar. It does have a call to blkdev_issue_flush, but that is only on the fsync path, so it isn't really protecting general journal commits. XFS - I'm less sure. I think it does 'a' then 'd', then 'b' or 'f' depending on a whether it thinks the device handles barriers, and finally 'g'. I haven't looked at other filesystems. So for devices that support BIO_RW_BARRIER, and for devices that don't need any flush, they work OK, but for device that need flushing, but don't support BIO_RW_BARRIER, none of them work. This should be easy to fix.
Re: raid5: I lost a XFS file system due to a minor IDE cable problem
On Fri, May 25, 2007 at 12:43:51AM -0500, Alberto Alonso wrote: The difference between ext3 and XFS is that ext3 will remount to read-only on the first write error but the XFS won't, XFS only fails only the current operation, IMHO. The method of ext3 isn't perfect, but in practice, it's working well. XFS will shutdown the filesystem if metadata corruption will occur due to a failed write. We don't immediately fail the filesystem on data write errors because on large systems you can get *transient* I/O errors (e.g. FC path failover) and so retrying failed data writes is useful for preventing unnecessary shutdowns of the filesystem. Different design criteria, different solutions... I think his point was that going into a read only mode causes a less catastrophic situation (ie. a web server can still serve pages). Sure - but once you've detected one corruption or had metadata I/O errors, can you trust the rest of the filesystem? I think that is a valid point, rather than shutting down the file system completely, an automatic switch to where the least disruption of service can occur is always desired. I consider the possibility of serving out bad data (i.e after a remount to readonly) to be the worst possible disruption of service that can happen ;) Maybe the automatic failure mode could be something that is configurable via the mount options. If only it were that simple. Have you looked to see how many hooks there are in XFS to shutdown without causing further damage? % grep FORCED_SHUTDOWN fs/xfs/*.[ch] fs/xfs/*/*.[ch] | wc -l 116 Changing the way we handle shutdowns would take a lot of time, effort and testing. When can I expect a patch? ;) I personally have found the XFS file system to be great for my needs (except issues with NFS interaction, where the bug report never got answered), but that doesn't mean it can not be improved. Got a pointer? Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFD] BIO_RW_BARRIER - what it means for devices, filesystems, and dm/md.
On Fri, May 25, 2007 at 05:58:25PM +1000, Neil Brown wrote: We can think of there being three types of devices: 1/ SAFE. With a SAFE device, there is no write-behind cache, or if there is it is non-volatile. Once a write completes it is completely safe. Such a device does not require barriers or -issue_flush_fn, and can respond to them either by a no-op or with -EOPNOTSUPP (the former is preferred). 2/ FLUSHABLE. A FLUSHABLE device may have a volatile write-behind cache. This cache can be flushed with a call to blkdev_issue_flush. It may not support barrier requests. So returns -EOPNOTSUPP to any barrier request? 3/ BARRIER. A BARRIER device supports both blkdev_issue_flush and BIO_RW_BARRIER. Either may be used to synchronise any write-behind cache to non-volatile storage (media). Handling of SAFE and FLUSHABLE devices is essentially the same and can work on a BARRIER device. The BARRIER device has the option of more efficient handling. How does a filesystem use this? === The filesystem will want to ensure that all preceding writes are safe before writing the barrier block. There are two ways to achieve this. Three, actually. 1/ Issue all 'preceding writes', wait for them to complete (bi_endio called), call blkdev_issue_flush, issue the commit write, wait for it to complete, call blkdev_issue_flush a second time. (This is needed for FLUSHABLE) *nod* 2/ Set the BIO_RW_BARRIER bit in the write request for the commit block. (This is more efficient on BARRIER). *nod* 3/ Use a SAFE device. The second, while much easier, can fail. So we do a test I/O to see if the device supports them before enabling that mode. But, as we've recently discovered, this is not sufficient to detect *correctly functioning* barrier support. So a filesystem should be prepared to deal with that failure by falling back to the first option. I don't buy that argument. Thus the general sequence might be: a/ issue all preceding writes. b/ issue the commit write with BIO_RW_BARRIER At this point, the filesystem has done everything it needs to ensure that the block layer has been informed of the I/O ordering requirements. Why should the filesystem now have to detect block layer breakage, and then use a different block layer API to issue the same I/O under the same constraints? c/ wait for the commit to complete. If it was successful - done. If it failed other than with EOPNOTSUPP, abort else continue d/ wait for all 'preceding writes' to complete e/ call blkdev_issue_flush f/ issue commit write without BIO_RW_BARRIER g/ wait for commit write to complete if it failed, abort h/ call blkdev_issue _flush? DONE steps b and c can be left out if it is known that the device does not support barriers. The only way to discover this to try and see if it fails. That's a very linear, single-threaded way of looking at it... ;) I don't think any filesystem follows all these steps. ext3 has the right structure, but it doesn't include steps e and h. reiserfs is similar. It does have a call to blkdev_issue_flush, but that is only on the fsync path, so it isn't really protecting general journal commits. XFS - I'm less sure. I think it does 'a' then 'd', then 'b' or 'f' depending on a whether it thinks the device handles barriers, and finally 'g'. That's right, except for the g (or c) bit - commit writes are async and nothing waits for them - the io completion wakes anything waiting on it's completion (yes, all XFS barrier I/Os are issued async which is why having to handle an -EOPNOTSUPP error is a real pain. The fix I currently have is to reissue the I/O from the completion handler with is ugly, ugly, ugly.) So for devices that support BIO_RW_BARRIER, and for devices that don't need any flush, they work OK, but for device that need flushing, but don't support BIO_RW_BARRIER, none of them work. This should be easy to fix. Right - XFS as it stands was designed to work on SAFE devices, and we've modified it to work on BARRIER devices. We don't support FLUSHABLE devices at all. But if the filesystem supports BARRIER devices, I don't see any reason why a filesystem needs to be modified to support FLUSHABLE devices - the key point being that by the time the filesystem has issued the commit write it has already waited for all it's dependent I/O, and so all the block device needs to do is issue flushes either side of the commit write HOW DO MD or DM USE THIS 1/ striping devices. This includes md/raid0 md/linear dm-linear dm-stripe and probably others. These devices can easily support blkdev_issue_flush by simply calling blkdev_issue_flush on all component
Re: [RFD] BIO_RW_BARRIER - what it means for devices, filesystems, and dm/md.
On Fri, May 25 2007, David Chinner wrote: The second, while much easier, can fail. So we do a test I/O to see if the device supports them before enabling that mode. But, as we've recently discovered, this is not sufficient to detect *correctly functioning* barrier support. Right, those are two different things. But paranoia aside, will this ever be a real life problem? I've always been of the opinion to just nicely ignore them. We can't easily detect it and tell the user his hw is crap. So a filesystem should be prepared to deal with that failure by falling back to the first option. I don't buy that argument. The problem with Neils reasoning there is that blkdev_issue_flush() may use the same method as the barrier to ensure data is on platter. A barrier write will include a flush, but it may also use the FUA bit to ensure data is on platter. So the only situation where a fallback from a barrier to flush would be valid, is if the device lied and told you it could do FUA but it could not and that is the reason why the barrier write failed. If that is the case, the block layer should stop using FUA and fallback to flush-write-flush. And if it does that, then there's never a valid reason to switch from using barrier writes to blkdev_issue_flush() since both methods would either both work or both fail. Thus the general sequence might be: a/ issue all preceding writes. b/ issue the commit write with BIO_RW_BARRIER At this point, the filesystem has done everything it needs to ensure that the block layer has been informed of the I/O ordering requirements. Why should the filesystem now have to detect block layer breakage, and then use a different block layer API to issue the same I/O under the same constraints? It's not block layer breakage, it's a device issue. 2/ Mirror devices. This includes md/raid1 and dm-raid1. .. Hopefully this is unlikely to happen. What device would work correctly with barriers once, and then not the next time? The answer is md/raid1. If you remove a failed device and add a new device that doesn't support barriers, md/raid1 will notice and stop supporting barriers. In case you hadn't already guess, I don't like this behaviour at all. It makes async I/O completion of barrier I/O an ugly, messy business, and every place you do sync I/O completion you need to put special error handling. That's unfortunately very true. It's an artifact of the sometimes problematic device capability discovery. If this happens to md/raid1, then why can't it simply do a blkdev_issue_flush, write, blkdev_issue_flush sequence to the device that doesn't support barriers and then the md device *never changes behaviour*. Next time the filesystem is mounted, it will turn off barriers because they won't be supported Because if it doesn't support barriers, blkdev_issue_flush() wouldn't work either. At least that is the case for SATA/IDE, SCSI is somewhat different (and has somewhat other issues). - Should the various filesystems be fixed as suggested above? Is someone willing to do that? Alternate viewpoint - should the block layer be fixed so that the filesystems only need to use one barrier API that provides static behaviour for the life of the mount? blkdev_issue_flush() isn't part of the barrier API, and using it as a work-around for a device that has barrier issues is wrong for the reasons listed above. The DRAIN_FUA - DRAIN_FLUSH automatic downgrade I mentioned above should be added, in which case blkdev_issue_flush() would never be needed (unless you want to do a data-less barrier, and we should probably add that specific functionality with an empty bio instead of providing an alternate way of doing that). -- Jens Axboe - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFD] BIO_RW_BARRIER - what it means for devices, filesystems, and dm/md.
2007/5/25, Neil Brown [EMAIL PROTECTED]: HOW DO MD or DM USE THIS 1/ striping devices. This includes md/raid0 md/linear dm-linear dm-stripe and probably others. These devices can easily support blkdev_issue_flush by simply calling blkdev_issue_flush on all component devices. This ensures that all of the previous requests have been processed but does this guarantee they where successful? This might be too paranoid but if I understood the concept correctly the success of a barrier request should indicate success of all previous request between this barrier and the last one. These devices would find it very hard to support BIO_RW_BARRIER. Doing this would require keeping track of all in-flight requests (which some, possibly all, of the above don't) and then: When a BIO_RW_BARRIER request arrives: wait for all pending writes to complete call blkdev_issue_flush on all devices issue the barrier write to the target device(s) as BIO_RW_BARRIER, if that is -EOPNOTSUP, re-issue, wait, flush. I guess just keep a count of submitted requests and errors since the last barrier might be enough. As long as all of the underlying device support at least support a flush the dm device could pretend to support BIO_RW_BARRIER. dm-linear and dm-stripe simply pass the BIO_RW_BARRIER flag down, which means data may not be flushed correctly: the commit block might be written to one device before a preceding block is written to another device. Hm, even worse: if the barrier requests accidentally end up on a device that does support barriers and another one on the map doesn't. Would any layer/fs above care to issue a flush call? I think the best approach for this class of devices is to return -EOPNOSUP. If the filesystem does the wait (which they all do already) and the blkdev_issue_flush (which is easy to add), they don't need to support BIO_RW_BARRIER. Without any additional code these really should report -EOPNOTSUPP. If disaster strikes there is no way to make assumptions on the real state on disk. 2/ Mirror devices. This includes md/raid1 and dm-raid1. These device can trivially implement blkdev_issue_flush much like the striping devices, and can support BIO_RW_BARRIER to some extent. md/raid1 currently tries. I'm not sure about dm-raid1. I fear this is more broken as with linear and stripe. There is no code to check the features of underlying devices and the request itself isn't sent forward but privately built ones (which do not have the barrier flag)... 3/ Multipath devices Requests are sent to the same device but one different paths. So at least with them the chance of one path supporting barriers but not another one seems little (as long as the paths do not use completely different transport layers). But passing on a request with the barrier flag also doesn't seem to be a good idea since previous requests can arrive at the device later. IMHO the best way to handle barriers for dm would be to add the sequence described to the generic mapping layer of dm (before calling the targets mapping function). There is already some sort of counting in-flight requests (suspend/resume needs that) and I guess the downgrade could also be rather simple. If a flush call to the target (mapped device) fails report -EOPNOTSUPP and stay that way (until next boot). So: some questions to help encourage response: - Is the approach to barriers taken by md appropriate? Should dm do the same? Who will do that? If my assumption about barrier semantics is true, then also md has to somehow make sure all previous requests have _successfully_ completed. In the mirror case I guess it is valid to report success if the mirror itself is in a clean state. Which is all previous requests (and the barrier) where successful on at least one mirror half and this state can be recovered. Question to dm-devel: What do people there think of the possible generic implementation in dm.c? - The comment above blkdev_issue_flush says Caller must run wait_for_completion() on its own. What does that mean? Guess this means it initiates a flush but doesn't wait for completion. So the caller must wait for the completion of the separate requests on its own, doesn't it? - Are there other bit that we could handle better? BIO_RW_FAILFAST? BIO_RW_SYNC? What exactly do they mean? BIO_RW_FAILFAST: means low-level driver shouldn't do much (or no) error recovery. Mainly used by mutlipath targets to avoid long SCSI recovery. This should just be propagated when passing requests on. BIO_RW_SYNC: means this is a bio of a synchronous request. I don't know whether there are more uses to it but this at least causes queues to be flushed immediately instead of waiting for more requests for a short time. Should also just be passed on. Otherwise performance gets poor since something above will
Re: raid5: I lost a XFS file system due to a minor IDE cable problem
On Friday 25 May 2007 06:55:00 David Chinner wrote: Oh, did you look at your logs and find that XFS had spammed them about writes that were failing? The first message after the incident: May 24 01:53:50 hq kernel: Filesystem loop1: XFS internal error xfs_btree_check_sblock at line 336 of file fs/xfs/xfs_btree.c. Caller 0xf8ac14f8 May 24 01:53:50 hq kernel: f8adae69 xfs_btree_check_sblock+0x4f/0xc2 [xfs] f8ac14f8 xfs_alloc_lookup+0x34e/0x47b [xfs] May 24 01:53:50 HF kernel: f8ac14f8 xfs_alloc_lookup+0x34e/0x47b [xfs] f8b1a9c7 kmem_zone_zalloc+0x1b/0x43 [xfs] May 24 01:53:50 hq kernel: f8abe645 xfs_alloc_ag_vextent+0x24d/0x1110 [xfs] f8ac0647 xfs_alloc_vextent+0x3bd/0x53b [xfs] May 24 01:53:50 hq kernel: f8ad2f7e xfs_bmapi+0x1ac4/0x23cd [xfs] f8acab97 xfs_bmap_search_multi_extents+0x8e/0xd8 [xfs] May 24 01:53:50 hq kernel: f8b1 xlog_dealloc_log+0x49/0xea [xfs] f8afdaee xfs_iomap_write_allocate+0x2d9/0x58b [xfs] May 24 01:53:50 hq kernel: f8afc3ae xfs_iomap+0x60e/0x82d [xfs] c0113bc8 __wake_up_common+0x39/0x59 May 24 01:53:50 hq kernel: f8b1ae11 xfs_map_blocks+0x39/0x6c [xfs] f8b1bd7b xfs_page_state_convert+0x644/0xf9c [xfs] May 24 01:53:50 hq kernel: c036f384 schedule+0x5d1/0xf4d f8b1c780 xfs_vm_writepage+0x0/0xe0 [xfs] May 24 01:53:50 hq kernel: f8b1c7d7 xfs_vm_writepage+0x57/0xe0 [xfs] c01830e8 mpage_writepages+0x1fb/0x3bb May 24 01:53:50 hq kernel: c0183020 mpage_writepages+0x133/0x3bb f8b1c780 xfs_vm_writepage+0x0/0xe0 [xfs] May 24 01:53:50 hq kernel: c0147bb3 do_writepages+0x35/0x3b c018135c __writeback_single_inode+0x88/0x387 May 24 01:53:50 hq kernel: c01819b7 sync_sb_inodes+0x1b4/0x2a8 c0181c63 writeback_inodes+0x63/0xdc May 24 01:53:50 hq kernel: c0147943 background_writeout+0x66/0x9f c01482b3 pdflush+0x0/0x1ad May 24 01:53:50 hq kernel: c01483a2 pdflush+0xef/0x1ad c01478dd background_writeout+0x0/0x9f May 24 01:53:50 hq kernel: c012d10b kthread+0xc2/0xc6 c012d049 kthread+0x0/0xc6 May 24 01:53:50 hq kernel: c0100dd5 kernel_thread_helper+0x5/0xb ..and I've spammed such messages. This internal error isn't a good reason to shut down the file system? I think if there's a sign of corrupted file system, the first thing we should do is to stop writes (or the entire FS) and let the admin to examine the situation. I'm not talking about my case where the md raid5 was a braindead, I'm talking about general situations. -- d - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [dm-devel] Re: [RFD] BIO_RW_BARRIER - what it means for devices, filesystems, and dm/md.
Jens Axboe wrote: A barrier write will include a flush, but it may also use the FUA bit to ensure data is on platter. So the only situation where a fallback from a barrier to flush would be valid, is if the device lied and told you it could do FUA but it could not and that is the reason why the barrier write failed. If that is the case, the block layer should stop using FUA and fallback to flush-write-flush. And if it does that, then there's never a valid reason to switch from using barrier writes to blkdev_issue_flush() since both methods would either both work or both fail. IIRC, the FUA bit only forces THAT request to hit the platter before it is completed; it does not flush any previous requests still sitting in the write back queue. Because all io before the barrier must be on the platter as well, setting the FUA bit on the barrier request means you don't have to follow it with a flush, but you still have to precede it with a flush. It's not block layer breakage, it's a device issue. How isn't it block layer breakage? If the device does not support barriers, isn't it the job of the block layer ( probably the scheduler ) to fall back to flush-write-flush? - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [dm-devel] [RFD] BIO_RW_BARRIER - what it means for devices, filesystems, and dm/md.
Neil Brown wrote: There is no guarantee that a device can support BIO_RW_BARRIER - it is always possible that a request will fail with EOPNOTSUPP. Why is it not the job of the block layer to translate for broken devices and send them a flush/write/flush? These devices would find it very hard to support BIO_RW_BARRIER. Doing this would require keeping track of all in-flight requests (which some, possibly all, of the above don't) and then: The device mapper keeps track of in flight requests already. When switching tables it has to hold new requests and wait for in flight requests to complete before switching to the new table. When it gets a barrier request it just needs to do the same thing, only not switch tables. I think the best approach for this class of devices is to return -EOPNOSUP. If the filesystem does the wait (which they all do already) and the blkdev_issue_flush (which is easy to add), they don't need to support BIO_RW_BARRIER. Why? The personalities should just pass the BARRIER flag down to each underlying device, and the dm common code should wait for all in flight io to complete before sending the barrier to the personality. For devices that don't support QUEUE_ORDERED_TAG (i.e. commands sent to the controller can be tagged as barriers), SCSI will use the SYNCHRONIZE_CACHE command to flush the cache after the barrier request (a bit like the filesystem calling blkdev_issue_flush, but at Don't you have to flush the cache BEFORE the barrier to ensure that previous IO is committed first, THEN the barrier write? - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html