Re: 2.6.23-rc1: known regressions with patches
Michal Piotrowski wrote: Subject : Oops while modprobing phy fixed module References : http://lkml.org/lkml/2007/7/14/63 Last known good : ? Submitter : Gabriel C [EMAIL PROTECTED] Caused-By : Tejun Heo [EMAIL PROTECTED] commit 3007e997de91ec59af39a3f9c91595b31ae6e08b Handled-By : Satyam Sharma [EMAIL PROTECTED] Tejun Heo [EMAIL PROTECTED] Vitaly Bordug [EMAIL PROTECTED] Patch1 : http://lkml.org/lkml/2007/7/18/506 Status : patch available Patch is in mainline. Commit a1da4dfe35bc36c3bc9716d995c85b7983c38a76. Thanks. -- tejun - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] block: cosmetic changes
Cosmetic changes. This is taken from Jens' zero-length barrier patch. Signed-off-by: Tejun Heo [EMAIL PROTECTED] Cc: Jens Axboe [EMAIL PROTECTED] --- block/ll_rw_blk.c |5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-) Index: work/block/ll_rw_blk.c === --- work.orig/block/ll_rw_blk.c +++ work/block/ll_rw_blk.c @@ -443,7 +443,8 @@ static inline struct request *start_orde rq_init(q, rq); if (bio_data_dir(q-orig_bar_rq-bio) == WRITE) rq-cmd_flags |= REQ_RW; - rq-cmd_flags |= q-ordered QUEUE_ORDERED_FUA ? REQ_FUA : 0; + if (q-ordered QUEUE_ORDERED_FUA) + rq-cmd_flags |= REQ_FUA; rq-elevator_private = NULL; rq-elevator_private2 = NULL; init_request_from_bio(rq, q-orig_bar_rq-bio); @@ -3167,7 +3168,7 @@ end_io: break; } - if (unlikely(bio_sectors(bio) q-max_hw_sectors)) { + if (unlikely(nr_sectors q-max_hw_sectors)) { printk(bio too big device %s (%u %u)\n, bdevname(bio-bi_bdev, b), bio_sectors(bio), - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] block: factor out bio_check_eod()
End of device check is done twice in __generic_make_request() and it's fully inlined each time. Factor out bio_check_eod(). This is taken from Jens' zero-length barrier patch. Signed-off-by: Tejun Heo [EMAIL PROTECTED] Cc: Jens Axboe [EMAIL PROTECTED] --- block/ll_rw_blk.c | 63 -- 1 file changed, 33 insertions(+), 30 deletions(-) Index: work/block/ll_rw_blk.c === --- work.orig/block/ll_rw_blk.c +++ work/block/ll_rw_blk.c @@ -3094,6 +3094,35 @@ static inline int should_fail_request(st #endif /* CONFIG_FAIL_MAKE_REQUEST */ +/* + * Check whether this bio extends beyond the end of the device. + */ +static int bio_check_eod(struct bio *bio, unsigned int nr_sectors) +{ + sector_t maxsector; + + if (!nr_sectors) + return 0; + + /* Test device or partition size, when known. */ + maxsector = bio-bi_bdev-bd_inode-i_size 9; + if (maxsector) { + sector_t sector = bio-bi_sector; + + if (maxsector nr_sectors || maxsector - nr_sectors sector) { + /* +* This may well happen - the kernel calls bread() +* without checking the size of the device, e.g., when +* mounting a device. +*/ + handle_bad_sector(bio); + return 1; + } + } + + return 0; +} + /** * generic_make_request: hand a buffer to its device driver for I/O * @bio: The bio describing the location in memory and on the device. @@ -3121,27 +3150,14 @@ static inline int should_fail_request(st static inline void __generic_make_request(struct bio *bio) { request_queue_t *q; - sector_t maxsector; sector_t old_sector; int ret, nr_sectors = bio_sectors(bio); dev_t old_dev; might_sleep(); - /* Test device or partition size, when known. */ - maxsector = bio-bi_bdev-bd_inode-i_size 9; - if (maxsector) { - sector_t sector = bio-bi_sector; - if (maxsector nr_sectors || maxsector - nr_sectors sector) { - /* -* This may well happen - the kernel calls bread() -* without checking the size of the device, e.g., when -* mounting a device. -*/ - handle_bad_sector(bio); - goto end_io; - } - } + if (bio_check_eod(bio, nr_sectors)) + goto end_io; /* * Resolve the mapping until finished. (drivers are @@ -3197,21 +3213,8 @@ end_io: old_sector = bio-bi_sector; old_dev = bio-bi_bdev-bd_dev; - maxsector = bio-bi_bdev-bd_inode-i_size 9; - if (maxsector) { - sector_t sector = bio-bi_sector; - - if (maxsector nr_sectors || - maxsector - nr_sectors sector) { - /* -* This may well happen - partitions are not -* checked to make sure they are within the size -* of the whole device. -*/ - handle_bad_sector(bio); - goto end_io; - } - } + if (bio_check_eod(bio, nr_sectors)) + goto end_io; ret = q-make_request_fn(q, bio); } while (ret); - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] block: factor out bio_check_eod()
Jens Axboe wrote: On Wed, Jul 18 2007, Tejun Heo wrote: End of device check is done twice in __generic_make_request() and it's fully inlined each time. Factor out bio_check_eod(). Tejun, yeah I should seperate the cleanups and put them in the upstream branch. Will do so and add your signed-off to both of them. Would they be different from the one I just posted? No big deal either way. I'm just basing the zero-length barrier on top of these patches. Oh well, the changes are trivial anyway. -- tejun - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] block: factor out bio_check_eod()
Jens Axboe wrote: On Wed, Jul 18 2007, Tejun Heo wrote: Jens Axboe wrote: On Wed, Jul 18 2007, Tejun Heo wrote: End of device check is done twice in __generic_make_request() and it's fully inlined each time. Factor out bio_check_eod(). Tejun, yeah I should seperate the cleanups and put them in the upstream branch. Will do so and add your signed-off to both of them. Would they be different from the one I just posted? No big deal either way. I'm just basing the zero-length barrier on top of these patches. Oh well, the changes are trivial anyway. This one ended up being the same, but in the first one you missed some of the cleanups. I ended up splitting the patch some more though, see the series: http://git.kernel.dk/?p=linux-2.6-block.git;a=shortlog;h=barrier Alright, will base on 662d5c5e6afb79d05db5563205b809c0de530286. Thanks. -- tejun - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] block: factor out bio_check_eod()
Jens Axboe wrote: On Wed, Jul 18 2007, Tejun Heo wrote: Jens Axboe wrote: On Wed, Jul 18 2007, Tejun Heo wrote: Jens Axboe wrote: On Wed, Jul 18 2007, Tejun Heo wrote: End of device check is done twice in __generic_make_request() and it's fully inlined each time. Factor out bio_check_eod(). Tejun, yeah I should seperate the cleanups and put them in the upstream branch. Will do so and add your signed-off to both of them. Would they be different from the one I just posted? No big deal either way. I'm just basing the zero-length barrier on top of these patches. Oh well, the changes are trivial anyway. This one ended up being the same, but in the first one you missed some of the cleanups. I ended up splitting the patch some more though, see the series: http://git.kernel.dk/?p=linux-2.6-block.git;a=shortlog;h=barrier Alright, will base on 662d5c5e6afb79d05db5563205b809c0de530286. Thanks. 1781c6a39fb6e31836557618c4505f5f7bc61605, no? Unless you want to rewrite it completely :-) I think I'll start from 662d5c5e and steal most parts from 1781c6a3. I like stealing, you know. :-) I think 1781c6a3 also can use splitting - zero length barrier implementation and issue_flush conversion. Anyways, how do I pull from git.kernel.dk? git://git.kernel.dk/linux-2.6-block.git gives me connection reset by server. Thanks. -- tejun - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] block: factor out bio_check_eod()
Jens Axboe wrote: somewhat annoying, I'll see if I can prefix it with git-daemon in the future. OK, now skip the /data/git/ stuff and just use git://git.kernel.dk/linux-2.6-block.git Alright, it works like a charm now. Thanks. -- tejun - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [dm-devel] Re: [RFD] BIO_RW_BARRIER - what it means for devices, filesystems, and dm/md.
[EMAIL PROTECTED] wrote: On Tue, 10 Jul 2007 14:39:41 EDT, Ric Wheeler said: All of the high end arrays have non-volatile cache (read, on power loss, it is a promise that it will get all of your data out to permanent storage). You don't need to ask this kind of array to drain the cache. In fact, it might just ignore you if you send it that kind of request ;-) OK, I'll bite - how does the kernel know whether the other end of that fiberchannel cable is attached to a DMX-3 or to some no-name product that may not have the same assurances? Is there a I'm a high-end array bit in the sense data that I'm unaware of? Well, the array just has to tell the kernel that it doesn't to write back caching. The kernel automatically selects ORDERED_DRAIN in such case. -- tejun - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [dm-devel] Re: [RFD] BIO_RW_BARRIER - what it means for devices, filesystems, and dm/md.
Ric Wheeler wrote: Don't those thingies usually have NV cache or backed by battery such that ORDERED_DRAIN is enough? All of the high end arrays have non-volatile cache (read, on power loss, it is a promise that it will get all of your data out to permanent storage). You don't need to ask this kind of array to drain the cache. In fact, it might just ignore you if you send it that kind of request ;-) The size of the NV cache can run from a few gigabytes up to hundreds of gigabytes, so you really don't want to invoke cache flushes here if you can avoid it. For this class of device, you can get the required in order completion and data integrity semantics as long as we send the IO's to the device in the correct order. Thanks for clarification. The problem is that the interface between the host and a storage device (ATA or SCSI) is not built to communicate that kind of information (grouped flush, relaxed ordering...). I think battery backed ORDERED_DRAIN combined with fine-grained host queue flush would be pretty good. It doesn't require some fancy new interface which isn't gonna be used widely anyway and can achieve most of performance gain if the storage plays it smart. I am not really sure that you need this ORDERED_DRAIN for big arrays... ORDERED_DRAIN is to properly order requests from host request queue (elevator/iosched). We can make it finer-grained but we do need to put some ordering restrictions. -- tejun - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFD] BIO_RW_BARRIER - what it means for devices, filesystems, and dm/md.
Hello, Jens. Jens Axboe wrote: On Mon, May 28 2007, Neil Brown wrote: I think the implementation priorities here are: 1/ implement a zero-length BIO_RW_BARRIER option. 2/ Use it (or otherwise) to make all dm and md modules handle barriers (and loop?). 3/ Devise and implement appropriate fall-backs with-in the block layer so that -EOPNOTSUP is never returned. 4/ Remove unneeded cruft from filesystems (and elsewhere). This is the start of 1/ above. It's very lightly tested, it's verified to DTRT here at least and not crash :-) It gets rid of the -issue_flush_fn() queue callback, all the driver knowledge resides in -prepare_flush_fn() anyways. blkdev_issue_flush() then just reuses the empty-bio approach to queue an empty barrier, this should work equally well for stacked and non-stacked devices. While this patch isn't complete yet, it's clearly the right direction to go. Finally took a brief look. :-) I think the sequencing for zero-length barrier can be better done by pre-setting QUEUE_ORDSEQ_BAR in start_ordered() rather than short circuiting the request after it's issued. What do you think? Thanks. -- tejun - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFD] BIO_RW_BARRIER - what it means for devices, filesystems, and dm/md.
Jens Axboe wrote: On Sat, Jun 02 2007, Tejun Heo wrote: Hello, Jens Axboe wrote: Would that be very different from issuing barrier and not waiting for its completion? For ATA and SCSI, we'll have to flush write back cache anyway, so I don't see how we can get performance advantage by implementing separate WRITE_ORDERED. I think zero-length barrier (haven't looked at the code yet, still recovering from jet lag :-) can serve as genuine barrier without the extra write tho. As always, it depends :-) If you are doing pure flush barriers, then there's no difference. Unless you only guarantee ordering wrt previously submitted requests, in which case you can eliminate the post flush. If you are doing ordered tags, then just setting the ordered bit is enough. That is different from the barrier in that we don't need a flush of FUA bit set. Hmmm... I'm feeling dense. Zero-length barrier also requires only one flush to separate requests before and after it (haven't looked at the code yet, will soon). Can you enlighten me? Yeah, that's what the zero-length barrier implementation I posted does. Not sure if you have a question beyond that, if so fire away :-) I thought you were talking about adding BIO_RW_ORDERED instead of exposing zero length BIO_RW_BARRIER. Sorry about the confusion. :-) -- tejun - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFD] BIO_RW_BARRIER - what it means for devices, filesystems, and dm/md.
Hello, Jens Axboe wrote: Would that be very different from issuing barrier and not waiting for its completion? For ATA and SCSI, we'll have to flush write back cache anyway, so I don't see how we can get performance advantage by implementing separate WRITE_ORDERED. I think zero-length barrier (haven't looked at the code yet, still recovering from jet lag :-) can serve as genuine barrier without the extra write tho. As always, it depends :-) If you are doing pure flush barriers, then there's no difference. Unless you only guarantee ordering wrt previously submitted requests, in which case you can eliminate the post flush. If you are doing ordered tags, then just setting the ordered bit is enough. That is different from the barrier in that we don't need a flush of FUA bit set. Hmmm... I'm feeling dense. Zero-length barrier also requires only one flush to separate requests before and after it (haven't looked at the code yet, will soon). Can you enlighten me? Thanks. -- tejun - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [dm-devel] Re: [RFD] BIO_RW_BARRIER - what it means for devices, filesystems, and dm/md.
[ cc'ing Ric Wheeler for storage array thingie. Hi, whole thread is at http://thread.gmane.org/gmane.linux.kernel.device-mapper.devel/3344 ] Hello, [EMAIL PROTECTED] wrote: but when you consider the self-contained disk arrays it's an entirely different story. you can easily have a few gig of cache and a complete OS pretending to be a single drive as far as you are concerned. and the price of such devices is plummeting (in large part thanks to Linux moving into this space), you can now readily buy a 10TB array for $10k that looks like a single drive. Don't those thingies usually have NV cache or backed by battery such that ORDERED_DRAIN is enough? The problem is that the interface between the host and a storage device (ATA or SCSI) is not built to communicate that kind of information (grouped flush, relaxed ordering...). I think battery backed ORDERED_DRAIN combined with fine-grained host queue flush would be pretty good. It doesn't require some fancy new interface which isn't gonna be used widely anyway and can achieve most of performance gain if the storage plays it smart. Thanks. -- tejun - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFD] BIO_RW_BARRIER - what it means for devices, filesystems, and dm/md.
Jens Axboe wrote: On Thu, May 31 2007, David Chinner wrote: On Thu, May 31, 2007 at 08:26:45AM +0200, Jens Axboe wrote: On Thu, May 31 2007, David Chinner wrote: IOWs, there are two parts to the problem: 1 - guaranteeing I/O ordering 2 - guaranteeing blocks are on persistent storage. Right now, a single barrier I/O is used to provide both of these guarantees. In most cases, all we really need to provide is 1); the need for 2) is a much rarer condition but still needs to be provided. if I am understanding it correctly, the big win for barriers is that you do NOT have to stop and wait until the data is on persistant media before you can continue. Yes, if we define a barrier to only guarantee 1), then yes this would be a big win (esp. for XFS). But that requires all filesystems to handle sync writes differently, and sync_blockdev() needs to call blkdev_issue_flush() as well So, what do we do here? Do we define a barrier I/O to only provide ordering, or do we define it to also provide persistent storage writeback? Whatever we decide, it needs to be documented The block layer already has a notion of the two types of barriers, with a very small amount of tweaking we could expose that. There's absolutely zero reason we can't easily support both types of barriers. That sounds like a good idea - we can leave the existing WRITE_BARRIER behaviour unchanged and introduce a new WRITE_ORDERED behaviour that only guarantees ordering. The filesystem can then choose which to use where appropriate Precisely. The current definition of barriers are what Chris and I came up with many years ago, when solving the problem for reiserfs originally. It is by no means the only feasible approach. I'll add a WRITE_ORDERED command to the #barrier branch, it already contains the empty-bio barrier support I posted yesterday (well a slightly modified and cleaned up version). Would that be very different from issuing barrier and not waiting for its completion? For ATA and SCSI, we'll have to flush write back cache anyway, so I don't see how we can get performance advantage by implementing separate WRITE_ORDERED. I think zero-length barrier (haven't looked at the code yet, still recovering from jet lag :-) can serve as genuine barrier without the extra write tho. Thanks. -- tejun - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [dm-devel] Re: [RFD] BIO_RW_BARRIER - what it means for devices, filesystems, and dm/md.
Stefan Bader wrote: 2007/5/30, Phillip Susi [EMAIL PROTECTED]: Stefan Bader wrote: Since drive a supports barrier request we don't get -EOPNOTSUPP but the request with block y might get written before block x since the disk are independent. I guess the chances of this are quite low since at some point a barrier request will also hit drive b but for the time being it might be better to indicate -EOPNOTSUPP right from device-mapper. The device mapper needs to ensure that ALL underlying devices get a barrier request when one comes down from above, even if it has to construct zero length barriers to send to most of them. And somehow also make sure all of the barriers have been processed before returning the barrier that came in. Plus it would have to queue all mapping requests until the barrier is done (if strictly acting according to barrier.txt). But I am wondering a bit whether the requirements to barriers are really that tight as described in Tejun's document (barrier request is only started if everything before is safe, the barrier itself isn't returned until it is safe, too, and all requests after the barrier aren't started before the barrier is done). Is it really necessary to defer any further requests until the barrier has been written to save storage? Or would it be sufficient to guarantee that, if a barrier request returns, everything up to (including the barrier) is on safe storage? Well, what's described in barrier.txt is the current implemented semantics and what filesystems expect, so we can't change it underneath them but we definitely can introduce new more relaxed variants, but one thing we should bear in mind is that harddisks don't have humongous caches or very smart controller / instruction set. No matter how relaxed interface the block layer provides, in the end, it just has to issue whole-sale FLUSH CACHE on the device to guarantee data ordering on the media. IMHO, we can do better by paying more attention to how we do things in the request queue which can be deeper and more intelligent than the device queue. Thanks. -- tejun - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFD] BIO_RW_BARRIER - what it means for devices, filesystems, and dm/md.
Hello, Neil Brown wrote: 1/ A BIO_RW_BARRIER request should never fail with -EOPNOTSUP. This is certainly a very attractive position - it makes the interface cleaner and makes life easier for filesystems and other clients of the block interface. Currently filesystems handle -EOPNOTSUP by a/ resubmitting the request without the BARRIER (after waiting for earlier requests to complete) and b/ possibly printing an error message to the kernel logs. The block layer can do both of these just as easily and it does make sense to do it there. Yeah, I think doing all the above in the block layer is the cleanest way to solve this. If write back cache flush doesn't work, barrier is bound to fail but block layer still can write the barrier block as requested (without actual barriering), whine about it to the user, and tell the FS that barrier is failed but the write itself went through, so that FS can go on without caring about it unless it wants to. md/dm modules could keep count of requests as has been suggested (though that would be a fairly big change for raid0 as it currently doesn't know when a request completes - bi_endio goes directly to the filesystem). However I think the idea of a zero-length BIO_RW_BARRIER would be a good option. raid0 could send one of these down each device, and when they all return, the barrier request can be sent to it's target device(s). Yeap. 2/ Maybe barriers provide stronger semantics than are required. All write requests are synchronised around a barrier write. This is often more than is required and apparently can cause a measurable slowdown. Also the FUA for the actual commit write might not be needed. It is important for consistency that the preceding writes are in safe storage before the commit write, but it is not so important that the commit write is immediately safe on storage. That isn't needed until a 'sync' or 'fsync' or similar. One possible alternative is: - writes can overtake barriers, but barrier cannot overtake writes. - flush before the barrier, not after. I think we can give this property to zero length barriers. This is considerably weaker, and hence cheaper. But I think it is enough for all filesystems (providing it is still an option to call blkdev_issue_flush on 'fsync'). Another alternative would be to tag each bio was being in a particular barrier-group. Then bio's in different groups could overtake each other in either direction, but a BARRIER request must be totally ordered w.r.t. other requests in the barrier group. This would require an extra bio field, and would give the filesystem more appearance of control. I'm not yet sure how much it would really help... It would allow us to set FUA on all bios with a non-zero barrier-group. That would mean we don't have to flush the entire cache, just those blocks that are critical but I'm still not sure it's a good idea. Barrier code as it currently stands deals with two colors so there can be only one outstanding barrier at given moment. Expanding it to deal with multiple colors and then to multiple simultaneous groups will take some work but is definitely possible. If FS people can make good use of it, I think it would be worthwhile. Of course, these weaker rules would only apply inside the elevator. Once the request goes to the device we need to work with what the device provides, which probably means total-ordering around the barrier. Yeah, on device side, the best we can do most of the time is full flush but as long as request queue depth is much deeper than the controller/device one, having multiple barrier groups can be helpful. We need more input from FS people, I think. 3/ Do we need explicit control of the 'ordered' mode? Consider a SCSI device that has NV RAM cache. mode_sense reports that write-back is enabled, so _FUA or _FLUSH will be used. But as it is *NV* ram, QUEUE_ORDER_DRAIN is really the best mode. But it seems there is no way to query this information. Using _FLUSH causes the NVRAM to be flushed to media which is a terrible performance problem. If the NV RAM can be reliably detected using one of the inquiry pages, sd driver can switch it to DRAIN automatically. Setting SYNC_NV doesn't work on the particular device in question. We currently tell customers to mount with -o nobarriers, but that really feels like the wrong solution. We should be telling the scsi device don't flush. An advantage of 'nobarriers' is it can go in /etc/fstab. Where would you record that a SCSI drive should be set to QUEUE_ORDERD_DRAIN ?? How about exporting ordered mode as sysfs attribute and configuring it using a udev rule? It's a device property after all. Thanks. -- tejun - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at
Re: [RFD] BIO_RW_BARRIER - what it means for devices, filesystems, and dm/md.
Hello, Neil Brown. Please cc me on blkdev barriers and, if you haven't yet, reading Documentation/block/barrier.txt can be helpful too. Neil Brown wrote: [--snip--] 1/ SAFE. With a SAFE device, there is no write-behind cache, or if there is it is non-volatile. Once a write completes it is completely safe. Such a device does not require barriers or -issue_flush_fn, and can respond to them either by a no-op or with -EOPNOTSUPP (the former is preferred). 2/ FLUSHABLE. A FLUSHABLE device may have a volatile write-behind cache. This cache can be flushed with a call to blkdev_issue_flush. It may not support barrier requests. 3/ BARRIER. A BARRIER device supports both blkdev_issue_flush and BIO_RW_BARRIER. Either may be used to synchronise any write-behind cache to non-volatile storage (media). Handling of SAFE and FLUSHABLE devices is essentially the same and can work on a BARRIER device. The BARRIER device has the option of more efficient handling. Actually, all above three are handled by blkdev flush code. How does a filesystem use this? === [--snip--] 2/ Set the BIO_RW_BARRIER bit in the write request for the commit block. (This is more efficient on BARRIER). This really should be enough. HOW DO MD or DM USE THIS 1/ striping devices. This includes md/raid0 md/linear dm-linear dm-stripe and probably others. These devices can easily support blkdev_issue_flush by simply calling blkdev_issue_flush on all component devices. These devices would find it very hard to support BIO_RW_BARRIER. Doing this would require keeping track of all in-flight requests (which some, possibly all, of the above don't) and then: When a BIO_RW_BARRIER request arrives: wait for all pending writes to complete call blkdev_issue_flush on all devices issue the barrier write to the target device(s) as BIO_RW_BARRIER, if that is -EOPNOTSUP, re-issue, wait, flush. Hmm... What do you think about introducing zero-length BIO_RW_BARRIER for this case? 2/ Mirror devices. This includes md/raid1 and dm-raid1. These device can trivially implement blkdev_issue_flush much like the striping devices, and can support BIO_RW_BARRIER to some extent. md/raid1 currently tries. I'm not sure about dm-raid1. md/raid1 determines if the underlying devices can handle BIO_RW_BARRIER. If any cannot, it rejects such requests (EOPNOTSUP) itself. If all underlying devices do appear to support barriers, md/raid1 will pass a barrier-write down to all devices. The difficulty comes if it fails on one device, but not all devices. In this case it is not clear what to do. Failing the request is a lie, because some data has been written (possible too early). Succeeding the request (after re-submitting the failed requests) is also a lie as the barrier wasn't really honoured. md/raid1 currently takes the latter approach, but will only do it once - after that it fails all barrier requests. Hopefully this is unlikely to happen. What device would work correctly with barriers once, and then not the next time? The answer is md/raid1. If you remove a failed device and add a new device that doesn't support barriers, md/raid1 will notice and stop supporting barriers. If md/raid1 can change from supporting barrier to not, then maybe some other device could too? I'm not sure what to do about this - maybe just ignore it... That sounds good. :-) 3/ Other modules Other md and dm modules (raid5, mpath, crypt) do not add anything interesting to the above. Either handling BIO_RW_BARRIER is trivial, or extremely difficult. HOW DO LOW LEVEL DEVICES HANDLE THIS This is part of the picture that I haven't explored greatly. My feeling is that most if not all devices support blkdev_issue_flush properly, and support barriers reasonably well providing that the hardware does. There in an exception I recently found though. For devices that don't support QUEUE_ORDERED_TAG (i.e. commands sent to the controller can be tagged as barriers), SCSI will use the SYNCHRONIZE_CACHE command to flush the cache after the barrier request (a bit like the filesystem calling blkdev_issue_flush, but at a lower level). However it does this without setting the SYNC_NV bit. This means that a device with a non-volatile cache will be required -- needlessly -- to flush that cache to media. Yeah, it probably needs updating but some devices might react badly too. So: some questions to help encourage response: - Is the above substantial correct? Totally correct? - Should the various filesystems be fixed as suggested above? Is someone willing to do