Re: 2.6.23-rc1: known regressions with patches

2007-07-24 Thread Tejun Heo
Michal Piotrowski wrote:
 Subject : Oops while modprobing phy fixed module
 References  : http://lkml.org/lkml/2007/7/14/63
 Last known good : ?
 Submitter   : Gabriel C [EMAIL PROTECTED]
 Caused-By   : Tejun Heo [EMAIL PROTECTED]
   commit 3007e997de91ec59af39a3f9c91595b31ae6e08b
 Handled-By  : Satyam Sharma [EMAIL PROTECTED]
   Tejun Heo [EMAIL PROTECTED]
   Vitaly Bordug [EMAIL PROTECTED]
 Patch1  : http://lkml.org/lkml/2007/7/18/506
 Status  : patch available

Patch is in mainline.  Commit a1da4dfe35bc36c3bc9716d995c85b7983c38a76.

Thanks.

-- 
tejun
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] block: cosmetic changes

2007-07-18 Thread Tejun Heo
Cosmetic changes.  This is taken from Jens' zero-length barrier patch.

Signed-off-by: Tejun Heo [EMAIL PROTECTED]
Cc: Jens Axboe [EMAIL PROTECTED]
---
 block/ll_rw_blk.c |5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

Index: work/block/ll_rw_blk.c
===
--- work.orig/block/ll_rw_blk.c
+++ work/block/ll_rw_blk.c
@@ -443,7 +443,8 @@ static inline struct request *start_orde
rq_init(q, rq);
if (bio_data_dir(q-orig_bar_rq-bio) == WRITE)
rq-cmd_flags |= REQ_RW;
-   rq-cmd_flags |= q-ordered  QUEUE_ORDERED_FUA ? REQ_FUA : 0;
+   if (q-ordered  QUEUE_ORDERED_FUA)
+   rq-cmd_flags |= REQ_FUA;
rq-elevator_private = NULL;
rq-elevator_private2 = NULL;
init_request_from_bio(rq, q-orig_bar_rq-bio);
@@ -3167,7 +3168,7 @@ end_io:
break;
}
 
-   if (unlikely(bio_sectors(bio)  q-max_hw_sectors)) {
+   if (unlikely(nr_sectors  q-max_hw_sectors)) {
printk(bio too big device %s (%u  %u)\n, 
bdevname(bio-bi_bdev, b),
bio_sectors(bio),
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] block: factor out bio_check_eod()

2007-07-18 Thread Tejun Heo
End of device check is done twice in __generic_make_request() and it's
fully inlined each time.  Factor out bio_check_eod().

This is taken from Jens' zero-length barrier patch.

Signed-off-by: Tejun Heo [EMAIL PROTECTED]
Cc: Jens Axboe [EMAIL PROTECTED]
---
 block/ll_rw_blk.c |   63 --
 1 file changed, 33 insertions(+), 30 deletions(-)

Index: work/block/ll_rw_blk.c
===
--- work.orig/block/ll_rw_blk.c
+++ work/block/ll_rw_blk.c
@@ -3094,6 +3094,35 @@ static inline int should_fail_request(st
 
 #endif /* CONFIG_FAIL_MAKE_REQUEST */
 
+/*
+ * Check whether this bio extends beyond the end of the device.
+ */
+static int bio_check_eod(struct bio *bio, unsigned int nr_sectors)
+{
+   sector_t maxsector;
+
+   if (!nr_sectors)
+   return 0;
+
+   /* Test device or partition size, when known. */
+   maxsector = bio-bi_bdev-bd_inode-i_size  9;
+   if (maxsector) {
+   sector_t sector = bio-bi_sector;
+
+   if (maxsector  nr_sectors || maxsector - nr_sectors  sector) {
+   /*
+* This may well happen - the kernel calls bread()
+* without checking the size of the device, e.g., when
+* mounting a device.
+*/
+   handle_bad_sector(bio);
+   return 1;
+   }
+   }
+
+   return 0;
+}
+
 /**
  * generic_make_request: hand a buffer to its device driver for I/O
  * @bio:  The bio describing the location in memory and on the device.
@@ -3121,27 +3150,14 @@ static inline int should_fail_request(st
 static inline void __generic_make_request(struct bio *bio)
 {
request_queue_t *q;
-   sector_t maxsector;
sector_t old_sector;
int ret, nr_sectors = bio_sectors(bio);
dev_t old_dev;
 
might_sleep();
-   /* Test device or partition size, when known. */
-   maxsector = bio-bi_bdev-bd_inode-i_size  9;
-   if (maxsector) {
-   sector_t sector = bio-bi_sector;
 
-   if (maxsector  nr_sectors || maxsector - nr_sectors  sector) {
-   /*
-* This may well happen - the kernel calls bread()
-* without checking the size of the device, e.g., when
-* mounting a device.
-*/
-   handle_bad_sector(bio);
-   goto end_io;
-   }
-   }
+   if (bio_check_eod(bio, nr_sectors))
+   goto end_io;
 
/*
 * Resolve the mapping until finished. (drivers are
@@ -3197,21 +3213,8 @@ end_io:
old_sector = bio-bi_sector;
old_dev = bio-bi_bdev-bd_dev;
 
-   maxsector = bio-bi_bdev-bd_inode-i_size  9;
-   if (maxsector) {
-   sector_t sector = bio-bi_sector;
-
-   if (maxsector  nr_sectors ||
-   maxsector - nr_sectors  sector) {
-   /*
-* This may well happen - partitions are not
-* checked to make sure they are within the size
-* of the whole device.
-*/
-   handle_bad_sector(bio);
-   goto end_io;
-   }
-   }
+   if (bio_check_eod(bio, nr_sectors))
+   goto end_io;
 
ret = q-make_request_fn(q, bio);
} while (ret);
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] block: factor out bio_check_eod()

2007-07-18 Thread Tejun Heo
Jens Axboe wrote:
 On Wed, Jul 18 2007, Tejun Heo wrote:
 End of device check is done twice in __generic_make_request() and it's
 fully inlined each time.  Factor out bio_check_eod().
 
 Tejun, yeah I should seperate the cleanups and put them in the upstream
 branch. Will do so and add your signed-off to both of them.
 

Would they be different from the one I just posted?  No big deal either
way.  I'm just basing the zero-length barrier on top of these patches.
Oh well, the changes are trivial anyway.

-- 
tejun
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] block: factor out bio_check_eod()

2007-07-18 Thread Tejun Heo
Jens Axboe wrote:
 On Wed, Jul 18 2007, Tejun Heo wrote:
 Jens Axboe wrote:
 On Wed, Jul 18 2007, Tejun Heo wrote:
 End of device check is done twice in __generic_make_request() and it's
 fully inlined each time.  Factor out bio_check_eod().
 Tejun, yeah I should seperate the cleanups and put them in the upstream
 branch. Will do so and add your signed-off to both of them.

 Would they be different from the one I just posted?  No big deal either
 way.  I'm just basing the zero-length barrier on top of these patches.
 Oh well, the changes are trivial anyway.
 
 This one ended up being the same, but in the first one you missed some
 of the cleanups. I ended up splitting the patch some more though, see
 the series:
 
 http://git.kernel.dk/?p=linux-2.6-block.git;a=shortlog;h=barrier

Alright, will base on 662d5c5e6afb79d05db5563205b809c0de530286.  Thanks.

-- 
tejun
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] block: factor out bio_check_eod()

2007-07-18 Thread Tejun Heo
Jens Axboe wrote:
 On Wed, Jul 18 2007, Tejun Heo wrote:
 Jens Axboe wrote:
 On Wed, Jul 18 2007, Tejun Heo wrote:
 Jens Axboe wrote:
 On Wed, Jul 18 2007, Tejun Heo wrote:
 End of device check is done twice in __generic_make_request() and it's
 fully inlined each time.  Factor out bio_check_eod().
 Tejun, yeah I should seperate the cleanups and put them in the upstream
 branch. Will do so and add your signed-off to both of them.

 Would they be different from the one I just posted?  No big deal either
 way.  I'm just basing the zero-length barrier on top of these patches.
 Oh well, the changes are trivial anyway.
 This one ended up being the same, but in the first one you missed some
 of the cleanups. I ended up splitting the patch some more though, see
 the series:

 http://git.kernel.dk/?p=linux-2.6-block.git;a=shortlog;h=barrier
 Alright, will base on 662d5c5e6afb79d05db5563205b809c0de530286.  Thanks.
 
 1781c6a39fb6e31836557618c4505f5f7bc61605, no? Unless you want to rewrite
 it completely :-)

I think I'll start from 662d5c5e and steal most parts from 1781c6a3.  I
like stealing, you know. :-) I think 1781c6a3 also can use splitting -
zero length barrier implementation and issue_flush conversion.

Anyways, how do I pull from git.kernel.dk?
git://git.kernel.dk/linux-2.6-block.git gives me connection reset by server.

Thanks.

-- 
tejun
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] block: factor out bio_check_eod()

2007-07-18 Thread Tejun Heo
Jens Axboe wrote:
 somewhat annoying, I'll see if I can prefix it with git-daemon in the
 future.
 
 OK, now skip the /data/git/ stuff and just use
 
 git://git.kernel.dk/linux-2.6-block.git

Alright, it works like a charm now.  Thanks.

-- 
tejun
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [dm-devel] Re: [RFD] BIO_RW_BARRIER - what it means for devices, filesystems, and dm/md.

2007-07-10 Thread Tejun Heo
[EMAIL PROTECTED] wrote:
 On Tue, 10 Jul 2007 14:39:41 EDT, Ric Wheeler said:
 
 All of the high end arrays have non-volatile cache (read, on power loss, it 
 is a 
 promise that it will get all of your data out to permanent storage). You 
 don't 
 need to ask this kind of array to drain the cache. In fact, it might just 
 ignore 
 you if you send it that kind of request ;-)
 
 OK, I'll bite - how does the kernel know whether the other end of that
 fiberchannel cable is attached to a DMX-3 or to some no-name product that
 may not have the same assurances?  Is there a I'm a high-end array bit
 in the sense data that I'm unaware of?

Well, the array just has to tell the kernel that it doesn't to write
back caching.  The kernel automatically selects ORDERED_DRAIN in such case.

-- 
tejun
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [dm-devel] Re: [RFD] BIO_RW_BARRIER - what it means for devices, filesystems, and dm/md.

2007-07-10 Thread Tejun Heo
Ric Wheeler wrote:
 Don't those thingies usually have NV cache or backed by battery such
 that ORDERED_DRAIN is enough?
 
 All of the high end arrays have non-volatile cache (read, on power loss,
 it is a promise that it will get all of your data out to permanent
 storage). You don't need to ask this kind of array to drain the cache.
 In fact, it might just ignore you if you send it that kind of request ;-)
 
 The size of the NV cache can run from a few gigabytes up to hundreds of
 gigabytes, so you really don't want to invoke cache flushes here if you
 can avoid it.
 
 For this class of device, you can get the required in order completion
 and data integrity semantics as long as we send the IO's to the device
 in the correct order.

Thanks for clarification.

 The problem is that the interface between the host and a storage device
 (ATA or SCSI) is not built to communicate that kind of information
 (grouped flush, relaxed ordering...).  I think battery backed
 ORDERED_DRAIN combined with fine-grained host queue flush would be
 pretty good.  It doesn't require some fancy new interface which isn't
 gonna be used widely anyway and can achieve most of performance gain if
 the storage plays it smart.
 
 I am not really sure that you need this ORDERED_DRAIN for big arrays...

ORDERED_DRAIN is to properly order requests from host request queue
(elevator/iosched).  We can make it finer-grained but we do need to put
some ordering restrictions.

-- 
tejun
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFD] BIO_RW_BARRIER - what it means for devices, filesystems, and dm/md.

2007-07-05 Thread Tejun Heo
Hello, Jens.

Jens Axboe wrote:
 On Mon, May 28 2007, Neil Brown wrote:
 I think the implementation priorities here are:

 1/ implement a zero-length BIO_RW_BARRIER option.
 2/ Use it (or otherwise) to make all dm and md modules handle
barriers (and loop?).
 3/ Devise and implement appropriate fall-backs with-in the block layer
so that  -EOPNOTSUP is never returned.
 4/ Remove unneeded cruft from filesystems (and elsewhere).
 
 This is the start of 1/ above. It's very lightly tested, it's verified
 to DTRT here at least and not crash :-)
 
 It gets rid of the -issue_flush_fn() queue callback, all the driver
 knowledge resides in -prepare_flush_fn() anyways. blkdev_issue_flush()
 then just reuses the empty-bio approach to queue an empty barrier, this
 should work equally well for stacked and non-stacked devices.
 
 While this patch isn't complete yet, it's clearly the right direction to
 go.

Finally took a brief look. :-) I think the sequencing for zero-length
barrier can be better done by pre-setting QUEUE_ORDSEQ_BAR in
start_ordered() rather than short circuiting the request after it's
issued.  What do you think?

Thanks.

-- 
tejun
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFD] BIO_RW_BARRIER - what it means for devices, filesystems, and dm/md.

2007-06-04 Thread Tejun Heo
Jens Axboe wrote:
 On Sat, Jun 02 2007, Tejun Heo wrote:
 Hello,

 Jens Axboe wrote:
 Would that be very different from issuing barrier and not waiting for
 its completion?  For ATA and SCSI, we'll have to flush write back cache
 anyway, so I don't see how we can get performance advantage by
 implementing separate WRITE_ORDERED.  I think zero-length barrier
 (haven't looked at the code yet, still recovering from jet lag :-) can
 serve as genuine barrier without the extra write tho.
 As always, it depends :-)

 If you are doing pure flush barriers, then there's no difference. Unless
 you only guarantee ordering wrt previously submitted requests, in which
 case you can eliminate the post flush.

 If you are doing ordered tags, then just setting the ordered bit is
 enough. That is different from the barrier in that we don't need a flush
 of FUA bit set.
 Hmmm... I'm feeling dense.  Zero-length barrier also requires only one
 flush to separate requests before and after it (haven't looked at the
 code yet, will soon).  Can you enlighten me?
 
 Yeah, that's what the zero-length barrier implementation I posted does.
 Not sure if you have a question beyond that, if so fire away :-)

I thought you were talking about adding BIO_RW_ORDERED instead of
exposing zero length BIO_RW_BARRIER.  Sorry about the confusion.  :-)

-- 
tejun
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFD] BIO_RW_BARRIER - what it means for devices, filesystems, and dm/md.

2007-06-02 Thread Tejun Heo
Hello,

Jens Axboe wrote:
 Would that be very different from issuing barrier and not waiting for
 its completion?  For ATA and SCSI, we'll have to flush write back cache
 anyway, so I don't see how we can get performance advantage by
 implementing separate WRITE_ORDERED.  I think zero-length barrier
 (haven't looked at the code yet, still recovering from jet lag :-) can
 serve as genuine barrier without the extra write tho.
 
 As always, it depends :-)
 
 If you are doing pure flush barriers, then there's no difference. Unless
 you only guarantee ordering wrt previously submitted requests, in which
 case you can eliminate the post flush.
 
 If you are doing ordered tags, then just setting the ordered bit is
 enough. That is different from the barrier in that we don't need a flush
 of FUA bit set.

Hmmm... I'm feeling dense.  Zero-length barrier also requires only one
flush to separate requests before and after it (haven't looked at the
code yet, will soon).  Can you enlighten me?

Thanks.

-- 
tejun

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [dm-devel] Re: [RFD] BIO_RW_BARRIER - what it means for devices, filesystems, and dm/md.

2007-06-01 Thread Tejun Heo
[ cc'ing Ric Wheeler for storage array thingie.  Hi, whole thread is at
http://thread.gmane.org/gmane.linux.kernel.device-mapper.devel/3344 ]

Hello,

[EMAIL PROTECTED] wrote:
 but when you consider the self-contained disk arrays it's an entirely
 different story. you can easily have a few gig of cache and a complete
 OS pretending to be a single drive as far as you are concerned.
 
 and the price of such devices is plummeting (in large part thanks to
 Linux moving into this space), you can now readily buy a 10TB array for
 $10k that looks like a single drive.

Don't those thingies usually have NV cache or backed by battery such
that ORDERED_DRAIN is enough?

The problem is that the interface between the host and a storage device
(ATA or SCSI) is not built to communicate that kind of information
(grouped flush, relaxed ordering...).  I think battery backed
ORDERED_DRAIN combined with fine-grained host queue flush would be
pretty good.  It doesn't require some fancy new interface which isn't
gonna be used widely anyway and can achieve most of performance gain if
the storage plays it smart.

Thanks.

-- 
tejun
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFD] BIO_RW_BARRIER - what it means for devices, filesystems, and dm/md.

2007-05-31 Thread Tejun Heo
Jens Axboe wrote:
 On Thu, May 31 2007, David Chinner wrote:
 On Thu, May 31, 2007 at 08:26:45AM +0200, Jens Axboe wrote:
 On Thu, May 31 2007, David Chinner wrote:
 IOWs, there are two parts to the problem:

1 - guaranteeing I/O ordering
2 - guaranteeing blocks are on persistent storage.

 Right now, a single barrier I/O is used to provide both of these
 guarantees. In most cases, all we really need to provide is 1); the
 need for 2) is a much rarer condition but still needs to be
 provided.

 if I am understanding it correctly, the big win for barriers is that you 
 do NOT have to stop and wait until the data is on persistant media before 
 you can continue.
 Yes, if we define a barrier to only guarantee 1), then yes this
 would be a big win (esp. for XFS). But that requires all filesystems
 to handle sync writes differently, and sync_blockdev() needs to
 call blkdev_issue_flush() as well

 So, what do we do here? Do we define a barrier I/O to only provide
 ordering, or do we define it to also provide persistent storage
 writeback? Whatever we decide, it needs to be documented
 The block layer already has a notion of the two types of barriers, with
 a very small amount of tweaking we could expose that. There's absolutely
 zero reason we can't easily support both types of barriers.
 That sounds like a good idea - we can leave the existing
 WRITE_BARRIER behaviour unchanged and introduce a new WRITE_ORDERED
 behaviour that only guarantees ordering. The filesystem can then
 choose which to use where appropriate
 
 Precisely. The current definition of barriers are what Chris and I came
 up with many years ago, when solving the problem for reiserfs
 originally. It is by no means the only feasible approach.
 
 I'll add a WRITE_ORDERED command to the #barrier branch, it already
 contains the empty-bio barrier support I posted yesterday (well a
 slightly modified and cleaned up version).

Would that be very different from issuing barrier and not waiting for
its completion?  For ATA and SCSI, we'll have to flush write back cache
anyway, so I don't see how we can get performance advantage by
implementing separate WRITE_ORDERED.  I think zero-length barrier
(haven't looked at the code yet, still recovering from jet lag :-) can
serve as genuine barrier without the extra write tho.

Thanks.

-- 
tejun
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [dm-devel] Re: [RFD] BIO_RW_BARRIER - what it means for devices, filesystems, and dm/md.

2007-05-31 Thread Tejun Heo
Stefan Bader wrote:
 2007/5/30, Phillip Susi [EMAIL PROTECTED]:
 Stefan Bader wrote:
 
  Since drive a supports barrier request we don't get -EOPNOTSUPP but
  the request with block y might get written before block x since the
  disk are independent. I guess the chances of this are quite low since
  at some point a barrier request will also hit drive b but for the time
  being it might be better to indicate -EOPNOTSUPP right from
  device-mapper.

 The device mapper needs to ensure that ALL underlying devices get a
 barrier request when one comes down from above, even if it has to
 construct zero length barriers to send to most of them.

 
 And somehow also make sure all of the barriers have been processed
 before returning the barrier that came in. Plus it would have to queue
 all mapping requests until the barrier is done (if strictly acting
 according to barrier.txt).
 
 But I am wondering a bit whether the requirements to barriers are
 really that tight as described in Tejun's document (barrier request is
 only started if everything before is safe, the barrier itself isn't
 returned until it is safe, too, and all requests after the barrier
 aren't started before the barrier is done). Is it really necessary to
 defer any further requests until the barrier has been written to save
 storage? Or would it be sufficient to guarantee that, if a barrier
 request returns, everything up to (including the barrier) is on safe
 storage?

Well, what's described in barrier.txt is the current implemented
semantics and what filesystems expect, so we can't change it underneath
them but we definitely can introduce new more relaxed variants, but one
thing we should bear in mind is that harddisks don't have humongous
caches or very smart controller / instruction set.  No matter how
relaxed interface the block layer provides, in the end, it just has to
issue whole-sale FLUSH CACHE on the device to guarantee data ordering on
the media.

IMHO, we can do better by paying more attention to how we do things in
the request queue which can be deeper and more intelligent than the
device queue.

Thanks.

-- 
tejun
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFD] BIO_RW_BARRIER - what it means for devices, filesystems, and dm/md.

2007-05-28 Thread Tejun Heo
Hello,

Neil Brown wrote:
 1/ A BIO_RW_BARRIER request should never fail with -EOPNOTSUP.
 
  This is certainly a very attractive position - it makes the interface
  cleaner and makes life easier for filesystems and other clients of
  the block interface.
  Currently filesystems handle -EOPNOTSUP by
   a/ resubmitting the request without the BARRIER (after waiting for
 earlier requests to complete) and
   b/ possibly printing an error message to the kernel logs.
 
  The block layer can do both of these just as easily and it does make
  sense to do it there.

Yeah, I think doing all the above in the block layer is the cleanest way
to solve this.  If write back cache  flush doesn't work, barrier is
bound to fail but block layer still can write the barrier block as
requested (without actual barriering), whine about it to the user, and
tell the FS that barrier is failed but the write itself went through, so
that FS can go on without caring about it unless it wants to.

  md/dm modules could keep count of requests as has been suggested
  (though that would be a fairly big change for raid0 as it currently
  doesn't know when a request completes - bi_endio goes directly to the
  filesystem). 
  However I think the idea of a zero-length BIO_RW_BARRIER would be a
  good option.  raid0 could send one of these down each device, and
  when they all return, the barrier request can be sent to it's target
  device(s).

Yeap.

 2/ Maybe barriers provide stronger semantics than are required.
 
  All write requests are synchronised around a barrier write.  This is
  often more than is required and apparently can cause a measurable
  slowdown.
 
  Also the FUA for the actual commit write might not be needed.  It is
  important for consistency that the preceding writes are in safe
  storage before the commit write, but it is not so important that the
  commit write is immediately safe on storage.  That isn't needed until
  a 'sync' or 'fsync' or similar.
 
  One possible alternative is:
- writes can overtake barriers, but barrier cannot overtake writes.
- flush before the barrier, not after.

I think we can give this property to zero length barriers.

  This is considerably weaker, and hence cheaper. But I think it is
  enough for all filesystems (providing it is still an option to call
  blkdev_issue_flush on 'fsync').
 
  Another alternative would be to tag each bio was being in a
  particular barrier-group.  Then bio's in different groups could
  overtake each other in either direction, but a BARRIER request must
  be totally ordered w.r.t. other requests in the barrier group.
  This would require an extra bio field, and would give the filesystem
  more appearance of control.  I'm not yet sure how much it would
  really help...
  It would allow us to set FUA on all bios with a non-zero
  barrier-group.  That would mean we don't have to flush the entire
  cache, just those blocks that are critical but I'm still not sure
  it's a good idea.

Barrier code as it currently stands deals with two colors so there can
be only one outstanding barrier at given moment.  Expanding it to deal
with multiple colors and then to multiple simultaneous groups will take
some work but is definitely possible.  If FS people can make good use of
it, I think it would be worthwhile.

  Of course, these weaker rules would only apply inside the elevator.
  Once the request goes to the device we need to work with what the
  device provides, which probably means total-ordering around the
  barrier. 

Yeah, on device side, the best we can do most of the time is full flush
but as long as request queue depth is much deeper than the
controller/device one, having multiple barrier groups can be helpful.
We need more input from FS people, I think.

 3/ Do we need explicit control of the 'ordered' mode?
 
   Consider a SCSI device that has NV RAM cache.  mode_sense reports
   that write-back is enabled, so _FUA or _FLUSH will be used.
   But as it is *NV* ram, QUEUE_ORDER_DRAIN is really the best mode.
   But it seems there is no way to query this information.
   Using _FLUSH causes the NVRAM to be flushed to media which is a
   terrible performance problem.

If the NV RAM can be reliably detected using one of the inquiry pages,
sd driver can switch it to DRAIN automatically.

   Setting SYNC_NV doesn't work on the particular device in question.
   We currently tell customers to mount with -o nobarriers, but that
   really feels like the wrong solution.  We should be telling the scsi
   device don't flush.
   An advantage of 'nobarriers' is it can go in /etc/fstab.  Where
   would you record that a SCSI drive should be set to
   QUEUE_ORDERD_DRAIN ??

How about exporting ordered mode as sysfs attribute and configuring it
using a udev rule?  It's a device property after all.

Thanks.

-- 
tejun
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  

Re: [RFD] BIO_RW_BARRIER - what it means for devices, filesystems, and dm/md.

2007-05-26 Thread Tejun Heo
Hello, Neil Brown.

Please cc me on blkdev barriers and, if you haven't yet, reading
Documentation/block/barrier.txt can be helpful too.

Neil Brown wrote:
[--snip--]
 1/ SAFE.  With a SAFE device, there is no write-behind cache, or if
   there is it is non-volatile.  Once a write completes it is 
   completely safe.  Such a device does not require barriers
   or -issue_flush_fn, and can respond to them either by a
 no-op or with -EOPNOTSUPP (the former is preferred).
 
 2/ FLUSHABLE.
   A FLUSHABLE device may have a volatile write-behind cache.
   This cache can be flushed with a call to blkdev_issue_flush.
 It may not support barrier requests.
 
 3/ BARRIER.
 A BARRIER device supports both blkdev_issue_flush and
   BIO_RW_BARRIER.  Either may be used to synchronise any
 write-behind cache to non-volatile storage (media).
 
 Handling of SAFE and FLUSHABLE devices is essentially the same and can
 work on a BARRIER device.  The BARRIER device has the option of more
 efficient handling.

Actually, all above three are handled by blkdev flush code.

 How does a filesystem use this?
 ===
 
[--snip--]
 2/ Set the BIO_RW_BARRIER bit in the write request for the commit
 block.
(This is more efficient on BARRIER).

This really should be enough.

 HOW DO MD or DM USE THIS
 
 
 1/ striping devices.
  This includes md/raid0 md/linear dm-linear dm-stripe and probably
  others. 
 
These devices can easily support blkdev_issue_flush by simply
calling blkdev_issue_flush on all component devices.
 
These devices would find it very hard to support BIO_RW_BARRIER.
Doing this would require keeping track of all in-flight requests
(which some, possibly all, of the above don't) and then:
  When a BIO_RW_BARRIER request arrives:
 wait for all pending writes to complete
 call blkdev_issue_flush on all devices
 issue the barrier write to the target device(s)
as BIO_RW_BARRIER,
 if that is -EOPNOTSUP, re-issue, wait, flush.

Hmm... What do you think about introducing zero-length BIO_RW_BARRIER
for this case?

 2/ Mirror devices.  This includes md/raid1 and dm-raid1.
 
These device can trivially implement blkdev_issue_flush much like
the striping devices, and can support BIO_RW_BARRIER to some
extent.
md/raid1 currently tries.  I'm not sure about dm-raid1.
 
md/raid1 determines if the underlying devices can handle
BIO_RW_BARRIER.  If any cannot, it rejects such requests (EOPNOTSUP)
itself.
If all underlying devices do appear to support barriers, md/raid1
will pass a barrier-write down to all devices.
The difficulty comes if it fails on one device, but not all
devices.  In this case it is not clear what to do.  Failing the
request is a lie, because some data has been written (possible too
early).  Succeeding the request (after re-submitting the failed
requests) is also a lie as the barrier wasn't really honoured.
md/raid1 currently takes the latter approach, but will only do it
once - after that it fails all barrier requests.
 
Hopefully this is unlikely to happen.  What device would work
correctly with barriers once, and then not the next time?
The answer is md/raid1.  If you remove a failed device and add a
new device that doesn't support barriers, md/raid1 will notice and
stop supporting barriers.
If md/raid1 can change from supporting barrier to not, then maybe
some other device could too?
 
I'm not sure what to do about this - maybe just ignore it...

That sounds good.  :-)

 3/ Other modules
 
Other md and dm modules (raid5, mpath, crypt) do not add anything
interesting to the above.  Either handling BIO_RW_BARRIER is
trivial, or extremely difficult.
 
 HOW DO LOW LEVEL DEVICES HANDLE THIS
 
 
 This is part of the picture that I haven't explored greatly.  My
 feeling is that most if not all devices support blkdev_issue_flush
 properly, and support barriers reasonably well providing that the
 hardware does.
 There in an exception I recently found though.
 For devices that don't support QUEUE_ORDERED_TAG (i.e. commands sent to
 the controller can be tagged as barriers), SCSI will use the
 SYNCHRONIZE_CACHE command to flush the cache after the barrier
 request (a bit like the filesystem calling blkdev_issue_flush, but at
 a lower level). However it does this without setting the SYNC_NV bit.
 This means that a device with a non-volatile cache will be required --
 needlessly -- to flush that cache to media.

Yeah, it probably needs updating but some devices might react badly too.

 So: some questions to help encourage response:
 
  - Is the above substantial correct?  Totally correct?
  - Should the various filesystems be fixed as suggested above?  Is 
 someone willing to do