[RFD] BIO_RW_BARRIER - what it means for devices, filesystems, and dm/md.

2007-05-25 Thread Neil Brown

This mail is about an issue that has been of concern to me for quite a
while and I think it is (well past) time to air it more widely and try
to come to a resolution.

This issue is how write barriers (the block-device kind, not the
memory-barrier kind) should be handled by the various layers.

The following is my understanding, which could well be wrong in
various specifics.  Corrections and other comments are more than
welcome.



What are barriers?
==
Barriers (as generated by requests with BIO_RW_BARRIER) are intended
to ensure that the data in the barrier request is not visible until
all writes submitted earlier are safe on the media, and that the data
is safe on the media before any subsequently submitted requests
are visible on the device.

This is achieved by tagging request in the elevator (or any other
request queue) so that no re-ordering is performed around a
BIO_RW_BARRIER request, and by sending appropriate commands to the
device so that any write-behind caching is defeated by the barrier
request.

Along side BIO_RW_BARRIER is blkdev_issue_flush which calls
q-issue_flush_fn.   This can be used to achieve similar effects.

There is no guarantee that a device can support BIO_RW_BARRIER - it is
always possible that a request will fail with EOPNOTSUPP.

Conversely, blkdev_issue_flush must be supported on any device that
uses write-behind caching (it if cannot be supported, then
write-behind caching should be turned off, at least by default).

We can think of there being three types of devices:
 
1/ SAFE.  With a SAFE device, there is no write-behind cache, or if
  there is it is non-volatile.  Once a write completes it is 
  completely safe.  Such a device does not require barriers
  or -issue_flush_fn, and can respond to them either by a
  no-op or with -EOPNOTSUPP (the former is preferred).

2/ FLUSHABLE.
  A FLUSHABLE device may have a volatile write-behind cache.
  This cache can be flushed with a call to blkdev_issue_flush.
  It may not support barrier requests.

3/ BARRIER.
  A BARRIER device supports both blkdev_issue_flush and
  BIO_RW_BARRIER.  Either may be used to synchronise any
  write-behind cache to non-volatile storage (media).

Handling of SAFE and FLUSHABLE devices is essentially the same and can
work on a BARRIER device.  The BARRIER device has the option of more
efficient handling.

How does a filesystem use this?
===

A filesystem will often have a concept of a 'commit' block which makes
an assertion about the correctness of other blocks in the filesystem.
In the most gross sense, this could be the writing of the superblock
of an ext2 filesystem, with the dirty bit clear.  This write commits
all other writes to the filesystem that precede it.

More subtle/useful is the commit block in a journal as with ext3 and
others.  This write commits some number of preceding writes in the
journal or elsewhere.

The filesystem will want to ensure that all preceding writes are safe
before writing the barrier block.  There are two ways to achieve this.

1/  Issue all 'preceding writes', wait for them to complete (bi_endio
   called), call blkdev_issue_flush, issue the commit write, wait
   for it to complete, call blkdev_issue_flush a second time.
   (This is needed for FLUSHABLE)

2/ Set the BIO_RW_BARRIER bit in the write request for the commit
block.
   (This is more efficient on BARRIER).

The second, while much easier, can fail.  So a filesystem should be
prepared to deal with that failure by falling back to the first
option.
Thus the general sequence might be:

  a/ issue all preceding writes.
  b/ issue the commit write with BIO_RW_BARRIER
  c/ wait for the commit to complete.
 If it was successful - done.
 If it failed other than with EOPNOTSUPP, abort
 else continue
  d/ wait for all 'preceding writes' to complete
  e/ call blkdev_issue_flush
  f/ issue commit write without BIO_RW_BARRIER
  g/ wait for commit write to complete
   if it failed, abort
  h/ call blkdev_issue
  DONE

steps b and c can be left out if it is known that the device does not
support barriers.  The only way to discover this to try and see if it
fails.

I don't think any filesystem follows all these steps.

 ext3 has the right structure, but it doesn't include steps e and h.
 reiserfs is similar.  It does have a call to blkdev_issue_flush, but 
  that is only on the fsync path, so it isn't really protecting
  general journal commits.
 XFS - I'm less sure.  I think it does 'a' then 'd', then 'b' or 'f'
   depending on a whether it thinks the device handles barriers,
   and finally 'g'.

 I haven't looked at other filesystems.

So for devices that support BIO_RW_BARRIER, and for devices that don't
need any flush, they work OK, but for device that need flushing, but
don't support BIO_RW_BARRIER, none of them work.  This should be easy
to fix.



Re: raid5: I lost a XFS file system due to a minor IDE cable problem

2007-05-25 Thread David Chinner
On Fri, May 25, 2007 at 12:43:51AM -0500, Alberto Alonso wrote:
The difference between ext3 and XFS is that ext3 will remount to
   read-only on the first write error but the XFS won't, XFS only fails
   only the current operation, IMHO. The method of ext3 isn't perfect, but
   in practice, it's working well.
  
  XFS will shutdown the filesystem if metadata corruption will occur
  due to a failed write. We don't immediately fail the filesystem on
  data write errors because on large systems you can get *transient*
  I/O errors (e.g. FC path failover) and so retrying failed data
  writes is useful for preventing unnecessary shutdowns of the
  filesystem.
  
  Different design criteria, different solutions...
 
 I think his point was that going into a read only mode causes a
 less catastrophic situation (ie. a web server can still serve
 pages).

Sure - but once you've detected one corruption or had metadata
I/O errors, can you trust the rest of the filesystem?

 I think that is a valid point, rather than shutting down
 the file system completely, an automatic switch to where the least
 disruption of service can occur is always desired.

I consider the possibility of serving out bad data (i.e after
a remount to readonly) to be the worst possible disruption of
service that can happen ;)

 Maybe the automatic failure mode could be something that is 
 configurable via the mount options.

If only it were that simple. Have you looked to see how many
hooks there are in XFS to shutdown without causing further
damage?

% grep FORCED_SHUTDOWN fs/xfs/*.[ch] fs/xfs/*/*.[ch] | wc -l
116

Changing the way we handle shutdowns would take a lot of time,
effort and testing. When can I expect a patch? ;)

 I personally have found the XFS file system to be great for
 my needs (except issues with NFS interaction, where the bug report
 never got answered), but that doesn't mean it can not be improved.

Got a pointer?

Cheers,

Dave.
-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFD] BIO_RW_BARRIER - what it means for devices, filesystems, and dm/md.

2007-05-25 Thread David Chinner
On Fri, May 25, 2007 at 05:58:25PM +1000, Neil Brown wrote:
 We can think of there being three types of devices:
  
 1/ SAFE.  With a SAFE device, there is no write-behind cache, or if
   there is it is non-volatile.  Once a write completes it is 
   completely safe.  Such a device does not require barriers
   or -issue_flush_fn, and can respond to them either by a
 no-op or with -EOPNOTSUPP (the former is preferred).
 
 2/ FLUSHABLE.
   A FLUSHABLE device may have a volatile write-behind cache.
   This cache can be flushed with a call to blkdev_issue_flush.
 It may not support barrier requests.

So returns -EOPNOTSUPP to any barrier request?

 3/ BARRIER.
 A BARRIER device supports both blkdev_issue_flush and
   BIO_RW_BARRIER.  Either may be used to synchronise any
 write-behind cache to non-volatile storage (media).
 
 Handling of SAFE and FLUSHABLE devices is essentially the same and can
 work on a BARRIER device.  The BARRIER device has the option of more
 efficient handling.
 
 How does a filesystem use this?
 ===

 
 The filesystem will want to ensure that all preceding writes are safe
 before writing the barrier block.  There are two ways to achieve this.

Three, actually.

 1/  Issue all 'preceding writes', wait for them to complete (bi_endio
called), call blkdev_issue_flush, issue the commit write, wait
for it to complete, call blkdev_issue_flush a second time.
(This is needed for FLUSHABLE)

*nod*

 2/ Set the BIO_RW_BARRIER bit in the write request for the commit
 block.
(This is more efficient on BARRIER).

*nod*

3/ Use a SAFE device.

 The second, while much easier, can fail.

So we do a test I/O to see if the device supports them before
enabling that mode.  But, as we've recently discovered, this is not
sufficient to detect *correctly functioning* barrier support.

 So a filesystem should be
 prepared to deal with that failure by falling back to the first
 option.

I don't buy that argument.

 Thus the general sequence might be:
 
   a/ issue all preceding writes.
   b/ issue the commit write with BIO_RW_BARRIER

At this point, the filesystem has done everything it needs to ensure
that the block layer has been informed of the I/O ordering
requirements. Why should the filesystem now have to detect block
layer breakage, and then use a different block layer API to issue
the same I/O under the same constraints?

   c/ wait for the commit to complete.
  If it was successful - done.
  If it failed other than with EOPNOTSUPP, abort
  else continue
   d/ wait for all 'preceding writes' to complete
   e/ call blkdev_issue_flush
   f/ issue commit write without BIO_RW_BARRIER
   g/ wait for commit write to complete
if it failed, abort
   h/ call blkdev_issue
_flush?

   DONE
 
 steps b and c can be left out if it is known that the device does not
 support barriers.  The only way to discover this to try and see if it
 fails.

That's a very linear, single-threaded way of looking at it... ;)

 I don't think any filesystem follows all these steps.
 
  ext3 has the right structure, but it doesn't include steps e and h.
  reiserfs is similar.  It does have a call to blkdev_issue_flush, but 
   that is only on the fsync path, so it isn't really protecting
   general journal commits.
  XFS - I'm less sure.  I think it does 'a' then 'd', then 'b' or 'f'
depending on a whether it thinks the device handles barriers,
and finally 'g'.

That's right, except for the g (or c) bit - commit writes are
async and nothing waits for them - the io completion wakes anything
waiting on it's completion

(yes, all XFS barrier I/Os are issued async which is why having to
handle an -EOPNOTSUPP error is a real pain. The fix I currently
have is to reissue the I/O from the completion handler with is
ugly, ugly, ugly.)

 So for devices that support BIO_RW_BARRIER, and for devices that don't
 need any flush, they work OK, but for device that need flushing, but
 don't support BIO_RW_BARRIER, none of them work.  This should be easy
 to fix.

Right - XFS as it stands was designed to work on SAFE devices, and
we've modified it to work on BARRIER devices. We don't support
FLUSHABLE devices at all.

But if the filesystem supports BARRIER devices, I don't see any
reason why a filesystem needs to be modified to support FLUSHABLE
devices - the key point being that by the time the filesystem
has issued the commit write it has already waited for all it's
dependent I/O, and so all the block device needs to do is
issue flushes either side of the commit write

 HOW DO MD or DM USE THIS
 
 
 1/ striping devices.
  This includes md/raid0 md/linear dm-linear dm-stripe and probably
  others. 
 
These devices can easily support blkdev_issue_flush by simply
calling blkdev_issue_flush on all component 

Re: [RFD] BIO_RW_BARRIER - what it means for devices, filesystems, and dm/md.

2007-05-25 Thread Jens Axboe
On Fri, May 25 2007, David Chinner wrote:
  The second, while much easier, can fail.
 
 So we do a test I/O to see if the device supports them before
 enabling that mode.  But, as we've recently discovered, this is not
 sufficient to detect *correctly functioning* barrier support.

Right, those are two different things. But paranoia aside, will this
ever be a real life problem? I've always been of the opinion to just
nicely ignore them. We can't easily detect it and tell the user his hw
is crap.

  So a filesystem should be
  prepared to deal with that failure by falling back to the first
  option.
 
 I don't buy that argument.

The problem with Neils reasoning there is that blkdev_issue_flush() may
use the same method as the barrier to ensure data is on platter.

A barrier write will include a flush, but it may also use the FUA bit to
ensure data is on platter. So the only situation where a fallback from a
barrier to flush would be valid, is if the device lied and told you it
could do FUA but it could not and that is the reason why the barrier
write failed. If that is the case, the block layer should stop using FUA
and fallback to flush-write-flush. And if it does that, then there's
never a valid reason to switch from using barrier writes to
blkdev_issue_flush() since both methods would either both work or both
fail.

  Thus the general sequence might be:
  
a/ issue all preceding writes.
b/ issue the commit write with BIO_RW_BARRIER
 
 At this point, the filesystem has done everything it needs to ensure
 that the block layer has been informed of the I/O ordering
 requirements. Why should the filesystem now have to detect block
 layer breakage, and then use a different block layer API to issue
 the same I/O under the same constraints?

It's not block layer breakage, it's a device issue.

  2/ Mirror devices.  This includes md/raid1 and dm-raid1.
 ..
 Hopefully this is unlikely to happen.  What device would work
 correctly with barriers once, and then not the next time?
 The answer is md/raid1.  If you remove a failed device and add a
 new device that doesn't support barriers, md/raid1 will notice and
 stop supporting barriers.
 
 In case you hadn't already guess, I don't like this behaviour at
 all.  It makes async I/O completion of barrier I/O an ugly, messy
 business, and every place you do sync I/O completion you need to put
 special error handling.

That's unfortunately very true. It's an artifact of the sometimes
problematic device capability discovery.

 If this happens to md/raid1, then why can't it simply do a
 blkdev_issue_flush, write, blkdev_issue_flush sequence to the device
 that doesn't support barriers and then the md device *never changes
 behaviour*. Next time the filesystem is mounted, it will turn off
 barriers because they won't be supported

Because if it doesn't support barriers, blkdev_issue_flush() wouldn't
work either. At least that is the case for SATA/IDE, SCSI is somewhat
different (and has somewhat other issues).

   - Should the various filesystems be fixed as suggested above?  Is 
  someone willing to do that?
 
 Alternate viewpoint - should the block layer be fixed so that the
 filesystems only need to use one barrier API that provides static
 behaviour for the life of the mount?

blkdev_issue_flush() isn't part of the barrier API, and using it as a
work-around for a device that has barrier issues is wrong for the
reasons listed above.

The DRAIN_FUA - DRAIN_FLUSH automatic downgrade I mentioned above
should be added, in which case blkdev_issue_flush() would never be
needed (unless you want to do a data-less barrier, and we should
probably add that specific functionality with an empty bio instead of
providing an alternate way of doing that).

-- 
Jens Axboe

-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFD] BIO_RW_BARRIER - what it means for devices, filesystems, and dm/md.

2007-05-25 Thread Stefan Bader

2007/5/25, Neil Brown [EMAIL PROTECTED]:


HOW DO MD or DM USE THIS


1/ striping devices.
 This includes md/raid0 md/linear dm-linear dm-stripe and probably
 others.

   These devices can easily support blkdev_issue_flush by simply
   calling blkdev_issue_flush on all component devices.



This ensures that all of the previous requests have been processed but
does this guarantee they where successful? This might be too paranoid
but if I understood the concept correctly the success of a barrier
request should indicate success of all previous request between this
barrier and the last one.


   These devices would find it very hard to support BIO_RW_BARRIER.
   Doing this would require keeping track of all in-flight requests
   (which some, possibly all, of the above don't) and then:
 When a BIO_RW_BARRIER request arrives:
wait for all pending writes to complete
call blkdev_issue_flush on all devices
issue the barrier write to the target device(s)
   as BIO_RW_BARRIER,
if that is -EOPNOTSUP, re-issue, wait, flush.



I guess just keep a count of submitted requests and errors since the
last barrier might be enough. As long as all of the underlying device
support at least support a flush the dm device could pretend to
support BIO_RW_BARRIER.



dm-linear and dm-stripe simply pass the BIO_RW_BARRIER flag down,
 which means data may not be flushed correctly:  the commit block
 might be written to one device before a preceding block is
 written to another device.


Hm, even worse: if the barrier requests accidentally end up on a
device that does support barriers and another one on the map doesn't.
Would any layer/fs above care to issue a flush call?


   I think the best approach for this class of devices is to return
   -EOPNOSUP.  If the filesystem does the wait (which they all do
   already) and the blkdev_issue_flush (which is easy to add), they
   don't need to support BIO_RW_BARRIER.


Without any additional code these really should report -EOPNOTSUPP. If
disaster strikes there is no way to make assumptions on the real state
on disk.


2/ Mirror devices.  This includes md/raid1 and dm-raid1.

   These device can trivially implement blkdev_issue_flush much like
   the striping devices, and can support BIO_RW_BARRIER to some
   extent.
   md/raid1 currently tries.  I'm not sure about dm-raid1.


I fear this is more broken as with linear and stripe. There is no code
to check the features of underlying devices and the request itself
isn't sent forward but privately built ones (which do not have the
barrier flag)...

3/ Multipath devices

Requests are sent to the same device but one different paths. So at
least with them the chance of one path supporting barriers but not
another one seems little (as long as the paths do not use completely
different transport layers). But passing on a request with the barrier
flag also doesn't seem to be a good idea since previous requests can
arrive at the device later.

IMHO the best way to handle barriers for dm would be to add the
sequence described to the generic mapping layer of dm (before calling
the targets mapping function). There is already some sort of counting
in-flight requests (suspend/resume needs that) and I guess the
downgrade could also be rather simple. If a flush call to the target
(mapped device) fails report -EOPNOTSUPP and stay that way (until next
boot).


So: some questions to help encourage response:




 - Is the approach to barriers taken by md appropriate?  Should dm
do the same?  Who will do that?


If my assumption about barrier semantics is true, then also md has to
somehow make sure all previous requests have _successfully_ completed.
In the mirror case I guess it is valid to report success if the mirror
itself is in a clean state. Which is all previous requests (and the
barrier) where successful on at least one mirror half and this state
can be recovered.

Question to dm-devel: What do people there think of the possible
generic implementation in dm.c?


 - The comment above blkdev_issue_flush says Caller must run
   wait_for_completion() on its own.  What does that mean?


Guess this means it initiates a flush but doesn't wait for completion.
So the caller must wait for the completion of the separate requests on
its own, doesn't it?


 - Are there other bit that we could handle better?
BIO_RW_FAILFAST?  BIO_RW_SYNC?  What exactly do they mean?


BIO_RW_FAILFAST: means low-level driver shouldn't do much (or no)
error recovery. Mainly used by mutlipath targets to avoid long SCSI
recovery. This should just be propagated when passing requests on.

BIO_RW_SYNC: means this is a bio of a synchronous request. I don't
know whether there are more uses to it but this at least causes queues
to be flushed immediately instead of waiting for more requests for a
short time. Should also just be passed on. Otherwise performance gets
poor since something above will 

Re: raid5: I lost a XFS file system due to a minor IDE cable problem

2007-05-25 Thread Pallai Roland

On Friday 25 May 2007 06:55:00 David Chinner wrote:
 Oh, did you look at your logs and find that XFS had spammed them
 about writes that were failing?

The first message after the incident:

May 24 01:53:50 hq kernel: Filesystem loop1: XFS internal error 
xfs_btree_check_sblock at line 336 of file fs/xfs/xfs_btree.c.  Caller 
0xf8ac14f8
May 24 01:53:50 hq kernel: f8adae69 xfs_btree_check_sblock+0x4f/0xc2 [xfs]  
f8ac14f8 xfs_alloc_lookup+0x34e/0x47b [xfs]
May 24 01:53:50 HF kernel: f8ac14f8 xfs_alloc_lookup+0x34e/0x47b [xfs]  
f8b1a9c7 kmem_zone_zalloc+0x1b/0x43 [xfs]
May 24 01:53:50 hq kernel: f8abe645 xfs_alloc_ag_vextent+0x24d/0x1110 [xfs]  
f8ac0647 xfs_alloc_vextent+0x3bd/0x53b [xfs]
May 24 01:53:50 hq kernel: f8ad2f7e xfs_bmapi+0x1ac4/0x23cd [xfs]  f8acab97 
xfs_bmap_search_multi_extents+0x8e/0xd8 [xfs]
May 24 01:53:50 hq kernel: f8b1 xlog_dealloc_log+0x49/0xea [xfs]  
f8afdaee xfs_iomap_write_allocate+0x2d9/0x58b [xfs]
May 24 01:53:50 hq kernel: f8afc3ae xfs_iomap+0x60e/0x82d [xfs]  c0113bc8 
__wake_up_common+0x39/0x59
May 24 01:53:50 hq kernel: f8b1ae11 xfs_map_blocks+0x39/0x6c [xfs]  
f8b1bd7b xfs_page_state_convert+0x644/0xf9c [xfs]
May 24 01:53:50 hq kernel: c036f384 schedule+0x5d1/0xf4d  f8b1c780 
xfs_vm_writepage+0x0/0xe0 [xfs]
May 24 01:53:50 hq kernel: f8b1c7d7 xfs_vm_writepage+0x57/0xe0 [xfs]  
c01830e8 mpage_writepages+0x1fb/0x3bb
May 24 01:53:50 hq kernel: c0183020 mpage_writepages+0x133/0x3bb  f8b1c780 
xfs_vm_writepage+0x0/0xe0 [xfs]
May 24 01:53:50 hq kernel: c0147bb3 do_writepages+0x35/0x3b  c018135c 
__writeback_single_inode+0x88/0x387
May 24 01:53:50 hq kernel: c01819b7 sync_sb_inodes+0x1b4/0x2a8  c0181c63 
writeback_inodes+0x63/0xdc
May 24 01:53:50 hq kernel: c0147943 background_writeout+0x66/0x9f  c01482b3 
pdflush+0x0/0x1ad
May 24 01:53:50 hq kernel: c01483a2 pdflush+0xef/0x1ad  c01478dd 
background_writeout+0x0/0x9f
May 24 01:53:50 hq kernel: c012d10b kthread+0xc2/0xc6  c012d049 
kthread+0x0/0xc6
May 24 01:53:50 hq kernel: c0100dd5 kernel_thread_helper+0x5/0xb

..and I've spammed such messages. This internal error isn't a good reason to 
shut down
the file system? I think if there's a sign of corrupted file system, the first 
thing we should do
is to stop writes (or the entire FS) and let the admin to examine the situation.
 I'm not talking about my case where the md raid5 was a braindead, I'm talking 
about
general situations.


--
 d

-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [dm-devel] Re: [RFD] BIO_RW_BARRIER - what it means for devices, filesystems, and dm/md.

2007-05-25 Thread Phillip Susi

Jens Axboe wrote:

A barrier write will include a flush, but it may also use the FUA bit to
ensure data is on platter. So the only situation where a fallback from a
barrier to flush would be valid, is if the device lied and told you it
could do FUA but it could not and that is the reason why the barrier
write failed. If that is the case, the block layer should stop using FUA
and fallback to flush-write-flush. And if it does that, then there's
never a valid reason to switch from using barrier writes to
blkdev_issue_flush() since both methods would either both work or both
fail.


IIRC, the FUA bit only forces THAT request to hit the platter before it 
is completed; it does not flush any previous requests still sitting in 
the write back queue.  Because all io before the barrier must be on the 
platter as well, setting the FUA bit on the barrier request means you 
don't have to follow it with a flush, but you still have to precede it 
with a flush.



It's not block layer breakage, it's a device issue.


How isn't it block layer breakage?  If the device does not support 
barriers, isn't it the job of the block layer ( probably the scheduler ) 
to fall back to flush-write-flush?



-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [dm-devel] [RFD] BIO_RW_BARRIER - what it means for devices, filesystems, and dm/md.

2007-05-25 Thread Phillip Susi

Neil Brown wrote:

There is no guarantee that a device can support BIO_RW_BARRIER - it is
always possible that a request will fail with EOPNOTSUPP.


Why is it not the job of the block layer to translate for broken devices 
and send them a flush/write/flush?



   These devices would find it very hard to support BIO_RW_BARRIER.
   Doing this would require keeping track of all in-flight requests
   (which some, possibly all, of the above don't) and then:


The device mapper keeps track of in flight requests already.  When 
switching tables it has to hold new requests and wait for in flight 
requests to complete before switching to the new table.  When it gets a 
barrier request it just needs to do the same thing, only not switch 
tables.



   I think the best approach for this class of devices is to return
   -EOPNOSUP.  If the filesystem does the wait (which they all do
   already) and the blkdev_issue_flush (which is easy to add), they
   don't need to support BIO_RW_BARRIER.


Why?  The personalities should just pass the BARRIER flag down to each 
underlying device, and the dm common code should wait for all in flight 
io to complete before sending the barrier to the personality.



For devices that don't support QUEUE_ORDERED_TAG (i.e. commands sent to
the controller can be tagged as barriers), SCSI will use the
SYNCHRONIZE_CACHE command to flush the cache after the barrier
request (a bit like the filesystem calling blkdev_issue_flush, but at


Don't you have to flush the cache BEFORE the barrier to ensure that 
previous IO is committed first, THEN the barrier write?



-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html