Re: When does a disk get flagged as bad?

2007-05-31 Thread Neil Brown
On Wednesday May 30, [EMAIL PROTECTED] wrote:
 
 After thinking about your post, I guess I can see some logic behind
 not failing on the read, although I would say that after x amount of
 read failures a drive should be kicked out no matter what.

When md gets a read error, it collects the correct data from elsewhere
and tries to write it to the drive that apparently failed.
If that succeeds, it tries to read it back again.  If that succeeds as
well, it assumes that the problem has been fixed.  Otherwise it fails
the drive.


NeilBrown
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFD] BIO_RW_BARRIER - what it means for devices, filesystems, and dm/md.

2007-05-31 Thread David Chinner
On Thu, May 31, 2007 at 08:26:45AM +0200, Jens Axboe wrote:
 On Thu, May 31 2007, David Chinner wrote:
  IOWs, there are two parts to the problem:
  
  1 - guaranteeing I/O ordering
  2 - guaranteeing blocks are on persistent storage.
  
  Right now, a single barrier I/O is used to provide both of these
  guarantees. In most cases, all we really need to provide is 1); the
  need for 2) is a much rarer condition but still needs to be
  provided.
  
   if I am understanding it correctly, the big win for barriers is that you 
   do NOT have to stop and wait until the data is on persistant media before 
   you can continue.
  
  Yes, if we define a barrier to only guarantee 1), then yes this
  would be a big win (esp. for XFS). But that requires all filesystems
  to handle sync writes differently, and sync_blockdev() needs to
  call blkdev_issue_flush() as well
  
  So, what do we do here? Do we define a barrier I/O to only provide
  ordering, or do we define it to also provide persistent storage
  writeback? Whatever we decide, it needs to be documented
 
 The block layer already has a notion of the two types of barriers, with
 a very small amount of tweaking we could expose that. There's absolutely
 zero reason we can't easily support both types of barriers.

That sounds like a good idea - we can leave the existing
WRITE_BARRIER behaviour unchanged and introduce a new WRITE_ORDERED
behaviour that only guarantees ordering. The filesystem can then
choose which to use where appropriate

Cheers,

Dave.
-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFD] BIO_RW_BARRIER - what it means for devices, filesystems, and dm/md.

2007-05-31 Thread Jens Axboe
On Thu, May 31 2007, David Chinner wrote:
 On Thu, May 31, 2007 at 08:26:45AM +0200, Jens Axboe wrote:
  On Thu, May 31 2007, David Chinner wrote:
   IOWs, there are two parts to the problem:
   
 1 - guaranteeing I/O ordering
 2 - guaranteeing blocks are on persistent storage.
   
   Right now, a single barrier I/O is used to provide both of these
   guarantees. In most cases, all we really need to provide is 1); the
   need for 2) is a much rarer condition but still needs to be
   provided.
   
if I am understanding it correctly, the big win for barriers is that 
you 
do NOT have to stop and wait until the data is on persistant media 
before 
you can continue.
   
   Yes, if we define a barrier to only guarantee 1), then yes this
   would be a big win (esp. for XFS). But that requires all filesystems
   to handle sync writes differently, and sync_blockdev() needs to
   call blkdev_issue_flush() as well
   
   So, what do we do here? Do we define a barrier I/O to only provide
   ordering, or do we define it to also provide persistent storage
   writeback? Whatever we decide, it needs to be documented
  
  The block layer already has a notion of the two types of barriers, with
  a very small amount of tweaking we could expose that. There's absolutely
  zero reason we can't easily support both types of barriers.
 
 That sounds like a good idea - we can leave the existing
 WRITE_BARRIER behaviour unchanged and introduce a new WRITE_ORDERED
 behaviour that only guarantees ordering. The filesystem can then
 choose which to use where appropriate

Precisely. The current definition of barriers are what Chris and I came
up with many years ago, when solving the problem for reiserfs
originally. It is by no means the only feasible approach.

I'll add a WRITE_ORDERED command to the #barrier branch, it already
contains the empty-bio barrier support I posted yesterday (well a
slightly modified and cleaned up version).

-- 
Jens Axboe

-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


problems with faulty disks and superblocks 1.0, 1.1 and 1.2

2007-05-31 Thread Hubert Verstraete

Hello

I'm having problems with a RAID-1 configuration. I cannot re-add a disk 
that I've failed, because each time I do this, the re-added disk is 
still seen as failed.
After some investigations, I found that this problem only occur when I 
create the RAID array with superblocks 1.0, 1.1 and 1.2.

With the superblock 0.90 I don't encounter this issue.

Here are the commands to easily reproduce the issue

mdadm -C /dev/md_d0 -e 1.0 -l 1 -n 2 -b internal -R /dev/sda /dev/sdb
mdadm /dev/md_d0 -f /dev/sda
mdadm /dev/md_d0 -r /dev/sda
mdadm /dev/md_d0 -a /dev/sda
cat /proc/mdstat

The output of mdstat is:
Personalities : [raid1]
md_d0 : active raid1 sda[0](F) sdb[1]
  104849 blocks super 1.2 [2/1] [_U]
  bitmap: 0/7 pages [0KB], 8KB chunk
unused devices: none

I'm wondering if the way I'm failing and re-adding a disk is correct. 
Did I make something wrong?


If I change the superblock to -e 0.90, there's no problem with this 
set of commands.


For now, I found a work-around with superblock 1.0 which consists in 
zeroing the superblock before re-adding the disk. But I suppose that 
doing so will force a full rebuild of the re-added disk, and I don't 
want this, because I'm using write-intent bitmaps.


I'm using mdadm - v2.5.6 on Debian Etch with kernel 2.6.18-4.

Bug or misunderstanding from myself? Any help would be appreciated :)

Thanks
Hubert
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [dm-devel] Re: [RFD] BIO_RW_BARRIER - what it means for devices, filesystems, and dm/md.

2007-05-31 Thread Stefan Bader

2007/5/30, Phillip Susi [EMAIL PROTECTED]:

Stefan Bader wrote:

 Since drive a supports barrier request we don't get -EOPNOTSUPP but
 the request with block y might get written before block x since the
 disk are independent. I guess the chances of this are quite low since
 at some point a barrier request will also hit drive b but for the time
 being it might be better to indicate -EOPNOTSUPP right from
 device-mapper.

The device mapper needs to ensure that ALL underlying devices get a
barrier request when one comes down from above, even if it has to
construct zero length barriers to send to most of them.



And somehow also make sure all of the barriers have been processed
before returning the barrier that came in. Plus it would have to queue
all mapping requests until the barrier is done (if strictly acting
according to barrier.txt).

But I am wondering a bit whether the requirements to barriers are
really that tight as described in Tejun's document (barrier request is
only started if everything before is safe, the barrier itself isn't
returned until it is safe, too, and all requests after the barrier
aren't started before the barrier is done). Is it really necessary to
defer any further requests until the barrier has been written to save
storage? Or would it be sufficient to guarantee that, if a barrier
request returns, everything up to (including the barrier) is on safe
storage?

Stefan
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFD] BIO_RW_BARRIER - what it means for devices, filesystems, and dm/md.

2007-05-31 Thread Bill Davidsen

Neil Brown wrote:

On Monday May 28, [EMAIL PROTECTED] wrote:
  

There are two things I'm not sure you covered.

First, disks which don't support flush but do have a cache dirty 
status bit you can poll at times like shutdown. If there are no drivers 
which support these, it can be ignored.



There are really devices like that?  So to implement a flush, you have
to stop sending writes and wait and poll - maybe poll every
millisecond?
  


Yes, there really are (or were). But I don't think that there are 
drivers, so it's not an issue.

That wouldn't be very good for performance  maybe you just
wouldn't bother with barriers on that sort of device?
  


That is why there are no drivers...

Which reminds me:  What is the best way to turn off barriers?
Several filesystems have -o nobarriers or -o barriers=0,
or the inverse.
  


If they can function usefully without, the admin gets to make that choice.

md/raid currently uses barriers to write metadata, and there is no
way to turn that off.  I'm beginning to wonder if that is best.
  


I don't see how you can have reliable operation without it, particularly 
WRT bitmap.

Maybe barrier support should be a function of the device.  i.e. the
filesystem or whatever always sends barrier requests where it thinks
it is appropriate, and the block device tries to honour them to the
best of its ability, but if you run
   blockdev --enforce-barriers=no /dev/sda
then you lose some reliability guarantees, but gain some throughput (a
bit like the 'async' export option for nfsd).

  
Since this is device dependent, it really should be in the device 
driver, and requests should have status of success, failure, or feature 
unavailability.




Second, NAS (including nbd?). Is there enough information to handle this  really 
right?



NAS means lots of things, including NFS and CIFS where this doesn't
apply.
  


Well, we're really talking about network attached devices rather than 
network filesystems. I guess people do lump them together.



For 'nbd', it is entirely up to the protocol.  If the protocol allows
a barrier flag to be sent to the server, then barriers should just
work.  If it doesn't, then either the server disables write-back
caching, or flushes every request, or you lose all barrier
guarantees. 
  


Pretty much agrees with what I said above, it's at a level closer to the 
device, and status should come back from the physical i/o request.

For 'iscsi', I guess it works just the same as SCSI...
  


Hopefully.

--
bill davidsen [EMAIL PROTECTED]
 CTO TMR Associates, Inc
 Doing interesting things with small computers since 1979

-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: RAID SB 1.x autodetection

2007-05-31 Thread Bill Davidsen

Jan Engelhardt wrote:

On May 30 2007 16:35, Bill Davidsen wrote:
  

On 29 May 2007, Jan Engelhardt uttered the following:
  

from your post at
http://www.mail-archive.com/linux-raid@vger.kernel.org/msg07384.html I
read that autodetecting arrays with a 1.x superblock is currently
impossible. Does it at least work to force the kernel to always assume
a 1.x sb? There are some 'broken' distros out there that still don't
use mdadm in initramfs, and recreating the initramfs each time is a
bit cumbersome...


The kernel build system should be able to do that for you, shouldn't it?

  

That would be an improvement, yes.



Hardly, with all the Fedora specific cruft. Anyway, there was a
simple patch posted in RH bugzilla, so I've gone with that.
  
I'm not sure what Fedora has to do with it, it is generally useful to 
all distributions. What I had in mind was a make target, so that instead 
of install as target, you could have install_mdadm in the Makefile. 
Or mdadm_install to be consistent with modules_install perhaps.


--
bill davidsen [EMAIL PROTECTED]
 CTO TMR Associates, Inc
 Doing interesting things with small computers since 1979

-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: very strange (maybe) raid1 testing results

2007-05-31 Thread Bill Davidsen

Neil Brown wrote:

On Wednesday May 30, [EMAIL PROTECTED] wrote:
  

On Wed, 30 May 2007, Jon Nelson wrote:



On Thu, 31 May 2007, Richard Scobie wrote:

  

Jon Nelson wrote:



I am getting 70-80MB/s read rates as reported via dstat, and 60-80MB/s as
reported by dd. What I don't understand is why just one disk is being used
here, instead of two or more. I tried different versions of metadata, and
using a bitmap makes no difference. I created the array with (allowing for
variations of bitmap and metadata version):
  
This is normal for md RAID1. What you should find is that for 
concurrent reads, each read will be serviced by a different disk, 
until no. of reads = no. of drives.

Alright. To clarify, let's assume some process (like a single-threaded 
webserver) using a raid1 to store content (who knows why, let's just say 
it is), and also assume that the I/O load is 100% reads. Given that the 
server does not fork (or create a thread) for each request, does that 
mean that every single web request is essentially serviced from one 
disk, always? What mechanism determines which disk actually services the 
request?
  

It's probably bad form to reply to one's own posts, but I just found

static int read_balance(conf_t *conf, r1bio_t *r1_bio)

in raid1.c which, if I'm reading the rest of the source correctly, 
basically says pick the disk whose current head position is closest. 
This *could* explain the behavior I was seeing. Is that not correct?



Yes, that is correct.  
md/raid1 will send a completely sequential read request to just one

device.  There is not much to be gained by doing anything else.
md/raid10 in 'far' or 'offset' mode lays the data out differently and
will issue read requests to all devices and often get better read
throughput at some cost in write throughput.
  
The whole single process thing may be a distraction rather than a 
solution, as well. I wrote a small program using pthreads which shared 
reads of a file between N threads in 1k blocks, such that each read was 
preceded by a seek. It *seemed* that these were being combined in the 
block layer before being passed on to the md logic, and treated as a 
single read as nearly as I could tell.


I did NOT look at actually disk i/o (didn't care), but rather only at 
the transfer rate from the file to memory, which did not change 
significantly from 1..N threads active, where N was the number of 
mirrors. And RAID-10 did as well with one thread as several.


--
bill davidsen [EMAIL PROTECTED]
 CTO TMR Associates, Inc
 Doing interesting things with small computers since 1979


-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFD] BIO_RW_BARRIER - what it means for devices, filesystems, and dm/md.

2007-05-31 Thread Bill Davidsen

Jens Axboe wrote:

On Thu, May 31 2007, David Chinner wrote:
  

On Thu, May 31, 2007 at 08:26:45AM +0200, Jens Axboe wrote:


On Thu, May 31 2007, David Chinner wrote:
  

IOWs, there are two parts to the problem:

1 - guaranteeing I/O ordering
2 - guaranteeing blocks are on persistent storage.

Right now, a single barrier I/O is used to provide both of these
guarantees. In most cases, all we really need to provide is 1); the
need for 2) is a much rarer condition but still needs to be
provided.


if I am understanding it correctly, the big win for barriers is that you 
do NOT have to stop and wait until the data is on persistant media before 
you can continue.
  

Yes, if we define a barrier to only guarantee 1), then yes this
would be a big win (esp. for XFS). But that requires all filesystems
to handle sync writes differently, and sync_blockdev() needs to
call blkdev_issue_flush() as well

So, what do we do here? Do we define a barrier I/O to only provide
ordering, or do we define it to also provide persistent storage
writeback? Whatever we decide, it needs to be documented


The block layer already has a notion of the two types of barriers, with
a very small amount of tweaking we could expose that. There's absolutely
zero reason we can't easily support both types of barriers.
  

That sounds like a good idea - we can leave the existing
WRITE_BARRIER behaviour unchanged and introduce a new WRITE_ORDERED
behaviour that only guarantees ordering. The filesystem can then
choose which to use where appropriate



Precisely. The current definition of barriers are what Chris and I came
up with many years ago, when solving the problem for reiserfs
originally. It is by no means the only feasible approach.

I'll add a WRITE_ORDERED command to the #barrier branch, it already
contains the empty-bio barrier support I posted yesterday (well a
slightly modified and cleaned up version).

  
Wait. Do filesystems expect (depend on) anything but ordering now? Does 
md? Having users of barriers as they currently behave suddenly getting 
SYNC behavior where they expect ORDERED is likely to have a negative 
effect on performance. Or do I misread what is actually guaranteed by 
WRITE_BARRIER now, and a flush is currently happening in all cases?


And will this also be available to user space f/s, since I just proposed 
a project which uses one? :-(
I think the goal is good, more choice is almost always better choice, I 
just want to be sure there won't be big disk performance regressions.


--
bill davidsen [EMAIL PROTECTED]
 CTO TMR Associates, Inc
 Doing interesting things with small computers since 1979

-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFD] BIO_RW_BARRIER - what it means for devices, filesystems, and dm/md.

2007-05-31 Thread Jens Axboe
On Thu, May 31 2007, Bill Davidsen wrote:
 Jens Axboe wrote:
 On Thu, May 31 2007, David Chinner wrote:
   
 On Thu, May 31, 2007 at 08:26:45AM +0200, Jens Axboe wrote:
 
 On Thu, May 31 2007, David Chinner wrote:
   
 IOWs, there are two parts to the problem:
 
   1 - guaranteeing I/O ordering
   2 - guaranteeing blocks are on persistent storage.
 
 Right now, a single barrier I/O is used to provide both of these
 guarantees. In most cases, all we really need to provide is 1); the
 need for 2) is a much rarer condition but still needs to be
 provided.
 
 
 if I am understanding it correctly, the big win for barriers is that 
 you do NOT have to stop and wait until the data is on persistant media 
 before you can continue.
   
 Yes, if we define a barrier to only guarantee 1), then yes this
 would be a big win (esp. for XFS). But that requires all filesystems
 to handle sync writes differently, and sync_blockdev() needs to
 call blkdev_issue_flush() as well
 
 So, what do we do here? Do we define a barrier I/O to only provide
 ordering, or do we define it to also provide persistent storage
 writeback? Whatever we decide, it needs to be documented
 
 The block layer already has a notion of the two types of barriers, with
 a very small amount of tweaking we could expose that. There's absolutely
 zero reason we can't easily support both types of barriers.
   
 That sounds like a good idea - we can leave the existing
 WRITE_BARRIER behaviour unchanged and introduce a new WRITE_ORDERED
 behaviour that only guarantees ordering. The filesystem can then
 choose which to use where appropriate
 
 
 Precisely. The current definition of barriers are what Chris and I came
 up with many years ago, when solving the problem for reiserfs
 originally. It is by no means the only feasible approach.
 
 I'll add a WRITE_ORDERED command to the #barrier branch, it already
 contains the empty-bio barrier support I posted yesterday (well a
 slightly modified and cleaned up version).
 
   
 Wait. Do filesystems expect (depend on) anything but ordering now? Does 
 md? Having users of barriers as they currently behave suddenly getting 
 SYNC behavior where they expect ORDERED is likely to have a negative 
 effect on performance. Or do I misread what is actually guaranteed by 
 WRITE_BARRIER now, and a flush is currently happening in all cases?

See the above stuff you quote, it's answered there. It's not a change,
this is how the Linux barrier write has always worked since I first
implemented it. What David and I are talking about is adding a more
relaxed version as well, that just implies ordering.

 And will this also be available to user space f/s, since I just proposed 
 a project which uses one? :-(

I see several uses for that, so I'd hope so.

 I think the goal is good, more choice is almost always better choice, I 
 just want to be sure there won't be big disk performance regressions.

We can't get more heavy weight than the current barrier, it's about as
conservative as you can get.

-- 
Jens Axboe

-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFD] BIO_RW_BARRIER - what it means for devices, filesystems, and dm/md.

2007-05-31 Thread Phillip Susi

David Chinner wrote:
you are understanding barriers to be the same as syncronous writes. (and 
therefor the data is on persistant media before the call returns)


No, I'm describing the high level behaviour that is expected by
a filesystem. The reasons for this are below


You say no, but then you go on to contradict yourself below.


Ok, that's my understanding of how *device based barriers* can work,
but there's more to it than that. As far as the filesystem is
concerned the barrier write needs to *behave* exactly like a sync
write because of the guarantees the filesystem has to provide
userspace. Specifically - sync, sync writes and fsync.


There, you just ascribed the synchronous property to barrier requests. 
This is false.  Barriers are about ordering, synchronous writes are 
another thing entirely.  The filesystem is supposed to use barriers to 
maintain ordering for journal data.  If you are trying to handle a 
synchronous write request, that's another flag.



This is the big problem, right? If we use barriers for commit
writes, the filesystem can return to userspace after a sync write or
fsync() and an *ordered barrier device implementation* may not have
written the blocks to persistent media. If we then pull the plug on
the box, we've just lost data that sync or fsync said was
successfully on disk. That's BAD.


That's why for synchronous writes, you set the flag to mark the request 
as synchronous, which has nothing at all to do with barriers.  You are 
trying to use barriers to solve two different problems.  Use one flag to 
indicate ordering, and another to indicate synchronisity.



Right now a barrier write on the last block of the fsync/sync write
is sufficient to prevent that because of the FUA on the barrier
block write. A purely ordered barrier implementation does not
provide this guarantee.


This is a side effect of the implementation of the barrier, not part of 
the semantics of barriers, so you shouldn't rely on this behavior.  You 
don't have to use FUA to handle the barrier request, and if you don't, 
then the request can be completed while the data is still in the write 
cache.  You just have to make sure to flush it before any subsequent 
requests.



IOWs, there are two parts to the problem:

1 - guaranteeing I/O ordering
2 - guaranteeing blocks are on persistent storage.

Right now, a single barrier I/O is used to provide both of these
guarantees. In most cases, all we really need to provide is 1); the
need for 2) is a much rarer condition but still needs to be
provided.


Yep... two problems... two flags.


Yes, if we define a barrier to only guarantee 1), then yes this
would be a big win (esp. for XFS). But that requires all filesystems
to handle sync writes differently, and sync_blockdev() needs to
call blkdev_issue_flush() as well

So, what do we do here? Do we define a barrier I/O to only provide
ordering, or do we define it to also provide persistent storage
writeback? Whatever we decide, it needs to be documented


We do the former or we end up in the same boat as O_DIRECT; where you 
have one flag that means several things, and no way to specify you only 
need some of those and not the others.



-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFD] BIO_RW_BARRIER - what it means for devices, filesystems, and dm/md.

2007-05-31 Thread Phillip Susi

David Chinner wrote:

That sounds like a good idea - we can leave the existing
WRITE_BARRIER behaviour unchanged and introduce a new WRITE_ORDERED
behaviour that only guarantees ordering. The filesystem can then
choose which to use where appropriate


So what if you want a synchronous write, but DON'T care about the order? 
  They need to be two completely different flags which you can choose 
to combine, or use individually.


-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [dm-devel] Re: [RFD] BIO_RW_BARRIER - what it means for devices, filesystems, and dm/md.

2007-05-31 Thread Phillip Susi

Jens Axboe wrote:

No Stephan is right, the barrier is both an ordering and integrity
constraint. If a driver completes a barrier request before that request
and previously submitted requests are on STABLE storage, then it
violates that principle. Look at the code and the various ordering
options.


I am saying that is the wrong thing to do.  Barrier should be about 
ordering only.  So long as the order they hit the media is maintained, 
the order the requests are completed in can change.  barrier.txt bears 
this out:


Requests in ordered sequence are issued in order, but not required to 
finish in order.  Barrier implementation can handle out-of-order 
completion of ordered sequence.  IOW, the requests MUST be processed in 
order but the hardware/software completion paths are allowed to reorder 
completion notifications - eg. current SCSI midlayer doesn't preserve 
completion order during error handling.



-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [dm-devel] Re: [RFD] BIO_RW_BARRIER - what it means for devices, filesystems, and dm/md.

2007-05-31 Thread Jens Axboe
On Thu, May 31 2007, Phillip Susi wrote:
 Jens Axboe wrote:
 No Stephan is right, the barrier is both an ordering and integrity
 constraint. If a driver completes a barrier request before that request
 and previously submitted requests are on STABLE storage, then it
 violates that principle. Look at the code and the various ordering
 options.
 
 I am saying that is the wrong thing to do.  Barrier should be about 
 ordering only.  So long as the order they hit the media is maintained, 
 the order the requests are completed in can change.  barrier.txt bears 

But you can't guarentee ordering without flushing the data out as well.
It all depends on the type of cache on the device, of course. If you
look at the ordinary sata/ide drive with write back caching, you can't
just issue the requests in order and pray that the drive cache will make
it to platter.

If you don't have write back caching, or if the cache is battery backed
and thus guarenteed to never be lost, maintaining order is naturally
enough.

Or if the drive can do ordered queued commands, you can relax the
flushing (again depending on the cache type, you may need to take
different paths).

 Requests in ordered sequence are issued in order, but not required to 
 finish in order.  Barrier implementation can handle out-of-order 
 completion of ordered sequence.  IOW, the requests MUST be processed in 
 order but the hardware/software completion paths are allowed to reorder 
 completion notifications - eg. current SCSI midlayer doesn't preserve 
 completion order during error handling.

If you carefully re-read that paragraph, then it just tells you that the
software implementation can deal with reordered completions. It doesn't
relax the rconstraints on ordering and integrity AT ALL.

-- 
Jens Axboe

-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFD] BIO_RW_BARRIER - what it means for devices, filesystems, and dm/md.

2007-05-31 Thread Jens Axboe
On Thu, May 31 2007, Phillip Susi wrote:
 David Chinner wrote:
 That sounds like a good idea - we can leave the existing
 WRITE_BARRIER behaviour unchanged and introduce a new WRITE_ORDERED
 behaviour that only guarantees ordering. The filesystem can then
 choose which to use where appropriate
 
 So what if you want a synchronous write, but DON'T care about the order? 
   They need to be two completely different flags which you can choose 
 to combine, or use individually.

If you have a use case for that, we can easily support it as well...
Depending on the drive capabilities (FUA support or not), it may be
nearly as slow as a real barrier write.

-- 
Jens Axboe

-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: RAID SB 1.x autodetection

2007-05-31 Thread Jan Engelhardt

On May 31 2007 09:00, Bill Davidsen wrote:
  
 
 Hardly, with all the Fedora specific cruft. Anyway, there was a
 simple patch posted in RH bugzilla, so I've gone with that.
 
 I'm not sure what Fedora has to do with it,

I like highly modularized systems. And that requires an initramfs
to load all the required modules.


Jan
-- 
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFD] BIO_RW_BARRIER - what it means for devices, filesystems, and dm/md.

2007-05-31 Thread david

On Thu, 31 May 2007, Jens Axboe wrote:


On Thu, May 31 2007, Phillip Susi wrote:

David Chinner wrote:

That sounds like a good idea - we can leave the existing
WRITE_BARRIER behaviour unchanged and introduce a new WRITE_ORDERED
behaviour that only guarantees ordering. The filesystem can then
choose which to use where appropriate


So what if you want a synchronous write, but DON'T care about the order?
  They need to be two completely different flags which you can choose
to combine, or use individually.


If you have a use case for that, we can easily support it as well...
Depending on the drive capabilities (FUA support or not), it may be
nearly as slow as a real barrier write.


true, but a real barrier write could have significant side effects on 
other writes that wouldn't happen with a synchronous wrote (a sync wrote 
can have other, unrelated writes re-ordered around it, a barrier write 
can't)


David Lang
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFD] BIO_RW_BARRIER - what it means for devices, filesystems, and dm/md.

2007-05-31 Thread Jens Axboe
On Thu, May 31 2007, [EMAIL PROTECTED] wrote:
 On Thu, 31 May 2007, Jens Axboe wrote:
 
 On Thu, May 31 2007, Phillip Susi wrote:
 David Chinner wrote:
 That sounds like a good idea - we can leave the existing
 WRITE_BARRIER behaviour unchanged and introduce a new WRITE_ORDERED
 behaviour that only guarantees ordering. The filesystem can then
 choose which to use where appropriate
 
 So what if you want a synchronous write, but DON'T care about the order?
   They need to be two completely different flags which you can choose
 to combine, or use individually.
 
 If you have a use case for that, we can easily support it as well...
 Depending on the drive capabilities (FUA support or not), it may be
 nearly as slow as a real barrier write.
 
 true, but a real barrier write could have significant side effects on 
 other writes that wouldn't happen with a synchronous wrote (a sync wrote 
 can have other, unrelated writes re-ordered around it, a barrier write 
 can't)

That is true, the sync write also has side effects at the drive side
since it may have a varied cost depending on the workload (eg what
already resides in the cache when it is issued), unless FUA is active.
That is also true for the barrier of course, but only for previously
submitted IO as we don't reorder.

I'm not saying that a SYNC write wont be potentially useful, just that
it's definitely not free even outside of the write itself.

-- 
Jens Axboe

-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFD] BIO_RW_BARRIER - what it means for devices, filesystems, and dm/md.

2007-05-31 Thread David Chinner
On Thu, May 31, 2007 at 02:31:21PM -0400, Phillip Susi wrote:
 David Chinner wrote:
 That sounds like a good idea - we can leave the existing
 WRITE_BARRIER behaviour unchanged and introduce a new WRITE_ORDERED
 behaviour that only guarantees ordering. The filesystem can then
 choose which to use where appropriate
 
 So what if you want a synchronous write, but DON'T care about the order? 

submit_bio(WRITE_SYNC, bio);

Already there, already used by XFS, JFS and direct I/O.

Cheers,

Dave.
-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFD] BIO_RW_BARRIER - what it means for devices, filesystems, and dm/md.

2007-05-31 Thread Tejun Heo
Jens Axboe wrote:
 On Thu, May 31 2007, David Chinner wrote:
 On Thu, May 31, 2007 at 08:26:45AM +0200, Jens Axboe wrote:
 On Thu, May 31 2007, David Chinner wrote:
 IOWs, there are two parts to the problem:

1 - guaranteeing I/O ordering
2 - guaranteeing blocks are on persistent storage.

 Right now, a single barrier I/O is used to provide both of these
 guarantees. In most cases, all we really need to provide is 1); the
 need for 2) is a much rarer condition but still needs to be
 provided.

 if I am understanding it correctly, the big win for barriers is that you 
 do NOT have to stop and wait until the data is on persistant media before 
 you can continue.
 Yes, if we define a barrier to only guarantee 1), then yes this
 would be a big win (esp. for XFS). But that requires all filesystems
 to handle sync writes differently, and sync_blockdev() needs to
 call blkdev_issue_flush() as well

 So, what do we do here? Do we define a barrier I/O to only provide
 ordering, or do we define it to also provide persistent storage
 writeback? Whatever we decide, it needs to be documented
 The block layer already has a notion of the two types of barriers, with
 a very small amount of tweaking we could expose that. There's absolutely
 zero reason we can't easily support both types of barriers.
 That sounds like a good idea - we can leave the existing
 WRITE_BARRIER behaviour unchanged and introduce a new WRITE_ORDERED
 behaviour that only guarantees ordering. The filesystem can then
 choose which to use where appropriate
 
 Precisely. The current definition of barriers are what Chris and I came
 up with many years ago, when solving the problem for reiserfs
 originally. It is by no means the only feasible approach.
 
 I'll add a WRITE_ORDERED command to the #barrier branch, it already
 contains the empty-bio barrier support I posted yesterday (well a
 slightly modified and cleaned up version).

Would that be very different from issuing barrier and not waiting for
its completion?  For ATA and SCSI, we'll have to flush write back cache
anyway, so I don't see how we can get performance advantage by
implementing separate WRITE_ORDERED.  I think zero-length barrier
(haven't looked at the code yet, still recovering from jet lag :-) can
serve as genuine barrier without the extra write tho.

Thanks.

-- 
tejun
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [dm-devel] Re: [RFD] BIO_RW_BARRIER - what it means for devices, filesystems, and dm/md.

2007-05-31 Thread Tejun Heo
Stefan Bader wrote:
 2007/5/30, Phillip Susi [EMAIL PROTECTED]:
 Stefan Bader wrote:
 
  Since drive a supports barrier request we don't get -EOPNOTSUPP but
  the request with block y might get written before block x since the
  disk are independent. I guess the chances of this are quite low since
  at some point a barrier request will also hit drive b but for the time
  being it might be better to indicate -EOPNOTSUPP right from
  device-mapper.

 The device mapper needs to ensure that ALL underlying devices get a
 barrier request when one comes down from above, even if it has to
 construct zero length barriers to send to most of them.

 
 And somehow also make sure all of the barriers have been processed
 before returning the barrier that came in. Plus it would have to queue
 all mapping requests until the barrier is done (if strictly acting
 according to barrier.txt).
 
 But I am wondering a bit whether the requirements to barriers are
 really that tight as described in Tejun's document (barrier request is
 only started if everything before is safe, the barrier itself isn't
 returned until it is safe, too, and all requests after the barrier
 aren't started before the barrier is done). Is it really necessary to
 defer any further requests until the barrier has been written to save
 storage? Or would it be sufficient to guarantee that, if a barrier
 request returns, everything up to (including the barrier) is on safe
 storage?

Well, what's described in barrier.txt is the current implemented
semantics and what filesystems expect, so we can't change it underneath
them but we definitely can introduce new more relaxed variants, but one
thing we should bear in mind is that harddisks don't have humongous
caches or very smart controller / instruction set.  No matter how
relaxed interface the block layer provides, in the end, it just has to
issue whole-sale FLUSH CACHE on the device to guarantee data ordering on
the media.

IMHO, we can do better by paying more attention to how we do things in
the request queue which can be deeper and more intelligent than the
device queue.

Thanks.

-- 
tejun
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html