Re: Raid-5 long write wait while reading

2007-06-02 Thread tj

Bill Davidsen wrote:

tj wrote:

Thomas Jager wrote:

Hi list.

I run a file server on MD raid-5.
If a client reads one big file and at the same time another client 
tries to write a file, the thread writing just sits in 
uninterruptible sleep until the reader has finished. Only very small 
amount of writes get trough while the reader is still working.

I'm having some trouble pinpointing the problem.
It's not consistent either sometimes it works as expected both the 
reader and writer gets some transactions. On huge reads I've seen 
the writer blocked for 30-40 minutes without any significant writes 
happening (Maybe a few megabytes, of several gigs waiting). It 
happens with NFS, SMB and FTP, and local with dd. And seems to be 
connected to raid-5. This does not happen on block devices without 
raid-5. I'm also wondering if it can have anything to do with 
loop-aes? I use loop-aes on top of the md, but then again i have not 
observed this problem on loop-devices with disk backend. I do know 
that loop-aes degrades performance but i didn't think it would do 
something like this?


I've seen this problem in 2.6.16-2.6.21

All disks in the array is connected to a controller with a SiI 3114 
chip.


I just noticed something else. A couple of slow readers where running 
on my raid-5 array. Then i started a copy from another local disk to 
the array. Then i got the extremely long wait. I noticed something in 
iostat:


avg-cpu:  %user   %nice %system %iowait  %steal   %idle
  3.900.00   48.05   31.930.00   16.12

Device:tpskB_read/skB_wrtn/skB_readkB_wrtn

sdg   0.8025.55 0.00128  0
sdh 154.89   632.34 0.00   3168  0
sdi   0.2012.77 0.00 64  0
sdj   0.4025.55 0.00128  0
sdk   0.4025.55 0.00128  0
sdl   0.8025.55 0.00128  0
sdm   0.8025.55 0.00128  0
sdn   0.6023.95 0.00120  0
md0 199.20   796.81 0.00   3992  0

All disks are member of the same raid array (md0). One of the disks 
has a ton of transactions compared to the other disks. Read 
operations as far as i can tell. Why? May be connected with my problem? 
Two thoughts on that, if you are doing a lot of directory operations, 
it's possible that the inodes being used most are all in one chunk.

Hi thanks for the reply.

It's not directory operations AFAIK. Reading a few files (3 in this 
case) and writing one.


The other possibility is that these a journal writes and reflect 
updates to the atime. The way to see if this is in some way  related 
is to mount (remount) with noatime: "mount -o remount,noatime /dev/md0 
/wherever" and retest. If this is journal activity you can do several 
things to reduce the problem, which I'll go into (a) if it seems to be 
the problem, and (b) if someone else doesn't point you to an existing 
document or old post on the topic. Oh, you could also try mounting the 
filesystem as etc2, assuming that it's ext3 now. I wouldn't run that 
way, but it's useful as a diagnostic tool.
I don't use ext3 i use ReiserFS. ( It seemed like a good idea at the 
time. ) It's mounted with  -o  noatime.
I've done some more testing and i seems like it might be connected to 
mount --bind. If i write to a binded mount i get the slow writes. But if 
i write directly to the real mount i don't. It might just be a random 
occurrence, as the problem always has been inconsistent. Thoughts?

-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [RFD] BIO_RW_BARRIER - what it means for devices, filesystems, and dm/md.

2007-06-02 Thread Guy Watkins
} -Original Message-
} From: [EMAIL PROTECTED] [mailto:linux-raid-
} [EMAIL PROTECTED] On Behalf Of Jens Axboe
} Sent: Saturday, June 02, 2007 10:35 AM
} To: Tejun Heo
} Cc: David Chinner; [EMAIL PROTECTED]; Phillip Susi; Neil Brown; linux-
} [EMAIL PROTECTED]; [EMAIL PROTECTED]; dm-
} [EMAIL PROTECTED]; linux-raid@vger.kernel.org; Stefan Bader; Andreas Dilger
} Subject: Re: [RFD] BIO_RW_BARRIER - what it means for devices,
} filesystems, and dm/md.
} 
} On Sat, Jun 02 2007, Tejun Heo wrote:
} > Hello,
} >
} > Jens Axboe wrote:
} > >> Would that be very different from issuing barrier and not waiting for
} > >> its completion?  For ATA and SCSI, we'll have to flush write back
} cache
} > >> anyway, so I don't see how we can get performance advantage by
} > >> implementing separate WRITE_ORDERED.  I think zero-length barrier
} > >> (haven't looked at the code yet, still recovering from jet lag :-)
} can
} > >> serve as genuine barrier without the extra write tho.
} > >
} > > As always, it depends :-)
} > >
} > > If you are doing pure flush barriers, then there's no difference.
} Unless
} > > you only guarantee ordering wrt previously submitted requests, in
} which
} > > case you can eliminate the post flush.
} > >
} > > If you are doing ordered tags, then just setting the ordered bit is
} > > enough. That is different from the barrier in that we don't need a
} flush
} > > of FUA bit set.
} >
} > Hmmm... I'm feeling dense.  Zero-length barrier also requires only one
} > flush to separate requests before and after it (haven't looked at the
} > code yet, will soon).  Can you enlighten me?
} 
} Yeah, that's what the zero-length barrier implementation I posted does.
} Not sure if you have a question beyond that, if so fire away :-)
} 
} --
} Jens Axboe

I must admit I have only read some of the barrier related posts, so this
issue may have been covered.  If so, sorry.

What I have read seems to be related to a single disk.  What if a logical
disk is used (md, LVM, ...)?  If a barrier is issued to a logical disk and
that driver issues barriers to all related devices (logical or physical),
all the devices MUST honor the barrier together.  If 1 device crosses the
barrier before another reaches the barrier, corruption should be assumed.
It seems to me each block device that represents more than 2 other devices
must do a flush at a barrier so that all devices will cross the barrier at
the same time.

Guy

-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: RAID 6 grow problem

2007-06-02 Thread Iain Rauch
>>> raid6 reshape wasn't added until 2.6.21.  Before that only raid5 was
>>> supported.
>>> You also need to ensure that CONFIG_MD_RAID5_RESHAPE=y.
>> 
>> I don't see that in the config. Should I add it? Then reboot?
> 
> You reported that you were running a 2.6.20 kernel, which doesn't
> support raid6 reshape.
> You need to compile a 2.6.21 kernel (or
>apt-get install linux-image-2.6.21-1-amd64
> or whatever) and ensure that CONFIG_MD_RAID5_RESHAPE=y is in the
> .config before compiling.

There only seems to be version 2.6.20 does this matter a lot? Also how do I
specify what is in the config when using apt-get install?

>> I used apt-get install mdadm to first install it, which gave me 2.5.x then I
>> downloaded the new source and typed make then make install. Now mdadm -V
>> shows "mdadm - v2.6.2 - 21st May 2007".
>> Is there anyway to check it is installed correctly?
> 
> The "mdadm -V" check is sufficient.

Are you sure because at first I just did the make/make install and mdadm -V
did tell me v2.6.2 but I don't believe it was installed properly because it
didn't recognise my array nor did it make a config file, and cat
/proc/mdstat said no file/directory??


Iain


-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: RAID 6 grow problem

2007-06-02 Thread Neil Brown
On Saturday June 2, [EMAIL PROTECTED] wrote:
> > raid6 reshape wasn't added until 2.6.21.  Before that only raid5 was
> > supported.
> > You also need to ensure that CONFIG_MD_RAID5_RESHAPE=y.
> 
> I don't see that in the config. Should I add it? Then reboot?

You reported that you were running a 2.6.20 kernel, which doesn't
support raid6 reshape.
You need to compile a 2.6.21 kernel (or 
   apt-get install linux-image-2.6.21-1-amd64
or whatever) and ensure that CONFIG_MD_RAID5_RESHAPE=y is in the
.config before compiling.

> 
> I used apt-get install mdadm to first install it, which gave me 2.5.x then I
> downloaded the new source and typed make then make install. Now mdadm -V
> shows "mdadm - v2.6.2 - 21st May 2007".
> Is there anyway to check it is installed correctly?

The "mdadm -V" check is sufficient.

NeilBrown
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: RAID 6 grow problem

2007-06-02 Thread Iain Rauch
> raid6 reshape wasn't added until 2.6.21.  Before that only raid5 was
> supported.
> You also need to ensure that CONFIG_MD_RAID5_RESHAPE=y.

I don't see that in the config. Should I add it? Then reboot?

I used apt-get install mdadm to first install it, which gave me 2.5.x then I
downloaded the new source and typed make then make install. Now mdadm -V
shows "mdadm - v2.6.2 - 21st May 2007".
Is there anyway to check it is installed correctly?


Iain


-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: RAID 6 grow problem

2007-06-02 Thread Neil Brown
On Saturday June 2, [EMAIL PROTECTED] wrote:
> > For the critical section part, it may be your syntax..
> > 
> > When I had the problem, Neil showed me the path! :)
> I don't think it is incorrect, before I thought it was supposted to specify
> an actual file so I 'touch'ed one and it says file exists.
> 
> > For your issue, do you have raid5/6 GROW support enabled in the kernel?
> > Also I grew mine I never used the --backup-file option.
> I don't know, how would I find this out? uname -r gives me 2.6.20-15-server.

raid6 reshape wasn't added until 2.6.21.  Before that only raid5 was
supported.
You also need to ensure that CONFIG_MD_RAID5_RESHAPE=y.

NeilBrown
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: RAID 6 grow problem

2007-06-02 Thread Justin Piszcz

CONFIG_MD_RAID5_RESHAPE=y

Check for this option.

On Sat, 2 Jun 2007, Justin Piszcz wrote:




On Sat, 2 Jun 2007, Iain Rauch wrote:


For the critical section part, it may be your syntax..

When I had the problem, Neil showed me the path! :)

I don't think it is incorrect, before I thought it was supposted to specify
an actual file so I 'touch'ed one and it says file exists.


For your issue, do you have raid5/6 GROW support enabled in the kernel?
Also I grew mine I never used the --backup-file option.
I don't know, how would I find this out? uname -r gives me 
2.6.20-15-server.



Iain




Find the .config for your kernel and see if the raid5 grow support is 
enabled.

-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: RAID 6 grow problem

2007-06-02 Thread Justin Piszcz



On Sat, 2 Jun 2007, Iain Rauch wrote:


For the critical section part, it may be your syntax..

When I had the problem, Neil showed me the path! :)

I don't think it is incorrect, before I thought it was supposted to specify
an actual file so I 'touch'ed one and it says file exists.


For your issue, do you have raid5/6 GROW support enabled in the kernel?
Also I grew mine I never used the --backup-file option.

I don't know, how would I find this out? uname -r gives me 2.6.20-15-server.


Iain




Find the .config for your kernel and see if the raid5 grow support is 
enabled.

-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: RAID 6 grow problem

2007-06-02 Thread Iain Rauch
> For the critical section part, it may be your syntax..
> 
> When I had the problem, Neil showed me the path! :)
I don't think it is incorrect, before I thought it was supposted to specify
an actual file so I 'touch'ed one and it says file exists.

> For your issue, do you have raid5/6 GROW support enabled in the kernel?
> Also I grew mine I never used the --backup-file option.
I don't know, how would I find this out? uname -r gives me 2.6.20-15-server.


Iain


-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFD] BIO_RW_BARRIER - what it means for devices, filesystems, and dm/md.

2007-06-02 Thread Bill Davidsen

Jens Axboe wrote:

On Fri, Jun 01 2007, Bill Davidsen wrote:
  

Jens Axboe wrote:


On Thu, May 31 2007, Bill Davidsen wrote:
 
  

Jens Axboe wrote:
   


On Thu, May 31 2007, David Chinner wrote:

 
  

On Thu, May 31, 2007 at 08:26:45AM +0200, Jens Axboe wrote:
  
   


On Thu, May 31 2007, David Chinner wrote:

 
  

IOWs, there are two parts to the problem:

1 - guaranteeing I/O ordering
2 - guaranteeing blocks are on persistent storage.

Right now, a single barrier I/O is used to provide both of these
guarantees. In most cases, all we really need to provide is 1); the
need for 2) is a much rarer condition but still needs to be
provided.

  
   

if I am understanding it correctly, the big win for barriers is that 
you do NOT have to stop and wait until the data is on persistant 
media before you can continue.

 
  

Yes, if we define a barrier to only guarantee 1), then yes this
would be a big win (esp. for XFS). But that requires all filesystems
to handle sync writes differently, and sync_blockdev() needs to
call blkdev_issue_flush() as well

So, what do we do here? Do we define a barrier I/O to only provide
ordering, or do we define it to also provide persistent storage
writeback? Whatever we decide, it needs to be documented
  
   


The block layer already has a notion of the two types of barriers, with
a very small amount of tweaking we could expose that. There's 
absolutely

zero reason we can't easily support both types of barriers.

 
  

That sounds like a good idea - we can leave the existing
WRITE_BARRIER behaviour unchanged and introduce a new WRITE_ORDERED
behaviour that only guarantees ordering. The filesystem can then
choose which to use where appropriate
  
   


Precisely. The current definition of barriers are what Chris and I came
up with many years ago, when solving the problem for reiserfs
originally. It is by no means the only feasible approach.

I'll add a WRITE_ORDERED command to the #barrier branch, it already
contains the empty-bio barrier support I posted yesterday (well a
slightly modified and cleaned up version).


 
  
Wait. Do filesystems expect (depend on) anything but ordering now? Does 
md? Having users of barriers as they currently behave suddenly getting 
SYNC behavior where they expect ORDERED is likely to have a negative 
effect on performance. Or do I misread what is actually guaranteed by 
WRITE_BARRIER now, and a flush is currently happening in all cases?
   


See the above stuff you quote, it's answered there. It's not a change,
this is how the Linux barrier write has always worked since I first
implemented it. What David and I are talking about is adding a more
relaxed version as well, that just implies ordering.
 
  
I was reading the documentation in block/biodoc.txt, which seems to just 
say ordered:


   1.2.1 I/O Barriers

   There is a way to enforce strict ordering for i/os through barriers.
   All requests before a barrier point must be serviced before the barrier
   request and any other requests arriving after the barrier will not be
   serviced until after the barrier has completed. This is useful for
   higher
   level control on write ordering, e.g flushing a log of committed updates
   to disk before the corresponding updates themselves.

   A flag in the bio structure, BIO_BARRIER is used to identify a
   barrier i/o.
   The generic i/o scheduler would make sure that it places the barrier
   request and
   all other requests coming after it after all the previous requests
   in the
   queue. Barriers may be implemented in different ways depending on the
   driver. A SCSI driver for example could make use of ordered tags to
   preserve the necessary ordering with a lower impact on throughput.
   For IDE
   this might be two sync cache flush: a pre and post flush when
   encountering
   a barrier write.

The "flush" comment is associated with IDE, so it wasn't clear that the 
device cache is always cleared to force the data to the platter.



The above should mention that the ordered tag comment for SCSI assumes
that the drive uses write through caching. If it does, then an ordered
tag is enough. If it doesn't, then you need a bit more than that (a post
flush, after the ordered tag has completed).

  

Thanks, go it.

--
bill davidsen <[EMAIL PROTECTED]>
 CTO TMR Associates, Inc
 Doing interesting things with small computers since 1979

-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: multiple xor_block() functions

2007-06-02 Thread Jouni Malinen
On Sat, Jun 02, 2007 at 08:57:46PM +0200, Adrian Bunk wrote:
> include/linux/raid/xor.h:extern void xor_block(unsigned int count, unsigned 
> int bytes, void **ptr);
> drivers/md/xor.c:xor_block(unsigned int count, unsigned int bytes, void **ptr)
> drivers/md/xor.c:EXPORT_SYMBOL(xor_block);
> 
> and
> 
> net/ieee80211/ieee80211_crypt_ccmp.c:static inline void xor_block(u8 * b, u8 
> * a, size_t len)
> 
> 
> At least one of them has to be renamed.

Why? Not that I would really mind renaming one of these, but I don't see
a good reason for it. ieee80211_crypt_ccmp.c should not include
linux/raid/xor.h and the xor_block() in CCMP code is a static inline
function that should not show up outside the scope of this file. Do we
have some magic that makes exported symbols pollute name space for
inlined helper functions?

-- 
Jouni MalinenPGP id EFC895FA
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


multiple xor_block() functions

2007-06-02 Thread Adrian Bunk
include/linux/raid/xor.h:extern void xor_block(unsigned int count, unsigned int 
bytes, void **ptr);
drivers/md/xor.c:xor_block(unsigned int count, unsigned int bytes, void **ptr)
drivers/md/xor.c:EXPORT_SYMBOL(xor_block);

and

net/ieee80211/ieee80211_crypt_ccmp.c:static inline void xor_block(u8 * b, u8 * 
a, size_t len)


At least one of them has to be renamed.


cu
Adrian

-- 

   "Is there not promise of rain?" Ling Tan asked suddenly out
of the darkness. There had been need of rain for many days.
   "Only a promise," Lao Er said.
   Pearl S. Buck - Dragon Seed

-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: RAID 6 grow problem

2007-06-02 Thread Justin Piszcz



On Sat, 2 Jun 2007, Iain Rauch wrote:


Hello, when I run:

mdadm /dev/md1 --grow --raid-devices 16 --backup-file=/md1backup

I get:

mdadm: Need to backup 1792K of critical section..
mdadm: Cannot set device size/shape for /dev/md1: Invalid argument

Any help?


Iain


-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html



Yup, I got the same thing, first, what is your chunk size?

mdadm -D /dev/md1

Justin.
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RAID 6 grow problem

2007-06-02 Thread Iain Rauch
Hello, when I run:

mdadm /dev/md1 --grow --raid-devices 16 --backup-file=/md1backup

I get:

mdadm: Need to backup 1792K of critical section..
mdadm: Cannot set device size/shape for /dev/md1: Invalid argument

Any help?


Iain


-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Customize the error emails of `mdadm --monitor`

2007-06-02 Thread Peter Rabbitson

Hi,

Is there a way to list the _number_ in addition to the name of a 
problematic component? The kernel trend to move all block devices into 
the sdX namespace combined with the dynamic name allocation renders 
messages like "/dev/sdc1 has problems" meaningless. It would make remote 
server support so much easier, by allowing the administrator to label 
drive trays Component0 Component1 Component2... etc, and be sure that 
the local tech support person will not pull out the wrong drive from the 
system.


Thanks

Peter
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: When does a disk get flagged as bad?

2007-06-02 Thread Alberto Alonso
So, what kind of error is:

end_request: I/O error, dev sdb, sector 42644555
end_request: I/O error, dev sdb, sector 124365763
...

I am still trying to figure out why that just make one of my servers
unresponsive.

Is there a way to have the md code kick that drive out of the array? 

The datacenter people are starting to get impatient having to reboot 
it every other day.

Thanks,

Alberto

On Thu, 2007-05-31 at 16:10 +1000, Neil Brown wrote:
> On Wednesday May 30, [EMAIL PROTECTED] wrote:
> > 
> > After thinking about your post, I guess I can see some logic behind
> > not failing on the read, although I would say that after x amount of
> > read failures a drive should be kicked out no matter what.
> 
> When md gets a read error, it collects the correct data from elsewhere
> and tries to write it to the drive that apparently failed.
> If that succeeds, it tries to read it back again.  If that succeeds as
> well, it assumes that the problem has been fixed.  Otherwise it fails
> the drive.
> 
> 
> NeilBrown
> -
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to [EMAIL PROTECTED]
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
-- 
Alberto AlonsoGlobal Gate Systems LLC.
(512) 351-7233http://www.ggsys.net
Hardware, consulting, sysadmin, monitoring and remote backups

-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFD] BIO_RW_BARRIER - what it means for devices, filesystems, and dm/md.

2007-06-02 Thread Jens Axboe
On Fri, Jun 01 2007, Bill Davidsen wrote:
> Jens Axboe wrote:
> >On Thu, May 31 2007, Bill Davidsen wrote:
> >  
> >>Jens Axboe wrote:
> >>
> >>>On Thu, May 31 2007, David Chinner wrote:
> >>> 
> >>>  
> On Thu, May 31, 2007 at 08:26:45AM +0200, Jens Axboe wrote:
>    
> 
> >On Thu, May 31 2007, David Chinner wrote:
> > 
> >  
> >>IOWs, there are two parts to the problem:
> >>
> >>1 - guaranteeing I/O ordering
> >>2 - guaranteeing blocks are on persistent storage.
> >>
> >>Right now, a single barrier I/O is used to provide both of these
> >>guarantees. In most cases, all we really need to provide is 1); the
> >>need for 2) is a much rarer condition but still needs to be
> >>provided.
> >>
> >>   
> >>
> >>>if I am understanding it correctly, the big win for barriers is that 
> >>>you do NOT have to stop and wait until the data is on persistant 
> >>>media before you can continue.
> >>> 
> >>>  
> >>Yes, if we define a barrier to only guarantee 1), then yes this
> >>would be a big win (esp. for XFS). But that requires all filesystems
> >>to handle sync writes differently, and sync_blockdev() needs to
> >>call blkdev_issue_flush() as well
> >>
> >>So, what do we do here? Do we define a barrier I/O to only provide
> >>ordering, or do we define it to also provide persistent storage
> >>writeback? Whatever we decide, it needs to be documented
> >>   
> >>
> >The block layer already has a notion of the two types of barriers, with
> >a very small amount of tweaking we could expose that. There's 
> >absolutely
> >zero reason we can't easily support both types of barriers.
> > 
> >  
> That sounds like a good idea - we can leave the existing
> WRITE_BARRIER behaviour unchanged and introduce a new WRITE_ORDERED
> behaviour that only guarantees ordering. The filesystem can then
> choose which to use where appropriate
>    
> 
> >>>Precisely. The current definition of barriers are what Chris and I came
> >>>up with many years ago, when solving the problem for reiserfs
> >>>originally. It is by no means the only feasible approach.
> >>>
> >>>I'll add a WRITE_ORDERED command to the #barrier branch, it already
> >>>contains the empty-bio barrier support I posted yesterday (well a
> >>>slightly modified and cleaned up version).
> >>>
> >>> 
> >>>  
> >>Wait. Do filesystems expect (depend on) anything but ordering now? Does 
> >>md? Having users of barriers as they currently behave suddenly getting 
> >>SYNC behavior where they expect ORDERED is likely to have a negative 
> >>effect on performance. Or do I misread what is actually guaranteed by 
> >>WRITE_BARRIER now, and a flush is currently happening in all cases?
> >>
> >
> >See the above stuff you quote, it's answered there. It's not a change,
> >this is how the Linux barrier write has always worked since I first
> >implemented it. What David and I are talking about is adding a more
> >relaxed version as well, that just implies ordering.
> >  
> 
> I was reading the documentation in block/biodoc.txt, which seems to just 
> say ordered:
> 
>1.2.1 I/O Barriers
> 
>There is a way to enforce strict ordering for i/os through barriers.
>All requests before a barrier point must be serviced before the barrier
>request and any other requests arriving after the barrier will not be
>serviced until after the barrier has completed. This is useful for
>higher
>level control on write ordering, e.g flushing a log of committed updates
>to disk before the corresponding updates themselves.
> 
>A flag in the bio structure, BIO_BARRIER is used to identify a
>barrier i/o.
>The generic i/o scheduler would make sure that it places the barrier
>request and
>all other requests coming after it after all the previous requests
>in the
>queue. Barriers may be implemented in different ways depending on the
>driver. A SCSI driver for example could make use of ordered tags to
>preserve the necessary ordering with a lower impact on throughput.
>For IDE
>this might be two sync cache flush: a pre and post flush when
>encountering
>a barrier write.
> 
> The "flush" comment is associated with IDE, so it wasn't clear that the 
> device cache is always cleared to force the data to the platter.

The above should mention that the ordered tag comment for SCSI assumes
that the drive uses write through caching. If it does, then an ordered
tag is enough. If it doesn't, then you need a bit more than that (a post
flush, after the ordered tag has completed).

-- 
Jens Axboe

-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vge

Re: [RFD] BIO_RW_BARRIER - what it means for devices, filesystems, and dm/md.

2007-06-02 Thread Jens Axboe
On Sat, Jun 02 2007, Tejun Heo wrote:
> Hello,
> 
> Jens Axboe wrote:
> >> Would that be very different from issuing barrier and not waiting for
> >> its completion?  For ATA and SCSI, we'll have to flush write back cache
> >> anyway, so I don't see how we can get performance advantage by
> >> implementing separate WRITE_ORDERED.  I think zero-length barrier
> >> (haven't looked at the code yet, still recovering from jet lag :-) can
> >> serve as genuine barrier without the extra write tho.
> > 
> > As always, it depends :-)
> > 
> > If you are doing pure flush barriers, then there's no difference. Unless
> > you only guarantee ordering wrt previously submitted requests, in which
> > case you can eliminate the post flush.
> > 
> > If you are doing ordered tags, then just setting the ordered bit is
> > enough. That is different from the barrier in that we don't need a flush
> > of FUA bit set.
> 
> Hmmm... I'm feeling dense.  Zero-length barrier also requires only one
> flush to separate requests before and after it (haven't looked at the
> code yet, will soon).  Can you enlighten me?

Yeah, that's what the zero-length barrier implementation I posted does.
Not sure if you have a question beyond that, if so fire away :-)

-- 
Jens Axboe

-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFD] BIO_RW_BARRIER - what it means for devices, filesystems, and dm/md.

2007-06-02 Thread Tejun Heo
Hello,

Jens Axboe wrote:
>> Would that be very different from issuing barrier and not waiting for
>> its completion?  For ATA and SCSI, we'll have to flush write back cache
>> anyway, so I don't see how we can get performance advantage by
>> implementing separate WRITE_ORDERED.  I think zero-length barrier
>> (haven't looked at the code yet, still recovering from jet lag :-) can
>> serve as genuine barrier without the extra write tho.
> 
> As always, it depends :-)
> 
> If you are doing pure flush barriers, then there's no difference. Unless
> you only guarantee ordering wrt previously submitted requests, in which
> case you can eliminate the post flush.
> 
> If you are doing ordered tags, then just setting the ordered bit is
> enough. That is different from the barrier in that we don't need a flush
> of FUA bit set.

Hmmm... I'm feeling dense.  Zero-length barrier also requires only one
flush to separate requests before and after it (haven't looked at the
code yet, will soon).  Can you enlighten me?

Thanks.

-- 
tejun

-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html