Re: scsi vs ide performance on fsync's
On Wed, 7 Mar 2001, Stephen C. Tweedie wrote: > Hi, > > On Tue, Mar 06, 2001 at 10:44:34AM -0800, Linus Torvalds wrote: > > > On Tue, 6 Mar 2001, Alan Cox wrote: > > > You want a write barrier. Write buffering (at least for short intervals) in > > > the drive is very sensible. The kernel needs to able to send drivers a write > > > barrier which will not be completed with outstanding commands before the > > > barrier. > > > > But Alan is right - we needs a "sync" command or something. I don't know > > if IDE has one (it already might, for all I know). > > Sync and barrier are very different models. With barriers we can > enforce some elemnt of write ordering without actually waiting for the > IOs to complete; with sync, we're explicitly asking to be told when > the data has become persistant. We can make use of both of these. > > SCSI certainly lets us do both of these operations independently. IDE > has the sync/flush command afaik, but I'm not sure whether the IDE > tagged command stuff has the equivalent of SCSI's ordered tag bits. > Andre? ATA-TCQ suxs to put is plain and simple. It really requires a special host and only the HPT366 series works. It is similar but not clear as to the nature. We are debating the usage of it now in T13. Cheers, Andre Hedrick Linux ATA Development - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: scsi vs ide performance on fsync's
On Wed, 7 Mar 2001, Stephen C. Tweedie wrote: Hi, On Tue, Mar 06, 2001 at 10:44:34AM -0800, Linus Torvalds wrote: On Tue, 6 Mar 2001, Alan Cox wrote: You want a write barrier. Write buffering (at least for short intervals) in the drive is very sensible. The kernel needs to able to send drivers a write barrier which will not be completed with outstanding commands before the barrier. But Alan is right - we needs a "sync" command or something. I don't know if IDE has one (it already might, for all I know). Sync and barrier are very different models. With barriers we can enforce some elemnt of write ordering without actually waiting for the IOs to complete; with sync, we're explicitly asking to be told when the data has become persistant. We can make use of both of these. SCSI certainly lets us do both of these operations independently. IDE has the sync/flush command afaik, but I'm not sure whether the IDE tagged command stuff has the equivalent of SCSI's ordered tag bits. Andre? ATA-TCQ suxs to put is plain and simple. It really requires a special host and only the HPT366 series works. It is similar but not clear as to the nature. We are debating the usage of it now in T13. Cheers, Andre Hedrick Linux ATA Development - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: scsi vs ide performance on fsync's
Hi, Jens Axboe: > > But most disks these days support IDE-SCSI, and SCSI does have ordered > > tags, so... > > Any proof to back this up? To my knowledge, only some WDC ATA disks > can be ATAPI driven. > Ummm, no, but that was my impression. If that's wrong, I apologize and will state the opposite, next time. -- Matthias Urlichs | noris network AG | http://smurf.noris.de/ -- You see things; and you say 'Why?' But I dream things that never were; and I say 'Why not?' --George Bernard Shaw [Back to Methuselah] - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: scsi vs ide performance on fsync's
On Fri, Mar 09 2001, Matthias Urlichs wrote: > Matthias Urlichs: > > On Wed, Mar 07 2001, Stephen C. Tweedie wrote: > > > SCSI certainly lets us do both of these operations independently. IDE > > > has the sync/flush command afaik, but I'm not sure whether the IDE > > > tagged command stuff has the equivalent of SCSI's ordered tag bits. > > > Andre? > > > > IDE has no concept of ordered tags... > > > But most disks these days support IDE-SCSI, and SCSI does have ordered > tags, so... Any proof to back this up? To my knowledge, only some WDC ATA disks can be ATAPI driven. -- Jens Axboe - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: scsi vs ide performance on fsync's
>> It's pretty clear that the IDE drive(r) is *not* waiting for the physical >> write to take place before returning control to the user program, whereas >> the SCSI drive(r) is. > >This would not be unexpected. > >IDE drives generally always do write buffering. I don't even know if you >_can_ turn it off. So the drive claims to have written the data as soon as >it has made the write buffer. > >It's definitely not the driver, but the actual drive. As I suspected. However, testing shows that many drives, including most IBMs, do respond to hdparm -W0 which turns write-caching off (some drives don't, including some Seagates). There are also drives in existence that have no cache at all (mostly old sub-1G drives) and some with too little for this to make a significant difference (the old 1.2G TravelStar in one of my PowerBooks is an example). So, is there a way to force (the majority of, rather than all) IDE drives to wait until it's been truly committed to media? If so, will this be integrated into the appropriate parts of the kernel, particularly for certain members of the sync() family and FS unmounting? -- from: Jonathan "Chromatix" Morton mail: [EMAIL PROTECTED] (not for attachments) big-mail: [EMAIL PROTECTED] uni-mail: [EMAIL PROTECTED] The key to knowledge is not to rely on people to teach you it. Get VNC Server for Macintosh from http://www.chromatix.uklinux.net/vnc/ -BEGIN GEEK CODE BLOCK- Version 3.12 GCS$/E/S dpu(!) s:- a20 C+++ UL++ P L+++ E W+ N- o? K? w--- O-- M++$ V? PS PE- Y+ PGP++ t- 5- X- R !tv b++ DI+++ D G e+ h+ r- y+ -END GEEK CODE BLOCK- - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: scsi vs ide performance on fsync's
It's pretty clear that the IDE drive(r) is *not* waiting for the physical write to take place before returning control to the user program, whereas the SCSI drive(r) is. This would not be unexpected. IDE drives generally always do write buffering. I don't even know if you _can_ turn it off. So the drive claims to have written the data as soon as it has made the write buffer. It's definitely not the driver, but the actual drive. As I suspected. However, testing shows that many drives, including most IBMs, do respond to hdparm -W0 which turns write-caching off (some drives don't, including some Seagates). There are also drives in existence that have no cache at all (mostly old sub-1G drives) and some with too little for this to make a significant difference (the old 1.2G TravelStar in one of my PowerBooks is an example). So, is there a way to force (the majority of, rather than all) IDE drives to wait until it's been truly committed to media? If so, will this be integrated into the appropriate parts of the kernel, particularly for certain members of the sync() family and FS unmounting? -- from: Jonathan "Chromatix" Morton mail: [EMAIL PROTECTED] (not for attachments) big-mail: [EMAIL PROTECTED] uni-mail: [EMAIL PROTECTED] The key to knowledge is not to rely on people to teach you it. Get VNC Server for Macintosh from http://www.chromatix.uklinux.net/vnc/ -BEGIN GEEK CODE BLOCK- Version 3.12 GCS$/E/S dpu(!) s:- a20 C+++ UL++ P L+++ E W+ N- o? K? w--- O-- M++$ V? PS PE- Y+ PGP++ t- 5- X- R !tv b++ DI+++ D G e+ h+ r- y+ -END GEEK CODE BLOCK- - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: scsi vs ide performance on fsync's
On Fri, Mar 09 2001, Matthias Urlichs wrote: Matthias Urlichs: On Wed, Mar 07 2001, Stephen C. Tweedie wrote: SCSI certainly lets us do both of these operations independently. IDE has the sync/flush command afaik, but I'm not sure whether the IDE tagged command stuff has the equivalent of SCSI's ordered tag bits. Andre? IDE has no concept of ordered tags... But most disks these days support IDE-SCSI, and SCSI does have ordered tags, so... Any proof to back this up? To my knowledge, only some WDC ATA disks can be ATAPI driven. -- Jens Axboe - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: scsi vs ide performance on fsync's
Hi, Jens Axboe: But most disks these days support IDE-SCSI, and SCSI does have ordered tags, so... Any proof to back this up? To my knowledge, only some WDC ATA disks can be ATAPI driven. Ummm, no, but that was my impression. If that's wrong, I apologize and will state the opposite, next time. -- Matthias Urlichs | noris network AG | http://smurf.noris.de/ -- You see things; and you say 'Why?' But I dream things that never were; and I say 'Why not?' --George Bernard Shaw [Back to Methuselah] - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: scsi vs ide performance on fsync's
Hi, Matthias Urlichs: > On Wed, Mar 07 2001, Stephen C. Tweedie wrote: > > SCSI certainly lets us do both of these operations independently. IDE > > has the sync/flush command afaik, but I'm not sure whether the IDE > > tagged command stuff has the equivalent of SCSI's ordered tag bits. > > Andre? > > IDE has no concept of ordered tags... > But most disks these days support IDE-SCSI, and SCSI does have ordered tags, so... Has anybody done speed comparisons between "native" IDE and IDE-SCSI? -- Matthias Urlichs | noris network AG | http://smurf.noris.de/ -- Success is something I will dress for when I get there, and not until. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: scsi vs ide performance on fsync's
On Wednesday, March 07, 2001 08:56:59 PM + "Stephen C. Tweedie" <[EMAIL PROTECTED]> wrote: > Hi, > > On Wed, Mar 07, 2001 at 09:15:36PM +0100, Jens Axboe wrote: >> On Wed, Mar 07 2001, Stephen C. Tweedie wrote: >> > >> > For most fs'es, that's not an issue. The fs won't start writeback on >> > the primary disk at all until the journal commit has been acknowledged >> > as firm on disk. >> >> But do you then force wait on that journal commit? > > It doesn't matter too much --- it's only the writeback which is doing > this (ext3 uses a separate journal thread for it), so any sleep is > only there to wait for the moment when writeback can safely begin: > users of the filesystem won't see any stalls. It is similar under reiserfs unless the log is full and new transactions have to wait for flushes to free up the log space. It is probably valid to assume the dedicated log device will be large enough that this won't happen very often, or fast enough (nvram) that it won't matter when it does happen. > >> A barrier operation is sufficient then. So you're saying don't >> over design, a simple barrier is all you need? > > Pretty much so. The simple barrier is the only thing which can be > effectively optimised at the hardware level with SCSI anyway. > The simple barrier is a good starting point regardless. If we can find hardware where it makes sense to do cross queue barriers (big raid controllers?), it might be worth trying. -chris - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: scsi vs ide performance on fsync's
Hi, On Wed, Mar 07, 2001 at 10:36:38AM -0800, Linus Torvalds wrote: > On Wed, 7 Mar 2001, Jeremy Hansen wrote: > > > > So in the meantime as this gets worked out on a lower level, we've decided > > to take the fsync() out of berkeley db for mysql transaction logs and > > mount the filesystem -o sync. > > > > Can anyone perhaps tell me why this may be a bad idea? > > - it doesn't help. The disk will _still_ do write buffering. It's the >DISK, not the OS. It doesn't matter what you do. > - your performance will suck. Added to which, "-o sync" only enables sync metadata updates. It still doesn't force an fsync on data writes. --Stephen - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: scsi vs ide performance on fsync's
Hi! > If not, then the drive could by all means optimise the access pattern > provided it acked the data or provided the results in the same order as the > instructions were given. This would probably shorten the time for a new > pathological set (distributed evenly across the disk surface, but all on > the worst-possible angular offset compared to the previous) to (8ms seek > time + 5ms rotational delay) * 4000 writes ~= 52 seconds (compared with > around 120 seconds for the previous set with rotational delay factored in). > Great, so you only need half as big a power store to guarantee writing that > much data, but it's still too much. Even with a 15000rpm drive and 5ms > seek times, it would still be too much. Drive can trivially seek to reserved track, and flush data on it. All within 25msec. Then, move data to proper location on next powerup. Pavel -- Philips Velo 1: 1"x4"x8", 300gram, 60, 12MB, 40bogomips, linux, mutt, details at http://atrey.karlin.mff.cuni.cz/~pavel/velo/index.html. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: scsi vs ide performance on fsync's
Hi! If not, then the drive could by all means optimise the access pattern provided it acked the data or provided the results in the same order as the instructions were given. This would probably shorten the time for a new pathological set (distributed evenly across the disk surface, but all on the worst-possible angular offset compared to the previous) to (8ms seek time + 5ms rotational delay) * 4000 writes ~= 52 seconds (compared with around 120 seconds for the previous set with rotational delay factored in). Great, so you only need half as big a power store to guarantee writing that much data, but it's still too much. Even with a 15000rpm drive and 5ms seek times, it would still be too much. Drive can trivially seek to reserved track, and flush data on it. All within 25msec. Then, move data to proper location on next powerup. Pavel -- Philips Velo 1: 1"x4"x8", 300gram, 60, 12MB, 40bogomips, linux, mutt, details at http://atrey.karlin.mff.cuni.cz/~pavel/velo/index.html. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: scsi vs ide performance on fsync's
Hi, On Wed, Mar 07, 2001 at 10:36:38AM -0800, Linus Torvalds wrote: On Wed, 7 Mar 2001, Jeremy Hansen wrote: So in the meantime as this gets worked out on a lower level, we've decided to take the fsync() out of berkeley db for mysql transaction logs and mount the filesystem -o sync. Can anyone perhaps tell me why this may be a bad idea? - it doesn't help. The disk will _still_ do write buffering. It's the DISK, not the OS. It doesn't matter what you do. - your performance will suck. Added to which, "-o sync" only enables sync metadata updates. It still doesn't force an fsync on data writes. --Stephen - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: scsi vs ide performance on fsync's
On Wednesday, March 07, 2001 08:56:59 PM + "Stephen C. Tweedie" [EMAIL PROTECTED] wrote: Hi, On Wed, Mar 07, 2001 at 09:15:36PM +0100, Jens Axboe wrote: On Wed, Mar 07 2001, Stephen C. Tweedie wrote: For most fs'es, that's not an issue. The fs won't start writeback on the primary disk at all until the journal commit has been acknowledged as firm on disk. But do you then force wait on that journal commit? It doesn't matter too much --- it's only the writeback which is doing this (ext3 uses a separate journal thread for it), so any sleep is only there to wait for the moment when writeback can safely begin: users of the filesystem won't see any stalls. It is similar under reiserfs unless the log is full and new transactions have to wait for flushes to free up the log space. It is probably valid to assume the dedicated log device will be large enough that this won't happen very often, or fast enough (nvram) that it won't matter when it does happen. A barrier operation is sufficient then. So you're saying don't over design, a simple barrier is all you need? Pretty much so. The simple barrier is the only thing which can be effectively optimised at the hardware level with SCSI anyway. The simple barrier is a good starting point regardless. If we can find hardware where it makes sense to do cross queue barriers (big raid controllers?), it might be worth trying. -chris - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: scsi vs ide performance on fsync's
Hi, Matthias Urlichs: On Wed, Mar 07 2001, Stephen C. Tweedie wrote: SCSI certainly lets us do both of these operations independently. IDE has the sync/flush command afaik, but I'm not sure whether the IDE tagged command stuff has the equivalent of SCSI's ordered tag bits. Andre? IDE has no concept of ordered tags... But most disks these days support IDE-SCSI, and SCSI does have ordered tags, so... Has anybody done speed comparisons between "native" IDE and IDE-SCSI? -- Matthias Urlichs | noris network AG | http://smurf.noris.de/ -- Success is something I will dress for when I get there, and not until. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: scsi vs ide performance on fsync's
On Wed, Mar 07 2001, Stephen C. Tweedie wrote: > On Wed, Mar 07, 2001 at 09:15:36PM +0100, Jens Axboe wrote: > > On Wed, Mar 07 2001, Stephen C. Tweedie wrote: > > > > > > For most fs'es, that's not an issue. The fs won't start writeback on > > > the primary disk at all until the journal commit has been acknowledged > > > as firm on disk. > > > > But do you then force wait on that journal commit? > > It doesn't matter too much --- it's only the writeback which is doing > this (ext3 uses a separate journal thread for it), so any sleep is > only there to wait for the moment when writeback can safely begin: > users of the filesystem won't see any stalls. Ok, but even if this is true for ext3 it may not be true for other journalled fs. AFAIR, reiser is doing an explicit wait_on_buffer which would then amount to quite a performance hit (speculation, haven't measured). > > A barrier operation is sufficient then. So you're saying don't > > over design, a simple barrier is all you need? > > Pretty much so. The simple barrier is the only thing which can be > effectively optimised at the hardware level with SCSI anyway. True -- Jens Axboe - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: scsi vs ide performance on fsync's
Hi, On Wed, Mar 07, 2001 at 09:15:36PM +0100, Jens Axboe wrote: > On Wed, Mar 07 2001, Stephen C. Tweedie wrote: > > > > For most fs'es, that's not an issue. The fs won't start writeback on > > the primary disk at all until the journal commit has been acknowledged > > as firm on disk. > > But do you then force wait on that journal commit? It doesn't matter too much --- it's only the writeback which is doing this (ext3 uses a separate journal thread for it), so any sleep is only there to wait for the moment when writeback can safely begin: users of the filesystem won't see any stalls. > A barrier operation is sufficient then. So you're saying don't > over design, a simple barrier is all you need? Pretty much so. The simple barrier is the only thing which can be effectively optimised at the hardware level with SCSI anyway. Cheers, Stephen - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: scsi vs ide performance on fsync's
On Wed, Mar 07 2001, Stephen C. Tweedie wrote: > On Wed, Mar 07, 2001 at 07:51:52PM +0100, Jens Axboe wrote: > > On Wed, Mar 07 2001, Stephen C. Tweedie wrote: > > > > My bigger concern is when the journalled fs has a log on a different > > queue. > > For most fs'es, that's not an issue. The fs won't start writeback on > the primary disk at all until the journal commit has been acknowledged > as firm on disk. But do you then force wait on that journal commit? > Certainly for ext3, synchronisation between the log and the primary > disk is no big thing. What really hurts is writing to the log, where > we have to wait for the log writes to complete before submitting the > commit write (which is sequentially allocated just after the rest of > the log blocks). Specifying a barrier on the commit block would allow > us to keep the log device streaming, and the fs can deal with > synchronising the primary disk quite happily by itself. A barrier operation is sufficient then. So you're saying don't over design, a simple barrier is all you need? -- Jens Axboe - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: scsi vs ide performance on fsync's
Hi, On Wed, Mar 07, 2001 at 07:51:52PM +0100, Jens Axboe wrote: > On Wed, Mar 07 2001, Stephen C. Tweedie wrote: > > My bigger concern is when the journalled fs has a log on a different > queue. For most fs'es, that's not an issue. The fs won't start writeback on the primary disk at all until the journal commit has been acknowledged as firm on disk. Certainly for ext3, synchronisation between the log and the primary disk is no big thing. What really hurts is writing to the log, where we have to wait for the log writes to complete before submitting the commit write (which is sequentially allocated just after the rest of the log blocks). Specifying a barrier on the commit block would allow us to keep the log device streaming, and the fs can deal with synchronising the primary disk quite happily by itself. Cheers, Stephen - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: scsi vs ide performance on fsync's
On Wed, Mar 07 2001, Stephen C. Tweedie wrote: > > Yep, it's much harder than it seems. Especially because for the barrier > > to be really useful, having inter-request dependencies becomes a > > requirement. So you can say something like 'flush X and Y, but don't > > flush Y before X is done'. > > Yes. Fortunately, the simplest possible barrier is just a matter of > marking a request as non-reorderable, and then making sure that you > both flush the elevator queue before servicing that request, and defer > any subsequent requests until the barrier request has been satisfied. > One it has gone through, you can let through the deferred requests (in > order, up to the point at which you encounter another barrier). The above should have been inter-queue dependencies. For one queue it's not a big issue, you basically described the whole sequence above. Either sequence it as zero for a non-empty queue and make sure the low level driver orders or flushes, or just hand it directly to the device. My bigger concern is when the journalled fs has a log on a different queue. > Only if the queue is empty can you give a barrier request directly to > the driver. The special optimisation you can do in this case with > SCSI is to continue to allow new requests through even before the > barrier has completed if the disk supports ordered queue tags. Yep, IDE will have to pay the price of a flush. -- Jens Axboe - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: scsi vs ide performance on fsync's
Hi, On Wed, Mar 07, 2001 at 03:12:41PM +0100, Jens Axboe wrote: > > Yep, it's much harder than it seems. Especially because for the barrier > to be really useful, having inter-request dependencies becomes a > requirement. So you can say something like 'flush X and Y, but don't > flush Y before X is done'. Yes. Fortunately, the simplest possible barrier is just a matter of marking a request as non-reorderable, and then making sure that you both flush the elevator queue before servicing that request, and defer any subsequent requests until the barrier request has been satisfied. One it has gone through, you can let through the deferred requests (in order, up to the point at which you encounter another barrier). Only if the queue is empty can you give a barrier request directly to the driver. The special optimisation you can do in this case with SCSI is to continue to allow new requests through even before the barrier has completed if the disk supports ordered queue tags. --Stephen - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: scsi vs ide performance on fsync's
On Wed, 7 Mar 2001, Jeremy Hansen wrote: > > So in the meantime as this gets worked out on a lower level, we've decided > to take the fsync() out of berkeley db for mysql transaction logs and > mount the filesystem -o sync. > > Can anyone perhaps tell me why this may be a bad idea? Two reasons: - it doesn't help. The disk will _still_ do write buffering. It's the DISK, not the OS. It doesn't matter what you do. - your performance will suck. Use fsync(). That's what it's there for. Tell people who don't have an UPS to disable write caching. If they have one (of the many, apparently) IDE disks that refuse to disable it, tell them to either get an UPS, or to switch to another disk. Linus - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: scsi vs ide performance on fsync's
So in the meantime as this gets worked out on a lower level, we've decided to take the fsync() out of berkeley db for mysql transaction logs and mount the filesystem -o sync. Can anyone perhaps tell me why this may be a bad idea? Thanks -jeremy On Tue, 6 Mar 2001, Jeremy Hansen wrote: > > Ahh, now we're getting somewhere. > > IDE: > > jeremy:~# time ./xlog file.out fsync > > real0m33.739s > user0m0.010s > sys 0m0.120s > > > so now this corresponds to the performance we're seeing on SCSI. > > So I guess what I'm wondering now is can or should anything be done about > this on the SCSI side? > > Thanks > -jeremy > > On Tue, 6 Mar 2001, Mike Black wrote: > > > Write caching is the culprit for the performance diff: > > > > On IDE: > > time xlog /blah.dat fsync > > 0.000u 0.190s 0:01.72 11.0% 0+0k 0+0io 91pf+0w > > # hdparm -W 0 /dev/hda > > > > /dev/hda: > > setting drive write-caching to 0 (off) > > # time xlog /blah.dat fsync > > 0.000u 0.220s 0:50.60 0.4% 0+0k 0+0io 91pf+0w > > # hdparm -W 1 /dev/hda > > > > /dev/hda: > > setting drive write-caching to 1 (on) > > # time xlog /blah.dat fsync > > 0.010u 0.230s 0:01.88 12.7% 0+0k 0+0io 91pf+0w > > > > On my SCSI setup: > > # time xlog /usr5/blah.dat fsync > > 0.020u 0.230s 0:30.48 0.8% 0+0k 0+0io 91pf+0w > > > > > > > > Michael D. Black Principal Engineer > > [EMAIL PROTECTED] 321-676-2923,x203 > > http://www.csihq.com Computer Science Innovations > > http://www.csihq.com/~mike My home page > > FAX 321-676-2355 > > - Original Message - > > From: "Andre Hedrick" <[EMAIL PROTECTED]> > > To: "Linus Torvalds" <[EMAIL PROTECTED]> > > Cc: "Douglas Gilbert" <[EMAIL PROTECTED]>; <[EMAIL PROTECTED]> > > Sent: Tuesday, March 06, 2001 2:12 AM > > Subject: Re: scsi vs ide performance on fsync's > > > > > > On Mon, 5 Mar 2001, Linus Torvalds wrote: > > > > > Well, it's fairly hard for the kernel to do much about that - it's almost > > > certainly just IDE doing write buffering on the disk itself. No OS > > > involved. > > > > I am pushing for WC to be defaulted in the off state, but as you know I > > have a bigger fight than caching on my hands... > > > > > I don't know if there is any way to turn of a write buffer on an IDE disk. > > > > You want a forced set of commands to kill caching at init? > > > > Andre Hedrick > > Linux ATA Development > > ASL Kernel Development > > > > - > > ASL, Inc. Toll free: 1-877-ASL-3535 > > 1757 Houret Court Fax: 1-408-941-2071 > > Milpitas, CA 95035Web: www.aslab.com > > > > - > > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in > > the body of a message to [EMAIL PROTECTED] > > More majordomo info at http://vger.kernel.org/majordomo-info.html > > Please read the FAQ at http://www.tux.org/lkml/ > > > > - > > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in > > the body of a message to [EMAIL PROTECTED] > > More majordomo info at http://vger.kernel.org/majordomo-info.html > > Please read the FAQ at http://www.tux.org/lkml/ > > > > -- this is my sig. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: scsi vs ide performance on fsync's
On Wed, Mar 07 2001, Stephen C. Tweedie wrote: > SCSI certainly lets us do both of these operations independently. IDE > has the sync/flush command afaik, but I'm not sure whether the IDE > tagged command stuff has the equivalent of SCSI's ordered tag bits. > Andre? IDE has no concept of ordered tags... -- Jens Axboe - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: scsi vs ide performance on fsync's
On Wed, Mar 07 2001, Stephen C. Tweedie wrote: > > SCSI has ordered tag, which fit the model Alan described quite nicely. > > I've been meaning to implement this for some time, it would be handy > > for journalled fs to use such a barrier. Since ATA doesn't do queueing > > (at least not in current Linux), a synchronize cache is probably the > > only way to go there. > > Note that you also have to preserve the position of the barrier in the > elevator queue, and you need to prevent LVM and soft raid from > violating the barrier if different commands end up being sent to > different disks. Yep, it's much harder than it seems. Especially because for the barrier to be really useful, having inter-request dependencies becomes a requirement. So you can say something like 'flush X and Y, but don't flush Y before X is done'. -- Jens Axboe - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: scsi vs ide performance on fsync's
Hi, On Tue, Mar 06, 2001 at 09:37:20PM +0100, Jens Axboe wrote: > > SCSI has ordered tag, which fit the model Alan described quite nicely. > I've been meaning to implement this for some time, it would be handy > for journalled fs to use such a barrier. Since ATA doesn't do queueing > (at least not in current Linux), a synchronize cache is probably the > only way to go there. Note that you also have to preserve the position of the barrier in the elevator queue, and you need to prevent LVM and soft raid from violating the barrier if different commands end up being sent to different disks. --Stephen - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: scsi vs ide performance on fsync's
Hi, On Tue, Mar 06, 2001 at 10:44:34AM -0800, Linus Torvalds wrote: > On Tue, 6 Mar 2001, Alan Cox wrote: > > You want a write barrier. Write buffering (at least for short intervals) in > > the drive is very sensible. The kernel needs to able to send drivers a write > > barrier which will not be completed with outstanding commands before the > > barrier. > > But Alan is right - we needs a "sync" command or something. I don't know > if IDE has one (it already might, for all I know). Sync and barrier are very different models. With barriers we can enforce some elemnt of write ordering without actually waiting for the IOs to complete; with sync, we're explicitly asking to be told when the data has become persistant. We can make use of both of these. SCSI certainly lets us do both of these operations independently. IDE has the sync/flush command afaik, but I'm not sure whether the IDE tagged command stuff has the equivalent of SCSI's ordered tag bits. Andre? --Stephen - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: scsi vs ide performance on fsync's
Andre Hedrick ([EMAIL PROTECTED]) wrote on Wed Mar 07 2001 - 01:58:44 EST : > On Wed, 7 Mar 2001, Jonathan Morton wrote: [ snip ] > > >Since all OSes that enable WC at init will flush > > >it at shutdown and do a periodic purge with in-activity. > > > > But Linux doesn't, as has been pointed out earlier. We need to fix Linux. > > Friend I have fixed this some time ago but it is bundled with TASKFILE > that is not going to arrive until 2.5. Because I need a way to execute > this and hold the driver until it is complete, regardless of the shutdown > method. I don't understand 100%. Is TASKFILE required to do proper write cache flushing ? > > >Err, last time I check all good devices flush their write caching on their > > >own to take advantage of having a maximum cache for prefetching. > > > > Which doesn't work if the buffer is filled up by the OS 0.5 seconds before > > the power goes. > > Maybe that is why there is a vender disk-cache dump zone on the edge of > the platters...just maybe you need to buy your drives from somebody that > does this and has a predictive sector stretcher as the energy from the > inertia by the DC three-phase motor executes the dump. So where is a list of drives that do this ? www.list-of-hardware-that-doesnt-suck.com is not responding ... > Ever wondered why modern drives have open collectors on the databuss? no :-) -- David Balazic -- "Be excellent to each other." - Bill & Ted - - - - - - - - - - - - - - - - - - - - - - - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: scsi vs ide performance on fsync's
Andre Hedrick ([EMAIL PROTECTED]) wrote on Wed Mar 07 2001 - 01:58:44 EST : On Wed, 7 Mar 2001, Jonathan Morton wrote: [ snip ] Since all OSes that enable WC at init will flush it at shutdown and do a periodic purge with in-activity. But Linux doesn't, as has been pointed out earlier. We need to fix Linux. Friend I have fixed this some time ago but it is bundled with TASKFILE that is not going to arrive until 2.5. Because I need a way to execute this and hold the driver until it is complete, regardless of the shutdown method. I don't understand 100%. Is TASKFILE required to do proper write cache flushing ? Err, last time I check all good devices flush their write caching on their own to take advantage of having a maximum cache for prefetching. Which doesn't work if the buffer is filled up by the OS 0.5 seconds before the power goes. Maybe that is why there is a vender disk-cache dump zone on the edge of the platters...just maybe you need to buy your drives from somebody that does this and has a predictive sector stretcher as the energy from the inertia by the DC three-phase motor executes the dump. So where is a list of drives that do this ? www.list-of-hardware-that-doesnt-suck.com is not responding ... Ever wondered why modern drives have open collectors on the databuss? no :-) -- David Balazic -- "Be excellent to each other." - Bill Ted - - - - - - - - - - - - - - - - - - - - - - - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: scsi vs ide performance on fsync's
Hi, On Tue, Mar 06, 2001 at 10:44:34AM -0800, Linus Torvalds wrote: On Tue, 6 Mar 2001, Alan Cox wrote: You want a write barrier. Write buffering (at least for short intervals) in the drive is very sensible. The kernel needs to able to send drivers a write barrier which will not be completed with outstanding commands before the barrier. But Alan is right - we needs a "sync" command or something. I don't know if IDE has one (it already might, for all I know). Sync and barrier are very different models. With barriers we can enforce some elemnt of write ordering without actually waiting for the IOs to complete; with sync, we're explicitly asking to be told when the data has become persistant. We can make use of both of these. SCSI certainly lets us do both of these operations independently. IDE has the sync/flush command afaik, but I'm not sure whether the IDE tagged command stuff has the equivalent of SCSI's ordered tag bits. Andre? --Stephen - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: scsi vs ide performance on fsync's
On Wed, Mar 07 2001, Stephen C. Tweedie wrote: SCSI has ordered tag, which fit the model Alan described quite nicely. I've been meaning to implement this for some time, it would be handy for journalled fs to use such a barrier. Since ATA doesn't do queueing (at least not in current Linux), a synchronize cache is probably the only way to go there. Note that you also have to preserve the position of the barrier in the elevator queue, and you need to prevent LVM and soft raid from violating the barrier if different commands end up being sent to different disks. Yep, it's much harder than it seems. Especially because for the barrier to be really useful, having inter-request dependencies becomes a requirement. So you can say something like 'flush X and Y, but don't flush Y before X is done'. -- Jens Axboe - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: scsi vs ide performance on fsync's
On Wed, Mar 07 2001, Stephen C. Tweedie wrote: SCSI certainly lets us do both of these operations independently. IDE has the sync/flush command afaik, but I'm not sure whether the IDE tagged command stuff has the equivalent of SCSI's ordered tag bits. Andre? IDE has no concept of ordered tags... -- Jens Axboe - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: scsi vs ide performance on fsync's
So in the meantime as this gets worked out on a lower level, we've decided to take the fsync() out of berkeley db for mysql transaction logs and mount the filesystem -o sync. Can anyone perhaps tell me why this may be a bad idea? Thanks -jeremy On Tue, 6 Mar 2001, Jeremy Hansen wrote: Ahh, now we're getting somewhere. IDE: jeremy:~# time ./xlog file.out fsync real0m33.739s user0m0.010s sys 0m0.120s so now this corresponds to the performance we're seeing on SCSI. So I guess what I'm wondering now is can or should anything be done about this on the SCSI side? Thanks -jeremy On Tue, 6 Mar 2001, Mike Black wrote: Write caching is the culprit for the performance diff: On IDE: time xlog /blah.dat fsync 0.000u 0.190s 0:01.72 11.0% 0+0k 0+0io 91pf+0w # hdparm -W 0 /dev/hda /dev/hda: setting drive write-caching to 0 (off) # time xlog /blah.dat fsync 0.000u 0.220s 0:50.60 0.4% 0+0k 0+0io 91pf+0w # hdparm -W 1 /dev/hda /dev/hda: setting drive write-caching to 1 (on) # time xlog /blah.dat fsync 0.010u 0.230s 0:01.88 12.7% 0+0k 0+0io 91pf+0w On my SCSI setup: # time xlog /usr5/blah.dat fsync 0.020u 0.230s 0:30.48 0.8% 0+0k 0+0io 91pf+0w Michael D. Black Principal Engineer [EMAIL PROTECTED] 321-676-2923,x203 http://www.csihq.com Computer Science Innovations http://www.csihq.com/~mike My home page FAX 321-676-2355 - Original Message - From: "Andre Hedrick" [EMAIL PROTECTED] To: "Linus Torvalds" [EMAIL PROTECTED] Cc: "Douglas Gilbert" [EMAIL PROTECTED]; [EMAIL PROTECTED] Sent: Tuesday, March 06, 2001 2:12 AM Subject: Re: scsi vs ide performance on fsync's On Mon, 5 Mar 2001, Linus Torvalds wrote: Well, it's fairly hard for the kernel to do much about that - it's almost certainly just IDE doing write buffering on the disk itself. No OS involved. I am pushing for WC to be defaulted in the off state, but as you know I have a bigger fight than caching on my hands... I don't know if there is any way to turn of a write buffer on an IDE disk. You want a forced set of commands to kill caching at init? Andre Hedrick Linux ATA Development ASL Kernel Development - ASL, Inc. Toll free: 1-877-ASL-3535 1757 Houret Court Fax: 1-408-941-2071 Milpitas, CA 95035Web: www.aslab.com - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/ - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/ -- this is my sig. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: scsi vs ide performance on fsync's
Hi, On Wed, Mar 07, 2001 at 03:12:41PM +0100, Jens Axboe wrote: Yep, it's much harder than it seems. Especially because for the barrier to be really useful, having inter-request dependencies becomes a requirement. So you can say something like 'flush X and Y, but don't flush Y before X is done'. Yes. Fortunately, the simplest possible barrier is just a matter of marking a request as non-reorderable, and then making sure that you both flush the elevator queue before servicing that request, and defer any subsequent requests until the barrier request has been satisfied. One it has gone through, you can let through the deferred requests (in order, up to the point at which you encounter another barrier). Only if the queue is empty can you give a barrier request directly to the driver. The special optimisation you can do in this case with SCSI is to continue to allow new requests through even before the barrier has completed if the disk supports ordered queue tags. --Stephen - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: scsi vs ide performance on fsync's
Hi, On Wed, Mar 07, 2001 at 07:51:52PM +0100, Jens Axboe wrote: On Wed, Mar 07 2001, Stephen C. Tweedie wrote: My bigger concern is when the journalled fs has a log on a different queue. For most fs'es, that's not an issue. The fs won't start writeback on the primary disk at all until the journal commit has been acknowledged as firm on disk. Certainly for ext3, synchronisation between the log and the primary disk is no big thing. What really hurts is writing to the log, where we have to wait for the log writes to complete before submitting the commit write (which is sequentially allocated just after the rest of the log blocks). Specifying a barrier on the commit block would allow us to keep the log device streaming, and the fs can deal with synchronising the primary disk quite happily by itself. Cheers, Stephen - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: scsi vs ide performance on fsync's
On Wed, Mar 07 2001, Stephen C. Tweedie wrote: On Wed, Mar 07, 2001 at 07:51:52PM +0100, Jens Axboe wrote: On Wed, Mar 07 2001, Stephen C. Tweedie wrote: My bigger concern is when the journalled fs has a log on a different queue. For most fs'es, that's not an issue. The fs won't start writeback on the primary disk at all until the journal commit has been acknowledged as firm on disk. But do you then force wait on that journal commit? Certainly for ext3, synchronisation between the log and the primary disk is no big thing. What really hurts is writing to the log, where we have to wait for the log writes to complete before submitting the commit write (which is sequentially allocated just after the rest of the log blocks). Specifying a barrier on the commit block would allow us to keep the log device streaming, and the fs can deal with synchronising the primary disk quite happily by itself. A barrier operation is sufficient then. So you're saying don't over design, a simple barrier is all you need? -- Jens Axboe - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: scsi vs ide performance on fsync's
Hi, On Wed, Mar 07, 2001 at 09:15:36PM +0100, Jens Axboe wrote: On Wed, Mar 07 2001, Stephen C. Tweedie wrote: For most fs'es, that's not an issue. The fs won't start writeback on the primary disk at all until the journal commit has been acknowledged as firm on disk. But do you then force wait on that journal commit? It doesn't matter too much --- it's only the writeback which is doing this (ext3 uses a separate journal thread for it), so any sleep is only there to wait for the moment when writeback can safely begin: users of the filesystem won't see any stalls. A barrier operation is sufficient then. So you're saying don't over design, a simple barrier is all you need? Pretty much so. The simple barrier is the only thing which can be effectively optimised at the hardware level with SCSI anyway. Cheers, Stephen - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: scsi vs ide performance on fsync's
On Wed, Mar 07 2001, Stephen C. Tweedie wrote: On Wed, Mar 07, 2001 at 09:15:36PM +0100, Jens Axboe wrote: On Wed, Mar 07 2001, Stephen C. Tweedie wrote: For most fs'es, that's not an issue. The fs won't start writeback on the primary disk at all until the journal commit has been acknowledged as firm on disk. But do you then force wait on that journal commit? It doesn't matter too much --- it's only the writeback which is doing this (ext3 uses a separate journal thread for it), so any sleep is only there to wait for the moment when writeback can safely begin: users of the filesystem won't see any stalls. Ok, but even if this is true for ext3 it may not be true for other journalled fs. AFAIR, reiser is doing an explicit wait_on_buffer which would then amount to quite a performance hit (speculation, haven't measured). A barrier operation is sufficient then. So you're saying don't over design, a simple barrier is all you need? Pretty much so. The simple barrier is the only thing which can be effectively optimised at the hardware level with SCSI anyway. True -- Jens Axboe - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: scsi vs ide performance on fsync's
On Wed, 7 Mar 2001, Jonathan Morton wrote: > Still doesn't make a difference - there is one revolution between writes, > no matter where on disk it is. Oh it does, because you are hitting the same sector with the same data. Rotate your buffer and then you will see the difference. > >Because of WinBench! > >All the prefetch/caching are modeled to be optimized to that bench-mark. > > Lies, damn lies, statistics, benchmarks, delivery dates. Especially a > consumer-oriented benchmark like WinBench. It's perfectly natural to > optimise for particular access patterns, but IMHO that doesn't excuse > breaking the drive just to get a better benchmark score. Obviously you have never been in the bowls of drive industry hell. Why do you think there was a change the ATA-6 to require the Write-Verify-Read to always return stuff from the platter? Because the SOB's in storage LIE! A real wake-up call for you is that everything about the world of storage is a big-fat-whopper of a LIE. Storage devices are BLACK-BOXES with the standards/rules to communicate being dictated by the device not the host. Storage devices are no beter then a Coke(tm) vending machine. You push "Coke" it gives you "Coke". You have not a clue to how it arrives or where it came from. Same thing about reading from a drive. > That isn't the point! I'm not talking about the physical mechanism, which > indeed is often the same between one generation of SCSI and the next > generation of IDE devices. I'm talking about the IDE controller which is > slapped on the bottom of said mechanism. The mech can be of world-class > quality, but if the controller is shot it doesn't cut the grain. So there is a $5 differnce in the cell-gates and the line drivers are more powerful, 80GB ATA + $5 != 80GB SCSI. > >Since all OSes that enable WC at init will flush > >it at shutdown and do a periodic purge with in-activity. > > But Linux doesn't, as has been pointed out earlier. We need to fix Linux. Friend I have fixed this some time ago but it is bundled with TASKFILE that is not going to arrive until 2.5. Because I need a way to execute this and hold the driver until it is complete, regardless of the shutdown method. > >Err, last time I check all good devices flush their write caching on their > >own to take advantage of having a maximum cache for prefetching. > > Which doesn't work if the buffer is filled up by the OS 0.5 seconds before > the power goes. Maybe that is why there is a vender disk-cache dump zone on the edge of the platters...just maybe you need to buy your drives from somebody that does this and has a predictive sector stretcher as the energy from the inertia by the DC three-phase motor executes the dump. Ever wondered why modern drives have open collectors on the databuss? Maybe to disconnect the power draw so that the motor now generator provides the needed power to complete the data dump... > I'm sorry if this looks like another troll, but I really do like to clear > up confusion. I do accept that IDE now has good enough real performance > for many purposes, but in terms of enforced quality it clearly lags behind > the entire SCSI field. I have no desire to debate the merits, but when your onboard host for ATA starts shipping with GigaBit-Copper speeds then we can have a pissing contest. Cheers, Andre Hedrick Linux ATA Development ASL Kernel Development - ASL, Inc. Toll free: 1-877-ASL-3535 1757 Houret Court Fax: 1-408-941-2071 Milpitas, CA 95035Web: www.aslab.com - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: scsi vs ide performance on fsync's
>I am not going to bite on your flame bate, and are free to waste you money. I don't flamebait. I was trying to clear up some confusion... >No, SCSI does with queuing. >I am saying that the ata/ide driver rips the heart out of the >io_request_lock what to darn long. This means that upon execution a >request virtually all interrupts are wacked and the drivers in dominating >the system. Given that IO's are limited to 128 sectors or one DMA PRD, >this is vastly smaller than the SCSI trasfer limit. Ah, so the ATA driver hogs interrupts. Nice. Kinda explains why I can't use the mouse on some systems when I use cdparanoia. >Okay real shortlimit to two zones that are equal in size. >The inner and outer, and the latter will cover more physical media than >the former. Simple Two zone model. Still doesn't make a difference - there is one revolution between writes, no matter where on disk it is. >> Under those circumstances, >> I would expect my 7200rpm Seagate to perform slower than my 1rpm IBM >> *regardless* of seeking performance. Seeking doesn't come into it! > >It does, because more RPM means more air-flow and more work to keep the >position stable. That's the engineers' problem, not ours. In fact, it's not really a problem because my IBM drive gave almost exactly the correct performance result, even at 1rpm, therefore it's managing to keep the position stable regardless of airflow. >> Why does this sound familiar? > >Because of WinBench! >All the prefetch/caching are modeled to be optimized to that bench-mark. Lies, damn lies, statistics, benchmarks, delivery dates. Especially a consumer-oriented benchmark like WinBench. It's perfectly natural to optimise for particular access patterns, but IMHO that doesn't excuse breaking the drive just to get a better benchmark score. >> Personally, I feel the bottom line is rapidly turning into "if you have >> critical data, don't put it on an IDE disk". There are too many corners >> cut when compared to ostensibly similar SCSI devices. Call me a SCSI bigot >> if you like - I realise SCSI is more expensive, but you get what you pay >> for. > >Let me slap you in the face with a salomi stick! >ATA 7200 RPM Drives are using SCSI 7200 RPM Drive HDA's >So you say ATA is Lame? Then so was your SCSI 7200's. That isn't the point! I'm not talking about the physical mechanism, which indeed is often the same between one generation of SCSI and the next generation of IDE devices. I'm talking about the IDE controller which is slapped on the bottom of said mechanism. The mech can be of world-class quality, but if the controller is shot it doesn't cut the grain. >Since all OSes that enable WC at init will flush >it at shutdown and do a periodic purge with in-activity. But Linux doesn't, as has been pointed out earlier. We need to fix Linux. Also, as I and someone else have also pointed out, there are drives in circulation which refuse to turn off write caching, including one sitting in my main workstation - the one which is rebooted the most often, simply because I need to use Windoze 95 for a few onerous tasks. I haven't suffered disk corruption yet, because Linux unmounts the filesystems and flushes it's own buffers several seconds before powering down, and uses a non-pathological access pattern, but I sure don't want to see the first time this doesn't work properly. >Err, last time I check all good devices flush their write caching on their >own to take advantage of having a maximum cache for prefetching. Which doesn't work if the buffer is filled up by the OS 0.5 seconds before the power goes. I'm sorry if this looks like another troll, but I really do like to clear up confusion. I do accept that IDE now has good enough real performance for many purposes, but in terms of enforced quality it clearly lags behind the entire SCSI field. -- from: Jonathan "Chromatix" Morton mail: [EMAIL PROTECTED] (not for attachments) big-mail: [EMAIL PROTECTED] uni-mail: [EMAIL PROTECTED] The key to knowledge is not to rely on people to teach you it. Get VNC Server for Macintosh from http://www.chromatix.uklinux.net/vnc/ -BEGIN GEEK CODE BLOCK- Version 3.12 GCS$/E/S dpu(!) s:- a20 C+++ UL++ P L+++ E W+ N- o? K? w--- O-- M++$ V? PS PE- Y+ PGP++ t- 5- X- R !tv b++ DI+++ D G e+ h+ r- y+ -END GEEK CODE BLOCK- - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: scsi vs ide performance on fsync's
> itself is a bad thing, particularly given the amount of CPU overhead that > IDE drives demand while attached to the controller (orders of magnitude > higher than a good SCSI controller) - the more overhead we can hand off to I know this is just a troll by a scsi-believer, but I'm biting anyway. on current machines and disks, ide costs a few % CPU, depending on which CPU, disk, kernel, the sustained bandwidth, etc. I've measured this using the now-trendy method of noticing how much the IO costs a separate, CPU-bound benchmark: load = 1 - (unloadedPerf / loadedPerf). my cheesy duron/600 desktop typically shows ~2% actual cost when running bonnie's block IO tests. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: scsi vs ide performance on fsync's
On Tue, Mar 06 2001, David Balazic wrote: > > > Wrong model > > > > > > You want a write barrier. Write buffering (at least for short intervals) > > > in the drive is very sensible. The kernel needs to able to send > > > drivers a write barrier which will not be completed with outstanding > > > commands before the > > > barrier. > > > > Agreed. > > > > Write buffering is incredibly useful on a disk - for all the same reasons > > that an OS wants to do it. The disk can use write buffering to speed up > > writes a lot - not just lower the _perceived_ latency by the OS, but to > > actually improve performance too. > > > > But Alan is right - we needs a "sync" command or something. I don't know > > if IDE has one (it already might, for all I know). > > ATA , SCSI and ATAPI all have a FLUSH_CACHE command. (*) > Whether the drives implement it is another question ... (Usually called SYNCHRONIZE_CACHE btw) SCSI has ordered tag, which fit the model Alan described quite nicely. I've been meaning to implement this for some time, it would be handy for journalled fs to use such a barrier. Since ATA doesn't do queueing (at least not in current Linux), a synchronize cache is probably the only way to go there. > (*) references : > ATA-6 draft standard from www.t13.org > MtFuji document from ftp.avc-pioneer.com -- Jens Axboe - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: scsi vs ide performance on fsync's
Linus Torvalds himself wrote : > On Tue, 6 Mar 2001, Alan Cox wrote: > > > > > > I don't know if there is any way to turn of a write buffer on an IDE disk. > > > You want a forced set of commands to kill caching at init? > > > > Wrong model > > > > You want a write barrier. Write buffering (at least for short intervals) in > > the drive is very sensible. The kernel needs to able to send drivers a write > > barrier which will not be completed with outstanding commands before the > > barrier. > > Agreed. > > Write buffering is incredibly useful on a disk - for all the same reasons > that an OS wants to do it. The disk can use write buffering to speed up > writes a lot - not just lower the _perceived_ latency by the OS, but to > actually improve performance too. > > But Alan is right - we needs a "sync" command or something. I don't know > if IDE has one (it already might, for all I know). ATA , SCSI and ATAPI all have a FLUSH_CACHE command. (*) Whether the drives implement it is another question ... (*) references : ATA-6 draft standard from www.t13.org MtFuji document from -- David Balazic -- "Be excellent to each other." - Bill & Ted - - - - - - - - - - - - - - - - - - - - - - - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: scsi vs ide performance on fsync's
Jonathan, I am not going to bite on your flame bate, and are free to waste you money. On Tue, 6 Mar 2001, Jonathan Morton wrote: > >> It's pretty clear that the IDE drive(r) is *not* waiting for the physical > >> write to take place before returning control to the user program, whereas > >> the SCSI drive(r) is. Both devices appear to be performing the write > > > >Wrong, IDE does not unplug thus the request is almost, I hate to admit it > >SYNC and not ASYNC :-( Thus if the drive acks that it has the data then > >the driver lets go. > > Uh, run that past me again? You are saying that because the IDE drive hogs > the bus until the write is complete or the driver forcibly disconnects, you > make the driver disconnect to save time? Or (more likely) have I totally > misread you... No, SCSI does with queuing. I am saying that the ata/ide driver rips the heart out of the io_request_lock what to darn long. This means that upon execution a request virtually all interrupts are wacked and the drivers in dominating the system. Given that IO's are limited to 128 sectors or one DMA PRD, this is vastly smaller than the SCSI trasfer limit. Since you are not using the test "Write Verify Read" all drives are going to lie. Only this command will force the stuff to hit the platters and return a read out of the dirty-cache. > >pre-seek. Thus the question is were is the drive leaving the heads when > >not active? It does not appear to be in the zone 1 region. > > Duh... I don't quite see what you're saying here, either. The test is a Okay real shortlimit to two zones that are equal in size. The inner and outer, and the latter will cover more physical media than the former. Simple Two zone model. > continuous rewrite of the same sector of the disk, so the head shouldn't be > moving *at all* until it's all over. In addition, the drive can't start True and you slip a rev. everytime. > writing the sector when it's just finished writing it, so it has to wait > for the rotation to breing it back round again. Under those circumstances, > I would expect my 7200rpm Seagate to perform slower than my 1rpm IBM > *regardless* of seeking performance. Seeking doesn't come into it! It does, because more RPM means more air-flow and more work to keep the position stable. > >Thus if your drive is one of those that does a stress test check that goes: > >"this bozo did not really mean to turn off write caching, renabling " > > Why does this sound familiar? Because of WinBench! All the prefetch/caching are modeled to be optimized to that bench-mark. > Personally, I feel the bottom line is rapidly turning into "if you have > critical data, don't put it on an IDE disk". There are too many corners > cut when compared to ostensibly similar SCSI devices. Call me a SCSI bigot > if you like - I realise SCSI is more expensive, but you get what you pay > for. Let me slap you in the face with a salomi stick! ATA 7200 RPM Drives are using SCSI 7200 RPM Drive HDA's So you say ATA is Lame? Then so was your SCSI 7200's. > Of course, under normal circumstances, you leave write-caching and UDMA on, > and you don't use a pathological stress-test like we've been doing. That > gives the best performance. But sometimes it's necessary to use these > "pathological" access patterns to achieve certain system functions. > Suppose, harking back to the Windows data-corruption scenario mentioned > earlier, that just before powering off you stuffed several MB of data, > scattered across the disk, into said disk and waited for said disk to say > "yup, i've got that", then powered down. Recent drives have very large > (2MB?) on-board caches, so how long does it take for a pathological pattern > of these to be committed to physical media? Can the drive sustain it's own > power long enough to do this (highly unlikely)? So the drive *must* be > able to tell the OS when it's actually committed the data to media, or risk > *serious* data corruption. OH...you are talking about the one IBM drive that is goat-screwed... The one that is to stupid to use the energy of the platters to drop the data in the vender power down strip...yet it dumps the buffer in a panic.. ERM, that is a bad drive, regardless if they publish an errata that states only good HOSTS that issue a flush-cache prior to power are to be certified...we maybe if they did not default the WC on then it would be a NOP of the design error. Since all OSes that enable WC at init will flush it at shutdown and do a periodic purge with in-activity. > Pathological shutdown pattern: assuming scatter-gather is not allowed (for > IDE), and a 20ms full-stroke seek, write sectors at alternately opposite > ends of the disk, working inwards until the buffer is full. 512-byte > sectors, 2MB of them, is 4000 writes * 20ms = around 80 seconds (not > including rotational delay, either). Last time I checked, you'd need a > capacitor array the size of the entire computer case to store
Re: scsi vs ide performance on fsync's
On Tue, 6 Mar 2001, Alan Cox wrote: > > > > I don't know if there is any way to turn of a write buffer on an IDE disk. > > You want a forced set of commands to kill caching at init? > > Wrong model > > You want a write barrier. Write buffering (at least for short intervals) in > the drive is very sensible. The kernel needs to able to send drivers a write > barrier which will not be completed with outstanding commands before the > barrier. Agreed. Write buffering is incredibly useful on a disk - for all the same reasons that an OS wants to do it. The disk can use write buffering to speed up writes a lot - not just lower the _perceived_ latency by the OS, but to actually improve performance too. But Alan is right - we needs a "sync" command or something. I don't know if IDE has one (it already might, for all I know). Linus - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: scsi vs ide performance on fsync's
>Jonathan Morton ([EMAIL PROTECTED]) wrote : > >> The OS needs to know the physical act of writing data has finished >>before >> it tells the m/board to cut the power - period. Pathological data sets >> included - they are the worst case which every engineer must take into >> account. Out of interest, does Linux guarantee this, in the light of what >> we've uncovered? If so, perhaps it could use the same technique to fix >> fdatasync() and family... > >Linux currently ignores write-cache, AFAICT. >Recently I asked a similar question , about flushing drive caches at >shutdown : >On Mon, Feb 19, 2001 at 01:45:57PM +0100, David Balazic wrote: >> It is a good idea IMO to flush the write cache of storage devices >> at shutdown and other critical moments. > >Not needed. All device drivers should disable write caches of >their devices, that need another signal than switching it off by >the power button to flush themselves. Sounds like a sensible place to implement it - in the device driver. I also note the existence of an ATA flush-buffer command, which should probably be used in sync() and family. The call(s) to the sync() family on shutdown should probably be performed by the filesystem itself on unmount (or remount read-only), and if journalled filesystems need synchronisation they should use sync() (or a more fine-grained version) themselves as necessary. Doesn't sound like too much of a headache to implement, to me - unless some drives ignore the ATA FLUSH command, in which case said drives can be considered seriously broken. :P I don't agree that write-caching in itself is a bad thing, particularly given the amount of CPU overhead that IDE drives demand while attached to the controller (orders of magnitude higher than a good SCSI controller) - the more overhead we can hand off to dedicated hardware, the better. What does matter is that drives implementing write-caching are handled in a safe and efficient manner, especially in cases where they refuse to turn such caching off (eg. my Seagate Barracuda *glares at drive*). Recalling my recent comments on worst-case drive-shutdown timings, I also remember seeing drives with 18ms *average* seek times quite recently - this was a Quantum Bigfoot (yes, a 5.25" HD), found in a low-end Compaq desktop - if anyone still believes Compaq makes high-quality machines for their low-end market, they're totally mistaken. The machine sped up quite a lot when a new 3.5" IBM DeskStar was installed, with an 8.5ms average seek and an almost doubling in rotational speed. :) -- from: Jonathan "Chromatix" Morton mail: [EMAIL PROTECTED] (not for attachments) big-mail: [EMAIL PROTECTED] uni-mail: [EMAIL PROTECTED] The key to knowledge is not to rely on people to teach you it. Get VNC Server for Macintosh from http://www.chromatix.uklinux.net/vnc/ -BEGIN GEEK CODE BLOCK- Version 3.12 GCS$/E/S dpu(!) s:- a20 C+++ UL++ P L+++ E W+ N- o? K? w--- O-- M++$ V? PS PE- Y+ PGP++ t- 5- X- R !tv b++ DI+++ D G e+ h+ r- y+ -END GEEK CODE BLOCK- - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: scsi vs ide performance on fsync's
On Tue, Mar 06, 2001 at 06:14:15PM +0100, David Balazic wrote: [snip] > Hardware Level caching is only good for OSes which have broken > drivers and broken caching (like plain old DOS). > > Linux does a good job in caching and cache control at software > level. Read caching, yes. But for writes, the drive can often do a lot more optimization because of it's synchronous operation with the platter and greater knowledge of internal disk geometry. What would be useful, as Alan said, is a barrier operation. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: scsi vs ide performance on fsync's
(( please CC me , not subscribed , [EMAIL PROTECTED] ) Jonathan Morton ([EMAIL PROTECTED]) wrote : > The OS needs to know the physical act of writing data has finished before > it tells the m/board to cut the power - period. Pathological data sets > included - they are the worst case which every engineer must take into > account. Out of interest, does Linux guarantee this, in the light of what > we've uncovered? If so, perhaps it could use the same technique to fix > fdatasync() and family... Linux currently ignores write-cache, AFAICT. Recently I asked a similar question , about flushing drive caches at shutdown : Subject : "Flusing caches on shutdown" message archived at : http://boudicca.tux.org/hypermail/linux-kernel/2001week08/0157.html Body attached at end of this message. The answer ( and only reply ) was : [ archived at : http://boudicca.tux.org/hypermail/linux-kernel/2001week08/0211.html ] --- begin quote --- From: Ingo Oeser ([EMAIL PROTECTED]) On Mon, Feb 19, 2001 at 01:45:57PM +0100, David Balazic wrote: > It is a good idea IMO to flush the write cache of storage devices > at shutdown and other critical moments. Not needed. All device drivers should disable write caches of their devices, that need another signal than switching it off by the power button to flush themselves. > Loosing data at powerdown due to write caches have been reported, > so this is no a theoretical problems. Also the journaled filesystems > are safe only in theory if the journal is not stored on non-volatile > memory, which is not guarantied in the current kernel. Fine. If users/admins have write caching enabled, they either know what they do, or should disable it (which is the default for all mass storage drivers AFAIK). Hardware Level caching is only good for OSes which have broken drivers and broken caching (like plain old DOS). Linux does a good job in caching and cache control at software level. Regards Ingo Oeser --- end quote --- My original mail : --- begin quote --- (( CC me the replies, as I'm not subscribed to LKML )) Hi! It is a good idea IMO to flush the write cache of storage devices at shutdown and other critical moments. I browsed through linux-2.4.1 and see no use of the SYNCHRONIZE CACHE SCSI command ( curiously it is defined in several other files besides include/scsi/scsi.h , grep returns : drivers/scsi/pci2000.h:#define SCSIOP_SYNCHRONIZE_CACHE 0x35 drivers/scsi/psi_dale.h:#define SCSIOP_SYNCHRONIZE_CACHE 0x35 drivers/scsi/psi240i.h:#define SCSIOP_SYNCHRONIZE_CACHE 0x35 ) I couldn't find evidence to the use of the equivalent ATA command either ( FLUSH CACHE , command code E7h ). Also add ATAPI to the list. ( and all other interfaces. I checked just SCSI and ATA ) Loosing data at powerdown due to write caches have been reported, so this is no a theoretical problems. Also the journaled filesystems are safe only in theory if the journal is not stored on non-volatile memory, which is not guarantied in the current kernel. What is the official word on this issue ? I think this is important to the "enterprise" guys, at the least. Sincerely, david PS: CC me , as I'm not subscribed to LKML --- end quote --- -- David Balazic -- "Be excellent to each other." - Bill & Ted - - - - - - - - - - - - - - - - - - - - - - -- David Balazic -- "Be excellent to each other." - Bill & Ted - - - - - - - - - - - - - - - - - - - - - - - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: scsi vs ide performance on fsync's
>On Tue, 6 Mar 2001, Mike Black wrote: > >> Write caching is the culprit for the performance diff: Indeed, and my during-the-boring-lecture benchmark on my 18Gb IBM TravelStar bears this out. I was confused earlier by the fact that one of my Seagate drives blatently ignores the no-write-caching request I sent it. :P At 4:02 pm + 6/3/2001, Jeremy Hansen wrote: >Ahh, now we're getting somewhere. >so now this corresponds to the performance we're seeing on SCSI. > >So I guess what I'm wondering now is can or should anything be done about >this on the SCSI side? Maybe, it depends on your perspective. In my personal opinion, the IDE behaviour is incorrect and some way of dealing with it (while still retaining the benefits of write-caching for normal applications) would be highly desirable. However, some applications may like or partially rely on that behaviour, to gain better on-disk data consistency while not suffering too much in performance (eg. the transaction database mentioned by at least one poster). The way to make all parties happy is to fix the IDE driver (or drives!) and make sure an *alternative* syscall is available which flushes the buffers asynchronously, as per the current IDE behaviour. It shouldn't be too hard to make the SCSI driver use that behaviour in the alternative syscall (which may already exist, I don't know Linux well enough to say). May this be a warning to all hardware manufacturers who "tweak" their hardware to gain better benchmark results without actually increasing performance - you *will* be found out! -- from: Jonathan "Chromatix" Morton mail: [EMAIL PROTECTED] (not for attachments) big-mail: [EMAIL PROTECTED] uni-mail: [EMAIL PROTECTED] The key to knowledge is not to rely on people to teach you it. Get VNC Server for Macintosh from http://www.chromatix.uklinux.net/vnc/ -BEGIN GEEK CODE BLOCK- Version 3.12 GCS$/E/S dpu(!) s:- a20 C+++ UL++ P L+++ E W+ N- o? K? w--- O-- M++$ V? PS PE- Y+ PGP++ t- 5- X- R !tv b++ DI+++ D G e+ h+ r- y+ -END GEEK CODE BLOCK- - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: scsi vs ide performance on fsync's
Ahh, now we're getting somewhere. IDE: jeremy:~# time ./xlog file.out fsync real0m33.739s user0m0.010s sys 0m0.120s so now this corresponds to the performance we're seeing on SCSI. So I guess what I'm wondering now is can or should anything be done about this on the SCSI side? Thanks -jeremy On Tue, 6 Mar 2001, Mike Black wrote: > Write caching is the culprit for the performance diff: > > On IDE: > time xlog /blah.dat fsync > 0.000u 0.190s 0:01.72 11.0% 0+0k 0+0io 91pf+0w > # hdparm -W 0 /dev/hda > > /dev/hda: > setting drive write-caching to 0 (off) > # time xlog /blah.dat fsync > 0.000u 0.220s 0:50.60 0.4% 0+0k 0+0io 91pf+0w > # hdparm -W 1 /dev/hda > > /dev/hda: > setting drive write-caching to 1 (on) > # time xlog /blah.dat fsync > 0.010u 0.230s 0:01.88 12.7% 0+0k 0+0io 91pf+0w > > On my SCSI setup: > # time xlog /usr5/blah.dat fsync > 0.020u 0.230s 0:30.48 0.8% 0+0k 0+0io 91pf+0w > > > > Michael D. Black Principal Engineer > [EMAIL PROTECTED] 321-676-2923,x203 > http://www.csihq.com Computer Science Innovations > http://www.csihq.com/~mike My home page > FAX 321-676-2355 > - Original Message - > From: "Andre Hedrick" <[EMAIL PROTECTED]> > To: "Linus Torvalds" <[EMAIL PROTECTED]> > Cc: "Douglas Gilbert" <[EMAIL PROTECTED]>; <[EMAIL PROTECTED]> > Sent: Tuesday, March 06, 2001 2:12 AM > Subject: Re: scsi vs ide performance on fsync's > > > On Mon, 5 Mar 2001, Linus Torvalds wrote: > > > Well, it's fairly hard for the kernel to do much about that - it's almost > > certainly just IDE doing write buffering on the disk itself. No OS > > involved. > > I am pushing for WC to be defaulted in the off state, but as you know I > have a bigger fight than caching on my hands... > > > I don't know if there is any way to turn of a write buffer on an IDE disk. > > You want a forced set of commands to kill caching at init? > > Andre Hedrick > Linux ATA Development > ASL Kernel Development > > - > ASL, Inc. Toll free: 1-877-ASL-3535 > 1757 Houret Court Fax: 1-408-941-2071 > Milpitas, CA 95035Web: www.aslab.com > > - > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in > the body of a message to [EMAIL PROTECTED] > More majordomo info at http://vger.kernel.org/majordomo-info.html > Please read the FAQ at http://www.tux.org/lkml/ > > - > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in > the body of a message to [EMAIL PROTECTED] > More majordomo info at http://vger.kernel.org/majordomo-info.html > Please read the FAQ at http://www.tux.org/lkml/ > -- this is my sig. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: scsi vs ide performance on fsync's
>> Pathological shutdown pattern: assuming scatter-gather is not allowed (for >> IDE), and a 20ms full-stroke seek, write sectors at alternately opposite >> ends of the disk, working inwards until the buffer is full. 512-byte >> sectors, 2MB of them, is 4000 writes * 20ms = around 80 seconds (not >> including rotational delay, either). Last time I checked, you'd need a >> capacitor array the size of the entire computer case to store enough power >> to allow the drive to do this after system shutdown, and I don't remember >> seeing LiIon batteries strapped to the bottom of my HDs. Admittedly, any >> sane OS doesn't actually use that kind of write pattern on shutdown, but >> the drive can't assume that. > >But since the drive has everything in cache, it can just write >out both bunches of sectors in an order which minimises disk >seek time ... > >(yes, the drives don't guarantee write ordering either, but that >shouldn't come as a big surprise when they don't guarantee that >data makes it to disk ;)) That would be true for SCSI devices - I understand the controllers and/or drives support "scatter-gather" which allows a drive to optimise it's seek pattern in the manner you describe. However, I'm not sure whether an IDE drive is allowed to do this. I'm reasonably sure that I heard somewhere that IDE drives have to complete transactions in the specified order as far as the host is concerned - what I'm unsure of is whether this also applies to mechanical head movement. If not, then the drive could by all means optimise the access pattern provided it acked the data or provided the results in the same order as the instructions were given. This would probably shorten the time for a new pathological set (distributed evenly across the disk surface, but all on the worst-possible angular offset compared to the previous) to (8ms seek time + 5ms rotational delay) * 4000 writes ~= 52 seconds (compared with around 120 seconds for the previous set with rotational delay factored in). Great, so you only need half as big a power store to guarantee writing that much data, but it's still too much. Even with a 15000rpm drive and 5ms seek times, it would still be too much. The OS needs to know the physical act of writing data has finished before it tells the m/board to cut the power - period. Pathological data sets included - they are the worst case which every engineer must take into account. Out of interest, does Linux guarantee this, in the light of what we've uncovered? If so, perhaps it could use the same technique to fix fdatasync() and family... -- from: Jonathan "Chromatix" Morton mail: [EMAIL PROTECTED] (not for attachments) big-mail: [EMAIL PROTECTED] uni-mail: [EMAIL PROTECTED] The key to knowledge is not to rely on people to teach you it. Get VNC Server for Macintosh from http://www.chromatix.uklinux.net/vnc/ -BEGIN GEEK CODE BLOCK- Version 3.12 GCS$/E/S dpu(!) s:- a20 C+++ UL++ P L+++ E W+ N- o? K? w--- O-- M++$ V? PS PE- Y+ PGP++ t- 5- X- R !tv b++ DI+++ D G e+ h+ r- y+ -END GEEK CODE BLOCK- - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: scsi vs ide performance on fsync's
Write caching is the culprit for the performance diff: On IDE: time xlog /blah.dat fsync 0.000u 0.190s 0:01.72 11.0% 0+0k 0+0io 91pf+0w # hdparm -W 0 /dev/hda /dev/hda: setting drive write-caching to 0 (off) # time xlog /blah.dat fsync 0.000u 0.220s 0:50.60 0.4% 0+0k 0+0io 91pf+0w # hdparm -W 1 /dev/hda /dev/hda: setting drive write-caching to 1 (on) # time xlog /blah.dat fsync 0.010u 0.230s 0:01.88 12.7% 0+0k 0+0io 91pf+0w On my SCSI setup: # time xlog /usr5/blah.dat fsync 0.020u 0.230s 0:30.48 0.8% 0+0k 0+0io 91pf+0w Michael D. Black Principal Engineer [EMAIL PROTECTED] 321-676-2923,x203 http://www.csihq.com Computer Science Innovations http://www.csihq.com/~mike My home page FAX 321-676-2355 - Original Message - From: "Andre Hedrick" <[EMAIL PROTECTED]> To: "Linus Torvalds" <[EMAIL PROTECTED]> Cc: "Douglas Gilbert" <[EMAIL PROTECTED]>; <[EMAIL PROTECTED]> Sent: Tuesday, March 06, 2001 2:12 AM Subject: Re: scsi vs ide performance on fsync's On Mon, 5 Mar 2001, Linus Torvalds wrote: > Well, it's fairly hard for the kernel to do much about that - it's almost > certainly just IDE doing write buffering on the disk itself. No OS > involved. I am pushing for WC to be defaulted in the off state, but as you know I have a bigger fight than caching on my hands... > I don't know if there is any way to turn of a write buffer on an IDE disk. You want a forced set of commands to kill caching at init? Andre Hedrick Linux ATA Development ASL Kernel Development - ASL, Inc. Toll free: 1-877-ASL-3535 1757 Houret Court Fax: 1-408-941-2071 Milpitas, CA 95035Web: www.aslab.com - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/ - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: scsi vs ide performance on fsync's
>> i assume you meant to time the xlog.c program? (or did i miss another >> program on the thread?) Yes. >> i've an IBM-DJSA-210 (travelstar 10GB, 5411rpm) which appears to do >> *something* with the write cache flag -- it gets 0.10s elapsed real time >> in default config; and gets 2.91s if i do "hdparm -W 0". >> >> ditto for an IBM-DTLA-307015 (deskstar 15GB 7200rpm) -- varies from .15s >> with write-cache to 1.8s without. >> >> and an IBM-DTLA-307075 (deskstar 75GB 7200rpm) varies from .03s to 1.67s. >> >> of course 1.8s is nowhere near enough time for 200 writes to complete. > >hi, not enough sleep, can't do math. 1.67s is exactly the ballpark you'd >expect for 200 writes to a correctly functioning 7200rpm disk. and the >travelstar appears to be doing the right thing as well. I was just about to point that out. :) I ran the program with 2000 packets in order to magnify the difference. So, it appears that the IBM IDE drives are doing the "right thing" when write-caching is switched off, but the Seagate drive (at least the one I'm using) appears not to turn the write-caching off at all. I want to try this out with some other drives, including a Seagate SCSI drive and a different Seagate IDE drive (attached to a non-UDMA controller), and perhaps a couple of older drives which I just happen to have lying around (particularly a Maxtor and an old TravelStar with very little cache). That'll have to wait until later, though - university work beckons. :( -- from: Jonathan "Chromatix" Morton mail: [EMAIL PROTECTED] (not for attachments) big-mail: [EMAIL PROTECTED] uni-mail: [EMAIL PROTECTED] The key to knowledge is not to rely on people to teach you it. Get VNC Server for Macintosh from http://www.chromatix.uklinux.net/vnc/ -BEGIN GEEK CODE BLOCK- Version 3.12 GCS$/E/S dpu(!) s:- a20 C+++ UL++ P L+++ E W+ N- o? K? w--- O-- M++$ V? PS PE- Y+ PGP++ t- 5- X- R !tv b++ DI+++ D G e+ h+ r- y+ -END GEEK CODE BLOCK- - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: scsi vs ide performance on fsync's
On Tue, 6 Mar 2001, Jonathan Morton wrote: > Pathological shutdown pattern: assuming scatter-gather is not allowed (for > IDE), and a 20ms full-stroke seek, write sectors at alternately opposite > ends of the disk, working inwards until the buffer is full. 512-byte > sectors, 2MB of them, is 4000 writes * 20ms = around 80 seconds (not > including rotational delay, either). Last time I checked, you'd need a > capacitor array the size of the entire computer case to store enough power > to allow the drive to do this after system shutdown, and I don't remember > seeing LiIon batteries strapped to the bottom of my HDs. Admittedly, any > sane OS doesn't actually use that kind of write pattern on shutdown, but > the drive can't assume that. But since the drive has everything in cache, it can just write out both bunches of sectors in an order which minimises disk seek time ... (yes, the drives don't guarantee write ordering either, but that shouldn't come as a big surprise when they don't guarantee that data makes it to disk ;)) regards, Rik -- Virtual memory is like a game you can't win; However, without VM there's truly nothing to lose... http://www.surriel.com/ http://www.conectiva.com/ http://distro.conectiva.com.br/ - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: scsi vs ide performance on fsync's
On Tue, 6 Mar 2001, dean gaudet wrote: > i assume you meant to time the xlog.c program? (or did i miss another > program on the thread?) > > i've an IBM-DJSA-210 (travelstar 10GB, 5411rpm) which appears to do > *something* with the write cache flag -- it gets 0.10s elapsed real time > in default config; and gets 2.91s if i do "hdparm -W 0". > > ditto for an IBM-DTLA-307015 (deskstar 15GB 7200rpm) -- varies from .15s > with write-cache to 1.8s without. > > and an IBM-DTLA-307075 (deskstar 75GB 7200rpm) varies from .03s to 1.67s. > > of course 1.8s is nowhere near enough time for 200 writes to complete. hi, not enough sleep, can't do math. 1.67s is exactly the ballpark you'd expect for 200 writes to a correctly functioning 7200rpm disk. and the travelstar appears to be doing the right thing as well. -dean - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: scsi vs ide performance on fsync's
On Tue, 6 Mar 2001, Jonathan Morton wrote: > Pathological shutdown pattern: assuming scatter-gather is not allowed > (for IDE), and a 20ms full-stroke seek, write sectors at alternately > opposite ends of the disk, working inwards until the buffer is full. > 512-byte sectors, 2MB of them, is 4000 writes * 20ms = around 80 > seconds i don't understand why the disk couldn't elevator in this case and be done in 20ms + rotational. > >Of course, whether you should even trust the harddisk is another question. > > I think this result in itself would lead me *not* to trust the hard disk, > especially an IDE one. Has anybody tried running this test with a recent > IBM DeskStar - one of the ones that is the same mech as the equivalent > UltraStar but with IDE controller? i assume you meant to time the xlog.c program? (or did i miss another program on the thread?) i've an IBM-DJSA-210 (travelstar 10GB, 5411rpm) which appears to do *something* with the write cache flag -- it gets 0.10s elapsed real time in default config; and gets 2.91s if i do "hdparm -W 0". ditto for an IBM-DTLA-307015 (deskstar 15GB 7200rpm) -- varies from .15s with write-cache to 1.8s without. and an IBM-DTLA-307075 (deskstar 75GB 7200rpm) varies from .03s to 1.67s. of course 1.8s is nowhere near enough time for 200 writes to complete. so who knows what that flag is doing. -dean - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: scsi vs ide performance on fsync's
> > I don't know if there is any way to turn of a write buffer on an IDE disk. > You want a forced set of commands to kill caching at init? Wrong model You want a write barrier. Write buffering (at least for short intervals) in the drive is very sensible. The kernel needs to able to send drivers a write barrier which will not be completed with outstanding commands before the barrier. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: scsi vs ide performance on fsync's
>> It's pretty clear that the IDE drive(r) is *not* waiting for the physical >> write to take place before returning control to the user program, whereas >> the SCSI drive(r) is. Both devices appear to be performing the write > >Wrong, IDE does not unplug thus the request is almost, I hate to admit it >SYNC and not ASYNC :-( Thus if the drive acks that it has the data then >the driver lets go. Uh, run that past me again? You are saying that because the IDE drive hogs the bus until the write is complete or the driver forcibly disconnects, you make the driver disconnect to save time? Or (more likely) have I totally misread you... >> immediately, however (judging from the device activity lights). Whether >> this is the correct behaviour or not, I leave up to you kernel hackers... > >Seagate has a better seek profile than ibm. >The second access is correct because the first one pushed the heads to the >pre-seek. Thus the question is were is the drive leaving the heads when >not active? It does not appear to be in the zone 1 region. Duh... I don't quite see what you're saying here, either. The test is a continuous rewrite of the same sector of the disk, so the head shouldn't be moving *at all* until it's all over. In addition, the drive can't start writing the sector when it's just finished writing it, so it has to wait for the rotation to breing it back round again. Under those circumstances, I would expect my 7200rpm Seagate to perform slower than my 1rpm IBM *regardless* of seeking performance. Seeking doesn't come into it! >> IMHO, if an application needs performance, it shouldn't be syncing disks >> after every write. Syncing means, in my book, "wait for the data to be >> committed to physical media" - note the *wait* involved there - so syncing >> should only be used where data integrity in the event of a system failure >> has a much higher importance than performance. > >I have only gotten the drive makers in the past 6 months to committee to >actively updating the contents of the identify page to reflect reality. >Thus if your drive is one of those that does a stress test check that goes: >"this bozo did not really mean to turn off write caching, renabling " Why does this sound familiar? Personally, I feel the bottom line is rapidly turning into "if you have critical data, don't put it on an IDE disk". There are too many corners cut when compared to ostensibly similar SCSI devices. Call me a SCSI bigot if you like - I realise SCSI is more expensive, but you get what you pay for. Of course, under normal circumstances, you leave write-caching and UDMA on, and you don't use a pathological stress-test like we've been doing. That gives the best performance. But sometimes it's necessary to use these "pathological" access patterns to achieve certain system functions. Suppose, harking back to the Windows data-corruption scenario mentioned earlier, that just before powering off you stuffed several MB of data, scattered across the disk, into said disk and waited for said disk to say "yup, i've got that", then powered down. Recent drives have very large (2MB?) on-board caches, so how long does it take for a pathological pattern of these to be committed to physical media? Can the drive sustain it's own power long enough to do this (highly unlikely)? So the drive *must* be able to tell the OS when it's actually committed the data to media, or risk *serious* data corruption. Pathological shutdown pattern: assuming scatter-gather is not allowed (for IDE), and a 20ms full-stroke seek, write sectors at alternately opposite ends of the disk, working inwards until the buffer is full. 512-byte sectors, 2MB of them, is 4000 writes * 20ms = around 80 seconds (not including rotational delay, either). Last time I checked, you'd need a capacitor array the size of the entire computer case to store enough power to allow the drive to do this after system shutdown, and I don't remember seeing LiIon batteries strapped to the bottom of my HDs. Admittedly, any sane OS doesn't actually use that kind of write pattern on shutdown, but the drive can't assume that. -- from: Jonathan "Chromatix" Morton mail: [EMAIL PROTECTED] (not for attachments) big-mail: [EMAIL PROTECTED] uni-mail: [EMAIL PROTECTED] The key to knowledge is not to rely on people to teach you it. Get VNC Server for Macintosh from http://www.chromatix.uklinux.net/vnc/ -BEGIN GEEK CODE BLOCK- Version 3.12 GCS$/E/S dpu(!) s:- a20 C+++ UL++ P L+++ E W+ N- o? K? w--- O-- M++$ V? PS PE- Y+ PGP++ t- 5- X- R !tv b++ DI+++ D G e+ h+ r- y+ -END GEEK CODE BLOCK- - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: scsi vs ide performance on fsync's
It's pretty clear that the IDE drive(r) is *not* waiting for the physical write to take place before returning control to the user program, whereas the SCSI drive(r) is. Both devices appear to be performing the write Wrong, IDE does not unplug thus the request is almost, I hate to admit it SYNC and not ASYNC :-( Thus if the drive acks that it has the data then the driver lets go. Uh, run that past me again? You are saying that because the IDE drive hogs the bus until the write is complete or the driver forcibly disconnects, you make the driver disconnect to save time? Or (more likely) have I totally misread you... immediately, however (judging from the device activity lights). Whether this is the correct behaviour or not, I leave up to you kernel hackers... Seagate has a better seek profile than ibm. The second access is correct because the first one pushed the heads to the pre-seek. Thus the question is were is the drive leaving the heads when not active? It does not appear to be in the zone 1 region. Duh... I don't quite see what you're saying here, either. The test is a continuous rewrite of the same sector of the disk, so the head shouldn't be moving *at all* until it's all over. In addition, the drive can't start writing the sector when it's just finished writing it, so it has to wait for the rotation to breing it back round again. Under those circumstances, I would expect my 7200rpm Seagate to perform slower than my 1rpm IBM *regardless* of seeking performance. Seeking doesn't come into it! IMHO, if an application needs performance, it shouldn't be syncing disks after every write. Syncing means, in my book, "wait for the data to be committed to physical media" - note the *wait* involved there - so syncing should only be used where data integrity in the event of a system failure has a much higher importance than performance. I have only gotten the drive makers in the past 6 months to committee to actively updating the contents of the identify page to reflect reality. Thus if your drive is one of those that does a stress test check that goes: "this bozo did not really mean to turn off write caching, renabling smurk" Why does this sound familiar? Personally, I feel the bottom line is rapidly turning into "if you have critical data, don't put it on an IDE disk". There are too many corners cut when compared to ostensibly similar SCSI devices. Call me a SCSI bigot if you like - I realise SCSI is more expensive, but you get what you pay for. Of course, under normal circumstances, you leave write-caching and UDMA on, and you don't use a pathological stress-test like we've been doing. That gives the best performance. But sometimes it's necessary to use these "pathological" access patterns to achieve certain system functions. Suppose, harking back to the Windows data-corruption scenario mentioned earlier, that just before powering off you stuffed several MB of data, scattered across the disk, into said disk and waited for said disk to say "yup, i've got that", then powered down. Recent drives have very large (2MB?) on-board caches, so how long does it take for a pathological pattern of these to be committed to physical media? Can the drive sustain it's own power long enough to do this (highly unlikely)? So the drive *must* be able to tell the OS when it's actually committed the data to media, or risk *serious* data corruption. Pathological shutdown pattern: assuming scatter-gather is not allowed (for IDE), and a 20ms full-stroke seek, write sectors at alternately opposite ends of the disk, working inwards until the buffer is full. 512-byte sectors, 2MB of them, is 4000 writes * 20ms = around 80 seconds (not including rotational delay, either). Last time I checked, you'd need a capacitor array the size of the entire computer case to store enough power to allow the drive to do this after system shutdown, and I don't remember seeing LiIon batteries strapped to the bottom of my HDs. Admittedly, any sane OS doesn't actually use that kind of write pattern on shutdown, but the drive can't assume that. -- from: Jonathan "Chromatix" Morton mail: [EMAIL PROTECTED] (not for attachments) big-mail: [EMAIL PROTECTED] uni-mail: [EMAIL PROTECTED] The key to knowledge is not to rely on people to teach you it. Get VNC Server for Macintosh from http://www.chromatix.uklinux.net/vnc/ -BEGIN GEEK CODE BLOCK- Version 3.12 GCS$/E/S dpu(!) s:- a20 C+++ UL++ P L+++ E W+ N- o? K? w--- O-- M++$ V? PS PE- Y+ PGP++ t- 5- X- R !tv b++ DI+++ D G e+ h+ r- y+ -END GEEK CODE BLOCK- - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: scsi vs ide performance on fsync's
I don't know if there is any way to turn of a write buffer on an IDE disk. You want a forced set of commands to kill caching at init? Wrong model You want a write barrier. Write buffering (at least for short intervals) in the drive is very sensible. The kernel needs to able to send drivers a write barrier which will not be completed with outstanding commands before the barrier. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: scsi vs ide performance on fsync's
On Tue, 6 Mar 2001, Jonathan Morton wrote: Pathological shutdown pattern: assuming scatter-gather is not allowed (for IDE), and a 20ms full-stroke seek, write sectors at alternately opposite ends of the disk, working inwards until the buffer is full. 512-byte sectors, 2MB of them, is 4000 writes * 20ms = around 80 seconds i don't understand why the disk couldn't elevator in this case and be done in 20ms + rotational. Of course, whether you should even trust the harddisk is another question. I think this result in itself would lead me *not* to trust the hard disk, especially an IDE one. Has anybody tried running this test with a recent IBM DeskStar - one of the ones that is the same mech as the equivalent UltraStar but with IDE controller? i assume you meant to time the xlog.c program? (or did i miss another program on the thread?) i've an IBM-DJSA-210 (travelstar 10GB, 5411rpm) which appears to do *something* with the write cache flag -- it gets 0.10s elapsed real time in default config; and gets 2.91s if i do "hdparm -W 0". ditto for an IBM-DTLA-307015 (deskstar 15GB 7200rpm) -- varies from .15s with write-cache to 1.8s without. and an IBM-DTLA-307075 (deskstar 75GB 7200rpm) varies from .03s to 1.67s. of course 1.8s is nowhere near enough time for 200 writes to complete. so who knows what that flag is doing. -dean - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: scsi vs ide performance on fsync's
On Tue, 6 Mar 2001, dean gaudet wrote: i assume you meant to time the xlog.c program? (or did i miss another program on the thread?) i've an IBM-DJSA-210 (travelstar 10GB, 5411rpm) which appears to do *something* with the write cache flag -- it gets 0.10s elapsed real time in default config; and gets 2.91s if i do "hdparm -W 0". ditto for an IBM-DTLA-307015 (deskstar 15GB 7200rpm) -- varies from .15s with write-cache to 1.8s without. and an IBM-DTLA-307075 (deskstar 75GB 7200rpm) varies from .03s to 1.67s. of course 1.8s is nowhere near enough time for 200 writes to complete. hi, not enough sleep, can't do math. 1.67s is exactly the ballpark you'd expect for 200 writes to a correctly functioning 7200rpm disk. and the travelstar appears to be doing the right thing as well. -dean - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: scsi vs ide performance on fsync's
On Tue, 6 Mar 2001, Jonathan Morton wrote: Pathological shutdown pattern: assuming scatter-gather is not allowed (for IDE), and a 20ms full-stroke seek, write sectors at alternately opposite ends of the disk, working inwards until the buffer is full. 512-byte sectors, 2MB of them, is 4000 writes * 20ms = around 80 seconds (not including rotational delay, either). Last time I checked, you'd need a capacitor array the size of the entire computer case to store enough power to allow the drive to do this after system shutdown, and I don't remember seeing LiIon batteries strapped to the bottom of my HDs. Admittedly, any sane OS doesn't actually use that kind of write pattern on shutdown, but the drive can't assume that. But since the drive has everything in cache, it can just write out both bunches of sectors in an order which minimises disk seek time ... (yes, the drives don't guarantee write ordering either, but that shouldn't come as a big surprise when they don't guarantee that data makes it to disk ;)) regards, Rik -- Virtual memory is like a game you can't win; However, without VM there's truly nothing to lose... http://www.surriel.com/ http://www.conectiva.com/ http://distro.conectiva.com.br/ - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: scsi vs ide performance on fsync's
i assume you meant to time the xlog.c program? (or did i miss another program on the thread?) Yes. i've an IBM-DJSA-210 (travelstar 10GB, 5411rpm) which appears to do *something* with the write cache flag -- it gets 0.10s elapsed real time in default config; and gets 2.91s if i do "hdparm -W 0". ditto for an IBM-DTLA-307015 (deskstar 15GB 7200rpm) -- varies from .15s with write-cache to 1.8s without. and an IBM-DTLA-307075 (deskstar 75GB 7200rpm) varies from .03s to 1.67s. of course 1.8s is nowhere near enough time for 200 writes to complete. hi, not enough sleep, can't do math. 1.67s is exactly the ballpark you'd expect for 200 writes to a correctly functioning 7200rpm disk. and the travelstar appears to be doing the right thing as well. I was just about to point that out. :) I ran the program with 2000 packets in order to magnify the difference. So, it appears that the IBM IDE drives are doing the "right thing" when write-caching is switched off, but the Seagate drive (at least the one I'm using) appears not to turn the write-caching off at all. I want to try this out with some other drives, including a Seagate SCSI drive and a different Seagate IDE drive (attached to a non-UDMA controller), and perhaps a couple of older drives which I just happen to have lying around (particularly a Maxtor and an old TravelStar with very little cache). That'll have to wait until later, though - university work beckons. :( -- from: Jonathan "Chromatix" Morton mail: [EMAIL PROTECTED] (not for attachments) big-mail: [EMAIL PROTECTED] uni-mail: [EMAIL PROTECTED] The key to knowledge is not to rely on people to teach you it. Get VNC Server for Macintosh from http://www.chromatix.uklinux.net/vnc/ -BEGIN GEEK CODE BLOCK- Version 3.12 GCS$/E/S dpu(!) s:- a20 C+++ UL++ P L+++ E W+ N- o? K? w--- O-- M++$ V? PS PE- Y+ PGP++ t- 5- X- R !tv b++ DI+++ D G e+ h+ r- y+ -END GEEK CODE BLOCK- - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: scsi vs ide performance on fsync's
Write caching is the culprit for the performance diff: On IDE: time xlog /blah.dat fsync 0.000u 0.190s 0:01.72 11.0% 0+0k 0+0io 91pf+0w # hdparm -W 0 /dev/hda /dev/hda: setting drive write-caching to 0 (off) # time xlog /blah.dat fsync 0.000u 0.220s 0:50.60 0.4% 0+0k 0+0io 91pf+0w # hdparm -W 1 /dev/hda /dev/hda: setting drive write-caching to 1 (on) # time xlog /blah.dat fsync 0.010u 0.230s 0:01.88 12.7% 0+0k 0+0io 91pf+0w On my SCSI setup: # time xlog /usr5/blah.dat fsync 0.020u 0.230s 0:30.48 0.8% 0+0k 0+0io 91pf+0w Michael D. Black Principal Engineer [EMAIL PROTECTED] 321-676-2923,x203 http://www.csihq.com Computer Science Innovations http://www.csihq.com/~mike My home page FAX 321-676-2355 - Original Message - From: "Andre Hedrick" [EMAIL PROTECTED] To: "Linus Torvalds" [EMAIL PROTECTED] Cc: "Douglas Gilbert" [EMAIL PROTECTED]; [EMAIL PROTECTED] Sent: Tuesday, March 06, 2001 2:12 AM Subject: Re: scsi vs ide performance on fsync's On Mon, 5 Mar 2001, Linus Torvalds wrote: Well, it's fairly hard for the kernel to do much about that - it's almost certainly just IDE doing write buffering on the disk itself. No OS involved. I am pushing for WC to be defaulted in the off state, but as you know I have a bigger fight than caching on my hands... I don't know if there is any way to turn of a write buffer on an IDE disk. You want a forced set of commands to kill caching at init? Andre Hedrick Linux ATA Development ASL Kernel Development - ASL, Inc. Toll free: 1-877-ASL-3535 1757 Houret Court Fax: 1-408-941-2071 Milpitas, CA 95035Web: www.aslab.com - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/ - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: scsi vs ide performance on fsync's
Pathological shutdown pattern: assuming scatter-gather is not allowed (for IDE), and a 20ms full-stroke seek, write sectors at alternately opposite ends of the disk, working inwards until the buffer is full. 512-byte sectors, 2MB of them, is 4000 writes * 20ms = around 80 seconds (not including rotational delay, either). Last time I checked, you'd need a capacitor array the size of the entire computer case to store enough power to allow the drive to do this after system shutdown, and I don't remember seeing LiIon batteries strapped to the bottom of my HDs. Admittedly, any sane OS doesn't actually use that kind of write pattern on shutdown, but the drive can't assume that. But since the drive has everything in cache, it can just write out both bunches of sectors in an order which minimises disk seek time ... (yes, the drives don't guarantee write ordering either, but that shouldn't come as a big surprise when they don't guarantee that data makes it to disk ;)) That would be true for SCSI devices - I understand the controllers and/or drives support "scatter-gather" which allows a drive to optimise it's seek pattern in the manner you describe. However, I'm not sure whether an IDE drive is allowed to do this. I'm reasonably sure that I heard somewhere that IDE drives have to complete transactions in the specified order as far as the host is concerned - what I'm unsure of is whether this also applies to mechanical head movement. If not, then the drive could by all means optimise the access pattern provided it acked the data or provided the results in the same order as the instructions were given. This would probably shorten the time for a new pathological set (distributed evenly across the disk surface, but all on the worst-possible angular offset compared to the previous) to (8ms seek time + 5ms rotational delay) * 4000 writes ~= 52 seconds (compared with around 120 seconds for the previous set with rotational delay factored in). Great, so you only need half as big a power store to guarantee writing that much data, but it's still too much. Even with a 15000rpm drive and 5ms seek times, it would still be too much. The OS needs to know the physical act of writing data has finished before it tells the m/board to cut the power - period. Pathological data sets included - they are the worst case which every engineer must take into account. Out of interest, does Linux guarantee this, in the light of what we've uncovered? If so, perhaps it could use the same technique to fix fdatasync() and family... -- from: Jonathan "Chromatix" Morton mail: [EMAIL PROTECTED] (not for attachments) big-mail: [EMAIL PROTECTED] uni-mail: [EMAIL PROTECTED] The key to knowledge is not to rely on people to teach you it. Get VNC Server for Macintosh from http://www.chromatix.uklinux.net/vnc/ -BEGIN GEEK CODE BLOCK- Version 3.12 GCS$/E/S dpu(!) s:- a20 C+++ UL++ P L+++ E W+ N- o? K? w--- O-- M++$ V? PS PE- Y+ PGP++ t- 5- X- R !tv b++ DI+++ D G e+ h+ r- y+ -END GEEK CODE BLOCK- - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: scsi vs ide performance on fsync's
Ahh, now we're getting somewhere. IDE: jeremy:~# time ./xlog file.out fsync real0m33.739s user0m0.010s sys 0m0.120s so now this corresponds to the performance we're seeing on SCSI. So I guess what I'm wondering now is can or should anything be done about this on the SCSI side? Thanks -jeremy On Tue, 6 Mar 2001, Mike Black wrote: Write caching is the culprit for the performance diff: On IDE: time xlog /blah.dat fsync 0.000u 0.190s 0:01.72 11.0% 0+0k 0+0io 91pf+0w # hdparm -W 0 /dev/hda /dev/hda: setting drive write-caching to 0 (off) # time xlog /blah.dat fsync 0.000u 0.220s 0:50.60 0.4% 0+0k 0+0io 91pf+0w # hdparm -W 1 /dev/hda /dev/hda: setting drive write-caching to 1 (on) # time xlog /blah.dat fsync 0.010u 0.230s 0:01.88 12.7% 0+0k 0+0io 91pf+0w On my SCSI setup: # time xlog /usr5/blah.dat fsync 0.020u 0.230s 0:30.48 0.8% 0+0k 0+0io 91pf+0w Michael D. Black Principal Engineer [EMAIL PROTECTED] 321-676-2923,x203 http://www.csihq.com Computer Science Innovations http://www.csihq.com/~mike My home page FAX 321-676-2355 - Original Message - From: "Andre Hedrick" [EMAIL PROTECTED] To: "Linus Torvalds" [EMAIL PROTECTED] Cc: "Douglas Gilbert" [EMAIL PROTECTED]; [EMAIL PROTECTED] Sent: Tuesday, March 06, 2001 2:12 AM Subject: Re: scsi vs ide performance on fsync's On Mon, 5 Mar 2001, Linus Torvalds wrote: Well, it's fairly hard for the kernel to do much about that - it's almost certainly just IDE doing write buffering on the disk itself. No OS involved. I am pushing for WC to be defaulted in the off state, but as you know I have a bigger fight than caching on my hands... I don't know if there is any way to turn of a write buffer on an IDE disk. You want a forced set of commands to kill caching at init? Andre Hedrick Linux ATA Development ASL Kernel Development - ASL, Inc. Toll free: 1-877-ASL-3535 1757 Houret Court Fax: 1-408-941-2071 Milpitas, CA 95035Web: www.aslab.com - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/ - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/ -- this is my sig. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: scsi vs ide performance on fsync's
On Tue, 6 Mar 2001, Mike Black wrote: Write caching is the culprit for the performance diff: Indeed, and my during-the-boring-lecture benchmark on my 18Gb IBM TravelStar bears this out. I was confused earlier by the fact that one of my Seagate drives blatently ignores the no-write-caching request I sent it. :P At 4:02 pm + 6/3/2001, Jeremy Hansen wrote: Ahh, now we're getting somewhere. so now this corresponds to the performance we're seeing on SCSI. So I guess what I'm wondering now is can or should anything be done about this on the SCSI side? Maybe, it depends on your perspective. In my personal opinion, the IDE behaviour is incorrect and some way of dealing with it (while still retaining the benefits of write-caching for normal applications) would be highly desirable. However, some applications may like or partially rely on that behaviour, to gain better on-disk data consistency while not suffering too much in performance (eg. the transaction database mentioned by at least one poster). The way to make all parties happy is to fix the IDE driver (or drives!) and make sure an *alternative* syscall is available which flushes the buffers asynchronously, as per the current IDE behaviour. It shouldn't be too hard to make the SCSI driver use that behaviour in the alternative syscall (which may already exist, I don't know Linux well enough to say). May this be a warning to all hardware manufacturers who "tweak" their hardware to gain better benchmark results without actually increasing performance - you *will* be found out! -- from: Jonathan "Chromatix" Morton mail: [EMAIL PROTECTED] (not for attachments) big-mail: [EMAIL PROTECTED] uni-mail: [EMAIL PROTECTED] The key to knowledge is not to rely on people to teach you it. Get VNC Server for Macintosh from http://www.chromatix.uklinux.net/vnc/ -BEGIN GEEK CODE BLOCK- Version 3.12 GCS$/E/S dpu(!) s:- a20 C+++ UL++ P L+++ E W+ N- o? K? w--- O-- M++$ V? PS PE- Y+ PGP++ t- 5- X- R !tv b++ DI+++ D G e+ h+ r- y+ -END GEEK CODE BLOCK- - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: scsi vs ide performance on fsync's
(( please CC me , not subscribed , [EMAIL PROTECTED] ) Jonathan Morton ([EMAIL PROTECTED]) wrote : The OS needs to know the physical act of writing data has finished before it tells the m/board to cut the power - period. Pathological data sets included - they are the worst case which every engineer must take into account. Out of interest, does Linux guarantee this, in the light of what we've uncovered? If so, perhaps it could use the same technique to fix fdatasync() and family... Linux currently ignores write-cache, AFAICT. Recently I asked a similar question , about flushing drive caches at shutdown : Subject : "Flusing caches on shutdown" message archived at : http://boudicca.tux.org/hypermail/linux-kernel/2001week08/0157.html Body attached at end of this message. The answer ( and only reply ) was : [ archived at : http://boudicca.tux.org/hypermail/linux-kernel/2001week08/0211.html ] --- begin quote --- From: Ingo Oeser ([EMAIL PROTECTED]) On Mon, Feb 19, 2001 at 01:45:57PM +0100, David Balazic wrote: It is a good idea IMO to flush the write cache of storage devices at shutdown and other critical moments. Not needed. All device drivers should disable write caches of their devices, that need another signal than switching it off by the power button to flush themselves. Loosing data at powerdown due to write caches have been reported, so this is no a theoretical problems. Also the journaled filesystems are safe only in theory if the journal is not stored on non-volatile memory, which is not guarantied in the current kernel. Fine. If users/admins have write caching enabled, they either know what they do, or should disable it (which is the default for all mass storage drivers AFAIK). Hardware Level caching is only good for OSes which have broken drivers and broken caching (like plain old DOS). Linux does a good job in caching and cache control at software level. Regards Ingo Oeser --- end quote --- My original mail : --- begin quote --- (( CC me the replies, as I'm not subscribed to LKML )) Hi! It is a good idea IMO to flush the write cache of storage devices at shutdown and other critical moments. I browsed through linux-2.4.1 and see no use of the SYNCHRONIZE CACHE SCSI command ( curiously it is defined in several other files besides include/scsi/scsi.h , grep returns : drivers/scsi/pci2000.h:#define SCSIOP_SYNCHRONIZE_CACHE 0x35 drivers/scsi/psi_dale.h:#define SCSIOP_SYNCHRONIZE_CACHE 0x35 drivers/scsi/psi240i.h:#define SCSIOP_SYNCHRONIZE_CACHE 0x35 ) I couldn't find evidence to the use of the equivalent ATA command either ( FLUSH CACHE , command code E7h ). Also add ATAPI to the list. ( and all other interfaces. I checked just SCSI and ATA ) Loosing data at powerdown due to write caches have been reported, so this is no a theoretical problems. Also the journaled filesystems are safe only in theory if the journal is not stored on non-volatile memory, which is not guarantied in the current kernel. What is the official word on this issue ? I think this is important to the "enterprise" guys, at the least. Sincerely, david PS: CC me , as I'm not subscribed to LKML --- end quote --- -- David Balazic -- "Be excellent to each other." - Bill Ted - - - - - - - - - - - - - - - - - - - - - - -- David Balazic -- "Be excellent to each other." - Bill Ted - - - - - - - - - - - - - - - - - - - - - - - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: scsi vs ide performance on fsync's
On Tue, Mar 06, 2001 at 06:14:15PM +0100, David Balazic wrote: [snip] Hardware Level caching is only good for OSes which have broken drivers and broken caching (like plain old DOS). Linux does a good job in caching and cache control at software level. Read caching, yes. But for writes, the drive can often do a lot more optimization because of it's synchronous operation with the platter and greater knowledge of internal disk geometry. What would be useful, as Alan said, is a barrier operation. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: scsi vs ide performance on fsync's
Jonathan Morton ([EMAIL PROTECTED]) wrote : The OS needs to know the physical act of writing data has finished before it tells the m/board to cut the power - period. Pathological data sets included - they are the worst case which every engineer must take into account. Out of interest, does Linux guarantee this, in the light of what we've uncovered? If so, perhaps it could use the same technique to fix fdatasync() and family... Linux currently ignores write-cache, AFAICT. Recently I asked a similar question , about flushing drive caches at shutdown : On Mon, Feb 19, 2001 at 01:45:57PM +0100, David Balazic wrote: It is a good idea IMO to flush the write cache of storage devices at shutdown and other critical moments. Not needed. All device drivers should disable write caches of their devices, that need another signal than switching it off by the power button to flush themselves. Sounds like a sensible place to implement it - in the device driver. I also note the existence of an ATA flush-buffer command, which should probably be used in sync() and family. The call(s) to the sync() family on shutdown should probably be performed by the filesystem itself on unmount (or remount read-only), and if journalled filesystems need synchronisation they should use sync() (or a more fine-grained version) themselves as necessary. Doesn't sound like too much of a headache to implement, to me - unless some drives ignore the ATA FLUSH command, in which case said drives can be considered seriously broken. :P I don't agree that write-caching in itself is a bad thing, particularly given the amount of CPU overhead that IDE drives demand while attached to the controller (orders of magnitude higher than a good SCSI controller) - the more overhead we can hand off to dedicated hardware, the better. What does matter is that drives implementing write-caching are handled in a safe and efficient manner, especially in cases where they refuse to turn such caching off (eg. my Seagate Barracuda *glares at drive*). Recalling my recent comments on worst-case drive-shutdown timings, I also remember seeing drives with 18ms *average* seek times quite recently - this was a Quantum Bigfoot (yes, a 5.25" HD), found in a low-end Compaq desktop - if anyone still believes Compaq makes high-quality machines for their low-end market, they're totally mistaken. The machine sped up quite a lot when a new 3.5" IBM DeskStar was installed, with an 8.5ms average seek and an almost doubling in rotational speed. :) -- from: Jonathan "Chromatix" Morton mail: [EMAIL PROTECTED] (not for attachments) big-mail: [EMAIL PROTECTED] uni-mail: [EMAIL PROTECTED] The key to knowledge is not to rely on people to teach you it. Get VNC Server for Macintosh from http://www.chromatix.uklinux.net/vnc/ -BEGIN GEEK CODE BLOCK- Version 3.12 GCS$/E/S dpu(!) s:- a20 C+++ UL++ P L+++ E W+ N- o? K? w--- O-- M++$ V? PS PE- Y+ PGP++ t- 5- X- R !tv b++ DI+++ D G e+ h+ r- y+ -END GEEK CODE BLOCK- - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: scsi vs ide performance on fsync's
On Tue, 6 Mar 2001, Alan Cox wrote: I don't know if there is any way to turn of a write buffer on an IDE disk. You want a forced set of commands to kill caching at init? Wrong model You want a write barrier. Write buffering (at least for short intervals) in the drive is very sensible. The kernel needs to able to send drivers a write barrier which will not be completed with outstanding commands before the barrier. Agreed. Write buffering is incredibly useful on a disk - for all the same reasons that an OS wants to do it. The disk can use write buffering to speed up writes a lot - not just lower the _perceived_ latency by the OS, but to actually improve performance too. But Alan is right - we needs a "sync" command or something. I don't know if IDE has one (it already might, for all I know). Linus - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: scsi vs ide performance on fsync's
Linus Torvalds himself wrote : On Tue, 6 Mar 2001, Alan Cox wrote: I don't know if there is any way to turn of a write buffer on an IDE disk. You want a forced set of commands to kill caching at init? Wrong model You want a write barrier. Write buffering (at least for short intervals) in the drive is very sensible. The kernel needs to able to send drivers a write barrier which will not be completed with outstanding commands before the barrier. Agreed. Write buffering is incredibly useful on a disk - for all the same reasons that an OS wants to do it. The disk can use write buffering to speed up writes a lot - not just lower the _perceived_ latency by the OS, but to actually improve performance too. But Alan is right - we needs a "sync" command or something. I don't know if IDE has one (it already might, for all I know). ATA , SCSI and ATAPI all have a FLUSH_CACHE command. (*) Whether the drives implement it is another question ... (*) references : ATA-6 draft standard from www.t13.org MtFuji document from -- David Balazic -- "Be excellent to each other." - Bill Ted - - - - - - - - - - - - - - - - - - - - - - - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: scsi vs ide performance on fsync's
Jonathan, I am not going to bite on your flame bate, and are free to waste you money. On Tue, 6 Mar 2001, Jonathan Morton wrote: It's pretty clear that the IDE drive(r) is *not* waiting for the physical write to take place before returning control to the user program, whereas the SCSI drive(r) is. Both devices appear to be performing the write Wrong, IDE does not unplug thus the request is almost, I hate to admit it SYNC and not ASYNC :-( Thus if the drive acks that it has the data then the driver lets go. Uh, run that past me again? You are saying that because the IDE drive hogs the bus until the write is complete or the driver forcibly disconnects, you make the driver disconnect to save time? Or (more likely) have I totally misread you... No, SCSI does with queuing. I am saying that the ata/ide driver rips the heart out of the io_request_lock what to darn long. This means that upon execution a request virtually all interrupts are wacked and the drivers in dominating the system. Given that IO's are limited to 128 sectors or one DMA PRD, this is vastly smaller than the SCSI trasfer limit. Since you are not using the test "Write Verify Read" all drives are going to lie. Only this command will force the stuff to hit the platters and return a read out of the dirty-cache. pre-seek. Thus the question is were is the drive leaving the heads when not active? It does not appear to be in the zone 1 region. Duh... I don't quite see what you're saying here, either. The test is a Okay real shortlimit to two zones that are equal in size. The inner and outer, and the latter will cover more physical media than the former. Simple Two zone model. continuous rewrite of the same sector of the disk, so the head shouldn't be moving *at all* until it's all over. In addition, the drive can't start True and you slip a rev. everytime. writing the sector when it's just finished writing it, so it has to wait for the rotation to breing it back round again. Under those circumstances, I would expect my 7200rpm Seagate to perform slower than my 1rpm IBM *regardless* of seeking performance. Seeking doesn't come into it! It does, because more RPM means more air-flow and more work to keep the position stable. Thus if your drive is one of those that does a stress test check that goes: "this bozo did not really mean to turn off write caching, renabling smurk" Why does this sound familiar? Because of WinBench! All the prefetch/caching are modeled to be optimized to that bench-mark. Personally, I feel the bottom line is rapidly turning into "if you have critical data, don't put it on an IDE disk". There are too many corners cut when compared to ostensibly similar SCSI devices. Call me a SCSI bigot if you like - I realise SCSI is more expensive, but you get what you pay for. Let me slap you in the face with a salomi stick! ATA 7200 RPM Drives are using SCSI 7200 RPM Drive HDA's So you say ATA is Lame? Then so was your SCSI 7200's. Of course, under normal circumstances, you leave write-caching and UDMA on, and you don't use a pathological stress-test like we've been doing. That gives the best performance. But sometimes it's necessary to use these "pathological" access patterns to achieve certain system functions. Suppose, harking back to the Windows data-corruption scenario mentioned earlier, that just before powering off you stuffed several MB of data, scattered across the disk, into said disk and waited for said disk to say "yup, i've got that", then powered down. Recent drives have very large (2MB?) on-board caches, so how long does it take for a pathological pattern of these to be committed to physical media? Can the drive sustain it's own power long enough to do this (highly unlikely)? So the drive *must* be able to tell the OS when it's actually committed the data to media, or risk *serious* data corruption. OH...you are talking about the one IBM drive that is goat-screwed... The one that is to stupid to use the energy of the platters to drop the data in the vender power down strip...yet it dumps the buffer in a panic.. ERM, that is a bad drive, regardless if they publish an errata that states only good HOSTS that issue a flush-cache prior to power are to be certified...we maybe if they did not default the WC on then it would be a NOP of the design error. Since all OSes that enable WC at init will flush it at shutdown and do a periodic purge with in-activity. Pathological shutdown pattern: assuming scatter-gather is not allowed (for IDE), and a 20ms full-stroke seek, write sectors at alternately opposite ends of the disk, working inwards until the buffer is full. 512-byte sectors, 2MB of them, is 4000 writes * 20ms = around 80 seconds (not including rotational delay, either). Last time I checked, you'd need a capacitor array the size of the entire computer case to store enough power to allow the drive to do this after system
Re: scsi vs ide performance on fsync's
On Tue, Mar 06 2001, David Balazic wrote: Wrong model You want a write barrier. Write buffering (at least for short intervals) in the drive is very sensible. The kernel needs to able to send drivers a write barrier which will not be completed with outstanding commands before the barrier. Agreed. Write buffering is incredibly useful on a disk - for all the same reasons that an OS wants to do it. The disk can use write buffering to speed up writes a lot - not just lower the _perceived_ latency by the OS, but to actually improve performance too. But Alan is right - we needs a "sync" command or something. I don't know if IDE has one (it already might, for all I know). ATA , SCSI and ATAPI all have a FLUSH_CACHE command. (*) Whether the drives implement it is another question ... (Usually called SYNCHRONIZE_CACHE btw) SCSI has ordered tag, which fit the model Alan described quite nicely. I've been meaning to implement this for some time, it would be handy for journalled fs to use such a barrier. Since ATA doesn't do queueing (at least not in current Linux), a synchronize cache is probably the only way to go there. (*) references : ATA-6 draft standard from www.t13.org MtFuji document from ftp.avc-pioneer.com -- Jens Axboe - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: scsi vs ide performance on fsync's
itself is a bad thing, particularly given the amount of CPU overhead that IDE drives demand while attached to the controller (orders of magnitude higher than a good SCSI controller) - the more overhead we can hand off to I know this is just a troll by a scsi-believer, but I'm biting anyway. on current machines and disks, ide costs a few % CPU, depending on which CPU, disk, kernel, the sustained bandwidth, etc. I've measured this using the now-trendy method of noticing how much the IO costs a separate, CPU-bound benchmark: load = 1 - (unloadedPerf / loadedPerf). my cheesy duron/600 desktop typically shows ~2% actual cost when running bonnie's block IO tests. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: scsi vs ide performance on fsync's
I am not going to bite on your flame bate, and are free to waste you money. I don't flamebait. I was trying to clear up some confusion... No, SCSI does with queuing. I am saying that the ata/ide driver rips the heart out of the io_request_lock what to darn long. This means that upon execution a request virtually all interrupts are wacked and the drivers in dominating the system. Given that IO's are limited to 128 sectors or one DMA PRD, this is vastly smaller than the SCSI trasfer limit. Ah, so the ATA driver hogs interrupts. Nice. Kinda explains why I can't use the mouse on some systems when I use cdparanoia. Okay real shortlimit to two zones that are equal in size. The inner and outer, and the latter will cover more physical media than the former. Simple Two zone model. Still doesn't make a difference - there is one revolution between writes, no matter where on disk it is. Under those circumstances, I would expect my 7200rpm Seagate to perform slower than my 1rpm IBM *regardless* of seeking performance. Seeking doesn't come into it! It does, because more RPM means more air-flow and more work to keep the position stable. That's the engineers' problem, not ours. In fact, it's not really a problem because my IBM drive gave almost exactly the correct performance result, even at 1rpm, therefore it's managing to keep the position stable regardless of airflow. Why does this sound familiar? Because of WinBench! All the prefetch/caching are modeled to be optimized to that bench-mark. Lies, damn lies, statistics, benchmarks, delivery dates. Especially a consumer-oriented benchmark like WinBench. It's perfectly natural to optimise for particular access patterns, but IMHO that doesn't excuse breaking the drive just to get a better benchmark score. Personally, I feel the bottom line is rapidly turning into "if you have critical data, don't put it on an IDE disk". There are too many corners cut when compared to ostensibly similar SCSI devices. Call me a SCSI bigot if you like - I realise SCSI is more expensive, but you get what you pay for. Let me slap you in the face with a salomi stick! ATA 7200 RPM Drives are using SCSI 7200 RPM Drive HDA's So you say ATA is Lame? Then so was your SCSI 7200's. That isn't the point! I'm not talking about the physical mechanism, which indeed is often the same between one generation of SCSI and the next generation of IDE devices. I'm talking about the IDE controller which is slapped on the bottom of said mechanism. The mech can be of world-class quality, but if the controller is shot it doesn't cut the grain. Since all OSes that enable WC at init will flush it at shutdown and do a periodic purge with in-activity. But Linux doesn't, as has been pointed out earlier. We need to fix Linux. Also, as I and someone else have also pointed out, there are drives in circulation which refuse to turn off write caching, including one sitting in my main workstation - the one which is rebooted the most often, simply because I need to use Windoze 95 for a few onerous tasks. I haven't suffered disk corruption yet, because Linux unmounts the filesystems and flushes it's own buffers several seconds before powering down, and uses a non-pathological access pattern, but I sure don't want to see the first time this doesn't work properly. Err, last time I check all good devices flush their write caching on their own to take advantage of having a maximum cache for prefetching. Which doesn't work if the buffer is filled up by the OS 0.5 seconds before the power goes. I'm sorry if this looks like another troll, but I really do like to clear up confusion. I do accept that IDE now has good enough real performance for many purposes, but in terms of enforced quality it clearly lags behind the entire SCSI field. -- from: Jonathan "Chromatix" Morton mail: [EMAIL PROTECTED] (not for attachments) big-mail: [EMAIL PROTECTED] uni-mail: [EMAIL PROTECTED] The key to knowledge is not to rely on people to teach you it. Get VNC Server for Macintosh from http://www.chromatix.uklinux.net/vnc/ -BEGIN GEEK CODE BLOCK- Version 3.12 GCS$/E/S dpu(!) s:- a20 C+++ UL++ P L+++ E W+ N- o? K? w--- O-- M++$ V? PS PE- Y+ PGP++ t- 5- X- R !tv b++ DI+++ D G e+ h+ r- y+ -END GEEK CODE BLOCK- - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: scsi vs ide performance on fsync's
On Wed, 7 Mar 2001, Jonathan Morton wrote: Still doesn't make a difference - there is one revolution between writes, no matter where on disk it is. Oh it does, because you are hitting the same sector with the same data. Rotate your buffer and then you will see the difference. Because of WinBench! All the prefetch/caching are modeled to be optimized to that bench-mark. Lies, damn lies, statistics, benchmarks, delivery dates. Especially a consumer-oriented benchmark like WinBench. It's perfectly natural to optimise for particular access patterns, but IMHO that doesn't excuse breaking the drive just to get a better benchmark score. Obviously you have never been in the bowls of drive industry hell. Why do you think there was a change the ATA-6 to require the Write-Verify-Read to always return stuff from the platter? Because the SOB's in storage LIE! A real wake-up call for you is that everything about the world of storage is a big-fat-whopper of a LIE. Storage devices are BLACK-BOXES with the standards/rules to communicate being dictated by the device not the host. Storage devices are no beter then a Coke(tm) vending machine. You push "Coke" it gives you "Coke". You have not a clue to how it arrives or where it came from. Same thing about reading from a drive. That isn't the point! I'm not talking about the physical mechanism, which indeed is often the same between one generation of SCSI and the next generation of IDE devices. I'm talking about the IDE controller which is slapped on the bottom of said mechanism. The mech can be of world-class quality, but if the controller is shot it doesn't cut the grain. So there is a $5 differnce in the cell-gates and the line drivers are more powerful, 80GB ATA + $5 != 80GB SCSI. Since all OSes that enable WC at init will flush it at shutdown and do a periodic purge with in-activity. But Linux doesn't, as has been pointed out earlier. We need to fix Linux. Friend I have fixed this some time ago but it is bundled with TASKFILE that is not going to arrive until 2.5. Because I need a way to execute this and hold the driver until it is complete, regardless of the shutdown method. Err, last time I check all good devices flush their write caching on their own to take advantage of having a maximum cache for prefetching. Which doesn't work if the buffer is filled up by the OS 0.5 seconds before the power goes. Maybe that is why there is a vender disk-cache dump zone on the edge of the platters...just maybe you need to buy your drives from somebody that does this and has a predictive sector stretcher as the energy from the inertia by the DC three-phase motor executes the dump. Ever wondered why modern drives have open collectors on the databuss? Maybe to disconnect the power draw so that the motor now generator provides the needed power to complete the data dump... I'm sorry if this looks like another troll, but I really do like to clear up confusion. I do accept that IDE now has good enough real performance for many purposes, but in terms of enforced quality it clearly lags behind the entire SCSI field. I have no desire to debate the merits, but when your onboard host for ATA starts shipping with GigaBit-Copper speeds then we can have a pissing contest. Cheers, Andre Hedrick Linux ATA Development ASL Kernel Development - ASL, Inc. Toll free: 1-877-ASL-3535 1757 Houret Court Fax: 1-408-941-2071 Milpitas, CA 95035Web: www.aslab.com - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: scsi vs ide performance on fsync's
On Mon, 5 Mar 2001, Linus Torvalds wrote: > Well, it's fairly hard for the kernel to do much about that - it's almost > certainly just IDE doing write buffering on the disk itself. No OS > involved. I am pushing for WC to be defaulted in the off state, but as you know I have a bigger fight than caching on my hands... > I don't know if there is any way to turn of a write buffer on an IDE disk. You want a forced set of commands to kill caching at init? Andre Hedrick Linux ATA Development ASL Kernel Development - ASL, Inc. Toll free: 1-877-ASL-3535 1757 Houret Court Fax: 1-408-941-2071 Milpitas, CA 95035Web: www.aslab.com - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: scsi vs ide performance on fsync's
On Tue, 6 Mar 2001, Jonathan Morton wrote: > It's pretty clear that the IDE drive(r) is *not* waiting for the physical > write to take place before returning control to the user program, whereas > the SCSI drive(r) is. Both devices appear to be performing the write Wrong, IDE does not unplug thus the request is almost, I hate to admit it SYNC and not ASYNC :-( Thus if the drive acks that it has the data then the driver lets go. > immediately, however (judging from the device activity lights). Whether > this is the correct behaviour or not, I leave up to you kernel hackers... Seagate has a better seek profile than ibm. The second access is correct because the first one pushed the heads to the pre-seek. Thus the question is were is the drive leaving the heads when not active? It does not appear to be in the zone 1 region. > IMHO, if an application needs performance, it shouldn't be syncing disks > after every write. Syncing means, in my book, "wait for the data to be > committed to physical media" - note the *wait* involved there - so syncing > should only be used where data integrity in the event of a system failure > has a much higher importance than performance. I have only gotten the drive makers in the past 6 months to committee to actively updating the contents of the identify page to reflect reality. Thus if your drive is one of those that does a stress test check that goes: "this bozo did not really mean to turn off write caching, renabling " Cheers, Andre Hedrick Linux ATA Development ASL Kernel Development - ASL, Inc. Toll free: 1-877-ASL-3535 1757 Houret Court Fax: 1-408-941-2071 Milpitas, CA 95035Web: www.aslab.com - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: scsi vs ide performance on fsync's
>I don't know if there is any way to turn of a write buffer on an IDE disk. hdparm has an option of this nature, but it makes no difference (as I reported). It's worth noting that even turning off UDMA to the disk on my machine doesn't help the situation - although it does slow things down a little, it's not "slow enough" to indicate that the drive is behaving properly. Might be worth running the test on some of my other machines, with their diverse collection of IDE controllers (mostly non-UDMA) and disks. >Of course, whether you should even trust the harddisk is another question. I think this result in itself would lead me *not* to trust the hard disk, especially an IDE one. Has anybody tried running this test with a recent IBM DeskStar - one of the ones that is the same mech as the equivalent UltraStar but with IDE controller? I only have SCSI and laptop IBMs here - all my desktop IDE drives are Seagate. However I do have one SCSI Seagate, which might be worth firing up for the occasion... -- from: Jonathan "Chromatix" Morton mail: [EMAIL PROTECTED] (not for attachments) big-mail: [EMAIL PROTECTED] uni-mail: [EMAIL PROTECTED] The key to knowledge is not to rely on people to teach you it. Get VNC Server for Macintosh from http://www.chromatix.uklinux.net/vnc/ -BEGIN GEEK CODE BLOCK- Version 3.12 GCS$/E/S dpu(!) s:- a20 C+++ UL++ P L+++ E W+ N- o? K? w--- O-- M++$ V? PS PE- Y+ PGP++ t- 5- X- R !tv b++ DI+++ D G e+ h+ r- y+ -END GEEK CODE BLOCK- - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: scsi vs ide performance on fsync's
On Tue, 6 Mar 2001, Douglas Gilbert wrote: > > > On the other hand, it's also entirely possible that IDE is just a lot > > better than what the SCSI-bigots tend to claim. It's not all that > > surprising, considering that the PC industry has pushed untold billions of > > dollars into improving IDE, with SCSI as nary a consideration. The above > > may just simply be the Truth, with a capital T. > > What exactly do you think fsync() and fdatasync() should > do? If they need to wait for dirty buffers to get flushed > to the disk oxide then multiple reported IDE results to > this thread are defying physics. Well, it's fairly hard for the kernel to do much about that - it's almost certainly just IDE doing write buffering on the disk itself. No OS involved. The kernel VFS and controller layers certainly wait for the disk to tell us that the data has been written, there's no question about that. But it's also not at all unlikely that the disk itself just lies. I don't know if there is any way to turn of a write buffer on an IDE disk. I do remember that there were some reports of filesystem corruption with some version of Windows that turned off the machine at shutdown (using software power-off as supported by most modern motherboards), and shut down so fast that the drives had not actually written out all data. Whether the reports were true or not I do not know, but I think we can take for granted that write buffers exist. Now, if you really care about your data integrity with a write-buffering disk, I suspect that you'd better have an UPS. At which point write buffering is a valid optimization, as long as you trust the harddisk itself not to crash even if the OS were to crash. Of course, whether you should even trust the harddisk is another question. Linus - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: scsi vs ide performance on fsync's
Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit Linus Torvalds wrote: > Well, it's entirely possible that the mid-level SCSI layer is doing > something horribly stupid. Well it's in good company as FreeBSD 4.2 on the same hardware returns the same result (including IDE timings that were too fast). My timepeg analysis showed that the SCSI disk was consuming the time, not any of the SCSI layers. > On the other hand, it's also entirely possible that IDE is just a lot > better than what the SCSI-bigots tend to claim. It's not all that > surprising, considering that the PC industry has pushed untold billions of > dollars into improving IDE, with SCSI as nary a consideration. The above > may just simply be the Truth, with a capital T. What exactly do you think fsync() and fdatasync() should do? If they need to wait for dirty buffers to get flushed to the disk oxide then multiple reported IDE results to this thread are defying physics. Doug Gilbert - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: scsi vs ide performance on fsync's
On Tue, 6 Mar 2001, Jonathan Morton wrote: > > It's pretty clear that the IDE drive(r) is *not* waiting for the physical > write to take place before returning control to the user program, whereas > the SCSI drive(r) is. This would not be unexpected. IDE drives generally always do write buffering. I don't even know if you _can_ turn it off. So the drive claims to have written the data as soon as it has made the write buffer. It's definitely not the driver, but the actual drive. Linus - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: scsi vs ide performance on fsync's
I've run the test on my own system and noted something interesting about the results: When the write() call extended the file (rather than just overwriting a section of a file already long enough), the performance drop was seen, and it was slower on SCSI than IDE - this is independent of whether IDE had hardware write-caching on or off. Where the file already existed, from an immediately-prior run of the same benchmark, both SCSI and IDE sped up to the same, relatively fast speed. These runs are for the following code, writing 2000 blocks of 4096 bytes each: fd = open("tst.txt", O_WRONLY | O_CREAT, 0644); for (k = 0; k < NUM_BLKS; ++k) { write(fd, buff + (k * BLK_SIZE), BLK_SIZE); fdatasync(fd); } close(fd); IDE: Seagate Barracuda 7200rpm UDMA/66 first run: 1.98 elapsed second and further runs:0.50 elapsed SCSI: IBM UltraStar 1 rpm Ultra/160 first run: 23.57 elapsed second and further runs:0.55 elapsed If the test file is removed between runs, all show the longer timings. HOWEVER if I modify the benchmark to use 2000 blocks of *20* bytes each, the timings change. IDE: Seagate Barracuda 7200rpm UDMA/66 first run: 1.46 elapsed second and further runs:1.45 elapsed SCSI: IBM UltraStar 1 rpm Ultra/160 first run: 18.30 elapsed second and further runs:11.88 elapsed Notice that the time for the second run of the SCSI drive is almost exactly one-fifth of a minute, and remember that 2000 rotations / 1 rpm = 1/5 minute. IOW, the SCSI drive is performing *correctly* on the second run of the benchmark. The poorer performance on the first run *could* be attributed to writing metadata interleaved with the data writes. The better performance on the second run of the first benchmark can easily be attributed to the fact that the drive does not need to wait an entire revolution before writing the next block of a file, if that block arrives quickly enough (this is a Duron, so it darn well arrives quickly). It's pretty clear that the IDE drive(r) is *not* waiting for the physical write to take place before returning control to the user program, whereas the SCSI drive(r) is. Both devices appear to be performing the write immediately, however (judging from the device activity lights). Whether this is the correct behaviour or not, I leave up to you kernel hackers... IMHO, if an application needs performance, it shouldn't be syncing disks after every write. Syncing means, in my book, "wait for the data to be committed to physical media" - note the *wait* involved there - so syncing should only be used where data integrity in the event of a system failure has a much higher importance than performance. -- from: Jonathan "Chromatix" Morton mail: [EMAIL PROTECTED] (not for attachments) big-mail: [EMAIL PROTECTED] uni-mail: [EMAIL PROTECTED] The key to knowledge is not to rely on people to teach you it. Get VNC Server for Macintosh from http://www.chromatix.uklinux.net/vnc/ -BEGIN GEEK CODE BLOCK- Version 3.12 GCS$/E/S dpu(!) s:- a20 C+++ UL++ P L+++ E W+ N- o? K? w--- O-- M++$ V? PS PE- Y+ PGP++ t- 5- X- R !tv b++ DI+++ D G e+ h+ r- y+ -END GEEK CODE BLOCK- - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: scsi vs ide performance on fsync's
On Mon, 5 Mar 2001, Jeremy Hansen wrote: > > Right now I'm running 2.4.2-ac11 on both machines and getting the same > results: > > SCSI: > > [root@orville /root]# time /root/xlog file.out fsync > > real0m21.266s > user0m0.000s > sys 0m0.310s > > IDE: > > [root@kahlbi /root]# time /root/xlog file.out fsync > > real0m8.928s > user0m0.000s > sys 0m6.700s > > This behavior has been noticed by others, so I'm hoping I'm not just crazy > or that my test is somehow flawed. > > We're using MySQL with Berkeley DB for transaction log support. It was > really confusing when a simple ide workstation was out performing our > Ultra160 raid array. Well, it's entirely possible that the mid-level SCSI layer is doing something horribly stupid. On the other hand, it's also entirely possible that IDE is just a lot better than what the SCSI-bigots tend to claim. It's not all that surprising, considering that the PC industry has pushed untold billions of dollars into improving IDE, with SCSI as nary a consideration. The above may just simply be the Truth, with a capital T. (And "bonnie" is not a very good benchmark. It's not exactly mirroring any real life access patterns. I would not be surprised if the SCSI driver performance has been tuned by bonnie alone, and maybe it just sucks at everything else) Maybe we should ask whether somebody like lnz is interested in seeing what SCSI does wrong here? Linus - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: scsi vs ide performance on fsync's
On 2 Mar 2001, Linus Torvalds wrote: > In article <[EMAIL PROTECTED]>, > Jeremy Hansen <[EMAIL PROTECTED]> wrote: > > > >The SCSI adapter on the raid array is an Adaptec 39160, the raid > >controller is a CMD-7040. Kernel 2.4.0 using XFS for the filesystem on > >the raid array, kernel 2.2.18 on ext2 on the IDE drive. The filesystem is > >not the problem, as I get almost the exact same results running this on > >ext2 on the raid array. > > Did you try a 2.4.x kernel on both? Finally got around to working on this. Right now I'm running 2.4.2-ac11 on both machines and getting the same results: SCSI: [root@orville /root]# time /root/xlog file.out fsync real0m21.266s user0m0.000s sys 0m0.310s IDE: [root@kahlbi /root]# time /root/xlog file.out fsync real0m8.928s user0m0.000s sys 0m6.700s This behavior has been noticed by others, so I'm hoping I'm not just crazy or that my test is somehow flawed. We're using MySQL with Berkeley DB for transaction log support. It was really confusing when a simple ide workstation was out performing our Ultra160 raid array. Thanks -jeremy > 2.4.0 has a bad elevator, which may show problems, so please check 2.4.2 > if the numbers change. Also, "fsync()" is very different indeed on 2.2.x > and 2.4.x, and I would not be 100% surprised if your IDE drive does > asynchronous write caching and your RAID does not... That would not show > up in bonnie. > > Also note how your bonnie file remove numbers for IDE seem to be much > better than for your RAID array, so it is not impossible that your RAID > unit just has a _huge_ setup overhead but good throughput, and that the > IDE numbers are better simply because your IDE setup is much lower > latency. Never mistake throughput for _speed_. > > Linus > - > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in > the body of a message to [EMAIL PROTECTED] > More majordomo info at http://vger.kernel.org/majordomo-info.html > Please read the FAQ at http://www.tux.org/lkml/ > -- this is my sig. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: scsi vs ide performance on fsync's
Since the intention of fsync and fdatasync seems to be to write dirty fs buffers to persistent storage (i.e. the "oxide") then the best time is not necessarily the objective. Given the IDE times that people have been reporting, it is very unlikely that any of those IDE disks were really doing 2000 discrete IO operations involving waiting for the those buffers to be written to the "oxide". [Reason: it should take at least 2000 revolutions of the disk to do it, since most of the 4KB writes are going to the same disk address as the prior write.] As it stands, the Linux SCSI subsystem has no mechanism to force a disk cache write through. The SCSI WRITE(10) command has a Force Unit Access bit (FUA) to do exactly that, but we don't use it. Do the fs/block layers flag they wish buffers written to the oxide?? The measurements that showed SCSI disks were taking a lot longer with the "xlog" test were more luck than good management. Here are some tests that show an IDE versus SCSI "xlog" comparison are very similar between FreeBSD 4.2 and lk 2.4.2 on the same hardware: # IBM DCHS04U SCSI disk 7200 rpm <> [root@free /var]# time /root/xlog tst.txt real0m0.043s [root@free /var]# time /root/xlog tst.txt fsync real0m33.131s # Quantum Fireball ST3.2A IDE disk 3600 rpm <> [root@free dos]# time /root/xlog tst.txt real0m0.034s [root@free dos]# time /root/xlog tst.txt fsync real0m5.737s # IBM DCHS04U SCSI disk 7200 rpm <> [root@tvilling extra]# time /root/xlog tst.txt 0:00.00elapsed 125%CPU [root@tvilling spare]# time /root/xlog tst.txt fsync 0:33.15elapsed 0%CPU # Quantum Fireball ST3.2A IDE disk 3600 rpm <> [root@tvilling /root]# time /root/xlog tst.txt 0:00.02elapsed 43%CPU [root@tvilling /root]# time /root/xlog tst.txt fsync 0:05.99elapsed 69%CPU Notes: FreeBSD doesn't have fdatasync() so I changed xlog to use fsync(). Linux timings were the same with fsync() and fdatasync(). The xlog program crashed immediately in FreeBSD; it needed some sanity checks on its arguments. One further note: I wrote: > [snip] > So writing more data to the SCSI disk speeds it up! > I suspect the critical point in the "20*200" test is > that the same sequence of 8 512 byte sectors are being > written to disk 200 times. BTW That disk spins at > 15K rpm so one rotation takes 4 ms and it has a > 4 MB cache. A clarification: by "same sequence" I meant written to the same disk address. If the 4 KB lies on the same track, then a delay of one disk revolution would be expected before you could write the next 4 KB to the "oxide" at the same address. Doug Gilbert - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: scsi vs ide performance on fsync's
Chris Mason <[EMAIL PROTECTED]> writes: > filemap_fdatawait, filemap_fdatasync, and fsync_inode_buffers all restrict > their scans to a list of dirty buffers for that specific file. Only > file_fsync goes through all the dirty buffers on the device, and the ext2 > fsync path never calls file_fsync. > > Or am I missing something? If the filesystems tested had blocksize < PAGE_SIZE the fsync would try to sync everything, not walk the dirty buffers directly. So e.g. if one of the file systems tested was generated with old ext2 utils that do not use 4K block size then some performance difference could be explained. -Andi - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: scsi vs ide performance on fsync's
Chris Mason [EMAIL PROTECTED] writes: filemap_fdatawait, filemap_fdatasync, and fsync_inode_buffers all restrict their scans to a list of dirty buffers for that specific file. Only file_fsync goes through all the dirty buffers on the device, and the ext2 fsync path never calls file_fsync. Or am I missing something? If the filesystems tested had blocksize PAGE_SIZE the fsync would try to sync everything, not walk the dirty buffers directly. So e.g. if one of the file systems tested was generated with old ext2 utils that do not use 4K block size then some performance difference could be explained. -Andi - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: scsi vs ide performance on fsync's
I've run the test on my own system and noted something interesting about the results: When the write() call extended the file (rather than just overwriting a section of a file already long enough), the performance drop was seen, and it was slower on SCSI than IDE - this is independent of whether IDE had hardware write-caching on or off. Where the file already existed, from an immediately-prior run of the same benchmark, both SCSI and IDE sped up to the same, relatively fast speed. These runs are for the following code, writing 2000 blocks of 4096 bytes each: fd = open("tst.txt", O_WRONLY | O_CREAT, 0644); for (k = 0; k NUM_BLKS; ++k) { write(fd, buff + (k * BLK_SIZE), BLK_SIZE); fdatasync(fd); } close(fd); IDE: Seagate Barracuda 7200rpm UDMA/66 first run: 1.98 elapsed second and further runs:0.50 elapsed SCSI: IBM UltraStar 1 rpm Ultra/160 first run: 23.57 elapsed second and further runs:0.55 elapsed If the test file is removed between runs, all show the longer timings. HOWEVER if I modify the benchmark to use 2000 blocks of *20* bytes each, the timings change. IDE: Seagate Barracuda 7200rpm UDMA/66 first run: 1.46 elapsed second and further runs:1.45 elapsed SCSI: IBM UltraStar 1 rpm Ultra/160 first run: 18.30 elapsed second and further runs:11.88 elapsed Notice that the time for the second run of the SCSI drive is almost exactly one-fifth of a minute, and remember that 2000 rotations / 1 rpm = 1/5 minute. IOW, the SCSI drive is performing *correctly* on the second run of the benchmark. The poorer performance on the first run *could* be attributed to writing metadata interleaved with the data writes. The better performance on the second run of the first benchmark can easily be attributed to the fact that the drive does not need to wait an entire revolution before writing the next block of a file, if that block arrives quickly enough (this is a Duron, so it darn well arrives quickly). It's pretty clear that the IDE drive(r) is *not* waiting for the physical write to take place before returning control to the user program, whereas the SCSI drive(r) is. Both devices appear to be performing the write immediately, however (judging from the device activity lights). Whether this is the correct behaviour or not, I leave up to you kernel hackers... IMHO, if an application needs performance, it shouldn't be syncing disks after every write. Syncing means, in my book, "wait for the data to be committed to physical media" - note the *wait* involved there - so syncing should only be used where data integrity in the event of a system failure has a much higher importance than performance. -- from: Jonathan "Chromatix" Morton mail: [EMAIL PROTECTED] (not for attachments) big-mail: [EMAIL PROTECTED] uni-mail: [EMAIL PROTECTED] The key to knowledge is not to rely on people to teach you it. Get VNC Server for Macintosh from http://www.chromatix.uklinux.net/vnc/ -BEGIN GEEK CODE BLOCK- Version 3.12 GCS$/E/S dpu(!) s:- a20 C+++ UL++ P L+++ E W+ N- o? K? w--- O-- M++$ V? PS PE- Y+ PGP++ t- 5- X- R !tv b++ DI+++ D G e+ h+ r- y+ -END GEEK CODE BLOCK- - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: scsi vs ide performance on fsync's
Since the intention of fsync and fdatasync seems to be to write dirty fs buffers to persistent storage (i.e. the "oxide") then the best time is not necessarily the objective. Given the IDE times that people have been reporting, it is very unlikely that any of those IDE disks were really doing 2000 discrete IO operations involving waiting for the those buffers to be written to the "oxide". [Reason: it should take at least 2000 revolutions of the disk to do it, since most of the 4KB writes are going to the same disk address as the prior write.] As it stands, the Linux SCSI subsystem has no mechanism to force a disk cache write through. The SCSI WRITE(10) command has a Force Unit Access bit (FUA) to do exactly that, but we don't use it. Do the fs/block layers flag they wish buffers written to the oxide?? The measurements that showed SCSI disks were taking a lot longer with the "xlog" test were more luck than good management. Here are some tests that show an IDE versus SCSI "xlog" comparison are very similar between FreeBSD 4.2 and lk 2.4.2 on the same hardware: # IBM DCHS04U SCSI disk 7200 rpm FreeBSD 4.2 [root@free /var]# time /root/xlog tst.txt real0m0.043s [root@free /var]# time /root/xlog tst.txt fsync real0m33.131s # Quantum Fireball ST3.2A IDE disk 3600 rpm FreeBSD 4.2 [root@free dos]# time /root/xlog tst.txt real0m0.034s [root@free dos]# time /root/xlog tst.txt fsync real0m5.737s # IBM DCHS04U SCSI disk 7200 rpm lk 2.4.2 [root@tvilling extra]# time /root/xlog tst.txt 0:00.00elapsed 125%CPU [root@tvilling spare]# time /root/xlog tst.txt fsync 0:33.15elapsed 0%CPU # Quantum Fireball ST3.2A IDE disk 3600 rpm lk 2.4.2 [root@tvilling /root]# time /root/xlog tst.txt 0:00.02elapsed 43%CPU [root@tvilling /root]# time /root/xlog tst.txt fsync 0:05.99elapsed 69%CPU Notes: FreeBSD doesn't have fdatasync() so I changed xlog to use fsync(). Linux timings were the same with fsync() and fdatasync(). The xlog program crashed immediately in FreeBSD; it needed some sanity checks on its arguments. One further note: I wrote: [snip] So writing more data to the SCSI disk speeds it up! I suspect the critical point in the "20*200" test is that the same sequence of 8 512 byte sectors are being written to disk 200 times. BTW That disk spins at 15K rpm so one rotation takes 4 ms and it has a 4 MB cache. A clarification: by "same sequence" I meant written to the same disk address. If the 4 KB lies on the same track, then a delay of one disk revolution would be expected before you could write the next 4 KB to the "oxide" at the same address. Doug Gilbert - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: scsi vs ide performance on fsync's
On Tue, 6 Mar 2001, Jonathan Morton wrote: It's pretty clear that the IDE drive(r) is *not* waiting for the physical write to take place before returning control to the user program, whereas the SCSI drive(r) is. This would not be unexpected. IDE drives generally always do write buffering. I don't even know if you _can_ turn it off. So the drive claims to have written the data as soon as it has made the write buffer. It's definitely not the driver, but the actual drive. Linus - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: scsi vs ide performance on fsync's
Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit Linus Torvalds wrote: Well, it's entirely possible that the mid-level SCSI layer is doing something horribly stupid. Well it's in good company as FreeBSD 4.2 on the same hardware returns the same result (including IDE timings that were too fast). My timepeg analysis showed that the SCSI disk was consuming the time, not any of the SCSI layers. On the other hand, it's also entirely possible that IDE is just a lot better than what the SCSI-bigots tend to claim. It's not all that surprising, considering that the PC industry has pushed untold billions of dollars into improving IDE, with SCSI as nary a consideration. The above may just simply be the Truth, with a capital T. What exactly do you think fsync() and fdatasync() should do? If they need to wait for dirty buffers to get flushed to the disk oxide then multiple reported IDE results to this thread are defying physics. Doug Gilbert - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: scsi vs ide performance on fsync's
On Tue, 6 Mar 2001, Douglas Gilbert wrote: On the other hand, it's also entirely possible that IDE is just a lot better than what the SCSI-bigots tend to claim. It's not all that surprising, considering that the PC industry has pushed untold billions of dollars into improving IDE, with SCSI as nary a consideration. The above may just simply be the Truth, with a capital T. What exactly do you think fsync() and fdatasync() should do? If they need to wait for dirty buffers to get flushed to the disk oxide then multiple reported IDE results to this thread are defying physics. Well, it's fairly hard for the kernel to do much about that - it's almost certainly just IDE doing write buffering on the disk itself. No OS involved. The kernel VFS and controller layers certainly wait for the disk to tell us that the data has been written, there's no question about that. But it's also not at all unlikely that the disk itself just lies. I don't know if there is any way to turn of a write buffer on an IDE disk. I do remember that there were some reports of filesystem corruption with some version of Windows that turned off the machine at shutdown (using software power-off as supported by most modern motherboards), and shut down so fast that the drives had not actually written out all data. Whether the reports were true or not I do not know, but I think we can take for granted that write buffers exist. Now, if you really care about your data integrity with a write-buffering disk, I suspect that you'd better have an UPS. At which point write buffering is a valid optimization, as long as you trust the harddisk itself not to crash even if the OS were to crash. Of course, whether you should even trust the harddisk is another question. Linus - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: scsi vs ide performance on fsync's
I don't know if there is any way to turn of a write buffer on an IDE disk. hdparm has an option of this nature, but it makes no difference (as I reported). It's worth noting that even turning off UDMA to the disk on my machine doesn't help the situation - although it does slow things down a little, it's not "slow enough" to indicate that the drive is behaving properly. Might be worth running the test on some of my other machines, with their diverse collection of IDE controllers (mostly non-UDMA) and disks. Of course, whether you should even trust the harddisk is another question. I think this result in itself would lead me *not* to trust the hard disk, especially an IDE one. Has anybody tried running this test with a recent IBM DeskStar - one of the ones that is the same mech as the equivalent UltraStar but with IDE controller? I only have SCSI and laptop IBMs here - all my desktop IDE drives are Seagate. However I do have one SCSI Seagate, which might be worth firing up for the occasion... -- from: Jonathan "Chromatix" Morton mail: [EMAIL PROTECTED] (not for attachments) big-mail: [EMAIL PROTECTED] uni-mail: [EMAIL PROTECTED] The key to knowledge is not to rely on people to teach you it. Get VNC Server for Macintosh from http://www.chromatix.uklinux.net/vnc/ -BEGIN GEEK CODE BLOCK- Version 3.12 GCS$/E/S dpu(!) s:- a20 C+++ UL++ P L+++ E W+ N- o? K? w--- O-- M++$ V? PS PE- Y+ PGP++ t- 5- X- R !tv b++ DI+++ D G e+ h+ r- y+ -END GEEK CODE BLOCK- - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: scsi vs ide performance on fsync's
On Tue, 6 Mar 2001, Jonathan Morton wrote: It's pretty clear that the IDE drive(r) is *not* waiting for the physical write to take place before returning control to the user program, whereas the SCSI drive(r) is. Both devices appear to be performing the write Wrong, IDE does not unplug thus the request is almost, I hate to admit it SYNC and not ASYNC :-( Thus if the drive acks that it has the data then the driver lets go. immediately, however (judging from the device activity lights). Whether this is the correct behaviour or not, I leave up to you kernel hackers... Seagate has a better seek profile than ibm. The second access is correct because the first one pushed the heads to the pre-seek. Thus the question is were is the drive leaving the heads when not active? It does not appear to be in the zone 1 region. IMHO, if an application needs performance, it shouldn't be syncing disks after every write. Syncing means, in my book, "wait for the data to be committed to physical media" - note the *wait* involved there - so syncing should only be used where data integrity in the event of a system failure has a much higher importance than performance. I have only gotten the drive makers in the past 6 months to committee to actively updating the contents of the identify page to reflect reality. Thus if your drive is one of those that does a stress test check that goes: "this bozo did not really mean to turn off write caching, renabling smurk" Cheers, Andre Hedrick Linux ATA Development ASL Kernel Development - ASL, Inc. Toll free: 1-877-ASL-3535 1757 Houret Court Fax: 1-408-941-2071 Milpitas, CA 95035Web: www.aslab.com - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: scsi vs ide performance on fsync's
On Mon, 5 Mar 2001, Linus Torvalds wrote: Well, it's fairly hard for the kernel to do much about that - it's almost certainly just IDE doing write buffering on the disk itself. No OS involved. I am pushing for WC to be defaulted in the off state, but as you know I have a bigger fight than caching on my hands... I don't know if there is any way to turn of a write buffer on an IDE disk. You want a forced set of commands to kill caching at init? Andre Hedrick Linux ATA Development ASL Kernel Development - ASL, Inc. Toll free: 1-877-ASL-3535 1757 Houret Court Fax: 1-408-941-2071 Milpitas, CA 95035Web: www.aslab.com - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: scsi vs ide performance on fsync's
Douglas Gilbert wrote: > There is definitely something strange going on here. > As the bonnie test below shows, the SCSI disk used > for my tests should vastly outperform the old IDE one: First thank you and others with my clueless investigation about the module loading under Debian GNU/Linux. (I should have known that Debian uses a very special module setup.) Anyway, I used to think SCSI is better than IDE in general, and the post was quite surprising. So I ran the test on my PC. On my systems too, the IDE beats SCSI hand down with the test case. BTW, has anyone noticed that the elapsed time of SCSI case is TWICE as long if we let the previous output of the test program stay before running the second test? (I suspect fdatasync takes time proportional to the (then current) file size, but still why SCSI case is so long is beyond me.) Eg. ishikawa@duron$ ls -l /tmp/t.out ls: /tmp/t.out: No such file or directory ishikawa@duron$ time ./xlog /tmp/t.out fsync real0m38.673s<=== my scsi disk is slow one to begin with... user0m0.050s sys 0m0.140s ishikawa@duron$ ls -l /tmp/t.out -rw-r--r--1 ishikawa users 112000 Mar 5 06:19 /tmp/t.out ishikawa@duron$ time ./xlog /tmp/t.out fsync real1m16.928s<=== See TWICE as long! user0m0.060s sys 0m0.160s ishikawa@duron$ ls -l /tmp/t.out -rw-r--r--1 ishikawa users 112000 Mar 5 06:20 /tmp/t.out ishikawa@duron$ rm /tmp/t.out< REMOVE the file and try again. ishikawa@duron$ time ./xlog /tmp/t.out fsync real0m40.667s < Half as long and back to original. user0m0.040s sys 0m0.120s iishikawa@duron$ time ./xlog /tmp/t.out xxx real0m0.012s <=== very fast without fdatasync as it should be. user0m0.010s sys 0m0.010s ishikawa@duron$ - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/