Re: [zfs-discuss] write cache partial-disk pools (was Server with 4 drives, how to configure ZFS?)
On Jul 2, 2011, at 6:39 AM, Edward Ned Harvey wrote: >> From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss- >> boun...@opensolaris.org] On Behalf Of Edward Ned Harvey >> >> Conclusion: Yes it matters to enable the write_cache. > > Now the question of whether or not it matters to use the whole disk versus > partitioning, and how to enable the write_cache automatically on sliced > disks: > > I understand that different people have had different results based on which > hardware and which OS rev they are using. So if this matters to you, you'll > just need to check for yourself. But here is what I found: > > On a Sun(Oracle) X4270 and solaris 11 express, my behavior is pretty much as > the man page describes. If I create a pool using the whole disks, then the > write_cache is enabled automatically. When I destroy a pool, the > write_cache is returned to its previous (disabled) state. When I create a > pool using slices of the disks, then the write_cache is not automatically > enabled. Yes, this is annoying. In NexentaStor, we have a property that manages the write cache policy on a per-device basis. > > I would like to know: Is there some way to enable the write cache > automatically on specific devices? I have a script that will enable the > write_cache on all the devices (just a simple wrapper around format -f) and > of course I can make it run at startup, or on a cron job, etc. But I'd like > to know if there's a more native way to achieve that end result. I would say change the way it compiles, but you are stuck without source for Solaris. > > I have one really specific reason to care about automatically enabling the > write_cache on sliced disks: > > All the disks in the system are large disks. (2T). The OS only needs a few > G, so we install the OS into mirrored slices. The rest of the disk is > sliced and added to the storage pool. The default behavior in this > situation is to disable write_cache on the first few disks. Sounds like a reasonable request to me. Maybe Oracle will accept an RFE? -- richard ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] write cache partial-disk pools (was Server with 4 drives, how to configure ZFS?)
> From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss- > boun...@opensolaris.org] On Behalf Of Edward Ned Harvey > > Conclusion: Yes it matters to enable the write_cache. Now the question of whether or not it matters to use the whole disk versus partitioning, and how to enable the write_cache automatically on sliced disks: I understand that different people have had different results based on which hardware and which OS rev they are using. So if this matters to you, you'll just need to check for yourself. But here is what I found: On a Sun(Oracle) X4270 and solaris 11 express, my behavior is pretty much as the man page describes. If I create a pool using the whole disks, then the write_cache is enabled automatically. When I destroy a pool, the write_cache is returned to its previous (disabled) state. When I create a pool using slices of the disks, then the write_cache is not automatically enabled. I would like to know: Is there some way to enable the write cache automatically on specific devices? I have a script that will enable the write_cache on all the devices (just a simple wrapper around format -f) and of course I can make it run at startup, or on a cron job, etc. But I'd like to know if there's a more native way to achieve that end result. I have one really specific reason to care about automatically enabling the write_cache on sliced disks: All the disks in the system are large disks. (2T). The OS only needs a few G, so we install the OS into mirrored slices. The rest of the disk is sliced and added to the storage pool. The default behavior in this situation is to disable write_cache on the first few disks. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] write cache partial-disk pools (was Server with 4 drives, how to configure ZFS?)
> From: Ross Walker [mailto:rswwal...@gmail.com] > Sent: Friday, June 17, 2011 9:48 PM > > The on-disk buffer is there so data is ready when the hard drive head lands, > without it the drive's average rotational latency will trend higher due to > missed landings because the data wasn't in buffer at the right time. > > The read buffer is to allow the disk to continuously read sectors whether the > system bus is ready to transfer or not. Without it, sequential reads wouldn't > last long enough to reach max throughput before they would have to pause > because of bus contention and then suffer a rotation of latency hit which > would kill read performance. And it turns out ... Ross is the winner. ;-) My hypothesis wasn't right, and whoever said a single disk would hog the bus in an idle state, that's also wrong. Conclusion: Yes it matters to enable the write_cache. But the reason it matters is to ensure the right data is present at the right time. NOT because of any idle bus blocking. Here's the test: I tested writing to a bunch (4) of disks simultaneously at maximum throughput, with the write_cache enabled. This is on a 6Gbit bus, they all performed 1.0 Gbit/sec which was precisely the mfgr spec. Then I disabled write_cache on all the disks and repeated the test. They all dropped to 750 Mbit/sec. If the idle bus contention were correct, then the total bus speed would have been limited to the max throughput of a single disk (1Gbit). But I was easily able to sustain 3Gbit, thus disproving the idle bus contention. If the filesystem write buffer were making the disk write_cache irrelevant, as I conjectured, then the total throughput would have been the same, regardless of whether the write_cache was enabled or disabled. Since performance dropped with write_cache disabled, it disproves my hypothesis. No further testing was necessary. I'm not interested in how much performance difference there is - or under which specific conditions they occur. I am only interested in the existence of a performance difference. So the conclusion is yes, you want to enable your disk write cache (assuming all the data on your disk is managed by ZFS.) ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] write cache partial-disk pools (was Server with 4 drives, how to configure ZFS?)
2011-06-19 3:47, Richard Elling пишет: On Jun 16, 2011, at 8:05 PM, Daniel Carosone wrote: On Thu, Jun 16, 2011 at 10:40:25PM -0400, Edward Ned Harvey wrote: From: Daniel Carosone [mailto:d...@geek.com.au] Sent: Thursday, June 16, 2011 10:27 PM Is it still the case, as it once was, that allocating anything other than whole disks as vdevs forces NCQ / write cache off on the drive (either or both, forget which, guess write cache)? I will only say, that regardless of whether or not that is or ever was true, I believe it's entirely irrelevant. Because your system performs read and write caching and buffering in ram, the tiny little ram on the disk can't possibly contribute anything. I disagree. It can vastly help improve the IOPS of the disk and keep the channel open for more transactions while one is in progress. Otherwise, the channel is idle, blocked on command completion, while the heads seek. Actually, all of the data I've gathered recently shows that the number of IOPS does not significantly increase for HDDs running random workloads. However the response time does :-( My data is leading me to want to restrict the queue depth to 1 or 2 for HDDs. SDDs are another story, they scale much better in the response time and IOPS vs queue depth analysis. Now, is there going to be a tunable which would allow us to set queue depths per-device? Or tunables are so evil that you'd "rather poke an eye your with a stick"? (C) Richard Elling ;) -- ++ || | Климов Евгений, Jim Klimov | | технический директор CTO | | ЗАО "ЦОС и ВТ" JSC COS&HT | || | +7-903-7705859 (cellular) mailto:jimkli...@cos.ru | | CC:ad...@cos.ru,jimkli...@mail.ru | ++ | () ascii ribbon campaign - against html mail | | /\- against microsoft attachments | ++ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] write cache partial-disk pools (was Server with 4 drives, how to configure ZFS?)
On Jun 21, 2011, at 8:18 AM, Garrett D'Amore wrote: >> >> Does that also go through disksort? Disksort doesn't seem to have any >> concept of priorities (but I haven't looked in detail where it plugs in to >> the whole framework). >> >>> So it might make better sense for ZFS to keep the disk queue depth small >>> for HDDs. >>> -- richard >>> >> > > disksort is much further down than zio priorities... by the time disksort > sees them they have already been sorted in priority order. Yes, disksort is at sd. So ZFS schedules I/Os, disksort reorders them, and the drive reorders them again. To get the best advantage out of the ZFS priority ordering, I can make an argument to disable disksort and keep the vdev_max_pending low to limit the reordering work done by the drive. I am not convinced that traditional benchmarks show the effects of ZFS priority ordering, though. -- richard ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] write cache partial-disk pools (was Server with 4 drives, how to configure ZFS?)
On Sun, 19 Jun 2011, Richard Elling wrote: Yes. I've been looking at what the value of zfs_vdev_max_pending should be. The old value was 35 (a guess, but a really bad guess) and the new value is 10 (another guess, but a better guess). I observe that data from a fast, modern I am still using 5 here. :-) I haven't formed an opinion yet, but I'm inclined towards wanting overall better latency. Most properly implemented systems are not running at maximum capacity and so decreased latency is definitely desirable so that applications obtain the best CPU usage and short-lived requests do not clog the system. Typical benchmark scenarios (max sustained or peak throughput) do not represent most real-world usage. The 60 or 80% solution (with assured reasonable response time) is definitely better than the 99% solution when it comes to user satisfaction. Bob -- Bob Friesenhahn bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer,http://www.GraphicsMagick.org/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] write cache partial-disk pools (was Server with 4 drives, how to configure ZFS?)
On Sun, Jun 19, 2011 at 08:03:25AM -0700, Richard Elling wrote: > Yes. I've been looking at what the value of zfs_vdev_max_pending should be. > The old value was 35 (a guess, but a really bad guess) and the new value is > 10 (another guess, but a better guess). I observe that data from a fast, > modern > HDD, for 1-10 threads (outstanding I/Os) the IOPS ranges from 309 to 333 > IOPS. > But as we add threads, the average response time increases from 2.3ms to > 137ms. Interesting. What happens to total throughput, since that's the expected tradeoff against latency here. I might guess that in your tests with a constant io size, it's linear with IOPS - but I wonder if that remains so for larger IO or with mixed sizes? > Since the whole idea is to get lower response time, and we know disks are not > simple queues so there is no direct IOPS to response time relationship, maybe > it > is simply better to limit the number of outstanding I/Os. I also wonder if we're seeing a form of "bufferbloat" here in these latencies. As I wrote in another post yesterday, remember that you're not counting actual outstanding IO's here, because the write IO's are being acknowledged immediately and tracked internally. The disk may therefore be getting itself into a state where either the buffer/queue is efectively full, or the number of requests it is tracking internally becomes inefficient (as well as the head-thrashing). Even before you get to that state and writes start slowing down too, your averages are skewed by write cache. All the writes are fast, while a longer queue exposes reads to contention with eachother, as well as to a much wider window of writes. Can you look at the average response time for just the reads, even amongst a mixed r/w workflow? Perhaps some alternate statistic than average, too. Can you repeat the tests with write-cache disabled, so you're more accurately exposing the controller's actual workload and backlog? I hypothesise that this will avoid those latencies getting so ridiculously out of control, and potentially also show better (relative) results for higher concurrency counts. Alternately, it will show that your disk firmware really is horrible at managing concurrency even for small values :) Whether it shows better absolute results than a shorter queue + write cache is an entirely different question. The write cache will certainly make things faster in the common case, which is another way of saying that your lower-bound average latencies are artificially low and making the degradation look worse. > > This comment seems to indicate that the drive queues up a whole bunch of > > requests, and since the queue is large, each individual response time has > > become large. It's not that physical actual performance has degraded with > > the cache enabled, it's that the queue has become long. For async writes, > > you don't really care how long the queue is, but if you have a mixture of > > async writes and occasional sync writes... Then the queue gets long, and > > when you sync, the sync operation will take a long time to complete. You > > might actually benefit by disabling the disk cache. > > > > Richard, have I gotten the gist of what you're saying? > > I haven't formed an opinion yet, but I'm inclined towards wanting overall > better latency. And, in particlar, better latency for specific (read) requests that zfs prioritises; these are often the ones that contribute most to a system feeling unresponsive. If this prioritisation is lost once passed to the disk, both because the disk doesn't have a priority mechanism and because it's contending with the deferred cost of previous writes, then you'll get better latency for the requests you care most about with a shorter queue. -- Dan. pgp4AqJyAubZi.pgp Description: PGP signature ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] write cache partial-disk pools (was Server with 4 drives, how to configure ZFS?)
For SSD we have code in illumos that disables disksort. Ultimately, we believe that the cost of disksort is in the noise for performance. -- Garrett D'Amore On Jun 20, 2011, at 8:38 AM, "Andrew Gabriel" wrote: > Richard Elling wrote: >> On Jun 19, 2011, at 6:04 AM, Andrew Gabriel wrote: >> >>> Richard Elling wrote: >>> Actually, all of the data I've gathered recently shows that the number of IOPS does not significantly increase for HDDs running random workloads. However the response time does :-( My data is leading me to want to restrict the queue depth to 1 or 2 for HDDs. >>> Thinking out loud here, but if you can queue up enough random I/Os, the >>> embedded disk controller can probably do a good job reordering them into >>> less random elevator sweep pattern, and increase IOPs through reducing the >>> total seek time, which may be why IOPs does not drop as much as one might >>> imagine if you think of the heads doing random seeks (they aren't random >>> anymore). However, this requires that there's a reasonable queue of I/Os >>> for the controller to optimise, and processing that queue will necessarily >>> increase the average response time. If you run with a queue depth of 1 or >>> 2, the controller can't do this. >>> >> >> I agree. And disksort is in the mix, too. >> > > Oh, I'd never looked at that. > >>> This is something I played with ~30 years ago, when the OS disk driver was >>> responsible for the queuing and reordering disc transfers to reduce total >>> seek time, and disk controllers were dumb. >>> >> >> ...and disksort still survives... maybe we should kill it? >> > > It looks like it's possibly slightly worse than the pathologically worst > response time case I described below... > >>> There are lots of options and compromises, generally weighing reduction in >>> total seek time against longest response time. Best reduction in total seek >>> time comes from planning out your elevator sweep, and inserting newly >>> queued requests into the right position in the sweep ahead. That also gives >>> the potentially worse response time, as you may have one transfer queued >>> for the far end of the disk, whilst you keep getting new transfers queued >>> for the track just in front of you, and you might end up reading or writing >>> the whole disk before you get to do that transfer which is queued for the >>> far end. If you can get a big enough queue, you can modify the insertion >>> algorithm to never insert into the current sweep, so you are effectively >>> planning two sweeps ahead. Then the worse response time becomes the time to >>> process one queue full, rather than the time to read or write the whole >>> disk. Lots of other tricks too (e.g. insertion into sweeps taking into >>> account priority, such as i f > the I/O is a synchronous or asynchronous, and age of existing queue entries). > I had much fun playing with this at the time. >>> >> >> The other wrinkle for ZFS is that the priority scheduler can't re-order I/Os >> sent to the disk. >> > > Does that also go through disksort? Disksort doesn't seem to have any concept > of priorities (but I haven't looked in detail where it plugs in to the whole > framework). > >> So it might make better sense for ZFS to keep the disk queue depth small for >> HDDs. >> -- richard >> > > -- > Andrew Gabriel > > ___ > zfs-discuss mailing list > zfs-discuss@opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] write cache partial-disk pools (was Server with 4 drives, how to configure ZFS?)
Richard Elling wrote: On Jun 19, 2011, at 6:04 AM, Andrew Gabriel wrote: Richard Elling wrote: Actually, all of the data I've gathered recently shows that the number of IOPS does not significantly increase for HDDs running random workloads. However the response time does :-( My data is leading me to want to restrict the queue depth to 1 or 2 for HDDs. Thinking out loud here, but if you can queue up enough random I/Os, the embedded disk controller can probably do a good job reordering them into less random elevator sweep pattern, and increase IOPs through reducing the total seek time, which may be why IOPs does not drop as much as one might imagine if you think of the heads doing random seeks (they aren't random anymore). However, this requires that there's a reasonable queue of I/Os for the controller to optimise, and processing that queue will necessarily increase the average response time. If you run with a queue depth of 1 or 2, the controller can't do this. I agree. And disksort is in the mix, too. Oh, I'd never looked at that. This is something I played with ~30 years ago, when the OS disk driver was responsible for the queuing and reordering disc transfers to reduce total seek time, and disk controllers were dumb. ...and disksort still survives... maybe we should kill it? It looks like it's possibly slightly worse than the pathologically worst response time case I described below... There are lots of options and compromises, generally weighing reduction in total seek time against longest response time. Best reduction in total seek time comes from planning out your elevator sweep, and inserting newly queued requests into the right position in the sweep ahead. That also gives the potentially worse response time, as you may have one transfer queued for the far end of the disk, whilst you keep getting new transfers queued for the track just in front of you, and you might end up reading or writing the whole disk before you get to do that transfer which is queued for the far end. If you can get a big enough queue, you can modify the insertion algorithm to never insert into the current sweep, so you are effectively planning two sweeps ahead. Then the worse response time becomes the time to process one queue full, rather than the time to read or write the whole disk. Lots of other tricks too (e.g. insertion into sweeps taking into account priority, such as if the I/O is a synchronous or asynchronous, and age of existing queue entries). I had much fun playing with this at the time. The other wrinkle for ZFS is that the priority scheduler can't re-order I/Os sent to the disk. Does that also go through disksort? Disksort doesn't seem to have any concept of priorities (but I haven't looked in detail where it plugs in to the whole framework). So it might make better sense for ZFS to keep the disk queue depth small for HDDs. -- richard -- Andrew Gabriel ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] write cache partial-disk pools (was Server with 4 drives, how to configure ZFS?)
On Jun 20, 2011, at 6:31 AM, Gary Mills wrote: > On Sun, Jun 19, 2011 at 08:03:25AM -0700, Richard Elling wrote: >> On Jun 19, 2011, at 6:28 AM, Edward Ned Harvey wrote: From: Richard Elling [mailto:richard.ell...@gmail.com] Sent: Saturday, June 18, 2011 7:47 PM Actually, all of the data I've gathered recently shows that the number of IOPS does not significantly increase for HDDs running random workloads. However the response time does :-( >>> >>> Could you clarify what you mean by that? >> >> Yes. I've been looking at what the value of zfs_vdev_max_pending should be. >> The old value was 35 (a guess, but a really bad guess) and the new value is >> 10 (another guess, but a better guess). I observe that data from a fast, >> modern >> HDD, for 1-10 threads (outstanding I/Os) the IOPS ranges from 309 to 333 >> IOPS. >> But as we add threads, the average response time increases from 2.3ms to >> 137ms. >> Since the whole idea is to get lower response time, and we know disks are >> not >> simple queues so there is no direct IOPS to response time relationship, >> maybe it >> is simply better to limit the number of outstanding I/Os. > > How would this work for a storage device with an intelligent > controller that provides only a few LUNs to the host, even though it > contains a much larger number of disks? I would expect the controller > to be more efficient with a large number of outstanding IOs because it > could distribute those IOs across the disks. It would, of course, > require a non-volatile cache to provide fast turnaround for writes. Yes, I've set it as high as 4,000 for a fast storage array. One size does not fit all. For normal operations, with a separate log and HDDs in the pool, I'm leaning towards 16. Except when resilvering or scrubbing, in which case 1 is better for HDDs. -- richard ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] write cache partial-disk pools (was Server with 4 drives, how to configure ZFS?)
On Sun, Jun 19, 2011 at 08:03:25AM -0700, Richard Elling wrote: > On Jun 19, 2011, at 6:28 AM, Edward Ned Harvey wrote: > >> From: Richard Elling [mailto:richard.ell...@gmail.com] > >> Sent: Saturday, June 18, 2011 7:47 PM > >> > >> Actually, all of the data I've gathered recently shows that the number of > >> IOPS does not significantly increase for HDDs running random workloads. > >> However the response time does :-( > > > > Could you clarify what you mean by that? > > Yes. I've been looking at what the value of zfs_vdev_max_pending should be. > The old value was 35 (a guess, but a really bad guess) and the new value is > 10 (another guess, but a better guess). I observe that data from a fast, > modern > HDD, for 1-10 threads (outstanding I/Os) the IOPS ranges from 309 to 333 > IOPS. > But as we add threads, the average response time increases from 2.3ms to > 137ms. > Since the whole idea is to get lower response time, and we know disks are not > simple queues so there is no direct IOPS to response time relationship, maybe > it > is simply better to limit the number of outstanding I/Os. How would this work for a storage device with an intelligent controller that provides only a few LUNs to the host, even though it contains a much larger number of disks? I would expect the controller to be more efficient with a large number of outstanding IOs because it could distribute those IOs across the disks. It would, of course, require a non-volatile cache to provide fast turnaround for writes. -- -Gary Mills--Unix Group--Computer and Network Services- ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] write cache partial-disk pools (was Server with 4 drives, how to configure ZFS?)
> From: Richard Elling [mailto:richard.ell...@gmail.com] > Sent: Sunday, June 19, 2011 11:03 AM > > > I was planning, in the near > > future, to go run iozone on some system with, and without the disk cache > > enabled according to format -e. If my hypothesis is right, it shouldn't > > significantly affect the IOPS, which seems to be corroborated by your > > message. > > iozone is a file system benchmark, won't tell you much about IOPS at the disk > level. > Be aware of all of the caching that goes on there. Yeah, that's the whole point. The basis of my argument was: Due to the caching & buffering the system does in RAM, the disks' cache & buffer are not relevant. The conversation spawns from the premise of whole-disk versus partition-based pools, possibly toggling the disk cache to off. See the subject of this email. ;-) Hopefully I'll have time to (dis) prove that conjecture this week. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] write cache partial-disk pools (was Server with 4 drives, how to configure ZFS?)
On Fri, Jun 17, 2011 at 07:41:41AM -0400, Edward Ned Harvey wrote: > > From: Daniel Carosone [mailto:d...@geek.com.au] > > Sent: Thursday, June 16, 2011 11:05 PM > > > > the [sata] channel is idle, blocked on command completion, while > > the heads seek. > > I'm interested in proving this point. Because I believe it's false. > > Just hand waving for the moment ... Presenting the alternative viewpoint > that I think is correct... > > All drives, regardless of whether or not their disk cache or buffer is > enabled, support PIO and DMA. This means no matter the state of the cache > or buffer, the bus will deliver information to/from the memory of the disk > as fast as possible, and the disk will optimize the visible workload to the > best of its ability, and the disk will report back an interrupt when each > operation is completed out-of-order. Yes, up to the that last "out-of-order". Without NCQ, requests are in-order and wait for completion with the channel idle. > It would be stupid for a disk to hog the bus in an idle state. Yes, but remember that ATA was designed originally to be stupid (simple). The complexity has crept in over time. Understanding the history and development order is important here. So, for older ATA disks, commands would transfer relatively quickly over the channel, which would then remain idle until a completion interrupt. Systems got faster. Write cache was added to make writes "complete" faster, read cache (with prefetch) was added in the hope of satisfying read requests faster and freeing up the channel. Systems got faster. NCQ was added (rather, TCQ was reinvented and crippled) to try and get better concurrency. NCQ supports only a few outstanding ops, in part because write-cache was by then established practice (turning it off would adversely impact benchmarks, especially for software that couldn't take advantage of concurrency). So, today with NCQ, writes are again essentially in-order (to cache) until the cache is full and request start blocking. NCQ may offer some benefit to concurrent reads, but again of litle value if the cache is full. Furthermore, the disk controllers may not be doing such a great job when given concurrent requests anyway, as Richard mentions elsewhere. Will reply to those points a little later. -- Dan. pgpEakN7OalNL.pgp Description: PGP signature ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] write cache partial-disk pools (was Server with 4 drives, how to configure ZFS?)
On Jun 19, 2011, at 6:04 AM, Andrew Gabriel wrote: > Richard Elling wrote: >> Actually, all of the data I've gathered recently shows that the number of >> IOPS does not significantly increase for HDDs running random workloads. >> However the response time does :-( My data is leading me to want to restrict >> the queue depth to 1 or 2 for HDDs. >> > > Thinking out loud here, but if you can queue up enough random I/Os, the > embedded disk controller can probably do a good job reordering them into less > random elevator sweep pattern, and increase IOPs through reducing the total > seek time, which may be why IOPs does not drop as much as one might imagine > if you think of the heads doing random seeks (they aren't random anymore). > However, this requires that there's a reasonable queue of I/Os for the > controller to optimise, and processing that queue will necessarily increase > the average response time. If you run with a queue depth of 1 or 2, the > controller can't do this. I agree. And disksort is in the mix, too. > This is something I played with ~30 years ago, when the OS disk driver was > responsible for the queuing and reordering disc transfers to reduce total > seek time, and disk controllers were dumb. ...and disksort still survives... maybe we should kill it? > There are lots of options and compromises, generally weighing reduction in > total seek time against longest response time. Best reduction in total seek > time comes from planning out your elevator sweep, and inserting newly queued > requests into the right position in the sweep ahead. That also gives the > potentially worse response time, as you may have one transfer queued for the > far end of the disk, whilst you keep getting new transfers queued for the > track just in front of you, and you might end up reading or writing the whole > disk before you get to do that transfer which is queued for the far end. If > you can get a big enough queue, you can modify the insertion algorithm to > never insert into the current sweep, so you are effectively planning two > sweeps ahead. Then the worse response time becomes the time to process one > queue full, rather than the time to read or write the whole disk. Lots of > other tricks too (e.g. insertion into sweeps taking into account priority, > such as if the I/O is a synchronous or asynchronous, and age of existing queue entries). I had much fun playing with this at the time. The other wrinkle for ZFS is that the priority scheduler can't re-order I/Os sent to the disk. So it might make better sense for ZFS to keep the disk queue depth small for HDDs. -- richard ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] write cache partial-disk pools (was Server with 4 drives, how to configure ZFS?)
On Jun 19, 2011, at 6:28 AM, Edward Ned Harvey wrote: >> From: Richard Elling [mailto:richard.ell...@gmail.com] >> Sent: Saturday, June 18, 2011 7:47 PM >> >> Actually, all of the data I've gathered recently shows that the number of >> IOPS does not significantly increase for HDDs running random workloads. >> However the response time does :-( > > Could you clarify what you mean by that? Yes. I've been looking at what the value of zfs_vdev_max_pending should be. The old value was 35 (a guess, but a really bad guess) and the new value is 10 (another guess, but a better guess). I observe that data from a fast, modern HDD, for 1-10 threads (outstanding I/Os) the IOPS ranges from 309 to 333 IOPS. But as we add threads, the average response time increases from 2.3ms to 137ms. Since the whole idea is to get lower response time, and we know disks are not simple queues so there is no direct IOPS to response time relationship, maybe it is simply better to limit the number of outstanding I/Os. FWIW, I left disksort enabled (the default) > I was planning, in the near > future, to go run iozone on some system with, and without the disk cache > enabled according to format -e. If my hypothesis is right, it shouldn't > significantly affect the IOPS, which seems to be corroborated by your > message. iozone is a file system benchmark, won't tell you much about IOPS at the disk level. Be aware of all of the caching that goes on there. I used a simple vdbench test on the raw device: vary I/O size from 512 bytes to 128KB, vary threads from 1 to 10, full stroke, 4KB random, read and write. > > I was also planning to perform sequential throughput testing on two disks > simultaneously, with and without the disk cache enabled. If one disk is > actually able to hog the bus in an idle state, it should mean the total > combined throughput with cache disabled would be equal to a single disk. > (Which I highly doubt.) > > >> However the response time does [increase] :-( > > This comment seems to indicate that the drive queues up a whole bunch of > requests, and since the queue is large, each individual response time has > become large. It's not that physical actual performance has degraded with > the cache enabled, it's that the queue has become long. For async writes, > you don't really care how long the queue is, but if you have a mixture of > async writes and occasional sync writes... Then the queue gets long, and > when you sync, the sync operation will take a long time to complete. You > might actually benefit by disabling the disk cache. > > Richard, have I gotten the gist of what you're saying? I haven't formed an opinion yet, but I'm inclined towards wanting overall better latency. > > Incidentally, I have done extensive testing of enabling/disabling the HBA > writeback cache. I found that as long as you have a dedicated log device > for sync writes, your performance is significantly better by disabling the > HBA writeback. Something on order of 15% better. > Yes, I recall these tests. -- richard ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] write cache partial-disk pools (was Server with 4 drives, how to configure ZFS?)
> From: Richard Elling [mailto:richard.ell...@gmail.com] > Sent: Saturday, June 18, 2011 7:47 PM > > Actually, all of the data I've gathered recently shows that the number of > IOPS does not significantly increase for HDDs running random workloads. > However the response time does :-( Could you clarify what you mean by that? I was planning, in the near future, to go run iozone on some system with, and without the disk cache enabled according to format -e. If my hypothesis is right, it shouldn't significantly affect the IOPS, which seems to be corroborated by your message. I was also planning to perform sequential throughput testing on two disks simultaneously, with and without the disk cache enabled. If one disk is actually able to hog the bus in an idle state, it should mean the total combined throughput with cache disabled would be equal to a single disk. (Which I highly doubt.) > However the response time does [increase] :-( This comment seems to indicate that the drive queues up a whole bunch of requests, and since the queue is large, each individual response time has become large. It's not that physical actual performance has degraded with the cache enabled, it's that the queue has become long. For async writes, you don't really care how long the queue is, but if you have a mixture of async writes and occasional sync writes... Then the queue gets long, and when you sync, the sync operation will take a long time to complete. You might actually benefit by disabling the disk cache. Richard, have I gotten the gist of what you're saying? Incidentally, I have done extensive testing of enabling/disabling the HBA writeback cache. I found that as long as you have a dedicated log device for sync writes, your performance is significantly better by disabling the HBA writeback. Something on order of 15% better. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] write cache partial-disk pools (was Server with 4 drives, how to configure ZFS?)
Richard Elling wrote: Actually, all of the data I've gathered recently shows that the number of IOPS does not significantly increase for HDDs running random workloads. However the response time does :-( My data is leading me to want to restrict the queue depth to 1 or 2 for HDDs. Thinking out loud here, but if you can queue up enough random I/Os, the embedded disk controller can probably do a good job reordering them into less random elevator sweep pattern, and increase IOPs through reducing the total seek time, which may be why IOPs does not drop as much as one might imagine if you think of the heads doing random seeks (they aren't random anymore). However, this requires that there's a reasonable queue of I/Os for the controller to optimise, and processing that queue will necessarily increase the average response time. If you run with a queue depth of 1 or 2, the controller can't do this. This is something I played with ~30 years ago, when the OS disk driver was responsible for the queuing and reordering disc transfers to reduce total seek time, and disk controllers were dumb. There are lots of options and compromises, generally weighing reduction in total seek time against longest response time. Best reduction in total seek time comes from planning out your elevator sweep, and inserting newly queued requests into the right position in the sweep ahead. That also gives the potentially worse response time, as you may have one transfer queued for the far end of the disk, whilst you keep getting new transfers queued for the track just in front of you, and you might end up reading or writing the whole disk before you get to do that transfer which is queued for the far end. If you can get a big enough queue, you can modify the insertion algorithm to never insert into the current sweep, so you are effectively planning two sweeps ahead. Then the worse response time becomes the time to process one queue full, rather than the time to read or write the whole disk. Lots of other tricks too (e.g. insertion into sweeps taking into account priority, such as if the I/O is a synchronous or asynchronous, and age of existing queue entries). I had much fun playing with this at the time. -- Andrew Gabriel ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] write cache partial-disk pools (was Server with 4 drives, how to configure ZFS?)
On Jun 16, 2011, at 8:05 PM, Daniel Carosone wrote: > On Thu, Jun 16, 2011 at 10:40:25PM -0400, Edward Ned Harvey wrote: >>> From: Daniel Carosone [mailto:d...@geek.com.au] >>> Sent: Thursday, June 16, 2011 10:27 PM >>> >>> Is it still the case, as it once was, that allocating anything other >>> than whole disks as vdevs forces NCQ / write cache off on the drive >>> (either or both, forget which, guess write cache)? >> >> I will only say, that regardless of whether or not that is or ever was true, >> I believe it's entirely irrelevant. Because your system performs read and >> write caching and buffering in ram, the tiny little ram on the disk can't >> possibly contribute anything. > > I disagree. It can vastly help improve the IOPS of the disk and keep > the channel open for more transactions while one is in progress. > Otherwise, the channel is idle, blocked on command completion, while > the heads seek. Actually, all of the data I've gathered recently shows that the number of IOPS does not significantly increase for HDDs running random workloads. However the response time does :-( My data is leading me to want to restrict the queue depth to 1 or 2 for HDDs. SDDs are another story, they scale much better in the response time and IOPS vs queue depth analysis. Has anyone else studied this? -- richard ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] write cache partial-disk pools (was Server with 4 drives, how to configure ZFS?)
On Jun 17, 2011, at 7:06 AM, Edward Ned Harvey wrote: > I will only say, that regardless of whether or not that is or ever was true, > I believe it's entirely irrelevant. Because your system performs read and > write caching and buffering in ram, the tiny little ram on the disk can't > possibly contribute anything. You would be surprised. The on-disk buffer is there so data is ready when the hard drive head lands, without it the drive's average rotational latency will trend higher due to missed landings because the data wasn't in buffer at the right time. The read buffer is to allow the disk to continuously read sectors whether the system bus is ready to transfer or not. Without it, sequential reads wouldn't last long enough to reach max throughput before they would have to pause because of bus contention and then suffer a rotation of latency hit which would kill read performance. Try disabling the on-board write or read cache and see how your sequential IO performs and you'll see just how valuable those puny caches are. -Ross ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] write cache partial-disk pools (was Server with 4 drives, how to configure ZFS?)
2011-06-17 15:06, Edward Ned Harvey пишет: When it comes to reads: The OS does readahead more intelligently than the disk could ever hope. Hardware readahead is useless. Here's another (lame?) question to the experts, partly as a followup to my last post about large arrays and essentially a shared bus to be freed ASAP: can the OS request a disk readahead (send a small command and release the bus) and then later poll the disk('s cache) for the readahead results? That is, it would not "hold the line" between sending a request and receiving the result. Alternatively, does it work in a packeted protocol (and in effect requests and responses do not "hold the line", but the controller must keep states - are these command queues?), and so the ability to transfer packets faster and free the shared ether between disks, backplanes and controllers, is critical per se? Thanks, //Jim The more I know, the more I know how little I know ;) ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] write cache partial-disk pools (was Server with 4 drives, how to configure ZFS?)
2011-06-17 15:41, Edward Ned Harvey пишет: From: Daniel Carosone [mailto:d...@geek.com.au] Sent: Thursday, June 16, 2011 11:05 PM the [sata] channel is idle, blocked on command completion, while the heads seek. I'm interested in proving this point. Because I believe it's false. Just hand waving for the moment ... Presenting the alternative viewpoint that I think is correct... I'm also interested to hear the in-the-trenches specialists and architechts on this point, however, the way it was explained to me a while ago, disk caches and higher interface speeds really matter in large arrays, where you have one (okay, 8) links from your controller to a backplane with dozens of disks, and the faster any of these disks completes its bursty operation, the less latency is induced on the array in whole. So even if the spinning drive can not sustain 6Gbps, its 64Mb of cache quite can spit out (or read in) its bit of data, free the bus, and let the other many drives spit theirs. I am not sure if this is relevant to say a motherboard controller where one chip processes 6-8 disks, but maybe there's something to it too... //Jim ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] write cache partial-disk pools (was Server with 4 drives, how to configure ZFS?)
> From: Daniel Carosone [mailto:d...@geek.com.au] > Sent: Thursday, June 16, 2011 11:05 PM > > the [sata] channel is idle, blocked on command completion, while > the heads seek. I'm interested in proving this point. Because I believe it's false. Just hand waving for the moment ... Presenting the alternative viewpoint that I think is correct... All drives, regardless of whether or not their disk cache or buffer is enabled, support PIO and DMA. This means no matter the state of the cache or buffer, the bus will deliver information to/from the memory of the disk as fast as possible, and the disk will optimize the visible workload to the best of its ability, and the disk will report back an interrupt when each operation is completed out-of-order. The difference between enabling or disabling the disk write buffer is: If the write buffer is disabled... It still gets used temporarily ... but the disk doesn't interrupt "completed" until the buffer is flushed to platter. If the disk write buffer is enabled, the disk will immediately report "completed" as soon as it receives the data, before flushing to platter... And if your application happens to have issued the write in "sync" mode (or the fsync() command), your OS will additionally issue the hardware sync command, and your application will block until the hardware sync has completed. It would be stupid for a disk to hog the bus in an idle state. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] write cache partial-disk pools (was Server with 4 drives, how to configure ZFS?)
> From: Daniel Carosone [mailto:d...@geek.com.au] > Sent: Thursday, June 16, 2011 10:27 PM > > Is it still the case, as it once was, that allocating anything other > than whole disks as vdevs forces NCQ / write cache off on the drive > (either or both, forget which, guess write cache)? I will only say, that regardless of whether or not that is or ever was true, I believe it's entirely irrelevant. Because your system performs read and write caching and buffering in ram, the tiny little ram on the disk can't possibly contribute anything. When it comes to reads: The OS does readahead more intelligently than the disk could ever hope. Hardware readahead is useless. When it comes to writes: Categorize as either async or sync. When it comes to async writes: The OS will buffer and optimize, and the applications have long since marched onward before the disk even sees the data. It's irrelevant how much time has elapsed before the disk finally commits to platter. When it comes to sync writes: The write will not be completed, and the application will block, until all the buffers have been flushed. Both ram and disk buffer. So neither the ram nor disk buffer is able to help you. It's like selling usb fobs labeled USB2 or USB3. If you look up or measure the actual performance of any one of these devices, they can't come anywhere near the bus speed... In fact, I recently paid $45 for a USB3 16G fob, which is finally able to achieve 380 Mbit. Oh, thank goodness I'm no longer constrained by that slow 480 Mbit bus... ;-) Even so, my new fob is painfully slow compared to a normal cheap-o usb2 hard disk. They just put these labels on there because it's a marketing requirement. Something that formerly mattered one day, but people still use as a purchasing decider. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] write cache partial-disk pools (was Server with 4 drives, how to configure ZFS?)
On 06/16/11 20:26, Daniel Carosone wrote: On Thu, Jun 16, 2011 at 09:15:44PM -0400, Edward Ned Harvey wrote: My personal preference, assuming 4 disks, since the OS is mostly reads and only a little bit of writes, is to create a 4-way mirrored 100G partition for the OS, and the remaining 900G of each disk (or whatever) becomes either a stripe of mirrors or raidz, as appropriate in your case, for the storagepool. Is it still the case, as it once was, that allocating anything other than whole disks as vdevs forces NCQ / write cache off on the drive (either or both, forget which, guess write cache)? It was once the case that using a slice as a vdev forced the write cache off, but I just tried it and found it wasn't disabled - at least with the current source. In fact it looks like we no longer change the setting. You may want to experiment yourself on your ZFS version (see below for how the check). If so, can this be forced back on somehow to regain performance when known to be safe? Yes: "format -e"-> select disk -> "cache" -> "write" -> "display"/"enable"/"disable" I think the original assumption was that zfs-in-a-partition likely implied the disk was shared with ufs, rather than another async-safe pool. - Correct. Neil. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] write cache partial-disk pools (was Server with 4 drives, how to configure ZFS?)
On Thu, Jun 16, 2011 at 10:40:25PM -0400, Edward Ned Harvey wrote: > > From: Daniel Carosone [mailto:d...@geek.com.au] > > Sent: Thursday, June 16, 2011 10:27 PM > > > > Is it still the case, as it once was, that allocating anything other > > than whole disks as vdevs forces NCQ / write cache off on the drive > > (either or both, forget which, guess write cache)? > > I will only say, that regardless of whether or not that is or ever was true, > I believe it's entirely irrelevant. Because your system performs read and > write caching and buffering in ram, the tiny little ram on the disk can't > possibly contribute anything. I disagree. It can vastly help improve the IOPS of the disk and keep the channel open for more transactions while one is in progress. Otherwise, the channel is idle, blocked on command completion, while the heads seek. > When it comes to reads: The OS does readahead more intelligently than the > disk could ever hope. Hardware readahead is useless. Little argument here, although the disk is aware of physical geometry and may well read an entire track. > When it comes to writes: Categorize as either async or sync. > > When it comes to async writes: The OS will buffer and optimize, and the > applications have long since marched onward before the disk even sees the > data. It's irrelevant how much time has elapsed before the disk finally > commits to platter. To the application in he short term, but not to the system. TXG closes have to wait for that, and applications have to wait for those to close so the next can open and accept new writes. > When it comes to sync writes: The write will not be completed, and the > application will block, until all the buffers have been flushed. Both ram > and disk buffer. So neither the ram nor disk buffer is able to help you. Yes. With write cache on in the drive, and especially with multiple outstanding commands, the async writes can all be streamed quickly to the disk. Then a cache sync can be issued, before the sync/FUA writes to close the txg are done. Without write cache, each async write (though deferred and perhaps coalesced) is synchronous to platters. This adds latency and decreases IOPS, impacting other operations (reads) as well. Please measure it, you will find this impact significant and even perhaps drastic for some quite realistic workloads. All this before the disk write cache has any chance to provide additional benefit by seek optimisations - ie, regardless of whether it is succesful or not in doing so. -- Dan. pgpCzO1l9K1Um.pgp Description: PGP signature ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] write cache partial-disk pools (was Server with 4 drives, how to configure ZFS?)
> From: Daniel Carosone [mailto:d...@geek.com.au] > Sent: Thursday, June 16, 2011 10:27 PM > > Is it still the case, as it once was, that allocating anything other > than whole disks as vdevs forces NCQ / write cache off on the drive > (either or both, forget which, guess write cache)? I will only say, that regardless of whether or not that is or ever was true, I believe it's entirely irrelevant. Because your system performs read and write caching and buffering in ram, the tiny little ram on the disk can't possibly contribute anything. When it comes to reads: The OS does readahead more intelligently than the disk could ever hope. Hardware readahead is useless. When it comes to writes: Categorize as either async or sync. When it comes to async writes: The OS will buffer and optimize, and the applications have long since marched onward before the disk even sees the data. It's irrelevant how much time has elapsed before the disk finally commits to platter. When it comes to sync writes: The write will not be completed, and the application will block, until all the buffers have been flushed. Both ram and disk buffer. So neither the ram nor disk buffer is able to help you. It's like selling usb fobs labeled USB2 or USB3. If you look up or measure the actual performance of any one of these devices, they can't come anywhere near the bus speed... In fact, I recently paid $45 for a USB3 16G fob, which is finally able to achieve 380 Mbit. Oh, thank goodness I'm no longer constrained by that slow 480 Mbit bus... ;-) Even so, my new fob is painfully slow compared to a normal cheap-o usb2 hard disk. They just put these labels on there because it's a marketing requirement. Something that formerly mattered one day, but people still use as a purchasing decider. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss