Re: [zfs-discuss] Slices and reservations Was: Re: How long should an empty destroy take? snv_134
From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss- boun...@opensolaris.org] On Behalf Of Nathan Kroenert Bottom line is that at 75 IOPS per spindle won't impress many people, and that's the sort of rate you get when you disable the disk cache. It's the same rate that you get with the disk cache enabled. Enabling cache doesn't magically decrease the access time for a drive. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Slices and reservations Was: Re: How long should an empty destroy take? snv_134
From: Bob Friesenhahn [mailto:bfrie...@simple.dallas.tx.us] The disk write cache helps with the step where data is sent to the disks since it is much faster to write into the disk write cache than to write to the media. Besides helping with unburdening the I/O channel, Having the disk cache disabled doesn't mean the cache isn't used. It only means the disk doesn't acknowledge the write until the write has been flushed. If you have a bunch of disks with platters that can go 500Mbit, all connected to a 6Gbit bus, the whole bus doesn't slow down to 500Mbit while a single disk is writing. No matter what happens, the controller is going to send a chunk of data into the disk at 6Gbit and wait for the disk to acknowledge it. Meanwhile the controller is free to send data to other disks too. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Slices and reservations Was: Re: How long should an empty destroy take? snv_134
From: Jim Dunham [mailto:james.dun...@oracle.com] ZFS only uses system RAM for read caching, If your email address didn't say oracle, I'd just simply come out and say you're crazy, but I'm trying to keep an open mind here... Correct me where the following statement is wrong: ZFS uses system RAM to buffer async writes. Sync writes must hit the ZIL first, and then the sync writes are put into the write buffer along with all the async writes to be written to the main pool storage. So after sync writes hit the ZIL and the device write cache is flushed, they too are buffered in system RAM. as all writes must be written to some form of stable storage before acknowledged. If a vdev represents a whole disk, ZFS will attempt to enable write caching. If a device does not support write caching, the attempt to set wce fails silently. Here is an easy analogy to remember basically what you said: format -e can control the cache settings for c0t0d0, but cannot control the cache settings for c0t0d0s0 because s0 is not actually a device. I contend: Suppose you have a disk with on-disk write cache enabled. Suppose a sync write comes along, so ZFS first performs a sync write to some ZIL sectors. Then ZFS will issue the cache flush command and wait for it to complete before acknowledging the sync write; hence the disk write cache does not benefit sync writes. So then we start thinking about async writes, and conclude: The async writes were acknowledged long ago, when the async writes were buffered in ZFS system ram, so there is once again, no benefit from the disk write cache in either situation. That's my argument, unless somebody can tell me where my logic is wrong. Disk write cache offers zero benefit. And disk read cache only offers benefit in unusual cases that I would call esoteric. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Slices and reservations Was: Re: How long should an empty destroy take? snv_134
On Tue, 8 Mar 2011, Edward Ned Harvey wrote: That's my argument, unless somebody can tell me where my logic is wrong. Disk write cache offers zero benefit. And disk read cache only offers benefit in unusual cases that I would call esoteric. I was agreeing with your email until it came to this conclusion. When zfs writes a transaction group, it sends data to the disks, and then tells all the disks involved in the transaction to flush their write cache. The disk write cache helps with the step where data is sent to the disks since it is much faster to write into the disk write cache than to write to the media. Besides helping with unburdening the I/O channel, disks are already commiting data in their write cache prior to recieving the cache sync request. Having more disks actively writing to disk at once results in better parallel behavior. Bob -- Bob Friesenhahn bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer,http://www.GraphicsMagick.org/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Slices and reservations Was: Re: How long should an empty destroy take? snv_134
Ed - Simple test. Get onto a system where you *can* disable the disk cache, disable it, and watch the carnage. Until you do that, you can pose as many interesting theories as you like. Bottom line is that at 75 IOPS per spindle won't impress many people, and that's the sort of rate you get when you disable the disk cache. Nathan. On 8/03/2011 11:53 PM, Edward Ned Harvey wrote: From: Jim Dunham [mailto:james.dun...@oracle.com] ZFS only uses system RAM for read caching, If your email address didn't say oracle, I'd just simply come out and say you're crazy, but I'm trying to keep an open mind here... Correct me where the following statement is wrong: ZFS uses system RAM to buffer async writes. Sync writes must hit the ZIL first, and then the sync writes are put into the write buffer along with all the async writes to be written to the main pool storage. So after sync writes hit the ZIL and the device write cache is flushed, they too are buffered in system RAM. as all writes must be written to some form of stable storage before acknowledged. If a vdev represents a whole disk, ZFS will attempt to enable write caching. If a device does not support write caching, the attempt to set wce fails silently. Here is an easy analogy to remember basically what you said: format -e can control the cache settings for c0t0d0, but cannot control the cache settings for c0t0d0s0 because s0 is not actually a device. I contend: Suppose you have a disk with on-disk write cache enabled. Suppose a sync write comes along, so ZFS first performs a sync write to some ZIL sectors. Then ZFS will issue the cache flush command and wait for it to complete before acknowledging the sync write; hence the disk write cache does not benefit sync writes. So then we start thinking about async writes, and conclude: The async writes were acknowledged long ago, when the async writes were buffered in ZFS system ram, so there is once again, no benefit from the disk write cache in either situation. That's my argument, unless somebody can tell me where my logic is wrong. Disk write cache offers zero benefit. And disk read cache only offers benefit in unusual cases that I would call esoteric. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Slices and reservations Was: Re: How long should an empty destroy take? snv_134
On Mon, Mar 7, 2011 at 1:50 PM, Yaverot yave...@computermail.net wrote: 1. While performance isn't my top priority, doesn't using slices make a significant difference? Write caching will be disabled on devices that use slices. It can be turned back on by using format -e 2. Doesn't snv_134 that I'm running already account for variances in these nominally-same disks? It will allow some small differences. I'm not sure what the limit on the difference size is. 3. The market refuses to sell disks under $50, therefore I won't be able to buy drives of 'matching' capacity anyway. You can always use a larger drive. If you think you may want to go back to smaller drives, make sure that the autoexpand zpool property is disabled though. 3. Assuming I want to do such an allocation, is this done with quota reservation? Or is it snapshots as you suggest? I think Edward misspoke when he said to use snapshots, and probably meant reservation. I've taken to creating a dataset called reserved and giving it a 10G reservation. (10G isn't a special value, feel free to use 5% of your pool size or whatever else you're comfortable with.) It's unmounted and doesn't contain anything, but it ensures that there is a chunk of space I can make available if needed. Because it doesn't contain anything, there shouldn't be any concern for de-allocation of blocks when it's destroyed. Alternately, the reservation can be reduced to make space available. Would it make more sense to make another filesystem in the pool, fill it enough and keep it handy to delete? Or is there some advantage to zfs destroy (snapshot) over zfs destroy (filesystem)? While I am thinking about the system and have extra drives, like now, is the time to make plans for the next system is full event. If a dataset contains data, the blocks will have to be freed when it's destroyed. If it's an empty dataset with a reservation, the only change is to fiddle some accounting bits. I seem to remember seeing a fix for 100% full pools a while ago so this may not be as critical as it used to be, but it's a nice safety net to have. -B -- Brandon High : bh...@freaks.com ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Slices and reservations Was: Re: How long should an empty destroy take? snv_134
From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss- boun...@opensolaris.org] On Behalf Of Yaverot I recommend: When creating your new pool, use slices of the new disks, which are 99% of the size of the new disks instead of using the whole new disks. Because this is a more reliable way of avoiding the problem my new replacement disk for the failed disk is slightly smaller than the failed disk and therefore I can't replace. 1. While performance isn't my top priority, doesn't using slices make a significant difference? Somewhere in some guide, it says so. But the answer is no. If you look more closely at that guide (what is it, the best practices guide? or something else?) you'll see what it *really* says is we don't recommend using slices, because sharing the hardware cache across multiple pools hurts performance or sharing cache across zfs and ufs hurts performance or something like that. But if you're only using one big slice for 99% of the whole disk and not using any other slice, then that whole argument is irrelevant. Also, thanks to system ram cache, I contend that disk-based hardware cache is totally useless anyway. The disk hardware cache will never have a hit except in truly weird esoteric cases. In normal cases, all the disk sectors that were read recently enough to be in disk based hardware cache will also be in system ram cache, and therefore the system will not request that sector from the disk again. I know I did some benchmarking, with and without slices, and found no difference. I'd be interested if anyone in the world has a counterexample. I know how to generate such a scenario, but like I said, it's an esoteric corner case that's not important in reality. 2. Doesn't snv_134 that I'm running already account for variances in these nominally-same disks? Yes. (I don't know which build introduced it, so I'm not confirming b134 specifically, but it's in some build and higher.) But as evidenced by a recent thread from Robert Hartzell cannot replace c10t0d0 with c10t0d0: device is too small it doesn't always work. 3. The market refuses to sell disks under $50, therefore I won't be able to buy drives of 'matching' capacity anyway. Normally, when replacing matching capacity drives, it's either something you bought in advance (like a hotspare or coldspare), or received via warranty. Maybe it doesn't matter for you, but it matters for some people. I also recommend: In every pool, create some space reservation. So when and if you ever hit 100% usage again and start to hit the system crash scenario, you can do a zfs destroy (snapshot) and delete the space reservation, in order to avoid the system crash scenario you just witnessed. Hopefully. 1. Why would tank being practically full affect management of other pools and start the crash scenario I encountered? rpool rpool/swap remained at 1% use, the apparent trigger was doing a zpool destroy others which is neither the rpool the system runs out of, nor tank. That wasn't the trigger - That was just the first symptom that you noticed. The actual trigger happened earlier, while tank was 100% full and some operations were still in progress. The precise trigger is difficult to identify, because it only sends the system into a long slow downward spiral. It doesn't cause immediate system failure. Generally by the time you notice any symptoms, it's already been spiraling downward for some time, so even if you know the right buttons to pull it out of the spiral, you won't know that you know the right buttons. Because after you press them, you still have to wait for some time for it to recover. I had the sun support rep tell me someone else had the same problem, and they waited a week and eventually it recovered. I wasn't able to wait that long. I power cycled and fixed it instantly. I don't know the answer to your question, why would it behave that way. And it doesn't always happen. But I've certainly seen it a few times before. Notice how precisely I told you exactly what you should expect to happen next. It's a clear pattern, but not clear enough or common enough to get enough attention to get fixed, apparently. Long ago I opened bug reports with oracle support, but nobody seems to be doing anything about it. 2. How can a zfs destroy ($snapshot) complete when both zpool destroy and zfs list fail to complete? Precisely the problem. The zfs destroy snapshot also hangs. You're hosed until you reboot. But zfs destroy snapshot isn't the only way in the world to free up some space. You can also zfs set reservation=5G tank zfs set reservation=none tank When you're in the failure mode that you experienced, nobody has yet confirmed the ability or inability to set the reservation to none. IF IT WORKS, then you could immediately afterward do a zfs destroy snapshot. But most likely the reservation won't do any good anyway. But it doesn't hurt anything, and it's worth
Re: [zfs-discuss] Slices and reservations Was: Re: How long should an empty destroy take? snv_134
From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss- boun...@opensolaris.org] On Behalf Of Brandon High Write caching will be disabled on devices that use slices. It can be turned back on by using format -e My experience has been, despite what the BPG (or whatever) says, this is not true. When I create pools using slices or not using slices, it doesn't seem to make any difference to the cache status of the drives. Also, when I go into format -e, I attempt to toggle the cache status, and toggling also fails. Which brings me back to my former argument: Who cares about the drive cache anyway. The system ram makes it irrelevant. 3. Assuming I want to do such an allocation, is this done with quota reservation? Or is it snapshots as you suggest? I think Edward misspoke when he said to use snapshots, and probably meant reservation. I meant if you are able to reduce or eliminate your reservation, that should free up enough space to enable you to destroy a snapshot, and re-enable your reservation. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Slices and reservations Was: Re: How long should an empty destroy take? snv_134
Edward Ned Harvey wrote: From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss- boun...@opensolaris.org] On Behalf Of Brandon High Write caching will be disabled on devices that use slices. It can be turned back on by using format -e My experience has been, despite what the BPG (or whatever) says, this is not true. When I create pools using slices or not using slices, it doesn't seem to make any difference to the cache status of the drives. Also, when I go into format -e, I attempt to toggle the cache status, and toggling also fails. Which brings me back to my former argument: Who cares about the drive cache anyway. The system ram makes it irrelevant. ZFS only uses system RAM for read caching, as all writes must be written to some form of stable storage before acknowledged. If a vdev represents a whole disk, ZFS will attempt to enable write caching. If a device does not support write caching, the attempt to set wce fails silently. As you made reference to above, one would need to use 'format -e' to manually inquire about this capability on a per disk type basis. If a disk does support write caching, and ZFS enables it, there should be some measurable write I/O performance, although how much is unclear. For those interested, one can trace back the ZFS code starting here: http://cvs.opensolaris.org/source/xref/onnv/onnv-gate/usr/src/uts/common/fs/zfs/vdev_disk.c#276 Jim 3. Assuming I want to do such an allocation, is this done with quota reservation? Or is it snapshots as you suggest? I think Edward misspoke when he said to use snapshots, and probably meant reservation. I meant if you are able to reduce or eliminate your reservation, that should free up enough space to enable you to destroy a snapshot, and re-enable your reservation. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss