Re: [zfs-discuss] ZFS Restripe
On Tue, 3 Aug 2010, Eduardo Bragatto wrote: You're a funny guy. :) Let me re-phrase it: I'm sure I'm getting degradation in performance as my applications are waiting more on I/O now than they used to do (based on CPU utilization graphs I have). The impression part, is that the reason is the limited space in those two volumes -- as I said, I already experienced bad performance on zfs systems running nearly out of space before. Assuming that your impressions are correct, are you sure that your new disk drives are similar to the older ones? Are they an identical model? Design trade-offs are now often resulting in larger capacity drives with reduced performance. Bob -- Bob Friesenhahn bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer,http://www.GraphicsMagick.org/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS Restripe
On Aug 4, 2010, at 12:26 AM, Richard Elling wrote: The tipping point for the change in the first fit/best fit allocation algorithm is now 96%. Previously, it was 70%. Since you don't specify which OS, build, or zpool version, I'll assume you are on something modern. I'm running Solaris 10 10/09 s10x_u8wos_08a, ZFS Pool version 15. NB, zdb -m will show the pool's metaslab allocations. If there are no 100% free metaslabs, then it is a clue that the allocator might be working extra hard. On the first two VDEVs there are no allocations 100% free (most are nearly full)... The two newer ones, however, do have several allocations of 128GB each, 100% free. If I understand correctly in that scenario the allocator will work extra, is that correct? OK, so how long are they waiting? Try iostat -zxCn and look at the asvc_t column. This will show how the disk is performing, though it won't show the performance delivered by the file system to the application. To measure the latter, try fsstat zfs (assuming you are on a Solaris distro) Checking with iostat, I noticed the average wait time to be between 40ms and 50ms for all disks. Which doesn't seem too bad. And this is the output of fsstat: # fsstat zfs new name name attr attr lookup rddir read read write write file remov chng get setops ops ops bytes ops bytes 3.26M 1.34M 3.22M 161M 13.4M 1.36G 9.6M 10.5M 899G 22.0M 625G zfs However I did have CPU spikes at 100% where the kernel was taking all cpu time. I have reduced my zfs_arc_max parameter as it seemed the applications were struggling for RAM and things are looking better now Thanks for your time, Eduardo Bragatto. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS Restripe
On Aug 4, 2010, at 12:20 AM, Khyron wrote: I notice you use the word volume which really isn't accurate or appropriate here. Yeah, it didn't seem right to me, but I wasn't sure about the nomenclature, thanks for clarifying. You may want to get a bit more specific and choose from the oldest datasets THEN find the smallest of those oldest datasets and send/receive it first. That way, the send/receive completes in less time, and when you delete the source dataset, you've now created more free space on the entire pool but without the risk of a single dataset exceeding your 10 TiB of workspace. That makes sense, I'll try send/receiving a few of those datasets and see how it goes. I believe I can find the ones that were created before the two new VDEVs were added, by comparing the creation time from zfs get creation ZFS' copy-on-write nature really wants no less than 20% free because you never update data in place; a new copy is always written to disk. Right, and my problem is that I have two VDEVs with less than 10% free at this point -- although the other two have around 50% free each. You might want to consider turning on compression on your new datasets too, especially if you have free CPU cycles to spare. I don't know how compressible your data is, but if it's fairly compressible, say lots of text, then you might get some added benefit when you copy the old data into the new datasets. Saving more space, then deleting the source dataset, should help your pool have more free space, and thus influence your writes for better I/O balancing when you do the next (and the next) dataset copies. Unfortunately the data taking most of the space it already compressed, so while I would gain some space from many text files that I also have, those are not the majority of my content, and the effort would probably not justify the small gain. Thanks Eduardo Bragatto ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS Restripe
On Wed, 4 Aug 2010, Eduardo Bragatto wrote: Checking with iostat, I noticed the average wait time to be between 40ms and 50ms for all disks. Which doesn't seem too bad. Actually, this is quite high. I would not expect such long wait times except for when under extreme load such as a benchmark. If the wait times are this long under normal use, then there is something wrong. However I did have CPU spikes at 100% where the kernel was taking all cpu time. I have reduced my zfs_arc_max parameter as it seemed the applications were struggling for RAM and things are looking better now Odd. What type of applications are you running on this system? Are applications running on the server competing with client accesses? Bob -- Bob Friesenhahn bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer,http://www.GraphicsMagick.org/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS Restripe
On Aug 4, 2010, at 11:18 AM, Bob Friesenhahn wrote: Assuming that your impressions are correct, are you sure that your new disk drives are similar to the older ones? Are they an identical model? Design trade-offs are now often resulting in larger capacity drives with reduced performance. Yes, the disks are the same, no problems there. On Aug 4, 2010, at 2:11 PM, Bob Friesenhahn wrote: On Wed, 4 Aug 2010, Eduardo Bragatto wrote: Checking with iostat, I noticed the average wait time to be between 40ms and 50ms for all disks. Which doesn't seem too bad. Actually, this is quite high. I would not expect such long wait times except for when under extreme load such as a benchmark. If the wait times are this long under normal use, then there is something wrong. That's a backup server, I usually have 10 rsync instances running simultaneously so there's a lot of random disk access going on -- I think that explains the high average time. Also, I recently enabled graphing of the IOPS per disk (reading it using net-snmp) and I see most disks are operating near their limit -- except for some disks from the older VDEVs which is what I'm trying to address here. However I did have CPU spikes at 100% where the kernel was taking all cpu time. I have reduced my zfs_arc_max parameter as it seemed the applications were struggling for RAM and things are looking better now Odd. What type of applications are you running on this system? Are applications running on the server competing with client accesses? I noticed some of those rsync processes were using almost 1GB of RAM each and the server has only 8GB. I started seeing the server swapping a bit during the cpu spikes at 100%, so I figured it would be better to cap ARC and leave some room for the rsync processes. I will also start using rsync v3 to reduce the memory foot print, so I might be able to give back some RAM to ARC, and I'm thinking maybe going to 16GB RAM, as the pool is quite large and I'm sure more ARC wouldn't hurt. Thanks, Eduardo Bragatto. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS Restripe
On Wed, 4 Aug 2010, Eduardo Bragatto wrote: I will also start using rsync v3 to reduce the memory foot print, so I might be able to give back some RAM to ARC, and I'm thinking maybe going to 16GB RAM, as the pool is quite large and I'm sure more ARC wouldn't hurt. It is definitely a wise idea to use rsync v3. Previous versions had to recurse the whole tree on both sides (storing what was learned in memory) before doing anything. Bob -- Bob Friesenhahn bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer,http://www.GraphicsMagick.org/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS Restripe
On Aug 4, 2010, at 9:03 AM, Eduardo Bragatto wrote: On Aug 4, 2010, at 12:26 AM, Richard Elling wrote: The tipping point for the change in the first fit/best fit allocation algorithm is now 96%. Previously, it was 70%. Since you don't specify which OS, build, or zpool version, I'll assume you are on something modern. I'm running Solaris 10 10/09 s10x_u8wos_08a, ZFS Pool version 15. Then the first fit/best fit threshold is 96%. NB, zdb -m will show the pool's metaslab allocations. If there are no 100% free metaslabs, then it is a clue that the allocator might be working extra hard. On the first two VDEVs there are no allocations 100% free (most are nearly full)... The two newer ones, however, do have several allocations of 128GB each, 100% free. If I understand correctly in that scenario the allocator will work extra, is that correct? Yes, and this can be measured, but... OK, so how long are they waiting? Try iostat -zxCn and look at the asvc_t column. This will show how the disk is performing, though it won't show the performance delivered by the file system to the application. To measure the latter, try fsstat zfs (assuming you are on a Solaris distro) Checking with iostat, I noticed the average wait time to be between 40ms and 50ms for all disks. Which doesn't seem too bad. ... actually, that is pretty bad. Look for an average around 10 ms and peaks around 20ms. Solve this problem first -- the system can do a huge amount of allocations for any algorithm in 1ms. And this is the output of fsstat: # fsstat zfs new name name attr attr lookup rddir read read write write file remov chng get setops ops ops bytes ops bytes 3.26M 1.34M 3.22M 161M 13.4M 1.36G 9.6M 10.5M 899G 22.0M 625G zfs Unfortunately, the first line is useless, it is the summary since boot. Try adding a sample interval to see how things are moving now. However I did have CPU spikes at 100% where the kernel was taking all cpu time. Again, this can be analyzed using baseline performance analysis techniques. The prstat command should show how CPU is being used. I'm not running Solaris 10 10/09, but IIRC, it has the ZFS enhancement where CPU time is attributed to the pool, as seen in prstat. -- richard I have reduced my zfs_arc_max parameter as it seemed the applications were struggling for RAM and things are looking better now Thanks for your time, Eduardo Bragatto. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss -- Richard Elling rich...@nexenta.com +1-760-896-4422 Enterprise class storage for everyone www.nexenta.com ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] ZFS Restripe
Hi, I have a large pool (~50TB total, ~42TB usable), composed of 4 raidz1 volumes (of 7 x 2TB disks each): # zpool iostat -v | grep -v c4 capacity operationsbandwidth pool used avail read write read write - - - - - - backup35.2T 15.3T602272 15.3M 11.1M raidz1 11.6T 1.06T138 49 2.99M 2.33M raidz1 11.8T 845G163 54 3.82M 2.57M raidz1 6.00T 6.62T161 84 4.50M 3.16M raidz1 5.88T 6.75T139 83 4.01M 3.09M - - - - - - Originally there were only the first two raidz1 volumes, and the two from the bottom were added later. You can notice that by the amount of used / free space. The first two volumes have ~11TB used and ~1TB free, while the other two have around ~6TB used and ~6TB free. I have hundreds of zfs'es storing backups from several servers. Each ZFS has about 7 snapshots of older backups. I have the impression I'm getting degradation in performance due to the limited space in the first two volumes, specially the second, which has only 845GB free. Is there any way to re-stripe the pool, so I can take advantage of all spindles across the raidz1 volumes? Right now it looks like the newer volumes are doing the heavy while the other two just hold old data. Thanks, Eduardo Bragatto ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS Restripe
Short answer: No. Long answer: Not without rewriting the previously written data. Data is being striped over all of the top level VDEVs, or at least it should be. But there is no way, at least not built into ZFS, to re-allocate the storage to perform I/O balancing. You would basically have to do this manually. Either way, I'm guessing this isn't the answer you wanted but hey, you get what you get. On Tue, Aug 3, 2010 at 13:52, Eduardo Bragatto edua...@bragatto.com wrote: Hi, I have a large pool (~50TB total, ~42TB usable), composed of 4 raidz1 volumes (of 7 x 2TB disks each): # zpool iostat -v | grep -v c4 capacity operationsbandwidth pool used avail read write read write - - - - - - backup35.2T 15.3T602272 15.3M 11.1M raidz1 11.6T 1.06T138 49 2.99M 2.33M raidz1 11.8T 845G163 54 3.82M 2.57M raidz1 6.00T 6.62T161 84 4.50M 3.16M raidz1 5.88T 6.75T139 83 4.01M 3.09M - - - - - - Originally there were only the first two raidz1 volumes, and the two from the bottom were added later. You can notice that by the amount of used / free space. The first two volumes have ~11TB used and ~1TB free, while the other two have around ~6TB used and ~6TB free. I have hundreds of zfs'es storing backups from several servers. Each ZFS has about 7 snapshots of older backups. I have the impression I'm getting degradation in performance due to the limited space in the first two volumes, specially the second, which has only 845GB free. Is there any way to re-stripe the pool, so I can take advantage of all spindles across the raidz1 volumes? Right now it looks like the newer volumes are doing the heavy while the other two just hold old data. Thanks, Eduardo Bragatto ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss -- You can choose your friends, you can choose the deals. - Equity Private If Linux is faster, it's a Solaris bug. - Phil Harman Blog - http://whatderass.blogspot.com/ Twitter - @khyron4eva ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS Restripe
On Aug 3, 2010, at 10:08 PM, Khyron wrote: Long answer: Not without rewriting the previously written data. Data is being striped over all of the top level VDEVs, or at least it should be. But there is no way, at least not built into ZFS, to re- allocate the storage to perform I/O balancing. You would basically have to do this manually. Either way, I'm guessing this isn't the answer you wanted but hey, you get what you get. Actually, that was the answer I was expecting, yes. The real question, then, is: what data should I rewrite? I want to rewrite data that's written on the nearly full volumes so they get spread to the volumes with more space available. Should I simply do a zfs send | zfs receive on all ZFSes I have? (we are talking about 400 ZFSes with about 7 snapshots each, here)... Or is there a way to rearrange specifically the data from the nearly full volumes? Thanks, Eduardo Bragatto ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS Restripe
On Aug 3, 2010, at 10:57 PM, Richard Elling wrote: Unfortunately, zpool iostat is completely useless at describing performance. The only thing it can do is show device bandwidth, and everyone here knows that bandwidth is not performance, right? Nod along, thank you. I totally understand that, I only used the output to show the space utilization per raidz1 volume. Yes, and you also notice that the writes are biased towards the raidz1 sets that are less full. This is exactly what you want :-) Eventually, when the less empty sets become more empty, the writes will rebalance. Actually, if we are going to consider the values from zpool iostats, they are just slightly biased towards the volumes I would want -- for example, on the first post I've made, the volume with less free space had 845GB free.. that same volume now has 833GB -- I really would like to just stop writing to that volume at this point as I've experience very bad performance in the past when a volume gets nearly full. As a reference, here's the information I posted less than 12 hours ago: # zpool iostat -v | grep -v c4 capacity operationsbandwidth pool used avail read write read write - - - - - - backup35.2T 15.3T602272 15.3M 11.1M raidz1 11.6T 1.06T138 49 2.99M 2.33M raidz1 11.8T 845G163 54 3.82M 2.57M raidz1 6.00T 6.62T161 84 4.50M 3.16M raidz1 5.88T 6.75T139 83 4.01M 3.09M - - - - - - And here's the info from the same system, as I write now: # zpool iostat -v | grep -v c4 capacity operationsbandwidth pool used avail read write read write - - - - - - backup35.3T 15.2T541208 9.90M 6.45M raidz1 11.6T 1.06T116 38 2.16M 1.41M raidz1 11.8T 833G122 39 2.28M 1.49M raidz1 6.02T 6.61T152 64 2.72M 1.78M raidz1 5.89T 6.73T149 66 2.73M 1.77M - - - - - - As you can see, the second raidz1 volume is not being spared and has been providing with almost as much space as the others (and even more compared to the first volume). I have the impression I'm getting degradation in performance due to the limited space in the first two volumes, specially the second, which has only 845GB free. Impressions work well for dating, but not so well for performance. Does your application run faster or slower? You're a funny guy. :) Let me re-phrase it: I'm sure I'm getting degradation in performance as my applications are waiting more on I/O now than they used to do (based on CPU utilization graphs I have). The impression part, is that the reason is the limited space in those two volumes -- as I said, I already experienced bad performance on zfs systems running nearly out of space before. Is there any way to re-stripe the pool, so I can take advantage of all spindles across the raidz1 volumes? Right now it looks like the newer volumes are doing the heavy while the other two just hold old data. Yes, of course. But it requires copying the data, which probably isn't feasible. I'm willing to copy data around to get this accomplish, I'm really just looking for the best method -- I have more than 10TB free, so I have some space to play with if I have to duplicate some data and erase the old copy, for example. Thanks, Eduardo Bragatto ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS Restripe
I notice you use the word volume which really isn't accurate or appropriate here. If all of these VDEVs are part of the same pool, which as I recall you said they are, then writes are striped across all of them (with bias for the more empty aka less full VDEVs). You probably want to zfs send the oldest dataset (ZFS terminology for a file system) into a new dataset. That oldest dataset was created when there were only 2 top level VDEVs, most likely. If you have multiple datasets created when you had only 2 VDEVs, then send/receive them both (in serial fashion, one after the other). If you have room for the snapshots too, then send all of it and then delete the source dataset when done. I think this will achieve what you want. You may want to get a bit more specific and choose from the oldest datasets THEN find the smallest of those oldest datasets and send/receive it first. That way, the send/receive completes in less time, and when you delete the source dataset, you've now created more free space on the entire pool but without the risk of a single dataset exceeding your 10 TiB of workspace. ZFS' copy-on-write nature really wants no less than 20% free because you never update data in place; a new copy is always written to disk. You might want to consider turning on compression on your new datasets too, especially if you have free CPU cycles to spare. I don't know how compressible your data is, but if it's fairly compressible, say lots of text, then you might get some added benefit when you copy the old data into the new datasets. Saving more space, then deleting the source dataset, should help your pool have more free space, and thus influence your writes for better I/O balancing when you do the next (and the next) dataset copies. HTH. On Tue, Aug 3, 2010 at 22:48, Eduardo Bragatto edua...@bragatto.com wrote: On Aug 3, 2010, at 10:08 PM, Khyron wrote: Long answer: Not without rewriting the previously written data. Data is being striped over all of the top level VDEVs, or at least it should be. But there is no way, at least not built into ZFS, to re-allocate the storage to perform I/O balancing. You would basically have to do this manually. Either way, I'm guessing this isn't the answer you wanted but hey, you get what you get. Actually, that was the answer I was expecting, yes. The real question, then, is: what data should I rewrite? I want to rewrite data that's written on the nearly full volumes so they get spread to the volumes with more space available. Should I simply do a zfs send | zfs receive on all ZFSes I have? (we are talking about 400 ZFSes with about 7 snapshots each, here)... Or is there a way to rearrange specifically the data from the nearly full volumes? Thanks, Eduardo Bragatto ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss -- You can choose your friends, you can choose the deals. - Equity Private If Linux is faster, it's a Solaris bug. - Phil Harman Blog - http://whatderass.blogspot.com/ Twitter - @khyron4eva ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS Restripe
On Aug 3, 2010, at 8:55 PM, Eduardo Bragatto wrote: On Aug 3, 2010, at 10:57 PM, Richard Elling wrote: Unfortunately, zpool iostat is completely useless at describing performance. The only thing it can do is show device bandwidth, and everyone here knows that bandwidth is not performance, right? Nod along, thank you. I totally understand that, I only used the output to show the space utilization per raidz1 volume. Yes, and you also notice that the writes are biased towards the raidz1 sets that are less full. This is exactly what you want :-) Eventually, when the less empty sets become more empty, the writes will rebalance. Actually, if we are going to consider the values from zpool iostats, they are just slightly biased towards the volumes I would want -- for example, on the first post I've made, the volume with less free space had 845GB free.. that same volume now has 833GB -- I really would like to just stop writing to that volume at this point as I've experience very bad performance in the past when a volume gets nearly full. The tipping point for the change in the first fit/best fit allocation algorithm is now 96%. Previously, it was 70%. Since you don't specify which OS, build, or zpool version, I'll assume you are on something modern. NB, zdb -m will show the pool's metaslab allocations. If there are no 100% free metaslabs, then it is a clue that the allocator might be working extra hard. As a reference, here's the information I posted less than 12 hours ago: # zpool iostat -v | grep -v c4 capacity operationsbandwidth pool used avail read write read write - - - - - - backup35.2T 15.3T602272 15.3M 11.1M raidz1 11.6T 1.06T138 49 2.99M 2.33M raidz1 11.8T 845G163 54 3.82M 2.57M raidz1 6.00T 6.62T161 84 4.50M 3.16M raidz1 5.88T 6.75T139 83 4.01M 3.09M - - - - - - And here's the info from the same system, as I write now: # zpool iostat -v | grep -v c4 capacity operationsbandwidth pool used avail read write read write - - - - - - backup35.3T 15.2T541208 9.90M 6.45M raidz1 11.6T 1.06T116 38 2.16M 1.41M raidz1 11.8T 833G122 39 2.28M 1.49M raidz1 6.02T 6.61T152 64 2.72M 1.78M raidz1 5.89T 6.73T149 66 2.73M 1.77M - - - - - - As you can see, the second raidz1 volume is not being spared and has been providing with almost as much space as the others (and even more compared to the first volume). Yes, perhaps 1.5-2x data written to the less full raidz1 sets. The exact amount of data is not shown, because zpool iostat doesn't show how much data is written, it shows the bandwidth. I have the impression I'm getting degradation in performance due to the limited space in the first two volumes, specially the second, which has only 845GB free. Impressions work well for dating, but not so well for performance. Does your application run faster or slower? You're a funny guy. :) Let me re-phrase it: I'm sure I'm getting degradation in performance as my applications are waiting more on I/O now than they used to do (based on CPU utilization graphs I have). The impression part, is that the reason is the limited space in those two volumes -- as I said, I already experienced bad performance on zfs systems running nearly out of space before. OK, so how long are they waiting? Try iostat -zxCn and look at the asvc_t column. This will show how the disk is performing, though it won't show the performance delivered by the file system to the application. To measure the latter, try fsstat zfs (assuming you are on a Solaris distro) Also, if these are HDDs, the media bandwidth decreases and seeks increase as they fill. ZFS tries to favor the outer cylinders (lower numbered metaslabs) to take this into account. Is there any way to re-stripe the pool, so I can take advantage of all spindles across the raidz1 volumes? Right now it looks like the newer volumes are doing the heavy while the other two just hold old data. Yes, of course. But it requires copying the data, which probably isn't feasible. I'm willing to copy data around to get this accomplish, I'm really just looking for the best method -- I have more than 10TB free, so I have some space to play with if I have to duplicate some data and erase the old copy, for example. zfs send/receive is usually the best method. -- richard -- Richard Elling rich...@nexenta.com +1-760-896-4422 Enterprise class storage for everyone www.nexenta.com ___ zfs-discuss mailing list zfs-discuss@opensolaris.org