Re: [zfs-discuss] ZFS Restripe
On Aug 4, 2010, at 12:26 AM, Richard Elling wrote: The tipping point for the change in the first fit/best fit allocation algorithm is now 96%. Previously, it was 70%. Since you don't specify which OS, build, or zpool version, I'll assume you are on something modern. I'm running Solaris 10 10/09 s10x_u8wos_08a, ZFS Pool version 15. NB, zdb -m will show the pool's metaslab allocations. If there are no 100% free metaslabs, then it is a clue that the allocator might be working extra hard. On the first two VDEVs there are no allocations 100% free (most are nearly full)... The two newer ones, however, do have several allocations of 128GB each, 100% free. If I understand correctly in that scenario the allocator will work extra, is that correct? OK, so how long are they waiting? Try iostat -zxCn and look at the asvc_t column. This will show how the disk is performing, though it won't show the performance delivered by the file system to the application. To measure the latter, try fsstat zfs (assuming you are on a Solaris distro) Checking with iostat, I noticed the average wait time to be between 40ms and 50ms for all disks. Which doesn't seem too bad. And this is the output of fsstat: # fsstat zfs new name name attr attr lookup rddir read read write write file remov chng get setops ops ops bytes ops bytes 3.26M 1.34M 3.22M 161M 13.4M 1.36G 9.6M 10.5M 899G 22.0M 625G zfs However I did have CPU spikes at 100% where the kernel was taking all cpu time. I have reduced my zfs_arc_max parameter as it seemed the applications were struggling for RAM and things are looking better now Thanks for your time, Eduardo Bragatto. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS Restripe
On Aug 4, 2010, at 12:20 AM, Khyron wrote: I notice you use the word volume which really isn't accurate or appropriate here. Yeah, it didn't seem right to me, but I wasn't sure about the nomenclature, thanks for clarifying. You may want to get a bit more specific and choose from the oldest datasets THEN find the smallest of those oldest datasets and send/receive it first. That way, the send/receive completes in less time, and when you delete the source dataset, you've now created more free space on the entire pool but without the risk of a single dataset exceeding your 10 TiB of workspace. That makes sense, I'll try send/receiving a few of those datasets and see how it goes. I believe I can find the ones that were created before the two new VDEVs were added, by comparing the creation time from zfs get creation ZFS' copy-on-write nature really wants no less than 20% free because you never update data in place; a new copy is always written to disk. Right, and my problem is that I have two VDEVs with less than 10% free at this point -- although the other two have around 50% free each. You might want to consider turning on compression on your new datasets too, especially if you have free CPU cycles to spare. I don't know how compressible your data is, but if it's fairly compressible, say lots of text, then you might get some added benefit when you copy the old data into the new datasets. Saving more space, then deleting the source dataset, should help your pool have more free space, and thus influence your writes for better I/O balancing when you do the next (and the next) dataset copies. Unfortunately the data taking most of the space it already compressed, so while I would gain some space from many text files that I also have, those are not the majority of my content, and the effort would probably not justify the small gain. Thanks Eduardo Bragatto ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS Restripe
On Aug 4, 2010, at 11:18 AM, Bob Friesenhahn wrote: Assuming that your impressions are correct, are you sure that your new disk drives are similar to the older ones? Are they an identical model? Design trade-offs are now often resulting in larger capacity drives with reduced performance. Yes, the disks are the same, no problems there. On Aug 4, 2010, at 2:11 PM, Bob Friesenhahn wrote: On Wed, 4 Aug 2010, Eduardo Bragatto wrote: Checking with iostat, I noticed the average wait time to be between 40ms and 50ms for all disks. Which doesn't seem too bad. Actually, this is quite high. I would not expect such long wait times except for when under extreme load such as a benchmark. If the wait times are this long under normal use, then there is something wrong. That's a backup server, I usually have 10 rsync instances running simultaneously so there's a lot of random disk access going on -- I think that explains the high average time. Also, I recently enabled graphing of the IOPS per disk (reading it using net-snmp) and I see most disks are operating near their limit -- except for some disks from the older VDEVs which is what I'm trying to address here. However I did have CPU spikes at 100% where the kernel was taking all cpu time. I have reduced my zfs_arc_max parameter as it seemed the applications were struggling for RAM and things are looking better now Odd. What type of applications are you running on this system? Are applications running on the server competing with client accesses? I noticed some of those rsync processes were using almost 1GB of RAM each and the server has only 8GB. I started seeing the server swapping a bit during the cpu spikes at 100%, so I figured it would be better to cap ARC and leave some room for the rsync processes. I will also start using rsync v3 to reduce the memory foot print, so I might be able to give back some RAM to ARC, and I'm thinking maybe going to 16GB RAM, as the pool is quite large and I'm sure more ARC wouldn't hurt. Thanks, Eduardo Bragatto. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] ZFS Restripe
Hi, I have a large pool (~50TB total, ~42TB usable), composed of 4 raidz1 volumes (of 7 x 2TB disks each): # zpool iostat -v | grep -v c4 capacity operationsbandwidth pool used avail read write read write - - - - - - backup35.2T 15.3T602272 15.3M 11.1M raidz1 11.6T 1.06T138 49 2.99M 2.33M raidz1 11.8T 845G163 54 3.82M 2.57M raidz1 6.00T 6.62T161 84 4.50M 3.16M raidz1 5.88T 6.75T139 83 4.01M 3.09M - - - - - - Originally there were only the first two raidz1 volumes, and the two from the bottom were added later. You can notice that by the amount of used / free space. The first two volumes have ~11TB used and ~1TB free, while the other two have around ~6TB used and ~6TB free. I have hundreds of zfs'es storing backups from several servers. Each ZFS has about 7 snapshots of older backups. I have the impression I'm getting degradation in performance due to the limited space in the first two volumes, specially the second, which has only 845GB free. Is there any way to re-stripe the pool, so I can take advantage of all spindles across the raidz1 volumes? Right now it looks like the newer volumes are doing the heavy while the other two just hold old data. Thanks, Eduardo Bragatto ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS Restripe
On Aug 3, 2010, at 10:08 PM, Khyron wrote: Long answer: Not without rewriting the previously written data. Data is being striped over all of the top level VDEVs, or at least it should be. But there is no way, at least not built into ZFS, to re- allocate the storage to perform I/O balancing. You would basically have to do this manually. Either way, I'm guessing this isn't the answer you wanted but hey, you get what you get. Actually, that was the answer I was expecting, yes. The real question, then, is: what data should I rewrite? I want to rewrite data that's written on the nearly full volumes so they get spread to the volumes with more space available. Should I simply do a zfs send | zfs receive on all ZFSes I have? (we are talking about 400 ZFSes with about 7 snapshots each, here)... Or is there a way to rearrange specifically the data from the nearly full volumes? Thanks, Eduardo Bragatto ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS Restripe
On Aug 3, 2010, at 10:57 PM, Richard Elling wrote: Unfortunately, zpool iostat is completely useless at describing performance. The only thing it can do is show device bandwidth, and everyone here knows that bandwidth is not performance, right? Nod along, thank you. I totally understand that, I only used the output to show the space utilization per raidz1 volume. Yes, and you also notice that the writes are biased towards the raidz1 sets that are less full. This is exactly what you want :-) Eventually, when the less empty sets become more empty, the writes will rebalance. Actually, if we are going to consider the values from zpool iostats, they are just slightly biased towards the volumes I would want -- for example, on the first post I've made, the volume with less free space had 845GB free.. that same volume now has 833GB -- I really would like to just stop writing to that volume at this point as I've experience very bad performance in the past when a volume gets nearly full. As a reference, here's the information I posted less than 12 hours ago: # zpool iostat -v | grep -v c4 capacity operationsbandwidth pool used avail read write read write - - - - - - backup35.2T 15.3T602272 15.3M 11.1M raidz1 11.6T 1.06T138 49 2.99M 2.33M raidz1 11.8T 845G163 54 3.82M 2.57M raidz1 6.00T 6.62T161 84 4.50M 3.16M raidz1 5.88T 6.75T139 83 4.01M 3.09M - - - - - - And here's the info from the same system, as I write now: # zpool iostat -v | grep -v c4 capacity operationsbandwidth pool used avail read write read write - - - - - - backup35.3T 15.2T541208 9.90M 6.45M raidz1 11.6T 1.06T116 38 2.16M 1.41M raidz1 11.8T 833G122 39 2.28M 1.49M raidz1 6.02T 6.61T152 64 2.72M 1.78M raidz1 5.89T 6.73T149 66 2.73M 1.77M - - - - - - As you can see, the second raidz1 volume is not being spared and has been providing with almost as much space as the others (and even more compared to the first volume). I have the impression I'm getting degradation in performance due to the limited space in the first two volumes, specially the second, which has only 845GB free. Impressions work well for dating, but not so well for performance. Does your application run faster or slower? You're a funny guy. :) Let me re-phrase it: I'm sure I'm getting degradation in performance as my applications are waiting more on I/O now than they used to do (based on CPU utilization graphs I have). The impression part, is that the reason is the limited space in those two volumes -- as I said, I already experienced bad performance on zfs systems running nearly out of space before. Is there any way to re-stripe the pool, so I can take advantage of all spindles across the raidz1 volumes? Right now it looks like the newer volumes are doing the heavy while the other two just hold old data. Yes, of course. But it requires copying the data, which probably isn't feasible. I'm willing to copy data around to get this accomplish, I'm really just looking for the best method -- I have more than 10TB free, so I have some space to play with if I have to duplicate some data and erase the old copy, for example. Thanks, Eduardo Bragatto ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] zpool import hanging - SOLVED
Hi, I have fixed this problem a couple weeks ago, but haven't found the time to report it until now. Cindy Swearingen was very kind in contacting me to resolve this issue, I would like to take this opportunity to express my gratitude to her. We have not found the root cause of the error. Cindy suspected about some known bugs in release 5/09 that have been fixed in 10/09, but we could not confirm that as the real cause of the problem. Anyway, I went ahead and re-installed the operating system with the latest Solaris release (10/09) and zpool import worked like there was nothing wrong. I have scrubbed the pool and no errors were found. I'm using the system since the OS was re-installed (exactly 10 days now) without any problems. If you get yourself in a situation where zpool import hangs and never finishes because it hangs while mounting some of the ZFS filesystems, make sure you try to import that pool on the newest stable system before wasting too much time debugging the problem. Thanks, Eduardo Bragatto. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Mirroring USB Drive with Laptop for Backup purposes
On May 12, 2010, at 3:23 AM, Brandon High wrote: On Tue, May 11, 2010 at 10:13 PM, Richard Elling richard.ell...@gmail.com wrote: But who needs usability? This is unix, man. I must have missed something. For the past few years I have routinely booted with unimportable pools because I often use ramdisks. Sure, I get FMA messages, but that doesn't affect the boot. OTOH, I don't try to backup using mirrors. (..) If it was possible to pass in a flag from grub to ignore the cache, it would make life a little easier in such cases. Recently I have been working on a zpool that refuses to import. During my work I had to boot the server many times in failsafe mode to be able to remove the zpool.cache file, so Brandon's suggestions sounds very reasonable at first. However, I realized that if you import using zpool import -R /altroot your_pool -- it does NOT create a new zpool.cache. So, as long as you use -R, you can safely import pools without creating a new zpool.cache file and your next reboot will not screw up the system. Basically there's no real need to a grub option (actually for a kernel parameter) -- if you have a problem, you go failsafe mode and remove the file, then in your tests you attempt to import using -R so the cache is not re-created and you don't need to go into failsafe mode ever again. best regards, Eduardo Bragatto. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] zpool import hanging
Hi again, As for the NFS issue I mentioned before, I made sure the NFS server was working and was able to export before I attempted to import anything, then I started a new zpool import backup: -- my hope was that the NFS share was causing the issue, since the only filesystem shared is the one causing the problem, but that doesn't seem to be the case. I've done a lot of research and could not find a similar case to mine. The most similar one I've found was this from 2008: http://opensolaris.org/jive/thread.jspa?threadID=70205tstart=15 I simply can not import the pool although ZFS reports it as OK. In that old thread, the user was also having the zpool import hang issue, however he was able to run these two commands (his pool was named data1): zdb -e -bb data1 zdb -e - data1 While my system returns: # zdb -e -bb backup zdb: can't open backup: File exists # zdb -e -ddd backup zdb: can't open backup: File exists Every documentation assumes you will be able to run zpool import before troubleshooting, however my problem is exactly on that command. I don't even know where to find more detailed documentation. I believe there's very knowledgeable people in this list. Could someone be kind enough to take a look and at least point me in the right direction? Thanks, Eduardo Bragatto.___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] zpool import hanging
On May 10, 2010, at 4:46 PM, John Balestrini wrote: Recently I had a similar issue where the pool wouldn't import and attempting to import it would essentially lock the server up. Finally I used pfexec zpool import -F pool1 and simply let it do it's thing. After almost 60 hours the imported finished and all has been well since (except my backup procedures have improved!). Hey John, thanks a lot for answering -- I already allowed the zpool import command to run from Friday to Monday and it did not complete -- I also made sure to start it using truss and literally nothing has happened during that time (the truss output file does not have anything new). While the zpool import command runs, I don't see any CPU or Disk I/O usage. zpool iostat shows very little I/O too: # zpool iostat -v capacity operationsbandwidth pool used avail read write read write - - - - - - backup31.4T 19.1T 11 2 29.5K 11.8K raidz1 11.9T 741G 2 0 3.74K 3.35K c3t102d0 - - 0 0 23.8K 1.99K c3t103d0 - - 0 0 23.5K 1.99K c3t104d0 - - 0 0 23.0K 1.99K c3t105d0 - - 0 0 21.3K 1.99K c3t106d0 - - 0 0 21.5K 1.98K c3t107d0 - - 0 0 24.2K 1.98K c3t108d0 - - 0 0 23.1K 1.98K raidz1 12.2T 454G 3 0 6.89K 3.94K c3t109d0 - - 0 0 43.7K 2.09K c3t110d0 - - 0 0 42.9K 2.11K c3t111d0 - - 0 0 43.9K 2.11K c3t112d0 - - 0 0 43.8K 2.09K c3t113d0 - - 0 0 47.0K 2.08K c3t114d0 - - 0 0 42.9K 2.08K c3t115d0 - - 0 0 44.1K 2.08K raidz1 3.69T 8.93T 3 0 9.42K610 c3t87d0 - - 0 0 43.6K 1.50K c3t88d0 - - 0 0 43.9K 1.48K c3t89d0 - - 0 0 44.2K 1.49K c3t90d0 - - 0 0 43.4K 1.49K c3t91d0 - - 0 0 42.5K 1.48K c3t92d0 - - 0 0 44.5K 1.49K c3t93d0 - - 0 0 44.8K 1.49K raidz1 3.64T 8.99T 3 0 9.40K 3.94K c3t94d0 - - 0 0 31.9K 2.09K c3t95d0 - - 0 0 31.6K 2.09K c3t96d0 - - 0 0 30.8K 2.08K c3t97d0 - - 0 0 34.2K 2.08K c3t98d0 - - 0 0 34.4K 2.08K c3t99d0 - - 0 0 35.2K 2.09K c3t100d0 - - 0 0 34.9K 2.08K - - - - - - Also, the third raidz entry shows less write in bandwidth (610). This is actually the first time it's a non-zero value. My last attempt to import it, was using this command: zpool import -o failmode=panic -f -R /altmount backup However it did not panic. As I mentioned in the first message, it mounts 189 filesystems and hangs on #190. While the command is hanging, I can use zfs mount to mount filesystems #191 and above (only one filesystem does not mount and causes the import procedure to hang). Before trying the command above, I was using only zpool import backup, and the iostat output was showing ZERO for the third raidz from the list above (not sure if that means something, but it does look odd). I'm really on a dead end here, any help is appreciated. Thanks, Eduardo Bragatto. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] zpool import hanging
On May 10, 2010, at 6:28 PM, Cindy Swearingen wrote: Hi Eduardo, Please use the following steps to collect more information: 1. Use the following command to get the PID of the zpool import process, like this: # ps -ef | grep zpool 2. Use the actual PID of zpool import found in step 1 in the following command, like this: echo 0tPID of zpool import::pid2proc|::walk thread|::findstack | mdb -k Then, send the output. Hi Cindy, first of all, thank you for taking your time to answer my question. Here's the output of the command you requested: # echo 0t733::pid2proc|::walk thread|::findstack | mdb -k stack pointer for thread 94e4db40: fe8000d3e5b0 [ fe8000d3e5b0 _resume_from_idle+0xf8() ] fe8000d3e5e0 swtch+0x12a() fe8000d3e600 cv_wait+0x68() fe8000d3e640 txg_wait_open+0x73() fe8000d3e670 dmu_tx_wait+0xc5() fe8000d3e6a0 dmu_tx_assign+0x38() fe8000d3e700 dmu_free_long_range_impl+0xe6() fe8000d3e740 dmu_free_long_range+0x65() fe8000d3e790 zfs_trunc+0x77() fe8000d3e7e0 zfs_freesp+0x66() fe8000d3e830 zfs_space+0xa9() fe8000d3e850 zfs_shim_space+0x15() fe8000d3e890 fop_space+0x2e() fe8000d3e910 zfs_replay_truncate+0xa8() fe8000d3e9b0 zil_replay_log_record+0x1ec() fe8000d3eab0 zil_parse+0x2ff() fe8000d3eb30 zil_replay+0xde() fe8000d3eb50 zfsvfs_setup+0x93() fe8000d3ebc0 zfs_domount+0x2e4() fe8000d3ecc0 zfs_mount+0x15d() fe8000d3ecd0 fsop_mount+0xa() fe8000d3ee00 domount+0x4d7() fe8000d3ee80 mount+0x105() fe8000d3eec0 syscall_ap+0x97() fe8000d3ef10 _sys_sysenter_post_swapgs+0x14b() The first message from this thread has three files attached with information from truss (tracing zpool import), zdb output and the entire list of threads taken from 'echo ::threadlist -v | mdb -k'. Thanks, Eduardo Bragatto ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] zpool import hanging
Additionally, I would like to mention that the only ZFS filesystem not mounting -- causing the entire zpool import backup command to hang, is the only filesystem configured to be exported via NFS: backup/insightiq sharenfs root=* local Is there any chance the NFS share is the culprit here? If so, how to avoid it? Thanks, Eduardo Bragatto ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] Mismatched replication levels
Hi everyone, I just joined the list after finding an unanswered message from Ray Van Dolson in the archives. I'm reproducing his question here, as I'm wondering about the same issue and did not find an answer for it anywhere yet. Can anyone shed any light on this subject? -- Original Message -- What are the technical reasons to not have mismatched replication levels? For example, I am creating a zpool with three raidz vdevs. Two with 8 disks and one with only 7. zpool allows me to do this with -f of course, but I can't find much documentation on why I shouldn't other than it's not recommended. I can understand why, perhaps, for situations where you add new vdevs to your pool later and accidentally use some that aren't redundant to the same degree others are -- you might unknowingly compromise your vpool in that way... But as long as we're aware, is there any performance other other technical reason I shouldn't set up my vdevs as I have above? Thanks, Ray -- End of Original Message -- According to the documentation, here: http://docs.sun.com/app/docs/doc/819-5461/gavwn?a=view (..)The command also warns you about creating a mirrored or RAID-Z pool using devices of different sizes. While this configuration is allowed, mismatched levels of redundancy result in unused space on the larger device(..) However, when I create a pool with two raidz groups with 7 vdevs each, and two raidz groups with 7 and 8 vdevs each, I do get more space, indicating the extra space in the largest raidz set is available (which I wouldn't expect to happen based on the statement above): 2 x raidz ( 7 + 8 ) using 1TB disks: backup2.nbg:~ root# zfs list benchpool78 NAMEUSED AVAIL REFER MOUNTPOINT benchpool 155K 11.5T 31.0K /benchpool78 backup2.nbg:~ root# zpool list benchpool78 NAMESIZE USED AVAILCAP HEALTH ALTROOT benchpool 13.6T 354K 13.6T 0% ONLINE - 2 x raidz ( 7 + 7 ) using 1TB disks: backup2.nbg:~ root# zfs list benchpool77 NAMEUSED AVAIL REFER MOUNTPOINT benchpool 117K 10.6T 1.70K /benchpool77 backup2.nbg:~ root# zpool list benchpool NAMESIZE USED AVAILCAP HEALTH ALTROOT benchpool 12.6T 146K 12.6T 0% ONLINE - So, is there any real reason for not using mismatched replication levels? Is there any performance penalty? Thanks, Eduardo ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss