Re: [zfs-discuss] Re: Production ZFS Server Death (06/06)
The whole raid does not fail -- we are talking about corruption here. If you lose some inodes your whole partition is not gone. My ZFS pool would not salvage -- poof, whole thing was gone (granted it was a test one and not a raidz or mirror yet). But still, for what happened, I cannot believe that 20G of data got messed up because a 1GB cache was not correctly flushed. Chad, I think what you're saying is for a zpool to allow you to salvage whatever remaining data that passes it's checksums. -- Regards, Jeremy ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] Limitations of ZFS
Hi Folks, Man pages of ZFS and ZPOOL, clearly saying that it is not good (recommended) to use some portion of device for ZFS file system creation. Hardly what are the problems if we use only some portion of disk space for ZFS FS ? or Why i can't use one partition of device for ZFS file ststem and another partition of for some other purpose ? Will it cause any problems if i use one partition of device for ZFS and another partition for some other purpose ? Why all people are strongly recommending to use whole disk (not part of disk) for creation zpools / ZFS file system ? Your help is appreciated. Thanks Regards Masthan - Want to start your own business? Learn how on Yahoo! Small Business.___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] A Plea for Help: Thumper/ZFS/NFS/B43
I've got a Thumper doing nothing but serving NFS. Its using B43 with zil_disabled. The system is being consumed in waves, but by what I don't know. Notice vmstat: 3 0 0 25693580 2586268 0 0 0 0 0 0 0 0 0 0 0 926 91 703 0 25 75 21 0 0 25693580 2586268 0 0 0 0 0 0 0 0 0 13 14 1720 21 1105 0 92 8 20 0 0 25693580 2586268 0 0 0 0 0 0 0 0 0 17 18 2538 70 834 0 100 0 25 0 0 25693580 2586268 0 0 0 0 0 0 0 0 0 0 0 745 18 179 0 100 0 37 0 0 25693552 2586240 0 0 0 0 0 0 0 0 0 7 7 1152 52 313 0 100 0 16 0 0 25693592 2586280 0 0 0 0 0 0 0 0 0 15 13 1543 52 767 0 100 0 17 0 0 25693592 2586280 0 0 0 0 0 0 0 0 0 2 2 890 72 192 0 100 0 27 0 0 25693572 2586260 0 0 0 0 0 0 0 0 0 15 15 3271 19 3103 0 98 2 0 0 0 25693456 2586144 0 11 0 0 0 0 0 0 0 281 249 34335 242 37289 0 46 54 0 0 0 25693448 2586136 0 2 0 0 0 0 0 0 0 0 0 2470 103 2900 0 27 73 0 0 0 25693448 2586136 0 0 0 0 0 0 0 0 0 0 0 1062 105 822 0 26 74 0 0 0 25693448 2586136 0 0 0 0 0 0 0 0 0 0 0 1076 91 857 0 25 75 0 0 0 25693448 2586136 0 0 0 0 0 0 0 0 0 0 0 917 126 674 0 25 75 These spikes of sys load come in waves like this. While there are close to a hundred systems mounting NFS shares on the Thumper, the amount of traffic is really low. Nothing to justify this. We're talking less than 10MB/s. NFS is pathetically slow. We're using NFSv3 TCP shared via ZFS sharenfs on a 3Gbps aggregation (3*1Gbps). I've been slamming my head against this problem for days and can't make headway. I'll post some of my notes below. Any thoughts or ideas are welcome! benr. === Step 1 was to disable any ZFS features that might consume large amounts of CPU: # zfs set compression=off joyous # zfs set atime=off joyous # zfs set checksum=off joyous These changes had no effect. Next was to consider that perhaps NFS was doing name lookups when it shouldn't. Indeed dns was specified in /etc/nsswitch.conf which won't work given that no DNS servers are accessable from the storage or private networks, but again, no improvement. In this process I removed dns from nsswitch.conf, deleted /etc/resolv.conf, and disabled the dns/client service in SMF. Turning back to CPU usage, we can see the activity is all SYStem time and comes in waves: [private:/tmp] root# sar 1 100 SunOS private.thumper1 5.11 snv_43 i86pc12/07/2006 10:38:05%usr%sys%wio %idle 10:38:06 0 27 0 73 10:38:07 0 27 0 73 10:38:09 0 27 0 73 10:38:10 1 26 0 73 10:38:11 0 26 0 74 10:38:12 0 26 0 74 10:38:13 0 24 0 76 10:38:14 0 6 0 94 10:38:15 0 7 0 93 10:38:22 0 99 0 1 -- 10:38:23 0 94 0 6 -- 10:38:24 0 28 0 72 10:38:25 0 27 0 73 10:38:26 0 27 0 73 10:38:27 0 27 0 73 10:38:28 0 27 0 73 10:38:29 1 30 0 69 10:38:30 0 27 0 73 And so we consider whether or not there is a pattern to the frequency. The following is sar output from any lines in which sys is above 90%: 10:40:04%usr%sys%wio %idleDelta 10:40:11 0 97 0 3 10:40:45 0 98 0 2 34 seconds 10:41:02 0 94 0 6 17 seconds 10:41:26 0 100 0 0 24 seconds 10:42:00 0 100 0 0 34 seconds 10:42:25 (end of sample) 25 seconds Looking at the congestion in the run queue: [private:/tmp] root# sar -q 5 100 10:45:43 runq-sz %runocc swpq-sz %swpocc 10:45:5127.0 85 0.0 0 10:45:57 1.0 20 0.0 0 10:46:02 2.0 60 0.0 0 10:46:1319.8 99 0.0 0 10:46:2317.7 99 0.0 0 10:46:3424.4 99 0.0 0 10:46:4122.1 97 0.0 0 10:46:4813.0 96 0.0 0 10:46:5525.3 102 0.0 0 Looking at the per-CPU breakdown: CPU minf mjf xcal intr ithr csw icsw migr smtx srw syscl usr sys wt idl 00 00 324 224000 1540 00 100 0 0 10 00 1140 2260 10 130860 1 0 99 20 00 162 138 1490540 00 1 0 99 30 00556 460430 00 1 0 99 CPU minf mjf xcal intr ithr csw icsw migr smtx srw syscl usr sys wt idl 00 00 310 210 340 17 1717 50 100 0 0 10 00 1521 2000 17 265591 65 0 34 20 00 271 197 1751 13 202 00 66 0 34 30 00 120
Re: [zfs-discuss] Limitations of ZFS
On 07 December, 2006 - dudekula mastan sent me these 2,9K bytes: Hi Folks, Man pages of ZFS and ZPOOL, clearly saying that it is not good (recommended) to use some portion of device for ZFS file system creation. Hardly what are the problems if we use only some portion of disk space for ZFS FS ? or Why i can't use one partition of device for ZFS file ststem and another partition of for some other purpose ? You can. Will it cause any problems if i use one partition of device for ZFS and another partition for some other purpose ? No. Why all people are strongly recommending to use whole disk (not part of disk) for creation zpools / ZFS file system ? One thing is performance; ZFS can enable/disable write cache in the disk at will if it has full control over the entire disk.. /Tomas -- Tomas Ögren, [EMAIL PROTECTED], http://www.acc.umu.se/~stric/ |- Student at Computing Science, University of Umeå `- Sysadmin at {cs,acc}.umu.se ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] A Plea for Help: Thumper/ZFS/NFS/B43
Hey Ben - I need more time to look at this and connect some dots, but real quick Some nfsstat data that we could use to potentially correlate to the local server activity would be interesting. zfs_create() seems to be the heavy hitter, but a periodic kernel profile (especially if we can catch a 97% SYS period) would help: #lockstat -i997 -Ik -s 10 sleep 60 Alternatively: #dtrace -n 'profile-997hz / arg0 != 0 / { @s[stack()]=count(); }' It would also be interesting to see what the zfs_create()'s are doing. Perhaps a quick: #dtrace -n 'zfs_create:entry { printf(ZFS Create: %s\n, stringof(args[0]-v_path)); }' It would also be interesting to see the network stats. Grab Brendan's nicstat and collect some samples You're reference to low traffic is in bandwidth, which, as you indicate, is really, really low. But the data, at least up to this point, suggests the workload is not data/bandwidth intensive, but more attribute intensive. Note again zfs_create() is the heavy ZFS function, along with zfs_getattr. Perhaps it's the attribute-intensive nature of the load that is at the root of this. I can spend more time on this tomorrow (traveling today). Thanks, /jim Ben Rockwood wrote: I've got a Thumper doing nothing but serving NFS. Its using B43 with zil_disabled. The system is being consumed in waves, but by what I don't know. Notice vmstat: 3 0 0 25693580 2586268 0 0 0 0 0 0 0 0 0 0 0 926 91 703 0 25 75 21 0 0 25693580 2586268 0 0 0 0 0 0 0 0 0 13 14 1720 21 1105 0 92 8 20 0 0 25693580 2586268 0 0 0 0 0 0 0 0 0 17 18 2538 70 834 0 100 0 25 0 0 25693580 2586268 0 0 0 0 0 0 0 0 0 0 0 745 18 179 0 100 0 37 0 0 25693552 2586240 0 0 0 0 0 0 0 0 0 7 7 1152 52 313 0 100 0 16 0 0 25693592 2586280 0 0 0 0 0 0 0 0 0 15 13 1543 52 767 0 100 0 17 0 0 25693592 2586280 0 0 0 0 0 0 0 0 0 2 2 890 72 192 0 100 0 27 0 0 25693572 2586260 0 0 0 0 0 0 0 0 0 15 15 3271 19 3103 0 98 2 0 0 0 25693456 2586144 0 11 0 0 0 0 0 0 0 281 249 34335 242 37289 0 46 54 0 0 0 25693448 2586136 0 2 0 0 0 0 0 0 0 0 0 2470 103 2900 0 27 73 0 0 0 25693448 2586136 0 0 0 0 0 0 0 0 0 0 0 1062 105 822 0 26 74 0 0 0 25693448 2586136 0 0 0 0 0 0 0 0 0 0 0 1076 91 857 0 25 75 0 0 0 25693448 2586136 0 0 0 0 0 0 0 0 0 0 0 917 126 674 0 25 75 These spikes of sys load come in waves like this. While there are close to a hundred systems mounting NFS shares on the Thumper, the amount of traffic is really low. Nothing to justify this. We're talking less than 10MB/s. NFS is pathetically slow. We're using NFSv3 TCP shared via ZFS sharenfs on a 3Gbps aggregation (3*1Gbps). I've been slamming my head against this problem for days and can't make headway. I'll post some of my notes below. Any thoughts or ideas are welcome! benr. === Step 1 was to disable any ZFS features that might consume large amounts of CPU: # zfs set compression=off joyous # zfs set atime=off joyous # zfs set checksum=off joyous These changes had no effect. Next was to consider that perhaps NFS was doing name lookups when it shouldn't. Indeed dns was specified in /etc/nsswitch.conf which won't work given that no DNS servers are accessable from the storage or private networks, but again, no improvement. In this process I removed dns from nsswitch.conf, deleted /etc/resolv.conf, and disabled the dns/client service in SMF. Turning back to CPU usage, we can see the activity is all SYStem time and comes in waves: [private:/tmp] root# sar 1 100 SunOS private.thumper1 5.11 snv_43 i86pc12/07/2006 10:38:05%usr%sys%wio %idle 10:38:06 0 27 0 73 10:38:07 0 27 0 73 10:38:09 0 27 0 73 10:38:10 1 26 0 73 10:38:11 0 26 0 74 10:38:12 0 26 0 74 10:38:13 0 24 0 76 10:38:14 0 6 0 94 10:38:15 0 7 0 93 10:38:22 0 99 0 1 -- 10:38:23 0 94 0 6 -- 10:38:24 0 28 0 72 10:38:25 0 27 0 73 10:38:26 0 27 0 73 10:38:27 0 27 0 73 10:38:28 0 27 0 73 10:38:29 1 30 0 69 10:38:30 0 27 0 73 And so we consider whether or not there is a pattern to the frequency. The following is sar output from any lines in which sys is above 90%: 10:40:04%usr%sys%wio %idleDelta 10:40:11 0 97 0 3 10:40:45 0 98 0 2 34 seconds 10:41:02 0 94 0 6 17 seconds 10:41:26 0 100 0 0 24 seconds 10:42:00 0 100 0 0 34 seconds 10:42:25 (end of sample) 25 seconds Looking at the
[zfs-discuss] ZFS bootability target
Hi I am about to plan an upgrade of about 500 systems (sparc) to Solaris 10 and would like to go for ZFS to manage the rootdisk. But what timeframe are we looking at? and what should we take into account to be able to migrate to it later on? -- // Flemming Danielsen ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Limitations of ZFS
Why all people are strongly recommending to use whole disk (not part of disk) for creation zpools / ZFS file system ? One thing is performance; ZFS can enable/disable write cache in the disk at will if it has full control over the entire disk.. ZFS will also flush the WC when necessary, and if applications are waiting for an I/O to complete, this is typically a point where data must be flushed out. So I don't expect much application performance gains here. There is a subset of SATA drives that do not handle concurrent I/O requests and staging I/Os through the cache can be a away to drive more data throughput in them. But for many devices the write cache is not a big performance factor. ZFS does some intelligent I/O scheduling and giving it entire disks allows that code to be more effective. Building pools with many slices from one disk, means more head movements and lost performance. -r ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] System pause peculiarity with mysql on zfs
Hey all, I run a netra X1 as the mysql db server for my small personal web site. This X1 has two drives in it with SVM-mirrored UFS slices for / and /var, a swap slice, and slice 7 is zfs. There is one zfs mirror pool called local on which there are a few file systems, one of which is for mysql. slice 7 used to be ufs, and I had no performance problems when that was the case. There is 1152MB of RAM on this box, half of which is in use. Solaris 10 FCS + all the latest patches as of today. So anyway, after moving mysql to live on zfs (with compression turned on for the volume in question), I noticed that web pages on my site took a bit of time, sometimes up to 20 seconds to load. I'd jump on to my X1, and notice that according to top, kernel was hogging 80-100% of the 500Mhz CPU, and mysqld was the top process in CPU use. The load average would shoot from a normal 0.something up to 6 or even 8. Command-line response was stop and go. Then I'd notice my page would finally load, and that corresponded with load and kernel CPU usage decreasing back to normal levels. I am able to reliably replicate this, and I ran lockstat while this was going on, the output of which is here: http://elektronkind.org/osol/lockstat-zfs-0.txt Part of me is kind of sure that this is 6421427 as there appears to be long and copious trips through ata_wait() as that bug illustrates, but I just want to be sure of it (and when is that bug seeing a solaris 10 patch, btw?) TIA, /dale ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS bootability target
I am about to plan an upgrade of about 500 systems (sparc) to Solaris 10 and would like to go for ZFS to manage the rootdisk. But what timeframe are we looking at? I've heard update 5, so several months at least. and what should we take into account to be able to migrate to it later on? Are you isolating OS/root data from application/user data? Do you have 2 dedicated disks to mirror root on? If so, my guess is that later upgrading will be easy. Root ZFS pools will be restricted, so you won't be sharing them with your non-root data in most cases. That suggests to me that I'd be able to break my existing SVM mirror and install a ZFS root, later mirroring with the other disk. -- Darren Dunham [EMAIL PROTECTED] Senior Technical Consultant TAOShttp://www.taos.com/ Got some Dr Pepper? San Francisco, CA bay area This line left intentionally blank to confuse you. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] Re: A Plea for Help: Thumper/ZFS/NFS/B43
Hi Ben, Your sar output shows one core pegged pretty much constantly! From the Solaris Performance and Tools book that SLP state value has The remainder of important events such as disk and network waits. along with other kernel wait events.. kernel locks or condition variables also accumilate time in this state ZFS COUNT zfs_create 4178 ZFS AVG TIME zfs_create 71215587 ZFS SUM TIME zfs_create 297538724997 I think it looks like the system must be spinning in zfs_create(), looking in usr/src/uts/common/fs/zfs/zfs_vnops.c there are a couple of places it could loop:- 1129 /* 1130 * Create a new file object and update the directory 1131 * to reference it. 1132 */ 1154 error = dmu_tx_assign(tx, zfsvfs-z_assign); 1155 if (error) { 1156 zfs_dirent_unlock(dl); 1157 if (error == ERESTART 1158 zfsvfs-z_assign == TXG_NOWAIT) { 1159 dmu_tx_wait(tx); 1160 dmu_tx_abort(tx); 1161 goto top; 1162 } and 1201 /* 1202 * Truncate regular files if requested. 1203 */ 1204 if ((ZTOV(zp)-v_type == VREG) 1205 (zp-z_phys-zp_size != 0) 1206 (vap-va_mask AT_SIZE) (vap-va_size == 0)) { 1207 error = zfs_freesp(zp, 0, 0, mode, TRUE); 1208 if (error == ERESTART 1209 zfsvfs-z_assign == TXG_NOWAIT) { 1210 /* NB: we already did dmu_tx_wait() */ 1211 zfs_dirent_unlock(dl); 1212 VN_RELE(ZTOV(zp)); 1213 goto top; I think the snoop would be very useful to pour over. Cheers, Alan This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] A Plea for Help: Thumper/ZFS/NFS/B43
Ben, The attached dscript might help determining the zfs_create issue. It prints: - a count of all functions called from zfs_create - average wall count time of the 30 highest functions - average cpu time of the 30 highest functions Note, please ignore warnings of the following type: dtrace: 1346 dynamic variable drops with non-empty dirty list Neil. Ben Rockwood wrote On 12/07/06 06:01,: I've got a Thumper doing nothing but serving NFS. Its using B43 with zil_disabled. The system is being consumed in waves, but by what I don't know. Notice vmstat: 3 0 0 25693580 2586268 0 0 0 0 0 0 0 0 0 0 0 926 91 703 0 25 75 21 0 0 25693580 2586268 0 0 0 0 0 0 0 0 0 13 14 1720 21 1105 0 92 8 20 0 0 25693580 2586268 0 0 0 0 0 0 0 0 0 17 18 2538 70 834 0 100 0 25 0 0 25693580 2586268 0 0 0 0 0 0 0 0 0 0 0 745 18 179 0 100 0 37 0 0 25693552 2586240 0 0 0 0 0 0 0 0 0 7 7 1152 52 313 0 100 0 16 0 0 25693592 2586280 0 0 0 0 0 0 0 0 0 15 13 1543 52 767 0 100 0 17 0 0 25693592 2586280 0 0 0 0 0 0 0 0 0 2 2 890 72 192 0 100 0 27 0 0 25693572 2586260 0 0 0 0 0 0 0 0 0 15 15 3271 19 3103 0 98 2 0 0 0 25693456 2586144 0 11 0 0 0 0 0 0 0 281 249 34335 242 37289 0 46 54 0 0 0 25693448 2586136 0 2 0 0 0 0 0 0 0 0 0 2470 103 2900 0 27 73 0 0 0 25693448 2586136 0 0 0 0 0 0 0 0 0 0 0 1062 105 822 0 26 74 0 0 0 25693448 2586136 0 0 0 0 0 0 0 0 0 0 0 1076 91 857 0 25 75 0 0 0 25693448 2586136 0 0 0 0 0 0 0 0 0 0 0 917 126 674 0 25 75 These spikes of sys load come in waves like this. While there are close to a hundred systems mounting NFS shares on the Thumper, the amount of traffic is really low. Nothing to justify this. We're talking less than 10MB/s. NFS is pathetically slow. We're using NFSv3 TCP shared via ZFS sharenfs on a 3Gbps aggregation (3*1Gbps). I've been slamming my head against this problem for days and can't make headway. I'll post some of my notes below. Any thoughts or ideas are welcome! benr. === Step 1 was to disable any ZFS features that might consume large amounts of CPU: # zfs set compression=off joyous # zfs set atime=off joyous # zfs set checksum=off joyous These changes had no effect. Next was to consider that perhaps NFS was doing name lookups when it shouldn't. Indeed dns was specified in /etc/nsswitch.conf which won't work given that no DNS servers are accessable from the storage or private networks, but again, no improvement. In this process I removed dns from nsswitch.conf, deleted /etc/resolv.conf, and disabled the dns/client service in SMF. Turning back to CPU usage, we can see the activity is all SYStem time and comes in waves: [private:/tmp] root# sar 1 100 SunOS private.thumper1 5.11 snv_43 i86pc12/07/2006 10:38:05%usr%sys%wio %idle 10:38:06 0 27 0 73 10:38:07 0 27 0 73 10:38:09 0 27 0 73 10:38:10 1 26 0 73 10:38:11 0 26 0 74 10:38:12 0 26 0 74 10:38:13 0 24 0 76 10:38:14 0 6 0 94 10:38:15 0 7 0 93 10:38:22 0 99 0 1 -- 10:38:23 0 94 0 6 -- 10:38:24 0 28 0 72 10:38:25 0 27 0 73 10:38:26 0 27 0 73 10:38:27 0 27 0 73 10:38:28 0 27 0 73 10:38:29 1 30 0 69 10:38:30 0 27 0 73 And so we consider whether or not there is a pattern to the frequency. The following is sar output from any lines in which sys is above 90%: 10:40:04%usr%sys%wio %idleDelta 10:40:11 0 97 0 3 10:40:45 0 98 0 2 34 seconds 10:41:02 0 94 0 6 17 seconds 10:41:26 0 100 0 0 24 seconds 10:42:00 0 100 0 0 34 seconds 10:42:25 (end of sample) 25 seconds Looking at the congestion in the run queue: [private:/tmp] root# sar -q 5 100 10:45:43 runq-sz %runocc swpq-sz %swpocc 10:45:5127.0 85 0.0 0 10:45:57 1.0 20 0.0 0 10:46:02 2.0 60 0.0 0 10:46:1319.8 99 0.0 0 10:46:2317.7 99 0.0 0 10:46:3424.4 99 0.0 0 10:46:4122.1 97 0.0 0 10:46:4813.0 96 0.0 0 10:46:5525.3 102 0.0 0 Looking at the per-CPU breakdown: CPU minf mjf xcal intr ithr csw icsw migr smtx srw syscl usr sys wt idl 00 00 324 224000 1540 00 100 0 0 10 00 1140 2260 10 130860 1 0 99 20 00 162 138 1490540 00
Re: [zfs-discuss] System pause peculiarity with mysql on zfs
Hi Dale, Are you using MyISAM or InnoDB? Also, what's your zpool configuration? Best Regards, Jason On 12/7/06, Dale Ghent [EMAIL PROTECTED] wrote: Hey all, I run a netra X1 as the mysql db server for my small personal web site. This X1 has two drives in it with SVM-mirrored UFS slices for / and /var, a swap slice, and slice 7 is zfs. There is one zfs mirror pool called local on which there are a few file systems, one of which is for mysql. slice 7 used to be ufs, and I had no performance problems when that was the case. There is 1152MB of RAM on this box, half of which is in use. Solaris 10 FCS + all the latest patches as of today. So anyway, after moving mysql to live on zfs (with compression turned on for the volume in question), I noticed that web pages on my site took a bit of time, sometimes up to 20 seconds to load. I'd jump on to my X1, and notice that according to top, kernel was hogging 80-100% of the 500Mhz CPU, and mysqld was the top process in CPU use. The load average would shoot from a normal 0.something up to 6 or even 8. Command-line response was stop and go. Then I'd notice my page would finally load, and that corresponded with load and kernel CPU usage decreasing back to normal levels. I am able to reliably replicate this, and I ran lockstat while this was going on, the output of which is here: http://elektronkind.org/osol/lockstat-zfs-0.txt Part of me is kind of sure that this is 6421427 as there appears to be long and copious trips through ata_wait() as that bug illustrates, but I just want to be sure of it (and when is that bug seeing a solaris 10 patch, btw?) TIA, /dale ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS failover without multipathing
Luke Schwab wrote: Hi, I am running Solaris 10 ZFS and I do not have STMS multipathing enables. I have dual FC connections to storage using two ports on an Emulex HBA. In the Solaris ZFS admin guide. It says that a ZFS file system monitors disks by their path and their device ID. If a disk is switched between controllers, ZFS will be able to pick up the disk on a secondary controller. I tested this theory by creating a zpool on the first controller and then I pulled the cable on the back of the server. the server took about 3-5 minutes to failover. But it did fail over!! By default, the [s]sd driver will retry [3]5 times with a timeout of 60 seconds. STMS understands the lower-level FC stuff, and can make better decisions, faster. -- richard ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] ZFS compression / ARC interaction
Quick question about the interaction of ZFS filesystem compression and the filesystem cache. We have an Opensolaris (actually Nexenta alpha-6) box running RRD collection. These files seem to be quite compressible. A test filesystem containing about 3,000 of these files shows a compressratio of 12.5x. My question is about how the filesystem cache works with compressed files. Does the fscache keep a copy of the compressed data, or the uncompressed blocks? To update one of these RRD files, I believe the whole contents are read into memory, modified, and then written back out. If the filesystem cache maintained a copy of the compressed data, a lot more, maybe more than 10x more, of these files could be maintained in the cache. That would mean we could have a lot more data files without ever needing to do a physical read. Looking at the source code overview, it looks like the compression happens underneath the ARC layer, so by that I am assuming the uncompressed blocks are cached, but I wanted to ask to be sure. Thanks! -Andy This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Re: zpool import takes to long with large numbers of file systems
Hi Luke, That's terrific! You know you might be able to tell ZFS which disks to look at. I'm not sure. It would be interesting, if anyone with a Thumper could comment on whether or not they see the import time issue. What are your load times now with MPXIO? Best Regards, Jason On 12/7/06, Luke Schwab [EMAIL PROTECTED] wrote: Jason, Sorry, I don't have IM. I did make some progress on testing. The solaris 10 OS allows you to kind of lun mask within the fp.conf file by creating a list of luns you don't want to see. This has improved my time greatly. I can now create/export within a second and import only takes about 10 seconds. What a differnce compared to the 5-8 minutes I've been seeing! but this not good if my machine needs to see LOTs of luns. It would be nice if there was a feature in zfs where you could specify which disk the the pool resides instead of zfs looking through every disk attached to the machine. Luke Schwab --- Jason J. W. Williams [EMAIL PROTECTED] wrote: Hey Luke, Do you have IM? My Yahoo IM ID is [EMAIL PROTECTED] -J On 12/6/06, Luke Schwab [EMAIL PROTECTED] wrote: Rats, I think I know where your going? We use LSIs exclusively. LSI performs lun masking as the driver level. You can specifically tell the LSI HBA to only bind to specific luns on the array. The array doesn't appear to support lun masking by itself. I believe you can also mask at the SAN switch but most of our connections are direct connect to the array. I just thought you know a quick and easy way to mask in Solaris 10. I tried the fp.conf file with a blackout list to prevent certain luns from being viewed but I couldn't get the OS to have a list of luns that it can only allow. Thanks, Luke --- Jason J. W. Williams [EMAIL PROTECTED] wrote: Hi Luke, Who makes your array? IBM, SGI or StorageTek? Best Regards, Jason On 12/6/06, Luke Schwab [EMAIL PROTECTED] wrote: Jason, could you give me a tip on how to do lun masking. I used to do it via the /kernel/drv/ssd.conf file with an LSI HBA. Now I have Emulex HBAs with the Leadville. I saw on sunsolve a way to mask using a black list in the /kernel/drv/fp.conf file but that isn't what I was looking for. Do you know any other ways? Thanks, Luke Scwhab --- Jason J. W. Williams [EMAIL PROTECTED] wrote: Hi Luke, I think you'll really like it. We moved from UFS/SVM and its a night and day management difference. Though I understand SVM itself is easier to deal with than VxVM, so it may be an order of magnitude easier. Best Regards, Jason On 12/6/06, Luke Schwab [EMAIL PROTECTED] wrote: The 4884 as well as the V280 server is using 2 ports each. I don't have any FS's at this point. I'm trying to keep it simple for now. We are beta testing to go away from VxVM and such. --- Jason J. W. Williams [EMAIL PROTECTED] wrote: Hi Luke, Is the 4884 using two or four ports? Also, how many FSs are involved? Best Regards, Jason On 12/6/06, Luke Schwab [EMAIL PROTECTED] wrote: I, too, experienced a long delay while importing a zpool on a second machine. I do not have any filesystems in the pool. Just the Solaris 10 Operating system, Emulex 10002DC HBA, and a 4884 LSI array (dual attached). I don't have any file systems created but when STMS(mpxio) is enabled I see # time zpool import testpool real 6m41.01s user 0m.30s sys 0m0.14s When I disable STMS(mpxio), the times are much better but still not that great? # time zpool import testpool real 1m15.01s user 0m.15s sys 0m0.35s Are these normal symproms?? Can anyone explain why I too see delays even though I don't have any file systems in the zpool? This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss __ Do You Yahoo!? Tired of spam? Yahoo! Mail has the best spam protection around http://mail.yahoo.com === message truncated === Have a burning question? Go to www.Answers.yahoo.com and get answers from real people who know.
Re: [zfs-discuss] System pause peculiarity with mysql on zfs
You said you are running Solaris 10 FCS but zfs was not released until Solaris 10 6/06 which is Solaris 10U2. On 12/7/06, Jason J. W. Williams [EMAIL PROTECTED] wrote: Hi Dale, Are you using MyISAM or InnoDB? Also, what's your zpool configuration? Best Regards, Jason On 12/7/06, Dale Ghent [EMAIL PROTECTED] wrote: Hey all, I run a netra X1 as the mysql db server for my small personal web site. This X1 has two drives in it with SVM-mirrored UFS slices for / and /var, a swap slice, and slice 7 is zfs. There is one zfs mirror pool called local on which there are a few file systems, one of which is for mysql. slice 7 used to be ufs, and I had no performance problems when that was the case. There is 1152MB of RAM on this box, half of which is in use. Solaris 10 FCS + all the latest patches as of today. So anyway, after moving mysql to live on zfs (with compression turned on for the volume in question), I noticed that web pages on my site took a bit of time, sometimes up to 20 seconds to load. I'd jump on to my X1, and notice that according to top, kernel was hogging 80-100% of the 500Mhz CPU, and mysqld was the top process in CPU use. The load average would shoot from a normal 0.something up to 6 or even 8. Command-line response was stop and go. Then I'd notice my page would finally load, and that corresponded with load and kernel CPU usage decreasing back to normal levels. I am able to reliably replicate this, and I ran lockstat while this was going on, the output of which is here: http://elektronkind.org/osol/lockstat-zfs-0.txt Part of me is kind of sure that this is 6421427 as there appears to be long and copious trips through ata_wait() as that bug illustrates, but I just want to be sure of it (and when is that bug seeing a solaris 10 patch, btw?) TIA, /dale ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS compression / ARC interaction
Andrew Miller wrote: Quick question about the interaction of ZFS filesystem compression and the filesystem cache. We have an Opensolaris (actually Nexenta alpha-6) box running RRD collection. These files seem to be quite compressible. A test filesystem containing about 3,000 of these files shows a compressratio of 12.5x. My question is about how the filesystem cache works with compressed files. Does the fscache keep a copy of the compressed data, or the uncompressed blocks? To update one of these RRD files, I believe the whole contents are read into memory, modified, and then written back out. If the filesystem cache maintained a copy of the compressed data, a lot more, maybe more than 10x more, of these files could be maintained in the cache. That would mean we could have a lot more data files without ever needing to do a physical read. Looking at the source code overview, it looks like the compression happens underneath the ARC layer, so by that I am assuming the uncompressed blocks are cached, but I wanted to ask to be sure. Thanks! -Andy Yup, your assumption is correct. We currently do compression below the ARC. We have contemplated caching data in compressed form, but have not really explored the idea fully yet. -Mark ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS compression / ARC interaction
On 12/8/06, Mark Maybee [EMAIL PROTECTED] wrote: Yup, your assumption is correct. We currently do compression below the ARC. We have contemplated caching data in compressed form, but have not really explored the idea fully yet. Hmm... interesting idea. That will incur CPU to do a decompress when the page is reclaimed but reduce memory pressure. What implications this will have on encryption? -- Just me, Wire ... ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] Re: ZFS compression / ARC interaction
Looking at the source code overview, it looks like the compression happens underneath the ARC layer, so by that I am assuming the uncompressed blocks are cached, but I wanted to ask to be sure. Thanks! -Andy Yup, your assumption is correct. We currently do compression below the ARC. We have contemplated caching data in compressed form, but have not really explored the idea fully yet. -Mark ___ Mark, Thanks for the quick response!I imagine the compression will still help quite a bit anyway, since ultimately there's a lot less data to write back to the disk. A compressed cache would be an interesting tuneable parameter - it would be great for these types of files, and also for some of the things we keep here in databases. (A lot of text/blobs and as such highly compressible). My colleagues and I are really impressed with the design and performance of ZFS - Keep up the good work! -Andy This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] Re: A Plea for Help: Thumper/ZFS/NFS/B43
I'm still confused though, I believe that locking an adaptive mutex will spin for a short period then context switch and so they shouldn't be burning CPU - at least not .4s worth! An adaptive mutex will spin as long as the thread which holds the mutex is on CPU. If the lock is moderately contended, you can wind up with threads spinning for quite a while as ownership of the lock passes from thread to thread across CPUs. Mutexes in Solaris tend to be most useful when they're held for very short periods of time; they also work pretty well if the owning thread blocks. If somebody is computing for quite a while while holding them (e.g. if dnode_next_offset is repeatedly called and is slow), they can waste a lot of time on other CPUs. In this case an rwlock usually works better. Anton This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] System pause peculiarity with mysql on zfs
On Dec 7, 2006, at 1:46 PM, Jason J. W. Williams wrote: Hi Dale, Are you using MyISAM or InnoDB? InnoDB. Also, what's your zpool configuration? A basic mirror: [EMAIL PROTECTED]zpool status pool: local state: ONLINE scrub: none requested config: NAME STATE READ WRITE CKSUM local ONLINE 0 0 0 mirror ONLINE 0 0 0 c0t0d0s7 ONLINE 0 0 0 c0t2d0s7 ONLINE 0 0 0 errors: No known data errors ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] Re: System pause peculiarity with mysql on zfs
This does look like the ATA driver bug rather than a ZFS issue per se. (For the curious, the reason ZFS triggers this when UFS doesn't is because ZFS sends a synchronize cache command to the disk, which is not handled in DMA mode by the controller; and for this particular controller, switching between DMA and PIO mode has some quirks which were worked around by adding delays. The fix involves a new quirk-work-around.) Anton This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS compression / ARC interaction
On 12/7/06, Andrew Miller [EMAIL PROTECTED] wrote: Quick question about the interaction of ZFS filesystem compression and the filesystem cache. We have an Opensolaris (actually Nexenta alpha-6) box running RRD collection. These files seem to be quite compressible. A test filesystem containing about 3,000 of these files shows a compressratio of 12.5x. Be careful here. If you are using files that have no data in them yet you will get much better compression than later in life. Judging by the fact that you got only 12.5x, I suspect that your files are at least partially populated. Expect the compression to get worse over time. Looking at some RRD files that come from a very active (e.g. numbers vary frequently) servers with data filling about 2/3 of the configured time periods, I see the following rates: 1.8 mpstat.rrd 1.8 vmstat.rrd 1.9 exacct_PROJECT_user.oracle.rrd 2.0 net-ce2.rrd 2.1 iostat-c14.rrd 2.1 iostat-c15.rrd 2.1 iostat-c16.rrd . . . 7.6 net-ce912005.rrd 7.7 net-ce912016.rrd 9.1 exacct_PROJECT_user.gemsadm.rrd 12.2 exacct_PROJECT_exacct_interval.rrd 18.1 exacct_PROJECT_user.patrol.rrd 18.1 exacct_PROJECT_user.precise.rrd 18.1 exacct_PROJECT_user.precise6.rrd 31.8 net-ce8.rrd 39.6 net-eri3.rrd 45.1 net-eri2.rrd The first column is the compression ratio. The net-eri{2,3} files are almost empty. My question is about how the filesystem cache works with compressed files. Does the fscache keep a copy of the compressed data, or the uncompressed blocks? To update one of these RRD files, I believe the whole contents are read into memory, modified, and then written back out. If the filesystem cache maintained a copy of the compressed data, a lot more, maybe more than 10x more, of these files could be maintained in the cache. That would mean we could have a lot more data files without ever needing to do a physical read. Here is an insert of a value: 25450: open(/opt/perfstat/rrd/somehost/iostat-c4.rrd, O_RDWR) = 3 25450: fstat64(3, 0xFFBFF5E0) = 0 25450: fstat64(3, 0xFFBFF640) = 0 25450: fstat64(3, 0xFFBFF4E8) = 0 25450: ioctl(3, TCGETA, 0xFFBFF5CC)Err#25 ENOTTY 25450: read(3, R R D\0 0 0 0 1\0\0\0\0.., 8192) = 8192 25450: llseek(3, 0, SEEK_CUR) = 8192 25450: lseek(3, 0xFC68, SEEK_CUR) = 7272 25450: fcntl(3, F_SETLK, 0xFFBFF7D0) = 0 25450: llseek(3, 0, SEEK_CUR) = 7272 25450: lseek(3, 2230952, SEEK_SET) = 2230952 25450: write(3, @ x S = pA3D7\v ?E6 f f.., 64) = 64 25450: lseek(3, 1864, SEEK_SET)= 1864 25450: write(3, E xA0 # U N K N\0\0\0\0.., 5408)= 5408 25450: close(3)= 0 Notice that it does the following: Open the file Read the first 8K Seek to a particular spot Take a lock Seek Write 64 bytes seek Write 5408 bytes close The rrd file in question is 8.6 MB. There was 8KB of reads and 5472 bytes of write. This is one of the big wins over the current binary rrd format over the original ASCII version that came with MRTG. Mike -- Mike Gerdts http://mgerdts.blogspot.com/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Re: System pause peculiarity with mysql on zfs
That's gotta be what it is. All our MySQL IOP issues have gone away one we moved to RAID-1 from RAID-Z. -J On 12/7/06, Anton B. Rang [EMAIL PROTECTED] wrote: This does look like the ATA driver bug rather than a ZFS issue per se. (For the curious, the reason ZFS triggers this when UFS doesn't is because ZFS sends a synchronize cache command to the disk, which is not handled in DMA mode by the controller; and for this particular controller, switching between DMA and PIO mode has some quirks which were worked around by adding delays. The fix involves a new quirk-work-around.) Anton This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] System pause peculiarity with mysql on zfs
On Dec 7, 2006, at 5:22 PM, Nicholas Senedzuk wrote: You said you are running Solaris 10 FCS but zfs was not released until Solaris 10 6/06 which is Solaris 10U2. Look at a Solaris 10 6/06 CD/DVD. Check out the Solaris_10/ UpgradePatches directory. ah! well whaddya know... Yes, apply those (you have to do them in the right order to do it in one run with 'patchadd -M') and you can bring your older box up to date with the update release. /dale ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Re: System pause peculiarity with mysql on zfs
On Dec 7, 2006, at 6:14 PM, Anton B. Rang wrote: This does look like the ATA driver bug rather than a ZFS issue per se. Yes indeed. Well, that answers that. FWIW, I'm hour 2 of a mysql configure script run. Yow! (For the curious, the reason ZFS triggers this when UFS doesn't is because ZFS sends a synchronize cache command to the disk, which is not handled in DMA mode by the controller; and for this particular controller, switching between DMA and PIO mode has some quirks which were worked around by adding delays. The fix involves a new quirk-work-around.) Ah, so I suppose this would affect the V100, too. The same ALi IDE controller in that box. Thanks for the insight. Since the fix for this made it into snv_52, I suppose it's too recent for a backport and patch release for s10 :( /dale ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] Re: ZFS compression / ARC interaction
Be careful here. If you are using files that have no data in them yet you will get much better compression than later in life. Judging by the fact that you got only 12.5x, I suspect that your files are at least partially populated. Expect the compression to get worse over time. I do expect it to get somewhat worse over time -- I don't expect such compression forever but didn't want to get too detailed in my original question. :-) A lot of the data points I'm collecting (40%+) are quite static or change slowly over time - representing, for example, disk space, TCP errors (hopefully always zero! :-)) or JVM jstats of development JVM instances that get only small occasional bursts in activity. (snip) Read the first 8K Seek to a particular spot Take a lock Seek Write 64 bytes seek Write 5408 bytes close Interesting, that looks a lot different than what I'm seeing. Maybe something different in the implementation (I'm using perl RRDs and RRD 1.2.11). Note that RRDFILE.rrd is 125504 bytes on disk. I'll have to look into it a little deeper as it certainly would help performance to just read the preamble and modify the pieces of the RRD that need to change. Maybe it's because my RRD's are quite small, they contain only one DS and I tuned down the number and length of the RRA's while fighting the performance issues that ultimately ended in me moving this to an opensolaris based box. open(RRDFILE.rrd, O_RDONLY) = 14 fstat64(14, 0x080473E0) = 0 fstat64(14, 0x08047310) = 0 ioctl(14, TCGETA, 0x080473AC) Err#25 ENOTTY read(14, R R D\0 0 0 0 3\0\0\0\0.., 125952) = 125504 llseek(14, 0, SEEK_CUR) = 125504 lseek(14, 21600, SEEK_SET) = 21600 lseek(14, 1600, SEEK_SET) = 1600 read(14, \0\0\0\0\0\0F8FF\0\0\0\0.., 125952) = 123904 llseek(14, 0xFFFE24F8, SEEK_CUR)= 3896 close(14) = 0 open(RRDFILE.rrd, O_RDWR) = 14 fstat64(14, 0x08047320) = 0 fstat64(14, 0x08047250) = 0 ioctl(14, TCGETA, 0x080472EC) Err#25 ENOTTY read(14, R R D\0 0 0 0 3\0\0\0\0.., 125952) = 125504 llseek(14, 0, SEEK_CUR) = 125504 lseek(14, 0, SEEK_END) = 125504 llseek(14, 0, SEEK_CUR) = 125504 llseek(14, 0, SEEK_CUR) = 125504 lseek(14, 1504, SEEK_SET) = 1504 fcntl(14, F_SETLK, 0x08047430) = 0 mmap(0x, 125504, PROT_READ|PROT_WRITE, MAP_SHARED, 14, 0) = 0xFE25F000 munmap(0xFE25F000, 125504) = 0 llseek(14, 0, SEEK_CUR) = 1504 lseek(14, 880, SEEK_SET)= 880 write(14, :B0 x E\0\0\0\0 1 5\0\0.., 624)= 624 close(14) = 0 (I did an strace on linux, also, which is using RRD 1.0.49, and it looks about the same - appears to read the whole thing. Maybe it's something in RRDs or the way I'm using it) Thanks for spending some of your time analyzing my problem. :-) -Andy This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Re: System pause peculiarity with mysql on zfs
Hi Dale, For what its worth, the SX releases tend to be pretty stable. I'm not sure if snv_52 has made a SX release yet. We ran for over 6 months on SX 10/05 (snv_23) with no downtime. Best Regards, Jason On 12/7/06, Dale Ghent [EMAIL PROTECTED] wrote: On Dec 7, 2006, at 6:14 PM, Anton B. Rang wrote: This does look like the ATA driver bug rather than a ZFS issue per se. Yes indeed. Well, that answers that. FWIW, I'm hour 2 of a mysql configure script run. Yow! (For the curious, the reason ZFS triggers this when UFS doesn't is because ZFS sends a synchronize cache command to the disk, which is not handled in DMA mode by the controller; and for this particular controller, switching between DMA and PIO mode has some quirks which were worked around by adding delays. The fix involves a new quirk-work-around.) Ah, so I suppose this would affect the V100, too. The same ALi IDE controller in that box. Thanks for the insight. Since the fix for this made it into snv_52, I suppose it's too recent for a backport and patch release for s10 :( /dale ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] Re: ZFS failover without multipathing
Jason, I am no longer looking at not using STMS multipathing because without STMS you loose the binding to the array and I loose all transmissions between the server and array. The binding does come back after a few minutes but this is not acceptable in our environment. Load times vary depending on my configuration. Senario 1: No STMS: Really fast zpool create and zpool import/export. Less then 1 second for create/export and 5-15 seconds for an import. Senario 2:STMS(mpxio)enabled and no blacks being used to LUN masking: zpool create takes 5-15 seconds, zpool imports take from 5-7 minutes. Senario 3: STMS enabled and blacklists enabled via /kernel/drv/fp.conf: It look at least 15 minutes to do a zpool create before I finially stopped it. This does not appear to be a viable solution. If you have any ideas about how to improve performance I am all ears. I'm not sure why ZFS takes so long to create pools with STMS? Does anyone have problems using LSI arrays. I already had problems using my LSI HBA with ZFS because the LSI HBA does not work with the Leadville stack. R/ ljs Hi Luke, That's terrific! You know you might be able to tell ZFS which disks to look at. I'm not sure. It would be interesting, if anyone with a Thumper could comment on whether or not they see the import time issue. What are your load times now with MPXIO? Best Regards, Jason This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Re: ZFS failover without multipathing
Hi Luke, I wonder if it is the HBA. We had issues with Solaris and LSI HBAs back when we were using an Xserve RAID. Haven't had any of the issues you're describing between our LSI array and the Qlogic HBAs we're using now. If you have another type of HBA I'd try it. MPXIO and ZFS haven't ever caused what you're seeing for us. -J On 12/7/06, Luke Schwab [EMAIL PROTECTED] wrote: Jason, I am no longer looking at not using STMS multipathing because without STMS you loose the binding to the array and I loose all transmissions between the server and array. The binding does come back after a few minutes but this is not acceptable in our environment. Load times vary depending on my configuration. Senario 1: No STMS: Really fast zpool create and zpool import/export. Less then 1 second for create/export and 5-15 seconds for an import. Senario 2:STMS(mpxio)enabled and no blacks being used to LUN masking: zpool create takes 5-15 seconds, zpool imports take from 5-7 minutes. Senario 3: STMS enabled and blacklists enabled via /kernel/drv/fp.conf: It look at least 15 minutes to do a zpool create before I finially stopped it. This does not appear to be a viable solution. If you have any ideas about how to improve performance I am all ears. I'm not sure why ZFS takes so long to create pools with STMS? Does anyone have problems using LSI arrays. I already had problems using my LSI HBA with ZFS because the LSI HBA does not work with the Leadville stack. R/ ljs Hi Luke, That's terrific! You know you might be able to tell ZFS which disks to look at. I'm not sure. It would be interesting, if anyone with a Thumper could comment on whether or not they see the import time issue. What are your load times now with MPXIO? Best Regards, Jason This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] A Plea for Help: Thumper/ZFS/NFS/B43
Ben Rockwood wrote: Eric Kustarz wrote: Ben Rockwood wrote: I've got a Thumper doing nothing but serving NFS. Its using B43 with zil_disabled. The system is being consumed in waves, but by what I don't know. Notice vmstat: We made several performance fixes in the NFS/ZFS area in recent builds, so if possible it would be great to upgrade you from snv_43. That said, there might be something else going on that we haven't accounted for. ... Step 1 was to disable any ZFS features that might consume large amounts of CPU: # zfs set compression=off joyous # zfs set atime=off joyous # zfs set checksum=off joyous In our performance testing, we haven't found checksums to be anywhere near a large consumer of CPU, so i would recommend leaving that on (due to its benefits). I suspect your apps/clients don't depend on atime, so i think its a good idea to turn that off. We've gotten better NFS performance with this off. More of a heads up as it sounds like compression on/off isn't your problem. If you are not getting good I/O BW with compression turned on, its most likely due to: 6460622 zio_nowait() doesn't live up to its name As Jim, mentioned, using lockstat to figure out where your CPU is being spent is the first step. I've been using 'lockstat -kgIW -D 60 sleep 60'. That collects data for the top 60 callers for a 1 minute period. If you see 'mutex_enter' high up in the results, then we have at least mutex lock contention. ... Interestingly, using prstat -mL to monitor thread latency, we see that a handful of threads are the culprates for consuming mass CPU: [private:/tmp] root# prstat -mL PID USERNAME USR SYS TRP TFL DFL LCK SLP LAT VCX ICX SCL SIG PROCESS/LWPID 22643 daemon 0.0 75 0.0 0.0 0.0 0.0 25 0.0 416 1 0 0 nfsd/1506 22643 daemon 0.0 75 0.0 0.0 0.0 0.0 25 0.0 415 0 0 0 nfsd/1563 22643 daemon 0.0 74 0.0 0.0 0.0 0.0 26 0.0 417 0 0 0 nfsd/1554 22643 daemon 0.0 74 0.0 0.0 0.0 0.0 26 0.0 419 0 0 0 nfsd/1551 22643 daemon 0.0 0.2 0.0 0.0 0.0 0.0 26 74 418 0 0 0 nfsd/1553 22643 daemon 0.0 0.2 0.0 0.0 0.0 0.0 100 0.0 417 0 0 0 nfsd/1536 22643 daemon 0.0 0.2 0.0 0.0 0.0 0.0 100 0.0 417 0 0 0 nfsd/1555 22643 daemon 0.0 0.2 0.0 0.0 0.0 0.0 100 0.0 418 0 0 0 nfsd/1539 22643 daemon 0.0 0.2 0.0 0.0 0.0 0.0 100 0.0 417 0 0 0 nfsd/1562 22643 daemon 0.0 0.2 0.0 0.0 0.0 0.0 100 0.0 418 0 0 0 nfsd/1545 22643 daemon 0.0 0.2 0.0 0.0 0.0 0.0 100 0.0 417 0 0 0 nfsd/1559 22643 daemon 0.0 0.2 0.0 0.0 0.0 0.0 100 0.0 419 1 0 0 nfsd/1541 22643 daemon 0.0 0.2 0.0 0.0 0.0 0.0 100 0.0 417 0 0 0 nfsd/1546 22643 daemon 0.0 0.2 0.0 0.0 0.0 0.0 100 0.0 417 0 0 0 nfsd/1543 22643 daemon 0.0 0.2 0.0 0.0 0.0 0.0 100 0.0 418 0 0 0 nfsd/1560 Total: 33 processes, 218 lwps, load averages: 4.64, 6.20, 5.86 The high SYS times being charged to the userland nfsd threads is representative of what the kernel threads are doing (which is most likely going to be NFS and ZFS). Running zvop_times.d (http://blogs.sun.com/erickustarz/resource/zvop_times.d) I get an idea of what ZFS is doing: [private:/] root# dtrace -s zvop_times.d dtrace: script 'zvop_times.d' matched 66 probes ^C CPU IDFUNCTION:NAME 1 2 :END ZFS COUNT zfs_getsecattr4 zfs_space70 zfs_rename 111 zfs_readdir 284 zfs_read367 zfs_mkdir 670 zfs_setattr1054 zfs_frlock 1562 zfs_putpage3916 zfs_write 4110 zfs_create 4178 zfs_remove 7794 zfs_fid 14960 zfs_inactive 17637 zfs_access20809 zfs_fsync 29668 zfs_lookup31273 zfs_getattr 175457 ZFS AVG TIME zfs_fsync 2337 zfs_getattr2735 zfs_access 2774 zfs_fid2948 zfs_inactive