bug? ZFS crypto vs. scrub
Sorry for abusing the mailing list, but I don't know how to report bugs anymore and have no visibility of whether this is a known/resolved issue. So, just in case it is not... With Solaris 11 Express, scrubbing a pool with encrypted datasets for which no key is currently loaded, unrecoverable read errors are reported. The error count applies to the pool, and not to any specific device, which is also somewhat at odds with the helpful message text for diagnostic status and suggested action: pool: geek state: ONLINE status: One or more devices has experienced an error resulting in data corruption. Applications may be affected. action: Restore the file in question if possible. Otherwise restore the entire pool from backup. see: http://www.sun.com/msg/ZFS-8000-8A scan: scrub repaired 0 in 3h8m with 280 errors on Tue May 10 17:12:15 2011 config: NAME STATE READ WRITE CKSUM geek ONLINE 280 0 0 raidz2-0 ONLINE 0 0 0 c13t0d0 ONLINE 0 0 0 c13t1d0 ONLINE 0 0 0 c13t2d0 ONLINE 0 0 0 c13t3d0 ONLINE 0 0 0 c13t4d0 ONLINE 0 0 0 c13t5d0 ONLINE 0 0 0 c0t0d0 ONLINE 0 0 0 c0t1d0 ONLINE 0 0 0 c1t0d0 ONLINE 0 0 0 c1t1d0 ONLINE 0 0 0 Using -v lists an error for the same 2 hexid's in each snapshot, as per the following example: geek/crypt@zfs-auto-snap_weekly-2011-03-28-22h39:0xfffe geek/crypt@zfs-auto-snap_weekly-2011-03-28-22h39:0x When this has happened previously (on this and other pools) mounting the dataset by supplying the key, and rerunning the scrub, removes the errors. For some reason, I can't in this case (keeps complaining that the key is wrong). That may be a different issue that has also happened before, and I will post about separately, once I'm sure I didn't just made a typo (twice) when first setting the key. -- Dan. pgpMX0o7N9c2w.pgp Description: PGP signature ___ zfs-crypto-discuss mailing list zfs-crypto-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-crypto-discuss
Re: [zfs-discuss] ZFS on HP MDS 600
On Mon, May 9, 2011 at 8:33 AM, Darren Honeyball ml...@spod.net wrote: I'm just mulling over the best configuration for this system - our work load is mostly writing millions of small files (around 50k) with occasional reads we need to keep as much space as possible. If space is a priority, then raidz or raidz2 are probably the best bets. If you're going to have a lot of random iops, then mirrors are best. You have some control over the performance : space ratio with raidz by adjusting the width of the radiz vdevs. For instance, mirrors will provide 34TB of space and best random iops. 24 x 3-disk raidz vdevs will have 48TB of space but still have pretty strong random iops performance. 13 x 5-disk raidz vdevs will give 52TB of space at the lost of lower random iops. Testing will help you find the best configuration for your environment. HP's recommendations for configuring the MDS 600 with ZFS is to let the P212 do the raid functions (raid 1+0 is recommended here) by configuring each half of the MDS 600 as a single logical drive (35 drives) then use a basic zfs pool on top to provide the zfs functionality - to me this would seem to loose a lot of the error checking functions of zfs? If you configured the two logical drives as a mirror in ZFS, then you'd still have full protection. Your overhead would be really high though - 3/4 of your original capacity would be used for data protection if I understand the recommendation correctly. (You'd use 1/2 of the original capacity for RAID1 in the MDS, then 1/2 of the remaining for the ZFS mirror.) You could use non-redundant pool in ZFS to reduce the overhead, but you sacrifice the self-healing properties of ZFS when you do that. Another option is to use raidz and let zfs handle the smart stuff - as the P212 doesn't support a true dumb JBOD function I'd need to create each drive as a single raid 0 logical drive - are there any drawback to doing this? Or would it be better to create slightly larger logical drives using say 2 physical drives per logical drive? Single-device logical drives are required when you can't configure a card or device as JBOD, and I believe its usually the recommended solution. Once you have the LUNs created, you can use ZFS to create mirrors or raidz vdevs. I'm planning on having 2 hot spares - one in each side of the MDS 600, is it also worth using a dedicated ZIL spindle or 2? It would depend on your workload. (How's that for helpful?) If you're experiencing a lot of synchronous writes, then a ZIL will help. If you aren't seeing a lot of sync writes, then a ZIL won't help. The ZIL doesn't have to be very large, since it's flushed on a regular basis. From the Best Practices guide: For a target throughput of X MB/sec and given that ZFS pushes transaction groups every 5 seconds (and have 2 outstanding), we also expect the ZIL to not grow beyond X MB/sec * 10 sec. So to service 100MB/sec of synchronous writes, 1 GB of log device should be sufficient. If the MDS has a non-volatile cache, there should be little or no need to use a ZIL. However, some reports have shown ZFS with a ZIL to be faster than using non-volatile cache. You should test performance using your workload. Is it worth tweaking zfs_nocacheflush or zfs_vdev_max_pending? As I mentioned above, if the MDS has a non-volatile cache, then setting zfs_nocacheflush might help performance. If you're exporting one LUN per device then you shouldn't need to adjust the max_pending. If you're exporting larger RAID10 luns from the MDS, then increasing the value might help for read workloads. -B -- Brandon High : bh...@freaks.com ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] GPU acceleration of ZFS
Good day, I think ZFS can take advantage of using GPU for sha256 calculation, encryption and maybe compression. Modern video card, like 5xxx or 6xxx ATI HD Series can do calculation of sha256 50-100 times faster than modern 4 cores CPU. kgpu project for linux shows nice results. 'zfs scrub' would work freely on high performance ZFS pools. The only problem that there is no AMD/Nvidia drivers for Solaris that support hardware-assisted OpenCL. Is anyone interested in it? Best regards, Anatoly Legkodymov. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] GPU acceleration of ZFS
On Tue, May 10, 2011 at 11:29 AM, Anatoly legko...@fastmail.fm wrote: Good day, I think ZFS can take advantage of using GPU for sha256 calculation, encryption and maybe compression. Modern video card, like 5xxx or 6xxx ATI HD Series can do calculation of sha256 50-100 times faster than modern 4 cores CPU. Ignoring optimizations from SIMD extensions like SSE and friends, this is probably true. However, the GPU also has to deal with the overhead of data transfer to itself before it can even begin crunching data. Granted, a Gen. 2 x16 link is quite speedy, but is CPU performance really that poor where a GPU can still out-perform it? My undergrad thesis dealt with computational acceleration utilizing CUDA, and the datasets had to scale quite a ways before there was a noticeable advantage in using a Tesla or similar over a bog-standard i7-920. The only problem that there is no AMD/Nvidia drivers for Solaris that support hardware-assisted OpenCL. This, and keep in mind that most of the professional users here will likely be using professional hardware, where a simple 8MB Rage XL gets the job done thanks to the magic of out-of-band management cards and other such facilities. Even as a home user, I have not placed a high-end videocard into my machine, I use a $5 ATI PCI videocard that saw about a hour of use whilst I installed Solaris 11. -- --khd ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] GPU acceleration of ZFS
IMHO, zfs need to run in all kind of HW T-series CMT server that can help sha calculation since T1 day, did not see any work in ZFS to take advantage it On 5/10/2011 11:29 AM, Anatoly wrote: Good day, I think ZFS can take advantage of using GPU for sha256 calculation, encryption and maybe compression. Modern video card, like 5xxx or 6xxx ATI HD Series can do calculation of sha256 50-100 times faster than modern 4 cores CPU. kgpu project for linux shows nice results. 'zfs scrub' would work freely on high performance ZFS pools. The only problem that there is no AMD/Nvidia drivers for Solaris that support hardware-assisted OpenCL. Is anyone interested in it? Best regards, Anatoly Legkodymov. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss attachment: laotsao.vcf___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] GPU acceleration of ZFS
On Tue, May 10, 2011 at 10:29 PM, Anatoly legko...@fastmail.fm wrote: Good day, I think ZFS can take advantage of using GPU for sha256 calculation, encryption and maybe compression. Modern video card, like 5xxx or 6xxx ATI HD Series can do calculation of sha256 50-100 times faster than modern 4 cores CPU. kgpu project for linux shows nice results. 'zfs scrub' would work freely on high performance ZFS pools. The only problem that there is no AMD/Nvidia drivers for Solaris that support hardware-assisted OpenCL. Is anyone interested in it? This isn't technically true. The NVIDIA drivers support compute, but there's other parts of the toolchain missing. /* I don't know about ATI/AMD, but I'd guess they likely don't support compute across platforms */ /* Disclaimer - The company I work for has a working HMPP compiler for Solaris/FreeBSD and we may soon support CUDA */ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] Tuning disk failure detection?
We recently had a disk fail on one of our whitebox (SuperMicro) ZFS arrays (Solaris 10 U9). The disk began throwing errors like this: May 5 04:33:44 dev-zfs4 scsi: [ID 243001 kern.warning] WARNING: /pci@0,0/pci8086,3410@9/pci15d9,400@0 (mpt_sas0): May 5 04:33:44 dev-zfs4mptsas_handle_event_sync: IOCStatus=0x8000, IOCLogInfo=0x31110610 And errors for the drive were incrementing in iostat -En output. Nothing was seen in fmdump. Unfortunately, it took about three hours for ZFS (or maybe it was MPT) to decide the drive was actually dead: May 5 07:41:06 dev-zfs4 scsi: [ID 107833 kern.warning] WARNING: /scsi_vhci/disk@g5000c5002cbc76c0 (sd4): May 5 07:41:06 dev-zfs4drive offline During this three hours the I/O performance on this server was pretty bad and caused issues for us. Once the drive failed completely, ZFS pulled in a spare and all was well. My question is -- is there a way to tune the MPT driver or even ZFS itself to be more/less aggressive on what it sees as a failure scenario? I suppose this would have been handled differently / better if we'd been using real Sun hardware? Our other option is to watch better for log entries similar to the above and either alert someone or take some sort of automated action .. I'm hoping there's a better way to tune this via driver or ZFS settings however. Thanks, Ray ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] primarycache=metadata seems to force behaviour of secondarycache=metadata
On Mon, May 9, 2011 at 2:54 PM, Tomas Ögren st...@acc.umu.se wrote: Slightly off topic, but we had an IBM RS/6000 43P with a PowerPC 604e cpu, which had about 60MB/s memory bandwidth (which is kind of bad for a 332MHz cpu) and its disks could do 70-80MB/s or so.. in some other machine.. It wasn't that long ago when 66MB/s ATA was considered a waste because no drive could use that much bandwidth. These days a slow drive has max throughput greater than 110MB/s. (OK, looking at some online reviews, it was about 13 years ago. Maybe I'm just old.) -B -- Brandon High : bh...@freaks.com ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] COW question
przemol...@poczta.fm wrote: On Fri, Jul 07, 2006 at 11:59:29AM +0800, Raymond Xiong wrote: It doesn't. Page 11 of the following slides illustrates how COW works in ZFS: http://www.opensolaris.org/os/community/zfs/docs/zfs_last.pdf Blocks containing active data are never overwritten in place; instead, a new block is allocated, modified data is written to it, and then any metadata blocks referencing it are similarly read, reallocated, and written. To reduce the overhead of this process, multiple updates are grouped into transaction groups, and an intent log is used when synchronous write semantics are required.(from http://en.wikipedia.org/wiki/ZFS) IN snapshot scenario, COW consumes much less disk space and is much faster. It says also that updating uberblock is an atomic operations. How is it achieved ? przemol ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss The email thread and document below gives you information about that: http://www.opensolaris.org/jive/thread.jspa?messageID=19264#19264 http://www.opensolaris.org/os/community/zfs/docs/ondiskformatfinal.pdf Francois. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS and Storage
On Thu, Jun 29, 2006 at 10:01:15AM +0200, Robert Milkowski wrote: Hello przemolicc, Thursday, June 29, 2006, 8:01:26 AM, you wrote: ppf On Wed, Jun 28, 2006 at 03:30:28PM +0200, Robert Milkowski wrote: ppf What I wanted to point out is the Al's example: he wrote about damaged data. Data ppf were damaged by firmware _not_ disk surface ! In such case ZFS doesn't help. ZFS can ppf detect (and repair) errors on disk surface, bad cables, etc. But cannot detect and repair ppf errors in its (ZFS) code. Not in its code but definitely in a firmware code in a controller. ppf As Jeff pointed out: if you mirror two different storage arrays. Not only I belive. There are some classes of problems that even in one array ZFS could help for fw problems (with many controllers in active-active config like Symetrix). Any real example ? przemol ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] DTrace IO provider and oracle
On Tue, Aug 08, 2006 at 11:33:28AM -0500, Tao Chen wrote: On 8/8/06, przemol...@poczta.fm przemol...@poczta.fm wrote: Hello, Solaris 10 GA + latest recommended patches: while runing dtrace: bash-3.00# dtrace -n 'io:::start {@[execname, args[2]-fi_pathname] = count();}' ... oracle none 2096052 How can I interpret 'none' ? Is it possible to get full path (like in vim) ? Section 27.2.3 fileinfo_t of DTrace Guide explains in detail why you see 'none' in many cases. http://www.sun.com/bigadmin/content/dtrace/d10_latest.pdf or http://docs.sun.com/app/docs/doc/817-6223/6mlkidllf?a=view The execname part can also be misleading, as many I/O activities are asynchronous (including but not limited to Asynchronous I/O), so whatever thread running on CPU may have nothing to do with the I/O that's occuring. This is working as designed and not a problem that limited to ZFS, IMO. Thanks Tao for the doc pointers. I haven't noticed them. przemol ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] raidz DEGRADED state
So there is no current way to specify the creation of a 3 disk raid-z array with a known missing disk? On 12/5/06, David Bustos david.bus...@sun.com wrote: Quoth Thomas Garner on Thu, Nov 30, 2006 at 06:41:15PM -0500: I currently have a 400GB disk that is full of data on a linux system. If I buy 2 more disks and put them into a raid-z'ed zfs under solaris, is there a generally accepted way to build an degraded array with the 2 disks, copy the data to the new filesystem, and then move the original disk to complete the array? No, because we currently can't add disks to a raidz array. You could create a mirror instead and then add in the other disk to make a three-way mirror, though. Even doing that would be dicey if you only have a single machine, though, since Solaris can't natively read the popular Linux filesystems. I believe there is freeware to do it, but nothing supported. David ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] raidz DEGRADED state
Ah, did not see your follow up. Thanks. Chris On Thu, 30 Nov 2006, Cindy Swearingen wrote: Sorry, Bart, is correct: If new_device is not specified, it defaults to old_device. This form of replacement is useful after an existing disk has failed and has been physically replaced. In this case, the new disk may have the same /dev/dsk path as the old device, even though it is actu- ally a different disk. ZFS recognizes this. cs Cindy Swearingen wrote: One minor comment is to identify the replacement drive, like this: # zpool replace mypool2 c3t6d0 c3t7d0 Otherwise, zpool will error... cs Bart Smaalders wrote: Krzys wrote: my drive did go bad on me, how do I replace it? I am sunning solaris 10 U2 (by the way, I thought U3 would be out in November, will it be out soon? does anyone know? [11:35:14] server11: /export/home/me zpool status -x pool: mypool2 state: DEGRADED status: One or more devices could not be opened. Sufficient replicas exist for the pool to continue functioning in a degraded state. action: Attach the missing device and online it using 'zpool online'. see: http://www.sun.com/msg/ZFS-8000-D3 scrub: none requested config: NAMESTATE READ WRITE CKSUM mypool2 DEGRADED 0 0 0 raidz DEGRADED 0 0 0 c3t0d0 ONLINE 0 0 0 c3t1d0 ONLINE 0 0 0 c3t2d0 ONLINE 0 0 0 c3t3d0 ONLINE 0 0 0 c3t4d0 ONLINE 0 0 0 c3t5d0 ONLINE 0 0 0 c3t6d0 UNAVAIL 0 679 0 cannot open errors: No known data errors ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss Shut down the machine, replace the drive, reboot and type: zpool replace mypool2 c3t6d0 On earlier versions of ZFS I found it useful to do this at the login prompt; it seemed fairly memory intensive. - Bart ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss !DSPAM:122,456f1b0c21174266247132! ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] DTrace IO provider and oracle
I use this construct to get something better than none args[2]-fi_pathname != none ? args[2]-fi_pathname : args[1]-dev_pathname In the latest versions of Solaris 10, you'll see IOs not directly issued by the app show up as being owned by 'zpool-POOLNAME' where POOLNAME is the real name of the pool. In this case, it appears the IOs are being done by the issuing process which means they're almost certainly reads. If that is the case, you could capture the pathname in the read call and pass that down to the start routine (left as an exercise for the reader). I also find, especially with oracle, that using the psargs string is much more informative - curpsinfo-pr_psargs. Jim --- - Original Message - From: przemol...@poczta.fm To: zfs-discuss@opensolaris.org Sent: Tuesday, May 10, 2011 10:27:55 AM GMT -08:00 US/Canada Pacific Subject: Re: [zfs-discuss] DTrace IO provider and oracle On Tue, Aug 08, 2006 at 11:33:28AM -0500, Tao Chen wrote: On 8/8/06, przemol...@poczta.fm przemol...@poczta.fm wrote: Hello, Solaris 10 GA + latest recommended patches: while runing dtrace: bash-3.00# dtrace -n 'io:::start {@[execname, args[2]-fi_pathname] = count();}' ... oracle none 2096052 How can I interpret 'none' ? Is it possible to get full path (like in vim) ? Section 27.2.3 fileinfo_t of DTrace Guide explains in detail why you see 'none' in many cases. http://www.sun.com/bigadmin/content/dtrace/d10_latest.pdf or http://docs.sun.com/app/docs/doc/817-6223/6mlkidllf?a=view The execname part can also be misleading, as many I/O activities are asynchronous (including but not limited to Asynchronous I/O), so whatever thread running on CPU may have nothing to do with the I/O that's occuring. This is working as designed and not a problem that limited to ZFS, IMO. Thanks Tao for the doc pointers. I haven't noticed them. przemol ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] fuser vs. zfs
On 23 November, 2005 - Benjamin Lewis sent me these 3,0K bytes: Hello, I'm running Solaris Express build 27a on an amd64 machine and fuser(1M) isn't behaving as I would expect for zfs filesystems. Various google and ... #fuser -c / /:[lots of other PIDs] 20617tm [others] 20412cm [others] #fuser -c /opt /opt: # Nothing at all for /opt. So it's safe to unmount? Nope: ... Has anyone else seen something like this? Try something less ancient, Solaris 10u9 reports it just fine for example. ZFS was pretty new-born when snv27 got out.. /Tomas -- Tomas Ögren, st...@acc.umu.se, http://www.acc.umu.se/~stric/ |- Student at Computing Science, University of Umeå `- Sysadmin at {cs,acc}.umu.se ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] fuser vs. zfs
On 10 May, 2011 - Tomas Ögren sent me these 0,9K bytes: On 23 November, 2005 - Benjamin Lewis sent me these 3,0K bytes: Hello, I'm running Solaris Express build 27a on an amd64 machine and fuser(1M) isn't behaving as I would expect for zfs filesystems. Various google and ... #fuser -c / /:[lots of other PIDs] 20617tm [others] 20412cm [others] #fuser -c /opt /opt: # Nothing at all for /opt. So it's safe to unmount? Nope: ... Has anyone else seen something like this? Try something less ancient, Solaris 10u9 reports it just fine for example. ZFS was pretty new-born when snv27 got out.. And for someone who is able to read as well, that mail was from 2005 - when snv27 actually was less ancient ;) Seems like the moderator queue from yesteryears just got flushed.. Sorry for the noise from my side.. /Tomas -- Tomas Ögren, st...@acc.umu.se, http://www.acc.umu.se/~stric/ |- Student at Computing Science, University of Umeå `- Sysadmin at {cs,acc}.umu.se ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] Old posts to zfs-discuss
Sorry for the old posts that some of you are seeing to zfs-discuss. The link between Jive and mailman was broken so I fixed that. However, once this was fixed Jive started sending every single post from the zfs-discuss board on Jive to the mail list. Quite a few posts were sent before I realized what was happening and was able to kill the process. Bill Rushmore ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Performance problem suggestions?
I've been going through my iostat, zilstat, and other outputs all to no avail. None of my disks ever seem to show outrageous service times, the load on the box is never high, and if the darned thing is CPU bound- I'm not even sure where to look. (traversing DDT blocks even if in memory, etc - and kernel times indeed are above 50%) as I'm zeroing deleted blocks inside the internal pool. This took several days already, but recovered lots of space in my main pool also... When you say you are zeroing deleted blocks- how are you going about doing that? Despite claims to the contrary- I can understand ZFS needing some tuning. What I can't understand are the baffling differences in performance I see. For example- after deleting a large volume- suddenly my performance will skyrocket- then gradually degrade- but the question is why? I'm not running dedup. My disks seem to be largely idle. I have 8 3GHz cores that also seem to be idle. I seem to have enough memory. What is ZFS doing during this time? Everything I've read suggests one of two possible causes- too full, or bad hardware. Is there anything else that might be an issue here? Another ZFS factor I haven't taken into account? Space seems to be the biggest factor in my performance difference- more free space = more performance- but as my fullest disks are less than 70% full, and my emptiest disks are less than 10% full- I can't understand why space is an issue. I have a few hardware errors for one of my pool disks- but we're talking about a very small number of errors over a long period of time. I'm considering replacing this disk but the pool is so slow at times I'm loathe to slow it down further by doing a replace unless I can be more certain that is going to fix the problem. -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Performance problem suggestions?
Well, as I wrote in other threads - i have a pool named pool on physical disks, and a compressed volume in this pool which i loopback-mount over iSCSI to make another pool named dcpool. When files in dcpool are deleted, blocks are not zeroed out by current ZFS and they are still allocated for the physical pool. Now i'm doing essentially this to clean up the parent pool: # dd if=/dev/zero of=/dcpool/nodedup/bigzerofile This file is in a non-deduped dataset, so to the point of view of dcpool, it has a growing huge file filled with zeroes - and its referenced blocks overwrite garbage left over from older deleted files and no longer referenced by dcpool. However for the pool this is a write of compressed zeroed block, which is not to be referenced, so the pool releases a volume block and its referencing metadata block. This has already released over half a terabyte in my physical pool (compressed blocks filled with zeroes are a special case for ZFS and require none or less-than-usual reference metadata blocks) ;) However, since I have millions of 4kb blocks for volume data and its metadata, I guess fragmentation is quite high, maybe even interlacing one-to-one? One way or another, this dcpool never saw IOs faster that say 15Mb/s, and usually lingers in 1-5Mb/s range, while I can get 30-50Mb/s in the pool easily in other datasets (with dynamic block sizes and lengthier contiguous data stretches). Writes had been relatively quick for the first virtual terabyte or so, but it's doing the last 100gb for several days now, at several megabytes per minute in the dcpool iostat. There's several Mb/sec of IO's on hardware disks to back this deletion and clean-up, however (as in my examples in previous post)... As for disks with different fill ratio - it is a commonly discussed performance problem. Seems to boil down to this: free space on all disks (actually on top-level VDEVs) is considered for round-robining writes to stripes. Disks that have been in use for a longer time may have very fragmented free space on one hand, and not so much of it on another, but ZFS is still trying to push bits around evenly. And while it's waiting on some disks, others may be blocked as well. Something like that... People on this forum have seen and reported that adding a 100Mb file tanked their multiterabyte pool's performance, and removing the file boosted it back up. I don't want to mix up other writers' findings, better search recent 5-10 pages of the forum post headings yourself. It's within the last hundred of threads, I think, maybe ;) -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Tuning disk failure detection?
In a recent post r-mexico wrote that they had to parse system messages and manually fail the drives on a similar, though different, occasion: http://opensolaris.org/jive/message.jspa?messageID=515815#515815 -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Modify stmf_sbd_lu properties
Don, Is it possible to modify the GUID associated with a ZFS volume imported into STMF? To clarify- I have a ZFS volume I have imported into STMF and export via iscsi. I have a number of snapshots of this volume. I need to temporarily go back to an older snapshot without removing all the more recent ones. I can delete the current sbd LU, clone the snapshot I want to test, and then bring that back in to sbd. The problem is that you need to use sbdadm create-lu and that creates a new GUID. (sbdadm import-lu on a clone will give you a metafile error). Take a look at the command set associated with stmfadm, and you should see that it has taken on all sbdadm options, and more. I believe you are looking for the functionality associated with stmfadm offline-lu, ... online-lu. - Jim Is it possible to change the GUID of the newly imported volume to match the old volume (even if that means changing the guid of the old volume first)? I had hoped this could be done by dumping the stmf_sdb_lu property from zfs and setting the clones property to this value- but that does not seem to work. Changing the guid is not an option for these tests. Any ieas? -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Tuning disk failure detection?
On Tue, May 10, 2011 at 02:42:40PM -0700, Jim Klimov wrote: In a recent post r-mexico wrote that they had to parse system messages and manually fail the drives on a similar, though different, occasion: http://opensolaris.org/jive/message.jspa?messageID=515815#515815 Thanks Jim, good pointer. It sounds like our use of SATA disks is likely the problem and we'd have better error reporting with SAS or some of the nearline SAS drives (SATA drives with a real SAS controller on them). Ray ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Performance problem suggestions?
it is my understanding for write (fast) consider faster HDD (SSD) for ZIL for read consider faster HDD(SSD) for L2ARC There were many discussion for V12N env raid1 is better than raidz On 5/10/2011 3:31 PM, Don wrote: I've been going through my iostat, zilstat, and other outputs all to no avail. None of my disks ever seem to show outrageous service times, the load on the box is never high, and if the darned thing is CPU bound- I'm not even sure where to look. (traversing DDT blocks even if in memory, etc - and kernel times indeed are above 50%) as I'm zeroing deleted blocks inside the internal pool. This took several days already, but recovered lots of space in my main pool also... When you say you are zeroing deleted blocks- how are you going about doing that? Despite claims to the contrary- I can understand ZFS needing some tuning. What I can't understand are the baffling differences in performance I see. For example- after deleting a large volume- suddenly my performance will skyrocket- then gradually degrade- but the question is why? I'm not running dedup. My disks seem to be largely idle. I have 8 3GHz cores that also seem to be idle. I seem to have enough memory. What is ZFS doing during this time? Everything I've read suggests one of two possible causes- too full, or bad hardware. Is there anything else that might be an issue here? Another ZFS factor I haven't taken into account? Space seems to be the biggest factor in my performance difference- more free space = more performance- but as my fullest disks are less than 70% full, and my emptiest disks are less than 10% full- I can't understand why space is an issue. I have a few hardware errors for one of my pool disks- but we're talking about a very small number of errors over a long period of time. I'm considering replacing this disk but the pool is so slow at times I'm loathe to slow it down further by doing a replace unless I can be more certain that is going to fix the problem. attachment: laotsao.vcf___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Tuning disk failure detection?
On Tue, May 10, 2011 at 03:57:28PM -0700, Brandon High wrote: On Tue, May 10, 2011 at 9:18 AM, Ray Van Dolson rvandol...@esri.com wrote: My question is -- is there a way to tune the MPT driver or even ZFS itself to be more/less aggressive on what it sees as a failure scenario? You didn't mention what drives you had attached, but I'm guessing they were normal desktop drives. I suspect (but can't confirm) that using enterprise drives with TLER / ERC / CCTL would have reported the failure up the stack faster than a consumer drive. The drives will report an error after 7 seconds rather than retry for several minutes. You may be able to enable the feature on your drives, depending on the manufacturer and firmware revision. -B Yup, shoulda included that. These are regular SATA drives -- supposedly Enterprise whatever that gives us (most likely a higher MTBF number). We'll probably look at going with nearline SAS drives (only increases cost slightly) and write a small SEC rule on our syslog server to watch for 0x3000 errors on servers with SATA disks only so we can at least be alerted more quickly. Ray ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Performance problem suggestions?
# dd if=/dev/zero of=/dcpool/nodedup/bigzerofile Ahh- I misunderstood your pool layout earlier. Now I see what you were doing. People on this forum have seen and reported that adding a 100Mb file tanked their multiterabyte pool's performance, and removing the file boosted it back up. Sadly I think several of those posts were mine or those of coworkers. Disks that have been in use for a longer time may have very fragmented free space on one hand, and not so much of it on another, but ZFS is still trying to push bits around evenly. And while it's waiting on some disks, others may be blocked as well. Something like that... This could explain why performance would go up after a large delete but I've not seen large wait times for any of my disks. The service time, percent busy, and every other metric continues to show nearly idle disks. If this is the problem- it would be nice if there were a simple zfs or dtrace query that would show it to you. -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss