[zfs-discuss] strange pool disks usage pattern
Hi, I have a pc with a MARVELL AOC-SAT2-MV8 controller and a pool made up of a six disks in a raid-z pool with a hot spare. pre -bash-3.2$ /sbin/zpool status pool: nas stato: ONLINE scrub: scrub in progress for 9h4m, 81,59% done, 2h2m to go config: NAMESTATE READ WRITE CKSUM nas ONLINE 0 0 0 raidz1ONLINE 0 0 0 c2t1d0 ONLINE 0 0 0 c2t4d0 ONLINE 0 0 0 c2t5d0 ONLINE 0 0 0 c2t3d0 ONLINE 0 0 0 c2t2d0 ONLINE 0 0 0 c2t0d0 ONLINE 0 0 0 dischi di riserva c2t7d0AVAIL errori: nessun errore di dati rilevato /pre Now, the problem is that issuing an iostat -Cmnx 10 or any other time intervall, I've seen, sometimes, a complete stall of disk I/O due to a disk in the pool (not always the same) being 100% busy. pre $ iostat -Cmnx 10 r/sw/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device 0,00,30,02,0 0,0 0,00,00,1 0 0 c1 0,00,30,02,0 0,0 0,00,00,1 0 0 c1t0d0 1852,1 297,0 13014,9 4558,4 9,2 1,64,30,7 2 158 c2 311,8 61,3 2185,3 750,7 2,0 0,35,50,7 17 25 c2t0d0 309,5 34,7 2207,2 769,5 1,6 0,54,71,4 41 47 c2t1d0 309,3 36,3 2173,0 770,0 1,0 0,32,90,7 18 26 c2t2d0 296,0 65,5 2057,3 749,2 2,1 0,25,90,6 16 23 c2t3d0 313,3 64,1 2187,3 748,8 1,7 0,24,60,5 15 21 c2t4d0 311,9 35,1 2204,8 770,1 0,7 0,22,10,5 11 17 c2t5d0 0,00,00,00,0 0,0 0,00,00,0 0 0 c2t7d0 extended device statistics r/sw/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device 0,4 14,73,2 30,4 0,0 0,20,0 13,2 0 2 c1 0,4 14,73,2 30,4 0,0 0,20,0 13,2 0 2 c1t0d0 1,70,0 58,90,0 3,0 1,0 1766,4 593,1 2 101 c2 0,30,07,70,0 0,0 0,00,30,4 0 0 c2t0d0 0,30,0 11,50,0 0,0 0,04,48,4 0 0 c2t1d0 0,00,00,00,0 3,0 1,00,00,0 100 100 c2t2d0 0,40,0 14,10,0 0,0 0,00,46,6 0 0 c2t3d0 0,40,0 14,10,0 0,0 0,00,32,5 0 0 c2t4d0 0,30,0 11,50,0 0,0 0,03,66,9 0 0 c2t5d0 0,00,00,00,0 0,0 0,00,00,0 0 0 c2t7d0 extended device statistics r/sw/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device 0,03,10,03,1 0,0 0,00,00,7 0 0 c1 0,03,10,03,1 0,0 0,00,00,7 0 0 c1t0d0 0,00,00,00,0 3,0 1,00,00,0 2 100 c2 0,00,00,00,0 0,0 0,00,00,0 0 0 c2t0d0 0,00,00,00,0 0,0 0,00,00,0 0 0 c2t1d0 0,00,00,00,0 3,0 1,00,00,0 100 100 c2t2d0 0,00,00,00,0 0,0 0,00,00,0 0 0 c2t3d0 0,00,00,00,0 0,0 0,00,00,0 0 0 c2t4d0 0,00,00,00,0 0,0 0,00,00,0 0 0 c2t5d0 0,00,00,00,0 0,0 0,00,00,0 0 0 c2t7d0 extended device statistics r/sw/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device 0,00,10,00,4 0,0 0,00,01,2 0 0 c1 0,00,10,00,4 0,0 0,00,01,2 0 0 c1t0d0 0,0 29,50,0 320,2 3,4 1,0 113,9 34,6 2 102 c2 0,06,90,0 63,3 0,1 0,0 12,60,7 0 0 c2t0d0 0,04,40,0 65,5 0,0 0,08,70,8 0 0 c2t1d0 0,00,00,00,0 3,0 1,00,00,0 100 100 c2t2d0 0,07,40,0 62,7 0,1 0,0 15,40,8 1 1 c2t3d0 0,06,80,0 63,6 0,1 0,0 13,20,7 0 0 c2t4d0 0,04,00,0 65,1 0,0 0,07,90,7 0 0 c2t5d0 0,00,00,00,0 0,0 0,00,00,0 0 0 c2t7d0 extended device statistics r/sw/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device 0,00,30,02,4 0,0 0,00,00,1 0 0 c1 0,00,30,02,4 0,0 0,00,00,1 0 0 c1t0d0 0,00,00,00,0 3,0 1,00,00,0 2 100 c2 0,00,00,00,0 0,0 0,00,00,0 0 0 c2t0d0 0,00,00,00,0 0,0 0,00,00,0 0 0 c2t1d0 0,00,00,00,0 3,0 1,00,00,0 100 100 c2t2d0 0,00,00,00,0 0,0 0,00,00,0 0 0 c2t3d0 0,00,00,00,0 0,0 0,00,00,0 0 0 c2t4d0 0,00,00,00,0 0,0 0,00,00,0 0 0 c2t5d0 0,00,00,00,0 0,0 0,00,00,0 0 0 c2t7d0 extended device statistics r/sw/s kr/s kw/s wait actv wsvc_t asvc_t %w
Re: [zfs-discuss] RAIDZ v. RAIDZ1
Cindy: I believe I may have been mistaken. When I recreated the zpools, you are correct you receive different numbers for zpool list and zfs list for the sizes. I must have typed one command and then the other when creating the different pools. Thanks for the assist. Sheepish grin. David -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] zpool is very slow
I created a raidz zpool and shares and now the OS is very slow. I timed it and I can get about eight seconds of use before I get ten seconds of a frozen screen. I can be doing anything or barely anything (moving the mouse an inch side to side repeatedly.) This makes the machine unusable. If I detach the SATA card that the raidz zpool is attached to everything is fine. The slowdown occurs regardless of the user that I login as (admin, reg user), and the speed up occurs only when the SATA card is removed. This leads me to believe that something is going on with the zpool. There are no files on the zpool (I don't have the patience for the constant freezing to copy files over to the zpool.) The zpool is 4TB in size. I previously had the system up and running for a week before I did something stupid and decided to start from scratch and reinstall and recreate the zpool. The zpool status command shows no errors with the zpool and iostat: mediaz used 470k avail 5.44T read 0 write 0 bandwidth read 37 bandwidth write 44 How do I find what is accessing the zpool and stop it? David -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] zpool is very slow
David, May be you can you the iosnoop of the dtrace toolkit: http://www.solarisinternals.com/wiki/index.php/DTraceToolkit#Scripts ..Remco David Stewart wrote: I created a raidz zpool and shares and now the OS is very slow. I timed it and I can get about eight seconds of use before I get ten seconds of a frozen screen. I can be doing anything or barely anything (moving the mouse an inch side to side repeatedly.) This makes the machine unusable. If I detach the SATA card that the raidz zpool is attached to everything is fine. The slowdown occurs regardless of the user that I login as (admin, reg user), and the speed up occurs only when the SATA card is removed. This leads me to believe that something is going on with the zpool. There are no files on the zpool (I don't have the patience for the constant freezing to copy files over to the zpool.) The zpool is 4TB in size. I previously had the system up and running for a week before I did something stupid and decided to start from scratch and reinstall and recreate the zpool. The zpool status command shows no errors with the zpool and iostat: mediaz used 470k avail 5.44T read 0 write 0 bandwidth read 37 bandwidth write 44 How do I find what is accessing the zpool and stop it? David ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] strange pool disks usage pattern
Maurilio Longo wrote: Hi, I have a pc with a MARVELL AOC-SAT2-MV8 controller and a pool made up of a six disks in a raid-z pool with a hot spare. ... Now, the problem is that issuing an iostat -Cmnx 10 or any other time intervall, I've seen, sometimes, a complete stall of disk I/O due to a disk in the pool (not always the same) being 100% busy. ... In this case it was c2t2d0 and it blocked the pool for 30 or 40 seconds. /var/adm/messages does not contain anything related to the pool. What can it be? This usually means you have either a driver bug, a bad controller, or a bad disk The marvell driver bug sometimes manifested in this way, but you would have seen bus resets in your error logs. Given you have exactly one outstanding transaction on the stuck disk, I suspect the disk is busy doing error recovery. Speaking from my recent extremely painful experience, replace that disk ASAP. -- Carson ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] strange pool disks usage pattern
Carson, the strange thing is that this is happening on several disks (can it be that are all failing?) What is the controller bug you're talking about? I'm running snv_114 on this pc, so it is fairly recent. Best regards. Maurilio. -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] strange pool disks usage pattern
Maurilio Longo wrote: the strange thing is that this is happening on several disks (can it be that are all failing?) Possible, but less likely. I'd suggest running some disk I/O tests, looking at the drive error counters before/after. What is the controller bug you're talking about? I'm running snv_114 on this pc, so it is fairly recent. There was a bug in the marvell driver for the controller used on the X4500 that caused bus hangs / resets. It was fixed around U6, so it should be long gone from OpenSolaris. But perhaps there's a different bug? You could also have a firmware bug on your disks. You might try lowering the number of tagged commands per disk and see if that helps at all. -- Carson ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] strange pool disks usage pattern
Possible, but less likely. I'd suggest running some disk I/O tests, looking at the drive error counters before/after. These disks have a few months of life and are scrubbed weekly, no errors so far. I did try to use smartmontools, but it cannot report SMART logs nor start SMART tests, so I don't know how to look at their internal state. You could also have a firmware bug on your disks. You might try lowering the number of tagged commands per disk and see if that helps at all. from man marvell88sx I read that this driver has no tunable parameters, so I don't know how I could change NCQ depth. Best regards. Maurilio. -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] strange pool disks usage pattern
Maurilio Longo wrote: I did try to use smartmontools, but it cannot report SMART logs nor start SMART tests, so I don't know how to look at their internal state. Really? That's odd... You could also have a firmware bug on your disks. You might try lowering the number of tagged commands per disk and see if that helps at all. from man marvell88sx I read that this driver has no tunable parameters, so I don't know how I could change NCQ depth. ZFS has a per block device outstanding IO tunable - I think it's in the evil tuning guide. -- Carson ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] strange pool disks usage pattern
Maurilio Longo wrote: Carson, the strange thing is that this is happening on several disks (can it be that are all failing?) What is the controller bug you're talking about? I'm running snv_114 on this pc, so it is fairly recent. Best regards. Maurilio. See 'iostat -En' output. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS caching of compressed data
Stuart Anderson wrote: I am wondering if the following idea makes any sense as a way to get ZFS to cache compressed data in DRAM? In particular, given a 2-way zvol mirror of highly compressible data on persistent storage devices, what would go wrong if I dynamically added a ramdisk as a 3rd mirror device at boot time? Would ZFS route most (or all) of the reads to the lower latency DRAM device? In the case of an un-clean shutdown where there was no opportunity to actively remove the ramdisk from the pool before shutdown would there be any problem at boot time when the ramdisk is still registered but unavailable? Note, this Gedanken experiment is for highly compressible (~9x) metadata for a non-ZFS filesystem. You would only get about 33% of IO's served from ram-disk. However at the KCA conference Bill and Jeff mentioned Just-in-time decompression/decryption planned for ZFS. If I understand it correctly some % of pages in ARC will be kept compressed/encrypted and will be decompressed/decrypted only if accessed. This could be especially useful to do so with prefetch. Now I would imaging that one will be able to tune what's percentage of ARC should keep compressed pages. Now I don't remember if they mentioned L2ARC here but it would probably be useful to have a tunable which would put compressed or uncompressed data onto L2ARC depending on it's value. Which approach is better would always depends on a given environment and on where an actual bottleneck is. -- Robert Milkowski http://milek.blogspot.com ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Unable to import pool: invalid vdev configuration
Osvald Ivarsson wrote: On Thu, Oct 1, 2009 at 7:40 PM, Victor Latushkin victor.latush...@sun.com wrote: On 01.10.09 17:54, Osvald Ivarsson wrote: I'm running OpenSolaris build svn_101b. I have 3 SATA disks connected to my motherboard. The raid, a raidz, which is called rescamp, has worked good before until a power failure yesterday. I'm now unable to import the pool. I can't export the raid, since it isn't imported. # zpool import rescamp cannot import 'rescamp': invalid vdev configuration # zpool import pool: rescamp id: 12297694211509104163 state: UNAVAIL action: The pool cannot be imported due to damaged devices or data. config: rescamp UNAVAIL insufficient replicas raidz1UNAVAIL corrupted data c15d0 ONLINE c14d0 ONLINE c14d1 ONLINE I've tried using zdb -l on all three disks, but in all cases it failes to unpack the labels. # zdb -l /dev/dsk/c14d0 LABEL 0 failed to unpack label 0 LABEL 1 failed to unpack label 1 LABEL 2 failed to unpack label 2 LABEL 3 failed to unpack label 3 If I run # zdb -l /dev/dsk/c14d0s0 I do find 4 labels, but c14d0, c14d1 and c15d0 is what I created the raid with. I do find labels this way for all three disks. Is this to any help? # zdb -l /dev/dsk/c14d1s0 LABEL 0 version=13 name='rescamp' state=0 txg=218097573 pool_guid=12297694211509104163 hostid=4925114 hostname='slaskvald' top_guid=9479723326726871122 guid=17774184411399278071 vdev_tree type='raidz' id=0 guid=9479723326726871122 nparity=1 metaslab_array=23 metaslab_shift=34 ashift=9 asize=3000574672896 is_log=0 children[0] type='disk' id=0 guid=9020535344824299914 path='/dev/dsk/c15d0s0' devid='id1,c...@ast31000333as=9te0dglf/a' phys_path='/p...@0,0/pci-...@11/i...@1/c...@0,0:a' whole_disk=1 DTL=102 children[1] type='disk' id=1 guid=14384361563876398475 path='/dev/dsk/c14d0s0' devid='id1,c...@asamsung_hd103uj=s13pjdws690618/a' phys_path='/p...@0,0/pci-...@11/i...@0/c...@0,0:a' whole_disk=1 DTL=216 children[2] type='disk' id=2 guid=17774184411399278071 path='/dev/dsk/c14d1s0' devid='id1,c...@ast31000333as=9te0de8w/a' phys_path='/p...@0,0/pci-...@11/i...@0/c...@1,0:a' whole_disk=1 DTL=100 LABEL 1 version=13 name='rescamp' state=0 txg=218097573 pool_guid=12297694211509104163 hostid=4925114 hostname='slaskvald' top_guid=9479723326726871122 guid=17774184411399278071 vdev_tree type='raidz' id=0 guid=9479723326726871122 nparity=1 metaslab_array=23 metaslab_shift=34 ashift=9 asize=3000574672896 is_log=0 children[0] type='disk' id=0 guid=9020535344824299914 path='/dev/dsk/c15d0s0' devid='id1,c...@ast31000333as=9te0dglf/a' phys_path='/p...@0,0/pci-...@11/i...@1/c...@0,0:a' whole_disk=1 DTL=102 children[1] type='disk' id=1 guid=14384361563876398475 path='/dev/dsk/c14d0s0' devid='id1,c...@asamsung_hd103uj=s13pjdws690618/a' phys_path='/p...@0,0/pci-...@11/i...@0/c...@0,0:a' whole_disk=1 DTL=216 children[2] type='disk' id=2 guid=17774184411399278071 path='/dev/dsk/c14d1s0' devid='id1,c...@ast31000333as=9te0de8w/a' phys_path='/p...@0,0/pci-...@11/i...@0/c...@1,0:a' whole_disk=1 DTL=100 LABEL 2 version=13 name='rescamp' state=0 txg=218097573 pool_guid=12297694211509104163 hostid=4925114 hostname='slaskvald' top_guid=9479723326726871122 guid=17774184411399278071 vdev_tree type='raidz' id=0 guid=9479723326726871122
Re: [zfs-discuss] cachefile for snail zpool import mystery?
Max Holm wrote: Hi, We are seeing more long delays in zpool import, say, 4~5 or even 25~30 minutes, especially when backup jobs are going on in the FC SAN the LUNs resides (no iSCSI LUNs yet). On the same node for the LUNs of the same array, some pools takes a few seconds, but minutes for some. the pattern seems random to me so far. It's first noticed soon after being upgraded to Solaris 10 U6 (10/08, on sparc, M4000,Vx90 using some IBM and Sun arrays.) Appreciated, if someone can comment on this. Thanks. We have a few VCS clusters, each has a set of service groups that import/export some zpools at proper events on a proper node (with '-R /' option). To fix the long delays, it seems I can use the 'zpool set cachefile=/x/... ...' for each pool, deploy all cachefiles to every cluster node of a cluster on a persisent location,/y/, then have the agent online script do 'zpool import -c /y/...', if /y/... exists. Any better fix? 1. Why would it ever take so long (20-30 minutes!) to import a pool? It seems I/O on the FC SAN were just fine, no error messages either. Is it problems of other stacks or because I deleted some LUNs on the array without taking it off device trees? This is probably your problem. Try to do devfsadm -vC 2. we now have the burden of maintaining these cachefiles when we change the zpool, say add/drop a lun. any advice? It'd be nice if zfs keeps a cache file (other than /etc/zfs/zpool.cache) for those ones imported under an altroot, and make it persistent, verify/update entries at proper events. IIRC when you change a pool config its cache file will be automatically updated on the same node. At least, I wish zfs allow us to create the cachefiles while they are not currently imported. so that I can just have a simple daily job to maintain the cache files on every node of a cluster automatically. What you can do is to put a script in a crontab which checks if a pool is currently imported on this node and if it is then copy the pools cache file to over nodes. btw: IIRC Sun Cluster HAS+ agane will automatically make use of cache files -- Robert Milkowski http://milek.blogspot.com ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Can't rm file when No space left on device...
Chris Ridd wrote: On 1 Oct 2009, at 19:34, Andrew Gabriel wrote: Pick a file which isn't in a snapshot (either because it's been created since the most recent snapshot, or because it's been rewritten since the most recent snapshot so it's no longer sharing blocks with the snapshot version). Out of curiosity, is there an easy way to find such a file? Find files with modification or creation time after last snapshot was created. Files which were modified after may still have most of their blocks refered by a snapshot though. -- Robert Milkowski http://milek.blogspot.com ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Unable to import pool: invalid vdev configuration
Osvald Ivarsson wrote: On Fri, Oct 2, 2009 at 2:36 PM, Victor Latushkin victor.latush...@sun.com wrote: Osvald Ivarsson wrote: On Thu, Oct 1, 2009 at 7:40 PM, Victor Latushkin victor.latush...@sun.com wrote: On 01.10.09 17:54, Osvald Ivarsson wrote: I'm running OpenSolaris build svn_101b. I have 3 SATA disks connected to my motherboard. The raid, a raidz, which is called rescamp, has worked good before until a power failure yesterday. I'm now unable to import the pool. I can't export the raid, since it isn't imported. # zpool import rescamp cannot import 'rescamp': invalid vdev configuration # zpool import pool: rescamp id: 12297694211509104163 state: UNAVAIL action: The pool cannot be imported due to damaged devices or data. config: rescamp UNAVAIL insufficient replicas raidz1UNAVAIL corrupted data c15d0 ONLINE c14d0 ONLINE c14d1 ONLINE I've tried using zdb -l on all three disks, but in all cases it failes to unpack the labels. # zdb -l /dev/dsk/c14d0 LABEL 0 failed to unpack label 0 LABEL 1 failed to unpack label 1 LABEL 2 failed to unpack label 2 LABEL 3 failed to unpack label 3 If I run # zdb -l /dev/dsk/c14d0s0 I do find 4 labels, but c14d0, c14d1 and c15d0 is what I created the raid with. I do find labels this way for all three disks. Is this to any help? # zdb -l /dev/dsk/c14d1s0 LABEL 0 version=13 name='rescamp' state=0 txg=218097573 pool_guid=12297694211509104163 hostid=4925114 hostname='slaskvald' top_guid=9479723326726871122 guid=17774184411399278071 vdev_tree type='raidz' id=0 guid=9479723326726871122 nparity=1 metaslab_array=23 metaslab_shift=34 ashift=9 asize=3000574672896 is_log=0 children[0] type='disk' id=0 guid=9020535344824299914 path='/dev/dsk/c15d0s0' devid='id1,c...@ast31000333as=9te0dglf/a' phys_path='/p...@0,0/pci-...@11/i...@1/c...@0,0:a' whole_disk=1 DTL=102 children[1] type='disk' id=1 guid=14384361563876398475 path='/dev/dsk/c14d0s0' devid='id1,c...@asamsung_hd103uj=s13pjdws690618/a' phys_path='/p...@0,0/pci-...@11/i...@0/c...@0,0:a' whole_disk=1 DTL=216 children[2] type='disk' id=2 guid=17774184411399278071 path='/dev/dsk/c14d1s0' devid='id1,c...@ast31000333as=9te0de8w/a' phys_path='/p...@0,0/pci-...@11/i...@0/c...@1,0:a' whole_disk=1 DTL=100 LABEL 1 version=13 name='rescamp' state=0 txg=218097573 pool_guid=12297694211509104163 hostid=4925114 hostname='slaskvald' top_guid=9479723326726871122 guid=17774184411399278071 vdev_tree type='raidz' id=0 guid=9479723326726871122 nparity=1 metaslab_array=23 metaslab_shift=34 ashift=9 asize=3000574672896 is_log=0 children[0] type='disk' id=0 guid=9020535344824299914 path='/dev/dsk/c15d0s0' devid='id1,c...@ast31000333as=9te0dglf/a' phys_path='/p...@0,0/pci-...@11/i...@1/c...@0,0:a' whole_disk=1 DTL=102 children[1] type='disk' id=1 guid=14384361563876398475 path='/dev/dsk/c14d0s0' devid='id1,c...@asamsung_hd103uj=s13pjdws690618/a' phys_path='/p...@0,0/pci-...@11/i...@0/c...@0,0:a' whole_disk=1 DTL=216 children[2] type='disk' id=2 guid=17774184411399278071 path='/dev/dsk/c14d1s0' devid='id1,c...@ast31000333as=9te0de8w/a' phys_path='/p...@0,0/pci-...@11/i...@0/c...@1,0:a' whole_disk=1 DTL=100 LABEL 2 version=13 name='rescamp' state=0 txg=218097573 pool_guid=12297694211509104163 hostid=4925114 hostname='slaskvald' top_guid=9479723326726871122 guid=17774184411399278071 vdev_tree type='raidz' id=0 guid=9479723326726871122
Re: [zfs-discuss] strange pool disks usage pattern
Milek, this is it pre # iostat -En c1t0d0 Soft Errors: 0 Hard Errors: 0 Transport Errors: 0 Vendor: ATA Product: ST3808110AS Revision: DSerial No: Size: 80,03GB 80026361856 bytes Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 0 Illegal Request: 91 Predictive Failure Analysis: 0 c2t0d0 Soft Errors: 0 Hard Errors: 11 Transport Errors: 0 Vendor: ATA Product: ST31000333AS Revision: CC1H Serial No: Size: 1000,20GB 1000204886016 bytes Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 0 Illegal Request: 101 Predictive Failure Analysis: 0 c2t1d0 Soft Errors: 0 Hard Errors: 4 Transport Errors: 0 Vendor: ATA Product: ST31000333AS Revision: CC1H Serial No: Size: 1000,20GB 1000204886016 bytes Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 0 Illegal Request: 96 Predictive Failure Analysis: 0 c2t2d0 Soft Errors: 0 Hard Errors: 69 Transport Errors: 0 Vendor: ATA Product: ST31000333AS Revision: CC1H Serial No: Size: 1000,20GB 1000204886016 bytes Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 0 Illegal Request: 105 Predictive Failure Analysis: 0 c2t3d0 Soft Errors: 0 Hard Errors: 5 Transport Errors: 0 Vendor: ATA Product: ST31000333AS Revision: CC1H Serial No: Size: 1000,20GB 1000204886016 bytes Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 0 Illegal Request: 96 Predictive Failure Analysis: 0 c2t4d0 Soft Errors: 0 Hard Errors: 90 Transport Errors: 0 Vendor: ATA Product: ST31000333AS Revision: CC1H Serial No: Size: 1000,20GB 1000204886016 bytes Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 0 Illegal Request: 96 Predictive Failure Analysis: 0 c2t5d0 Soft Errors: 0 Hard Errors: 30 Transport Errors: 0 Vendor: ATA Product: ST31000333AS Revision: CC1H Serial No: Size: 1000,20GB 1000204886016 bytes Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 0 Illegal Request: 96 Predictive Failure Analysis: 0 c2t7d0 Soft Errors: 0 Hard Errors: 0 Transport Errors: 0 Vendor: ATA Product: ST31000333AS Revision: CC1H Serial No: Size: 1000,20GB 1000204886016 bytes Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 0 Illegal Request: 94 Predictive Failure Analysis: 0 # /pre What are hard errors? Maurilio. -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] strange pool disks usage pattern
Errata, they're ST31000333AS and not 340AS Maurilio. -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Best way to convert checksums
Data security. I migrated my organization from Linux to Solaris driven away from Linux by the the shortfalls of fsck on TB size file systems, and towards Solaris by the features of ZFS. At the time I tried to dig up information concerning tradeoffs associated with Fletcher2 vs. 4 vs. SHA256 and found nothing. Studying the algorithms, I decided that fletcher2 would tend to be weak for periodic data, which characterizes my data. I ran throughput tests and got 67MB/Sec for Fletcher2 and 4 and 48MB/Sec for SHA256. I projected (perhaps without basis) SHA256's cryptographic strength to also mean strength as a hash, and chose it since 48MB/Sec is more than I need. 21 months later (9/15/09) I lost everything to a corrupt metadata (Not sure where this was printed) ZFS-8000-CS. No clue why to date, I will never know. The person who restored from tape was not informed to set checksum=sha256, so it all went in with the default, Fletcher2. Before taking rather disruptive actions to correct this, I decided to question my original decision and found schlie's post stating that a bug in fletcher2 makes it essentially a one bit parity on the entire block: http://opensolaris.org/jive/thread.jspa?threadID=69655tstart=30 While this is twice as good as any other file system in the world that has NO such checksum, this does not provide the security I migrated for. Especially given that I did not know what caused the original data loss, it is all I have to lean on. Convinced that I need to convert all of the checksums to sha256 to have the data security ZFS purports to deliver and in the absence of a checksum conversion capability, I need to copy the data. It appears that all of the implementations of the various means of copying data, from tar and cpio to cp to rsync to pax have ghosts in their closets, each living in glass houses, and each throwing stones at the other with respect to various issues with file size, filename lengths, pathname lengths, ACLs, extended attributes, sparse files, etc. etc. etc. It seems like zfs send/receive *should* be safe from all such issues as part of the zfs family, but the questions raised here are ambiguous once one starts to think about it. If the file system is faithfully duplicated, it should also duplicate all properties, including the checksum used on each block. It appears (to my advantage) that this is not what is done. This enables the filesystem spontaneously created by zfs receive to inherit from the pool, which evidently can be set to sha256 though it is a pool not a file system in the pool. The present question is protection on the base pool. This can be set when the pool is created, though not with U4 which I am running. It is not clear (yet) if this is simply not documented with the current release or if the version that supports this has not been released yet. If I were to upgrade (Which I cannot do in a timely fashion), it would only be to U7. I cannot run a weekly build type of OS on my production server. Any way it goes I am hosed. In short there is surely some structure, some blocks with stuff written in them when a pool is created but before anything else is done, else it would be a blank disk, not a zfs pool. Are these protected by Fletcher2 as the default? I have learned that the Ubberblock is protected by SHA256, other parts by Fletcher4. Is this everything? In U4 was it fletcher4, or was this a recent change steming from Schlie's report? In short, what is the situation with regard to the data security I switched to Solaris/ZFS for, and what can I do to achieve it? What *do* the tools do? Are there tools for what needs to be done to convert things, to copy things, to verify things, and to do so completely and correctly? So here is where I am: I should zfs send/receive, but I cannot have confidence that there are not fletcher2 protected blocks (1 bit parity) at the most fundamental levels of the zpool. To verify data, I cannot depend on existing tools since diff is not large file aware. My best idea at this point is to calculate and compare MD5 sums of every file and spot check other properties as best I can. Given this rather full perspective, help or comments very appreciated. I still think zfs is the way to go, but the road is a little bumpy at the moment. -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Best way to convert checksums
Appologize that the preceeding post appears out of context. I expected it to indent as I pushed the reply button on myxiplx' Oct 1, 2009 1:47 post. It was in response to his question. I will try to remember to provide links internal to my messages. -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Best way to convert checksums
On 02 October, 2009 - Ray Clark sent me these 4,4K bytes: Data security. I migrated my organization from Linux to Solaris driven away from Linux by the the shortfalls of fsck on TB size file systems, and towards Solaris by the features of ZFS. [...] Before taking rather disruptive actions to correct this, I decided to question my original decision and found schlie's post stating that a bug in fletcher2 makes it essentially a one bit parity on the entire block: http://opensolaris.org/jive/thread.jspa?threadID=69655tstart=30 While this is twice as good as any other file system in the world that has NO such checksum, this does not provide the security I migrated for. Especially given that I did not know what caused the original data loss, it is all I have to lean on. ... That post refers to bug 6740597 http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=6740597 which also refers to http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=2178540 So it seems like it's fixed in snv114 and s10u8, which won't help your s10u4 unless you update.. /Tomas -- Tomas Ögren, st...@acc.umu.se, http://www.acc.umu.se/~stric/ |- Student at Computing Science, University of Umeå `- Sysadmin at {cs,acc}.umu.se ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Best way to convert checksums
Replying to Cindys Oct 1, 2009 3:34 PM post: Thank you. The second part was my attempt to guess my way out of this. If the fundamental structure of the pool (That which was created before I set the checksum=sha256 property) is using fletcher2, perhaps as I use the pool all of this structure will be updated, and therefore automatically migrate to the new checksum. It would be very difficult for me to recreate the pool, but I have space to duplicate the user files (and so get the new checksum). Perhaps this will also result in the underlying structure of the pool being converted in the course of normal use. Comments for or against? -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Best way to convert checksums
Replying to relling's October 1, 2009 3:34 post: Richard, regarding when a pool is created, there is only metadata which uses fletcher4. Was this true in U4, or is this a new change of default with U4 using fletcher2? Similarly, did the Ubberblock use sha256 in U4? I am running U4. --Ray -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Unable to import pool: invalid vdev configuration
On Thu, Oct 1, 2009 at 7:40 PM, Victor Latushkin victor.latush...@sun.com wrote: On 01.10.09 17:54, Osvald Ivarsson wrote: I'm running OpenSolaris build svn_101b. I have 3 SATA disks connected to my motherboard. The raid, a raidz, which is called rescamp, has worked good before until a power failure yesterday. I'm now unable to import the pool. I can't export the raid, since it isn't imported. # zpool import rescamp cannot import 'rescamp': invalid vdev configuration # zpool import pool: rescamp id: 12297694211509104163 state: UNAVAIL action: The pool cannot be imported due to damaged devices or data. config: rescamp UNAVAIL insufficient replicas raidz1 UNAVAIL corrupted data c15d0 ONLINE c14d0 ONLINE c14d1 ONLINE I've tried using zdb -l on all three disks, but in all cases it failes to unpack the labels. # zdb -l /dev/dsk/c14d0 LABEL 0 failed to unpack label 0 LABEL 1 failed to unpack label 1 LABEL 2 failed to unpack label 2 LABEL 3 failed to unpack label 3 If I run # zdb -l /dev/dsk/c14d0s0 I do find 4 labels, but c14d0, c14d1 and c15d0 is what I created the raid with. I do find labels this way for all three disks. Is this to any help? # zdb -l /dev/dsk/c14d1s0 LABEL 0 version=13 name='rescamp' state=0 txg=218097573 pool_guid=12297694211509104163 hostid=4925114 hostname='slaskvald' top_guid=9479723326726871122 guid=17774184411399278071 vdev_tree type='raidz' id=0 guid=9479723326726871122 nparity=1 metaslab_array=23 metaslab_shift=34 ashift=9 asize=3000574672896 is_log=0 children[0] type='disk' id=0 guid=9020535344824299914 path='/dev/dsk/c15d0s0' devid='id1,c...@ast31000333as=9te0dglf/a' phys_path='/p...@0,0/pci-...@11/i...@1/c...@0,0:a' whole_disk=1 DTL=102 children[1] type='disk' id=1 guid=14384361563876398475 path='/dev/dsk/c14d0s0' devid='id1,c...@asamsung_hd103uj=s13pjdws690618/a' phys_path='/p...@0,0/pci-...@11/i...@0/c...@0,0:a' whole_disk=1 DTL=216 children[2] type='disk' id=2 guid=17774184411399278071 path='/dev/dsk/c14d1s0' devid='id1,c...@ast31000333as=9te0de8w/a' phys_path='/p...@0,0/pci-...@11/i...@0/c...@1,0:a' whole_disk=1 DTL=100 LABEL 1 version=13 name='rescamp' state=0 txg=218097573 pool_guid=12297694211509104163 hostid=4925114 hostname='slaskvald' top_guid=9479723326726871122 guid=17774184411399278071 vdev_tree type='raidz' id=0 guid=9479723326726871122 nparity=1 metaslab_array=23 metaslab_shift=34 ashift=9 asize=3000574672896 is_log=0 children[0] type='disk' id=0 guid=9020535344824299914 path='/dev/dsk/c15d0s0' devid='id1,c...@ast31000333as=9te0dglf/a' phys_path='/p...@0,0/pci-...@11/i...@1/c...@0,0:a' whole_disk=1 DTL=102 children[1] type='disk' id=1 guid=14384361563876398475 path='/dev/dsk/c14d0s0' devid='id1,c...@asamsung_hd103uj=s13pjdws690618/a' phys_path='/p...@0,0/pci-...@11/i...@0/c...@0,0:a' whole_disk=1 DTL=216 children[2] type='disk' id=2 guid=17774184411399278071 path='/dev/dsk/c14d1s0' devid='id1,c...@ast31000333as=9te0de8w/a' phys_path='/p...@0,0/pci-...@11/i...@0/c...@1,0:a' whole_disk=1 DTL=100 LABEL 2 version=13 name='rescamp' state=0 txg=218097573 pool_guid=12297694211509104163 hostid=4925114 hostname='slaskvald'
Re: [zfs-discuss] Unable to import pool: invalid vdev configuration
On Fri, Oct 2, 2009 at 2:36 PM, Victor Latushkin victor.latush...@sun.com wrote: Osvald Ivarsson wrote: On Thu, Oct 1, 2009 at 7:40 PM, Victor Latushkin victor.latush...@sun.com wrote: On 01.10.09 17:54, Osvald Ivarsson wrote: I'm running OpenSolaris build svn_101b. I have 3 SATA disks connected to my motherboard. The raid, a raidz, which is called rescamp, has worked good before until a power failure yesterday. I'm now unable to import the pool. I can't export the raid, since it isn't imported. # zpool import rescamp cannot import 'rescamp': invalid vdev configuration # zpool import pool: rescamp id: 12297694211509104163 state: UNAVAIL action: The pool cannot be imported due to damaged devices or data. config: rescamp UNAVAIL insufficient replicas raidz1 UNAVAIL corrupted data c15d0 ONLINE c14d0 ONLINE c14d1 ONLINE I've tried using zdb -l on all three disks, but in all cases it failes to unpack the labels. # zdb -l /dev/dsk/c14d0 LABEL 0 failed to unpack label 0 LABEL 1 failed to unpack label 1 LABEL 2 failed to unpack label 2 LABEL 3 failed to unpack label 3 If I run # zdb -l /dev/dsk/c14d0s0 I do find 4 labels, but c14d0, c14d1 and c15d0 is what I created the raid with. I do find labels this way for all three disks. Is this to any help? # zdb -l /dev/dsk/c14d1s0 LABEL 0 version=13 name='rescamp' state=0 txg=218097573 pool_guid=12297694211509104163 hostid=4925114 hostname='slaskvald' top_guid=9479723326726871122 guid=17774184411399278071 vdev_tree type='raidz' id=0 guid=9479723326726871122 nparity=1 metaslab_array=23 metaslab_shift=34 ashift=9 asize=3000574672896 is_log=0 children[0] type='disk' id=0 guid=9020535344824299914 path='/dev/dsk/c15d0s0' devid='id1,c...@ast31000333as=9te0dglf/a' phys_path='/p...@0,0/pci-...@11/i...@1/c...@0,0:a' whole_disk=1 DTL=102 children[1] type='disk' id=1 guid=14384361563876398475 path='/dev/dsk/c14d0s0' devid='id1,c...@asamsung_hd103uj=s13pjdws690618/a' phys_path='/p...@0,0/pci-...@11/i...@0/c...@0,0:a' whole_disk=1 DTL=216 children[2] type='disk' id=2 guid=17774184411399278071 path='/dev/dsk/c14d1s0' devid='id1,c...@ast31000333as=9te0de8w/a' phys_path='/p...@0,0/pci-...@11/i...@0/c...@1,0:a' whole_disk=1 DTL=100 LABEL 1 version=13 name='rescamp' state=0 txg=218097573 pool_guid=12297694211509104163 hostid=4925114 hostname='slaskvald' top_guid=9479723326726871122 guid=17774184411399278071 vdev_tree type='raidz' id=0 guid=9479723326726871122 nparity=1 metaslab_array=23 metaslab_shift=34 ashift=9 asize=3000574672896 is_log=0 children[0] type='disk' id=0 guid=9020535344824299914 path='/dev/dsk/c15d0s0' devid='id1,c...@ast31000333as=9te0dglf/a' phys_path='/p...@0,0/pci-...@11/i...@1/c...@0,0:a' whole_disk=1 DTL=102 children[1] type='disk' id=1 guid=14384361563876398475 path='/dev/dsk/c14d0s0' devid='id1,c...@asamsung_hd103uj=s13pjdws690618/a' phys_path='/p...@0,0/pci-...@11/i...@0/c...@0,0:a' whole_disk=1 DTL=216 children[2] type='disk' id=2 guid=17774184411399278071 path='/dev/dsk/c14d1s0' devid='id1,c...@ast31000333as=9te0de8w/a' phys_path='/p...@0,0/pci-...@11/i...@0/c...@1,0:a' whole_disk=1 DTL=100 LABEL 2 version=13 name='rescamp' state=0 txg=218097573 pool_guid=12297694211509104163 hostid=4925114 hostname='slaskvald'
Re: [zfs-discuss] Unable to import pool: invalid vdev configuration
On Fri, Oct 2, 2009 at 2:51 PM, Victor Latushkin victor.latush...@sun.com wrote: Osvald Ivarsson wrote: On Fri, Oct 2, 2009 at 2:36 PM, Victor Latushkin victor.latush...@sun.com wrote: Osvald Ivarsson wrote: On Thu, Oct 1, 2009 at 7:40 PM, Victor Latushkin victor.latush...@sun.com wrote: On 01.10.09 17:54, Osvald Ivarsson wrote: I'm running OpenSolaris build svn_101b. I have 3 SATA disks connected to my motherboard. The raid, a raidz, which is called rescamp, has worked good before until a power failure yesterday. I'm now unable to import the pool. I can't export the raid, since it isn't imported. # zpool import rescamp cannot import 'rescamp': invalid vdev configuration # zpool import pool: rescamp id: 12297694211509104163 state: UNAVAIL action: The pool cannot be imported due to damaged devices or data. config: rescamp UNAVAIL insufficient replicas raidz1 UNAVAIL corrupted data c15d0 ONLINE c14d0 ONLINE c14d1 ONLINE I've tried using zdb -l on all three disks, but in all cases it failes to unpack the labels. # zdb -l /dev/dsk/c14d0 LABEL 0 failed to unpack label 0 LABEL 1 failed to unpack label 1 LABEL 2 failed to unpack label 2 LABEL 3 failed to unpack label 3 If I run # zdb -l /dev/dsk/c14d0s0 I do find 4 labels, but c14d0, c14d1 and c15d0 is what I created the raid with. I do find labels this way for all three disks. Is this to any help? # zdb -l /dev/dsk/c14d1s0 LABEL 0 version=13 name='rescamp' state=0 txg=218097573 pool_guid=12297694211509104163 hostid=4925114 hostname='slaskvald' top_guid=9479723326726871122 guid=17774184411399278071 vdev_tree type='raidz' id=0 guid=9479723326726871122 nparity=1 metaslab_array=23 metaslab_shift=34 ashift=9 asize=3000574672896 is_log=0 children[0] type='disk' id=0 guid=9020535344824299914 path='/dev/dsk/c15d0s0' devid='id1,c...@ast31000333as=9te0dglf/a' phys_path='/p...@0,0/pci-...@11/i...@1/c...@0,0:a' whole_disk=1 DTL=102 children[1] type='disk' id=1 guid=14384361563876398475 path='/dev/dsk/c14d0s0' devid='id1,c...@asamsung_hd103uj=s13pjdws690618/a' phys_path='/p...@0,0/pci-...@11/i...@0/c...@0,0:a' whole_disk=1 DTL=216 children[2] type='disk' id=2 guid=17774184411399278071 path='/dev/dsk/c14d1s0' devid='id1,c...@ast31000333as=9te0de8w/a' phys_path='/p...@0,0/pci-...@11/i...@0/c...@1,0:a' whole_disk=1 DTL=100 LABEL 1 version=13 name='rescamp' state=0 txg=218097573 pool_guid=12297694211509104163 hostid=4925114 hostname='slaskvald' top_guid=9479723326726871122 guid=17774184411399278071 vdev_tree type='raidz' id=0 guid=9479723326726871122 nparity=1 metaslab_array=23 metaslab_shift=34 ashift=9 asize=3000574672896 is_log=0 children[0] type='disk' id=0 guid=9020535344824299914 path='/dev/dsk/c15d0s0' devid='id1,c...@ast31000333as=9te0dglf/a' phys_path='/p...@0,0/pci-...@11/i...@1/c...@0,0:a' whole_disk=1 DTL=102 children[1] type='disk' id=1 guid=14384361563876398475 path='/dev/dsk/c14d0s0' devid='id1,c...@asamsung_hd103uj=s13pjdws690618/a' phys_path='/p...@0,0/pci-...@11/i...@0/c...@0,0:a' whole_disk=1 DTL=216 children[2] type='disk' id=2 guid=17774184411399278071 path='/dev/dsk/c14d1s0' devid='id1,c...@ast31000333as=9te0de8w/a' phys_path='/p...@0,0/pci-...@11/i...@0/c...@1,0:a' whole_disk=1 DTL=100 LABEL 2 version=13 name='rescamp' state=0 txg=218097573 pool_guid=12297694211509104163 hostid=4925114 hostname='slaskvald'
Re: [zfs-discuss] Can't rm file when No space left on device...
It seems like the appropriate solution would be to have a tool that allows removing a file from one or more snapshots at the same time as removing the source ... That would make them not really snapshots. And such a tool would have to fix clones too. While I concur that being able to remove files from snapshots is somewhat against the concept behind snapshots, I feel that there is a tradeoff here for the administrator: Let's say we accidentally snapshotted a very large temporary file. We don't need the file and we don't need its snapshot. Yet the only way to free the space taken up by this accidentally snapshotted file is to delete the WHOLE snapshot, including all the files of which snapshots may be required. To paraphrase: that would make this snapshot not really a snapshot ANYMORE. At this point having a separate tool that allows you to do spring cleaning and deleting files from snapshots would quite possibly be more in the spirit of snapshotting than having to delete snapshots. Just my $.02, Rudolf -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] Find out file changes by comparing snapshots?
Hi, Is there a way or script that helps to find out what files have changed by comparing two snapshots? Thanks, Simon -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Best way to convert checksums
Interesting answer, thanks :) I'd like to dig a little deeper if you don't mind, just to further my own understanding (which is usually rudimentary compared to a lot of the guys on here). My belief is that ZFS stores two copies of the metadata for any block, so corrupt metadata really shouldn't happen often. Could I ask what the structure of your pool is, what level of redundancy do you have there. The very fact that you had a 'corrupt metadata' error implies to me that the checksums have done their job in finding an error, and I'm wondering if the true cause could be further down the line. I'm still taking all this in though - we'll be using sha256 on our secondary system, just in case :) -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS caching of compressed data
On Oct 2, 2009, at 5:05 AM, Robert Milkowski wrote: Stuart Anderson wrote: I am wondering if the following idea makes any sense as a way to get ZFS to cache compressed data in DRAM? In particular, given a 2-way zvol mirror of highly compressible data on persistent storage devices, what would go wrong if I dynamically added a ramdisk as a 3rd mirror device at boot time? Would ZFS route most (or all) of the reads to the lower latency DRAM device? In the case of an un-clean shutdown where there was no opportunity to actively remove the ramdisk from the pool before shutdown would there be any problem at boot time when the ramdisk is still registered but unavailable? Note, this Gedanken experiment is for highly compressible (~9x) metadata for a non-ZFS filesystem. You would only get about 33% of IO's served from ram-disk. With SVM you are allowed to specify a read policy on sub-mirrors for just this reason, e.g., http://wikis.sun.com/display/BigAdmin/Using+a+SVM+submirror+on+a+ramdisk+to+increase+read+performance Is there no equivalent in ZFS? However at the KCA conference Bill and Jeff mentioned Just-in-time decompression/decryption planned for ZFS. If I understand it correctly some % of pages in ARC will be kept compressed/encrypted and will be decompressed/decrypted only if accessed. This could be especially useful to do so with prefetch. I thought the optimization being discussed there was simply to avoid decompressing/decrypting unused data. I missed the part about keeping compressed data around in the ARC . Now I would imaging that one will be able to tune what's percentage of ARC should keep compressed pages. That would be nice. Now I don't remember if they mentioned L2ARC here but it would probably be useful to have a tunable which would put compressed or uncompressed data onto L2ARC depending on it's value. Which approach is better would always depends on a given environment and on where an actual bottleneck is. I agree something like this would be preferable to the SVM ramdisk solution. Thanks. -- Stuart Anderson ander...@ligo.caltech.edu http://www.ligo.caltech.edu/~anderson ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Find out file changes by comparing snapshots?
Simon Gao wrote: Hi, Is there a way or script that helps to find out what files have changed by comparing two snapshots? http://blogs.sun.com/chrisg/entry/zfs_versions_of_a_file Is something along those lines, but since the snapshots are visible under .zfs/snapshot/snapshot_name/ as filesystems you could just use basic UNIX tools like find/diff etc. -- Darren J Moffat ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] Performance issue on a zpool
I have an HP DL380G4 w/ 3Gb of ram and a slow MSA15 (SATA discs to a single u320 interface). I was using this with 10u7 for a smb over zfs file server for a few clients with mild needs. I never benched it as these unattended wkst's just wrote a slow steady of data and had no issues. I now need to use it for backup storage and noticed how absolutely bad the performance is. Just as a non technical way to grasp how bad, a dd of 4g from /dev/zero on the root rpool in a mirrored u320 pair of 72g discs on the Smart Array 6i takes ~1.5 minutes. The same dd on a RaidZ2 with 6 discs all exported as simple volumes on a Smart Array 6400 controller takes almost 20 minutes? With windows installed on this aging server, the times are nearly identical. The data I intend to write out to this machine will be bacula disk volumes and hence it needs to sustain large amounts of streamed data so even a zil I don't think will help as the file's will be 300-400G :) What's my next best option? Thanks! jlc ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Best way to convert checksums
My pool was the default, with checksum=256. The default has two copies of all metadata (as I understand it), and one copy of user data. It was a raidz2 with eight 750GB drives, yielding just over 4TB of usable space. I am not happy with the situation, but I recognize that I am 2x better off (1 bit parity) than I would be with any other file system. -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] strange pool disks usage pattern
For the archives... On Oct 2, 2009, at 12:41 AM, Maurilio Longo wrote: Hi, I have a pc with a MARVELL AOC-SAT2-MV8 controller and a pool made up of a six disks in a raid-z pool with a hot spare. pre -bash-3.2$ /sbin/zpool status pool: nas stato: ONLINE scrub: scrub in progress for 9h4m, 81,59% done, 2h2m to go config: NAMESTATE READ WRITE CKSUM nas ONLINE 0 0 0 raidz1ONLINE 0 0 0 c2t1d0 ONLINE 0 0 0 c2t4d0 ONLINE 0 0 0 c2t5d0 ONLINE 0 0 0 c2t3d0 ONLINE 0 0 0 c2t2d0 ONLINE 0 0 0 c2t0d0 ONLINE 0 0 0 dischi di riserva c2t7d0AVAIL errori: nessun errore di dati rilevato /pre Now, the problem is that issuing an iostat -Cmnx 10 or any other time intervall, I've seen, sometimes, a complete stall of disk I/O due to a disk in the pool (not always the same) being 100% busy. pre $ iostat -Cmnx 10 r/sw/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device 0,00,30,02,0 0,0 0,00,00,1 0 0 c1 0,00,30,02,0 0,0 0,00,00,1 0 0 c1t0d0 1852,1 297,0 13014,9 4558,4 9,2 1,64,30,7 2 158 c2 311,8 61,3 2185,3 750,7 2,0 0,35,50,7 17 25 c2t0d0 309,5 34,7 2207,2 769,5 1,6 0,54,71,4 41 47 c2t1d0 309,3 36,3 2173,0 770,0 1,0 0,32,90,7 18 26 c2t2d0 296,0 65,5 2057,3 749,2 2,1 0,25,90,6 16 23 c2t3d0 313,3 64,1 2187,3 748,8 1,7 0,24,60,5 15 21 c2t4d0 311,9 35,1 2204,8 770,1 0,7 0,22,10,5 11 17 c2t5d0 0,00,00,00,0 0,0 0,00,00,0 0 0 c2t7d0 extended device statistics r/sw/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device 0,4 14,73,2 30,4 0,0 0,20,0 13,2 0 2 c1 0,4 14,73,2 30,4 0,0 0,20,0 13,2 0 2 c1t0d0 1,70,0 58,90,0 3,0 1,0 1766,4 593,1 2 101 c2 0,30,07,70,0 0,0 0,00,30,4 0 0 c2t0d0 0,30,0 11,50,0 0,0 0,04,48,4 0 0 c2t1d0 0,00,00,00,0 3,0 1,00,00,0 100 100 c2t2d0 This is a symptom of an I/O getting dropped in the data path. You can clearly see 1 IOP in actv queue (which is the queue between the interface card and target). The %busy is calculated by counting the percentage of time that at least one IOP is in the actv queue. The higher level device drivers have timeouts and will try to reset and re-issue IOPs as needed. -- richard 0,40,0 14,10,0 0,0 0,00,46,6 0 0 c2t3d0 0,40,0 14,10,0 0,0 0,00,32,5 0 0 c2t4d0 0,30,0 11,50,0 0,0 0,03,66,9 0 0 c2t5d0 0,00,00,00,0 0,0 0,00,00,0 0 0 c2t7d0 extended device statistics r/sw/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device 0,03,10,03,1 0,0 0,00,00,7 0 0 c1 0,03,10,03,1 0,0 0,00,00,7 0 0 c1t0d0 0,00,00,00,0 3,0 1,00,00,0 2 100 c2 0,00,00,00,0 0,0 0,00,00,0 0 0 c2t0d0 0,00,00,00,0 0,0 0,00,00,0 0 0 c2t1d0 0,00,00,00,0 3,0 1,00,00,0 100 100 c2t2d0 0,00,00,00,0 0,0 0,00,00,0 0 0 c2t3d0 0,00,00,00,0 0,0 0,00,00,0 0 0 c2t4d0 0,00,00,00,0 0,0 0,00,00,0 0 0 c2t5d0 0,00,00,00,0 0,0 0,00,00,0 0 0 c2t7d0 extended device statistics r/sw/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device 0,00,10,00,4 0,0 0,00,01,2 0 0 c1 0,00,10,00,4 0,0 0,00,01,2 0 0 c1t0d0 0,0 29,50,0 320,2 3,4 1,0 113,9 34,6 2 102 c2 0,06,90,0 63,3 0,1 0,0 12,60,7 0 0 c2t0d0 0,04,40,0 65,5 0,0 0,08,70,8 0 0 c2t1d0 0,00,00,00,0 3,0 1,00,00,0 100 100 c2t2d0 0,07,40,0 62,7 0,1 0,0 15,40,8 1 1 c2t3d0 0,06,80,0 63,6 0,1 0,0 13,20,7 0 0 c2t4d0 0,04,00,0 65,1 0,0 0,07,90,7 0 0 c2t5d0 0,00,00,00,0 0,0 0,00,00,0 0 0 c2t7d0 extended device statistics r/sw/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device 0,00,30,02,4 0,0 0,00,00,1 0 0 c1 0,00,30,02,4 0,0 0,00,00,1 0 0 c1t0d0 0,00,00,00,0 3,0 1,00,00,0 2 100 c2 0,00,00,00,0 0,0 0,00,00,0 0 0 c2t0d0 0,00,00,00,0 0,0 0,00,00,0 0 0 c2t1d0 0,00,00,00,0 3,0 1,00,00,0
Re: [zfs-discuss] Best way to convert checksums
webcl...@rochester.rr.com said: To verify data, I cannot depend on existing tools since diff is not large file aware. My best idea at this point is to calculate and compare MD5 sums of every file and spot check other properties as best I can. Ray, I recommend that you use rsync's -c to compare copies. It reads all the source files, computes a checksum for them, then does the same for the destination and compares checksums. As far as I know, the only thing that rsync can't do in your situation is the ZFS/NFSv4 ACL's. I've used it to migrate many TB's of data. Regards, Marion ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Replacing a failed drive
Does the same thing apply for a failing drive? I have a drive that has not failed but by all indications, it's about to Can I do the same thing here? -dan Jeff Bonwick wrote: Yep, you got it. Jeff On Fri, Jun 19, 2009 at 04:15:41PM -0700, Simon Breden wrote: Hi, I have a ZFS storage pool consisting of a single RAIDZ2 vdev of 6 drives, and I have a question about replacing a failed drive, should it occur in future. If a drive fails in this double-parity vdev, then am I correct in saying that I would need to (1) unplug the old drive once I've identified the drive id (c1t0d0 etc), (2) plug in the new drive on the same SATA cable, and (3) issue a 'zpool replace pool_name drive_id' command etc, at which point ZFS will resilver the new drive from the parity data ? Thanks, Simon -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss -- http://www.java.com * Dan Transue * *Sun Microsystems, Inc.* 495 S. High Street, #200 Columbus, OH 43215 US Phone x30944 / 877-932-9964 Mobile 484-554-6951 Fax 877-932-9964 Email dan.tran...@sun.com ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] .zfs snapshots on subdirectories?
Suppose I have a storagepool: /storagepool And I have snapshots on it. Then I can access the snaps under /storagepool/.zfs/snapshots But is there any way to enable this within all the subdirs? For example, cd /storagepool/users/eharvey/some/foo/dir cd .zfs I don't want to create a new filesystem for every subdir. I just want to automatically have the .zfs hidden directory available within all the existing subdirs, if that's possible. Thanks.. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Replacing a failed drive
Yes, you can use the zpool replace process with any kind of drive: failed, failing, or even healthy. cs On 10/02/09 12:15, Dan Transue wrote: Does the same thing apply for a failing drive? I have a drive that has not failed but by all indications, it's about to Can I do the same thing here? -dan Jeff Bonwick wrote: Yep, you got it. Jeff On Fri, Jun 19, 2009 at 04:15:41PM -0700, Simon Breden wrote: Hi, I have a ZFS storage pool consisting of a single RAIDZ2 vdev of 6 drives, and I have a question about replacing a failed drive, should it occur in future. If a drive fails in this double-parity vdev, then am I correct in saying that I would need to (1) unplug the old drive once I've identified the drive id (c1t0d0 etc), (2) plug in the new drive on the same SATA cable, and (3) issue a 'zpool replace pool_name drive_id' command etc, at which point ZFS will resilver the new drive from the parity data ? Thanks, Simon -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss -- http://www.java.com * Dan Transue * *Sun Microsystems, Inc.* 495 S. High Street, #200 Columbus, OH 43215 US Phone x30944 / 877-932-9964 Mobile 484-554-6951 Fax 877-932-9964 Email dan.tran...@sun.com ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] bigger zfs arc
zfs will use as much memory as is necessary but how is necessary calculated? using arc_summary.pl from http://www.cuddletech.com/blog/pivot/entry.php?id=979 my tiny system shows: Current Size: 4206 MB (arcsize) Target Size (Adaptive): 4207 MB (c) Min Size (Hard Limit):894 MB (zfs_arc_min) Max Size (Hard Limit):7158 MB (zfs_arc_max) so arcsize is close to the desired c, no pressure here but it would be nice to know how c is calculated as its much smaller than zfs_arc_max on a system like yours with nothing else on it. When an L2ARC is attached does it get used if there is no memory pressure? My guess is no. for the same reason an L2ARC takes so long to fill. arc_summary.pl from the same system is Most Recently Used Ghost:0% 9367837 (mru_ghost) [ Return Customer Evicted, Now Back ] Most Frequently Used Ghost: 0% 11138758 (mfu_ghost) [ Frequent Customer Evicted, Now Back ] so with no ghosts, this system wouldn't benefit from an L2ARC even if added In review: (audit welcome) if arcsize = c and is much less than zfs_arc_max, there is no point in adding system ram in hopes of increase arc. if m?u_ghost is a small %, there is no point in adding an L2ARC. if you do add a L2ARC, one must have ram between c and zfs_arc_max for its pointers. Rob ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS caching of compressed data
Stuart Anderson wrote: On Oct 2, 2009, at 5:05 AM, Robert Milkowski wrote: Stuart Anderson wrote: I am wondering if the following idea makes any sense as a way to get ZFS to cache compressed data in DRAM? In particular, given a 2-way zvol mirror of highly compressible data on persistent storage devices, what would go wrong if I dynamically added a ramdisk as a 3rd mirror device at boot time? Would ZFS route most (or all) of the reads to the lower latency DRAM device? In the case of an un-clean shutdown where there was no opportunity to actively remove the ramdisk from the pool before shutdown would there be any problem at boot time when the ramdisk is still registered but unavailable? Note, this Gedanken experiment is for highly compressible (~9x) metadata for a non-ZFS filesystem. You would only get about 33% of IO's served from ram-disk. With SVM you are allowed to specify a read policy on sub-mirrors for just this reason, e.g., http://wikis.sun.com/display/BigAdmin/Using+a+SVM+submirror+on+a+ramdisk+to+increase+read+performance Is there no equivalent in ZFS? Nope, at least not right now. -- Robert Milkowski http://milek.blogspot.com ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Best way to convert checksums
Ray, The checksums are set on the file systems not the pool. If a new checksum is set and *you* rewrite the data, then the rewritten data will contain the new checksum. If your pool has the space for you to duplicate the user data and new checksum is set, then the duplicated data will have the new checksum. ZFS doesn't rewrite data as part of normal operations. I confirmed with a simple test (like Darren's) that even if you have a single-disk pool and the disk is replaced and all the data is resilvered and a new checksum is set, you'll see data with the previous checksum and the new checksum. Cindy On 10/02/09 08:44, Ray Clark wrote: Replying to Cindys Oct 1, 2009 3:34 PM post: Thank you. The second part was my attempt to guess my way out of this. If the fundamental structure of the pool (That which was created before I set the checksum=sha256 property) is using fletcher2, perhaps as I use the pool all of this structure will be updated, and therefore automatically migrate to the new checksum. It would be very difficult for me to recreate the pool, but I have space to duplicate the user files (and so get the new checksum). Perhaps this will also result in the underlying structure of the pool being converted in the course of normal use. Comments for or against? ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Best way to convert checksums
On Oct 2, 2009, at 7:46 AM, Ray Clark wrote: Replying to relling's October 1, 2009 3:34 post: Richard, regarding when a pool is created, there is only metadata which uses fletcher4. Was this true in U4, or is this a new change of default with U4 using fletcher2? Similarly, did the Ubberblock use sha256 in U4? I am running U4. ZFS uses different checksums for different things. Briefly, use checksum - uberblock SHA-256, self-checksummed labels SHA-256 metadatafletcher4 datafletcher2 (default), set with checksum parameter ZIL log fletcher2, self-checksummed gang block SHA-256, self-checksummed The parent holds the checksum for an entity is not self-checksummed. The big question, that is currently unanswered, is do we see single bit faults in disk-based storage systems? The answer to this question must be known before the effectiveness of a checksum can be evaluated. The overwhelming empirical evidence suggests that fletcher2 catches many storage system corruptions. -- richard ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Best way to convert checksums
re == Richard Elling richard.ell...@gmail.com writes: r == Ross myxi...@googlemail.com writes: re The answer to this question must be known before the re effectiveness of a checksum can be evaluated. ...well...we can use math to know that a checksum is effective. What you are really suggesting we evaluate ``empirically'' is the degree of INeffectiveness of the broken checksum. r ZFS stores two copies of the metadata for any block, so r corrupt metadata really shouldn't happen often. the other copy probably won't be read if the first copy read has a valid checksum. I think it'll more likely just lazy-panic instead. If that's the case, the two copies won't help cover up the broken checksum bug. but Richard's table says metadata has fletcher4 which the OP said is as good as the correct algorithm would have been, even in its broken implementation, so long as it's only used up to 128kByte. It's only data and ZIL that has the relevantly-broken checksum, according to his math. re The overwhelming empirical evidence suggests that fletcher2 re catches many storage system corruptions. What do you mean by the word ``many''? It's a weasel-word. It basically means, AFAICT, ``the broken checksum still trips sometimes.'' But have you any empirical evidence about the fraction of real world errors which are still caught by the broken checksum vs. those that are not? I don't see how you could. How about cases where checksums are not used to correct bit-flip gremlins but relied upon to determine whether a data structure is fully present (committed) yet, like in the ZIL, or to determine which half of a mirror is stale---these are cases where checksums could be wrong even if the storage subsystem is functioning in an ideal way. Checksum weakness on ZFS where checksums are presumed good by other parts of the design could potentially be worse overall than a checksumless design. That's not my impression, but it's the right place to put the bar. Ray's ``well at least it's better than no checksums'' is wrong because it presumes ZFS could function as well as another filesystem if ZFS were using a hypothetical null checksum. It couldn't. Anyway I'm glad the problem is both fixed and also avoidable on the broken systems. I just think the doublespeak after the fact is, once again, not helping anyone. pgpSoPvsby5bY.pgp Description: PGP signature ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Best way to convert checksums
Hi Miles, good to hear from you again. On Oct 2, 2009, at 1:20 PM, Miles Nordin wrote: re == Richard Elling richard.ell...@gmail.com writes: r == Ross myxi...@googlemail.com writes: re The answer to this question must be known before the re effectiveness of a checksum can be evaluated. ...well...we can use math to know that a checksum is effective. What you are really suggesting we evaluate ``empirically'' is the degree of INeffectiveness of the broken checksum. By your logic, SECDED ECC for memory is broken because it only corrects 1 bit per symbol and only detects brokeness of 2 bits per symbol. However, the empirical evidence suggests that ECC provides a useful function for many people. Do we know how many triple bit errors occur in memories? I can compute the probability, but have never seen a field failure analysis. So, if ECC is good enough for DRAM, is fletcher2 good enough for storage? NB, for DRAM the symbol size is usually 64 bits. For the ZFS case, the symbol size is 4,096 to 1,048,576 bits. AFAIK, no collisions have been found in SHA-256 digests for symbols of size 1,048,576, but it has not been proven that that they do not exist. r ZFS stores two copies of the metadata for any block, so r corrupt metadata really shouldn't happen often. the other copy probably won't be read if the first copy read has a valid checksum. I think it'll more likely just lazy-panic instead. If that's the case, the two copies won't help cover up the broken checksum bug. but Richard's table says metadata has fletcher4 which the OP said is as good as the correct algorithm would have been, even in its broken implementation, so long as it's only used up to 128kByte. It's only data and ZIL that has the relevantly-broken checksum, according to his math. re The overwhelming empirical evidence suggests that fletcher2 re catches many storage system corruptions. What do you mean by the word ``many''? It's a weasel-word. I'll blame the lawyers. They are causing me to remove certain words from my vocabulary :-( It basically means, AFAICT, ``the broken checksum still trips sometimes.'' But have you any empirical evidence about the fraction of real world errors which are still caught by the broken checksum vs. those that are not? I don't see how you could. Question for the zfs-discuss participants, have you seen a data corruption that was not detected when using fletcher2? Personally, I've seen many corruptions of data stored on file systems lacking checksums. How about cases where checksums are not used to correct bit-flip gremlins but relied upon to determine whether a data structure is fully present (committed) yet, like in the ZIL, or to determine which half of a mirror is stale---these are cases where checksums could be wrong even if the storage subsystem is functioning in an ideal way. Checksum weakness on ZFS where checksums are presumed good by other parts of the design could potentially be worse overall than a checksumless design. That's not my impression, but it's the right place to put the bar. Ray's ``well at least it's better than no checksums'' is wrong because it presumes ZFS could function as well as another filesystem if ZFS were using a hypothetical null checksum. It couldn't. I'm in Ray's camp. I've got far to many scars from data corruption and I'd rather not add more. -- richard Anyway I'm glad the problem is both fixed and also avoidable on the broken systems. I just think the doublespeak after the fact is, once again, not helping anyone. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Best way to convert checksums
Replying to hakanson's Oct 2, 2009 2:01 post: Thanks. I suppose it is true that I am not even trying to compare the peripheral stuff, and simple presence of a file and the data matching covers some of them. Using it for moving data, one encounters a longer list: Sparse files, ACL handling, extended atributes, length of filenames, length of pathnames, large files. And probably other interesting things that can be not handled correctly. Most information for misbehavior of the various archive / backup / data movement utilities is very old. One wonders how they behave today. This would be a useful compilation, but I can't do it. -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Can't rm file when No space left on device...
Rudolf Potucek wrote: It seems like the appropriate solution would be to have a tool that allows removing a file from one or more snapshots at the same time as removing the source ... That would make them not really snapshots. And such a tool would have to fix clones too. While I concur that being able to remove files from snapshots is somewhat against the concept behind snapshots, I feel that there is a tradeoff here for the administrator: Let's say we accidentally snapshotted a very large temporary file. We don't need the file and we don't need its snapshot. Yet the only way to free the space taken up by this accidentally snapshotted file is to delete the WHOLE snapshot, including all the files of which snapshots may be required. To paraphrase: that would make this snapshot not really a snapshot ANYMORE. At this point having a separate tool that allows you to do spring cleaning and deleting files from snapshots would quite possibly be more in the spirit of snapshotting than having to delete snapshots. Just my $.02, Rudolf NO. Snapshotting is sacred - once you break the model where a snapshot is a point-in-time picture, all sorts of bad things can happen. You've changed a fundamental assumption of snapshots, and this then impacts how we view them from all sorts of angles; it's a huge loss to trade away for a very small gain. Should you want to modify a snapshot for some reason, that's what the 'zfs clone' function is for. clone your snapshot, promote it, and make your modifications. -- Erik Trimble Java System Support Mailstop: usca22-123 Phone: x17195 Santa Clara, CA Timezone: US/Pacific (GMT-0800) ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Best way to convert checksums
Cindys Oct 2, 2009 2:59, Thanks for staying with me. Re: The checksums are aset on the file systems not the pool.: But previous responses seem to indicate that I can set them for file stored in the filesystem that appears to be the pool, at the pool level, before I create any new ones. One post seems to indicate that there is a checksum property for this file system, and independently for the pool. (This topic needs a picture). Re: If a new checksum is set and *you* rewrite the data ... then the duplciated data will have the new checksum. Understand. Now I am on to being concerned for the blocks that comprise the zpool that *contain* the file system. Re: ZFS doesn't rewrite data as part of normal operations. I confirmed with a simple test (like Darren's) that even if you have a single-disk pool and the disk is replaced and all the data is resilvered and a new checksum is set, you'll see data with the previous checksum and the new checksum. Yes, ... a resilver duplicates exactly. Darren's example showed that without the -R, no properties were sent and the zfs receive had no choice but to use the pool default for the zfs filesystem that it created. This also implies that there was a property associated with the pool. So my previous comment about zfs send/receive not duplicating exactly was not fair. The man page / admin guide should be clear as to what is sent without -R. I would have guessed everything, just not descendent file systems. It is a shame that zdb is totally undocumented. I thought I had discovered a gold mine when I first read Darren's note! --Ray -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Best way to convert checksums
Re: relling's Oct 2, 2009 3:26 Post: (1) Is this list everything? (2) Is this the same for U4? (3) If I change the zpool checksum property on creation as you indicated in your Oct 1, 12:51 post (evidently very recent versions only), does this change the checksums used for this list? Why would not the strongest checksum be used for the most fundamental data rather than fool around, allowing the user to compromise only when the tradeoff pays back on the 99% bulk of the data? Re: The big question, that is currently unanswered, is do we see single bit faults in disk-based storage systems? I don't think this is the question. I believe the implication of schlie's post is not that single bit faults will get through, but that the current fletcher2 is equivalent to a single bit checksum. You could have 1,000 bits in error, or 4095, and still have a 50-50 chance of detecting it. A single bit error would be certain to be detected (I think) even with the current code. -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Best way to convert checksums
Re: Miles Nordin Oct 2, 2009 4:20: Re: Anyway, I'm glad the problem is both fixed... I want to know HOW it can be fixed? If they fixed it, this will invalidate every pool that has not been changed from the default (Probably almost all of them!). This can't be! So what WAS done? In the interest of honesty in advertising and enabling people to evaluate their own risks, I think we should know how it was fixed. Something either ingenious or potentially misleading must have been done. I am not suggesting that it was not the best way to handle a difficult situation, but I don't see how it can be transparent. If the string fletcher2 does the same thing, it is not fixed. If it does something different, it is misleading. ... and avoidable on the broken systems. Please tell me how! Without destroying and recreating my zpool, I can only fix the zfs file system blocks, not the underlying zpool blocks. WITH destroying and recreating my zpool, I can only control the checksum on the underlying zpool using a version of Solaris that is not yet available. And then (Pending relling's response) may or may not *still* effect the blocks I am concerned about. So how is this avoidable? It is partially avoidable (so far) IF I have the luxury of doing significant rebuilding.. No? -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Best way to convert checksums
On Oct 2, 2009, at 3:05 PM, Ray Clark wrote: Re: relling's Oct 2, 2009 3:26 Post: (1) Is this list everything? AFAIK (2) Is this the same for U4? Yes. This hasn't changed in a very long time. (3) If I change the zpool checksum property on creation as you indicated in your Oct 1, 12:51 post (evidently very recent versions only), does this change the checksums used for this list? Why would not the strongest checksum be used for the most fundamental data rather than fool around, allowing the user to compromise only when the tradeoff pays back on the 99% bulk of the data? Performance. Many people value performance over dependability. Re: The big question, that is currently unanswered, is do we see single bit faults in disk-based storage systems? I don't think this is the question. I believe the implication of schlie's post is not that single bit faults will get through, but that the current fletcher2 is equivalent to a single bit checksum. You could have 1,000 bits in error, or 4095, and still have a 50-50 chance of detecting it. A single bit error would be certain to be detected (I think) even with the current code. I don't believe schlie posted the number of fletcher2 collisions for the symbol size used by ZFS. I do not believe it will be anywhere near 50% collisions. -- richard ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Best way to convert checksums
Re: relling's Oct 2 5:06 Post: Re: analogy to ECC memory... I appreciate the support, but the ECC memory analogy does not hold water. ECC memory is designed to correct for multiple independent events, such as electrical noise, bits flipped due to alpha particles from the DRAM package, or cosmic rays. The probability of these independent events coinciding in time and space is very small indeed. It works well. ZFS does purport to cover errors such as these in the crummy double layer boards wtihout sufficient decoupling, microcontrollers and memories without parity or ECC, etc. found in the cost-reduced to the razor's edge hardware most of us run on, but it also covers system level errors such as entire blocks being replaced, or large fractions of them being corrupted by high level bugs. With the current fletcher2 we have only a 50-50 chance of catching these multi-bit errors. Probability of multiple bits being changed is not small, because the probabilities of the error mechanism effecting the 4096~1048576 bits in the block are not independent. Indeed, in many of the show-cased mechanisms, it is a sure bet - the entire disk sector is written with the wrong data, for sure! Although there is a good chance that many of the bits in the sector happen to match, there is an excellent chance that many are different. And the mechanisms that caused these differences were not independent . Re: AFAIK, no collisions have been found in SHA-256 digests for symbols of size 1,048,576, but it has not been proven that they do not exist For sure they exist. I think 4096 of them, for every SHA256 digest, there are (I think) 4096 1,048,576 bit long blocks that will create it. One hopes that the same properties that make SHA256 a good cryptographic hash also make it a good hash period. This, I admit, is a leap of ignorance (At least I know what cliff I am leaping off of). Regarding the question of what people have seen, I have seen lots of unexplained things happen, and by definition one never knows why. I am not interested in seeing any more. I see the potential for disaster, and my time, and the time of my group, is better spent doing other things. That is why I moved to ZFS. -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Best way to convert checksums
re == Richard Elling richard.ell...@gmail.com writes: re By your logic, SECDED ECC for memory is broken because it only re corrects ECC is not a checksum. Go ahead, get out your dictionary, enter severe-pedantry-mode. but it is relevantly different. In for example data transmission scenarios, FEC's like ECC are often used along with a strong noncorrecting checksum over a larger block. The OP further described scenarios plausible for storage, like ``long string of zeroes with 1 bit flipped'', that produce collisions with the misimplemented fletcher2 (but, obviously, not with any strong checksum like correct-fletcher2). re is fletcher2 good enough for storage? yes, it probably is good enough, but ZFS implements some other broken algorithm and calls it fletcher2. so, please stop saying fletcher2. re I'll blame the lawyers. They are causing me to remove certain re words from my vocabulary :-( yeah, well, allow me to add a word back to the vocabulary: BROKEN. If you are not legally allowed to use words like broken and working, then find another identity from which to talk, please. re Question for the zfs-discuss participants, have you seen a re data corruption that was not detected when using fletcher2? This is ridiculous. It's not fletcher2, it's brokenfletcher2. It's avoidably extremely weak. It's reasonable to want to use a real checksum, and this PR game you are playing is frustrating and confidence-harming for people who want that. This does not have to become a big deal, unless you try to spin it with a 7200rpm PR machine like IBM did with their broken Deathstar drives before they became HGST. Please, what we need to do is admit that the checksum is relevantly broken in a way that compromises the integrity guarantees with which ZFS was sold to many customers, fix the checksum, and learn how to conveniently migrate our data. Based on the table you posted, I guess file data can be set to fletcher4 or sha256 using filesystem properties to work around the bug on Solaris versions with the broken implementation. 1. What's needed to avoid fletcher2 on the ZIL on broken Solaris versions? 2. I understand the workaround, but not the fix. How does the fix included S10u8 and snv_114 work? Is there a ZFS version bump? Does the fix work by implementing fletcher2 correctly? or does it just disable fletcher2 and force everything to use brokenfletcher4 which is good enough? If the former, how are the broken and correct versions of fletcher2 distinguished---do they show up with different names in the pool properties? Once you have the fixed software, how do you make sure fixed checksums are actually covering data blocks originally written by old broken software? I assume you have to use rsync or zfs send/recv to rewrite all the data with the new checksum? If yes, what do you have to do before rewriting---upgrade solaris and then 'zfs upgrade' each filesystem one by one? Will zfs send/recv work across the filesystem versions, or does the copying have to be done with rsync? 3. speaking of which, what about the checksum in zfs send streams? is it also fletcher2, and if so was it also fixed in s10u8/snv_114, and how does this affect compatibility for people who have ignored my advice and stored streams instead of zpools? Will a newer 'zfs recv' always work with an older 'zfs send' but not the other way around? there is basically no informaiton about implementing the fix in the bug, and we can't write to the bug from outside Sun. Whatever sysadmins need to do to get their data under the strength of checksum they thought it was under, it might be nice to describe it in the bug for whoever gets referred to the bug and has an affected version. pgp4LNb1yFFMv.pgp Description: PGP signature ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Best way to convert checksums
Let me try to refocus: Given that I have a U4 system with a zpool created with Fletcher2: What blocks in the system are protected by Fletcher2, or even Fletcher4 although that does not worry me so much. Given that I only have 1.6TB of data in a 4TB pool, what can I do to change those blocks to sha256 or Fletcher4: (1) Without destroying and recreating the zpool under U4 (2) With destroying and recreating the zpool under U4 (Which I don't really have the resources to pull off) (3) With upgrading to U7 (Perhaps in a few months) (4) With upgrading to U8 Thanks. -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Can't rm file when No space left on device...
NO. Snapshotting is sacred LOL! Ok, ok, I admit that snapshotting the whole ZFS root filesystem (yes, we have ZFS root in production, oops) instead of creating individual snapshots for *each* individual ZFS is against the code of good sysadmin-ing. I bow to the developer gods and will only follow the approved gospel in the future ;) once you break the model where a snapshot is a point-in-time picture, all sorts of bad things can happen. You've changed a fundamental assumption of snapshots, and this then impacts how we view them from all sorts of angles; it's a huge loss to trade away for a very small gain. Hmm ... I can see how the assumption of a snapshot being unalterable could provide some programming shortcuts and opportunities for optimization of ZFS code. Not sure that I understand the huge loss perspective though. I think at the point where I am desperately scrabbling to free 30% of my root FS held hostage by an accidental snapshot while keeping on-line backup strategy in tact, I won't be too worried about performance ;) Should you want to modify a snapshot for some reason, that's what the 'zfs clone' function is for. clone your snapshot, promote it, and make your modifications. Err ... hello ... filesystem already full ... hello? -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] bigger zfs arc
On Fri, Oct 2, 2009 at 1:45 PM, Rob Logan r...@logan.com wrote: zfs will use as much memory as is necessary but how is necessary calculated? using arc_summary.pl from http://www.cuddletech.com/blog/pivot/entry.php?id=979 my tiny system shows: Current Size: 4206 MB (arcsize) Target Size (Adaptive): 4207 MB (c) That looks a lot like ~ 4 * 1024 MB. Is this a 64-bit capable system that you have booted from a 32-bit kernel? -- Mike Gerdts http://mgerdts.blogspot.com/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Best way to convert checksums
On Oct 2, 2009, at 3:44 PM, Ray Clark wrote: Let me try to refocus: Given that I have a U4 system with a zpool created with Fletcher2: What blocks in the system are protected by Fletcher2, or even Fletcher4 although that does not worry me so much. Given that I only have 1.6TB of data in a 4TB pool, what can I do to change those blocks to sha256 or Fletcher4: (1) Without destroying and recreating the zpool under U4 (2) With destroying and recreating the zpool under U4 (Which I don't really have the resources to pull off) (3) With upgrading to U7 (Perhaps in a few months) (4) With upgrading to U8 This has been answered several times in this thread already. set checksum=sha256 filesystem copy your files -- all newly written data will have the sha256 checksums. -- richard ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Best way to convert checksums
On Oct 2, 2009, at 3:36 PM, Miles Nordin wrote: re == Richard Elling richard.ell...@gmail.com writes: re By your logic, SECDED ECC for memory is broken because it only re corrects ECC is not a checksum. SHA-256 is not a checksum, either, but that isn't the point. The concern is that corruption can be detected. ECC has very, very limited detection capabilities, yet it is good enough for many people. We know that MOS memories have certain failure modes that cause bit flips and by using ECC and interleaving, the dependability is improved. The big question is, what does the corrupted data look like in storage? Random bit flips? Big chunks of zeros? 55aa patterns? Since the concern with the broken fletcher2 is restricted to the most significant bits, we are most concerned with failures where the most significants are set to ones. But as I said, we have no real idea what the corrupted data should look like, and if it is zero-filled, then fletcher2 will catch it. Go ahead, get out your dictionary, enter severe-pedantry-mode. but it is relevantly different. In for example data transmission scenarios, FEC's like ECC are often used along with a strong noncorrecting checksum over a larger block. The OP further described scenarios plausible for storage, like ``long string of zeroes with 1 bit flipped'', that produce collisions with the misimplemented fletcher2 (but, obviously, not with any strong checksum like correct-fletcher2). re is fletcher2 good enough for storage? yes, it probably is good enough, but ZFS implements some other broken algorithm and calls it fletcher2. so, please stop saying fletcher2. If I was to refer to Fletcher's algorithm, I would use Fletcher. When I am referring to the ZFS checksum setting of fletcher2 I will continue to use fletcher2 re I'll blame the lawyers. They are causing me to remove certain re words from my vocabulary :-( yeah, well, allow me to add a word back to the vocabulary: BROKEN. If you are not legally allowed to use words like broken and working, then find another identity from which to talk, please. re Question for the zfs-discuss participants, have you seen a re data corruption that was not detected when using fletcher2? This is ridiculous. It's not fletcher2, it's brokenfletcher2. It's avoidably extremely weak. It's reasonable to want to use a real checksum, and this PR game you are playing is frustrating and confidence-harming for people who want that. There is no PR campaign. It is what it is. What is done is done. This does not have to become a big deal, unless you try to spin it with a 7200rpm PR machine like IBM did with their broken Deathstar drives before they became HGST. Please, what we need to do is admit that the checksum is relevantly broken in a way that compromises the integrity guarantees with which ZFS was sold to many customers, fix the checksum, and learn how to conveniently migrate our data. Unfortunately, there is a backwards compatibility issue that requires the current fletcher2 to live for a very long time. The only question for debate is whether it should be the default. To date, I see no field data that suggests it is not detecting corruption. Based on the table you posted, I guess file data can be set to fletcher4 or sha256 using filesystem properties to work around the bug on Solaris versions with the broken implementation. 1. What's needed to avoid fletcher2 on the ZIL on broken Solaris versions? Please file RFEs at bugs.opensolaris.org 2. I understand the workaround, but not the fix. How does the fix included S10u8 and snv_114 work? Is there a ZFS version bump? Does the fix work by implementing fletcher2 correctly? or does it just disable fletcher2 and force everything to use brokenfletcher4 which is good enough? If the former, how are the broken and correct versions of fletcher2 distinguished---do they show up with different names in the pool properties? The best I can tell, the comments are changed to indicate fletcher2 is deprecated. However, it must live on (forever) because of backwards compatibility. I presume one day the default will change to fletcher4 or something else. This is implied by zfs(1m): checksum=on | off | fletcher2,| fletcher4 | sha256 Controls the checksum used to verify data integrity. The default value is on, which automatically selects an appropriate algorithm (currently, fletcher2, but this may change in future releases). The value off disables integrity checking on user data. Disabling checksums is NOT a recommended practice. Once you have the fixed software, how do you make sure fixed checksums are actually covering data blocks originally written by old broken software? I assume you have to use rsync or zfs send/recv to rewrite all the data with the new checksum? If yes, what do you have to do before rewriting---upgrade
Re: [zfs-discuss] bigger zfs arc
On Oct 2, 2009, at 11:45 AM, Rob Logan wrote: zfs will use as much memory as is necessary but how is necessary calculated? using arc_summary.pl from http://www.cuddletech.com/blog/pivot/entry.php?id=979 my tiny system shows: Current Size: 4206 MB (arcsize) Target Size (Adaptive): 4207 MB (c) Min Size (Hard Limit):894 MB (zfs_arc_min) Max Size (Hard Limit):7158 MB (zfs_arc_max) so arcsize is close to the desired c, no pressure here but it would be nice to know how c is calculated as its much smaller than zfs_arc_max on a system like yours with nothing else on it. c is the current size the ARC. c will change dynamically, as memory pressure and demand change. When an L2ARC is attached does it get used if there is no memory pressure? My guess is no. for the same reason an L2ARC takes so long to fill. arc_summary.pl from the same system is You want to cache stuff closer to where it is being used. Expect the L2ARC to contain ARC evictions. Most Recently Used Ghost:0% 9367837 (mru_ghost) [ Return Customer Evicted, Now Back ] Most Frequently Used Ghost: 0% 11138758 (mfu_ghost) [ Frequent Customer Evicted, Now Back ] so with no ghosts, this system wouldn't benefit from an L2ARC even if added In review: (audit welcome) if arcsize = c and is much less than zfs_arc_max, there is no point in adding system ram in hopes of increase arc. If you add RAM arc_c_max will change unless you limit it by setting zfs_arc_max. In other words, c will change dynamically between the limits: arc_c_min = c = arc_c_max. By default for 64-bit machines, the arc_c_max is the greater of 3/4 of physical memory or all but 1GB. If zfs_arc_max is set and is less than arc_c_max and greater than 64 MB, then arc_c_max is set to zfs_arc_max. This allows you to reasonably cap arc_c_max. Note: if you pick an unreasonable value for zfs_arc_max, you will not be notified -- check current values with kstat -n arcstats if m?u_ghost is a small %, there is no point in adding an L2ARC. Yes, to the first order. Ghosts are those whose data is evicted, but whose pointer remains. if you do add a L2ARC, one must have ram between c and zfs_arc_max for its pointers. No. The pointers are part of c. Herein lies the rub. If you have a very large L2ARC and limited RAM, then you could waste L2ARC space because the pointers run out of space. SWAG pointers at 200 bytes each per record. For example, suppose you use a Seagate 2 TB disk for L2ARC: + Disk size = 3,907,029,168 512-byte sectors - 4.5 MB for labels and reserve + workload uses 8KB fixed record size (eg Oracle OLTP database) + RAM needed to support this L2ARC on this workload is approximately: 1 GB + Application space + ((3,907,029,168 - 9,232) * 200 / 16) or at least 48 GBytes, practically speaking Do not underestimate the amount of RAM needed to address lots of stuff :-) -- richard ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss