Re: [zfs-discuss] any more efficient way to transfer snapshot between two hosts than ssh tunnel?
Hi Fred, Try mbuffer (http://www.maier-komor.de/mbuffer.html) On 14 December 2012 15:01, Fred Liu fred_...@issi.com wrote: Assuming in a secure and trusted env, we want to get the maximum transfer speed without the overhead from ssh. ** ** Thanks. ** ** Fred ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss -- Adrian Smith (ISUnix), Ext: 55070 ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] grrr, How to get rid of mis-touched file named `-c'
You could list by inode, then use find with rm. # ls -i 7223 -O # find . -inum 7223 -exec rm {} \; David On 11/23/11 2:00 PM, Jason King (Gmail) jason.brian.k...@gmail.com wrote: Did you try rm -- filename ? Sent from my iPhone On Nov 23, 2011, at 1:43 PM, Harry Putnam rea...@newsguy.com wrote: Somehow I touched some rather peculiar file names in ~. Experimenting with something I've now forgotten I guess. Anyway I now have 3 zero length files with names -O, -c, -k. I've tried as many styles of escaping as I could come up with but all are rejected like this: rm \-c rm: illegal option -- c usage: rm [-fiRr] file ... Ditto for: [\-]c '-c' *c '-'c \075c OK, I'm out of escapes. or other tricks... other than using emacs but I haven't installed emacs as yet. I can just ignore them of course, until such time as I do get emacs installed, but by now I just want to know how it might be done from a shell prompt. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] How to recover -- LUNs go offline, now permanent errors?
Cindy, I gave your suggestion a try. I did the zpool clear and then did another zpool scrub and all is happy now. Thank you for your help. David -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] How to recover -- LUNs go offline, now permanent errors?
Cindy, Thanks for the reply. I'll get that a try and then send an update. Thanks, David -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] How to recover -- LUNs go offline, now permanent errors?
I recently had an issue with my LUNs from our storage unit going offline. This caused the zpool to get numerous errors on the luns. The pool is on-line, and I did a scrub, but one of the raid sets is degraded: raidz2-3 DEGRADED 0 0 0 c7t60001FF011C6F3103B00011D1BF1d0 DEGRADED 0 0 0 too many errors c7t60001FF011C6F3023900011D1BF1d0 DEGRADED 0 0 0 too many errors c7t60001FF011C6F2F53700011D1BF1d0 DEGRADED 0 0 0 too many errors c7t60001FF011C6F2E43500011D1BF1d0 DEGRADED 0 0 0 too many errors c7t60001FF011C6F2D23300011D1BF1d0 DEGRADED 0 0 0 too many errors c7t60001FF011C6F2A93100011D1BF1d0 DEGRADED 0 0 0 too many errors c7t60001FF011C6F29A2F00011D1BF1d0 DEGRADED 0 0 0 too many errors c7t60001FF011C6F2682D00011D1BF1d0 DEGRADED 0 0 0 too many errors c7t60001FF011C6F24C2B00011D1BF1d0 DEGRADED 0 0 0 too many errors c7t60001FF011C6F2192900011D1BF1d0 DEGRADED 0 0 0 too many errors Also I have the following: errors: Permanent errors have been detected in the following files: 0x3a:0x3b04 Originally, there was a file, and then a directory listed, but I removed them. Now I'm stuck with the hex codes above. How do I interpret them? Can this pool be recovered, or basically how do I proceed? The system is Solaris 10 U9 with all recent patches. Thanks, David -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Zpool metadata corruption from S10U9 to S11 express
On 6/22/11 10:28 PM, Fajar A. Nugraha w...@fajar.net wrote: On Thu, Jun 23, 2011 at 9:28 AM, David W. Smith smith...@llnl.gov wrote: When I tried out Solaris 11, I just exported the pool prior to the install of Solaris 11. I was lucky in that I had mirrored the boot drive, so after I had installed Solaris 11 I still had the other disk in the mirror with Solaris 10 still installed. I didn't install any additional software in either environments with regards to volume management, etc. From the format command, I did remember seeing 60 luns coming from the DDN and as I recall I disk see multiple paths as well under Solaris 11. I think you are correct however in that for some reason Solaris 11 could not read the devices. So you mean the root cause of the problem is Solaris Express failed to see the disks? Or are the disks available on solaris express as well? When you boot with Solaris Express Live CD, what does zpool import show? Under Solaris 11 express, disks were seen with the format command, or like luxadm probe, etc. So I'm not sure why zpool import failed, or why I assume could not read the devices. I have not tried the Solaris Express live CD, but I was booted off an installed version. David ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Zpool metadata corruption from S10U9 to S11 express
path='/dev/dsk/c3t59d0s0' devid='id1,sd@TATA_STEC_ZeusIOPS___018_GBytes__STMD039A/a' phys_path='/pci@7c,0/pci10de,376@e/pci1000,3150@0/sd@3b,0:a' whole_disk=1 create_txg=269718 children[1] type='disk' id=1 guid=2456972971894251597 path='/dev/dsk/c3t60d0s0' devid='id1,sd@TATA_STEC_ZeusIOPS___018_GBytes__STMCFFC0/a' phys_path='/pci@7c,0/pci10de,376@e/pci1000,3150@0/sd@3c,0:a' whole_disk=1 create_txg=269718 rewind_txg_ts=1308690257 bad config type 7 for seconds_of_rewind verify_data_errors=0 LABEL 3 version=22 name='tank' state=0 txg=402415 pool_guid=13155614069147461689 hostid=799263814 hostname='Chaiten' top_guid=7929625263716612584 guid=12265708552998034011 vdev_children=8 vdev_tree type='mirror' id=7 guid=7929625263716612584 metaslab_array=171 metaslab_shift=27 ashift=9 asize=18240241664 is_log=1 create_txg=269718 children[0] type='disk' id=0 guid=12265708552998034011 path='/dev/dsk/c3t59d0s0' devid='id1,sd@TATA_STEC_ZeusIOPS___018_GBytes__STMD039A/a' phys_path='/pci@7c,0/pci10de,376@e/pci1000,3150@0/sd@3b,0:a' whole_disk=1 create_txg=269718 children[1] type='disk' id=1 guid=2456972971894251597 path='/dev/dsk/c3t60d0s0' devid='id1,sd@TATA_STEC_ZeusIOPS___018_GBytes__STMCFFC0/a' phys_path='/pci@7c,0/pci10de,376@e/pci1000,3150@0/sd@3c,0:a' whole_disk=1 create_txg=269718 rewind_txg_ts=1308690257 bad config type 7 for seconds_of_rewind verify_data_errors=0 Please let me know if you need more info... Thanks, David W. Smith ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] Zpool metadata corruption from S10U9 to S11 express
I was recently running Solaris 10 U9 and I decided that I would like to go to Solaris 11 Express so I exported my zpool, hoping that I would just do an import once I had the new system installed with Solaris 11. Now when I try to do an import I'm getting the following: # /home/dws# zpool import pool: tank id: 13155614069147461689 state: FAULTED status: The pool metadata is corrupted. action: The pool cannot be imported due to damaged devices or data. see: http://www.sun.com/msg/ZFS-8000-72 config: tank FAULTED corrupted data logs mirror-6 ONLINE c9t57d0 ONLINE c9t58d0 ONLINE mirror-7 ONLINE c9t59d0 ONLINE c9t60d0 ONLINE Is there something else I can do to see what is wrong. Original attempt when specifying the name resulted in: # /home/dws# zpool import tank cannot import 'tank': I/O error Destroy and re-create the pool from a backup source. I verified that I have all 60 of my luns. The controller numbers have changed, but I don't believe that should matter. Any suggestions about getting additional information about what is happening would be greatly appreciated. Thanks, David ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] Zpool metadata corruption from S10U9 to S11 express
I was recently running Solaris 10 U9 and I decided that I would like to go to Solaris 11 Express so I exported my zpool, hoping that I would just do an import once I had the new system installed with Solaris 11. Now when I try to do an import I'm getting the following: # /home/dws# zpool import pool: tank id: 13155614069147461689 state: FAULTED status: The pool metadata is corrupted. action: The pool cannot be imported due to damaged devices or data. see: http://www.sun.com/msg/ZFS-8000-72 config: tank FAULTED corrupted data logs mirror-6 ONLINE c9t57d0 ONLINE c9t58d0 ONLINE mirror-7 ONLINE c9t59d0 ONLINE c9t60d0 ONLINE Is there something else I can do to see what is wrong. Original attempt when specifying the name resulted in: # /home/dws# zpool import tank cannot import 'tank': I/O error Destroy and re-create the pool from a backup source. I verified that I have all 60 of my luns. The controller numbers have changed, but I don't believe that should matter. Any suggestions about getting additional information about what is happening would be greatly appreciated. Thanks, David -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Zpool metadata corruption from S10U9 to S11 express
An update: I had mirrored my boot drive when I installed Solaris 10U9 originally, so I went ahead and rebooted the system to this disk instead of my Solaris 11 install. After getting the system up, I imported the zpool, and everything worked normally. So I guess there is some sort of incompatibility between Solaris 10 and Solaris 11. I would have thought that Solaris 11 could import an older pool level. Any other insight on importing pools between these two versions of Solaris would be helpful. Thanks, David -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Zpool metadata corruption from S10U9 to S11 express
On Wed, Jun 22, 2011 at 06:32:49PM -0700, Daniel Carosone wrote: On Wed, Jun 22, 2011 at 12:49:27PM -0700, David W. Smith wrote: # /home/dws# zpool import pool: tank id: 13155614069147461689 state: FAULTED status: The pool metadata is corrupted. action: The pool cannot be imported due to damaged devices or data. see: http://www.sun.com/msg/ZFS-8000-72 config: tank FAULTED corrupted data logs mirror-6 ONLINE c9t57d0 ONLINE c9t58d0 ONLINE mirror-7 ONLINE c9t59d0 ONLINE c9t60d0 ONLINE Is there something else I can do to see what is wrong. Can you tell us more about the setup, in particular the drivers and hardware on the path? There may be labelling, block size, offset or even bad drivers or other issues getting in the way, preventing ZFS from doing what should otherwise be expected to work. Was there something else in the storage stack on the old OS, like a different volume manager or some multipathing? Can you show us the zfs labels with zdb -l /dev/foo ? Does import -F get any further? Original attempt when specifying the name resulted in: # /home/dws# zpool import tank cannot import 'tank': I/O error Some kind of underlying driver problem odour here. -- Dan. The system is an x4440 with two dual port Qlogic 8 Gbit FC cards connected to a DDN 9900 storage unit. There are 60 luns configured from the storage unit we using raidz1 across these luns in a 9+1 configuration. Under Solaris 10U9 multipathing is enabled. For example here is one of the devices: # luxadm display /dev/rdsk/c8t60001FF010DC50AA2E00081D1BF1d0s2 DEVICE PROPERTIES for disk: /dev/rdsk/c8t60001FF010DC50AA2E00081D1BF1d0s2 Vendor: DDN Product ID: S2A 9900 Revision: 6.11 Serial Num: 10DC50AA002E Unformatted capacity: 15261576.000 MBytes Write Cache: Enabled Read Cache: Enabled Minimum prefetch: 0x0 Maximum prefetch: 0x0 Device Type: Disk device Path(s): /dev/rdsk/c8t60001FF010DC50AA2E00081D1BF1d0s2 /devices/scsi_vhci/disk@g60001ff010dc50aa2e00081d1bf1:c,raw Controller /dev/cfg/c5 Device Address 2401ff051232,2e Host controller port WWN2101001b32bfe1d3 Class secondary State ONLINE Controller /dev/cfg/c7 Device Address 2801ff0510dc,2e Host controller port WWN2101001b32bd4f8f Class primary State ONLINE Here is the output of the zdb command: # zdb -l /dev/dsk/c8t60001FF010DC50AA2E00081D1BF1d0s0 LABEL 0 version=22 name='tank' state=0 txg=402415 pool_guid=13155614069147461689 hostid=799263814 hostname='Chaiten' top_guid=7879214599529115091 guid=9439709931602673823 vdev_children=8 vdev_tree type='raidz' id=5 guid=7879214599529115091 nparity=1 metaslab_array=35 metaslab_shift=40 ashift=12 asize=160028491776000 is_log=0 create_txg=22 children[0] type='disk' id=0 guid=15738823520260019536 path='/dev/dsk/c8t60001FF0123252803700081D1BF1d0s0' devid='id1,sd@n60001ff0123252803700081d1bf1/a' phys_path='/scsi_vhci/disk@g60001ff0123252803700081d1bf1:a' whole_disk=1 DTL=166 create_txg=22 children[1] type='disk' id=1 guid=7241121769141495862 path='/dev/dsk/c8t60001FF010DC50C53600081D1BF1d0s0' devid='id1,sd@n60001ff010dc50c53600081d1bf1/a' phys_path='/scsi_vhci/disk@g60001ff010dc50c53600081d1bf1:a' whole_disk=1 DTL=165 create_txg=22 children[2] type='disk' id=2 guid=2777230007222012140 path='/dev/dsk/c8t60001FF0123252793500081D1BF1d0s0' devid='id1,sd@n60001ff0123252793500081d1bf1/a' phys_path='/scsi_vhci/disk@g60001ff0123252793500081d1bf1:a' whole_disk=1 DTL=164 create_txg=22 children[3] type='disk' id=3 guid=5525323314985659974 path='/dev/dsk/c8t60001FF010DC50BE3400081D1BF1d0s0' devid='id1,sd@n60001ff010dc50be3400081d1bf1/a' phys_path='/scsi_vhci/disk@g60001ff010dc50be3400081d1bf1:a' whole_disk=1 DTL=163
Re: [zfs-discuss] Question on ZFS iSCSI
Disk /dev/zvol/rdsk/pool/dcpool: 4295GB Sector size (logical/physical): 512B/512B Just to check, did you already try: zpool import -d /dev/zvol/rdsk/pool/ poolname ? thanks Andy. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS, Oracle and Nexenta
Still i wonder what Gartner means with Oracle monetizing on ZFS.. It simply means that Oracle want to make money from ZFS (as is normal for technology companies with their own technology). The reason this might cause uncertainty for ZFS is that maintaining or helping make the open source version of ZFS better may be seen by Oracle as contradictory to them making money from it. That said, what is already open source cannot be un-open sourced, as others have said... cheers Andy. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Monitoring disk seeks
Hi, see the seeksize script on this URL: http://prefetch.net/articles/solaris.dtracetopten.html Not used it but looks neat! cheers Andy. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Solaris vs FreeBSD question
Hi, I am using FreeBSD 8.2 in production with ZFS. Although I have had one issue with it in the past but I would recommend it and I consider it production ready. That said if you can wait for FreeBSD 8.3 or 9.0 to come out (a few months away) you will get a better system as these will include ZFS v28 (FreeBSD-RELEASE is currently v15). On the other had things can always go wrong, of course RAID is not backup, even with snapshots ;) cheers Andy. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS send/recv initial data load
On Feb 16, 2011, at 7:38 AM, whitetr6 at gmail.com wrote: My question is about the initial seed of the data. Is it possible to use a portable drive to copy the initial zfs filesystem(s) to the remote location and then make the subsequent incrementals over the network? If so, what would I need to do to make sure it is an exact copy? Thank you, Yes, you can send the initial seed snapshot to a file on a portable disk. for example: # zfs send tank/volume@seed /myexternaldrive/zfssnap.data If the volume of data is too much to fit on a single disk then you can create a new pool spread across the number of disks you require, make a duplicate of the snapshot onto your new pool. Then from the new pool you can run a new zfs send when connected to your offsite server. thanks Andy. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Drive i/o anomaly
It is a 4k sector drive, but I thought zfs recognised those drives and didn't need any special configuration...? 4k drives are a big problem for ZFS, much has been posted/written about it. Basically, if the 4k drives report 512 byte blocks, as they almost all do, then ZFS does not detect and configure the pool correctly. If the drive actually reports the real 4k block size, ZFS handles this very nicely. So the problem/fault is drives misreporting the real block size, to maintain compatibility with other OS's etc, and not really with ZFS. cheers Andy. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] zpool scalability and performance
Basically I think yes you need to add all the vdevs you require in the circumstances you describe. You just have to consider what ZFS is able to do with the disks that you give it. If you have 4x mirrors to start with then all writes will be spread across all disks and you will get nice performance using all 8 spindles/disks. If you fill all of these up then add one other mirror then its logical that new data written will be only written to the free space on the new mirror and you will get the performance of writing data to a single mirrored vdev. To handle this you would either have to add sufficient new devices to give you your required performance. Or if there is a fair amount of data turn around on your pool, ie you are deleting (including from snapshots) old data then you might get reasonable performance by adding a new mirror at some point before your existing pool is completely full. Ie data will initially get written and spread across all disks as there will be free space on all disks, and over time old data will be removed from the other older vdevs. Which would result in most of the time reads and writes benefiting from all vdevs, but it't not going to give you guarantees of that I guess... Anyway, thats what occurred to me on the subject! ;) cheers Andy. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS percent busy vs zpool iostat
Quoting Bob Friesenhahn bfrie...@simple.dallas.tx.us: What function is the system performing when it is so busy? The work load of the server is SMTP mail server, with associated spam and virus scanning, and serving maildir email via POP3 and IMAP. Wrong conclusion. I am not sure what the percentages are percentages of (total RAM?), but 603MB is a very small ARC. FreeBSD pre-assigns kernel memory for zfs so it is not dynamically shared with the kernel as it is with Solaris. This is the min, max, and actual size of the ARC. ZFS is free to use up to the MAX (2098.08M) if it decides it wants to. Depending on the work load on this server it will go up to 2098M (as in Ive seen it get to that size on this and other servers), just with its usual daily work load it decides to set this to around 600M. I assume it decides it's not worth using any more RAM. The ARC is adaptive so you should not assume that its objective is to try to absorb your hard drive. It should not want to cache data which is rarely accessed. Regardless, your ARC size may actually be constrained by default FreeBSD kernel tunings. I guess then that ZFS is weighing up how useful it is to use more than 600M and deciding that it isnt that useful? Anyway, Ive just forced the Min to 1900M so will see how this goes today. The type of drives you are using have very poor seek performance. Higher RPM drives would surely help. Stuffing lots more memory in your system and adjusting the kernel so that zfs can use a lot more of it is likely to help dramatically. Zfs loves memory. thanks Bob, and also to Matt for your comments... ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS percent busy vs zpool iostat
Ok, think I have the biggest issue. The drives are 4k sector drives, and I wasn't aware of that. My fault, I should have checked this. Had the disks for ages and are sub 1TB so had the idea that they wouldn't be 4k drives... I will obviously have to address this, either by creating a pool using 4k aware zfs commands or replacing the disks. Anyway, thanks to all and to Taemun for getting me to check this... ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] System crash on zpool attach object_count == usedobjs failed assertion
I've just run zdb against the two pools on my home OpenSolaris box, and now both are showing this failed assertion, with the counts off by one. # zdb rpool /dev/null Assertion failed: object_count == usedobjs (0x18da2 == 0x18da3), file ../zdb.c, line 1460 Abort (core dumped) # zdb rz2pool /dev/null Assertion failed: object_count == usedobjs (0x2ba25 == 0x2ba26), file ../zdb.c, line 1460 Abort (core dumped) The last time I checked them with zdb, probably a few months back, they were fine. And since the pools otherwise seem to be behaving without problem, I've had no reason to run zdb. 'zpool status' looks fine, and the pools mount without problem. 'zpool scrub' works without problem. I have been upgrading to most of the recent 'dev' version of OpenSolaris. I wonder if there is some bug in the code that could cause this assertion. Maybe one unusual thing, is that I have not yet upgraded the versions of the pools. # uname -a SunOS opensolaris 5.11 snv_133 i86pc i386 i86pc # zpool upgrade This system is currently running ZFS pool version 22. The following pools are out of date, and can be upgraded. After being upgraded, these pools will no longer be accessible by older software versions. VER POOL --- 13 rpool 16 rz2pool The assertions is being tracked by this bug: http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=6801840 ..but in that report, the counts are not off by one, Unfortunately, there is little indication of any progress being made. Maybe some other 'zfs-discuss' readers would try zdb on there pools, if using a recent dev build and see if they get a similar problem... Thanks Nigel Smith # mdb core Loading modules: [ libumem.so.1 libc.so.1 libzpool.so.1 libtopo.so.1 libavl.so.1 libnvpair.so.1 ld.so.1 ] ::status debugging core file of zdb (64-bit) from opensolaris file: /usr/sbin/amd64/zdb initial argv: zdb rpool threading model: native threads status: process terminated by SIGABRT (Abort), pid=883 uid=0 code=-1 panic message: Assertion failed: object_count == usedobjs (0x18da2 == 0x18da3), file ../zdb.c, line 1460 $C fd7fffdff090 libc.so.1`_lwp_kill+0xa() fd7fffdff0b0 libc.so.1`raise+0x19() fd7fffdff0f0 libc.so.1`abort+0xd9() fd7fffdff320 libc.so.1`_assert+0x7d() fd7fffdff810 dump_dir+0x35a() fd7fffdff840 dump_one_dir+0x54() fd7fffdff850 libzpool.so.1`findfunc+0xf() fd7fffdff940 libzpool.so.1`dmu_objset_find_spa+0x39f() fd7fffdffa30 libzpool.so.1`dmu_objset_find_spa+0x1d2() fd7fffdffb20 libzpool.so.1`dmu_objset_find_spa+0x1d2() fd7fffdffb40 libzpool.so.1`dmu_objset_find+0x2c() fd7fffdffb70 dump_zpool+0x197() fd7fffdffc10 main+0xa3d() fd7fffdffc20 0x406e6c() -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] System crash on zpool attach object_count == usedobjs failed assertion
Hi Stephen If your system is crashing while attaching the new device, are you getting a core dump file? If so, it would be interesting to examine the file with mdb, to see the stack backtrace, as this may give a clue to what's going wrong. What storage controller you are using for the disks? And what device driver is the controller using? Thanks Nigel Smith -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] crashed zpool
Hello Carsten Have you examined the core dump file with mdb ::stack to see if this give a clue to what happend? Regards Nigel -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Help with itadm commands
The iSCSI COMSTAR Port Provider is not installed by default. What release of OpenSolaris are you running? If pre snv_133 then: $ pfexec pkg install SUNWiscsit For snv_133, I think it will be: $ pfexec pkg install network/iscsi/target Regards Nigel Smith -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Abysmal ISCSI / ZFS Performance
Hi Matt Are the seeing low speeds on writes only or on both read AND write? Are you seeing low speed just with iSCSI or also with NFS or CIFS? I've tried updating to COMSTAR (although I'm not certain that I'm actually using it) To check, do this: # svcs -a | grep iscsi If 'svc:/system/iscsitgt:default' is online, you are using the old mature 'user mode' iscsi target. If 'svc:/network/iscsi/target:default' is online, then you are using the new 'kernel mode' comstar iscsi target. For another good way to monitor disk i/o, try: # iostat -xndz 1 http://docs.sun.com/app/docs/doc/819-2240/iostat-1m?a=view Don't just assume that your Ethernet IP TCP layer are performing to the optimum - check it. I often use 'iperf' or 'netperf' to do this: http://blogs.sun.com/observatory/entry/netperf (Iperf is available by installing the SUNWiperf package. A package for netperf is in the contrib repository.) The last time I checked, the default values used in the OpenSolaris TCP stack are not optimum for Gigabit speed, and need to be adjusted. Here is some advice, I found with Google, but there are others: http://serverfault.com/questions/13190/what-are-good-speeds-for-iscsi-and-nfs-over-1gb-ethernet BTW, what sort of network card are you using, as this can make a difference. Regards Nigel Smith -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Abysmal ISCSI / ZFS Performance
Hi Matt Haven't gotten NFS or CIFS to work properly. Maybe I'm just too dumb to figure it out, but I'm ending up with permissions errors that don't let me do much. All testing so far has been with iSCSI. So until you can test NFS or CIFS, we don't know if it's a general performance problem, or just an iSCSI problem. To get CIFS working, try this: http://blogs.sun.com/observatory/entry/accessing_opensolaris_shares_from_windows Here's IOStat while doing writes : Here's IOStat when doing reads : Your getting 1000 Kr/s kw/s, so add the iostat 'M' option to display throughput in MegaBytes per second. It'll sustain 10-12% gigabit for a few minutes, have a little dip, I'd still be interested to see the size of the TCP buffers. What does this report: # ndd /dev/tcp tcp_xmit_hiwat # ndd /dev/tcp tcp_recv_hiwat # ndd /dev/tcp tcp_conn_req_max_q # ndd /dev/tcp tcp_conn_req_max_q0 Current NIC is an integrated NIC on an Abit Fatality motherboard. Just your generic fare gigabit network card. I can't imagine that it would be holding me back that much though. Well there are sometimes bugs in the device drivers: http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=6913756 http://sigtar.com/2009/02/12/opensolaris-rtl81118168b-issues/ That's why I say don't just assume the network is performing to the optimum. To do a local test, direct to the hard drives, you could try 'dd', with various transfer sizes. Some advice from BenR, here: http://www.cuddletech.com/blog/pivot/entry.php?id=820 Regards Nigel Smith -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Abysmal ISCSI / ZFS Performance
Another things you could check, which has been reported to cause a problem, is if network or disk drivers share an interrupt with a slow device, like say a usb device. So try: # echo ::interrupts -d | mdb -k ... and look for multiple driver names on an INT#. Regards Nigel Smith -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Idiots Guide to Running a NAS with ZFS/OpenSolaris
Hi Robert Have a look at these links: http://delicious.com/nwsmith/opensolaris-nas Regards Nigel Smith -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Disk Issues
I have booted up an osol-dev-131 live CD on a Dell Precision T7500, and the AHCI driver successfully loaded, to give access to the two sata DVD drives in the machine. (Unfortunately, I did not have the opportunity to attach any hard drives, but I would expect that also to work.) 'scanpci' identified the southbridge as an Intel 82801JI (ICH10 family) Vendor 0x8086, device 0x3a22 AFAIK, as long as the SATA interface report a PCI ID class-code of 010601, then the AHCI device driver should load. The mode of the SATA interface will need to be selected in the BIOS. There are normally three modes: Native IDE, RAID or AHCI. 'scanpci' should report different class-codes depending on the mode selected in the BIOS. RAID mode should report a class-code of 010400 IDE mode should report a class-code of 0101xx With OpenSolaris, you can see the class-code in the output from 'prtconf -pv'. If Native IDE is selected the ICH10 SATA interface should appear as two controllers, the first for ports 0-3, and the second for ports 4 5. Regards Nigel Smith -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Painfully slow RAIDZ2 as fibre channel COMSTAR export
Hi Dave So which hard drives are connected to which controllers? And what device drivers are those controllers using? The output from 'format', 'cfgadm' and 'prtconf -D' may help us to understand. Strange that you say that there are two hard drives per controllers, but three drives are showing high %b. And strange that you have c7,c8,c9,c10,c11 which looks like FIVE controllers! Regards Nigel Smith -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] ..and now ZFS send dedupe
More ZFS goodness putback before close of play for snv_128. http://mail.opensolaris.org/pipermail/onnv-notify/2009-November/010768.html http://hg.genunix.org/onnv-gate.hg/rev/216d8396182e Regards Nigel Smith -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS + fsck
On Thu Nov 5 14:38:13 PST 2009, Gary Mills wrote: It would be nice to see this information at: http://hub.opensolaris.org/bin/view/Community+Group+on/126-130 but it hasn't changed since 23 October. Well it seems we have an answer: http://mail.opensolaris.org/pipermail/zfs-discuss/2009-November/033672.html On Mon Nov 9 14:26:54 PST 2009, James C. McPherson wrote: The flag days page has not been updated since the switch to XWiki, it's on my todo list but I don't have an ETA for when it'll be done. Perhaps anyone interested in seeing the flags days page resurrected can petition James to raise the priority on his todo list. Thanks Nigel Smith -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] marvell88sx2 driver build126
I think you can work out the files for the driver by looking here: http://src.opensolaris.org/source/xref/onnv/onnv-gate/usr/src/pkgdefs/SUNWmv88sx/prototype_i386 So the 32 bit driver is: kernel/drv/marvell88sx And the 64 bit driver is: kernel/drv/amd64/marvell88sx It a pity that the marvell driver is not open source. For the sata drivers that are open source, ahci, nv_sata, si3124 ..you can see the history of all the changes to the source code of the drivers, all cross referenced to the bug numbers, using OpenGrok: http://src.opensolaris.org/source/xref/onnv/onnv-gate/usr/src/uts/common/io/sata/adapters/ Regards Nigel Smith -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS + fsck
Hi Robert I think you mean snv_128 not 126 :-) 6667683 need a way to rollback to an uberblock from a previous txg http://bugs.opensolaris.org/view_bug.do?bug_id=6667683 http://hg.genunix.org/onnv-gate.hg/rev/8aac17999e4d Regards Nigel Smith -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS + fsck
Hi Gary I will let 'website-discuss' know about this problem. They normally fix issues like that. Those pages always seemed to just update automatically. I guess it's related to the website transition. Thanks Nigel Smith -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] dedupe is in
Ok, thanks everyone then (but still thanks to Victor for the heads up) :-) On Mon, Nov 2, 2009 at 4:03 PM, Victor Latushkin victor.latush...@sun.com wrote: On 02.11.09 18:38, Ross wrote: Double WOHOO! Thanks Victor! Thanks should go to Tim Haley, Jeff Bonwick and George Wilson ;-) ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] dedupe is in
ZFS dedup will be in snv_128, but putbacks to snv_128 will not likely close till the end of this week. The OpenSolaris dev repository was updated to snv_126 last Thursday: http://mail.opensolaris.org/pipermail/opensolaris-announce/2009-October/001317.html So it looks like about 5 weeks before the dev repository will be updated to snv_128. Then we see if any bugs emerge as we all rush to test it out... Regards Nigel Smith -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] zfs corrupts grub
This is opensolaris on a Tecra M5 using an 128GB SSD as the boot device. This device is partitioned into two roughly 60GB partitions. I installed opensolaris 2009.06 into the first partition then did an image update to build 124 from the dev repository. All went well so then I created a zpool from the second partition , which created fine and I could add filesystems to that pool. however when I can to reboot the laptop there was a message ( I think from bootadm ) about an unrecognised GRUB entry and the the reboot stopped with the words GRUB appearing at the top left of the screen. So it appears that zfs has done something to the grub entry such that I can no longer boot the laptop. Anyone have any ideas how to either recover from this and/or prevent this happening in the future. T -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] How to map solaris disk devices to physical location for ZFS pool setup
Hi, I'm setting up a ZFS environment running on a Sun x4440 + J4400 arrays (similar to 7410 environment) and I was trying to figure out the best way to map a disk drive physical location (tray and slot) to the Solaris device c#t#d#. Do I need to install the CAM software to do this, or is there another way? I would like to understand the solaris device to physical drive location so that I can setup my ZFS pool mirrors/raid properly. I'm currently running Solaris Express build 119. Thanks, David -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] Read about ZFS backup - Still confused
I am just a simple home user. When I was using linux, I backed up my home directory (which contained all my critical data) using tar. I backed up my linux partition using partimage. These backups were put on dvd's. That way I could restore (and have) even if the hard drive completely went belly up. I would like to duplicate this scheme using zfs commands. I know I can copy a snapshot to a dvd but can I recover using just the snapshot or does it rely on the zfs file system on my hard drive being ok? Cork -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Read about ZFS backup - Still confused
Let me try rephrasing this. I would like the ability to restore so my system mirrors its state at the time when I backed it up given the old hard drive is now a door stop. Cork -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] snv_110 - snv_121 produces checksum errors on Raid-Z pool
Adam The 'OpenSolaris Development Release Packaging Repository' has recently been updated to release 121. http://mail.opensolaris.org/pipermail/opensolaris-announce/2009-August/001253.html http://pkg.opensolaris.org/dev/en/index.shtml Just to be totally clear, as you recommending that anyone using raidz, raidz2, raidz3, should not upgrade to that release? For the people who have already upgraded, presumably the recommendation is that they should revert to a pre 121 BE. Thanks Nigel Smith -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS Confusion
Hi Volker, On Fri, Aug 21, 2009 at 5:42 PM, Volker A. Brandtv...@bb-c.de wrote: Can you actually see the literal commands? A bit like MySQL's 'show create table'? Or are you just intrepreting the output? Just interpreting the output. Actually you could see the commands on the old server by using zpool history oradata That's awesome - thank you very much! S. -- Stephen Nelson-Smith Technical Director Atalanta Systems Ltd www.atalanta-systems.com ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS Confusion
Sorry - didn't realised I'd replied only to you. You can either set the mountpoint property when you create the dataset or do it in a second operation after the create. Either: # zfs create -o mountpoint=/u01 rpool/u01 or: # zfs create rpool/u01 # zfs set mountpoint=/u01 rpool/u01 Got you. I'm not sure about the remote mount. It appears to be a local SMB resource mounted as NFS? I've never seen that before. Ah that's just a Sharity mount - it's a red herring. u0[1-4] will be the same. Thanks very much, S. -- Stephen Nelson-Smith Technical Director Atalanta Systems Ltd www.atalanta-systems.com ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Shrinking a zpool?
Hi Matt Thanks for this update, and the confirmation to the outside world that this problem is being actively worked on with significant resources. But I would like to support Cyril's comment. AFAIK, any updates you are making to bug 4852783 are not available to the outside world via the normal bug URL. It would be useful if we were able to see them. I think it is frustrating for the outside world that it cannot see Sun's internal source code repositories for work in progress, and only see the code when it is complete and pushed out. And so there is no way to judge what progress is being made, or to actively help with code reviews or testing. Best Regards Nigel Smith -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Shrinking a zpool?
ob Friesenhahn wrote: Sun has placed themselves in the interesting predicament that being open about progress on certain high-profile enterprise features (such as shrink and de-duplication) could cause them to lose sales to a competitor. Perhaps this is a reason why Sun is not nearly as open as we would like them to be. I agree that it is difficult for Sun, at this time, to be more 'open', especially for ZFS, as we still await the resolution of Oracle purchasing Sun, the court case with NetApp over patents, and now the GreenBytes issue! But I would say they are more likely to avoid loosing sales by confirming what enhancements they are prioritising. I think people will wait if they know work is being done, and progress being made, although not indefinitely. I guess it depends on the rate of progress of ZFS compared to say btrfs. I would say that maybe Sun should have held back on announcing the work on deduplication, as it just seems to have ramped up frustration, now that it seems no more news is forthcoming. It's easy to be wise after the event and time will tell. Thanks Nigel Smith -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Tunable iSCSI timeouts - ZFS over iSCSI fix
Yup, somebody pointed that out to me last week and I can't wait :-) On Wed, Jul 29, 2009 at 7:48 PM, Davedave-...@dubkat.com wrote: Anyone (Ross?) creating ZFS pools over iSCSI connections will want to pay attention to snv_121 which fixes the 3 minute hang after iSCSI disk problems: http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=649 Yay! ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Another user looses his pool (10TB) in this case and 40 days work
David Magda wrote: This is also (theoretically) why a drive purchased from Sun is more that expensive then a drive purchased from your neighbourhood computer shop: Sun (and presumably other manufacturers) takes the time and effort to test things to make sure that when a drive says I've synced the data, it actually has synced the data. This testing is what you're presumably paying for. So how do you test a hard drive to check it does actually sync the data? How would you do it in theory? And in practice? Now say we are talking about a virtual hard drive, rather than a physical hard drive. How would that affect the answer to the above questions? Thanks Nigel -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] ZFS gzip Death Spiral Revisited
I have the following configuation. My storage: 12 luns from a Clariion 3x80. Each LUN is a whole 6 disk raid-6. My host: Sun t5240 with 32 hardware threads and 16gig of ram. My zpool: all 12 luns from the clariion in a simple pool My test data: A 1 gig backup file of a ufsdump from /opt on a machine with lots of mixed binary/text data. A 15gig file that is already tightly compressed. I wrote some benchmarks and tested. This system is completely idle except for testing. With the 1 gig file: testing record sizes for 8,16,32,128k testing compression with off,on,gzip 128k record sizes were fastest. gzip compression was fastest. Using the best of those results, I then ran the torture test with a file almost as large as system memory that was already compressed. The results were the infamous lock up, stutter, cant kill the cp/dd command, oh god, system console is unresponsive too, what has science done?!?!. In the past threads I dug up, it seems that people were using wimpier hardware or gzip-9 and running into this. I ran into it with very capable hardware. I do not get this behavior using the default lzjb compression, and I was able to also produce it using weaker gzip-3 compression. Is there a fix for this I am not aware of? Workaround? Etc? gzip compression works wonderfully with the uncompressed smallers 1-4g files I am trying. It would be a shame to use the weaker default compression because of this test case. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] Comstar production-ready?
Hi, I recommended a ZFS-based archive solution to a client needing to have a network-based archive of 15TB of data in a remote datacentre. I based this on an X2200 + J4400, Solaris 10 + rsync. This was enthusiastically received, to the extent that the client is now requesting that their live system (15TB data on cheap SAN and Linux LVM) be replaced with a ZFS-based system. The catch is that they're not ready to move their production systems off Linux - so web, db and app layer will all still be on RHEL 5. As I see it, if they want to benefit from ZFS at the storage layer, the obvious solution would be a NAS system, such as a 7210, or something buillt from a JBOD and a head node that does something similar. The 7210 is out of budget - and I'm not quite sure how it presents its storage - is it NFS/CIFS? If so, presumably it would be relatively easy to build something equivalent, but without the (awesome) interface. The interesting alternative is to set up Comstar on SXCE, create zpools and volumes, and make these available either over a fibre infrastructure, or iSCSI. I'm quite excited by this as a solution, but I'm not sure if it's really production ready. What other options are there, and what advice/experience can you share? Thanks, S. -- Stephen Nelson-Smith Technical Director Atalanta Systems Ltd www.atalanta-systems.com ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS: unreliable for professional usage?
Hey guys, I'll let this die in a sec, but I just wanted to say that I've gone and read the on disk document again this morning, and to be honest Richard, without the description you just wrote, I really wouldn't have known that uberblocks are in a 128 entry circular queue that's 4x redundant. Please understand that I'm not asking for answers to these notes, this post is purely to illustrate to you ZFS guys that much as I appreciate having the ZFS docs available, they are very tough going for anybody who isn't a ZFS developer. I consider myself well above average in IT ability, and I've really spent quite a lot of time in the past year reading around ZFS, but even so I would definitely have come to the wrong conclusion regarding uberblocks. Richard's post I can understand really easily, but in the on disk format docs, that information is spread over 7 pages of really quite technical detail, and to be honest, for a user like myself raises as many questions as it answers: On page 6 I learn that labels are stored on each vdev, as well as each disk. So there will be a label on the pool, mirror (or raid group), and disk. I know the disk ones are at the start and end of the disk, and it sounds like the mirror vdev is in the same place, but where is the root vdev label? The example given doesn't mention its location at all. Then, on page 7 it sounds like the entire label is overwriten whenever on-disk data is updated - any time on-disk data is overwritten, there is potential for error. To me, it sounds like it's not a 128 entry queue, but just a group of 4 labels, all of which are overwritten as data goes to disk. Then finally, on page 12 the uberblock is mentioned (although as an aside, the first time I read these docs I had no idea what the uberblock actually was). It does say that only one uberblock is active at a time, but with it being part of the label I'd just assume these were overwritten as a group.. And that's why I'll often throw ideas out - I can either rely on my own limited knowledge of ZFS to say if it will work, or I can take advantage of the excellent community we have here, and post the idea for all to see. It's a quick way for good ideas to be improved upon, and bad ideas consigned to the bin. I've done it before in my rather lengthly 'zfs availability' thread. My thoughts there were thrashed out nicely, with some quite superb additions (namely the concept of lop sided mirrors which I think are a great idea). Ross PS. I've also found why I thought you had to search for these blocks, it was after reading this thread where somebody used mdb to search a corrupt pool to try to recover data: http://opensolaris.org/jive/message.jspa?messageID=318009 On Fri, Feb 13, 2009 at 11:09 PM, Richard Elling richard.ell...@gmail.com wrote: Tim wrote: On Fri, Feb 13, 2009 at 4:21 PM, Bob Friesenhahn bfrie...@simple.dallas.tx.us mailto:bfrie...@simple.dallas.tx.us wrote: On Fri, 13 Feb 2009, Ross Smith wrote: However, I've just had another idea. Since the uberblocks are pretty vital in recovering a pool, and I believe it's a fair bit of work to search the disk to find them. Might it be a good idea to allow ZFS to store uberblock locations elsewhere for recovery purposes? Perhaps it is best to leave decisions on these issues to the ZFS designers who know how things work. Previous descriptions from people who do know how things work didn't make it sound very difficult to find the last 20 uberblocks. It sounded like they were at known points for any given pool. Those folks have surely tired of this discussion by now and are working on actual code rather than reading idle discussion between several people who don't know the details of how things work. People who don't know how things work often aren't tied down by the baggage of knowing how things work. Which leads to creative solutions those who are weighed down didn't think of. I don't think it hurts in the least to throw out some ideas. If they aren't valid, it's not hard to ignore them and move on. It surely isn't a waste of anyone's time to spend 5 minutes reading a response and weighing if the idea is valid or not. OTOH, anyone who followed this discussion the last few times, has looked at the on-disk format documents, or reviewed the source code would know that the uberblocks are kept in an 128-entry circular queue which is 4x redundant with 2 copies each at the beginning and end of the vdev. Other metadata, by default, is 2x redundant and spatially diverse. Clearly, the failure mode being hashed out here has resulted in the defeat of those protections. The only real question is how fast Jeff can roll out the feature to allow reverting to previous uberblocks. The procedure for doing this by hand has long been known, and was posted on this forum -- though it is tedious. -- richard
Re: [zfs-discuss] ZFS: unreliable for professional usage?
On Fri, Feb 13, 2009 at 7:41 PM, Bob Friesenhahn bfrie...@simple.dallas.tx.us wrote: On Fri, 13 Feb 2009, Ross wrote: Something like that will have people praising ZFS' ability to safeguard their data, and the way it recovers even after system crashes or when hardware has gone wrong. You could even have a common causes of this are... message, or a link to an online help article if you wanted people to be really impressed. I see a career in politics for you. Barring an operating system implementation bug, the type of problem you are talking about is due to improperly working hardware. Irreversibly reverting to a previous checkpoint may or may not obtain the correct data. Perhaps it will produce a bunch of checksum errors. Yes, the root cause is improperly working hardware (or an OS bug like 6424510), but with ZFS being a copy on write system, when errors occur with a recent write, for the vast majority of the pools out there you still have huge amounts of data that is still perfectly valid and should be accessible. Unless I'm misunderstanding something, reverting to a previous checkpoint gets you back to a state where ZFS knows it's good (or at least where ZFS can verify whether it's good or not). You have to consider that even with improperly working hardware, ZFS has been checksumming data, so if that hardware has been working for any length of time, you *know* that the data on it is good. Yes, if you have databases or files there that were mid-write, they will almost certainly be corrupted. But at least your filesystem is back, and it's in as good a state as it's going to be given that in order for your pool to be in this position, your hardware went wrong mid-write. And as an added bonus, if you're using ZFS snapshots, now your pool is accessible, you have a bunch of backups available so you can probably roll corrupted files back to working versions. For me, that is about as good as you can get in terms of handling a sudden hardware failure. Everything that is known to be saved to disk is there, you can verify (with absolute certainty) whether data is ok or not, and you have backup copies of damaged files. In the old days you'd need to be reverting to tape backups for both of these, with potentially hours of downtime before you even know where you are. Achieving that in a few seconds (or minutes) is a massive step forwards. There are already people praising ZFS' ability to safeguard their data, and the way it recovers even after system crashes or when hardware has gone wrong. Yes there are, but the majority of these are praising the ability of ZFS checksums to detect bad data, and to repair it when you have redundancy in your pool. I've not seen that many cases of people praising ZFS' recovery ability - uberblock problems seem to have a nasty habit of leaving you with tons of good, checksummed data on a pool that you can't get to, and while many hardware problems are dealt with, others can hang your entire pool. Bob == Bob Friesenhahn bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer,http://www.GraphicsMagick.org/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS: unreliable for professional usage?
On Fri, Feb 13, 2009 at 8:24 PM, Bob Friesenhahn bfrie...@simple.dallas.tx.us wrote: On Fri, 13 Feb 2009, Ross Smith wrote: You have to consider that even with improperly working hardware, ZFS has been checksumming data, so if that hardware has been working for any length of time, you *know* that the data on it is good. You only know this if the data has previously been read. Assume that the device temporarily stops pysically writing, but otherwise responds normally to ZFS. Then the device starts writing again (including a recent uberblock), but with a large gap in the writes. Then the system loses power, or crashes. What happens then? Well in that case you're screwed, but if ZFS is known to handle even corrupted pools automatically, when that happens the immediate response on the forums is going to be something really bad has happened to your hardware, followed by troubleshooting to find out what. Instead of the response now, where we all know there's every chance the data is ok, and just can't be gotten to without zdb. Also, that's a pretty extreme situation since you'd need a device that is being written to but not read from to fail in this exact way. It also needs to have no scrubbing being run, so the problem has remained undetected. However, even in that situation, if we assume that it happened and that these recovery tools are available, ZFS will either report that your pool is seriously corrupted, indicating a major hardware problem (and ZFS can now state this with some confidence), or ZFS will be able to open a previous uberblock, mount your pool and begin a scrub, at which point all your missing writes will be found too and reported. And then you can go back to your snapshots. :-D Bob == Bob Friesenhahn bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer,http://www.GraphicsMagick.org/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS: unreliable for professional usage?
On Fri, Feb 13, 2009 at 8:24 PM, Bob Friesenhahn bfrie...@simple.dallas.tx.us wrote: On Fri, 13 Feb 2009, Ross Smith wrote: You have to consider that even with improperly working hardware, ZFS has been checksumming data, so if that hardware has been working for any length of time, you *know* that the data on it is good. You only know this if the data has previously been read. Assume that the device temporarily stops pysically writing, but otherwise responds normally to ZFS. Then the device starts writing again (including a recent uberblock), but with a large gap in the writes. Then the system loses power, or crashes. What happens then? Hey Bob, Thinking about this a bit more, you've given me an idea: Would it be worth ZFS occasionally reading previous uberblocks from the pool, just to check they are there and working ok? I wonder if you could do this after a few uberblocks have been written. It would seem to be a good way of catching devices that aren't writing correctly early on, as well as a way of guaranteeing that previous uberblocks are available to roll back to should a write go wrong. I wonder what the upper limits for this kind of write failure is going to be. I've seen 30 second delays mentioned in this thread. How often are uberblocks written? Is there any guarantee that we'll always have more than 30 seconds worth of uberblocks on a drive? Should ZFS be set so that it keeps either a given number of uberblocks, or 5 minutes worth of uberblocks, whichever is the larger? Ross ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS: unreliable for professional usage?
You don't, but that's why I was wondering about time limits. You have to have a cut off somewhere, but if you're checking the last few minutes of uberblocks that really should cope with a lot. It seems like a simple enough thing to implement, and if a pool still gets corrupted with these checks in place, you can absolutely, positively blame it on the hardware. :D However, I've just had another idea. Since the uberblocks are pretty vital in recovering a pool, and I believe it's a fair bit of work to search the disk to find them. Might it be a good idea to allow ZFS to store uberblock locations elsewhere for recovery purposes? This could be as simple as a USB stick plugged into the server, a separate drive, or a network server. I guess even the ZIL device would work if it's separate hardware. But knowing the locations of the uberblocks would save yet more time should recovery be needed. On Fri, Feb 13, 2009 at 8:59 PM, Bob Friesenhahn bfrie...@simple.dallas.tx.us wrote: On Fri, 13 Feb 2009, Ross Smith wrote: Thinking about this a bit more, you've given me an idea: Would it be worth ZFS occasionally reading previous uberblocks from the pool, just to check they are there and working ok? That sounds like a good idea. However, how do you know for sure that the data returned is not returned from a volatile cache? If the hardware is ignoring cache flush requests, then any data returned may be from a volatile cache. Bob == Bob Friesenhahn bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer,http://www.GraphicsMagick.org/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] FW: Supermicro AOC-SAT2-MV8 hang when drive removed
Heh, yeah, I've thought the same kind of thing in the past. The problem is that the argument doesn't really work for system admins. As far as I'm concerned, the 7000 series is a new hardware platform, with relatively untested drivers, running a software solution that I know is prone to locking up when hardware faults are handled badly by drivers. Fair enough, that actual solution is out of our price range, but I would still be very dubious about purchasing it. At the very least I'd be waiting a year for other people to work the kinks out of the drivers. Which is a shame, because ZFS has so many other great features it's easily our first choice for a storage platform. The one and only concern we have is its reliability. We have snv_106 running as a test platform now. If I felt I could trust ZFS 100% I'd roll it out tomorrow. On Thu, Feb 12, 2009 at 4:25 PM, Tim t...@tcsac.net wrote: On Thu, Feb 12, 2009 at 9:25 AM, Ross myxi...@googlemail.com wrote: This sounds like exactly the kind of problem I've been shouting about for 6 months or more. I posted a huge thread on availability on these forums because I had concerns over exactly this kind of hanging. ZFS doesn't trust hardware or drivers when it comes to your data - everything is checksummed. However, when it comes to seeing whether devices are responding, and checking for faults, it blindly trusts whatever the hardware or driver tells it. Unfortunately, that means ZFS is vulnerable to any unexpected bug or error in the storage chain. I've encountered at least two hang conditions myself (and I'm not exactly a heavy user), and I've seen several others on the forums, including a few on x4500's. Now, I do accept that errors like this will be few and far between, but they still means you have the risk that a badly handled error condition can hang your entire server, instead of just one drive. Solaris can handle things like CPU's or Memory going faulty for crying out loud. Its raid storage system had better be able to handle a disk failing. Sun seem to be taking the approach that these errors should be dealt with in the driver layer. And while that's technically correct, a reliable storage system had damn well better be able to keep the server limping along while we wait for patches to the storage drivers. ZFS absolutely needs an error handling layer between the volume manager and the devices. It needs to timeout items that are not responding, and it needs to drop bad devices if they could cause problems elsewhere. And yes, I'm repeating myself, but I can't understand why this is not being acted on. Right now the error checking appears to be such that if an unexpected, or badly handled error condition occurs in the driver stack, the pool or server hangs. Whereas the expected behavior would be for just one drive to fail. The absolute worst case scenario should be that an entire controller has to be taken offline (and I would hope that the controllers in an x4500 would be running separate instances of the driver software). None one of those conditions should be fatal, good storage designs cope with them all, and good error handling at the ZFS layer is absolutely vital when you have projects like Comstar introducing more and more types of storage device for ZFS to work with. Each extra type of storage introduces yet more software into the equation, and increases the risk of finding faults like this. While they will be rare, they should be expected, and ZFS should be designed to handle them. I'd imagine for the exact same reason short-stroking/right-sizing isn't a concern. We don't have this problem in the 7000 series, perhaps you should buy one of those. ;) --Tim ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS: unreliable for professional usage?
That would be the ideal, but really I'd settle for just improved error handling and recovery for now. In the longer term, disabling write caching by default for USB or Firewire drives might be nice. On Thu, Feb 12, 2009 at 8:35 PM, Gary Mills mi...@cc.umanitoba.ca wrote: On Thu, Feb 12, 2009 at 11:53:40AM -0500, Greg Palmer wrote: Ross wrote: I can also state with confidence that very, very few of the 100 staff working here will even be aware that it's possible to unmount a USB volume in windows. They will all just pull the plug when their work is saved, and since they all come to me when they have problems, I think I can safely say that pulling USB devices really doesn't tend to corrupt filesystems in Windows. Everybody I know just waits for the light on the device to go out. The key here is that Windows does not cache writes to the USB drive unless you go in and specifically enable them. It caches reads but not writes. If you enable them you will lose data if you pull the stick out before all the data is written. This is the type of safety measure that needs to be implemented in ZFS if it is to support the average user instead of just the IT professionals. That implies that ZFS will have to detect removable devices and treat them differently than fixed devices. It might have to be an option that can be enabled for higher performance with reduced data security. -- -Gary Mills--Unix Support--U of M Academic Computing and Networking- ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Data loss bug - sidelined??
I can check on Monday, but the system will probably panic... which doesn't really help :-) Am I right in thinking failmode=wait is still the default? If so, that should be how it's set as this testing was done on a clean install of snv_106. From what I've seen, I don't think this is a problem with the zfs failmode. It's more of an issue of what happens in the period *before* zfs realises there's a problem and applies the failmode. This time there was just a window of a couple of minutes while commands would continue. In the past I've managed to stretch it out to hours. To me the biggest problems are: - ZFS accepting writes that don't happen (from both before and after the drive is removed) - No logging or warning of this in zpool status I appreciate that if you're using cache, some data loss is pretty much inevitable when a pool fails, but that should be a few seconds worth of data at worst, not minutes or hours worth. Also, if a pool fails completely and there's data in the cache that hasn't been committed to disk, it would be great if Solaris could respond by: - immediately dumping the cache to any (all?) working storage - prompting the user to fix the pool, or save the cache before powering down the system Ross On Fri, Feb 6, 2009 at 5:49 PM, Richard Elling richard.ell...@gmail.com wrote: Ross, this is a pretty good description of what I would expect when failmode=continue. What happens when failmode=panic? -- richard Ross wrote: Ok, it's still happening in snv_106: I plugged a USB drive into a freshly installed system, and created a single disk zpool on it: # zpool create usbtest c1t0d0 I opened the (nautilus?) file manager in gnome, and copied the /etc/X11 folder to it. I then copied the /etc/apache folder to it, and at 4:05pm, disconnected the drive. At this point there are *no* warnings on screen, or any indication that there is a problem. To check that the pool was still working, I created duplicates of the two folders on that drive. That worked without any errors, although the drive was physically removed. 4:07pm I ran zpool status, the pool is actually showing as unavailable, so at least that has happened faster than my last test. The folder is still open in gnome, however any attempt to copy files to or from it just hangs the file transfer operation window. 4:09pm /usbtest is still visible in gnome Also, I can still open a console and use the folder: # cd usbtest # ls X11X11 (copy) apache apache (copy) I also tried: # mv X11 X11-test That hung, but I saw the X11 folder disappear from the graphical file manager, so the system still believes something is working with this pool. The main GUI is actually a little messed up now. The gnome file manager window looking at the /usbtest folder has hung. Also, right-clicking the desktop to open a new terminal hangs, leaving the right-click menu on screen. The main menu still works though, and I can still open a new terminal. 4:19pm Commands such as ls are finally hanging on the pool. At this point I tried to reboot, but it appears that isn't working. I used system monitor to kill everything I had running and tried again, but that didn't help. I had to physically power off the system to reboot. After the reboot, as expected, /usbtest still exists (even though the drive is disconnected). I removed that folder and connected the drive. ZFS detects the insertion and automounts the drive, but I find that although the pool is showing as online, and the filesystem shows as mounted at /usbtest. But the /usbtest directory doesn't exist. I had to export and import the pool to get it available, but as expected, I've lost data: # cd usbtest # ls X11 even worse, zfs is completely unaware of this: # zpool status -v usbtest pool: usbtest state: ONLINE scrub: none requested config: NAMESTATE READ WRITE CKSUM usbtest ONLINE 0 0 0 c1t0d0ONLINE 0 0 0 errors: No known data errors So in summary, there are a good few problems here, many of which I've already reported as bugs: 1. ZFS still accepts read and write operations for a faulted pool, causing data loss that isn't necessarily reported by zpool status. 2. Even after writes start to hang, it's still possible to continue reading data from a faulted pool. 3. A faulted pool causes unwanted side effects in the GUI, making the system hard to use, and impossible to reboot. 4. After a hard reset, ZFS does not recover cleanly. Unused mountpoints are left behind. 5. Automatic mounting of pools doesn't seem to work reliably. 6. zfs status doesn't inform of any problems mounting the pool. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Data loss bug - sidelined??
Something to do with cache was my first thought. It seems to be able to read and write from the cache quite happily for some time, regardless of whether the pool is live. If you're reading or writing large amounts of data, zfs starts experiencing IO faults and offlines the pool pretty quickly. If you're just working with small datasets, or viewing files that you've recently opened, it seems you can stretch it out for quite a while. But yes, it seems that it doesn't enter failmode until the cache is full. I would expect it to hit this within 5 seconds (since I believe that is how often the cache should be writing). On Fri, Feb 6, 2009 at 7:04 PM, Brent Jones br...@servuhome.net wrote: On Fri, Feb 6, 2009 at 10:50 AM, Ross Smith myxi...@googlemail.com wrote: I can check on Monday, but the system will probably panic... which doesn't really help :-) Am I right in thinking failmode=wait is still the default? If so, that should be how it's set as this testing was done on a clean install of snv_106. From what I've seen, I don't think this is a problem with the zfs failmode. It's more of an issue of what happens in the period *before* zfs realises there's a problem and applies the failmode. This time there was just a window of a couple of minutes while commands would continue. In the past I've managed to stretch it out to hours. To me the biggest problems are: - ZFS accepting writes that don't happen (from both before and after the drive is removed) - No logging or warning of this in zpool status I appreciate that if you're using cache, some data loss is pretty much inevitable when a pool fails, but that should be a few seconds worth of data at worst, not minutes or hours worth. Also, if a pool fails completely and there's data in the cache that hasn't been committed to disk, it would be great if Solaris could respond by: - immediately dumping the cache to any (all?) working storage - prompting the user to fix the pool, or save the cache before powering down the system Ross On Fri, Feb 6, 2009 at 5:49 PM, Richard Elling richard.ell...@gmail.com wrote: Ross, this is a pretty good description of what I would expect when failmode=continue. What happens when failmode=panic? -- richard Ross wrote: Ok, it's still happening in snv_106: I plugged a USB drive into a freshly installed system, and created a single disk zpool on it: # zpool create usbtest c1t0d0 I opened the (nautilus?) file manager in gnome, and copied the /etc/X11 folder to it. I then copied the /etc/apache folder to it, and at 4:05pm, disconnected the drive. At this point there are *no* warnings on screen, or any indication that there is a problem. To check that the pool was still working, I created duplicates of the two folders on that drive. That worked without any errors, although the drive was physically removed. 4:07pm I ran zpool status, the pool is actually showing as unavailable, so at least that has happened faster than my last test. The folder is still open in gnome, however any attempt to copy files to or from it just hangs the file transfer operation window. 4:09pm /usbtest is still visible in gnome Also, I can still open a console and use the folder: # cd usbtest # ls X11X11 (copy) apache apache (copy) I also tried: # mv X11 X11-test That hung, but I saw the X11 folder disappear from the graphical file manager, so the system still believes something is working with this pool. The main GUI is actually a little messed up now. The gnome file manager window looking at the /usbtest folder has hung. Also, right-clicking the desktop to open a new terminal hangs, leaving the right-click menu on screen. The main menu still works though, and I can still open a new terminal. 4:19pm Commands such as ls are finally hanging on the pool. At this point I tried to reboot, but it appears that isn't working. I used system monitor to kill everything I had running and tried again, but that didn't help. I had to physically power off the system to reboot. After the reboot, as expected, /usbtest still exists (even though the drive is disconnected). I removed that folder and connected the drive. ZFS detects the insertion and automounts the drive, but I find that although the pool is showing as online, and the filesystem shows as mounted at /usbtest. But the /usbtest directory doesn't exist. I had to export and import the pool to get it available, but as expected, I've lost data: # cd usbtest # ls X11 even worse, zfs is completely unaware of this: # zpool status -v usbtest pool: usbtest state: ONLINE scrub: none requested config: NAMESTATE READ WRITE CKSUM usbtest ONLINE 0 0 0 c1t0d0ONLINE 0 0 0 errors: No known data errors So in summary, there are a good few problems here, many of which I've already reported as bugs: 1. ZFS
Re: [zfs-discuss] Any way to set casesensitivity=mixed on the main pool?
It's not intuitive because when you know that -o sets options, an error message saying that it's not a valid property makes you think that it's not possible to do what you're trying. Documented and intuitive are very different things. I do appreciate that the details are there in the manuals, but for items like this where it's very easy to pick the wrong one, it helps if the commands can work with you. The difference between -o and -O is pretty subtle, I just think that extra sentence in the error message could save a lot of frustration when people get mixed up. Ross On Wed, Feb 4, 2009 at 11:14 AM, Darren J Moffat darr...@opensolaris.org wrote: Ross wrote: Good god. Talk about non intuitive. Thanks Darren! Why isn't that intuitive ? It is even documented in the man page. zpool create [-fn] [-o property=value] ... [-O file-system- property=value] ... [-m mountpoint] [-R root] pool vdev ... Is it possible for me to suggest a quick change to the zpool error message in solaris? Should I file that as an RFE? I'm just wondering if the error message could be changed to something like: property 'casesensitivity' is not a valid pool property. Did you mean to use -O? It's just a simple change, but it makes it obvious that it can be done, instead of giving the impression that it's not possible. Feel free to log the RFE in defect.opensolaris.org. -- Darren J Moffat ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] SSD drives in Sun Fire X4540 or X4500 for dedicated ZIL device
That's my understanding too. One (STEC?) drive as a write cache, basically a write optimised SSD. And cheaper, larger, read optimised SSD's for the read cache. I thought it was an odd strategy until I read into SSD's a little more and realised you really do have to think about your usage cases with these. SSD's are very definitely not all alike. On Fri, Jan 23, 2009 at 4:33 PM, Greg Mason gma...@msu.edu wrote: If i'm not mistaken (and somebody please correct me if i'm wrong), the Sun 7000 series storage appliances (the Fishworks boxes) use enterprise SSDs, with dram caching. One such product is made by STEC. My understanding is that the Sun appliances use one SSD for the ZIL, and one as a read cache. For the 7210 (which is basically a Sun Fire X4540), that gives you 46 disks and 2 SSDs. -Greg Bob Friesenhahn wrote: On Thu, 22 Jan 2009, Ross wrote: However, now I've written that, Sun use SATA (SAS?) SSD's in their high end fishworks storage, so I guess it definately works for some use cases. But the fishworks (Fishworks is a development team, not a product) write cache device is not based on FLASH. It is based on DRAM. The difference is like night and day. Apparently there can also be a read cache which is based on FLASH. Bob == Bob Friesenhahn bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer,http://www.GraphicsMagick.org/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] Verbose Information from zfs send -v snapshot
What 'verbose information' is reported by the zfs send -v snapshot contain? Also on Solaris 10u6 I don't get any output at all - is this a bug? Regards, Nick -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] zfs list improvements?
Hmm... that's a tough one. To me, it's a trade off either way, using a -r parameter to specify the depth for zfs list feels more intuitive than adding extra commands to modify the -r behaviour, but I can see your point. But then, using -c or -d means there's an optional parameter for zfs list that you don't have in the other commands anyway. And would you have to use -c or -d with -r, or would they work on their own, providing two ways to achieve very similar functionality. Also, now you've mentioned that you want to keep things consistent among all the commands, keeping -c and -d free becomes more important to me. You don't know if you might want to use these for another command later on. It sounds to me that whichever way you implement it there's going to be some potential for confusion, but personally I'd stick with using -r. It leaves you with a single syntax for viewing children. The -r on the other commands can be modified to give an error message if they don't support this extra parameter, and it leaves both -c and -d free to use later on. Ross On Fri, Jan 9, 2009 at 7:16 PM, Richard Morris - Sun Microsystems - Burlington United States richard.mor...@sun.com wrote: On 01/09/09 01:44, Ross wrote: Can I ask why we need to use -c or -d at all? We already have -r to recursively list children, can't we add an optional depth parameter to that? You then have: zfs list : shows current level (essentially -r 0) zfs list -r : shows all levels (infinite recursion) zfs list -r 2 : shows 2 levels of children An optional depth argument to -r has already been suggested: http://mail.opensolaris.org/pipermail/zfs-discuss/2009-January/054241.html However, other zfs subcommands such as destroy, get, rename, and snapshot also provide -r options without optional depth arguments. And its probably good to keep the zfs subcommand option syntax consistent. On the other hand, if all of the zfs subcommands were modified to accept an optional depth argument to -r, then this would not be an issue. But, for example, the top level(s) of datasets cannot be destroyed if that would leave orphaned datasets. BTW, when no dataset is specified, zfs list is the same as zfs list -r (infinite recursion). When a dataset is specified then it shows only the current level. Does anyone have any non-theoretical situations where a depth option other than 1 or 2 would be used? Are scripts being used to work around this problem? -- Rich ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] zfs destroy is taking a long time...
I was wondering if anyone has any experience with how long a zfs destroy of about 40 TB should take? So far, it has been about an hour... Is there any good way to tell if it is working or if it is hung? Doing a zfs list just hangs. If you do a more specific zfs list, then it is okay... zfs list pool/another-fs Thanks, David -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] zfs destroy is taking a long time...
A few more details: The system is a Sun x4600 running Solaris 10 Update 4. -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] zfs destroy is taking a long time...
On Thu, 2009-01-08 at 13:26 -0500, Brian H. Nelson wrote: David Smith wrote: I was wondering if anyone has any experience with how long a zfs destroy of about 40 TB should take? So far, it has been about an hour... Is there any good way to tell if it is working or if it is hung? Doing a zfs list just hangs. If you do a more specific zfs list, then it is okay... zfs list pool/another-fs Thanks, David I can't voice to something like 40 TB, but I can share a related story (on Solaris 10u5). A couple days ago, I tried to zfs destroy a clone of a snapshot of a 191 GB zvol. It didn't complete right away, but the machine appeared to continue working on it, so I decided to let it go overnight (it was near the end of the day). Well, by about 4:00 am the next day, the machine had completely ran out of memory and hung. When I came in, I forced a sync from prom to get it back up. While it was booting, it stopped during (I think) the zfs initialization part, where it ran the disks for about 10 minutes before continuing. When the machine was back up, everything appeared to be ok. The clone was still there, although usage had changed to zero. I ended up patching the machine up to the latest u6 kernel + zfs patch (13-01 + 139579-01). After that, the zfs destroy went off without a hitch. I turned up bug 6606810 'zfs destroy volume is taking hours to complete' which is supposed to be fixed by 139579-01. I don't know if that was the cause of my issue or not. I've got a 2GB kernel dump if anyone is interested in looking. -Brian Brian, Thanks for the reply. I'll take a look at the 139579-01 patch. Perhaps as well a Sun engineer will comment about this issue being fixed with patches, etc. David ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Using zfs mirror as a simple backup mechanism for time-slider.
On Fri, Dec 19, 2008 at 6:47 PM, Richard Elling richard.ell...@sun.com wrote: Ross wrote: Well, I really like the idea of an automatic service to manage send/receives to backup devices, so if you guys don't mind, I'm going to share some other ideas for features I think would be useful. cool. One of the first is that you need some kind of capacity management and snapshot deletion. Eventually backup media are going to fill and you need to either prompt the user to remove snapshots, or even better, you need to manage the media automatically and remove old snapshots to make space for new ones. I've implemented something like this for a project I'm working on. Consider this a research project at this time, though I hope to leverage some of the things we learn as we scale up, out, and refine the operating procedures. Way cool :D There is a failure mode lurking here. Suppose you take two sets of snapshots: local and remote. You want to do an incremental send, for efficiency. So you look at the set of snapshots on both machines and find the latest, common snapshot. You will then send the list of incrementals from the latest, common through the latest snapshot. On the remote machine, if there are any other snapshots not in the list you are sending and newer than the latest, common snapshot, then the send/recv will fail. In practice, this means that if you use the zfs-auto-snapshot feature, which will automatically destroy older snapshots as it goes (eg. the default policy for frequent is take snapshots every 15 minutes, keep 4). If you never have an interruption in your snapshot schedule, you can merrily cruise along and not worry about this. But if there is an interruption (for maintenance, perhaps) and a snapshot is destroyed on the sender, then you also must make sure it gets destroyed on the receiver. I just polished that code yesterday, and it seems to work fine... though it makes folks a little nervous. Anyone with an operations orientation will recognize that there needs to be a good process wrapped around this, but I haven't worked through all of the scenarios on the receiver yet. Very true. In this context I think this would be fine. You would want a warning to pop up saying that a snapshot has been deleted locally and will have to be overwritten on the backup, but I think that would be ok. If necessary you could have a help page explaining why - essentially this is a copy of your pool, not just a backup of your files, and to work it needs an accurate copy of your snapshots. If you wanted to be really fancy, you could have an option for the user to view the affected files, but I think that's probably over complicating things. I don't suppose there's any way the remote snapshot can be cloned / separated from the pool just in case somebody wanted to retain access to the files within it? I'm thinking that a setup like time slider would work well, where you specify how many of each age of snapshot to keep. But I would want to be able to specify different intervals for different devices. eg. I might want just the latest one or two snapshots on a USB disk so I can take my files around with me. On a removable drive however I'd be more interested in preserving a lot of daily / weekly backups. I might even have an archive drive that I just store monthly snapshots on. What would be really good would be a GUI that can estimate how much space is going to be taken up for any configuration. You could use the existing snapshots on disk as a guide, and take an average size for each interval, giving you average sizes for hourly, daily, weekly, monthly, etc... ha ha, I almost blew coffee out my nose ;-) I'm sure that once the forward time-slider functionality is implemented, it will be much easier to manage your storage utilization :-) So, why am I giggling? My wife just remembered that she hadn't taken her photos off the camera lately... 8 GByte SD cards are the vehicle of evil destined to wreck your capacity planning :-) Haha, that's a great image, but I've got some food for thought even with this. If you think about it, even though 8GB sounds a lot, it's barely over 1% of a 500GB drive, so it's not an unmanageable blip as far as storage goes. Also, if you're using the default settings for Tim's backups, you'll be taking snapshots every 15 minutes, hour, day, week and month. Now, when you start you're not going to have any sensible averages for your monthly snapshot sizes, but you're very rapidly going to get a set of figures for your 15 minute snapshots. What I would suggest is to use those to extrapolate forwards to give very rough estimates of usage early on, with warnings as to how rough these are. In time these estimates will improve in accuracy, and your 8GB photo 'blip' should be relatively easily incorporated. What you could maybe do is have a high and low usage estimate shown in the GUI. Early on these will be quite a
Re: [zfs-discuss] Using zfs mirror as a simple backup mechanism for time-slider.
Absolutely. The tool shouldn't need to know that the backup disk is accessed via USB, or whatever. The GUI should, however, present devices intelligently, not as cXtYdZ! Yup, and that's easily achieved by simply prompting for a user friendly name as devices are attached. Now you could store that locally, but it would be relatively easy to drop an XML configuration file on the device too, allowing the same friendly name to be shown wherever it's connected. And this is sounding more and more like something I was thinking of developing myself. A proper Sun version would be much better though (not least before I've never developed anything for Solaris!). ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Using zfs mirror as a simple backup mechanism for time-slider.
On Thu, Dec 18, 2008 at 7:11 PM, Nicolas Williams nicolas.willi...@sun.com wrote: On Thu, Dec 18, 2008 at 07:05:44PM +, Ross Smith wrote: Absolutely. The tool shouldn't need to know that the backup disk is accessed via USB, or whatever. The GUI should, however, present devices intelligently, not as cXtYdZ! Yup, and that's easily achieved by simply prompting for a user friendly name as devices are attached. Now you could store that locally, but it would be relatively easy to drop an XML configuration file on the device too, allowing the same friendly name to be shown wherever it's connected. I was thinking more something like: - find all disk devices and slices that have ZFS pools on them - show users the devices and pool names (and UUIDs and device paths in case of conflicts).. I was thinking that device pool names are too variable, you need to be reading serial numbers or ID's from the device and link to that. - let the user pick one. - in the case that the user wants to initialize a drive to be a backup you need something more complex. - one possibility is to tell the user when to attach the desired backup device, in which case the GUI can detect the addition and then it knows that that's the device to use (but be careful to check that the user also owns the device so that you don't pick the wrong one on multi-seat systems) - another is to be much smarter about mapping topology to physical slots and present a picture to the user that makes sense to the user, so the user can click on the device they want. This is much harder. I was actually thinking of a resident service. Tim's autobackup script was capable of firing off backups when it detected the insertion of a USB drive, and if you've got something sitting there monitoring drive insertions you could have it prompt the user when new drives are detected, asking if they should be used for backups. Of course, you'll need some settings for this so it's not annoying if people don't want to use it. A simple tick box on that pop up dialog allowing people to say don't ask me again would probably do. You'd then need a second way to assign drives if the user changed their mind. I'm thinking this would be to load the software and select a drive. Mapping to physical slots would be tricky, I think you'd be better with a simple view that simply names the type of interface, the drive size, and shows any current disk labels. It would be relatively easy then to recognise the 80GB USB drive you've just connected. Also, because you're formatting these drives as ZFS, you're not restricted to just storing your backups on them. You can create a root pool (to contain the XML files, etc), and the backups can then be saved to a filesystem within that. That means the drive then functions as both a removable drive, and as a full backup for your system. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Using zfs mirror as a simple backup mechanism for time-slider.
Of course, you'll need some settings for this so it's not annoying if people don't want to use it. A simple tick box on that pop up dialog allowing people to say don't ask me again would probably do. I would like something better than that. Don't ask me again sucks when much, much later you want to be asked and you don't know how to get the system to ask you. Only if your UI design doesn't make it easy to discover how to add devices another way, or turn this setting back on. My thinking is that this actually won't be the primary way of adding devices. It's simply there for ease of use for end users, as an easy way for them to discover that they can use external drives to backup their system. Once you have a backup drive configured, most of the time you're not going to want to be prompted for other devices. Users will generally setup a single external drive for backups, and won't want prompting every time they insert a USB thumb drive, a digital camera, phone, etc. So you need that initial prompt to make the feature discoverable, and then an easy and obvious way to configure backup devices later. You'd then need a second way to assign drives if the user changed their mind. I'm thinking this would be to load the software and select a drive. Mapping to physical slots would be tricky, I think you'd be better with a simple view that simply names the type of interface, the drive size, and shows any current disk labels. It would be relatively easy then to recognise the 80GB USB drive you've just connected. Right, so do as I suggested: tell the user to remove the device if it's plugged in, then plug it in again. That way you can known unambiguously (unless the user is doing this with more than one device at a time). That's horrible from a users point of view though. Possibly worth having as a last resort, but I'd rather just let the user pick the device. This does have potential as a help me find my device feature though. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Using zfs mirror as a simple backup mechanism for time-slider.
I was thinking more something like: - find all disk devices and slices that have ZFS pools on them - show users the devices and pool names (and UUIDs and device paths in case of conflicts).. I was thinking that device pool names are too variable, you need to be reading serial numbers or ID's from the device and link to that. Device names are, but there's no harm in showing them if there's something else that's less variable. Pool names are not very variable at all. I was thinking of something a little different. Don't worry about devices, because you don't send to a device (rather, send to a pool). So a simple list of source file systems and a list of destinations would do. I suppose you could work up something with pictures and arrows, like Nautilus, but that might just be more confusing than useful. True, but if this is an end user service, you want something that can create the filesystem for them on their devices. An advanced mode that lets you pick any destination filesystem would be good for network admins, but for end users they're just going to want to point this at their USB drive. But that is the easy part. The hard part is dealing with the plethora of failure modes... -- richard Heh, my response to this is who cares? :-D This is a high level service, it's purely concerned with backup succeeded or backup failed, possibly with an overdue for backup prompt if you want to help the user manage the backups. Any other failure modes can be dealt with by the lower level services or by the user. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Split responsibility for data with ZFS
Forgive me for not understanding the details, but couldn't you also work backwards through the blocks with ZFS and attempt to recreate the uberblock? So if you lost the uberblock, could you (memory and time allowing) start scanning the disk, looking for orphan blocks that aren't refernced anywhere else and piece together the top of the tree? Or roll back to a previous uberblock (or a snapshot uberblock), and then look to see what blocks are on the disk but not referenced anywhere. Is there any way to intelligently work out where those blocks would be linked by looking at how they interact with the known data? Of course, rolling back to a previous uberblock would still be a massive step forward, and something I think would do much to improve the perception of ZFS as a tool to reliably store data. You cannot understate the difference to the end user between a file system that on boot says: Sorry, can't read your data pool. With one that says: Whoops, the uberblock, and all the backups are borked. Would you like to roll back to a backup uberblock, or leave the filesystem offline to repair manually? As much as anything else, a simple statement explaining *why* a pool is inaccessible, and saying just how badly things have gone wrong helps tons. Being able to recover anything after that is just the icing on the cake, especially if it can be done automatically. Ross PS. Sorry for the duplicate Casper, I forgot to cc the list. On Mon, Dec 15, 2008 at 10:30 AM, casper@sun.com wrote: I think the problem for me is not that there's a risk of data loss if a pool becomes corrupt, but that there are no recovery tools available. With UFS, people expect that if the worst happens, fsck will be able to recover their data in most cases. Except, of course, that fsck lies. In fixes the meta data and the quality of the rest is unknown. Anyone using UFS knows that UFS file corruption are common; specifically, when using a UFS root and the system panic's when trying to install a device driver, there's a good chance that some files in /etc are corrupt. Some were application problems (some code used fsync(fileno(fp)); fclose(fp); it doesn't guarantee anything) With ZFS you have no such tools, yet Victor has on at least two occasions shown that it's quite possible to recover pools that were completely unusable (I believe by making use of old / backup copies of the uberblock). True; and certainly ZFS should be able backtrack. But it's much more likely to happen automatically then using a recovery tool. See, fsck could only be written because specific corruption are known and the patterns they have. With ZFS, you can only backup to a certain uberblock and the pattern will be a surprise. Casper ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Split responsibility for data with ZFS
I'm not sure I follow how that can happen, I thought ZFS writes were designed to be atomic? They either commit properly on disk or they don't? On Mon, Dec 15, 2008 at 6:34 PM, Bob Friesenhahn bfrie...@simple.dallas.tx.us wrote: On Mon, 15 Dec 2008, Ross wrote: My concern is that ZFS has all this information on disk, it has the ability to know exactly what is and isn't corrupted, and it should (at least for a system with snapshots) have many, many potential uberblocks to try. It should be far, far better than UFS at recovering from these things, but for a certain class of faults, when it hits a problem it just stops dead. While ZFS knows if a data block is retrieved correctly from disk, a correctly retrieved data block does not indicate that the pool isn't corrupted. A block written in the wrong order is a form of corruption. Bob == Bob Friesenhahn bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer,http://www.GraphicsMagick.org/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] cannot mount ZFS volume
Ahhh...I missed the difference between a volume and a FS. That was it...thanks. -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] cannot mount ZFS volume
When I create a volume I am unable to mount it locally. I pretty sure it has something to do with the other volumes in the same ZFS pool being shared out as ISCSI luns. For some reason ZFS things the base volume is ISCSI. Is there a flag that I am missing? Thanks in advanced for the help. [EMAIL PROTECTED]:~# zpool list NAME SIZE USED AVAILCAP HEALTH ALTROOT datapool 464G 196G 268G42% ONLINE - rpool 48.8G 4.33G 44.4G 8% ONLINE - [EMAIL PROTECTED]:~# zfs create -V 2g datapool/share [EMAIL PROTECTED]:~# zfs list NAME USED AVAIL REFER MOUNTPOINT datapool 352G 105G18K /datapool datapool/backup 200G 207G 97.7G - datapool/datavol 150G 156G 98.3G - datapool/share 2G 107G16K - [EMAIL PROTECTED]:~# zfs mount datapool/share cannot open 'datapool/share': operation not applicable to datasets of this type [EMAIL PROTECTED]:~# zfs share datapool/share cannot share 'datapool/share': 'shareiscsi' property not set set 'shareiscsi' property or use iscsitadm(1M) to share this volume [EMAIL PROTECTED]:~# zfs get shareiscsi datapool NAME PROPERTYVALUE SOURCE datapool shareiscsi off local [EMAIL PROTECTED]:~# zfs get shareiscsi datapool/share NAMEPROPERTYVALUE SOURCE datapool/share shareiscsi off inherited from datapool [EMAIL PROTECTED]:~# zfs set sharenfs=on datapool/share cannot set property for 'datapool/share': 'sharenfs' does not apply to datasets of this type -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] zfs not yet suitable for HA applications?
Hi Dan, replying in line: On Fri, Dec 5, 2008 at 9:19 PM, David Anderson [EMAIL PROTECTED] wrote: Trying to keep this in the spotlight. Apologies for the lengthy post. Heh, don't apologise, you should see some of my posts... o_0 I'd really like to see features as described by Ross in his summary of the Availability: ZFS needs to handle disk removal / driver failure better (http://www.opensolaris.org/jive/thread.jspa?messageID=274031#274031 ). I'd like to have these/similar features as well. Has there already been internal discussions regarding adding this type of functionality to ZFS itself, and was there approval, disapproval or no decision? Unfortunately my situation has put me in urgent need to find workarounds in the meantime. My setup: I have two iSCSI target nodes, each with six drives exported via iscsi (Storage Nodes). I have a ZFS Node that logs into each target from both Storage Nodes and creates a mirrored Zpool with one drive from each Storage Node comprising each half of the mirrored vdevs (6 x 2-way mirrors). My problem: If a Storage Node crashes completely, is disconnected from the network, iscsitgt core dumps, a drive is pulled, or a drive has a problem accessing data (read retries), then my ZFS Node hangs while ZFS waits patiently for the layers below to report a problem and timeout the devices. This can lead to a roughly 3 minute or longer halt when reading OR writing to the Zpool on the ZFS node. While this is acceptable in certain situations, I have a case where my availability demand is more severe. My goal: figure out how to have the zpool pause for NO LONGER than 30 seconds (roughly within a typical HTTP request timeout) and then issue reads/writes to the good devices in the zpool/mirrors while the other side comes back online or is fixed. My ideas: 1. In the case of the iscsi targets disappearing (iscsitgt core dump, Storage Node crash, Storage Node disconnected from network), I need to lower the iSCSI login retry/timeout values. Am I correct in assuming the only way to accomplish this is to recompile the iscsi initiator? If so, can someone help point me in the right direction (I have never compiled ONNV sources - do I need to do this or can I just recompile the iscsi initiator)? I believe it's possible to just recompile the initiator and install the new driver. I have some *very* rough notes that were sent to me about a year ago, but I've no experience compiling anything in Solaris, so don't know how useful they will be. I'll try to dig them out in case they're useful. 1.a. I'm not sure in what Initiator session states iscsi_sess_max_delay is applicable - only for the initial login, or also in the case of reconnect? Ross, if you still have your test boxes available, can you please try setting set iscsi:iscsi_sess_max_delay = 5 in /etc/system, reboot and try failing your iscsi vdevs again? I can't find a case where this was tested quick failover. Will gladly have a go at this on Monday. 1.b. I would much prefer to have bug 649 addressed and fixed rather than having to resort to recompiling the iscsi initiator (if iscsi_sess_max_delay) doesn't work. This seems like a trivial feature to implement. How can I sponsor development? 2. In the case of the iscsi target being reachable, but the physical disk is having problems reading/writing data (retryable events that take roughly 60 seconds to timeout), should I change the iscsi_rx_max_window tunable with mdb? Is there a tunable for iscsi_tx? Ross, I know you tried this recently in the thread referenced above (with value 15), which resulted in a 60 second hang. How did you offline the iscsi vol to test this failure? Unless iscsi uses a multiple of the value for retries, then maybe the way you failed the disk caused the iscsi system to follow a different failure path? Unfortunately I don't know of a way to introduce read/write retries to a disk while the disk is still reachable and presented via iscsitgt, so I'm not sure how to test this. So far I've just been shutting down the Solaris box hosting the iSCSI target. Next step will involve pulling some virtual cables. Unfortunately I don't think I've got a physical box handy to test drive failures right now, but my previous testing (of simply pulling drives) showed that it can be hit and miss as to how well ZFS detects these types of 'failure'. Like you I don't know yet how to simulate failures, so I'm doing simple tests right now, offlining entire drives or computers. Unfortunately I've found more than enough problems with just those tests to keep me busy. 2.a With the fix of http://bugs.opensolaris.org/view_bug.do?bug_id=6518995 , we can set sd_retry_count along with sd_io_time to cause I/O failure when a command takes longer than sd_retry_count * sd_io_time. Can (or should) these tunables be set on the imported iscsi disks in the ZFS Node, or can/should they be applied only to the local disk on
Re: [zfs-discuss] Availability: ZFS needs to handle disk removal / driver failure better
Yeah, thanks Maurice, I just saw that one this afternoon. I guess you can't reboot with iscsi full stop... o_0 And I've seen the iscsi bug before (I was just too lazy to look it up lol), I've been complaining about that since February. In fact it's been a bad week for iscsi here, I've managed to crash the iscsi client twice in the last couple of days too (full kernel dump crashes), so I'll be filing a bug report on that tomorrow morning when I get back to the office. Ross On Wed, Dec 3, 2008 at 7:39 PM, Maurice Volaski [EMAIL PROTECTED] wrote: 2. With iscsi, you can't reboot with sendtargets enabled, static discovery still seems to be the order of the day. I'm seeing this problem with static discovery: http://bugs.opensolaris.org/view_bug.do?bug_id=6775008. 4. iSCSI still has a 3 minute timeout, during which time your pool will hang, no matter how many redundant drives you have available. This is CR 649, http://bugs.opensolaris.org/view_bug.do?bug_id=649, which is separate from the boot time timeout, though, and also one that Sun so far has been unable to fix! -- Maurice Volaski, [EMAIL PROTECTED] Computing Support, Rose F. Kennedy Center Albert Einstein College of Medicine of Yeshiva University ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Availability: ZFS needs to handle disk removal / driver failure better
Hey folks, I've just followed up on this, testing iSCSI with a raided pool, and it still appears to be struggling when a device goes offline. I don't see how this could work except for mirrored pools. Would that carry enough market to be worthwhile? -- richard I have to admit, I've not tested this with a raided pool, but since all ZFS commands hung when my iSCSI device went offline, I assumed that you would get the same effect of the pool hanging if a raid-z2 pool is waiting for a response from a device. Mirrored pools do work particularly well with this since it gives you the potential to have remote mirrors of your data, but if you had a raid-z2 pool, you still wouldn't want that hanging if a single device failed. zpool commands hanging is CR6667208, and has been fixed in b100. http://bugs.opensolaris.org/view_bug.do?bug_id=6667208 I will go and test the raid scenario though on a current build, just to be sure. Please. -- richard I've just created a pool using three snv_103 iscsi Targets, with a fourth install of snv_103 collating those targets into a raidz pool, and sharing that out over CIFS. To test the server, while transferring files from a windows workstation, I powered down one of the three iSCSI targets. It took a few minutes to shutdown, but once that happened the windows copy halted with the error: The specified network name is no longer available. At this point, the zfs admin tools still work fine (which is a huge improvement, well done!), but zpool status still reports that all three devices are online. A minute later, I can open the share again, and start another copy. Thirty seconds after that, zpool status finally reports that the iscsi device is offline. So it looks like we have the same problems with that 3 minute delay, with zpool status reporting wrong information, and the CIFS service having problems tool. At this point I restarted the iSCSI target, but had problems bringing it back online. It appears there's a bug in the initiator, but it's easily worked around: http://www.opensolaris.org/jive/thread.jspa?messageID=312981#312981 What was great was that as soon as the iSCSI initiator reconnected, ZFS started resilvering. What might not be so great is the fact that all three devices are showing that they've been resilvered: # zpool status pool: iscsipool state: ONLINE status: One or more devices has experienced an unrecoverable error. An attempt was made to correct the error. Applications are unaffected. action: Determine if the device needs to be replaced, and clear the errors using 'zpool clear' or replace the device with 'zpool replace'. see: http://www.sun.com/msg/ZFS-8000-9P scrub: resilver completed after 0h2m with 0 errors on Tue Dec 2 11:04:10 2008 config: NAME STATE READ WRITE CKSUM iscsipool ONLINE 0 0 0 raidz1 ONLINE 0 0 0 c2t600144F04933FF6C5056967AC800d0 ONLINE 0 0 0 179K resilvered c2t600144F04934FAB35056964D9500d0 ONLINE 5 9.88K 0 311M resilvered c2t600144F04934119E50569675FF00d0 ONLINE 0 0 0 179K resilvered errors: No known data errors It's proving a little hard to know exactly what's happening when, since I've only got a few seconds to log times, and there are delays with each step. However, I ran another test using robocopy and was able to observe the behaviour a little more closely: Test 2: Using robocopy for the transfer, and iostat plus zpool status on the server 10:46:30 - iSCSI server shutdown started 10:52:20 - all drives still online according to zpool status 10:53:30 - robocopy error - The specified network name is no longer available - zpool status shows all three drives as online - zpool iostat appears to have hung, taking much longer than the 30s specified to return a result - robocopy is now retrying the file, but appears to have hung 10:54:30 - robocopy, CIFS and iostat all start working again, pretty much simultaneously - zpool status now shows the drive as offline I could probably do with using DTrace to get a better look at this, but I haven't learnt that yet. My guess as to what's happening would be: - iSCSI target goes offline - ZFS will not be notified for 3 minutes, but I/O to that device is essentially hung - CIFS times out (I suspect this is on the client side with around a 30s timeout, but I can't find the timeout documented anywhere). - zpool iostat is now waiting, I may be wrong but this doesn't appear to have benefited from the changes to zpool status - After 3 minutes, the iSCSI drive goes offline. The pool carries on with the remaining two drives, CIFS carries on working, iostat carries on working. zpool status however is still out of date. - zpool status eventually catches up, and reports that the drive has gone
Re: [zfs-discuss] Availability: ZFS needs to handle disk removal / driver failure better
Hi Richard, Thanks, I'll give that a try. I think I just had a kernel dump while trying to boot this system back up though, I don't think it likes it if the iscsi targets aren't available during boot. Again, that rings a bell, so I'll go see if that's another known bug. Changing that setting on the fly didn't seem to help, if anything things are worse this time around. I changed the timeout to 15 seconds, but didn't restart any services: # echo iscsi_rx_max_window/D | mdb -k iscsi_rx_max_window: iscsi_rx_max_window:180 # echo iscsi_rx_max_window/W0t15 | mdb -kw iscsi_rx_max_window:0xb4= 0xf # echo iscsi_rx_max_window/D | mdb -k iscsi_rx_max_window: iscsi_rx_max_window:15 After making those changes, and repeating the test, offlining an iscsi volume hung all the commands running on the pool. I had three ssh sessions open, running the following: # zpool iostats -v iscsipool 10 100 # format /dev/null # time zpool status They hung for what felt a minute or so. After that, the CIFS copy timed out. After the CIFS copy timed out, I tried immediately restarting it. It took a few more seconds, but restarted no problem. Within a few seconds of that restarting, iostat recovered, and format returned it's result too. Around 30 seconds later, zpool status reported two drives, paused again, then showed the status of the third: # time zpool status pool: iscsipool state: ONLINE status: One or more devices has experienced an unrecoverable error. An attempt was made to correct the error. Applications are unaffected. action: Determine if the device needs to be replaced, and clear the errors using 'zpool clear' or replace the device with 'zpool replace'. see: http://www.sun.com/msg/ZFS-8000-9P scrub: resilver completed after 0h0m with 0 errors on Tue Dec 2 16:39:21 2008 config: NAME STATE READ WRITE CKSUM iscsipool ONLINE 0 0 0 raidz1 ONLINE 0 0 0 c2t600144F04933FF6C5056967AC800d0 ONLINE 0 0 0 15K resilvered c2t600144F04934FAB35056964D9500d0 ONLINE 0 0 0 15K resilvered c2t600144F04934119E50569675FF00d0 ONLINE 0 200 0 24K resilvered errors: No known data errors real3m51.774s user0m0.015s sys 0m0.100s Repeating that a few seconds later gives: # time zpool status pool: iscsipool state: DEGRADED status: One or more devices could not be opened. Sufficient replicas exist for the pool to continue functioning in a degraded state. action: Attach the missing device and online it using 'zpool online'. see: http://www.sun.com/msg/ZFS-8000-2Q scrub: resilver completed after 0h0m with 0 errors on Tue Dec 2 16:39:21 2008 config: NAME STATE READ WRITE CKSUM iscsipool DEGRADED 0 0 0 raidz1 DEGRADED 0 0 0 c2t600144F04933FF6C5056967AC800d0 ONLINE 0 0 0 15K resilvered c2t600144F04934FAB35056964D9500d0 ONLINE 0 0 0 15K resilvered c2t600144F04934119E50569675FF00d0 UNAVAIL 3 5.80K 0 cannot open errors: No known data errors real0m0.272s user0m0.029s sys 0m0.169s On Tue, Dec 2, 2008 at 3:58 PM, Richard Elling [EMAIL PROTECTED] wrote: .. iSCSI timeout is set to 180 seconds in the client code. The only way to change is to recompile it, or use mdb. Since you have this test rig setup, and I don't, do you want to experiment with this timeout? The variable is actually called iscsi_rx_max_window so if you do echo iscsi_rx_max_window/D | mdb -k you should see 180 Change it using something like: echo iscsi_rx_max_window/W0t30 | mdb -kw to set it to 30 seconds. -- richard ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Availability: ZFS needs to handle disk removal / driver failure better
On Fri, Nov 28, 2008 at 5:05 AM, Richard Elling [EMAIL PROTECTED] wrote: Ross wrote: Well, you're not alone in wanting to use ZFS and iSCSI like that, and in fact my change request suggested that this is exactly one of the things that could be addressed: The idea is really a two stage RFE, since just the first part would have benefits. The key is to improve ZFS availability, without affecting it's flexibility, bringing it on par with traditional raid controllers. A. Track response times, allowing for lop sided mirrors, and better failure detection. I've never seen a study which shows, categorically, that disk or network failures are preceded by significant latency changes. How do we get better failure detection from such measurements? Not preceded by as such, but a disk or network failure will certainly cause significant latency changes. If the hardware is down, there's going to be a sudden, and very large change in latency. Sure, FMA will catch most cases, but we've already shown that there are some cases where it doesn't work too well (and I would argue that's always going to be possible when you are relying on so many different types of driver). This is there to ensure that ZFS can handle *all* cases. Many people have requested this since it would facilitate remote live mirrors. At a minimum, something like VxVM's preferred plex should be reasonably easy to implement. B. Use response times to timeout devices, dropping them to an interim failure mode while waiting for the official result from the driver. This would prevent redundant pools hanging when waiting for a single device. I don't see how this could work except for mirrored pools. Would that carry enough market to be worthwhile? -- richard I have to admit, I've not tested this with a raided pool, but since all ZFS commands hung when my iSCSI device went offline, I assumed that you would get the same effect of the pool hanging if a raid-z2 pool is waiting for a response from a device. Mirrored pools do work particularly well with this since it gives you the potential to have remote mirrors of your data, but if you had a raid-z2 pool, you still wouldn't want that hanging if a single device failed. I will go and test the raid scenario though on a current build, just to be sure. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS, Smashing Baby a fake???
Hey Jeff, Good to hear there's work going on to address this. What did you guys think to my idea of ZFS supporting a waiting for a response status for disks as an interim solution that allows the pool to continue operation while it's waiting for FMA or the driver to fault the drive? I do appreciate that it's hard to come up with a definative it's dead Jim answer, and I agree that long term the FMA approach will pay dividends. But I still feel this is a good short term solution, and one that would also compliment your long term plans. My justification for this is that it seems to me that you can split disk behavior into two states: - returns data ok - doesn't return data ok And for the state where it's not returning data, you can again split that in two: - returns wrong data - doesn't return data The first of these is already covered by ZFS with its checksums (with FMA doing the extra work to fault drives), so it's just the second that needs immediate attention, and for the life of me I can't think of any situation that a simple timeout wouldn't catch. Personally I'd love to see two parameters, allowing this behavior to be turned on if desired, and allowing timeouts to be configured: zfs-auto-device-timeout zfs-auto-device-timeout-fail-delay The first sets whether to use this feature, and configures the maximum time ZFS will wait for a response from a device before putting it in a waiting status. The second would be optional and is the maximum time ZFS will wait before faulting a device (at which point it's replaced by a hot spare). The reason I think this will work well with the FMA work is that you can implement this now and have a real improvement in ZFS availability. Then, as the other work starts bringing better modeling for drive timeouts, the parameters can be either removed, or set automatically by ZFS. Long term I guess there's also the potential to remove the second setting if you felt FMA etc ever got reliable enough, but personally I would always want to have the final fail delay set. I'd maybe set it to a long value such as 1-2 minutes to give FMA, etc a fair chance to find the fault. But I'd be much happier knowing that the system will *always* be able to replace a faulty device within a minute or two, no matter what the FMA system finds. The key thing is that you're not faulting devices early, so FMA is still vital. The idea is purely to let ZFS to keep the pool active by removing the need for the entire pool to wait on the FMA diagnosis. As I said before, the driver and firmware are only aware of a single disk, and I would imagine that FMA also has the same limitation - it's only going to be looking at a single item and trying to determine whether it's faulty or not. Because of that, FMA is going to be designed to be very careful to avoid false positives, and will likely take it's time to reach an answer in some situations. ZFS however has the benefit of knowing more about the pool, and in the vast majority of situations, it should be possible for ZFS to read or write from other devices while it's waiting for an 'official' result from any one faulty component. Ross On Tue, Nov 25, 2008 at 8:37 AM, Jeff Bonwick [EMAIL PROTECTED] wrote: I think we (the ZFS team) all generally agree with you. The current nevada code is much better at handling device failures than it was just a few months ago. And there are additional changes that were made for the FishWorks (a.k.a. Amber Road, a.k.a. Sun Storage 7000) product line that will make things even better once the FishWorks team has a chance to catch its breath and integrate those changes into nevada. And then we've got further improvements in the pipeline. The reason this is all so much harder than it sounds is that we're trying to provide increasingly optimal behavior given a collection of devices whose failure modes are largely ill-defined. (Is the disk dead or just slow? Gone or just temporarily disconnected? Does this burst of bad sectors indicate catastrophic failure, or just localized media errors?) The disks' SMART data is notoriously unreliable, BTW. So there's a lot of work underway to model the physical topology of the hardware, gather telemetry from the devices, the enclosures, the environmental sensors etc, so that we can generate an accurate FMA fault diagnosis and then tell ZFS to take appropriate action. We have some of this today; it's just a lot of work to complete it. Oh, and regarding the original post -- as several readers correctly surmised, we weren't faking anything, we just didn't want to wait for all the device timeouts. Because the disks were on USB, which is a hotplug-capable bus, unplugging the dead disk generated an interrupt that bypassed the timeout. We could have waited it out, but 60 seconds is an eternity on stage. Jeff On Mon, Nov 24, 2008 at 10:45:18PM -0800, Ross wrote: But that's exactly the problem Richard: AFAIK. Can you state that absolutely,
Re: [zfs-discuss] ZFS, Smashing Baby a fake???
PS. I think this also gives you a chance at making the whole problem much simpler. Instead of the hard question of is this faulty, you're just trying to say is it working right now?. In fact, I'm now wondering if the waiting for a response flag wouldn't be better as possibly faulty. That way you could use it with checksum errors too, possibly with settings as simple as errors per minute or error percentage. As with the timeouts, you could have it off by default (or provide sensible defaults), and let administrators tweak it for their particular needs. Imagine a pool with the following settings: - zfs-auto-device-timeout = 5s - zfs-auto-device-checksum-fail-limit-epm = 20 - zfs-auto-device-checksum-fail-limit-percent = 10 - zfs-auto-device-fail-delay = 120s That would allow the pool to flag a device as possibly faulty regardless of the type of fault, and take immediate proactive action to safeguard data (generally long before the device is actually faulted). A device triggering any of these flags would be enough for ZFS to start reading from (or writing to) other devices first, and should you get multiple failures, or problems on a non redundant pool, you always just revert back to ZFS' current behaviour. Ross On Tue, Nov 25, 2008 at 8:37 AM, Jeff Bonwick [EMAIL PROTECTED] wrote: I think we (the ZFS team) all generally agree with you. The current nevada code is much better at handling device failures than it was just a few months ago. And there are additional changes that were made for the FishWorks (a.k.a. Amber Road, a.k.a. Sun Storage 7000) product line that will make things even better once the FishWorks team has a chance to catch its breath and integrate those changes into nevada. And then we've got further improvements in the pipeline. The reason this is all so much harder than it sounds is that we're trying to provide increasingly optimal behavior given a collection of devices whose failure modes are largely ill-defined. (Is the disk dead or just slow? Gone or just temporarily disconnected? Does this burst of bad sectors indicate catastrophic failure, or just localized media errors?) The disks' SMART data is notoriously unreliable, BTW. So there's a lot of work underway to model the physical topology of the hardware, gather telemetry from the devices, the enclosures, the environmental sensors etc, so that we can generate an accurate FMA fault diagnosis and then tell ZFS to take appropriate action. We have some of this today; it's just a lot of work to complete it. Oh, and regarding the original post -- as several readers correctly surmised, we weren't faking anything, we just didn't want to wait for all the device timeouts. Because the disks were on USB, which is a hotplug-capable bus, unplugging the dead disk generated an interrupt that bypassed the timeout. We could have waited it out, but 60 seconds is an eternity on stage. Jeff On Mon, Nov 24, 2008 at 10:45:18PM -0800, Ross wrote: But that's exactly the problem Richard: AFAIK. Can you state that absolutely, categorically, there is no failure mode out there (caused by hardware faults, or bad drivers) that won't lock a drive up for hours? You can't, obviously, which is why we keep saying that ZFS should have this kind of timeout feature. For once I agree with Miles, I think he's written a really good writeup of the problem here. My simple view on it would be this: Drives are only aware of themselves as an individual entity. Their job is to save restore data to themselves, and drivers are written to minimise any chance of data loss. So when a drive starts to fail, it makes complete sense for the driver and hardware to be very, very thorough about trying to read or write that data, and to only fail as a last resort. I'm not at all surprised that drives take 30 seconds to timeout, nor that they could slow a pool for hours. That's their job. They know nothing else about the storage, they just have to do their level best to do as they're told, and will only fail if they absolutely can't store the data. The raid controller on the other hand (Netapp / ZFS, etc) knows all about the pool. It knows if you have half a dozen good drives online, it knows if there are hot spares available, and it *should* also know how quickly the drives under its care usually respond to requests. ZFS is perfectly placed to spot when a drive is starting to fail, and to take the appropriate action to safeguard your data. It has far more information available than a single drive ever will, and should be designed accordingly. Expecting the firmware and drivers of individual drives to control the failure modes of your redundant pool is just crazy imo. You're throwing away some of the biggest benefits of using multiple drives in the first place. -- This message posted from opensolaris.org ___ zfs-discuss mailing list
Re: [zfs-discuss] ZFS, Smashing Baby a fake???
No, I count that as doesn't return data ok, but my post wasn't very clear at all on that. Even for a write, the disk will return something to indicate that the action has completed, so that can also be covered by just those two scenarios, and right now ZFS can lock the whole pool up if it's waiting for that response. My idea is simply to allow the pool to continue operation while waiting for the drive to fault, even if that's a faulty write. It just means that the rest of the operations (reads and writes) can keep working for the minute (or three) it takes for FMA and the rest of the chain to flag a device as faulty. For write operations, the data can be safely committed to the rest of the pool, with just the outstanding writes for the drive left waiting. Then as soon as the device is faulted, the hot spare can kick in, and the outstanding writes quickly written to the spare. For single parity, or non redundant volumes there's some benefit in this. For dual parity pools there's a massive benefit as your pool stays available, and your data is still well protected. Ross On Tue, Nov 25, 2008 at 10:44 AM, [EMAIL PROTECTED] wrote: My justification for this is that it seems to me that you can split disk behavior into two states: - returns data ok - doesn't return data ok I think you're missing won't write. There's clearly a difference between get data from a different copy which you can fix but retrying data to a different part of the redundant data and writing data: the data which can't be written must be kept until the drive is faulted. Casper ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS, Smashing Baby a fake???
Hmm, true. The idea doesn't work so well if you have a lot of writes, so there needs to be some thought as to how you handle that. Just thinking aloud, could the missing writes be written to the log file on the rest of the pool? Or temporarily stored somewhere else in the pool? Would it be an option to allow up to a certain amount of writes to be cached in this way while waiting for FMA, and only suspend writes once that cache is full? With a large SSD slog device would it be possible to just stream all writes to the log? As a further enhancement, might it be possible to commit writes to the working drives, and just leave the writes for the bad drive(s) in the slog (potentially saving a lot of space)? For pools without log devices, I suspect that you would probably need the administrator to specify the behavior as I can see several options depending on the raid level and that pools priorities for data availability / integrity: Drive fault write cache settings: default - pool waits for device, no writes occur until device or spare comes online slog - writes are cached to slog device until full, then pool reverts to default behavior (could this be the default with slog devices present?) pool - writes are cached to the pool itself, up to a set maximum, and are written to the device or spare as soon as possible. This assumes a single parity pool with the other devices available. If the upper limit is reached, or another devices goes faulty, pool reverts to default behaviour. Storing directly to the rest of the pool would probably want to be off by default on single parity pools, but I would imagine that it could be on by default on dual parity pools. Would that be enough to allow writes to continue in most circumstances while the pool waits for FMA? Ross On Tue, Nov 25, 2008 at 10:55 AM, [EMAIL PROTECTED] wrote: My idea is simply to allow the pool to continue operation while waiting for the drive to fault, even if that's a faulty write. It just means that the rest of the operations (reads and writes) can keep working for the minute (or three) it takes for FMA and the rest of the chain to flag a device as faulty. Except when you're writing a lot; 3 minutes can cause a 20GB backlog for a single disk. Casper ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS, Smashing Baby a fake???
The shortcomings of timeouts have been discussed on this list before. How do you tell the difference between a drive that is dead and a path that is just highly loaded? A path that is dead is either returning bad data, or isn't returning anything. A highly loaded path is by definition reading writing lots of data. I think you're assuming that these are file level timeouts, when this would actually need to be much lower level. Sounds good - devil, meet details, etc. Yup, I imagine there are going to be a few details to iron out, many of which will need looking at by somebody a lot more technical than myself. Despite that I still think this is a discussion worth having. So far I don't think I've seen any situation where this would make things worse than they are now, and I can think of plenty of cases where it would be a huge improvement. Of course, it also probably means a huge amount of work to implement. I'm just hoping that it's not prohibitively difficult, and that the ZFS team see the benefits as being worth it. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS, Smashing Baby a fake???
I disagree Bob, I think this is a very different function to that which FMA provides. As far as I know, FMA doesn't have access to the big picture of pool configuration that ZFS has, so why shouldn't ZFS use that information to increase the reliability of the pool while still using FMA to handle device failures? The flip side of the argument is that ZFS already checks the data returned by the hardware. You might as well say that FMA should deal with that too since it's responsible for all hardware failures. The role of ZFS is to manage the pool, availability should be part and parcel of that. On Tue, Nov 25, 2008 at 3:57 PM, Bob Friesenhahn [EMAIL PROTECTED] wrote: On Tue, 25 Nov 2008, Ross Smith wrote: Good to hear there's work going on to address this. What did you guys think to my idea of ZFS supporting a waiting for a response status for disks as an interim solution that allows the pool to continue operation while it's waiting for FMA or the driver to fault the drive? A stable and sane system never comes with two brains. It is wrong to put this sort of logic into ZFS when ZFS is already depending on FMA to make the decisions and Solaris already has an infrastructure to handle faults. The more appropriate solution is that this feature should be in FMA. Bob == Bob Friesenhahn [EMAIL PROTECTED], http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer,http://www.GraphicsMagick.org/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Help recovering zfs filesystem
FYI, here are the link to the 'labelfix' utility. It an attachment to one of Jeff Bonwick's posts on this thread: http://www.opensolaris.org/jive/thread.jspa?messageID=229969 or here: http://mail.opensolaris.org/pipermail/zfs-discuss/2008-May/047267.html http://mail.opensolaris.org/pipermail/zfs-discuss/2008-May/047270.html Regards Nigel Smith -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] questions on zfs send,receive,backups
Snapshots are not replacements for traditional backup/restore features. If you need the latter, use what is currently available on the market. -- richard I'd actually say snapshots do a better job in some circumstances. Certainly they're being used that way by the desktop team: http://blogs.sun.com/erwann/entry/zfs_on_the_desktop_zfs None of this is stuff I'm after personally btw. This was just my attempt to interpret the request of the OP. Although having said that, the ability to restore single files as fast as you can restore a whole snapshot would be a nice feature. Is that something that would be possible? Say you had a ZFS filesystem containing a 20GB file, with a recent snapshot. Is it technically feasible to restore that file by itself in the same way a whole filesystem is rolled back with zfs restore? If the file still existed, would this be a case of redirecting the file's top level block (dnode?) to the one from the snapshot? If the file had been deleted, could you just copy that one block? Is it that simple, or is there a level of interaction between files and snapshots that I've missed (I've glanced through the tech specs, but I'm a long way from fully understanding them). Ross ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] questions on zfs send,receive,backups
If the file still existed, would this be a case of redirecting the file's top level block (dnode?) to the one from the snapshot? If the file had been deleted, could you just copy that one block? Is it that simple, or is there a level of interaction between files and snapshots that I've missed (I've glanced through the tech specs, but I'm a long way from fully understanding them). It is as simple as a cp, or drag-n-drop in Nautilus. The snapshot is read-only, so there is no need to cp, as long as you don't want to modify it or destroy the snapshot. -- richard But that's missing the point here, which was that we want to restore this file without having to copy the entire thing back. Doing a cp or a drag-n-drop creates a new copy of the file, taking time to restore, and allocating extra blocks. Not a problem for small files, but not ideal if you're say using ZFS to store virtual machines, and want to roll back a single 20GB file from a 400GB filesystem. My question was whether it's technically feasible to roll back a single file using the approach used for restoring snapshots, making it an almost instantaneous operation? ie: If a snapshot exists that contains the file you want, you know that all the relevant blocks are already on disk. You don't want to copy all of the blocks, but since ZFS follows a tree structure, couldn't you restore the file by just restoring the one master block for that file? I'm just thinking that if it's technically feasible, I might raise an RFE for this. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] questions on zfs send,receive,backups
Hi Darren, That's storing a dump of a snapshot on external media, but files within it are not directly accessible. The work Tim et all are doing is actually putting a live ZFS filesystem on external media and sending snapshots to it. A live ZFS filesystem is far more useful (and reliable) than a dump, and having the ability to restore individual files from that would be even better. It still doesn't help the OP, but I think that's what he was after. Ross On Mon, Nov 3, 2008 at 9:55 AM, Darren J Moffat [EMAIL PROTECTED] wrote: Ross wrote: Ok, I see where you're coming from now, but what you're talking about isn't zfs send / receive. If I'm interpreting correctly, you're talking about a couple of features, neither of which is in ZFS yet, and I'd need the input of more technical people to know if they are possible. 1. The ability to restore individual files from a snapshot, in the same way an entire snapshot is restored - simply using the blocks that are already stored. 2. The ability to store (and restore from) snapshots on external media. What makes you say this doesn't work ? Exactly what do you mean here because this will work: $ zfs send [EMAIL PROTECTED] | dd of=/dev/tape Sure it might not be useful and I don't think that is what you mean here so can you expand on sotre snapshots on external media. -- Darren J Moffat ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] diagnosing read performance problem
Hi Matt Well this time you have filtered out any SSH traffic on port 22 successfully. But I'm still only seeing half of the conversation! I see packets sent from client to server. That is from source: 10.194.217.12 to destination: 10.194.217.3 So a different client IP this time And the Duplicate ACK packets (often long bursts) are back in this capture. I've looked at these a little bit more carefully this time, and I now notice it's using the 'TCP selective acknowledgement' feature (SACK) on those packets. Now this is not something I've come across before, so I need to do some googling! SACK is defined in RFC1208. http://www.ietf.org/rfc/rfc2018.txt I found this explanation of when SACK is used: http://thenetworkguy.typepad.com/nau/2007/10/one-of-the-most.html http://thenetworkguy.typepad.com/nau/2007/10/tcp-selective-a.html This seems to indicate these 'SACK' packets are triggered as a result of 'lost packets', in this case, it must be the packets sent back from your server to the client, that is during your video playback. Of course I'm not seeing ANY of those packets in this capture because there are none captured from server to client! I'm still not sure why you cannot seem to capture these packets! Oh, by the way, I probably should advise you to run... # netstat -i ..on the OpenSolaris box, to see if any errors are being counted on the network interface. Are you still seeing the link going up/down in '/var/admin/message'? You are never going to do any good while that is happening. I think you need to try a different network card in the server. Regards Nigel Smith -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] HELP! SNV_97, 98, 99 zfs with iscsitadm and VMWare!
Hi Tano Great to hear that you've now got this working!! I understand you are using a Broadcom network card, from your previous posts I can see you are using the 'bnx' driver. I will raise this as a bug, but first please would you run '/usr/X11/bin/scanpci' to indentify the exact 'vendor id' and 'device id' for the Broadcom network chipset, and report that back here. I must admit that this is the first I have heard of 'I/OAT DMA', so I did some Googling on it, and found this links: http://opensolaris.org/os/community/arc/caselog/2008/257/onepager/ To quote from that ARC case: All new Sun Intel based platforms have Intel I/OAT (I/O Acceleration Technology) hardware. The first such hardware is an on-systemboard asynchronous DMA engine code named Crystal Beach. Through a set of RFEs Solaris will use this hardware to implement TCP receive side zero CPU copy via a socket. Ok, so I think that makes some sense, in the context of the problem we were seeing. It's referring to how the network adaptor transfers the data it has received, out of the buffer and onto the rest of the operating system. I've just looked to see if I can find the source code for the BNX driver, but I cannot find it. Digging deeper we find on this page: http://www.opensolaris.org/os/about/no_source/ ..on the 'ON' tab, that: Components for which there are currently no plans to release source bnx driver (B) Broadcom NetXtreme II Gigabit Ethernet driver So the bnx driver is closed source :-( Regards Nigel Smith -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] diagnosing read performance problem
Hi Matt Can you just confirm if that Ethernet capture file, that you made available, was done on the client, or on the server. I'm beginning to suspect you did it on the client. You can get a capture file on the server (OpenSolaris) using the 'snoop' command, as per one of my previous emails. You can still view the capture file with WireShark as it supports the 'snoop' file format. Normally it would not be too important where the capture was obtained, but here, where something strange is happening, it could be critical to understanding what is going wrong and where. It would be interesting to do two separate captures - one on the client and the one on the server, at the same time, as this would show if the switch was causing disruption. Try to have the clocks on the client server synchronised as close as possible. Thanks Nigel Smith -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] diagnosing read performance problem
Hi Matt In your previous capture, (which you have now confirmed was done on the Windows client), all those 'Bad TCP checksum' packets sent by the client, are explained, because you must be doing hardware TCP checksum offloading on the client network adaptor. WireShark will capture the packets before that hardware calculation is done, so the checksum all appear to be wrong, as they have not yet been calculated! http://wiki.wireshark.org/TCP_checksum_offload http://www.wireshark.org/docs/wsug_html_chunked/ChAdvChecksums.html Ok, so lets look at the new capture, 'snoop'ed on the OpenSolaris box. I was surprised how small that snoop capture file was - only 753400 bytes after unzipping. I soon realized why... The strange thing is that I'm only seeing half of the conversation! I see packets sent from client to server. That is from source: 10.194.217.10 to destination: 10.194.217.3 I can also see some packets from source: 10.194.217.5 (Your AD domain controller) to destination 10.194.217.3 But you've not capture anything transmitted from your OpenSolaris server - source: 10.194.217.3 (I checked, and I did not have any filters applied in WireShark that would cause the missing half!) Strange! I'm not sure how you did that. The half of the conversation that I can see looks fine - there does not seem to be any problem. I'm not seeing any duplication of ACK's from the client in this capture. (So again somewhat strange, unless you've fixed the problem!) I'm assuming your using a single network card in the Solaris server, but maybe you had better just confirm that. Regarding not capturing SSH traffic and only capturing traffic from ( hopefully to) the client, try this: # snoop -o test.cap -d rtls0 host 10.194.217.10 and not port 22 Regarding those 'link down', 'link up' messages, '/var/adm/messages'. I can tie up some of those events with your snoop capture file, but it just shows that no packets are being received while the link is down, which is exactly what you would expect. But dropping the link for a second will surely disrupt your video playback! If the switch is ok, and the cable from the switch is ok, then it does now point towards the network card in the OpenSolaris box. Maybe as simple as a bad mechanical connection on the cable socket BTW, just run '/usr/X11/bin/scanpci' and identify the 'vendor id' and 'device id' for the network card, just in case it turns out to be a driver bug. Regards Nigel Smith -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] diagnosing read performance problem
Hi Matt. Ok, got the capture and successfully 'unzipped' it. (Sorry, I guess I'm using old software to do this!) I see 12840 packets. The capture is a TCP conversation between two hosts using the SMB aka CIFS protocol. 10.194.217.10 is the client - Presumably Windows? 10.194.217.3 is the server - Presumably OpenSolaris - CIFS server? Using WireShark, Menu: 'Statistics Endpoints' show: The Client has transmitted 4849 packets, and the Server has transmitted 7991 packets. Menu: 'Analyze Expert info Composite': The 'Errors' tab shows: 4849 packets with a 'Bad TCP checksum' error - These are all transmitted by the Client. (Apply a filter of 'ip.src_host == 10.194.217.10' to confirm this.) The 'Notes' tab shows: ..numerous 'Duplicate Ack's' For example, for 60 different ACK packets, the exact same packet was re-transmitted 7 times! Packet #3718 was duplicated 17 times. Packet #8215 was duplicated 16 times. packet #6421 was duplicated 15 times, etc. These bursts of duplicate ACK packets are all coming from the client side. This certainly looks strange to me - I've not seen anything like this before. It's not going to help the speed to unnecessarily duplicate packets like that, and these burst are often closely followed by a short delay, ~0.2 seconds. And as far as I can see, it looks to point towards the client as the source of the problem. If you are seeing the same problem with other client PC, then I guess we need to suspect the 'switch' that connects them. Ok, that's my thoughts conclusion for now. Maybe you could get some more snoop captures with other clients, and with a different switch, and do a similar analysis. Regards Nigel Smith -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] My 500-gig ZFS is gone: insufficient replicas, corrupted data
Hi Eugene I'm delighted to hear you got your files back! I've seen a few posts to this forum where people have done some change to the hardware, and then found that the ZFS pool have gone. And often you never hear any more from them, so you assume they could not recover it. Thanks for reporting back your interesting story. I wonder how many other people have been caught out with this 'Host Protected Area' (HPA) and never worked out that this was the cause... Maybe one moral of this story is to make a note of your hard drive and partitions sizes now, while you have a working system. If your using Solaris, maybe try 'prtvtoc'. http://docs.sun.com/app/docs/doc/819-2240/prtvtoc-1m?a=view (Unless someone knows a better way?) Thanks Nigel Smith # prtvtoc /dev/rdsk/c1t1d0 * /dev/rdsk/c1t1d0 partition map * * Dimensions: * 512 bytes/sector * 1465149168 sectors * 1465149101 accessible sectors * * Flags: * 1: unmountable * 10: read-only * * Unallocated space: * First SectorLast * Sector CountSector * 34 222 255 * * First SectorLast * Partition Tag FlagsSector CountSector Mount Directory 0 400256 1465132495 1465132750 8 1100 1465132751 16384 1465149134 -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] My 500-gig ZFS is gone: insufficient replicas, corrupted data
...check out that link that Eugene provided. It was a GigaByte GA-G31M-S2L motherboard. http://www.gigabyte.com.tw/Products/Motherboard/Products_Spec.aspx?ProductID=2693 Some more info on 'Host Protected Area' (HPA), relating to OpenSolaris here: http://opensolaris.org/os/community/arc/caselog/2007/660/onepager/ http://bugs.opensolaris.org/view_bug.do?bug_id=5044205 Regards Nigel Smith -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] zpool import problem
Hi Terry Please could you post back to this forum the output from # zdb -l /dev/rdsk/... ... for each of the 5 drives in your raidz2. (maybe best as an attachment) Are you seeing labels with the error 'failed to unpack'? What is the reported 'status' of your zpool? (You have not provided a 'zpool status') Thanks Nigel Smith -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] zpool import: all devices online but: insufficient replicas
Hi Kristof Please could you post back to this forum the output from # zdb -l /dev/rdsk/... ... for each of the storage devices in your pool, while it is in a working condition on Server1. (Maybe best as an attachment) Then do the same again with the pool on Server2. What is the reported 'status' of your zpool on Server2? (You have not provided a 'zpool status') Thanks Nigel Smith -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] My 500-gig ZFS is gone: insufficient replicas, corrupted data
Hi Miles I think you make some very good points in your comments. It would be nice to get some positive feedback on these from Sun. And my thought also on (quickly) looking at that bug ARC case was does not this also need to be factored into the SATA framework. I really miss not having 'smartctl' (fully) working with PATA and SATA drives on x86 Solaris. I've done a quick search on PSARC 2007/660 and it was closed approved fast-track 11/28/2007. I did a quick search, but I could not find any code that had been committed to 'onnv-gate' that references this case. Regards Nigel Smith -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss