Re: [zfs-discuss] zfs/sol10u8 less stable than in sol10u5?
We recently patched our X4500 from Sol10 U6 to Sol10 U8 and have not noticed anything like what you're seeing. We do not have any SSD devices installed. -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] building zpools on device aliases
Cindy: Thanks for your reply. These units are located at a remote site 300km away, so you're right about the main issue being able to map the OS and/or ZFS device to a physical disk. The use of alias devices was one way we thought to make this mapping more intuitive, although of couse we'd always attempt to double check via the JBOD LEDs. I know that there are big improvements in Opensolaris related to device enumeration (as per Eric Schrock's blog), but we're running running Solaris 10 U6 for which: - fmtopo -V doesn't (yet) produce any output for disks - fmdump doesn't produce any human readable disk ids, only guids which then have to be correlated via a zdb -c Sean Date: Tue, 17 Nov 2009 16:18:52 -0700 From: Cindy Swearingen cindy.swearin...@sun.com Subject: Re: [zfs-discuss] building zpools on device aliases To: sean walmsley s...@fpp.nuclearsafetysolutions.com Cc: zfs-discuss@opensolaris.org Hi Sean, I sympathize with your intentions but providing pseudo-names for these disks might cause more confusion than actual help. The c4t5... name isn't so bad. I've seen worse. :-) Here are the issues with using the aliases: - If a device fails on a J4200, a LED will indicate which disk has failed but will not identify the alias name. - To prevent confusion with drive failures, you will need to map your aliases to the disk names, possibly pulling out the disks and relabeling them with the alias name. You might be able to use /usr/lib/fm/fmd/fmtopo -V | grep disks to do these mappings online. - We don't know what fmdump will indicate if a disk has problems, the dev alias or the real disk name. - We don't know what else might fail The stress of mapping the alias names to real disk names, might happen under duress, like when a disk fails. The physical disk ID on the disk will be included in the expanded name, not the alias name. The hardest part is getting the right disks in the pool. After that, it gets easier. ZFS does a good job of identifying the devices in a pool and will identify which disk is having problems as will the fmdump command. The LEDs on the disks themselves also help disk replacements. The wheels are turning to make device administration easier. We're just not there yet. Cindy On 11/16/09 19:17, sean walmsley wrote: We have a number of Sun J4200 SAS JBOD arrays which we have multipathed using Sun's MPxIO facility. While this is great for reliability, it results in the /dev/dsk device IDs changing from cXtYd0 to something virtually unreadable like c4t5000C5000B21AC63d0s3. Since the entries in /dev/{rdsk,dsk} are simply symbolic links anyway, would there be any problem with adding alias links to /devices there and building our zpools on them? We've tried this and it seems to work fine producing a zpool status similar to the following: ... NAMESTATE READ WRITE CKSUM vol01 ONLINE 0 0 0 mirrorONLINE 0 0 0 top00 ONLINE 0 0 0 bot00 ONLINE 0 0 0 mirrorONLINE 0 0 0 top01 ONLINE 0 0 0 bot01 ONLINE 0 0 0 ... Here our aliases are topnn and botnn to denote the disks in the top and bottom JBODs. The obvious question is what happens if the alias link disappears?. We've tested this, and ZFS seems to handle it quite nicely by finding the normal /dev/dsk link and simply working with that (although it's more difficult to get ZFS to use the alias again once it is recreated). If anyone can think of anything really nasty that we've missed, we'd appreciate knowing about it. Alternatively, if there is a better supported means of having ZFS display human-readable device ids we're all ears :-) Perhaps an MPxIO RFE for vanity device names would be in order? = Sean Walmsleysean at fpp . nuclearsafetysolutions dot com Nuclear Safety Solutions Ltd. 416-592-4608 (V) 416-592-5528 (F) 700 University Ave M/S H04 J19, Toronto, Ontario, M5G 1X6, CANADA ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] building zpools on device aliases
We have a number of Sun J4200 SAS JBOD arrays which we have multipathed using Sun's MPxIO facility. While this is great for reliability, it results in the /dev/dsk device IDs changing from cXtYd0 to something virtually unreadable like c4t5000C5000B21AC63d0s3. Since the entries in /dev/{rdsk,dsk} are simply symbolic links anyway, would there be any problem with adding alias links to /devices there and building our zpools on them? We've tried this and it seems to work fine producing a zpool status similar to the following: ... NAMESTATE READ WRITE CKSUM vol01 ONLINE 0 0 0 mirrorONLINE 0 0 0 top00 ONLINE 0 0 0 bot00 ONLINE 0 0 0 mirrorONLINE 0 0 0 top01 ONLINE 0 0 0 bot01 ONLINE 0 0 0 ... Here our aliases are topnn and botnn to denote the disks in the top and bottom JBODs. The obvious question is what happens if the alias link disappears?. We've tested this, and ZFS seems to handle it quite nicely by finding the normal /dev/dsk link and simply working with that (although it's more difficult to get ZFS to use the alias again once it is recreated). If anyone can think of anything really nasty that we've missed, we'd appreciate knowing about it. Alternatively, if there is a better supported means of having ZFS display human-readable device ids we're all ears :-) Perhaps an MPxIO RFE for vanity device names would be in order? -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] cryptic vdev name from fmdump
This morning we got a fault management message from one of our production servers stating that a fault in one of our pools had been detected and fixed. Looking into the error using fmdump gives: fmdump -v -u 90ea244e-1ea9-4bd6-d2be-e4e7a021f006 TIME UUID SUNW-MSG-ID Oct 22 09:29:05.3448 90ea244e-1ea9-4bd6-d2be-e4e7a021f006 FMD-8000-4M Repaired 100% fault.fs.zfs.device Problem in: zfs://pool=vol02/vdev=179e471c0732582 Affects: zfs://pool=vol02/vdev=179e471c0732582 FRU: - Location: - My question is: how do I relate the vdev name above (179e471c0732582) with an actual drive? I've checked these id's against the device ids (cXtYdZ - obviously no match) and against all of the disk serial numbers. I've also tried all of the zpool list and zpool status options with no luck. I'm sure I'm missing something obvious here, but if anyone can point me in the right direction I'd appreciate it! -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] cryptic vdev name from fmdump
Thanks for this information. We have a weekly scrub schedule, but I ran another just to be sure :-) It completed with 0 errors. Running fmdump -eV gives: TIME CLASS fmdump: /var/fm/fmd/errlog is empty Dumping the faultlog (no -e) does give some output, but again there are no human readable identifiers: ... (some stuff omitted) (start fault-list[0]) nvlist version: 0 version = 0x0 class = fault.fs.zfs.device certainty = 0x64 asru = (embedded nvlist) nvlist version: 0 version = 0x0 scheme = zfs pool = 0x4fcdc2c9d60a5810 vdev = 0x179e471c0732582 (end asru) resource = (embedded nvlist) nvlist version: 0 version = 0x0 scheme = zfs pool = 0x4fcdc2c9d60a5810 vdev = 0x179e471c0732582 (end resource) (end fault-list[0]) So, I'm still stumped. -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] cryptic vdev name from fmdump
Eric and Richard - thanks for your responses. I tried both: echo ::spa -c | mcb zdb -C (not much of a man page for this one!) and was able to match the POOL id from the log (hex 4fcdc2c9d60a5810) with both outputs. As Richard pointed out, I needed to convert the hex value to decimal to get a match with the zdb output. In neither case, however, was I able to get a match with the disk vdev id from the fmdump output. It turns out that a disk in this machine was replaced about a month ago, and sure enough the vdev that was complaining at the time was the 0x179e471c0732582 vdev that is now missing. What's confusing is that the fmd message I posted about is dated Oct 22 whereas the original error and replacement happened back in September. An fmadm faulty on the machine currently doesn't return any issues. After physically replacing the bad drive and issuing the zpool replace command, I think that we probably issued the fmadm repair uuid command in line with what Sun has asked us to do in the past. In our experience, if you don't do this then fmd will re-issue duplicate complaints regarding hardware failures after every reboot until you do. In this case, perhaps a repair wasn't really the appropriate command since we actually replaced the drive. Would a fmadm flush have been better? Perhaps a clean reboot is in order? So, it looks like the root problem here is that fmd is confused rather than there being a real issue with ZFS. Despite this, we're happy to know that we can now match vdevs against physical devices using either the mdb trick or zdb. We've followed Eric's work on ZFS device enumeration for the Fishwork project with great interest - hopefully this will eventually get extended to the fmdump output as suggested. Sean Walmsley -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Why is Solaris 10 ZFS performance so terrible?
Sun X4500 (thumper) with 16Gb of memory running Solaris 10 U6 with patches current to the end of Feb 2009. Current ARC size is ~6Gb. ZFS filesystem created in a ~3.2 Tb pool consisting of 7 sets of mirrored 500Gb SATA drives. I used 4000 8Mb files for a total of 32Gb. run 1: ~140M/s average according to zpool iostat real4m1.11s user0m10.44s sys 0m50.76s run 2: ~37M/s average according to zpool iostat real13m53.43s user0m10.62s sys 0m55.80s A zfs unmount followed by a mount of the filesystem returned the performance to the run 1 case. real3m58.16s user0m11.54s sys 0m51.95s In summary, the second run performance drops to about 30% of the original run. -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] questions regarding RFE 6334757 and CR 6322205 disk write cache. thanks (case 11356581)
1) Turning on write caching is potentially dangerous because the disk will indicate that data has been written (to cache) before it has actually been written to non-volatile storage (disk). Since the factory has no way of knowing how you'll use your T5140, I'm guessing that they set the disk write caches off by default. 2) Since ZFS knows about disk caches and ensures that it issues synchronous writes where required, it is safe to turn on write caching when the *ENTIRE* disk is used for ZFS. Accordingly, ZFS will attempt to turn on a disk's write cache whenever you add the *ENTIRE* disk to a zpool. If you add only a disk slice to a zpool, ZFS will not try to turn on write caching since it doesn't know whether other portions of the disk will be used for applications which are not write-cache safe. zpool create pool01 c0t0d0- ZFS will try to turn on disk write cache since using entire disk zpool create pool02 c0t0d0s1- ZFS will not try to turn on disk write cache (only using 1 slice) To avoid future disk replacement problems (e.g. if the replacement disk is slightly smaller), we generally create a single disk slice that takes up almost the entire disk and then build our pools on these slices. ZFS doesn't turn on the write cache in this case, but since we know that the disk is only being used for ZFS we can (and do!) safety turn on the write cache manually. 3) You can change the write (and read) cache settings using the cache submenu of the format -e command. If you disable the write cache where it could safely be enabled you will only reduce the performance of the system. If you enable the write cache where it should not be enabled, you run the risk of data loss and/or corruption in the event of a power loss. 4) I wouldn't assume any particular setting for FRU parts, although I believe that Sun parts generally ship with the write caches disabled. Better to explicitly check using format -e. -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] questions regarding RFE 6334757 and CR 6322205 disk write cache. thanks (case 11356581)
Something caused my original message to get cut off. Here is the full post: 1) Turning on write caching is potentially dangerous because the disk will indicate that data has been written (to cache) before it has actually been written to non-volatile storage (disk). Since the factory has no way of knowing how you'll use your T5140, I'm guessing that they set the disk write caches off by default. 2) Since ZFS knows about disk caches and ensures that it issues synchronous writes where required, it is safe to turn on write caching when the *ENTIRE* disk is used for ZFS. Accordingly, ZFS will attempt to turn on a disk's write cache whenever you add the *ENTIRE* disk to a zpool. If you add only a disk slice to a zpool, ZFS will not try to turn on write caching since it doesn't know whether other portions of the disk will be used for applications which are not write-cache safe. zpool create pool01 c0t0d0ZFS will try to turn on disk write cache since using entire disk zpool create pool02 c0t0d0s1 ZFS will not try to turn on disk write cache (only using 1 slice) To avoid future disk replacement problems (e.g. if the replacement disk is slightly smaller), we generally create a single disk slice that takes up almost the entire disk and then build our pools on these slices. ZFS doesn't turn on the write cache in this case, but since we know that the disk is only being used for ZFS we can (and do!) safety turn on the write cache manually. 3) You can change the write (and read) cache settings using the cache submenu of the format -e command. If you disable the write cache where it could safely be enabled you will only reduce the performance of the system. If you enable the write cache where it should not be enabled, you run the risk of data loss and/or corruption in the event of a power loss. 4) I wouldn't assume any particular setting for FRU parts, although I believe that Sun parts generally ship with the write caches disabled. Better to explicitly check using format -e. -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] sharing UFS root and ZFS pool
Some additional information: I should have noted that the client could not see the thumper1 shares via the automounter. I've played around with this setup a bit more and it appears that I can manually mount both filesystems (e.g. on /tmp/troot and /tmp/tpool), so the ZFS and UFS volumes are being shared properly, it's just the automounter that doesn't want to deal with both at once. Does the automounter have issues with picking up UFS and ZFS volumes at the same time? This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] sharing UFS root and ZFS pool
I have a server thumper1 which exports its root (UFS) filesystem to one specific server hoss via /etc/dfs/dfstab so that we can backup various system files. When I added a ZFS pool mypool to this system, I shared it to hoss and several other machines using the ZFS sharenfs property. Prior to adding the pool, hoss could see thumper1's / directory. After adding the ZFS pool, hoss could only see the thumper1:/mypool (i.e. the ZFS share) directory. Using unshare/share and zfs unshare/zfs share in various orders, I can get hoss to see *EITHER* thumper1's UFS root directory *OR* the ZFS mypool directory, but never both at the same time. On thumper1, doing a share command shows either / or /mypool shared, but never both at the same time. This is a bit of a surprise to me since sharing / and /mypool at the same time via dfstab would not be an issue if both filesystems were UFS (we've done this in the past with no problems). Any suggestions on how to get this working would be much appreciated. This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS Administration
I haven't used it myself, but the following blog describes an automatic snapshot facility: http://blogs.sun.com/timf/entry/zfs_automatic_snapshots_0_10 I agree that it would be nice to have this type of functionality built into the base product, however. This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS needs a viable backup mechanism
We mostly rely on AMANDA, but for a simple, compressed, encrypted, tape-spanning alternative backup (intended for disaster recovery) we use: tar cf - files | lzf (quick compression utility) | ssl (to encrypt) | mbuffer (which writes to tape and looks after tape changes) Recovery is exactly the opposite, i.e: mbuffer | ssl | lzf | tar xf - The mbuffer utility (http://www.maier-komor.de/mbuffer.html) has the ability to call a script to change tapes, so if you have the mtx utility you're in business. Mbuffer's other great advantage is that it buffers reads and/or writes, so you can make sure that your tape is writing in decent sized chunks (i.e. isn't shoe-shining) even if your system can't keep up with your shiny new LTO-4 drive :-) We did have a problem with mbuffer not automatically detecting EOT on our drives, but since we're compressing as part of the pipeline rather than in the drive itself we just told mbuffer to swap tapes at ~98% of the physical tape size. Since mbuffer doesn't care where your data stream comes from, I would think that you could easily do something like: zfs send | mbuffer (writes to tape and looks after tape changes) and mbuffer (reads from tape and looks after tape changes) | zfs receive This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss