Re: [zfs-discuss] Raidz2 slow read speed (under 5MB/s)
Nevermind this, I destroyed the raid volume, then checked each hard drive one by one, and when I put it back together, the problem fixed itself. I'm now getting 30-60MB/s read and write, which is still slow as heck, but works well for my application. -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] Raidz2 slow read speed (under 5MB/s)
Hello all, I'm building a file server (or just a storage that I intend to access by Workgroup from primarily Windows machines) using zfs raidz2 and openindiana 148. I will be using this to stream blu-ray movies and other media, so I will be happy if I get just 20MB/s reads, which seems like a pretty low standard considering some people are getting 100+. This is my first time with OI, and raid, for that matter, so I hope you guys have a little patience for a noob. :) I figured out how to setup the vdevs and smbshare after some trial and error, and got my Windows box to see the share. Transferring a 40GB file to the share yields 55-80MB/s, not earth-shattering, but satisfactory IMO. The problem is when I transfer the same file back to the Windows box, it went to less than 5MB/s. I then copied a 1GB file, and then moved that 1GB file from the raidz2 drives to the root drive (SSD), in attempt to isolate the problem. That was less than 5MB/s. The same file, once again, copied from the root drive to the raidz2 was fast, maybe 70-100MB/s. The problem here as far as I can tell is either some setting within zfs or the HBA controller. Or maybe... even the timing issue with WD Green drives shouldn't create that much disparity. I've attached the iostat of when activity is idle, when copying from raidz2 to root (read), and for comparison, copying to raidz2 from root (write). Please note the intermittent idling in all disks (except 1?) when the file is copied from the raidz2 volume to anywhere else. I have no idea what that's about, but the drives will drop to 0 every couple of seconds, and repeat. My system is as follows: 10 WD20EARS (bad idea? I only found out after I bought them.) in raidz2 config 32GB SSD for root drive for OS install Supermicro USAS-L8i HBA card (1068E chipset I believe?) 6GB RAM 500 watt power supply AMD Athlon II X2 260 CPU Here's my zpool: pool: rpool state: ONLINE scan: none requested config: NAMESTATE READ WRITE CKSUM rpool ONLINE 0 0 0 c2t0d0s0 ONLINE 0 0 0 errors: No known data errors pool: solaris state: ONLINE scan: none requested config: NAMESTATE READ WRITE CKSUM solaris ONLINE 0 0 0 raidz2-0 ONLINE 0 0 0 c2t4d0 ONLINE 0 0 0 c2t5d0 ONLINE 0 0 0 c1t0d0 ONLINE 0 0 0 c1t1d0 ONLINE 0 0 0 c1t2d0 ONLINE 0 0 0 c1t3d0 ONLINE 0 0 0 c1t4d0 ONLINE 0 0 0 c1t5d0 ONLINE 0 0 0 c1t6d0 ONLINE 0 0 0 c1t7d0 ONLINE 0 0 0 errors: No known data errors I'd greatly appreciate it if someone could give me some leads on where the problem might be. I've spent the past 2 days on this, and it's very frustrating since I would actually be very happy getting even 10MB/s read. Regards. -- This message posted from opensolaris.org Idle-iostat Description: Binary data Read-iostat Description: Binary data Write-iostat Description: Binary data ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Raidz2 slow read speed (under 5MB/s)
Do you mean that OI148 might have a bug that Solaris 11 Express might solve? I will download the Solaris 11 Express LiveUSB and give it a shot. -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] ZFS receive checksum mismatch
Hey all, New to ZFS, I made a critical error when migrating data and configuring zpools according to needs - I stored a snapshot stream to a file using zfs send -R [filesystem]@[snapshot] [stream_file]. When I attempted to receive the stream onto to the newly configured pool, I ended up with a checksum mismatch and thought I had lost my data. After googling the issue and finding nil, I downloaded FreeBSD 9-CURRENT (development), installed, and recompiled the kernel making one modification to /usr/src/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/dmu_send.c: Comment out the following lines (1439 - 1440 at the time of writing): if (!ZIO_CHECKSUM_EQUAL(drre.drr_checksum, pcksum)) ra.err = ECKSUM; Once recompiled and booted up on the new kernel, I executed zfs receive -v [filesystem] [stream_file]. Once received, I scrubbed the zpool, which corrected a couple of checksum errors, and proceeded to finish setting up my NAS. Hopefully, this might help someone else if they're stupid enough to make the same mistake I did... Note: changing this section of the ZFS kernel code should not be used for anything other than special cases when you need to bypass the data integrity checks for recovery purposes. -Johnny Walker ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] ZFS receive checksum mismatch
New to ZFS, I made a critical error when migrating data and configuring zpools according to needs - I stored a snapshot stream to a file using zfs send -R [filesystem]@[snapshot] [stream_file]. Why is this a critical error, I thought you were supposed to be able to save the output from zfs send to a file (just as with tar or ufsdump you can save the output to a file or a stream) ? Well yes, you can save the stream to a file, but it is intended for immediate use with zfs receive. Since the stream is not an image but instead a serialization of objects, normal data recovery methods do not apply in the event of corruption. When I attempted to receive the stream onto to the newly configured pool, I ended up with a checksum mismatch and thought I had lost my data. Was the cause of the checksum mismatch just that the stream data was stored as a file ? That does not seem right to me. I really can't say for sure what caused the corruption, but I think it may have been related to a dying power supply. For more information, check out: http://www.solarisinternals.com/wiki/index.php/ZFS_Best_Practices_Guide#Storing_ZFS_Snapshot_Streams_.28zfs_send.2Freceive.29 ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Using multiple logs on single SSD devices
On Aug 2, 2010, at 8:18 PM, Edward Ned Harvey wrote: From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss- boun...@opensolaris.org] On Behalf Of Jonathan Loran Because you're at pool v15, it does not matter if the log device fails while you're running, or you're offline and trying to come online, or whatever. Simply if the log device fails, unmirrored, and the version is less than 19, the pool is simply lost. There are supposedly techniques to recover, so it's not necessarily a data unrecoverable by any means situation, but you certainly couldn't recover without a server crash, or at least shutdown. And it would certainly be a nightmare, at best. The system will not fall back to ZIL in the main pool. That was a feature created in v19. Yes, after sending my query yesterday, I found the zfs best practices guide, which I haven't read for a long time, many update w/r to SSD devices (many by you Ed, no?). I also found the long thread on this list, which somehow I missed in my first pass about SSD best practices. After reading this, I became much more nervious. My previous assumption when I added the log was based upon the IOP rate I saw to the ZIL, and the number of IOP an Intel X25e could take, and it looked like the drive should last a few years, at least. But of course, that ssumes no other failure modes. Given the high price of failure, now that I know the system will suddenly go south, I realized that action needed to be taken ASAP to mirror the log. I'm afraid it's too late for that, unless you're willing to destroy recreate your pool. You cannot remove the existing log device. You cannot shrink it. You cannot replace it with a smaller one. The only things you can do right now are: (a) Start mirroring that log device with another device of the same size or larger. or (b) Buy another SSD which is larger than the first. Create a slice on the 2nd which is equal to the size of the first. Mirror the first onto the slice of the 2nd. After resilver, detach the first drive, and replace it with another one of the larger drives. Slice the 3rd drive just like the 2nd, and mirror the 2nd drive slice onto it. Now you've got a mirrored sliced device, without any downtime, but you had to buy 2x 2x larger drives in order to do it. or (c) Destroy recreate your whole pool, but learn from your mistake. This time, slice each SSD, and mirror the slices to form the log device. BTW, ask me how I know this in such detail? It's cuz I made the same mistake last year. There was one interesting possibility we considered, but didn't actually implement: We are running a stripe of mirrors. We considered the possibility of breaking the mirrors, creating a new pool out of the other half using the SSD properly sliced. Using zfs send to replicate all the snapshots over to the new pool, up to a very recent time. Then, we'd be able to make a very short service window. Shutdown briefly, send that one final snapshot to the new pool, destroy the old pool, rename the new pool to take the old name, and bring the system back up again. Instead of scheduling a long service window. As soon as the system is up again, start mirroring and resilvering (er ... initial silvering), and of course, slice the SSD before attaching the mirror. Naturally there is some risk, running un-mirrored long enough to send the snaps... and so forth. Anyway, just an option to consider. Destroying this pool is very much off the table. It holds home directories for our whole lab, about 375 of them. If I take the system offline, then no one works until it's back up. You could say this machine is mission critical. The host has been very reliable. Everyone is now spoiled by how it never goes down, and I'm very proud of that fact. The only way I could recreate the pool would be through some clever means like you give, or I thought perhaps using AVS to replicate one side of the mirror, then everything could be done through a quick reboot. One other idea I had was using a sparse zvol for the log, but I think eventually, the sparse volume would fill up beyond its physical capacity. On top of that, this would mean we would have a log that is a zvol from another zpool, which I think could a cause boot race condition. I think the real solution to my immediate problem is this: Bite the bullet, and add storage to the existing pool. It won't be as clean as I like, and it would disturb my nicely balanced mirror stripe with new large empty vdevs, which I fear could impact performance down the road when the original stripe fills up, and all writes go to the new vdevs. Perhaps by the time that happens, the feature to rebalance the pool will be available, if that's even being worked on. Maybe that's wishful thinking. At any rate, if I don't have to add another pool, I can mirror the logs I have: problem solved. Finally, I'm told by my SE that ZFS
[zfs-discuss] modified mdb and zdb
Hi, I would really apreciate if any of you can help me get the modified mdb and zdb (in any version of OpenSolaris) for digital forensic reserch purpose. Thank you. Jonathan Cifuentes _ Invite your mail contacts to join your friends list with Windows Live Spaces. It's easy! http://spaces.live.com/spacesapi.aspx?wx_action=createwx_url=/friends.aspxmkt=en-us___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] Migrating ZFS/data pool to new pool on the same system
Can anyone confirm my action plan is the proper way to do this? The reason I'm doing this is I want to create 2xraidz2 pools instead of expanding my current 2xraidz1 pool. So I'll create a 1xraidz2 vdev, migrate my current 2xraidz1 pool over, destroy that pool and then add it as a 1xraidz2 vdev to the new pool. I'm running b130, sharing both with CIFS and ISCSI (not comstar), multiple decedent file systems. Other than a couple VirtualBox machines that use the pool for storage (I'll shut them down), nothing on the server should be messing with the pool. As I understand it the old way of doing iSCSI is going away so I should plan on Comstar. I'm also thinking I should just unshare the CIFS to prevent any of my computers from writing to it. So migrating from pool1 to pool2 0. Turn off AutoSnapshots 1. Create snapshot - zfs snapshot -r po...@snap1 2. Send/Receive - zfs send -R po...@snap1 | zfs receive -F -d test2 3. Unshare CIFS and remove iSCSI targets. For the iSCSI targets, seems like I can't re-use them for Comstar and the reservations aren't carried over for block devices? I may just destroy them before hand. Nothing important on them. 4. Create new snapshots - zfs snapshot -r po...@snap2 5. Send incremental stream - zfs send -Ri snap1 po...@snap2 | zfs receive -F -d test2 Repeat steps 4 and 5 as necessary. 6. Offline pool1... if I don't plan on destroying it right away. Other than zfs list, is there anything I should check to make sure I received all the data to the new pool? -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] Replaced drive in zpool, was fine, now degraded - ohno
I just started replacing drives in this zpool (to increase storage). I pulled the first drive, and replaced it with a new drive and all was well. It resilvered with 0 errors. This was 5 days ago. Just today I was looking around and noticed that my pool was degraded (I see now that this occurred last night). Sure enough there are 12 read errors on the new drive. I'm on snv 111b. I attempted to get smartmontools workings, but it doesn't seem to want to work as these are all sata drives. fmdump indicates that the read errors occurred within about 10 minutes of one another. Is it safe to say this drive is bad, or is there anything else I can do about this? Thanks, Jon $ zpool status MyStorage pool: MyStorage state: DEGRADED status: One or more devices are faulted in response to persistent errors. Sufficient replicas exist for the pool to continue functioning in a degraded state. action: Replace the faulted device, or use 'zpool clear' to mark the device repaired. scrub: scrub completed after 8h7m with 0 errors on Sun Apr 11 13:07:40 2010 config: NAMESTATE READ WRITE CKSUM MyStorage DEGRADED 0 0 0 raidz1DEGRADED 0 0 0 c5t0d0 ONLINE 0 0 0 c5t1d0 ONLINE 0 0 0 c6t1d0 ONLINE 0 0 0 c7t1d0 FAULTED 12 0 0 too many errors errors: No known data errors $ fmdump TIME UUID SUNW-MSG-ID Apr 09 16:08:04.4660 1f07d23f-a4ba-cbbb-8713-d003d9771079 ZFS-8000-D3 Apr 13 22:29:02.8063 e26c7e32-e5dd-cd9c-cd26-d5715049aad8 ZFS-8000-FD That first log is the original drive being replaced. The second is the read errors on the new drive. -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Replaced drive in zpool, was fine, now degraded - ohno
I just ran 'iostat -En'. This is what was reported for the drive in question (all other drives showed 0 errors across the board. All drives indicated the illegal request... predictive failure analysis -- c7t1d0 Soft Errors: 0 Hard Errors: 36 Transport Errors: 0 Vendor: ATA Product: SAMSUNG HD203WI Revision: 0002 Serial No: Size: 2000.40GB 2000398934016 bytes Media Error: 36 Device Not Ready: 0 No Device: 0 Recoverable: 0 Illegal Request: 126 Predictive Failure Analysis: 0 -- -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Replaced drive in zpool, was fine, now degraded - ohno
Yeah, -- $smartctl -d sat,12 -i /dev/rdsk/c5t0d0 smartctl 5.39.1 2010-01-28 r3054 [i386-pc-solaris2.11] (local build) Copyright (C) 2002-10 by Bruce Allen, http://smartmontools.sourceforge.net Smartctl: Device Read Identity Failed (not an ATA/ATAPI device) -- I'm thinking between 111 and 132 (mentioned in post) something changed. -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Replaced drive in zpool, was fine, now degraded - ohno
Do worry about media errors. Though this is the most common HDD error, it is also the cause of data loss. Fortunately, ZFS detected this and repaired it for you. Right. I assume you do recommend swapping the faulted drive out though? Other file systems may not be so gracious. -- richard As we are all too aware I'm sure :) -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] Couple Questions about replacing a drive in a zpool
First a little background, I'm running b130, I have a zpool with two Raidz1(each 4 drives, all WD RE4-GPs) arrays (vdev?). They're in a Norco-4220 case (home server), which just consists of SAS backplanes (aoc-usas-l8i -8087-backplane-SATA drives). A couple of the drives are showing a number of hard/transport/media errors (weekly scrubs are fine) and I'm guessing it can explain some of the slower throughput I've been seeing lately (errors are increasing more rapidly). I do have a Hotspare for the zpool and I'm thinking of doing an advanced replacement (RMA) one at a time to minimize risk/downtime. Here's the first question, since the current drive is working, when I choose to replace with the Hotspare, does the current drive cease to have data written to it? Eg, if the hotspare fails, does a resliver need to occur on the current drive? Second question, is there a way to power-down the current drive while the system is running? I mean, by command-line, which I would think would be more graceful than my current plan of pulling the drive. The reason I ask is I'm only semi-confident I know the physical layout of the (logical) drive ordering. Last question, is there another way to see HD serial#s? Iostat doesn't show the serial#s. I'm willing to power down the system to pull the slots I think the drives are in to retrieve the serials, it would just be nice not to since my home network depends on a lot of the services (AD/DHCP/DNS) that run on the server. Though, if I do have to power it down, its not the end of the world. Example Iostat -En output: c8t6d0 Soft Errors: 0 Hard Errors: 160 Transport Errors: 213 Vendor: ATA Product: WDC WD2002FYPS-0 Revision: 5G04 Serial No: Size: 2000.40GB 2000398934016 bytes Media Error: 50 Device Not Ready: 0 No Device: 0 Recoverable: 0 Illegal Request: 0 Predictive Failure Analysis: 0 -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Couple Questions about replacing a drive in a zpool
First a little background, I'm running b130, I have a zpool with two Raidz1(each 4 drives, all WD RE4-GPs) arrays (vdev?). They're in a Norco-4220 case (home server), which just consists of SAS backplanes (aoc-usas-l8i -8087-backplane-SATA drives). A couple of the drives are showing a number of hard/transport/media errors (weekly scrubs are fine) and I'm guessing it can explain some of the slower throughput I've been seeing lately (errors are increasing more rapidly). I do have a Hotspare for the zpool and I'm thinking of doing an advanced replacement (RMA) one at a time to minimize risk/downtime. Here's the first question, since the current drive is working, when I choose to replace with the Hotspare, does the current drive cease to have data written to it? Eg, if the hotspare fails, does a resliver need to occur on the current drive? Second question, is there a way to power-down the current drive while the system is running? I mean, by command-line, which I would think would be more graceful than my current plan of pulling the drive. The reason I ask is I'm only semi-confident I know the physical layout of the (logical) drive ordering. Last question, is there another way to see HD serial#s? Iostat doesn't show the serial#s. I'm willing to power down the system to pull the slots I think the drives are in to retrieve the serials, it would just be nice not to since my home network depends on a lot of the services (AD/DHCP/DNS) that run on the server. Though, if I do have to power it down, its not the end of the world. Example Iostat -En output: c8t6d0 Soft Errors: 0 Hard Errors: 160 Transport Errors: 213 Vendor: ATA Product: WDC WD2002FYPS-0 Revision: 5G04 Serial No: Size: 2000.40GB 2000398934016 bytes Media Error: 50 Device Not Ready: 0 No Device: 0 Recoverable: 0 Illegal Request: 0 Predictive Failure Analysis: 0 Someone pointed me to a thread and to try cfgadm -v to get the serial#, but it doesn't work for me. Sounds like it would be for SATA only? However, through that thread i found smartmontools (smartctl) and I was able to use that to determine the serial numbers. Fantastic! -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] This is the scrub that never ends...
On Sep 9, 2009, at 9:29 PM, Bill Sommerfeld wrote: On Wed, 2009-09-09 at 21:30 +, Will Murnane wrote: Some hours later, here I am again: scrub: scrub in progress for 18h24m, 100.00% done, 0h0m to go Any suggestions? Let it run for another day. A pool on a build server I manage takes about 75-100 hours to scrub, but typically starts reporting 100.00% done, 0h0m to go at about the 50-60 hour point. I suspect the combination of frequent time-based snapshots and a pretty active set of users causes the progress estimate to be off.. out of curiousity - do you have a lot of small files in the filesystem? zdb -s pool might be interesting to observe too --- .je (oh, and thanks for the subject line .. now i've had this song stuck in my head for a couple days :P) ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Books on File Systems and File System Programming
On Aug 14, 2009, at 11:14 AM, Peter Schow wrote: On Thu, Aug 13, 2009 at 05:02:46PM -0600, Louis-Fr?d?ric Feuillette wrote: I saw this question on another mailing list, and I too would like to know. And I have a couple questions of my own. == Paraphrased from other list == Does anyone have any recommendations for books on File Systems and/or File Systems Programming? == end == Going back ten years, but still a good tutorial: Practical File System Design with the Be File System by Dominic Giampaolo http://www.nobius.org/~dbg/practical-file-system-design.pdf I think he's still at apple now working on spotlight .. his fs-kit is good study too: http://www.nobius.org/~dbg/fs-kit-0.4.tgz for understanding the vnode/vfs interface - you might want to take a look at: - Solaris Internals (2nd edition) - chapter 14 - Zadok's FiST paper: http://www.fsl.cs.sunysb.edu/docs/zadok-thesis-proposal/ UFS: - Solaris Internals (2nd edition) - chapter 15 HFS+: - Amit Singh's Mac OS X Internals chapter 11 (see http://osxbook.com/) then opensolaris src of course for: http://src.opensolaris.org/source/xref/onnv/onnv-gate/usr/src/uts/common/fs/ http://opensolaris.org/os/community/zfs/source/ http://opensolaris.org/os/project/samqfs/sourcecode/ http://opensolaris.org/os/project/ext3/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Understanding SAS/SATA Backplanes and Connectivity
We have a SC846E1 at work; it's the 24-disk, 4u version of the 826e1. It's working quite nicely as a SATA JBOD enclosure. We'll probably be buying another in the coming year to have more capacity. Good to hear. What HBA(s) are you using against it? I've got one too and it works great. I use the LSI SAS 3442e which also gives you an external SAS port. You don't need a fancy HBA with onboard RAID. Configure to IT mode. -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Why is Solaris 10 ZFS performance so terrible?
On Jul 4, 2009, at 11:57 AM, Bob Friesenhahn wrote: This brings me to the absurd conclusion that the system must be rebooted immediately prior to each use. see Phil's later email .. an export/import of the pool or a remount of the filesystem should clear the page cache - with mmap'd files you're essentially both them both in the page cache and also in the ARC .. then invalidations in the page cache are going to have effects on dirty data in the cache /etc/system tunables are currently: set zfs:zfs_arc_max = 0x28000 set zfs:zfs_write_limit_override = 0xea60 set zfs:zfs_vdev_max_pending = 5 if you're on x86 - i'd also increase maxphys to 128K .. we still have a 56KB default value in there which is still a bad thing (IMO) --- .je ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] cannot mount '/tank/home': directory is not empty
i've seen a problem where periodically a 'zfs mount -a' and sometimes a 'zpool import pool' can create what appears to be a race condition on nested mounts .. that is .. let's say that i have: FS mountpoint pool/export pool/fs1/export/home pool/fs2/export/home/bob pool/fs3/export/home/bob/stuff if pool is imported (or a mount -a is done) and somehow pool/fs3 mounts first - then it will create /export/home and /export/home/bob and pool/fs1 and pool/fs2 will fail to mount .. this seems to be happening on more recent builds, but not predictably - so i'm still trying to track down what's going on On Jun 10, 2009, at 1:01 PM, Richard Elling wrote: Something is bothering me about this thread. It seems to me that if the system provides an error message such as cannot mount '/tank/home': directory is not empty then the first plan of action should be to look and see what is there, no? The issue of overlaying mounts has existed for about 30 years and invariably one discovers that events which lead to different data in overlapping directories is the result of some sort of procedural issue. Perhaps once again, ZFS is a singing canary? -- richard ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] Does zpool clear delete corrupted files
Hi list, First off: # cat /etc/release Solaris 10 6/06 s10x_u2wos_09a X86 Copyright 2006 Sun Microsystems, Inc. All Rights Reserved. Use is subject to license terms. Assembled 09 June 2006 Here's an (almost) disaster scenario that came to life over the past week. We have a very large zpool containing over 30TB, composed (foolishly) of three concatenated iSCSI SAN devices. There's no redundancy in this pool at the zfs level. We are actually in the process of migrating this to a x4540 + j4500 setup, but since the x4540 is part of the existing pool, we need to mirror it, then detach it so we can build out the replacement storage. What happened was some time after I had attached the mirror to the x4540, the scsi_vhci/network connection went south, and the server panicked. Since this system has been up, over the past 2.5 years, this has never happened before. When we got the thing glued back together, it immediately started resilvering from the beginning, and reported about 1.9 million data errors. The list from zpool status -v gave over 883k bad files. This is a small percentage of the total number of files in this volume: over 80 million (1%). My question is this: When we clear the pool with zpool clear, what happens to all of the bad files? Are they deleted from the pool, or do the error counters just get reset, leaving the bad files in tact? I'm going to perform a full backup of this guy (not so easy on my budget), and I would rather only get the good files. Thanks, Jon - _/ _/ / - Jonathan Loran - - -/ / /IT Manager - - _ / _ / / Space Sciences Laboratory, UC Berkeley -/ / / (510) 643-5146 jlo...@ssl.berkeley.edu - __/__/__/ AST:7731^29u18e3 ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Does zpool clear delete corrupted files
Kinda scary then. Better make sure we delete all the bad files before I back it up. What's odd is we've checked a few hundred files, and most of them don't seem to have any corruption. I'm thinking what's wrong is the metadata for these files is corrupted somehow, yet we can read them just fine. I wish I could tell which ones are really bad, so we wouldn't have to recreate them unnecessarily. They are mirrored in various places, or can be recreated via reprocessing, but recreating/ restoring that many files is no easy task. Thanks, Jon On Jun 1, 2009, at 2:41 PM, Paul Choi wrote: zpool clear just clears the list of errors (and # of checksum errors) from its stats. It does not modify the filesystem in any manner. You run zpool clear to make the zpool forget that it ever had any issues. -Paul Jonathan Loran wrote: Hi list, First off: # cat /etc/release Solaris 10 6/06 s10x_u2wos_09a X86 Copyright 2006 Sun Microsystems, Inc. All Rights Reserved. Use is subject to license terms. Assembled 09 June 2006 Here's an (almost) disaster scenario that came to life over the past week. We have a very large zpool containing over 30TB, composed (foolishly) of three concatenated iSCSI SAN devices. There's no redundancy in this pool at the zfs level. We are actually in the process of migrating this to a x4540 + j4500 setup, but since the x4540 is part of the existing pool, we need to mirror it, then detach it so we can build out the replacement storage. What happened was some time after I had attached the mirror to the x4540, the scsi_vhci/network connection went south, and the server panicked. Since this system has been up, over the past 2.5 years, this has never happened before. When we got the thing glued back together, it immediately started resilvering from the beginning, and reported about 1.9 million data errors. The list from zpool status -v gave over 883k bad files. This is a small percentage of the total number of files in this volume: over 80 million (1%). My question is this: When we clear the pool with zpool clear, what happens to all of the bad files? Are they deleted from the pool, or do the error counters just get reset, leaving the bad files in tact? I'm going to perform a full backup of this guy (not so easy on my budget), and I would rather only get the good files. Thanks, Jon - _/ _/ / - Jonathan Loran - - -/ / /IT Manager - - _ / _ / / Space Sciences Laboratory, UC Berkeley -/ / / (510) 643-5146 jlo...@ssl.berkeley.edu mailto:jlo...@ssl.berkeley.edu - __/__/__/ AST:7731^29u18e3 ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss - _/ _/ / - Jonathan Loran - - -/ / /IT Manager - - _ / _ / / Space Sciences Laboratory, UC Berkeley -/ / / (510) 643-5146 jlo...@ssl.berkeley.edu - __/__/__/ AST:7731^29u18e3 ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Data size grew.. with compression on
OpenSolaris Forums wrote: if you rsync data to zfs over existing files, you need to take something more into account: if you have a snapshot of your files and rsync the same files again, you need to use --inplace rsync option , otherwise completely new blocks will be allocated for the new files. that`s because rsync will write entirely new file and rename it over the old one. ZFS will allocate new blocks either way, check here http://all-unix.blogspot.com/2007/03/zfs-cow-and-relate-features.html for more information about how Copy-On-Write works. Jonathan ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Data size grew.. with compression on
Daniel Rock wrote: Jonathan schrieb: OpenSolaris Forums wrote: if you have a snapshot of your files and rsync the same files again, you need to use --inplace rsync option , otherwise completely new blocks will be allocated for the new files. that`s because rsync will write entirely new file and rename it over the old one. ZFS will allocate new blocks either way No it won't. --inplace doesn't rewrite blocks identical on source and target but only blocks which have been changed. I use rsync to synchronize a directory with a few large files (each up to 32 GB). Data normally gets appended to one file until it reaches the size limit of 32 GB. Before I used --inplace a snapshot needed on average ~16 GB. Now with --inplace it is just a few kBytes. It appears I may have misread the initial post. I don't really know how I misread it, but I think I missed the snapshot portion of the message and got confused. I understand the interaction between snapshots, rsync, and --inplace being discussed now. My apologies, Jonathan ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Can this be done?
Michael Shadle wrote: On Sat, Mar 28, 2009 at 1:37 AM, Peter Tribble peter.trib...@gmail.com wrote: zpool add tank raidz1 disk_1 disk_2 disk_3 ... (The syntax is just like creating a pool, only with add instead of create.) so I can add individual disks to the existing tank zpool anytime i want? Using the command above that Peter gave you would get you a result similar to this NAMESTATE READ WRITE CKSUM storage2ONLINE 0 0 0 raidz1ONLINE 0 0 0 ad16ONLINE 0 0 0 ad14ONLINE 0 0 0 ad10ONLINE 0 0 0 ad12ONLINE 0 0 0 raidz1ONLINE 0 0 0 da2 ONLINE 0 0 0 da0 ONLINE 0 0 0 da1 ONLINE 0 0 0 da3 ONLINE 0 0 0 The actual setup is a RAIDZ1 of 1.5TB drives and a RAIDZ1 of 500GB drives with the data striped across the two RAIDZs. In your case it would be 7 drives in each RAIDZ based on what you said before but I don't have *that* much money for my home file server. so essentially you're tleling me to keep it at raidz1 (not raidz2 as many people usually stress when getting up to a certain # of disks, like 8 or so most people start bringing it up a lot) This really depends on how valuable your data is. Richard Elling has a lot of great information about MTTDL here http://blogs.sun.com/relling/tags/mttdl ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS and SNDR..., now I'm confused.
On Mar 6, 2009, at 8:58 AM, Andrew Gabriel wrote: Jim Dunham wrote: ZFS the filesystem is always on disk consistent, and ZFS does maintain filesystem consistency through coordination between the ZPL (ZFS POSIX Layer) and the ZIL (ZFS Intent Log). Unfortunately for SNDR, ZFS caches a lot of an applications filesystem data in the ZIL, therefore the data is in memory, not written to disk, so SNDR does not know this data exists. ZIL flushes to disk can be seconds behind the actual application writes completing, and if SNDR is running asynchronously, these replicated writes to the SNDR secondary can be additional seconds behind the actual application writes. Unlike UFS filesystems and lockfs -f, or lockfs -w, there is no 'supported' way to get ZFS to empty the ZIL to disk on demand. I'm wondering if you really meant ZIL here, or ARC? In either case, creating a snapshot should get both flushed to disk, I think? (If you don't actually need a snapshot, simply destroy it immediately afterwards.) not sure if there's another way to trigger a full flush or lockfs, but to make sure you do have all transactions that may not have been flushed from the ARC you could just unmount the filesystem or export the zpool .. with the latter, then you wouldn't have to worry about the -f on the import --- .je ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] replace same sized disk fails with too small error
not quite .. it's 16KB at the front and 8MB back of the disk (16384 sectors) for the Solaris EFI - so you need to zero out both of these of course since these drives are 1TB you i find it's easier to format to SMI (vtoc) .. with format -e (choose SMI, label, save, validate - then choose EFI) but to Casper's point - you might want to make sure that fdisk is using the whole disk .. you should probably reinitialize the fdisk sectors either with the fdisk command or run fdisk from format (delete the partition, create a new partition using 100% of the disk, blah, blah) .. finally - glancing at the format output - there appears to be a mix of labels on these disks as you've got a mix c#d# entries and c#t#d# entries so i might suspect fdisk might not be consistent across the various disks here .. also noticed that you dumped the vtoc for c3d0 and c4d0, but you're replacing c2d1 (of unknown size/layout) with c1d1 (never dumped in your emails) .. so while this has been an animated (slightly trollish) discussion on right-sizing (odd - I've typically only seen that term as an ONTAPism) with some short-stroking digs .. it's a little unclear what the c1d1s0 slice looks like here or what the cylinder count is - i agree it should be the same - but it would be nice to see from my armchair here On Jan 22, 2009, at 3:32 AM, Dale Sears wrote: Would this work? (to get rid of an EFI label). dd if=/dev/zero of=/dev/dsk/thedisk bs=1024k count=1 Then use format format might complain that the disk is not labeled. You can then label the disk. Dale Antonius wrote: can you recommend a walk-through for this process, or a bit more of a description? I'm not quite sure how I'd use that utility to repair the EFI label ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Can I create ZPOOL with missing disks?
Tomas Ögren wrote: On 15 January, 2009 - Jim Klimov sent me these 1,3K bytes: Is it possible to create a (degraded) zpool with placeholders specified instead of actual disks (parity or mirrors)? This is possible in linux mdadm (missing keyword), so I kinda hoped this can be done in Solaris, but didn't manage to. Usecase scenario: I have a single server (or home workstation) with 4 HDD bays, sold with 2 drives. Initially the system was set up with a ZFS mirror for data slices. Now we got 2 more drives and want to replace the mirror with a larger RAIDZ2 set (say I don't want a RAID10 which is trivial to make). Technically I think that it should be possible to force creation of a degraded raidz2 array with two actual drives and two missing drives. Then I'd copy data from the old mirror pool to the new degraded raidz2 pool (zfs send | zfs recv), destroy the mirror pool and attach its two drives to repair the raidz2 pool. While obviously not an enterprise approach, this is useful while expanding home systems when I don't have a spare tape backup to dump my files on it and restore afterwards. I think it's an (intended?) limitation in zpool command itself, since the kernel can very well live with degraded pools. You can fake it.. [snip command set] Summary, yes that actually works and I've done it, but its very slow! I essentially did this myself when I migrated a 4x2-way mirror pool to a 2x4 disk raidzs (4x 500GB and 4x 1.5TB). I can say from experience that it works but since I used 2 sparsefiles to simulate 2 disks on a single physical disk performance sucked and it took a long time to do the migration. IIRC it took over 2 days to transfer 2TB of data. I used rsync, at the time I either didn't know about or forgot about zfs send/receive which would probably work better. It took a couple more days to verify that everything transferred correctly with no bit rot (rsync -c). I think Sun avoids making things like this too easy because from a business standpoint it's easier just to spend the money on enough hardware to do it properly without the chance of data loss and the extended down time. Doesn't invest the time in may be a be a better phrase than avoids though. I doubt Sun actually goes out of their way to make things harder for people. Hope that helps, Jonathan ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Drive Checksum error
Glaser, David wrote: Hi all, [snipped] So, is there a way to see if it is a bad disk, or just zfs being a pain? Should I reset the checksum error counter and re-run the scrub? You could try using smartctl to query the disk directly, although I don't recall if it works on the x4500. Normally 1 error is not a big deal. Clearing the errors and re-running the scrub would not hurt anything and if you get errors again then it may be worth checking the disk further. Perhaps swapping it with a known good drive to make sure the disk is the problem and not the cable. If you start seeing hundreds of errors be sure to check things like the cable. I had a SATA cable come loose on a home ZFS fileserver and scrub was throwing 100's of errors even though the drive itself was fine, I don't want to think about what could have happened with UFS... Hope that helps, Jonathan ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Inexpensive ZFS home server
David Evans wrote: For anyone looking for a cheap home ZFS server... Dell is having a sale on their PowerEdge SC440 for $199 (regular $598) through 11/12/2008. http://www.dell.com/content/products/productdetails.aspx/pedge_sc440?c=uscs=04l=ens=bsd Its got Dual Core Intel® Pentium®E2180, 2.0GHz, 1MB Cache, 800MHz FSB and you can upgrade the memory (ECC too) to 2gb for 19$ bucks. @$199, I just ordered 2. dce I don't think the Pentium E2180 has the lanes to use ECC RAM. I'm also not confident the system board for this machine would make use of ECC memory either, which is not good from a ZFS perspective. How many SATA plugs are there on the MB in this guy? Jon -- - _/ _/ / - Jonathan Loran - - -/ / /IT Manager - - _ / _ / / Space Sciences Laboratory, UC Berkeley -/ / / (510) 643-5146 [EMAIL PROTECTED] - __/__/__/ AST:7731^29u18e3 ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS on Fit-PC Slim?
On 6 Nov 2008, at 04:09, Vincent Fox wrote: According to the slides I have seen, a ZFS filesystem even on a single disk can handle massive amounts of sector failure before it becomes unusable. I seem to recall it said 1/8th of the disk? So even on a single disk the redundancy in the metadata is valuable. And if I don't have really very much data I can set copies=2 so I have better protection for the data as well. My goal is a compact low-powered and low-maintenance widget. Eliminating the chance of fsck is always a good thing now that I have tasted ZFS. In my personal experience, disks are more likely to fail completely than suffer from small sector failures. But don't get me wrong, provided you have a good backup strategy and can afford the downtime of replacing the disk and restoring, then ZFS is still a great filesystem to use for a single disk. Dont be put off. Many of the people on this list are running multi- terabyte enterprise solutions and are unable to think in terms of non- redundant, small numbers of gigabytes :-) I'm going to try and see if Nevada will even install when it arrives, and report back. Perhaps BSD is another option. If not I will fall back to Ubuntu. I have FreeBSD and ZFS working fine(*) on a 1.8GHz VIA C7 (32bit) processor. Admittedly this is with 2GB of RAM, but I set aside 1GB for ARC and the machine is still showing 750MB free at the moment, so I'm sure it could run with 256MB of ARC in under 512MB. 1.8GHz is a fair bit faster than the Geode in the Fit-PC, but the C7 scales back to 900MHz and my machine still runs acceptably at that speed (although I wouldn't want to buildworld with it). I say, give it a go and see what happens. I'm sure I can still dimly recall a time when 500MHz/512MB was a kick-ass system... Jonathan (*) This machine can sustain 110MB/s off of the 4-disk RAIDZ1 set, which is substantially more than I can get over my 100Mb network. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] [storage-discuss] ZFS Success Stories
We have 135 TB capacity with about 75 TB in use on zfs based storage. zfs use started about 2 years ago, and has grown from there. This spans 9 SAN appliances, with 5 head nodes, and 2 more recent servers running zfs on JBOD with vdevs made up of raidz2. So far, the experience has been very positive. Never lost a bit of data. We scrub weekly, and I've started sleeping better at night. I have also read the horror stories, but we aren't seeing them here. We did have some performance issues, especially involving the SAN storage on more heavily used systems, but enabling the cache on the SAN devices without pushing fsync through to disk basically fixed that. Your zfs layout can profoundly effect performance, which is a down side. It's best to test your setup under an approximate realistic work load to balance capacity with performance before deploying. BTW, most of our zfs deployment is on Solaris 10{u4,u5}, but two large servers are on OpenSolaris svn86. The OpenSolaris servers seem to be considerably faster, and more feature rich, without any reliability issues, so far. Jon gm_sjo wrote: Hi all, I have built out an 8TB SAN at home using OpenSolaris + ZFS. I have yet to put it into 'production' as a lot of the issues raised on this mailing list are putting me off trusting my data onto the platform right now. Throughout time, I have stored my personal data on NetWare and now NT and this solution has been 100% reliable for the last 12 years. Never a single problem (nor have I had any issues with NTFS with the tens of thousands of spindles i've worked with over the years). I appreciate 99% of the time people only comment if they have a problem, which is why I think it'd be nice for some people who have successfully implemented ZFS, including making various use of the features (recovery, replacing disks, etc), could just reply to this post with a sentence or paragraph detailing how great it is for them. Not necessarily interested in very small implementations of one/two disks that haven't changed config since the first day it was installed, but more aimed towards setups that are 'organic' and have changed/been_administered over time (to show functionality of the tools, resilience of the platform, etc.).. .. Of course though, I guess a lot of people who may have never had a problem wouldn't even be signed up on this list! :-) Thanks! ___ storage-discuss mailing list [EMAIL PROTECTED] http://mail.opensolaris.org/mailman/listinfo/storage-discuss -- - _/ _/ / - Jonathan Loran - - -/ / /IT Manager - - _ / _ / / Space Sciences Laboratory, UC Berkeley -/ / / (510) 643-5146 [EMAIL PROTECTED] - __/__/__/ AST:7731^29u18e3 ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] [Fwd: Another ZFS question]
Hi Please see the query below. Appreciate any help. Rgds jonathan Original Message Would you mind helping me ask your tech guy whether there will be repercussions when I try to run this command in view of the situation below: # /*zpool add -f zhome raidz c6t6006016056AC1A00C8FB7A6346F8DB11d0 c6t6006016056AC1A00D034FA5246F8DB11d0*/ --- bash-3.00# zpool status pool: zhome state: ONLINE status: One or more devices has experienced an error resulting in data corruption. Applications may be affected. action: Restore the file in question if possible. Otherwise restore the entire pool from backup. see: http://www.sun.com/msg/ZFS-8000-8A scrub: scrub in progress, 3.05% done, 13h52m to go config: NAME STATE READ WRITE CKSUM zhome ONLINE 0 0 2.20K raidz1 ONLINE 0 0 2.20K c6t60060160A16D1B003E5B94CAEC46DC11d0 ONLINE 0 0 0 c6t60060160A16D1B005A7106AEEC46DC11d0 ONLINE 0 0 0 c6t60060160A16D1B007AC27FB9EC46DC11d0 ONLINE 0 0 0 c6t60060160A16D1B003870C8A5EC46DC11d0 ONLINE 0 0 0 errors: 1 data errors, use '-v' for a list pool: zhome2 state: ONLINE scrub: none requested config: NAME STATE READ WRITE CKSUM zhome2 ONLINE 0 0 0 c6t6006016056AC1A004069225BE146DC11d0 ONLINE 0 0 0 c6t6006016056AC1A008253EF629235DC11d0 ONLINE 0 0 0 errors: No known data errors bash-3.00# zpool add -n zhome raidz c6t6006016056AC1A00C8FB7A6346F8DB11d0 c6t6006016056AC1A00D034FA5246F8DB11d0 c6t6006016056AC1A007860F66946F8DB11d0 invalid vdev specification use '-f' to override the following errors: mismatched replication level: pool uses 4-way raidz and new vdev uses 3-way raidz Thanks and regards, Andre ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS poor performance on Areca 1231ML
Ross Becker wrote: Okay, after doing some testing, it appears that the issue is on the ZFS side. I fiddled around a while with options on the areca card, and never got any better performance results than my first test. So, my best out of the raidz2 is 42 mb/s write and 43 mb/s read. I also tried turning off crc's (not how I'd run production, but for testing), and got no performance gain. After fiddling with options, I destroyed my zfs zpool, and tried some single-drive bits. I simply used newfs to create filesystems on single drives, mounted them, and ran some single-drive bonnie++ tests. On a single drive, I got 50 mb/sec write 70 mb/sec read. I also tested two benchmarks on two drives simultaneously, and on each of the tests, the result dropped by about 2mb/sec, so I got a combined 96 mb/sec write 136 mb/sec read with two separate UFS filesystems on two separate disks. So next steps? --ross Raidz(2) vdevs can sustain the max iops of single drive in the vdev. I'm curious what zpool iostat would say while bonnie++ is running it's writing intelligently test. The throughput sounds very low to me, but the clue here is the single drive speed is in line with the raidz2 vdev, so if a single drive is being limited by iops, not by raw throughput, then this IO result makes sense. For fun, you should make two vdevs out of two raidz to see if you get twice the throughput, more or less. I'll bet the answer is yes. Jon -- - _/ _/ / - Jonathan Loran - - -/ / /IT Manager - - _ / _ / / Space Sciences Laboratory, UC Berkeley -/ / / (510) 643-5146 [EMAIL PROTECTED] - __/__/__/ AST:7731^29u18e3 ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] zfs-auto-snapshot default schedules
On 25 Sep 2008, at 14:40, Ross wrote: For a default setup, I would have thought a years worth of data would be enough, something like: Given that this can presumably be configured to suit everyone's particular data retention plan, for a default setup, what was originally proposed seems obvious and sensible to me. Going slightly off-topic: All this auto-snapshot stuff is ace, but what's really missing, in my view, is some easy way to actually determine where the version of the file you want is. I typically find myself futzing about with diff across a dozen mounted snapshots trying to figure out when the last good version is. It would be great if there was some way to know if a snapshot contains blocks for a particular file, i.e., that snapshot contains an earlier version of the file than the next snapshot / now. If you could do that and make ls support it with an additional flag/column, it'd be a real time-saver. The current mechanism is especially hard as the auto-mount dirs can only be found at the top of the filesystem so you have to work with long path names. An fs trick to make .snapshot dirs of symbolic links appear automagically would rock, i.e., % cd /foo/bar/baz % ls -l .snapshot [...] nightly.0 - /foo/.zfs/snapshot/nightly.0/bar/baz % diff {,.snapshot/nightly.0/}importantfile Yes, I know this last command can just be written as: % diff /foo/{,.zfs/snapshot/nightly.0}/bar/baz/importantfile but this requires me to a) type more; and b) remember where the top of the filesystem is in order to split the path. This is obviously more of a pain if the path is 7 items deep, and the split means you can't just use $PWD. [My choice of .snapshot/nightly.0 is a deliberate nod to the competition ;-)] Jonathan ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] zfs-auto-snapshot default schedules
On 25 Sep 2008, at 17:14, Darren J Moffat wrote: Chris Gerhard has a zfs_versions script that might help: http://blogs.sun.com/chrisg/entry/that_there_is Ah. Cool. I will have to try this out. Jonathan ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] pulling disks was: ZFS hangs/freezes after disk failure,
Miles Nordin wrote: What is a ``failure rate for a time interval''? Failure rate = Failures/unit time Failure rate for a time interval = (Failures/unit time) * time For example, if we have a failure rate: Fr = 46% failures/month Then the expectation value of a failure in one year: Fe = 46% failures/month * 12 months = 5.52 failures Jon -- - _/ _/ / - Jonathan Loran - - -/ / /IT Manager - - _ / _ / / Space Sciences Laboratory, UC Berkeley -/ / / (510) 643-5146 [EMAIL PROTECTED] - __/__/__/ AST:7731^29u18e3 ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] corrupt zfs stream? checksum mismatch
Hi Richard, Thanks for the detailed reply, and the work behind the scenes filing the CRs. I've bookmarked both, and will keep a keen eye on them for status changes. As Miles put it, I'll have to put these dumps into storage for possible future use. I do dearly hope that I'll be able to recover most of that data in the future, but for the most important bits (documents/spreadsheets), I'll have to rebuild them by way of some rather intensive data entry based on hard copies, now. Not fun. I do have a working [zfs send dump!] backup from October, so it's not a total loss of my livelihood, but it'll be a life lesson alright. With CR 6736794, I wonder if some extra notes could be added around the checksumming side of the code? The wording that has been used doesn't quite match my scenario, but I certainly agree with what requested functionality has been requested there. I have a 50GB zfs send dump and zfs receive is failing (and rolling back) around the 20GB mark. While the exact cause and nature of my issue remains unknown, I very much expect that the vast majority of my zfs send dump is in fact in tact, including data beyond that 20GB checksum error point. I.E there is a problem around the 20GB mark, but I expect that the remaining 30GB contains good data, or in very least, *mostly* good data. The CR appears to be only requesting that zfs receive stop at the 20GB mark, but {new feature} allows the failed restore attempt to be mountable, in a unknown/known bad state. I'd much prefer that zfs receive continue on error too, thus giving it the full 50GB to process and attempt to repair, rather than only the data up until the point that it encountered it's first problem. Without knowing much about the actual on disk format,metadata and structures I can't be sure, but the fs is going to have a much better chance at recovering when there is more data available across the entire length of the fs, right? I know from my linux days that the ext2/3 superblocks were distributed across the full disk, so the more of the disk that it can attempt to read, the better the chance that it'll find more correct metadata to use in an attempt a repair of the FS. And of course the second benefit of reading more of the data stream, past an error is that more user data will at least have a chance of being recovered. If it stops half way, it has _no_ chance of recovering that data, so I favor my odds of letting it go on to at least try :) Or is that an entirely new CR itself? Jonathan This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] corrupt zfs stream? checksum mismatch
Hi Mattias Miles. To test the version mismatch theory, I setup a snv_91 VM (using virtualbox) on my snv_95 desktop, and tried the zfs receive again. Unfortunately the symptoms are exactly the same: around the ~20GB mark, the justhome.zfs stream still bombs out with the checksum error. I didn't realise that the zfs stream format wasn't backward compatible at the time that I made the backup, but having performed the above test, this doesn't actually appear to be my problem. I wish it were - that I could have dealt with! :( So far we've established that in this case: *Version mismatches aren't causing the problem. *Receiving across the network isn't the issue (because I have the exact same issue restoring the stream directly on my file server). *All that's left was the initial send, and since zfs guarantees end to end data integrity, it should have been able to deal with any network possible randomness in the middle (zfs on both ends) - or at absolute worst, the zfs send command should have failed, if it encountered errors. Seems fair, no? So, is there a major bug here, or at least an oversight in the zfs send part of the code? Does zfs send not do checksumming, or, verification after sending? I'm not sure how else to interpret this data. Today to add some more datapoints, I repeated a zfs send to the same nfs server from the same desktop, though this time I'm using zfs root with snv_95. Same hardware, same network, same commands, but this time I didn't have any issues with the zfs receive. ?!?!?!?! Miles: zfs receive -nv works ok: # zfs receive -vn rpool/test /net/supernova/Z/backup/angelous/justhome.zfs would receive full stream of faith/[EMAIL PROTECTED] into rpool/[EMAIL PROTECTED] Where it gets interesting is with my recursive zfs dump: bash-3.2# zfs receive -nvF -d rpool/test /net/supernova/Z/backup/angelous/pre-zfsroot.zfs would receive full stream of [EMAIL PROTECTED] into rpool/[EMAIL PROTECTED] would receive full stream of faith/[EMAIL PROTECTED] into rpool/test/[EMAIL PROTECTED] would receive full stream of faith/[EMAIL PROTECTED] into rpool/test/[EMAIL PROTECTED] would receive full stream of faith/[EMAIL PROTECTED] into rpool/test/[EMAIL PROTECTED] [EMAIL PROTECTED] is actually empty. faith/[EMAIL PROTECTED] bombs out around 2GB in, but I'm not really too worried about that fs. faith/[EMAIL PROTECTED] is also another fs that I can live without. faith/[EMAIL PROTECTED] is the one that we're after. It would seem that my justhome.zfs dump (containing only faith/[EMAIL PROTECTED]) isn't going to work, but is there some way to recover the /home fs from the pre-zfsroot.zfs dump? Since there seems to be a problem with the first fs (faith/virtualmachines), I need to find a way to skip restoring that zfs, so it can focus on the faith/home fs. How can this be achieved with zfs receive? Jonathan This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] corrupt zfs stream? checksum mismatch
Thanks for the information, I'm learning quite a lot from all this. It seems to me that zfs send *should* be doing some kind of verification, since some work has clearly been put into zfs so that zfs's can be dumped into files/pipes. It's a great feature to have, and I can't believe that this was purely for zfs send | zfs receive scenarios. A common example used all over the place is zfs send | ssh $host. In these examples is ssh guaranteeing the data delivery somehow? If not, there need to be some serious asterisks in these guides! Looking at this at a level that I do understand, it's going via TCP, which checksums packets. then again, I was using nfs over TCP, and look where I am today. So much for that! As I google these subjects more and more, I fear that I'm hitting the conceptual mental block that many before me have done also. zfs send is not zfsdump, even though it sure looks the same, and it's not clearly stated that you may end up in a situation like the one I'm in today if you don't somehow test your backups. As you've rightly pointed out, it's done now and even if I did manage to reproduce this again, that won't help my data locked away in these 2 .zfs files, so focusing on the hopeful is there anything I can do to recover my data from these zfs dumps? Anything at all :) If the problem is just that zfs receive is checksumming the data on the way in, can I disable this somehow within zfs? Can I globally disable checksumming in the kernel module? mdb something or rather? I read this thread where someone did successfully manage to recovery data from a damaged zfs, which fulls me with some hope: http://www.opensolaris.org/jive/thread.jspa?messageID=220125 It's way over my head, but if anyone can tell me the mdb commands I'm happy to try them, even if they do kill my cat. I don't really have anything to loose with a copy of the data, and I'll do it all in a VM anyway. Thanks, Jonathan This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] corrupt zfs stream? checksum mismatch
Hi folks, Perhaps I was a little verbose in my first post, putting a view people off. Does anyone else have any ideas on this one. I can't be the first person to have had a problem with a zfs backup stream. Is there nothing that can be done to recover at least some of the stream. As another helpful chap pointed out, if tar encounters an error in the bitstream it just moves on until it finds usable data again. Can zfs not do something similar? I'll take whatever i can get! Jonathan This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] x4500 dead HDD, hung server, unable to boot.
Jorgen Lundman wrote: # /usr/X11/bin/scanpci | /usr/sfw/bin/ggrep -A1 vendor 0x11ab device 0x6081 pci bus 0x0001 cardnum 0x01 function 0x00: vendor 0x11ab device 0x6081 Marvell Technology Group Ltd. MV88SX6081 8-port SATA II PCI-X Controller But it claims resolved for our version: SunOS x4500-02.unix 5.10 Generic_127128-11 i86pc i386 i86pc Perhaps I should see if there are any recommended patches for Sol 10 5/08? Jorgen, For Sol 10, you need to get the IDR patch for the Marvell controllers. Given the crummy support you're getting, you may have problems getting it. (Can anyone on this list help Jorgen?) From recent posts on this list, I don't think there's an official patch yet, but if so, get that instead. This should greatly improve matters for you. Jon ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] corrupt zfs stream? checksum mismatch
Hi Folks, I'm in the very unsettling position of fearing that I've lost all of my data via a zfs send/receive operation, despite ZFS's legendary integrity. The error that I'm getting on restore is: receiving full stream of faith/[EMAIL PROTECTED] into Z/faith/[EMAIL PROTECTED] cannot receive: invalid stream (checksum mismatch) Background: I was running snv_91, and decided to upgrade to snv_95 converting to the much awaited zfs-root in the process. On snv_91, I was using zfs for /opt, /export/home, and a couple of other file systems under /export I expected that converting to zfs root would require completely formatting my disk, so I needed to backup all of my critical data to a remote host beforehand. My main file server is running snv_71, using an 8 disks raid-Z, with plenty of space available via nfs, so I directed a zfs send across nfs to it. So it was zfs - nfs - zfs (raid-z) I don't remember the exact commands used, but I started off with a zfs snapshot -r, and then did a zfs send [EMAIL PROTECTED] /my/nfs/server/backup.zfs This sent each of the filesystems across and redirected them into the one, single backup file. I wasn't all that confident that this was a wise move, as I didn't know how I was going to get just one fs (rather than all) extracted again at a later time using zfs receive (I'm open to answers on that one still!). So, I decided to *also send just the snapshot of my home directory, which contains all of my vital information. A bit of extra piece of mind eh, 2 backups are better than one I then installed snv_95 from dvd, using zfs-root, destroying my previous zpool on the disk in the process. Here I am now, trying to restore my vital data that I backed up onto the nfs server, but it's not working! # cat justhome.zfs | zfs receive -v Z/faith/home receiving full stream of faith/[EMAIL PROTECTED] into Z/faith/[EMAIL PROTECTED] cannot receive: invalid stream (checksum mismatch) I just don't understand what's going on here. I started off restoring across nfs to my desktop with the standard options. I've tried disabling checksumming on the parent zfs fs, to ensure that when it was restoring it wouldn't be using checksumming. I still got the checksum mismatch error. Next I tried restoring the zfs backup internally within the nfs server, making it all local disk traffic, on the off chance that it was the network on my new build that was somehow broken. No dice, same error, with or without checksumming on the parent fs. I've also tried my other backup file, but that's also having the same problem. In all I've tried about 8 combinations, and I'm breaking out in a sweat with the possibility of having lost all of my data. The zfs backup that included all file systems bombs out fairly early, on a small fs that was only a few GB. The zfs backup that included just my home fs, gets around 20Gb of the way through, before failing with the same error (and deleting the partial zfs fs). I don't recall how big the original home fs was, perhaps 30-40GB, so it's a fair way through. What's causing this error, and if this situation is as dire as I'm fearing (please tell me it's not so!), why can't I at least have the 20GB of data that it can restore before it bombs out with that checksum error. Thanks for any help with this! Jonathan This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] The best motherboard for a home ZFS fileserver
Miles Nordin wrote: s == Steve [EMAIL PROTECTED] writes: s http://www.newegg.com/Product/Product.aspx?Item=N82E16813128354 no ECC: http://en.wikipedia.org/wiki/List_of_Intel_chipsets#Core_2_Chipsets This MB will take these: http://www.intel.com/products/processor/xeon3000/index.htm Which does support ECC. Now I'm not sure, but I suspect that this Gigabit MB doesn't have the ECC lanes. It's a lot more cash, but the following MB is on the HCL, and I have one in service working just swell: http://www.newegg.com/Product/Product.aspx?Item=N82E16813182105 Has the plus (or minus I suppose) of four PCI-X slots to plug in the AOC-SAT2-MV8 cards. Jon ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Supermicro AOC-SAT2-MV8 hang when drive removed
From a reporting perspective, yes, zpool status should not hang, and should report an error if a drive goes away, or is in any way behaving badly. No arguments there. From the data integrity perspective, the only event zfs needs to know about is when a bad drive is replaced, such that a resilver is triggered. If a drive is suddenly gone, but it is only one component of a redundant set, your data should still be fine. Now, if enough drives go away to break the redundancy, that's a different story altogether. Jon Ross Smith wrote: I agree that device drivers should perform the bulk of the fault monitoring, however I disagree that this absolves ZFS of any responsibility for checking for errors. The primary goal of ZFS is to be a filesystem and maintain data integrity, and that entails both reading and writing data to the devices. It is no good having checksumming when reading data if you are loosing huge amounts of data when a disk fails. I'm not saying that ZFS should be monitoring disks and drivers to ensure they are working, just that if ZFS attempts to write data and doesn't get the response it's expecting, an error should be logged against the device regardless of what the driver says. If ZFS is really about end-to-end data integrity, then you do need to consider the possibility of a faulty driver. Now I don't know what the root cause of this error is, but I suspect it will be either a bad response from the SATA driver, or something within ZFS that is not working correctly. Either way however I believe ZFS should have caught this. It's similar to the iSCSI problem I posted a few months back where the ZFS pool hangs for 3 minutes when a device is disconnected. There's absolutely no need for the entire pool to hang when the other half of the mirror is working fine. ZFS is often compared to hardware raid controllers, but so far it's ability to handle problems is falling short. Ross Date: Wed, 30 Jul 2008 09:48:34 -0500 From: [EMAIL PROTECTED] To: [EMAIL PROTECTED] CC: zfs-discuss@opensolaris.org Subject: Re: [zfs-discuss] Supermicro AOC-SAT2-MV8 hang when drive removed On Wed, 30 Jul 2008, Ross wrote: Imagine you had a raid-z array and pulled a drive as I'm doing here. Because ZFS isn't aware of the removal it keeps writing to that drive as if it's valid. That means ZFS still believes the array is online when in fact it should be degrated. If any other drive now fails, ZFS will consider the status degrated instead of faulted, and will continue writing data. The problem is, ZFS is writing some of that data to a drive which doesn't exist, meaning all that data will be lost on reboot. While I do believe that device drivers. or the fault system, should notify ZFS when a device fails (and ZFS should appropriately react), I don't think that ZFS should be responsible for fault monitoring. ZFS is in a rather poor position for device fault monitoring, and if it attempts to do so then it will be slow and may misbehave in other ways. The software which communicates with the device (i.e. the device driver) is in the best position to monitor the device. The primary goal of ZFS is to be able to correctly read data which was successfully committed to disk. There are programming interfaces (e.g. fsync(), msync()) which may be used to ensure that data is committed to disk, and which should return an error if there is a problem. If you were performing your tests over an NFS mount then the results should be considerably different since NFS requests that its data be committed to disk. Bob -- - _/ _/ / - Jonathan Loran - - -/ / /IT Manager - - _ / _ / / Space Sciences Laboratory, UC Berkeley -/ / / (510) 643-5146 [EMAIL PROTECTED] - __/__/__/ AST:7731^29u18e3 ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Supermicro AOC-SAT2-MV8 hang when drive removed
this information, but wherever that is, zpool status should be reporting the error and directing the admin to the log file. I would probably say this could be safely stored on the system drive. Would it be possible to have a number of possible places to store this log? What I'm thinking is that if the system drive is unavailable, ZFS could try each pool in turn and attempt to store the log there. In fact e-mail alerts or external error logging would be a great addition to ZFS. Surely it makes sense that filesystem errors would be better off being stored and handled externally? Ross Date: Mon, 28 Jul 2008 12:28:34 -0700 From: [EMAIL PROTECTED] Subject: Re: [zfs-discuss] Supermicro AOC-SAT2-MV8 hang when drive removed To: [EMAIL PROTECTED] I'm trying to reproduce and will let you know what I find. -- richard Win £3000 to spend on whatever you want at Uni! Click here to WIN! http://clk.atdmt.com/UKM/go/101719803/direct/01/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss -- - _/ _/ / - Jonathan Loran - - -/ / /IT Manager - - _ / _ / / Space Sciences Laboratory, UC Berkeley -/ / / (510) 643-5146 [EMAIL PROTECTED] - __/__/__/ AST:7731^29u18e3 ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Announcement: The Unofficial Unsupported Python ZFS API
On 14 Jul 2008, at 16:07, Will Murnane wrote: As long as I'm composing an email, I might as well mention that I had forgotten to mention Swig as a dependency (d'oh!). I now have a mention of it on the page, and a spec file that can be built using pkgtool. If you tried this before and gave up because of a missing package, please give it another shot. Not related to the actual API itself, but just thought I'd note that all the cool kids are using ctypes these days to bind Python to foreign libraries. http://docs.python.org/lib/module-ctypes.html This has the advantage of requiring no other libraries and no compile phase at all. Jonathan ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Largest (in number of files) ZFS instance tested
On Jul 11, 2008, at 4:59 PM, Bob Friesenhahn wrote: Has anyone tested a ZFS file system with at least 100 million + files? What were the performance characteristics? I think that there are more issues with file fragmentation over a long period of time than the sheer number of files. actually it's a similar problem .. with a maximum blocksize of 128KB and the COW nature of the filesytem you get indirect block pointers pretty quickly on a large ZFS filesystem as the size of your tree grows .. in this case a large constantly modified file (eg: /u01/data/ *.dbf) is going to behave over time like a lot of random access to files spread across the filesystem .. the only real difference is that you won't walk it every time someone does a getdirent() or an lstat64() so ultimately the question could be framed as what's the maximum manageable tree size you can get to with ZFS while keeping in mind that there's no real re-layout tool (by design) .. the number i'm working with until i hear otherwise is probably about 20M, but in the relativistic sense - it *really* does depend on how balanced your tree is and what your churn rate is .. we know on QFS we can go up to 100M, but i trust the tree layout a little better there, can separate the metadata out if i need to and have planned on it, and know that we've got some tools to relayout the metadata or dump/restore for a tape backed archive jonathan (oh and btw - i believe this question is a query for field data .. architect != crash test dummy .. but some days it does feel like it) ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS deduplication
Tim Spriggs wrote: Does anyone know a tool that can look over a dataset and give duplication statistics? I'm not looking for something incredibly efficient but I'd like to know how much it would actually benefit our dataset: HiRISE has a large set of spacecraft data (images) that could potentially have large amounts of redundancy, or not. Also, other up and coming missions have a large data volume that have a lot of duplicate image info and a small budget; with d11p in OpenSolaris there is a good business case to invest in Sun/OpenSolaris rather than buy the cheaper storage (+ linux?) that can simply hold everything as is. If someone feels like coding a tool up that basically makes a file of checksums and counts how many times a particular checksum get's hit over a dataset, I would be willing to run it and provide feedback. :) -Tim Me too. Our data profile is just like Tim's: Terra bytes of satellite data. I'm going to guess that the d11p ratio won't be fantastic for us. I sure would like to measure it though. Jon -- - _/ _/ / - Jonathan Loran - - -/ / /IT Manager - - _ / _ / / Space Sciences Laboratory, UC Berkeley -/ / / (510) 643-5146 [EMAIL PROTECTED] - __/__/__/ AST:7731^29u18e3 ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS deduplication
Justin Stringfellow wrote: Does anyone know a tool that can look over a dataset and give duplication statistics? I'm not looking for something incredibly efficient but I'd like to know how much it would actually benefit our Check out the following blog..: http://blogs.sun.com/erickustarz/entry/how_dedupalicious_is_your_pool Unfortunately we are on Solaris 10 :( Can I get a zdb for zfs V4 that will dump those checksums? Jon -- - _/ _/ / - Jonathan Loran - - -/ / /IT Manager - - _ / _ / / Space Sciences Laboratory, UC Berkeley -/ / / (510) 643-5146 [EMAIL PROTECTED] - __/__/__/ AST:7731^29u18e3 ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS deduplication
Moore, Joe wrote: On ZFS, sequential files are rarely sequential anyway. The SPA tries to keep blocks nearby, but when dealing with snapshotted sequential files being rewritten, there is no way to keep everything in order. In some cases, a d11p system could actually speed up data reads and writes. If you are repeatedly accessing duplicate data, then you will more likely hit your ARC, and not have to go to disk. With your data d11p, the ARC can hold a significantly higher percentage of your data set, just like the disks. For a d11p ARC, I would expire based upon block reference count. If a block has few references, it should expire first, and vise versa, blocks with many references should be the last out. With all the savings on disks, think how much RAM you could buy ;) Jon -- - _/ _/ / - Jonathan Loran - - -/ / /IT Manager - - _ / _ / / Space Sciences Laboratory, UC Berkeley -/ / / (510) 643-5146 [EMAIL PROTECTED] - __/__/__/ AST:7731^29u18e3 ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS deduplication
Neil Perrin wrote: Mertol, Yes, dedup is certainly on our list and has been actively discussed recently, so there's hope and some forward progress. It would be interesting to see where it fits into our customers priorities for ZFS. We have a long laundry list of projects. In addition there's bug fixes performance changes that customers are demanding. Neil. I want to cast my vote for getting dedup on ZFS. One place we currently use ZFS is as nearline storage for backup data. I have a 16TB server that provides a file store for an EMC Networker server. I'm seeing a compressratio of 1.73, which is mighty impressive, since we also use native EMC compression during the backups. But with dedup, we should see way more. Here at UCB SSL, we have demoed and investigated various dedup products, hardware and software, but they are all steep on the ROI curve. I would be very excited to see block level ZFS deduplication roll out. Especially since we already have the infrastructure in place using Solaris/ZFS. Cheers, Jon -- - _/ _/ / - Jonathan Loran - - -/ / /IT Manager - - _ / _ / / Space Sciences Laboratory, UC Berkeley -/ / / (510) 643-5146 [EMAIL PROTECTED] - __/__/__/ AST:7731^29u18e3 ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Cannot delete errored file
Ben Middleton wrote: Hi, Quick update: I left memtest running over night - 39 passes, no errors. I also attempted to force the BIOS to run the memory at 800MHz 5-5-5-15 as suggested - but the machine became very unstable - long boot times; PCI-Express failure of Yukon network card on booting etc. I've switched it back to Auto speedtiming for now. I'll just hope that it was a one-off glitch that corrupted the pool. I'm going to rebuild the pool this weekend. Thanks for all the suggestions. Ben, Haven't read this whole thread, and this has been brought up before, but make sure you power supply is running clean. I can't tell you how many times I've seen very strange and intermittent system errors occur from a flaky power supply. Jon -- - _/ _/ / - Jonathan Loran - - -/ / /IT Manager - - _ / _ / / Space Sciences Laboratory, UC Berkeley -/ / / (510) 643-5146 [EMAIL PROTECTED] - __/__/__/ AST:7731^29u18e3 ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] SATA controller suggestion
On 9 Jun 2008, at 14:59, Thomas Maier-Komor wrote: time gdd if=/dev/zero bs=1048576 count=10240 of=/data/video/x real 0m13.503s user 0m0.016s sys 0m8.981s Are you sure gdd doesn't create a sparse file? One would presumably expect it to be instantaneous if it was creating a sparse file. It's not a compressed filesystem though is it? /dev/ zero tends to be fairly compressible ;-) I think, as someone else pointed out, running zpool iostat at the same time might be the best way to see what's really happening. Jonathan ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] zfs equivalent of ufsdump and ufsrestore
On 30 May 2008, at 15:49, J.P. King wrote: For _my_ purposes I'd be happy with zfs send/receive, if only it was guaranteed to be compatible between versions. I agree that the inability to extract single files is an irritation - I am not sure why this is anything more than an implementation detail, but I haven't gone into it in depth. I would presume it is because zfs send/receive works at the block level, below the ZFS POSIX layer - i.e., below the filesystem level. I would guess that a stream is simply a list of the blocks that were modified between the two snapshots, suitable for re-playing on another pool. This means that the stream may not contain your entire file. An interesting point regarding this is that send/receive will be optimal in the case of small modifications to very large files, such as database files or large log files. The actual modified/appended blocks would be sent rather than the whole changed file. This may be an important point depending on your file modification patterns. Jonathan ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] zfs equivalent of ufsdump and ufsrestore
On 29 May 2008, at 15:51, Thomas Maier-Komor wrote: I very strongly disagree. The closest ZFS equivalent to ufsdump is 'zfs send'. 'zfs send' like ufsdump has initmiate awareness of the the actual on disk layout and is an integrated part of the filesystem implementation. star is a userland archiver. The man page for zfs states the following for send: The format of the stream is evolving. No backwards compati- bility is guaranteed. You may not be able to receive your streams on future versions of ZFS. I think this should be taken into account when considering 'zfs send' for backup purposes... Presumably, if one is backing up to another disk, one could zfs receive to a pool on that disk. That way you get simple file-based access, full history (although it could be collapsed by deleting older snapshots as necessary), and no worries about stream format changes. Jonathan ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] zfs equivalent of ufsdump and ufsrestore
On 29 May 2008, at 17:52, Chris Siebenmann wrote: The first issue alone makes 'zfs send' completely unsuitable for the purposes that we currently use ufsdump. I don't believe that we've lost a complete filesystem in years, but we restore accidentally deleted files all the time. (And snapshots are not the answer, as it is common that a user doesn't notice the problem until well after the fact.) ('zfs send' to live disks is not the answer, because we cannot afford the space, heat, power, disks, enclosures, and servers to spin as many disks as we have tape space, especially if we want the fault isolation that separate tapes give us. most especially if we have to build a second, physically separate machine room in another building to put the backups in.) However, the original poster did say they were wanting to backup to another disk and said they wanted something lightweight/cheap/easy. zfs send/receive would seem to fit the bill in that case. Let's answer the question rather than getting into an argument about whether zfs send/receive is suitable for an enterprise archival solution. Using snapshots is a useful practice as it costs fairly little in terms of disk space and provides immediate access to fairly recent, accidentally deleted files. If one is using snapshots, sending the streams to the backup pool is a simple procedure. One can then keep as many snapshots on the backup pool as necessary to provide the amount of history required. All of the files are kept in identical form on the backup pool for easy browsing when something needs to be restored. In event of catastrophic failure of the primary pool, one can quickly move the backup disk to the primary system and import it as the new primary pool. It's a bit-perfect incremental backup strategy that requires no additional tools. Jonathan ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Inconcistancies with scrub and zdb
Jonathan Loran wrote: Since no one has responded to my thread, I have a question: Is zdb suitable to run on a live pool? Or should it only be run on an exported or destroyed pool? In fact, I see that it has been asked before on this forum, but is there a users guide to zdb? Answering myself, finally looked at the zdb source code, and I see the results running on a live pool are not consistent, hence the -L option. OK, so I'm going to trust the scrub to tell me if there are errors, and as far as I can tell, my pools are clean now. But is was scary creating the mirror from a pool with checksum errors. I think there could be some more verbosity about what is going on, or to give the user some options when checksum errors are found in the process of silvering up a mirror for the first time. Just a comment. Thanks, Jon -- - _/ _/ / - Jonathan Loran - - -/ / /IT Manager - - _ / _ / / Space Sciences Laboratory, UC Berkeley -/ / / (510) 643-5146 [EMAIL PROTECTED] - __/__/__/ AST:7731^29u18e3 ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] Video streaming and prefetch
Hi all, I'm new to this list and ZFS, so forgive me if I'm re-hashing an old topic. I'm also using ZFS on FreeBSD not Solaris, so forgive me for being a heretic ;-) I recently setup a home NAS box and decided that ZFS is the only sensible way to manage 4TB of disks. The primary use of the box is to serve my telly (actually a Mac mini). This is using afp (via netatalk) to serve space to the telly for storing and retrieving video. The video tends to be 2-4GB files that are read/written sequentially at a rate in the region of 800KB/s. Unfortunately, the performance has been very choppy. The video software assumes it's talking to fast local storage and thus makes little attempt to buffer. I spent a long time trying to figure out the network problem before determining that the problem is actually in reading from the FS. This is a pretty cheap box, but it can still sustain 110GB/s off the array and low milliseconds access times. So there really is no excuse for not being able to serve up 800KB/s in an even fashion. After some experimentation I have determined that the problem is prefetching. Given this thing is mostly serving sequentially at a low, even rate it ought to be perfect territory for prefetching. I spent the weekend reading the ZFS code (bank holiday fun eh?) and running some experiments and think the problem is in the interaction between the prefetching code and the running processes. (Warning: some of the following is speculation on observed behaviour and may be rubbish.) The behaviour I see is the file streaming stalling whenever the prefetch code decides to read some more blocks. The dmu_zfetch code is all run as part of the read() operation. When this finds itself getting close to running out of prefetched blocks it queues up requests for more blocks - 256 of them. At 128KB per block, that's 32MB of data it requests. At this point it should be asynchronous and the caller should get back control and be able to process the data it just read. However, my NAS box is a uniprocessor and the issue thread is higher priority than user processes. So, in fact, it immediately begins issuing the physical reads to the disks. Given that modern disks tend to prefetch into their own caches anyway, some of these reads are likely to be served up instantly. This causes interrupts back into the kernel to deal with the data. This queues up the interrupt threads, which are also higher priority than user processes. These consume a not-insubstantial amount of CPU time to gather, checksum and load the blocks into the ARC. During which time, the disks have located the other blocks and started serving them up. So what I seem to get is a small perfect storm of interrupt processing. This delays the user process for a few hundred milliseconds. Even though the originally requested block was *in* the cache! To add insult to injury the, user process in this case, when it finally regains the CPU and returns the data to the the caller, then sleeps for a couple of hundred milliseconds. So prefetching, instead of evening-out reading and reducing jitter, has produced the worst case performance of compressing all of the jitter into one massive lump every 40 seconds (32MB / 800K). I get reasonably even performance if I disable prefetching or if I reduce the zfetch_block_cap to 16-32 blocks instead of 256. Other than just taking this opportunity to rant, I'm wondering if anyone else has seen similar problems and found a way around them? Also, to any ZFS developers: why does the prefetching logic follow the same path as a regular async read? Surely these ought to be way down the priority list? My immediate thought after a weekend of reading the code was to re-write it to use a low priority prefetch thread and have all of the dmu_zfetch() logic in that instead of in-line with the original dbuf_read(). Jonathan PS: Hi Darren! ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Inconcistancies with scrub and zdb
Since no one has responded to my thread, I have a question: Is zdb suitable to run on a live pool? Or should it only be run on an exported or destroyed pool? In fact, I see that it has been asked before on this forum, but is there a users guide to zdb? Thanks, Jon -- - _/ _/ / - Jonathan Loran - - -/ / /IT Manager - - _ / _ / / Space Sciences Laboratory, UC Berkeley -/ / / (510) 643-5146 [EMAIL PROTECTED] - __/__/__/ AST:7731^29u18e3 ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] Inconcistancies with scrub and zdb
Hi List, First of all: S10u4 120011-14 So I have the weird situation. Earlier this week, I finally mirrored up two iSCSI based pools. I had been wanting to do this for some time, because the availability of the data in these pools is important. One pool mirrored just fine, but the other pool is another story. First lesson (I think) is you should scrub your pools, at least those backed by a SAN, before mirroring them. The problem pool was scrubbed about two weeks before I mirrored it, and it was clean. I assumed, wrongly that there were no checksum errors in the time that elapsed. Well guess again. When I mirrored this guy, the source mirror had two checksum errors. Interestingly, the target inherited these errors, and so now both sides of the mirror showed two checksums in the counters. I don't know if this was real, or if the zpool attach operation just incremented the counters on the second half of the mirror. My next mistake was to assume the counters were in error on the second mirror, and so I zeroed out the counters with zpool clear. OK, so now I scrub the pool, and no checksum errors were found on either side of the mirror. Huh?!? What about those two checksum errors on the first mirror. OK, so I run zdb on the pool, and if finds scads of errors: Traversing all blocks to verify checksums and verify nothing leaked ... zdb_blkptr_cb: Got error 50 reading 33, 727252, 0, 4a -- skipping-- ... and then tons of: Error counts: errno count 50 123 leaked space: vdev 0, offset 0x4deaed800, size 2048 ... OK, this is odd, so I scrub the pool again, and this time it found 4 checksum errors, on the initial mirror, but none on the other mirror. That makes some sense, (though I don't know what changed) so I break the mirror, taking off the original side that has the checksum errs. I then scrub the pool, no errors found. That's good, but just to be sure, I run zdb on it, and it finds tons of the same errors as if found on the original side of the mirror. Argh! In the mean time, I ran 4 passes of format- analyze - compare on the initial half of the mirror that had the checksums and it's totally clean hardware wise. So my questions are these: 1) Does zdb leaked space mean trouble with the pool? 2) Is it possible that the errors got injected to the new half of the mirror when I attached it? For now, I'm going to assume that the new half of the mirror is OK, hardware wise. 3) I'm running a scrub and zdb on the other pool that lives on these SAN boxes, cause I want to see if they come up with the same problems. If not, what would be going on with this crazy pool. 4) Can I recover from this without copying the whole pool to new storage? If not, it will be painful for us. We will have to reboot 350 servers and workstations on stale file handles, interrupting 100's of production processes. My user base is loosing faith in my team. Oh sage ones, please advise. Thanks in advance. Jon ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] share zfs hierarchy over nfs
Bob Friesenhahn wrote: So for Linux, I think that you will also need to figure out an indirect-map incantation which works for its own broken automounter. Make sure that you read all available documentation for the Linux automounter so you know which parts don't actually work. Oh contraire Bob. I'm not going to boost Linux, but in this department, they've tried to do it right. If you use Linux autofs V4 or higher, you can use Sun style maps (except there's no direct maps in V4. Need V5 for direct maps). For our home directories, which use an indirect map, we just use the Solaris map, thus: auto_home: *zfs-server:/home/ Sorry to be so off (ZFS) topic. Jon -- - _/ _/ / - Jonathan Loran - - -/ / /IT Manager - - _ / _ / / Space Sciences Laboratory, UC Berkeley -/ / / (510) 643-5146 [EMAIL PROTECTED] - __/__/__/ AST:7731^29u18e3 ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS - Implementation Successes and Failures
Dominic Kay wrote: Hi Firstly apologies for the spam if you got this email via multiple aliases. I'm trying to document a number of common scenarios where ZFS is used as part of the solution such as email server, $homeserver, RDBMS and so forth but taken from real implementations where things worked and equally importantly threw up things that needed to be avoided (even if that was the whole of ZFS!). I'm not looking to replace the Best Practices or Evil Tuning guides but to take a slightly different slant. If you have been involved in a ZFS implementation small or large and would like to discuss it either in confidence or as a referenceable case study that can be written up, I'd be grateful if you'd make contact. -- Dominic Kay http://blogs.sun.com/dom For all the storage under my management, we are deploying ZFS going forward. There have been issues, to be sure, though none of them show stoppers. I agree with other posters that the way the z* commands lockup on a failed device are really not good, and it would be nice to be able to remove devices from a zpool. There have been other performance issues that are more the fault of of our SAN nodes than ZFS. But the ease of management, the unlimited nature (volume size to number of file systems) of everything ZFS, built in snapshots, and the confidence we get in our data make ZFS a winner. The way we've deployed ZFS has been to map iSCSI devices from our SAN. I know this isn't an ideal way to deploy ZFS, but SAN's do offer flexibility that direct attached drives do not. Performance is now sufficient for our needs, but it wasn't at first. We do everything here on the cheap, we have to. After all, this is University research ;) Anyway, we buy commodity x86 servers, and use software iSCSI. Most of our iSCSI nodes run Open-E iSCSI-R3. The latest version is actually quite quick, which wasn't always the case. I am experimenting using ZFS on the iSCSI target, but haven't finished validating that yet. I've also rebuilt an older 24 disk SATA chassis with the following parts: Motherboard:Supermicro PDSME+ Processor: Intel Xeon X3210 Kentsfield 2.13GHz 2 x 4MB L2 Cache LGA 775 Quad-Core Disk Controllers x3: Supermicro AOC-SAT2-MV8 8-Port SATA Hard disks x24: WD-1TB RE2, GP RAM: Crucial, 4x2GB unbuffered ECC PC2-5300 (8GB total) New power supplies... The PDSME+ MB was on the Solaris HCL, and it has four PCI-X slots, so using three of the Super Micro MVs' is no problem. This is obviously a standalone system, but it will be for nearline backup data, and doesn't have the same expansion requirements as our other servers. The thing about this guy is how smokin fast it is. I've set it up on snv b86, with 4 x 6 drive raid2z stripes, and I'm seeing up to 450MB/sec write and 900MB/sec read speeds. We can't get data into it anywhere that quick, but the potential is awesome. And it was really cheap, for this amount of storage. Our total storage on ZFS now is at: 103TB, some user home directories, some software distribution, and a whole lot of scientific data. I compress almost everything, since our bandwidth tends to be SAN pinched, not at the head nodes, so we can afford it. I sleep at night, and the users don't see problems. I'm a happy camper. Cheers, Jon ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS for write-only media?
Bob Friesenhahn wrote: The problem here is that by putting the data away from your machine, you loose the chance to scrub it on a regular basis, i.e. there is always the risk of silent corruption. Running a scrub is pointless since the media is not writeable. :-) But that's the point. You can't correct silent errors on write once media because you can't write the repair. I think it makes more sense to save a checksum of the entire CD/DVD/etc. media separately, so you can check the validity of your data that way, instead of using ZFS on WORM media. Jon ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS for write-only media?
Bob Friesenhahn wrote: On Tue, 22 Apr 2008, Jonathan Loran wrote: But that's the point. You can't correct silent errors on write once media because you can't write the repair. Yes, you can correct the error (at time of read) due to having both redundant media, and redundant blocks. That is a normal function of ZFS. It just not possible to correct the failed block on the media by re-writing it or moving its data to a new location. I suppose with ditto blocks, this has some merrit. Someone needs to characterize how errors probigate on different types of WORM media. perhaps this has already been done. In my experience, when DVD-R go south, they really go bad at once. Not a lot of small bit errors. But a full analysis would be good. Probably it would make the most sence to write mirrored WORM disks with different technology to hedge your bets. Jon ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] 24-port SATA controller options?
Luke Scharf wrote: Maurice Volaski wrote: Perhaps providing the computations rather than the conclusions would be more persuasive on a technical list ; 2 16-disk SATA arrays in RAID 5 2 16-disk SATA arrays in RAID 6 1 9-disk SATA array in RAID 5. 4 drive failures over 5 years. Of course, YMMV, especially if you drive drunk :-) My mileage does vary! On a 4 year old 84 disk array (with 12 RAID 5s), I replace one drive every couple of weeks (on average). This array lives in a proper machine-room with good power and cooling. The array stays active, though. -Luke I basically agree with this. We have about 150TB in mostly RAID 5 configurations, ranging from 8 to 16 disks per volume. We also replace bad drives about every week or three, but in six years, have never lost an array. I think our secret is this: on our 3ware controllers we run a verify at a minimum of three times a week. The verify will read the whole array (data and parity), find bad blocks and move them if necessary to good media. Because of this, we've never had a rebuild trigger a secondary failure. knock wood. Our server room has conditioned power and cooling as well. Jon ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] OpenSolaris ZFS NAS Setup
Chris Siebenmann wrote: | What your saying is independent of the iqn id? Yes. SCSI objects (including iSCSI ones) respond to specific SCSI INQUIRY commands with various 'VPD' pages that contain information about the drive/object, including serial number info. Some Googling turns up: http://wikis.sun.com/display/StorageDev/Solaris+OS+Disk+Driver+Device+Identifier+Generation http://www.bustrace.com/bustrace6/sas.htm Since you're using Linux IET as the target, you want to set the 'ScsiId' and 'ScsiSN' Lun parameters to unique (and different) values. (You can use sdparm, http://sg.torque.net/sg/sdparm.html, on Solaris to see exactly what you're currently reporting in the VPD data for each disk.) - cks CC-ing the list, cause this is of general interest Chris, indeed the older version of Open-E iSCSI I was using for my tests has no unique VPD identifiers what so ever, so this could confuse the initiator: prudhoe # sdparm -6 -i /devices/iscsi/[EMAIL PROTECTED],0:wd,raw /devices/iscsi/[EMAIL PROTECTED],0:wd,raw: IET VIRTUAL-DISK 0 Device identification VPD page: Addressed logical unit: designator type: T10 vendor identification, code set: Binary vendor id: IET vendor specific: Where as the new version of Open-E iSCSI (called iSCSI R3) does. These are two LUNS from the system I will be doing a ZFS mirror on, running the new Open-E iSCSI-R3 on the target: apollo # sdparm -i /devices/scsi_vhci/[EMAIL PROTECTED]:wd,raw /devices/scsi_vhci/[EMAIL PROTECTED]:wd,raw: iSCSI DISK 0 Device identification VPD page: Addressed logical unit: designator type: T10 vendor identification, code set: Binary vendor id: iSCSI vendor specific: XBD3Qzf9pzqYrsdz apollo # sdparm -i /devices/scsi_vhci/[EMAIL PROTECTED]:wd,raw /devices/scsi_vhci/[EMAIL PROTECTED]:wd,raw: iSCSI DISK 0 Device identification VPD page: Addressed logical unit: designator type: T10 vendor identification, code set: Binary vendor id: iSCSI vendor specific: ZknC2lbWA5y3M7v6 Open-E iSCSI-R3 generates a uniq vendor specific serial number, so the ZFS mirror will most likely fail and recover more cleanly. Thanks for the pointers. Jon ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS volume export to USB-2 or Firewire?
On Apr 9, 2008, at 11:46 AM, Bob Friesenhahn wrote: On Wed, 9 Apr 2008, Ross wrote: Well the first problem is that USB cables are directional, and you don't have the port you need on any standard motherboard. That Thanks for that info. I did not know that. Adding iSCSI support to ZFS is relatively easy since Solaris already supported TCP/IP and iSCSI. Adding USB support is much more difficult and isn't likely to happen since afaik the hardware to do it just doens't exist. I don't believe that Firewire is directional but presumably the Firewire support in Solaris only expects to support certain types of devices. My workstation has Firewire but most systems won't have it. It seemed really cool to be able to put your laptop next to your Solaris workstation and just plug it in via USB or Firewire so it can be used as a removable storage device. Or Solaris could be used on appropriate hardware to create a more reliable portable storage device. Apparently this is not to be and it will be necessary to deal with iSCSI instead. I have never used iSCSI so I don't know how difficult it is to use as temporary removable storage under Windows or OS-X. i'm not so sure what you're really after, but i'm guessing one of two things: 1) a global filesystem? if so - ZFS will never be globally accessible from 2 hosts at the same time without an interposer layer such as NFS or Lustre .. zvols could be exported to multiple hosts via iSCSI or FC- target but that's only 1/2 the story .. 2) an easy way to export volumes? agree - there should be some sort of semantics that would a signal filesystem is removable and trap on USB events when the media is unplugged .. of course you'll have problems with uncommitted transactions that would have to roll back on the next plug, or somehow be query-able iSCSI will get you a block/character device level sharing from a zvol (pseudo device) or the equivalent of a blob filestore .. you'd have to format it with a filesystem, but that filesystem could be a global one (eg: QFS) and you could multi-host natively that way. --- .je ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] OpenSolaris ZFS NAS Setup
Just to report back to the list... Sorry for the lengthy post So I've tested the iSCSI based zfs mirror on Sol 10u4, and it does more or less work as expected. If I unplug one side of the mirror - unplug or power down one of the iSCSI targets - I/O to the zpool stops for a while, perhaps a minute, and then things free up again. zpool commands seem to get unworkably slow, and error messages fly by on the console like fire ants running from a flood. Worst of all, plugging the faulted mirror back in (before removing the mirror from the pool) it's very hard to bring the faulted device back online: prudhoe # zpool status pool: test state: DEGRADED status: One or more devices could not be used because the label is missing or invalid. Sufficient replicas exist for the pool to continue functioning in a degraded state. action: Replace the device using 'zpool replace'. see: http://www.sun.com/msg/ZFS-8000-4J scrub: resilver completed with 0 errors on Tue Apr 8 16:34:08 2008 config: NAMESTATE READ WRITE CKSUM testDEGRADED 0 0 0 mirrorDEGRADED 0 0 0 c2t1d0 FAULTED 0 2.88K 0 corrupted data c2t1d0 ONLINE 0 0 0 errors: No known data errors Comment: why are there now two instances of c2t1d0?? prudhoe # zpool replace test c2t2d0 invalid vdev specification use '-f' to override the following errors: /dev/dsk/c2t1d0s0 is part of active ZFS pool test. Please see zpool(1M). prudhoe # zpool replace -f test c2t2d0 invalid vdev specification the following errors must be manually repaired: /dev/dsk/c2t1d0s0 is part of active ZFS pool test. Please see zpool(1M). prudhoe # zpool remove test c2t2d0 cannot remove c2t2d0: no such device in pool prudhoe # zpool offline test c2t2d0 cannot offline c2t2d0: no such device in pool prudhoe # zpool online test c2t2d0 cannot online c2t2d0: no such device in pool OK, get more drastic prudhoe # zpool clear test prudhoe # zpool status pool: test state: DEGRADED status: One or more devices could not be used because the label is missing or invalid. Sufficient replicas exist for the pool to continue functioning in a degraded state. action: Replace the device using 'zpool replace'. see: http://www.sun.com/msg/ZFS-8000-4J scrub: resilver completed with 0 errors on Tue Apr 8 16:34:08 2008 config: NAMESTATE READ WRITE CKSUM testDEGRADED 0 0 0 mirrorDEGRADED 0 0 0 c2t1d0 FAULTED 0 0 0 corrupted data c2t1d0 ONLINE 0 0 0 errors: No known data errors Frustration setting in. The error counts are zero, but still two instances of c2t1d0 listed... prudhoe # zpool export test prudhoe # zpool import test prudhoe # zpool list NAMESIZEUSED AVAILCAP HEALTH ALTROOT test 12.9G 9.54G 3.34G74% ONLINE - prudhoe # zpool status pool: test state: ONLINE status: One or more devices is currently being resilvered. The pool will continue to function, possibly in a degraded state. action: Wait for the resilver to complete. scrub: resilver in progress, 1.11% done, 0h20m to go config: NAMESTATE READ WRITE CKSUM testONLINE 0 0 0 mirrorONLINE 0 0 0 c2t2d0 ONLINE 0 0 0 c2t1d0 ONLINE 0 0 0 errors: No known data errors Finally resilvering with the right devices. The thing I really don't like here is the pool had to be exported and then imported to make this work. For an NFS server, this is not really acceptable. Now I know this is ol' Solaris 10u4, but still, I'm surprised I needed to export/import the pool to get it working correctly again. Anyone know what I did wrong? Is there a canonical way to online the previously faulted device? Anyway, It looks like for now, I can get some sort of HA our of this iSCSI mirror. The other pluses is the pool can self heal, and reads will be spread across both units. Cheers, Jon --- P.S. Playing with this more before sending this message, if you can detach the faulted mirror before putting it back online, it all works well. Hope that nothing bounces on your network when you have a failure: unplug one iscsi mirror, then: prudhoe # zpool status -v pool: test state: DEGRADED status: One or more devices could not be opened. Sufficient replicas exist for the pool to continue functioning in a degraded state. action: Attach the missing device and online it using 'zpool online'. see: http://www.sun.com/msg/ZFS-8000-D3 scrub: scrub completed with 0 errors on Wed Apr 9 14:18:45 2008 config: NAMESTATE READ WRITE CKSUM testDEGRADED 0 0 0 mirror
Re: [zfs-discuss] [storage-discuss] OpenSolaris ZFS NAS Setup
kristof wrote: If you have a mirrored iscsi zpool. It will NOT panic when 1 of the submirrors is unavailable. zpool status will hang for some time, but after I thinkt 300 seconds it will put the device on unavailable. The panic was the default in the past, And it only occurs if all devices are unavailable. Since I think b77 there is a new zpool property: failemode, which you can set to prevent a panic: failmode=wait | continue | panic Controls the system behavior in the event of catas- trophic pool failure. This condition is typically a result of a loss of connectivity to the underlying storage device(s) or a failure of all devices within the pool. The behavior of such an event is determined as follows: waitBlocks all I/O access until the device con- nectivity is recovered and the errors are cleared. This is the default behavior. continueReturns EIO to any new write I/O requests but allows reads to any of the remaining healthy devices. Any write requests that have yet to be committed to disk would be blocked. panic Prints out a message to the console and gen- erates a system crash dump. This is encouraging, but one problem: Our system is on Solaris 10 U4. Will this guy be immune to panics when one side of the mirror goes down? Seriously, I'm tempted to upgrade this box to OS b8? However, there are a lot of dependencies which we need to worry about in doing that - for example, will all our off the shelf software run with Open Solaris. More things to test. Thanks, Jon ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] OpenSolaris ZFS NAS Setup
This guy seems to have had lots of fun with iSCSI :) http://web.ivy.net/~carton/oneNightOfWork/20061119-carton.html This is scaring the heck out of me. I have a project to create a zpool mirror out of two iSCSI targets, and if the failure of one of them will panic my system, that will be totally unacceptable. What's the point of having an HA mirror if one side can't fail without busting the host. Is it really true that as the guy on the above link states (Please read the link, sorry) when one iSCSI mirror goes off line, the initiator system will panic? Or even worse, not boot its self cleanly after such a panic? How could this be? Anyone else with experience with iSCSI based ZFS mirrors? Thanks, Jon ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Backup-ing up ZFS configurations
Bob Friesenhahn wrote: On Tue, 25 Mar 2008, Robert Milkowski wrote: As I wrote before - it's not only about RAID config - what if you have hundreds of file systems, with some share{nfs|iscsi|cifs) enabled with specific parameters, then specific file system options, etc. Some zfs-related configuration is done using non-ZFS commands. For example, a filesystem devoted to a user is typically chowned to that user user's group. I assume that owner, group, and any ACLs associated with a filesystem would be preserved so that they are part of the pool re-creation commands? When creating ZFS filesystems, the step of creating the pool is separate from the steps of creating the filesystems. Obviously these steps need to either be separate, or separable, so that a similar filesystem layout can be created with different hardware. Correct me if I'm not interpreting this discussion properly, but aren't we discussing reconstruction of the container (zpool/zfs-file-systems and settings), not the data therein? Modes, ACL, extended attributes, and ownership of the data, should all come over with a zfs receive, or the backup recovery of your choice. I believe I could write a trivial shell script to take the listings of: # zpool list pool and # zfs list -r -t filesystem,volume -o all pool to recreate the whole pool, and all the necessary properties. Jon ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS I/O algorithms
On Mar 20, 2008, at 11:07 AM, Bob Friesenhahn wrote: On Thu, 20 Mar 2008, Mario Goebbels wrote: Similarly, read block size does not make a significant difference to the sequential read speed. Last time I did a simple bench using dd, supplying the record size as blocksize to it instead of no blocksize parameter bumped the mirror pool speed from 90MB/s to 130MB/s. Indeed. However, as an interesting twist to things, in my own benchmark runs I see two behaviors. When the file size is smaller than the amount of RAM the ARC can reasonably grow to, the write block size does make a clear difference. When the file size is larger than RAM, the write block size no longer makes much difference and sometimes larger block sizes actually go slower. in that case .. try fixing the ARC size .. the dynamic resizing on the ARC can be less than optimal IMHO --- .je ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS I/O algorithms
On Mar 20, 2008, at 2:00 PM, Bob Friesenhahn wrote: On Thu, 20 Mar 2008, Jonathan Edwards wrote: in that case .. try fixing the ARC size .. the dynamic resizing on the ARC can be less than optimal IMHO Is a 16GB ARC size not considered to be enough? ;-) I was only describing the behavior that I observed. It seems to me that when large files are written very quickly, that when the file becomes bigger than the ARC, that what is contained in the ARC is mostly stale and does not help much any more. If the file is smaller than the ARC, then there is likely to be more useful caching. sure i got that - it's not the size of the arc in this case since caching is going to be a lost cause.. but explicitly setting a zfs_arc_max should result in fewer calls to arc_shrink() when you hit memory pressure between the application's page buffer competing with the arc in other words, as soon as the arc is 50% full of dirty pages (8GB) it'll start evicting pages .. you can't avoid that .. but what you can avoid is the additional weight of constantly growing and shrinking the cache as it tries to keep up with your constantly changing blocks in a large file --- .je ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] zfs backups to tape
On Mar 14, 2008, at 3:28 PM, Bill Shannon wrote: What's the best way to backup a zfs filesystem to tape, where the size of the filesystem is larger than what can fit on a single tape? ufsdump handles this quite nicely. Is there a similar backup program for zfs? Or a general tape management program that can take data from a stream and split it across tapes reliably with appropriate headers to ease tape management and restore? for now you could send snapshots to files and a file hierarchy on a SAM-QFS archive .. then you've got all the feature functionality there to be able to proactively back up the snapshots and possibly segment them if they're big enough (non-shared-qfs - might make sense if you've got multiple drives you want to take advantage of) .. I believe the goal is to provide this sort of functionality through a DMAPI HSM with ADM at some point in the near future: http://opensolaris.org/os/project/adm/ --- .je ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] zfs backups to tape
Carson Gaspar wrote: Bob Friesenhahn wrote: On Fri, 14 Mar 2008, Bill Shannon wrote: What's the best way to backup a zfs filesystem to tape, where the size of the filesystem is larger than what can fit on a single tape? ufsdump handles this quite nicely. Is there a similar backup program for zfs? Or a general tape management program that can take data from Previously it was suggested on this list to use a special version of tar called 'star' (ftp://ftp.berlios.de/pub/star). Suggested by the rather biased (and extremely opinionated) author of 'star'. Who, by the way, never out-and-out admitted that star does _not_ support ZFS ACLs (which it doesn't). Sadly I don't now of any non-commercial backup solution for ZFS that supports ACLs. That is simply not true. Legato (EMC) Networker 7.4 does a perfect job of capturing the ZFS ACL's Just to make sure, I just performed a test recover of a directory where we use a complicated set of NFS4 style ACL's, and they were preserved exactly. Even rsync doesn't support them, due to Sun's choice to use their own unique ACL API. I commend Sun's choice of NFS v4 ACLs. This is the only way to ensure CIFS compatibility, and it is the way the industry will be moving. Jon -- - _/ _/ / - Jonathan Loran - - -/ / /IT Manager - - _ / _ / / Space Sciences Laboratory, UC Berkeley -/ / / (510) 643-5146 [EMAIL PROTECTED] - __/__/__/ AST:7731^29u18e3 ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] zfs backups to tape
Robert Milkowski wrote: Hello Jonathan, Friday, March 14, 2008, 9:48:47 PM, you wrote: Carson Gaspar wrote: Bob Friesenhahn wrote: On Fri, 14 Mar 2008, Bill Shannon wrote: What's the best way to backup a zfs filesystem to tape, where the size of the filesystem is larger than what can fit on a single tape? ufsdump handles this quite nicely. Is there a similar backup program for zfs? Or a general tape management program that can take data from Previously it was suggested on this list to use a special version of tar called 'star' ftp://ftp.berlios.de/pub/star). Suggested by the rather biased (and extremely opinionated) author of 'star'. Who, by the way, never out-and-out admitted that star does _not_ support ZFS ACLs (which it doesn't). Sadly I don't now of any non-commercial backup solution for ZFS that supports A That is simply not true. Legato (EMC) Networker 7.4 does a perfect job of capturing the ZFS ACL's Just to make sure, I just performed a test recover of a directory where we use a complicated set of NFS4 style ACL's, and they were preserved exactly. If you look closely you'll see he wrote non-comercial backup solution. Unless I miss something Legate and Netbackup (another poster) are commercial. Right you are. I read his post wrong. Networker and NetBackup are very pricey commercial packages. Thanks Jon ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Mirroring to a smaller disk
Patrick Bachmann wrote: Jonathan, On Tue, Mar 04, 2008 at 12:37:33AM -0800, Jonathan Loran wrote: I'm 'not sure I follow how this would work. The keyword here is thin provisioning. The sparse zvol only uses as much space as the actual data needs. So, if you use a sparse zvol, you may mirror to a smaller disk, iff you use as much space as is physically available to the sparse zvol. I do have tons of space on the old array. It's only 15% utilized, hence my original comment. How does my data get into the /test/old zvol (zpool foo)? What would I end up with. There's no zvol on foo. After detaching /test/old, you may reconfigure your old array. At that point, foo is on a zvol on the pool bar. In what way to get the data over depends on how your reconfiguration of the old array impacts the pool and vdev size. If it gets smaller, you cannot attach it to the pool where your data currently resides and have to go the send|receive route... Putting the zpool on a zvol permanently might not be something you want as this creates some overhead, I can't quantisize, and you mentioned some performance issues you're already experiencing. Well, there's the rub. I will be reconfiguring the old array identical to the new one. It will be smaller. It's always something, isn't it. I have to say though, this is very slick and I can see this sparse zvol trick will be handy in the future. Thanks! Jon ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] periodic ZFS disk accesses
On Mar 1, 2008, at 3:41 AM, Bill Shannon wrote: Running just plain iosnoop shows accesses to lots of files, but none on my zfs disk. Using iosnoop -d c1t1d0 or iosnoop -m /export/ home/shannon shows nothing at all. I tried /usr/demo/dtrace/iosnoop.d too, still nothing. hi Bill this came up sometime last year .. io:::start won't work since ZFS doesn't call bdev_strategy() directly .. you'll want to use something more like zfs_read:entry, zfs_write:entry and zfs_putpage or zfs_getpage for mmap'd ZFS files here's one i hacked from our discussion back then to track some timings on files: cat zfs_iotime.d #!/usr/sbin/dtrace -s # pragma D option quiet zfs_write:entry, zfs_read:entry, zfs_putpage:entry, zfs_getpage:entry { self-ts = timestamp; self-filepath = args[0]-v_path; } zfs_write:return, zfs_read:return, zfs_putpage:return, zfs_getpage:return /self-ts self-filepath/ { printf(%s on %s took %d nsecs\n, probefunc, stringof(self-filepath), timestamp - self-ts); self-ts = 0; self-filepath = 0; } --- .je ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] Does a mirror increase read performance
Quick question: If I create a ZFS mirrored pool, will the read performance get a boost? In other words, will the data/parity be read round robin between the disks, or do both mirrored sets of data and parity get read off of both disks? The latter case would have a CPU expense, so I would think you would see a slow down. Thanks, Jon ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Does a mirror increase read performance
Roch Bourbonnais wrote: Le 28 févr. 08 à 20:14, Jonathan Loran a écrit : Quick question: If I create a ZFS mirrored pool, will the read performance get a boost? In other words, will the data/parity be read round robin between the disks, or do both mirrored sets of data and parity get read off of both disks? The latter case would have a CPU expense, so I would think you would see a slow down. 2 disks mirrored together can read data faster than a single disk. So to service a read only one side of the mirror is read. Raid-Z parity is only read in the presence of checksum errors. That's what I suspected, but I'm glad to get the final word on this. BTW, I guess I should have said checksums instead of parity. My bad. Thanks, Jon ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Does a mirror increase read performance
Roch Bourbonnais wrote: Le 28 févr. 08 à 21:00, Jonathan Loran a écrit : Roch Bourbonnais wrote: Le 28 févr. 08 à 20:14, Jonathan Loran a écrit : Quick question: If I create a ZFS mirrored pool, will the read performance get a boost? In other words, will the data/parity be read round robin between the disks, or do both mirrored sets of data and parity get read off of both disks? The latter case would have a CPU expense, so I would think you would see a slow down. 2 disks mirrored together can read data faster than a single disk. So to service a read only one side of the mirror is read. Raid-Z parity is only read in the presence of checksum errors. That's what I suspected, but I'm glad to get the final word on this. BTW, I guess I should have said checksums instead of parity. My bad. OK. The checksum is a different story and is stored within the metadata block pointing to the data block. So given that to reach the data block we've already had to read the metadata block, checskum validation is never the source of an I/O. I really need to read those ZFS internals docs (in all my spare time ;) Thanks, Jon ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Can ZFS be event-driven or not?
On Feb 27, 2008, at 8:36 AM, Uwe Dippel wrote: As much as ZFS is revolutionary, it is far away from being the 'ultimate file system', if it doesn't know how to handle event- driven snapshots (I don't like the word), backups, versioning. As long as a high-level system utility needs to be invoked by a scheduler for these features (CDP), and - this is relevant - *ZFS does not support these functionalities essentially different from FAT or UFS*, the days of ZFS are counted. Sooner or later, and I bet it is sooner, someone will design a file system (hardware, software, Cairo) to which the tasks of retiring files, as well as creating versions of modified files, can be passed down, together with the file handlles. meh .. don't believe all the marketing hype you hear - it's good at what it's good at, and is a constant WIP for many of the other features that people would like to hear .. but the one ring to rule them all - not quite yet .. as for the CDP issue - i believe the event driving would really have to happen below ZFS at the vnode or znode layer .. keep in mind that with the ZPL we're still dealing with 30+ year old structures and methods (which is fine btw) in the VFS/Vnode layers .. a couple of areas i would look at (that i haven't seen mentioned in this discussion) might be: - fop_vnevent .. or the equivalent (if we have one yet) for a znode - filesystem - door interface for event handling - auditing if you look at what some of the other vendors (eg: apple/timemachine) are doing - it's essentially a tally of file change events that get dumped into a database and rolled up at some point .. if you plan on taking more immediate action on the file changes then i believe that you'll run into latency (race) issues for synchronous semantics anyhow - just a thought from another who is constantly learning (being corrected, learning some more, more correction, etc ..) --- .je ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Can ZFS be event-driven or not?
David Magda wrote: On Feb 24, 2008, at 01:49, Jonathan Loran wrote: In some circles, CDP is big business. It would be a great ZFS offering. ZFS doesn't have it built-in, but AVS made be an option in some cases: http://opensolaris.org/os/project/avs/ Point in time copy (as AVS offers) is not the same thing as CDP. When you snapshot data as in point in time copies, you predict the future, knowing the time slice at which your data will be needed. Continuous data protection is based on the premise that you don't have a clue ahead of time which point in time you want to recover to. Essentially, for CDP, you need to save every storage block that has ever been written, so you can put them back in place if you so desire. Anyone else on the list think it is worthwhile adding CDP to the ZFS list of capabilities? It causes space management issues, but it's an interesting, useful idea. Jon ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Which DTrace provider to use
Marion Hakanson wrote: [EMAIL PROTECTED] said: It's not that old. It's a Supermicro system with a 3ware 9650SE-8LP. Open-E iSCSI-R3 DOM module. The system is plenty fast. I can pretty handily pull 120MB/sec from it, and write at over 100MB/sec. It falls apart more on random I/O. The server/initiator side is a T2000 with Solaris 10u4. It never sees over 25% CPU, ever. Oh yeah, and two 1GB network links to the SAN . . . My opinion is, if when the array got really loaded up, everything slowed down evenly, users wouldn't mind or notice much. But when every 20 or so reads/writes gets delayed my 10s of seconds, the users start to line up at my door. Hmm, I have no experience with iSCSI yet. But behavior of our T2000 file/NFS server connected via 2Gbit fiber channel SAN is exactly as you describe when our HDS SATA array gets behind. Access to other ZFS pools remains unaffected, but any access to the busy pool just hangs. Some Oracle apps on NFS clients die due to excessive delays. In our case, this old HDS array's SATA shelves have a very limited queue depth (four per RAID controller) in the back end loop, plus every write is hit with the added overhead of an in-array read-back verification. Maybe your iSCSI situation injects enough latency at higher loads to cause something like our FC queue limitations. The iSCSI array has 2GB RAM as a cache. Writes to cache complete very fast. I'm not sure, but would love to get some metering going on this guy to find out, that it's really the reads that cause the issue. It seems like, but I'm not totally sure yet that heavy random read loads are when things break down. I'll pass on anything I find to the list, 'cause I'm sure there are a lot of folks with ZFS on a SAN. The flexibility of having the SAN is still seductive, even though the benefits to ZFS performance for direct attached storage are pulling us the other way. Jon ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Which DTrace provider to use
Hi Brendon, I have been using iopending, though I'm not sure how to interpret it. Is it true the column on the left is how deep in the queue requests are, and then the histogram represents home many requests there are at each queue depth? Then I would guess if there's lots of requests with high queue depth, that's bad. once in awhile, I see some pretty long queues, but they only last a second, and then things even right out again. I'll try your disktime.d script below and the other checks you recommend. May have more questions to follow. Thanks! Jon Brendan Gregg - Sun Microsystems wrote: G'Day Jon, For disk layer metrics, you could try Disk/iopending from the DTraceToolkit to check how saturated the disks become with requests (which answers that question with much higher definition than iostat). I'd also run disktime.d, which should be in the next DTraceToolkit release (it's pretty obvious), and is included below. disktime.d measures disk delta times - time from request to completion. #!/usr/sbin/dtrace -s #pragma D option quiet #pragma D option dynvarsize=16m BEGIN { trace(Tracing... Hit Ctrl-C to end.\n); } io:::start { start[args[0]-b_edev, args[0]-b_blkno] = timestamp; } io:::done /start[args[0]-b_edev, args[0]-b_blkno]/ { this-delta = timestamp - start[args[0]-b_edev, args[0]-b_blkno]; @[args[1]-dev_statname] = quantize(this-delta); } The iopattern script will also give you a measure of random vs sequential I/O - which would be interesting to see. ... For latencies in ZFS (such as ZIO pipeline latencies), we don't have a stable provider yet. It is possible to write fbt based scripts to do this - but they'll only work on a particular version of Solaris. fsinfo would be a good provider to hit up for the VFS layer. I'd also check syscall latencies - it might be too obvious, but it can be worth checking (eg, if you discover those long latencies are only on the open syscall)... Brendan -- - _/ _/ / - Jonathan Loran - - -/ / /IT Manager - - _ / _ / / Space Sciences Laboratory, UC Berkeley -/ / / (510) 643-5146 [EMAIL PROTECTED] - __/__/__/ AST:7731^29u18e3 ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Which DTrace provider to use
[EMAIL PROTECTED] wrote: On Tue, Feb 12, 2008 at 10:21:44PM -0800, Jonathan Loran wrote: Thanks for any help anyone can offer. I have faced similar problem (although not exactly the same) and was going to monitor disk queue with dtrace but couldn't find any docs/urls about it. Finally asked Chris Gerhard for help. He partially answered via his blog: http://blogs.sun.com/chrisg/entry/latency_bubble_in_your_io Maybe it helps you. Regards przemol This is perfect. Thank you. Jon ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Which DTrace provider to use
Marion Hakanson wrote: [EMAIL PROTECTED] said: ... I know, I know, I should have gone with a JBOD setup, but it's too late for that in this iteration of this server. We we set this up, I had the gear already, and it's not in my budget to get new stuff right now. What kind of array are you seeing this problem with? It sounds very much like our experience here with a 3-yr-old HDS ATA array. It's not that old. It's a Supermicro system with a 3ware 9650SE-8LP. Open-E iSCSI-R3 DOM module. The system is plenty fast. I can pretty handily pull 120MB/sec from it, and write at over 100MB/sec. It falls apart more on random I/O. The server/initiator side is a T2000 with Solaris 10u4. It never sees over 25% CPU, ever. Oh yeah, and two 1GB network links to the SAN When the crunch came here, I didn't know enough dtrace to help, but I threw the following into crontab to run every five minutes (24x7), and it at least collected the info I needed to see what LUN/filesystem was busying things out. Way crude, but effective enough: /bin/ksh -c date mpstat 2 20 iostat -xn 2 20 \ fsstat $(zfs list -H -o mountpoint -t filesystem | egrep '^/') 2 20 \ vmstat 2 20 /var/tmp/iostats.log 21 /dev/null A quick scan using egrep could pull out trouble spots; E.g. the following would identify iostat lines that showed 90-100% busy: egrep '^Sun |^Mon |^Tue |^Wed |^Thu |^Fri |^Sat | 1[0-9][0-9] c6| 9[0-9] c6'\ /var/tmp/iostats.log yeah, I have some running traditional *stat utilities running. If I capture more than a second at a time, things look good. I was hoping to get a real distribution of service times, to catch the outliers, that don't get absorbed into the average. Hence why I wanted to use dtrace. My opinion is, if when the array got really loaded up, everything slowed down evenly, users wouldn't mind or notice much. But when every 20 or so reads/writes gets delayed my 10s of seconds, the users start to line up at my door. Thanks for the tips. Jon ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] Which DTrace provider to use
Hi List, I'm wondering if one of you expert DTrace guru's can help me. I want to write a DTrace script to print out a a histogram of how long IO requests sit in the service queue. I can output the results with the quantize method. I'm not sure which provider I should be using for this. Does anyone know? I can easily adapt one of the DTrace Toolkit routines for this, if I can find the provider. I'll also throw out the problem I'm trying to meter. We are using ZFS on a large SAN array (4TB). The pool on this array serves up a lot of users, (250 home file systems/directories) and also /usr/local and other OTS software. It works fine most of the time, but then gets overloaded during busy periods. I'm going to reconfigure the array to help with this, but I sure would love to have some metrics to know how big a difference my tweaks are making. Basically, the problem users experience, when the load shoots up are huge latencies. An ls on a non-cached directory, which usually is instantaneous, will take 20, 30, 40 seconds or more. Then when the storage array catches up, things get better. My clients are not happy campers. I know, I know, I should have gone with a JBOD setup, but it's too late for that in this iteration of this server. We we set this up, I had the gear already, and it's not in my budget to get new stuff right now. Thanks for any help anyone can offer. Jon ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] OpenSolaris, ZFS and Hardware RAID,
Anton B. Rang wrote: Careful here. If your workload is unpredictable, RAID 6 (and RAID 5) for that matter will break down under highly randomized write loads. Oh? What precisely do you mean by break down? RAID 5's write performance is well-understood and it's used successfully in many installations for random write loads. Clearly if you need the very highest performance from a given amount of hardware, RAID 1 will perform better for random writes, but RAID 5 can be quite good. (RAID 6 is slightly worse, since a random write requires access to 3 disks instead of 2.) There are certainly bad implementations out there, but in general RAID 5 is a reasonable choice for many random-access workloads. (For those who haven't been paying attention, note that RAIDZ and RAIDZ2 are closer to RAID 3 in implementation and performance than to RAID 5; neither is a good choice for random-write workloads.) In my testing, if you have a lot of IO queues spread widely across your array, you do better with RAID 1 or 10. RAIDZ and RAIDZ2 are much worse, yes. If you add large transfers on top of this, which happen in multi-purpose pools, small reads can get starved out. The throughput curve (IO rate vs. queues*size) with RAID 5-6 flattens out a lot faster than with RAID 10. The scoop is this. On multipurpose pools, zfs often takes the place of many individual file systems. Those had the advantage of separation of IO and some tuning was also available to each file system. My experience, or should I say theory is that RAID 5,6 hardware accelerated arrays work pretty good in more predictable IO patterns. Sometimes even great. I use RAID 5,6 a lot for these. Don't get me wrong, I love zfs, I ain't going back. Don't start flaming me, I just think we have to be aware of the limitations and engineer our storage carefully. I made the mistake recently of putting to much faith in hardware RAID 6, and as our user load grew, the performance went through the floor faster than I thought it would. My 2 cents. Jon ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] OpenSolaris, ZFS and Hardware RAID, a recipe for success?
Richard Elling wrote: Nick wrote: Using the RAID cards capability for RAID6 sounds attractive? Assuming the card works well with Solaris, this sounds like a reasonable solution. Careful here. If your workload is unpredictable, RAID 6 (and RAID 5) for that matter will break down under highly randomized write loads. There's a lot of trickery done with hardware RAID cards that can do some read-ahead caching magic, improving the read-paritycalc-paritycalc-write cycle, but you can't beat out the laws of physics. If you do *know* you'll be streaming more than writing random small number of blocks, RAID 6 hardware can work. But with transaction like loads, performance will suck. Jon ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZIL controls in Solaris 10 U4?
This is true, but I think it's the testing bit that worries me. It's hard to lab out, and fully test an equivalent setup that has 350 active clients pounding on it to test usability and stability. One of our boxes has a boat load of special software running and various tweaks, that also would need to be validated. in other words, upgrades have tended to be painful. We don't really have any Open Solaris experience yet, and we've more or less trusted Sun to ring out the issues to minimize the problems, and make these upgrades smoother. Of course, the irony is that the requirement for this very stability is why we haven't seen the features in the ZFS code we need in Solaris 10. Thanks, Jon Mike Gerdts wrote: On Jan 30, 2008 2:27 PM, Jonathan Loran [EMAIL PROTECTED] wrote: Before ranting any more, I'll do the test of disabling the ZIL. We may have to build out these systems with Open Solaris, but that will be hard as they are in production. I would have to install the new OS on test systems and swap out the drives during scheduled down time. Ouch. Live upgrade can be very helpful here, either for upgrading or applying a flash archive. Once you are comfortable that Nevada performs like you want, you could prep the new OS on alternate slices or broken mirrors. Activating the updated OS should take only a few seconds longer than a standard init 6. Failback is similarly easy. I can't remember the last time I swapped physical drives to minimize the outage during an upgrade. -- - _/ _/ / - Jonathan Loran - - -/ / /IT Manager - - _ / _ / / Space Sciences Laboratory, UC Berkeley -/ / / (510) 643-5146 [EMAIL PROTECTED] - __/__/__/ AST:7731^29u18e3 ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZIL controls in Solaris 10 U4?
Guanghui Wang wrote: I dont know when will U5 or U6 coming,so i just set zfs_nocacheflush=1 to /etc/system,and the performance will speed up like zil_disable=1,and that's more safe for the filesystem. the separate zlog feature is not in U4,the nfs performance on zfs will be too slow when you do not set zfs_nochacheflush=1 in your /etc/system file. Yeah, on one of my systems, I was able to set the zfs_nocacheflush=1, but the other machine that's suffering isn't patched up enough to use it. I have to schedule down time to patch it up. I don't have hard numbers yet, but the seat of the pants impression is that stopping cache flushes has helped. On our SAN array's, I thought the settings I choose would have had them ignore cache flushing, but apparently not. Thanks everyone for the help. I still look forward to using fast SSD for the ZIL when it comes to Solaris 10 U? as a preferred method. Jon -- - _/ _/ / - Jonathan Loran - - -/ / /IT Manager - - _ / _ / / Space Sciences Laboratory, UC Berkeley -/ / / (510) 643-5146 [EMAIL PROTECTED] - __/__/__/ AST:7731^29u18e3 ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZIL controls in Solaris 10 U4?
Neil Perrin wrote: Roch - PAE wrote: Jonathan Loran writes: Is it true that Solaris 10 u4 does not have any of the nice ZIL controls that exist in the various recent Open Solaris flavors? I would like to move my ZIL to solid state storage, but I fear I can't do it until I have another update. Heck, I would be happy to just be able to turn the ZIL off to see how my NFS on ZFS performance is effected before spending the $'s. Anyone know when will we see this in Solaris 10? You can certainly turn it off with any release (Jim's link). It's true that S10u4 does not have the Separate Intent Log to allow using an SSD for ZIL blocks. I believe S10U5 will have that feature. Don't think we can live with this. Thanks Unfortunately it will not. A lot of ZFS fixes and features that had existed for a while will not be in U5 (for reasons I can't go into here). They should be in S10U6... Neil. I feel like we're being hung out to dry here. I've got 70TB on 9 various Solaris 10 u4 servers, with different data sets. All of these are NFS servers. Two servers have a ton of small files, with a lot of read and write updating, and NFS performance on these are abysmal. ZFS is installed on SAN array's (my first mistake). I will test by disabling the ZIL, but if it turns out the ZIL needs to be on a separate device, we're hosed. Before ranting any more, I'll do the test of disabling the ZIL. We may have to build out these systems with Open Solaris, but that will be hard as they are in production. I would have to install the new OS on test systems and swap out the drives during scheduled down time. Ouch. Jon -- - _/ _/ / - Jonathan Loran - - -/ / /IT Manager - - _ / _ / / Space Sciences Laboratory, UC Berkeley -/ / / (510) 643-5146 [EMAIL PROTECTED] - __/__/__/ AST:7731^29u18e3 ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZIL controls in Solaris 10 U4?
Vincent Fox wrote: Are you already running with zfs_nocacheflush=1? We have SAN arrays with dual battery-backed controllers for the cache, so we definitely have this set on all our production systems. It makes a big difference for us. No, we're not using the zfs_nocacheflush=1, but our SAN array's are set to cache all writebacks, so it shouldn't be needed. I may test this, if I get the chance to reboot one of the servers, but I'll bet the storage arrays' are working correctly. As I said before I don't see the catastrophe in disabling ZIL though. No catastrophe, just a potential mess. We actually run our production Cyrus mail servers using failover servers so our downtime is typically just the small interval to switch active idle nodes anyhow. We did this mainly for patching purposes. Wish we could afford such replication. Poor EDU environment here, I'm afraid. But we toyed with the idea of running OpenSolaris on them, then just upgrading the idle node to new OpenSolaris image every month using Jumpstart and switching to it. Anything goes wrong switch back to the other node. What we ended up doing, for political reasons, was putting the squeeze on our Sun reps and getting a 10u4 kernel spin patch with... what did they call it? Oh yeah a big wad of ZFS fixes. So this ends up being a hug PITA because for the next 6 months to a year we are tied to getting any kernel patches through this other channel rather than the usual way. But it does work for us, so there you are. Mmmm, for us, Open Solaris may be easier. I manly was after stability, to be honest. Our ongoing experience with bleeding edge Linux is painful at times, and on our big iron, I want them to just work. but if they're so slow, they're not really working right, are they? Sigh... Give my choice I'd go with OpenSolaris but that's a hard sell for datacenter management types. I think it's no big deal in a production shop with good JumpStart and CFengine setups, where any host should be rebuildable from scratch in a matter of hours. Good luck. True, I'll think about that going forward. Thanks, Jon This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss -- - _/ _/ / - Jonathan Loran - - -/ / /IT Manager - - _ / _ / / Space Sciences Laboratory, UC Berkeley -/ / / (510) 643-5146 [EMAIL PROTECTED] - __/__/__/ AST:7731^29u18e3 ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] ZIL controls in Solaris 10 U4?
Is it true that Solaris 10 u4 does not have any of the nice ZIL controls that exist in the various recent Open Solaris flavors? I would like to move my ZIL to solid state storage, but I fear I can't do it until I have another update. Heck, I would be happy to just be able to turn the ZIL off to see how my NFS on ZFS performance is effected before spending the $'s. Anyone know when will we see this in Solaris 10? Thanks, Jon -- - _/ _/ / - Jonathan Loran - - -/ / /IT Manager - - _ / _ / / Space Sciences Laboratory, UC Berkeley -/ / / (510) 643-5146 [EMAIL PROTECTED] - __/__/__/ AST:7731^29u18e3 ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Issue fixing ZFS corruption
Jeff Bonwick wrote: The Silicon Image 3114 controller is known to corrupt data. Google for silicon image 3114 corruption to get a flavor. I'd suggest getting your data onto different h/w, quickly. I'll second this, the 3114 is a piece of junk if you value your data. I bought a 4 port LSI SAS card (yes a bit pricy) and have had 0 problems since and hot swap actually works. I never tried it with the 3114 I had just never seen it actually working before so I was quite pleasantly surprised. Jonathan ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Issue fixing ZFS corruption
Bertrand Sirodot wrote: Hi, if I want to stay with SATA and not go to SAS, do you have a recommendation on which SATA controller is actually supported by Solaris? SAS controllers do support SATA drives actually (not the other way around though). I'm running SATA drives on mine without a problem. As far as which ones are supported by Solaris someone else will have to answer as I actually use ZFS on FreeBSD. SATA controllers are usually less expensive than SAS controllers of course. Jonathan ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] hardware for zfs home storage
Alex, I imagine that you've spent/will spend dozens or perhaps hundreds of hours ripping your MP3's. Don't even think about skipping backups. Budget in the cost of backups, preferably off site backups, even something you can carry to work and lock in your desk. Buy a four drive USB enclosure and 4 1TB drives. That would work. As reliable as zfs is, there's no technological fix for natural disasters, or human error. My 2 cents. Jon Alex wrote: Hi, I'm sure this has been asked many times and though a quick search didn't reveal anything illuminating, I'll post regardless. I am looking to make a storage system available on my home network. I need storage space in the order of terabytes as I have a growing iTunes collection and tons of MP3s that I converted from vinyl. At this time I am unsure of the growth rate, but I suppose it isn't unreasonable to look for 4TB usable storage. Since I will not be backing this up, I think I want RAIDZ2. Since this is for home use, I don't want to spend an inordinate amount of money. I did look at the cheaper STK arrays, but they're more than what I want to pay, so I am thinking that puts me in the white-box market. Power consumption would be nice to keep low also. I don't really care if it's external or internal disks. Even though I don't want to get completely skinned over the money, I also don't want to buy something that is unreliable. I am very interested as to your thoughts and experiences on this. E.g. what to buy, what to stay away from. Thanks in advance! This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss -- - _/ _/ / - Jonathan Loran - - -/ / /IT Manager - - _ / _ / / Space Sciences Laboratory, UC Berkeley -/ / / (510) 643-5146 [EMAIL PROTECTED] - __/__/__/ AST:7731^29u18e3 ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] rename(2) (mv(1)) between ZFS filesystems in the same zpool
Joerg Schilling wrote: Carsten Bormann [EMAIL PROTECTED] wrote: On Dec 29 2007, at 08:33, Jonathan Loran wrote: We snapshot the file as it exists at the time of the mv in the old file system until all referring file handles are closed, then destroy the single file snap. I know, not easy to implement, but that is the correct behavior, I believe. Exactly. Note that apart from open descriptors, there may be other links to the file on the old FS; it has to be clear whether writes to the file in the new FS change the file in the old FS or not. I'd rather say they shouldn't. Yes, this would be different from the normal rename(2) semantics with respect to multiply linked files. And yes, the semantics of link(2) should also be consistent with this. This in an interesting problem. Your proposal would imply that a file may have different identities in different filesystems: - different st_dev - different st_ino - different link count This cannot be implemented with a single inode data anymore. Well, it is not impossible as my WOFS (mentioned before) implements hardlinks via inode relative symlinks. In order to allow this. a file would need a storage pool global serial number that allows to match different inode sets for the file. Jörg At first, as I mentioned in my earlier email, I was thinking we needed to emulate the cross-fs rename/link/etc behavior as it is currently implemented, where a file appears to actually be copied. But now I'm not so sure. In Unixland, the ideal has always been to have the whole file system, kit and caboodle, singly rooted at /. Heck, even devices are in the file system. Of course, reality required that Programmatically, we needed to be aware of what file system your cwd is in. At a minimum, it's returned in our various stat structs (st_dev). I can see I'm getting long winded, but I'm thinking: what is the value of having different behavior with a cross zfs file move, within the same pool as that between directories. I'm not addressing the previous discussion about how to treat file handles, etc, but more about sharing open file blocks, linked across zfs boundaries before and after such a mv. I think the test is this: can we find a scenario where something would break if we did share the file blocks across zfs boundaries after such a mv? For every example I've been able to think of, if I ask the question: what if I moved the file from one directory to the other, instead of across zfs boundaries, would it have been different? it's been no. Comments please. Jon -- - _/ _/ / - Jonathan Loran - - -/ / /IT Manager - - _ / _ / / Space Sciences Laboratory, UC Berkeley -/ / / (510) 643-5146 [EMAIL PROTECTED] - __/__/__/ AST:7731^29u18e3 ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] rename(2) (mv(1)) between ZFS filesystems in the same zpool
Joerg Schilling wrote: Jonathan Edwards [EMAIL PROTECTED] wrote: since in the current implementation a mv between filesystems would have to assign new st_ino values (fsids in NFS should also be different), all you should need to do is assign new block pointers in the new side of the filesystem .. that would also be handy for cp as well If the rename would keep the blocks from the old file for the new name then the new file would inherit the identity of the old file. If you did iplement the rename in a way that would cause new values for st_dev/st_ino to be returned from a fstat(2) cal then this could confuse programs. If you instead set st_nlink for the open file to 0, then this would be OK from the viewpoint of the old file but not be OK from the view to the whole system. How would you implement writes into the open fd from the old name? Jörg More concise way of putting what I'm saying. Traditional mv between two fs will create two copies of the data, if the source file is open. At a minimum, this will have to be emulated or things will break. Since zfs file systems are really different Unix file systems, we have to deal with the semantics. It's not just a path change as in a directory mv. Jon -- - _/ _/ / - Jonathan Loran - - -/ / /IT Manager - - _ / _ / / Space Sciences Laboratory, UC Berkeley -/ / / (510) 643-5146 [EMAIL PROTECTED] - __/__/__/ AST:7731^29u18e3 ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss