Re: [zfs-discuss] Sun Storage 7320 system
On Thu, May 26, 2011 at 8:13 AM, Gustav wrote: > Hi All, > > Can someone please give some advise on the following? > > We are installing a 7320 with 2 18 GB Write Accelerators, 20 x 1 TB disks > and 96 GB of ram. > > Postgres will be running on a Oracle x6270 device with 96GB of ram > installed and two quad core cpus, with a local WALL on 4 hard drives, and > 7320 LUN via 8GB FC. > > I am going configure the 7320 as Mirrored with the following options > available to me (read and write cache enabled): > Double parity RAID > Mirrored > Single partiy RAID, narrow stripes > Striped > Triple mirrored > > Was does the above mean in real terms officially, > and what is optimum for a (Postgresql 9, or any performance tips for > Postgres on a 7320) database high writes, > and are there any comments that can help us improve performance? > Hello, I would advise against creating many small pools of disks unless you've very different capacity/performance/reliability requeriments. Even then, try to limit the number of pools as much as you can. I know newer firmwares allowed you to create 2 pools but I guess they removed that limitation (but still advise against it). Check the documentation in the "Help" link within the appliance, it's usually very detailed. Alghouth the 7320 is an appliance and comes with its own documentation, I think you'd benefit from reading the ZFS Best Practices Guide ( http://www.solarisinternals.com/wiki/index.php/ZFS_Best_Practices_Guide). You're probably looking for maximum performance with availability so that narrows it down to a mirrored pool, unless your Postgresql workload is very specific that raidz would fit, but beware of the performance hit. Regards, -- Giovanni Tirloni ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Reboots when importing old rpool
On Tue, May 17, 2011 at 6:38 PM, Brandon High wrote: > On Tue, May 17, 2011 at 11:10 AM, Hung-ShengTsao (Lao Tsao) Ph.D. > wrote: >> >> may be do >> zpool import -R /a rpool > > 'zpool import -N' may work as well. It looks like a crash dump is in order. The system shouldn't panic just because it can't import a pool. Try booting with the kernel debugger on (add "-kv" to the grub kernel line). Take a look at dumpadm. -- Giovanni Tirloni ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS File / Folder Management
On Mon, May 16, 2011 at 8:54 PM, MasterCATZ wrote: > > Is it possible to ZFS to keep data that belongs together on the same Pool > > ( or would these Questions be more related to Raid-Z ? ) > > that way if their is a failure only the Data on the Pool that Failed Needs to > be replaced > ( or if one Pool failed Does that Mean All the other Pools still fail as well > ? with out a way to recover data ? ) > > I am Wanting to be able to Expand My Array over Time by adding either 4 or 8 > HDD pools > > Most of the Data will probably never be Deleted > > but say I have 1 gig remaining on the First Pool and adding an 8 gig file > does this mean the data will be then put onto pool 1 and pool 2 ? > ( 1 gig pool 1 7 gig pool 2 ) > > or would ZFS be able to put it onto the 2nd pool instead of Splitting it ? > > the other scenario would be folder structure would ZFS be able to understand > data contained in a Folder Tree Belongs together and be able to store it on a > dedicated pool ? > > > if so it would be great or else you would be spending for ever replacing data > from backup if something does go wrong > > > sorry if this goes in the wrong spot i could no find > » OpenSolaris Forums » zfs » discuss > in the drop down menu You can create a single pool and grow it as needed. From that pool, you create filesystems. If you want to create multiple pools (due to redundancy/performance requirements being different), ZFS will keep them separated. And again, you will create filesystems/datasets from each one independently. http://download.oracle.com/docs/cd/E19963-01/html/821-1448/index.html http://download.oracle.com/docs/cd/E18752_01/html/819-5461/index.html -- Giovanni Tirloni ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] 350TB+ storage solution
On Mon, May 16, 2011 at 9:02 AM, Sandon Van Ness wrote: > > Actually I have seen resilvers take a very long time (weeks) on > solaris/raidz2 when I almost never see a hardware raid controller take more > than a day or two. In one case i thrashed the disks absolutely as hard as I > could (hardware controller) and finally was able to get the rebuild to take > almost 1 week.. Here is an example of one right now: > > pool: raid3060 > state: ONLINE > status: One or more devices is currently being resilvered. The pool will > continue to function, possibly in a degraded state. > action: Wait for the resilver to complete. > scrub: resilver in progress for 224h54m, 52.38% done, 204h30m to go > config: > > Resilver has been a problem with RAIDZ volumes for a while. I've routinely seen it take >300 hours and sometimes >600 hours with 13TB pools at 80%. All disks are maxed out on IOPS while still reading 1-2MB/s and there rarely is any writes. I've written about it before here (and provided data). My only guess is that fragmentation is a real problem in a scrub/resilver situation but whenever the conversation changes to point weaknesses in ZFS we start seeing "that is not a problem" comments. With the 7000s appliance I've heard that the 900hr estimated resilver time was "normal" and "everything is working as expected". Can't help but think there is some walled garden syndrome floating around. -- Giovanni Tirloni ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Quick zfs send -i performance questions
On Wed, May 4, 2011 at 9:04 PM, Brandon High wrote: > On Wed, May 4, 2011 at 2:25 PM, Giovanni Tirloni > wrote: > > The problem we've started seeing is that a zfs send -i is taking hours > to > > send a very small amount of data (eg. 20GB in 6 hours) while a zfs send > full > > transfer everything faster than the incremental (40-70MB/s). Sometimes we > > just give up on sending the incremental and send a full altogether. > > Does the send complete faster if you just pipe to /dev/null? I've > observed that if recv stalls, it'll pause the send, and they two go > back and forth stepping on each other's toes. Unfortunately, send and > recv tend to pause with each individual snapshot they are working on. > > Putting something like mbuffer > (http://www.maier-komor.de/mbuffer.html) in the middle can help smooth > it out and speed things up tremendously. It prevents the send from > pausing when the recv stalls, and allows the recv to continue working > when the send is stalled. You will have to fiddle with the buffer size > and other options to tune it for your use. > We've done various tests piping it to /dev/null and then transferring the files to the destination. What seems to stall is the recv because it doesn't complete (through mbuffer, ssh, locally, etc). The zfs send always complete at the same rate. Mbuffer is being used but doesn't seem to help. When things start to stall, the in / out buffers will quickly fill up and nothing will be sent. Probably because the mbuffer on the other side can't receive any more data until the zfs recv gives it some air to breath. What I find it curious is that it only happens with incrementals. Full send's go as fast as possible (monitored with mbuffer). I was just wondering if other people have seen it, if there is a bug (b111 is quite old), etc. -- Giovanni Tirloni ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Quick zfs send -i performance questions
On Tue, May 3, 2011 at 11:42 PM, Peter Jeremy < peter.jer...@alcatel-lucent.com> wrote: > - Is the source pool heavily fragmented with lots of small files? > Peter, We've some servers holding Xen VMs and the setup was create to have a default VM from where others would be cloned so the space saving are quite good. The problem we've started seeing is that a zfs send -i is taking hours to send a very small amount of data (eg. 20GB in 6 hours) while a zfs send full transfer everything faster than the incremental (40-70MB/s). Sometimes we just give up on sending the incremental and send a full altogether. I'm wondering if it has to do with fragmentation too. Has anyone experience this? OpenSolaris b111. As a data point, we also have servers holding Vmware VMs (not cloned) and there is no problem. Anyone know what's special about Xen's cloned VMs? Sparse files maybe? Thanks, -- Giovanni Tirloni ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] A resilver record?
0.3 0.1 2.3 1 16 c4t10d0 r/s w/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device 113.0 23.0 3226.9 28.0 0.0 1.2 0.1 8.9 2 40 c4t1d0 159.0 0.0 3286.9 0.0 0.0 0.6 0.1 3.9 2 24 c4t8d0 176.0 0.0 3545.9 0.0 0.0 0.5 0.1 3.0 2 26 c4t10d0 r/s w/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device 147.4 34.4 3888.9 52.1 0.0 1.5 0.2 8.3 3 43 c4t1d0 181.7 0.0 3515.1 0.0 0.0 0.6 0.1 3.1 2 24 c4t8d0 193.5 0.0 3489.9 0.0 0.0 0.6 0.2 3.3 4 22 c4t10d0 r/s w/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device 151.2 33.9 3792.7 42.7 0.0 1.5 0.1 7.9 1 36 c4t1d0 197.5 0.0 3856.9 0.0 0.0 0.4 0.1 2.3 2 19 c4t8d0 164.6 0.0 3928.1 0.0 0.0 0.7 0.1 4.2 1 24 c4t10d0 r/s w/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device 171.0 90.0 4426.3 121.5 0.0 1.3 0.1 4.9 3 51 c4t1d0 184.0 0.0 4426.8 0.0 0.0 0.7 0.1 4.0 2 30 c4t8d0 195.0 0.0 4430.3 0.0 0.0 0.7 0.1 3.7 2 32 c4t10d0 ^C Anyone else with over 600 hours of resilver time? :-) Thank you, Giovanni Tirloni (gtirl...@sysdroid.com) ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Resilver misleading output
On Tue, Dec 14, 2010 at 6:34 AM, Bruno Sousa wrote: > Hello everyone, > > I have a pool consisting of 28 1TB sata disks configured in 15*2 vdevs > raid1 (2 disks per mirror)2 SSD in miror for the ZIL and 3 SSD's for L2ARC, > and recently i added two more disks. > For some reason the resilver process kicked in, and the system is > noticeable slower, but i'm clueless to what should i do , because the zpool > status says that the resilver process has finished. > > This system is running opensolaris snv_134, has 32GB of memory, and here's > the zpool output > > zpool status -xv vol0 > pool: vol0 > state: ONLINE > status: One or more devices is currently being resilvered. The pool will >continue to function, possibly in a degraded state. > action: Wait for the resilver to complete. > scrub: resilver in progress for 13h24m, 100.00% done, 0h0m to go > config: > > zpool iostat snip > > mirror-12 ONLINE 0 0 0 >c8t5000C5001A11A4AEd0ONLINE 0 0 0 >c8t5000C5001A10CFB7d0ONLINE 0 0 0 > 1.71G resilvered > mirror-13 ONLINE 0 0 0 >c8t5000C5001A0F621Dd0ONLINE 0 0 0 >c8t5000C50019EB3E2Ed0ONLINE 0 0 0 > mirror-14 ONLINE 0 0 0 >c8t5000C5001A0F543Dd0ONLINE 0 0 0 >c8t5000C5001A105D8Cd0ONLINE 0 0 0 > mirror-15 ONLINE 0 0 0 > c8t5000C5001A0FEB16d0 ONLINE 0 0 0 > c8t5000C50019C1D460d0ONLINE 0 0 0 > 4.06G resilvered > > > Any idea for this type of situation? > http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=6899970 -- Giovanni Tirloni gtirl...@sysdroid.com ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Supermicro AOC-USAS2-L8i
On Fri, Oct 15, 2010 at 5:18 PM, Maurice Volaski < maurice.vola...@einstein.yu.edu> wrote: > The mpt_sas driver supports it. We've had LSI 2004 and 2008 controllers >> hang >> for quite some time when used with SuperMicro chassis and Intel X25-E SSDs >> (OSOL b134 and b147). It seems to be a firmware issue that isn't fixed >> with >> the last update. >> > > Do you mean to include all the PCie cards not just the AOC-USAS2-L8i and > when it's directly connected and not through the backplane? Prior reports > here seem to be implicating the card only when it was connected to the > backplane. > > I only tested the LSI 2004/2008 HBAs connected to the backplane (both 3Gb/s and 6Gb/s). The MegaRAID 8888ELP, when connected to the same backplane, doesn't exhibit that behavior. -- Giovanni Tirloni gtirl...@sysdroid.com ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Supermicro AOC-USAS2-L8i
On Tue, Oct 12, 2010 at 9:30 AM, Alexander Lesle wrote: > Hello guys, > > I want to built a new NAS and I am searching for a controller. > At supermicro I found this new one with the LSI 2008 controller. > > http://www.supermicro.com/products/accessories/addon/AOC-USAS2-L8i.cfm?TYP=I > > Who can confirm that this card runs under OSOL build134 or solaris10? > Why this card? Because its supports 6.0 Gb/s SATA. > The mpt_sas driver supports it. We've had LSI 2004 and 2008 controllers hang for quite some time when used with SuperMicro chassis and Intel X25-E SSDs (OSOL b134 and b147). It seems to be a firmware issue that isn't fixed with the last update. While running any heavy workload you'll see as much as $zfs_vdev_max_pending operations stuck on each SSD randomly. Others have reported success with them though, YMMV. LSI says the boards are not supported under Solaris and refuses to investigate it. -- Giovanni Tirloni gtirl...@sysdroid.com ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] non-ECC Systems and ZFS for home users
On Thu, Sep 23, 2010 at 1:08 PM, Dick Hoogendijk wrote: > And about what SUN systems are you thinking for 'home use' ? > The likeliness of memory failures might be much higher than becoming a > millionair, but in the years past I have never had one. And my home sytems > are rather cheap. Mind you, not the cheapest, but rather cheap. I do buy > good memory though. So, to me, with a good backup I feel rather safe using > ZFS. I also had it running for quite some time on a 32bits machine and that > also worked out fine. > We have correctable memory errors on ECC systems on a monthly basis. It's not if they'll happen but how often. -- Giovanni Tirloni gtirl...@sysdroid.com ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Replacing a disk never completes
On Thu, Sep 16, 2010 at 9:36 AM, Ben Miller wrote: > I have an X4540 running b134 where I'm replacing 500GB disks with 2TB disks > (Seagate Constellation) and the pool seems sick now. The pool has four > raidz2 vdevs (8+2) where the first set of 10 disks were replaced a few > months ago. I replaced two disks in the second set (c2t0d0, c3t0d0) a > couple of weeks ago, but have been unable to get the third disk to finish > replacing (c4t0d0). > > I have tried the resilver for c4t0d0 four times now and the pool also comes > up with checksum errors and a permanent error (:<0x0>). The first > resilver was from 'zpool replace', which came up with checksum errors. I > cleared the errors which triggered the second resilver (same result). I > then did a 'zpool scrub' which started the third resilver and also > identified three permanent errors (the two additional were in files in > snapshots which I then destroyed). I then did a 'zpool clear' and then > another scrub which started the fourth resilver attempt. This last attempt > identified another file with errors in a snapshot that I have now destroyed. > > Any ideas how to get this disk finished being replaced without rebuilding > the pool and restoring from backup? The pool is working, but is reporting > as degraded and with checksum errors. > > [...] Try to run a `zpool clear pool2` and see if clears the errors. If not, you may have to detach `c4t0d0s0/o`. I believe it's a bug that was fixed in recent builds. -- Giovanni Tirloni gtirl...@sysdroid.com ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] what is zfs doing during a log resilver?
On Thu, Sep 2, 2010 at 10:18 AM, Jeff Bacon wrote: > So, when you add a log device to a pool, it initiates a resilver. > > What is it actually doing, though? Isn't the slog a copy of the > in-memory intent log? Wouldn't it just simply replicate the data that's > in the other log, checked against what's in RAM? And presumably there > isn't that much data in the slog so there isn't that much to check? > > Or is it just doing a generic resilver for the sake of argument because > you changed something? > Good question. Here it takes little over 1 hour to resilver a 32GB SSD in a mirror. I've always wondered what exactly it was doing since it was supposed to be 30 seconds worth of data. It also generates lots of checksum errors. -- Giovanni Tirloni gtirl...@sysdroid.com ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Replaced pool device shows up in zpool status
On Mon, Aug 16, 2010 at 11:47 AM, Mark J Musante wrote: > On Mon, 16 Aug 2010, Matthias Appel wrote: > >> Can anybody tell me how to get rid of c1t3d0 and heal my zpool? > > Can you do a "zpool detach performance c1t3d0/o"? If that works, then > "zpool replace performance c1t3d0 c1t0d0" should replace the bad disk with > the new hot spare. Once the resilver completes, do a "zpool detach > performance c1t3d0" to remove the bad disk and promote the hot spare to a > full member of the pool. > > Or, if that doesn't work, try the same thing with c1t3d0 and c1t3d0/o > swapped around. Recently fixed in b147: http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=67825 -- Giovanni Tirloni gtirl...@sysdroid.com ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] autoreplace not kicking in
On Wed, Aug 11, 2010 at 4:06 PM, Cindy Swearingen wrote: > Hi Giovanni, > > The spare behavior and the autoreplace property behavior are separate > but they should work pretty well in recent builds. > > You should not need to perform a zpool replace operation if the > autoreplace property is set. If autoreplace is set and a replacement > disk is inserted into the same physical location of the removed > failed disk, then a new disk label is applied to the replacement > disk and ZFS should recognize it. That's what I'm having to do in b111. I will try to simulate the same situation in b134. > > Let the replacement disk resilver from the spare. When the resilver > completes, the spare should detach automatically. We saw this happen on > a disk replacement last week on a system running a recent Nevada build. > > If the spare doesn't detach after the resilver is complete, then just > detach it manually. Yes, that's working as expected (spare detaches after resilver). -- Giovanni Tirloni gtirl...@sysdroid.com ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] autoreplace not kicking in
Hello, In OpenSolaris b111 with autoreplace=on and a pool without spares, ZFS is not kicking the resilver after a faulty disk is replaced and shows up with the same device name, even after waiting several minutes. The solution is to do a manual `zpool replace` which returns the following: # zpool replace tank c3t17d0 invalid vdev specification use '-f' to override the following errors: /dev/dsk/c3t17d0s0 is part of active ZFS pool tank. Please see zpool(1M). ... and resilvering starts immediately. Looks like the `zpool replace` kicked in the autoreplace function. Since b111 is running a little old there is a chance this has already been reported and fixed. Does anyone know anything about it ? Also, if autoreplace is on and the pool has spares, when a disk fails the spare is automatically used (works fine) but when the faulty disk is replaced.. nothing really happens. Was the autoreplace code supposed to replace the faulty disk and release the spare when resilver is done ? Thank you, -- Giovanni Tirloni gtirl...@sysdroid.com ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Upgrading 2009.06 to something current
On Sun, Aug 1, 2010 at 2:57 PM, David Dyer-Bennet wrote: > What's a good choice for a decently stable upgrade? I'm unable to run > backups because ZFS send/receive won't do full-pool replication reliably, it > hangs better than 2/3 of the time, and people here have told me later > versions (later than 111b) fix this. I was originally waiting for the > "spring" release, but okay, I've kind of given up on that. This is a home > "production" server; it's got all my photos on it. And the backup isn't as > current as I'd like, and I'm having trouble getting a better backup. (I'll > do *something* before I risk the upgrade; maybe brute force, rsync to an > external drive, to at least give me a clean copy of the current state; I can > live without ACLs.) > > I find various blogs with instructions for how to do such an upgrade, and > they don't agree, and each one has posts from people for whom it didn't > work, too. Is there any kind of consensus on what the best way to do this > is? You've got to point pkg to pkg.opensolaris.org/dev and then choose one of the development builds. If you run a `pkg image-update` right away, the latest bits you'll get are from build 134 which people have reported works OK. If you want to try something in between b111 and b134, see the following instructions: http://blogs.sun.com/observatory/entry/updating_to_a_specific_build -- Giovanni Tirloni gtirl...@sysdroid.com ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Increase resilver priority
On Fri, Jul 23, 2010 at 12:50 PM, Bill Sommerfeld wrote: > On 07/23/10 02:31, Giovanni Tirloni wrote: >> >> We've seen some resilvers on idle servers that are taking ages. Is it >> possible to speed up resilver operations somehow? >> >> Eg. iostat shows<5MB/s writes on the replaced disks. > > What build of opensolaris are you running? There were some recent > improvements (notably the addition of prefetch to the pool traverse used by > scrub and resilver) which sped this up significantly for my systems. b111. Thanks for the heads up regarding these improvements, I'll try that in b134. > Also: if there are large numbers of snapshots, pools seem to take longer to > resilver, particularly when there's a lot of metadata divergence between > snapshots. Turning off atime updates (if you and your applications can cope > with this) may also help going forward. There are 7 snapshots and atime is disabled. -- Giovanni Tirloni gtirl...@sysdroid.com ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Increase resilver priority
On Fri, Jul 23, 2010 at 11:59 AM, Richard Elling wrote: > On Jul 23, 2010, at 2:31 AM, Giovanni Tirloni wrote: > >> Hello, >> >> We've seen some resilvers on idle servers that are taking ages. Is it >> possible to speed up resilver operations somehow? >> >> Eg. iostat shows <5MB/s writes on the replaced disks. > > This is lower than I expect, but It may be IOPS bound. What does > iostat say about the IOPS and asvc_t ? > -- richard It seems to have improved a bit. scrub: resilver in progress for 7h19m, 75.37% done, 2h23m to go config: NAME STATE READ WRITE CKSUM storage DEGRADED 0 1 0 mirror DEGRADED 0 0 0 c3t2d0ONLINE 0 0 0 replacing DEGRADED 1.29M 0 0 c3t3d0s0/o FAULTED 0 0 0 corrupted data c3t3d0 DEGRADED 0 0 1.29M too many errors mirror ONLINE 0 0 0 c3t4d0ONLINE 0 0 0 c3t5d0ONLINE 0 0 0 mirror DEGRADED 0 0 0 c3t6d0ONLINE 0 0 0 c3t7d0REMOVED 0 0 0 mirror ONLINE 0 0 0 c3t8d0ONLINE 0 0 0 c3t9d0ONLINE 0 0 0 mirror ONLINE 0 0 0 c3t10d0 ONLINE 0 0 0 c3t11d0 ONLINE 0 0 0 mirror ONLINE 0 0 0 c3t12d0 ONLINE 0 0 0 c3t13d0 ONLINE 0 0 0 mirror ONLINE 0 0 0 c3t14d0 ONLINE 0 0 0 c3t15d0 ONLINE 0 0 0 mirror ONLINE 0 0 0 c3t16d0 ONLINE 0 0 0 c3t17d0 ONLINE 0 0 0 mirror ONLINE 0 0 0 c3t18d0 ONLINE 0 0 0 c3t19d0 ONLINE 0 0 0 mirror ONLINE 0 0 0 c3t20d0 ONLINE 0 0 0 c3t21d0 ONLINE 0 0 0 logs DEGRADED 0 1 0 mirror ONLINE 0 0 0 c3t1d0ONLINE 0 0 0 c3t22d0 ONLINE 0 0 0 extended device statistics r/sw/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device 582.2 864.3 68925.2 37511.2 0.0 52.70.0 36.4 0 610 c3 0.0 201.70.0 806.7 0.0 0.00.00.1 0 2 c3t0d0 0.0 268.20.0 10531.2 0.0 0.10.00.4 0 10 c3t1d0 144.10.0 18375.90.0 0.0 9.50.0 65.7 0 100 c3t2d0 79.5 125.2 10109.9 15634.3 0.0 35.00.0 171.0 0 100 c3t3d0 10.90.0 1181.30.0 0.0 0.10.0 13.3 0 10 c3t4d0 19.90.0 2120.60.0 0.0 0.30.0 15.6 0 19 c3t5d0 35.80.0 3819.50.0 0.0 0.60.0 18.1 0 28 c3t6d0 0.00.00.00.0 0.0 0.00.00.0 0 0 c3t7d0 22.90.0 2506.60.0 0.0 0.50.0 22.0 0 22 c3t8d0 15.90.0 1639.80.0 0.0 0.30.0 20.5 0 15 c3t9d0 23.80.0 2889.60.0 0.0 0.50.0 19.8 0 27 c3t10d0 21.90.0 2558.30.0 0.0 0.60.0 28.9 0 19 c3t11d0 32.80.0 3151.90.0 0.0 1.20.0 37.4 0 25 c3t12d0 25.80.0 2707.80.0 0.0 0.50.0 18.8 0 26 c3t13d0 19.90.0 2281.10.0 0.0 0.30.0 17.5 0 24 c3t14d0 23.80.0 2782.30.0 0.0 0.30.0 14.6 0 20 c3t15d0 18.90.0 2249.80.0 0.0 0.40.0 19.7 0 23 c3t16d0 21.90.0 2519.50.0 0.0 0.50.0 22.6 0 27 c3t17d0 12.90.0 1653.20.0 0.0 0.20.0 16.8 0 18 c3t18d0 26.80.0 3262.70.0 0.0 0.80.0 28.4 0 29 c3t19d0 9.90.0 1271.70.0 0.0 0.10.0 14.3 0 13 c3t20d0 14.90.0 1843.90.0 0.0 0.30.0 20.5 0 19 c3t21d0 0.0 269.20.0 10539.0 0.0 0.40.01.3 0 33 c3t22d0 extended device statistics r/sw/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device 405.1 745.2 51057.2 29893.8 0.0 53.30.0 46.4 0 457 c3 0.0 252.10.0 1008.3 0.0 0.00.00.1 0 2 c3t0d0 0.0 177.10.0 5485.8 0.0 0.00.00.3 0 4 c3t1d0 145.00.0 18438.00.0 0.0 15.10.0 104.0 0 100 c3t2d0 80.0 140.0 10147.8 17925.9 0.0 35.00.0 159.0 0 100 c3t3d0 8.00.0 1024.30.0 0.0 0.20.0 19.9 0 12 c3t4d0 7.00.0 768.30.0 0.0 0.10.0 15.
[zfs-discuss] Increase resilver priority
Hello, We've seen some resilvers on idle servers that are taking ages. Is it possible to speed up resilver operations somehow? Eg. iostat shows <5MB/s writes on the replaced disks. I'm thinking a small performance degradation would be sometimes better than the increased risk window (where a vdev is degraded). Thank you, -- Giovanni Tirloni gtirl...@sysdroid.com ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] zfs send to remote any ideas for a faster way than ssh?
On Tue, Jul 20, 2010 at 12:59 AM, Edward Ned Harvey wrote: >> From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss- >> boun...@opensolaris.org] On Behalf Of Richard Jahnel >> >> I'vw also tried mbuffer, but I get broken pipe errors part way through >> the transfer. > > The standard answer is mbuffer. I think you should ask yourself what's > going wrong with mbuffer. You're not, by any chance, sending across a LACP > aggregated link, are you? Most people don't notice it, but I sure do, that > usually LACP introduces packet errors. Just watch your error counter, and > start cramming data through there. So far I've never seen a single > implementation that passed this test... Although I'm sure I've just had bad > luck. > > If you're having some packet loss, that might explain the poor performance > of ssh too. Although ssh is known to slow things down in the best of cases > ... I don't know if the speed you're seeing is reasonable considering. We have hundreds of servers using LACP and so far have not noticed any increase in the error rate. Could you share what implementations (OS, switch) have you tested and how it was done ? I would like to try to simulate these issues. -- Giovanni Tirloni gtirl...@sysdroid.com ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] corrupt pool?
On Mon, Jul 19, 2010 at 1:42 PM, Wolfraider wrote: > Our server locked up hard yesterday and we had to hard power it off and back > on. The server locked up again on reading ZFS config (I left it trying to > read the zfs config for 24 hours). I went through and removed the drives for > the data pool we created and powered on the server and it booted > successfully. I removed the pool from the system and reattached the drives > and tried to re-import the pool. It has now been trying to import for about 6 > hours. Does anyone know how to recover this pool? Running version 134. Have you enabled compression or deduplication ? Check the disks with `iostat -XCn 1` (look for high asvc_t times) and `iostat -En` (hard and soft errors). -- Giovanni Tirloni gtirl...@sysdroid.com ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] carrying on
On Mon, Jul 19, 2010 at 7:12 AM, Joerg Schilling wrote: > Giovanni Tirloni wrote: > >> On Sun, Jul 18, 2010 at 10:19 PM, Miles Nordin wrote: >> > IMHO it's important we don't get stuck running Nexenta in the same >> > spot we're now stuck with OpenSolaris: with a bunch of CDDL-protected >> > source that few people know how to use in practice because the build >> > procedure is magical and secret. This is why GPL demands you release >> > ``all build scripts''! >> >> I don't know if the GPL demands that but I think we've all learned a >> lesson from Oracle/Sun regarding that. > > The missing requirement to provide build scripts is a drawback of the CDDL. > > ...But believe me that the GPL would not help you here, as the GPL cannot > force the original author (in this case Sun/Oracle or whoever) to supply the > scripts in question. I don't have any doubts that the GPL (or any other license) would not prevent the current situation. It's more of a strategic/business decision. >> I hope that if we want to be able to move OpenSolaris to the next >> level, we can this time avoid falling into the same mouse trap. > > This is a community issue. > > Do we have people that are willing to help? Yep! Just need a little guidance in the beginning :) -- Giovanni Tirloni gtirl...@sysdroid.com ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] carrying on
On Sun, Jul 18, 2010 at 10:19 PM, Miles Nordin wrote: > IMHO it's important we don't get stuck running Nexenta in the same > spot we're now stuck with OpenSolaris: with a bunch of CDDL-protected > source that few people know how to use in practice because the build > procedure is magical and secret. This is why GPL demands you release > ``all build scripts''! I don't know if the GPL demands that but I think we've all learned a lesson from Oracle/Sun regarding that. Releasing source code and expecting people to figure out the rest could be called "open source" but it won't create the kind of collaboration people usually expect. For any "fork" (or whatever people want to call it, there are many shades of gray) to succeed, the release and documentation of the build/testing infrastructure used to create the end product is as important as the main source code itself. I'm not saying Oracle/Sun should have released all and everything they used to create the OpenSolaris binary distribution (their product). I'm saying they should have first stopped treating it as a proprietary product and then released those bits to further forster external collaboration. But now that's all history and discussing about how things could have been done won't change anything. I hope that if we want to be able to move OpenSolaris to the next level, we can this time avoid falling into the same mouse trap. -- Giovanni Tirloni gtirl...@sysdroid.com ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Lost zpool after reboot
On Sat, Jul 17, 2010 at 3:07 PM, Amit Kulkarni wrote: > I don't know if the devices are renumbered. How do you know if the devices > are changed? > > Here is output of format, the middle one is the boot drive and selection 0 & > 2 are the ZFS mirrors > > AVAILABLE DISK SELECTIONS: > 0. c8t0d0 > /p...@0,0/pci108e,5...@7/d...@0,0 > 1. c8t1d0 > /p...@0,0/pci108e,5...@7/d...@1,0 > 2. c9t0d0 > /p...@0,0/pci108e,5...@8/d...@0,0 It seems that the devices that ZFS is trying to open exist. I wonder why it's failing. Please send the output of: zpool status zpool import zdb -C (dump config) zdb -l /dev/dsk/c8t0d0s0 (dump label contents) zdb -l /dev/dsk/c9t0d0s0 (dump label contents) check /var/adm/messages Perhaps with the additional information someone here can help you better. I don't have any experience with Windows 7 to guarantee that it hasn't messes with the disk contents. -- Giovanni Tirloni gtirl...@sysdroid.com ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Lost zpool after reboot
On Sat, Jul 17, 2010 at 10:55 AM, Amit Kulkarni wrote: > I did a zpool status and it gave me zfs 8000-3C error, saying my pool is > unavailable. Since I am able to boot & access browser, I tried a zpool import > without arguments, with trying to export my pool, more fiddling. Now I can't > get zpool status to show my pool. > vdev_path = /dev/dsk/c9t0d0s0 > vdev_devid = id1,s...@ahitachi_hds7225scsun250g_0719bn9e3k=vfa100r1dn9e3k/a > parent_guid = 0xb89f3c5a72a22939 Does format(1M) show the devices where they once where ? -- Giovanni Tirloni gtirl...@sysdroid.com ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] fmadm warnings about media erros
On Sat, Jul 17, 2010 at 10:49 AM, Bob Friesenhahn wrote: > On Sat, 17 Jul 2010, Bruno Sousa wrote: >> >> Jul 15 12:30:48 storage01 SOURCE: eft, REV: 1.16 >> Jul 15 12:30:48 storage01 EVENT-ID: 859b9d9c-1214-4302-8089-b9447619a2a1 >> Jul 15 12:30:48 storage01 DESC: The command was terminated with a >> non-recovered error condition that may have been caused by a flaw in the >> media or an error in the recorded data. > > This sounds like a hard error to me. I suggest using 'iostat -xe' to check > the hard error counts and check the system log files. If your storage array > was undergoing maintenance and had a cable temporarily disconnected or > controller rebooted, then it is possible that hard errors could be counted. > FMA usually waits until several errors have been reported over a period of > time before reporting a fault. Speaking of that, is there a place where one can see/change these thresholds ? -- Giovanni Tirloni gtirl...@sysdroid.com ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Legality and the future of zfs...
On Wed, Jul 14, 2010 at 12:57 PM, Edward Ned Harvey wrote: >> From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss- >> boun...@opensolaris.org] On Behalf Of Edward Ned Harvey >> >> When you pay for the higher prices for OEM hardware, you're paying for >> the >> knowledge of parts availability and compatibility. And a single point >> vendor who supports the system as a whole, not just one component. > > For the record: > > I'm not saying this is always worth while. Sometimes I buy the enterprise > product and triple-platinum support. Sometimes I buy generic blackboxes > with mfgr warranty on individual components. It depends on your specific > needs at the time. > > I will say, that I am a highly paid senior admin. I only buy the generic > black boxes if I have interns or junior (no college level) people available > to support them. Generic != black boxes. Quite the opposite. Some companies are successfully doing the opposite of you: They are using standard parts and a competent staff that knows how to create solutions out of them without having to pay for GUI-powered systems and a 4-hour on-site part swapping service. -- Giovanni Tirloni gtirl...@sysdroid.com ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] [osol-help] ZFS list snapshots incurs large delay
On Tue, Jul 13, 2010 at 2:44 PM, Brent Jones wrote: > I have been running a pair of X4540's for almost 2 years now, the > usual spec (Quad core, 64GB RAM, 48x 1TB). > I have a pair of mirrored drives for rpool, and a Raidz set with 5-6 > disks in each vdev for the rest of the disks. > I am running snv_132 on both systems. > > I noticed an oddity on one particular system, that when running a > scrub, or a zfs list -t snapshot, the results take forever. > Mind you, these are identical systems in hardware, and software. The > primary system replicates all data sets to the secondary nightly, so > there isn't much of a discrepancy of space used. > > Primary system: > # time zfs list -t snapshot | wc -l > 979 > > real 1m23.995s > user 0m0.360s > sys 0m4.911s > > Secondary system: > # time zfs list -t snapshot | wc -l > 979 > > real 0m1.534s > user 0m0.223s > sys 0m0.663s > > > At the time of running both of those, no other activity was happening, > load average of .05 or so. Subsequent runs also take just as long on > the primary, no matter how many times I run it, it will take about 1 > minute and 25 seconds each time, very little drift (+- 1 second if > that) > > Both systems are at about 77% used space on the storage pool, no other > distinguishing factors that I can discern. > Upon a reboot, performance is respectable for a little while, but > within days, it will sink back to those levels. I suspect a memory > leak, but both systems run the same software versions and packages, so > I can't envision that. > > Would anyone have any ideas what may cause this? It could be a disk failing and dragging I/O down with it. Try to check for high asvc_t with `iostat -XCn 1` and errors in `iostat -En` Any timeouts or retries in /var/adm/messages ? -- Giovanni Tirloni gtirl...@sysdroid.com ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] zfs send/recv hanging in 2009.06
On Fri, Jul 9, 2010 at 6:49 PM, BJ Quinn wrote: > I have a couple of systems running 2009.06 that hang on relatively large zfs > send/recv jobs. With the -v option, I see the snapshots coming across, and > at some point the process just pauses, IO and CPU usage go to zero, and it > takes a hard reboot to get back to normal. The same script running against > the same data doesn't hang on 2008.05. There are issues running concurrent zfs receive in 2009.6. Try to run just one at a time. Switching to a development build (b134) is probably the answer until we've a new release. -- Giovanni Tirloni gtirl...@sysdroid.com ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] NexentaStor 3.0.3 vs OpenSolaris - Patches more up to date?
On Tue, Jul 6, 2010 at 4:06 PM, Spandana Goli wrote: > Release Notes information: > If there are new features, each release is added to > http://www.nexenta.com/corp/documentation/release-notes-support. > > If just bug fixes, then the Changelog listing is updated: > http://www.nexenta.com/corp/documentation/nexentastor-changelog Is there a bug tracker were one can objectively list all the bugs (with details) that went into a release ? "Many bug fixes" is a bit too general. -- Giovanni Tirloni gtirl...@sysdroid.com ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] WD caviar/mpt issues
On Wed, Jun 23, 2010 at 2:43 PM, Jeff Bacon wrote: >> >> Swapping the 9211-4i for a MegaRAID ELP (mega_sas) improves >> >> performance by 30-40% instantly and there are no hangs anymore so > I'm >> >> guessing it's something related to the mpt_sas driver. > > Wait. The mpt_sas driver by default uses scsi_vhci, and scsi_vhci by > default does load-balance round-robin. Have you tried setting > load-balance="none" in scsi_vhci.conf? That didn't help. -- Giovanni Tirloni gtirl...@sysdroid.com ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] WD caviar/mpt issues
On Wed, Jun 23, 2010 at 10:14 AM, Jeff Bacon wrote: >> > Have I missed any changes/updates in the situation? >> >> I'm been getting very bad performance out of a LSI 9211-4i card >> (mpt_sas) with Seagate Constellation 2TB SAS disks, SM SC846E1 and >> Intel X-25E/M SSDs. Long story short, I/O will hang for over 1 minute >> at random under heavy load. > > Hm. That I haven't seen. Is this hang as in some drive hangs up with > iostat busy% at 100 and nothing else happening (can't talk to a disk) or > a hang as perceived by applications under load? > > What's your read/write mix, and what are you using for CPU/mem? How many > drives? I'm using iozone to get some performance numbers and I/O hangs when it's doing the writing phase. This pool has: 18 x 2TB SAS disks as 9 data mirrors 2 x 32GB X-25E as log mirror 1 x 160GB X-160M as cache iostat shows "2" I/O operations active and SSDs at 100% busy when it's stuck. There are timeout messages when this happens: Jun 23 00:05:51 osol-x8-hba scsi: [ID 107833 kern.warning] WARNING: /p...@0,0/pci8086,3...@3/pci1000,3...@0 (mpt_sas0): Jun 23 00:05:51 osol-x8-hba Disconnected command timeout for Target 11 Jun 23 00:05:51 osol-x8-hba scsi: [ID 365881 kern.info] /p...@0,0/pci8086,3...@3/pci1000,3...@0 (mpt_sas0): Jun 23 00:05:51 osol-x8-hba Log info 0x3114 received for target 11. Jun 23 00:05:51 osol-x8-hba scsi_status=0x0, ioc_status=0x8048, scsi_state=0xc Jun 23 00:05:51 osol-x8-hba scsi: [ID 365881 kern.info] /p...@0,0/pci8086,3...@3/pci1000,3...@0 (mpt_sas0): Jun 23 00:05:51 osol-x8-hba Log info 0x3114 received for target 11. Jun 23 00:05:51 osol-x8-hba scsi_status=0x0, ioc_status=0x8048, scsi_state=0xc Jun 23 00:11:51 osol-x8-hba scsi: [ID 107833 kern.warning] WARNING: /p...@0,0/pci8086,3...@3/pci1000,3...@0 (mpt_sas0): Jun 23 00:11:51 osol-x8-hba Disconnected command timeout for Target 11 Jun 23 00:11:51 osol-x8-hba scsi: [ID 365881 kern.info] /p...@0,0/pci8086,3...@3/pci1000,3...@0 (mpt_sas0): Jun 23 00:11:51 osol-x8-hba Log info 0x3114 received for target 11. Jun 23 00:11:51 osol-x8-hba scsi_status=0x0, ioc_status=0x8048, scsi_state=0xc Jun 23 00:11:51 osol-x8-hba scsi: [ID 365881 kern.info] /p...@0,0/pci8086,3...@3/pci1000,3...@0 (mpt_sas0): Jun 23 00:11:51 osol-x8-hba Log info 0x3114 received for target 11. Jun 23 00:11:51 osol-x8-hba scsi_status=0x0, ioc_status=0x8048, scsi_state=0xc > I wonder if maybe your SSDs are flooding the channel. I have a (many) > 847E2 chassis, and I'm considering putting in a second pair of > controllers and splitting the drives front/back so it's 24/12 vs all 36 > on one pair. My plan is to use the newest SC846E26 chassis with 2 cables but right now what I've available for testing is the SC846E1. I like the fact that SM uses the LSI chipsets in their backplanes. It's been a good experience so far. >> Swapping the 9211-4i for a MegaRAID ELP (mega_sas) improves >> performance by 30-40% instantly and there are no hangs anymore so I'm >> guessing it's something related to the mpt_sas driver. > > Well, I sorta hate to swap out all of my controllers (bother, not to > mention the cost) but it'd be nice to have raidutil/lsiutil back. As much as I would like to blame faulty hardware for this issue, I only pointed out that using the MegaRAID doesn't show the problem because that's what I've been using without any issues in this particular setup. This system will be available to me for quite some time, so if anyone wants all kinds of tests to understand what's happening, I would be happy to provide those. -- Giovanni Tirloni gtirl...@sysdroid.com ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] WD caviar/mpt issues
On Fri, Jun 18, 2010 at 9:53 AM, Jeff Bacon wrote: > I know that this has been well-discussed already, but it's been a few months > - WD caviars with mpt/mpt_sas generating lots of retryable read errors, > spitting out lots of beloved " Log info 3108 received for target" > messages, and just generally not working right. > > (SM 836EL1 and 836TQ chassis - though I have several variations on theme > depending on date of purchase: 836EL2s, 846s and 847s - sol10u8, > 1.26/1.29/1.30 LSI firmware on LSI retail 3801 and 3081E controllers. Not > that it works any better on the brace of 9211-8is I also tried these drives > on.) > > Before signing up for the list, I "accidentally" bought a wad of caviar black > 2TBs. No, they are new enough to not respond to WDTLER.EXE, and yes, they are > generally unhappy with my boxen. I have them "working" now, running > direct-attach off 3 3081E-Rs with breakout cables in the SC836TQ (passthru > backplane) chassis, set up as one pool of 2 6+2 raidz2 vdevs (16 drives > total), but they still toss the occasional error and performance is, well, > abysmal - zpool scrub runs at about a third the speed of the 1TB cudas that > they share the machine with, in terms of iostat reported ops/sec or > bytes/sec. They don't want to work in an expander chassis at all - spin up > the drives and connect them and they'll run great for a while, then after > about 12 hours they start throwing errors. (Cycling power on the enclosure > does seem to reset them to run for another 12 hours, but...) > > I've caved in and bought a brace of replacement cuda XTs, and I am currently > going to resign these drives to other lesser purposes (attached to si3132s > and ICH10 in a box to be used to store backups, running Windoze). It's kind > of a shame, because their single-drive performance is quite good - I've been > doing single-drive tests in another chassis against cudas and constellations, > and they seem quite a bit faster except on random-seek. > > Have I missed any changes/updates in the situation? I'm been getting very bad performance out of a LSI 9211-4i card (mpt_sas) with Seagate Constellation 2TB SAS disks, SM SC846E1 and Intel X-25E/M SSDs. Long story short, I/O will hang for over 1 minute at random under heavy load. Swapping the 9211-4i for a MegaRAID ELP (mega_sas) improves performance by 30-40% instantly and there are no hangs anymore so I'm guessing it's something related to the mpt_sas driver. I submitted bug #6963321 a few minutes ago (not available yet). -- Giovanni Tirloni gtirl...@sysdroid.com ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] zpool export / import discrepancy
On Tue, Jun 15, 2010 at 1:56 PM, Scott Squires wrote: > Is ZFS dependent on the order of the drives? Will this cause any issue down > the road? Thank you all; No. In your case the logical names changed but ZFS managed to order the disks correctly as they were before. -- Giovanni ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Ideal SATA/SAS Controllers for ZFS
On Thu, May 27, 2010 at 2:39 AM, Marc Bevand wrote: > Hi, > > Brandon High freaks.com> writes: >> >> I only looked at the Megaraid that he mentioned, which has a PCIe >> 1.0 4x interface, or 1000MB/s. > > You mean x8 interface (theoretically plugged into that x4 slot below...) > >> The board also has a PCIe 1.0 4x electrical slot, which is 8x >> physical. If the card was in the PCIe slot furthest from the CPUs, >> then it was only running 4x. The tests were done connecting both cards to the PCIe 2.0 x8 slot#6 that connects directly to the Intel 5520 chipset. I totally ignored the differences between PCIe 1.0 and 2.0. My fault. > > If Giovanni had put the Megaraid in this slot, he would have seen > an even lower throughput, around 600MB/s: > > This slot is provided by the ICH10R which as you can see on: > http://www.supermicro.com/manuals/motherboard/5500/MNL-1062.pdf > is connected to the northbridge through a DMI link, an Intel- > proprietary PCIe 1.0 x4 link. The ICH10R supports a Max_Payload_Size > of only 128 bytes on the DMI link: > http://www.intel.com/Assets/PDF/datasheet/320838.pdf > And as per my experience: > http://opensolaris.org/jive/thread.jspa?threadID=54481&tstart=45 > a 128-byte MPS allows using just about 60% of the theoretical PCIe > throughput, that is, for the DMI link: 250MB/s * 4 links * 60% = 600MB/s. > Note that the PCIe x4 slot supports a larger, 256-byte MPS but this is > irrevelant as the DMI link will be the bottleneck anyway due to the > smaller MPS. > >> > A single 3Gbps link provides in theory 300MB/s usable after 8b-10b > encoding, >> > but practical throughput numbers are closer to 90% of this figure, or > 270MB/s. >> > 6 disks per link means that each disk gets allocated 270/6 = 45MB/s. >> >> ... except that a SFF-8087 connector contains four 3Gbps connections. > > Yes, four 3Gbps links, but 24 disks per SFF-8087 connector. That's > still 6 disks per 3Gbps (according to Giovanni, his LSI HBA was > connected to the backplane with a single SFF-8087 cable). Correct. The backplane on the SC646E1 only has one SFF-8087 cable to the HBA. >> It may depend on how the drives were connected to the expander. You're >> assuming that all 18 are on 3 channels, in which case moving drives >> around could help performance a bit. > > True, I assumed this and, frankly, this is probably what he did by > using adjacent drive bays... A more optimal solution would be spread > the 18 drives in a 5+5+4+4 config so that the 2 most congested 3Gbps > links are shared by only 5 drives, instead of 6, which would boost the > througput by 6/5 = 1.2x. Which would change my first overall 810MB/s > estimate to 810*1.2 = 972MB/s. The chassis has 4 columns of 6 disks. The 18 disks I was testing were all on columns #1 #2 #3. Column #0 still has a pair of SSDs and more disks which I havent' used in this test. I'll try to move things around to make use of the 4 port multipliers and test again. SuperMicro is going to release 6Gb/s backplane that uses the LSI SAS2X36 chipset in the near future, I've been told. Good thing this is still a lab experience. Thanks very much for the invaluable help! -- Giovanni ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Ideal SATA/SAS Controllers for ZFS
On Wed, May 26, 2010 at 9:22 PM, Brandon High wrote: > On Wed, May 26, 2010 at 4:27 PM, Giovanni Tirloni > wrote: >> SuperMicro X8DTi motherboard >> SuperMicro SC846E1 chassis (3Gb/s backplane) >> LSI 9211-4i (PCIex x4) connected to backplane with a SFF-8087 cable (4-lane). >> 18 x Seagate 1TB SATA 7200rpm >> >> I was able to saturate the system at 800MB/s with the 18 disks in >> RAID-0. Same performance was achieved swapping the 9211-4i for a >> MegaRAID ELP. >> >> I'm guessing the backplane and cable are the bottleneck here. > > I'd wager it's the PCIe x4. That's about 1000MB/s raw bandwidth, about > 800MB/s after overhead. Makes perfect sense. I was calculating the bottlenecks using the full-duplex bandwidth and it wasn't apparent the one-way bottleneck. In any case the solution is limited externally by the 4 x Gigabit Ethernet NICs, unless we add more, which isn't necessary for our requirements. Thanks! -- Giovanni ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Ideal SATA/SAS Controllers for ZFS
On Thu, May 20, 2010 at 2:19 AM, Marc Bevand wrote: > Deon Cui gmail.com> writes: >> >> So I had a bunch of them lying around. We've bought a 16x SAS hotswap >> case and I've put in an AMD X4 955 BE with an ASUS M4A89GTD Pro as >> the mobo. >> >> In the two 16x PCI-E slots I've put in the 1068E controllers I had >> lying around. Everything is still being put together and I still >> haven't even installed opensolaris yet but I'll see if I can get >> you some numbers on the controllers when I am done. > > This is a well-architected config with no bottlenecks on the PCIe > links to the 890GX northbridge or on the HT link to the CPU. If you > run 16 concurrent dd if=/dev/rdsk/c?d?t?p0 of=/dev/zero bs=1024k and > assuming your drives can do ~100MB/s sustained reads at the > beginning of the platter, you should literally see an aggregate > throughput of ~1.6GB/s... SuperMicro X8DTi motherboard SuperMicro SC846E1 chassis (3Gb/s backplane) LSI 9211-4i (PCIex x4) connected to backplane with a SFF-8087 cable (4-lane). 18 x Seagate 1TB SATA 7200rpm I was able to saturate the system at 800MB/s with the 18 disks in RAID-0. Same performance was achieved swapping the 9211-4i for a MegaRAID ELP. I'm guessing the backplane and cable are the bottleneck here. Any comments ? -- Giovanni ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS Hard disk buffer at 100%
On Fri, May 7, 2010 at 8:07 AM, Emily Grettel wrote: > Hi, > > I've had my RAIDz volume working well on SNV_131 but it has come to my > attention that there has been some read issues with the drives. Previously I > thought this was a CIFS problem but I'm noticing that when transfering files > or uncompressing some fairly large 7z (1-2Gb) files (or even smaller rar - > 200-300Mb) files occasionally running iostat will give the b% as 100 for a > drive or two. > That's the percent of time the disk is busy (transactions in progress) - iostat(1M). > > I have the Western Digital EADS 1TB drives (Green ones) and not the more > expensive black or enterprise drives (our sysadmins fault). > > The pool in question spans 4x 1TB drives. > > What exactly does this mean? Is it a controller problem disk problem or > cable problem? I've got this on commodity hardware as its only used for a > small business with 4-5 staff accessing our media server. Its using the > Intel ICHR SATA controller. I've already changed the cables, swapped out the > odd drive that exhibted this issue and the only thing I can think of is to > buy a Intel or LSI SATA card. > > The scrub sessions take almost a day and a half now (previously at most > 12hours!) but theres also 70% of space being used (files wise they're chunky > MPG files) or compressed artwork but there are no errors reported. > > Does anyone have any ideas? > You might be maxing out your drives' I/O capacity. That could happen when ZFS is commting the transactions to disk every 30 seconds but if %b is constantly high you disks might not be keeping up with the performance requirements. We've had some servers showing high asvc_t times but it turned out to be a firmware issue in the disk controller. It was very erratic (1-2 drives out of 24 would show that). If you look in the archives, people have sent a few averaged I/O performance numbers that you could compare to your workload. -- Giovanni ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Loss of L2ARC SSD Behaviour
On Thu, May 6, 2010 at 1:18 AM, Edward Ned Harvey wrote: > > From the information I've been reading about the loss of a ZIL device, > What the heck? Didn't I just answer that question? > I know I said this is answered in ZFS Best Practices Guide. > > http://www.solarisinternals.com/wiki/index.php/ZFS_Best_Practices_Guide#Sepa > rate_Log_Devices > > Prior to pool version 19, if you have an unmirrored log device that fails, > your whole pool is permanently lost. > Prior to pool version 19, mirroring the log device is highly recommended. > In pool version 19 or greater, if an unmirrored log device fails during > operation, the system reverts to the default behavior, using blocks from > the > main storage pool for the ZIL, just as if the log device had been > gracefully > removed via the "zpool remove" command. > This week I've had a bad experience replacing a SSD device that was in a hardware RAID-1 volume. While rebuilding, the source SSD failed and the volume was brought off-line by the controller. The server kept working just fine but seemed to have switched from the 30-second interval to all writes going directly to the disks. I could confirm this with iostat. We've had some compatibility issues between LSI MegaRAID cards and a few MTRON SSDs and I didn't believe the SSD had really died. So I brought it off-line and back on-line and everything started to work. ZFS showed the log device c3t1d0 as removed. After the RAID-1 volume was back I replaced that device with itself and a resilver process started. I don't know what it was resilvering against but it took 2h10min. I should have probably tried a zpool offline/online too. So I think if a log device fails AND you've to import your pool later (server rebooted, etc)... then you lost your pool (prior to version 19). Right ? This happened on OpenSolaris 2009.6. -- Giovanni ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] What about this status report
On Sat, Mar 27, 2010 at 6:02 PM, Harry Putnam wrote: > Bob Friesenhahn writes: > > > On Sat, 27 Mar 2010, Harry Putnam wrote: > > > >> What to do with a status report like the one included below? > >> > >> What does it mean to have an unrecoverable error but no data errors? > > > > I think that this summary means that the zfs scrub did not encounter > > any reported read/write errors from the disks, but on one of the > > disks, 7 of the returned blocks had a computed checksum error. This > > could be a problem with the data that the disk previously > > wrote. Perhaps there was an undetected data transfer error, the drive > > firmware glitched, the drive experienced a cache memory glitch, or the > > drive wrote/read data from the wrong track. > > > > If you clear the error information, make sure you keep a record of it > > in case it happens again. > > Thanks. > > So its not a serious matter? Or maybe more of a potentially serious > matter? > Not really. That exactly the kind of problem ZFS is designed to catch. > > Is there specific documentation somewhere that tells how to read these > status reports? > Your pool is not degraded so I don't think anything will show up in fmdump. But check 'fmdump -eV' and see the actual errors that got created. You could find something there. -- Giovanni ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Moving drives around...
On Tue, Mar 23, 2010 at 2:00 PM, Ray Van Dolson wrote: > Kind of a newbie question here -- or I haven't been able to find great > search terms for this... > > Does ZFS recognize zpool members based on drive serial number or some > other unique, drive-associated ID? Or is it based off the drive's > location (c0t0d0, etc). > ZFS makes uses of labels and will detect your drives even if you move them around. You can check that with 'zdb -l /dev/rdsk/cXtXdXs0' > > I'm wondering because I have a zpool set up across a bunch of drives > and I am planning to move those drives to another port on the > controller potentially changing their location -- as well as the > location of my "boot" zpool (two disks). > > Will ZFS detect this and be smart about it or do I need to do something > like a zfs export ahead of time? What about for the root pool? > No need. Same goes for the rpool, you only need to make sure your system will boot from the correct disk. -- Giovanni ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Proposition of a new zpool property.
On Sat, Mar 20, 2010 at 4:07 PM, Svein Skogen wrote: > We all know that data corruption may happen, even on the most reliable of > hardware. That's why zfs har pool scrubbing. > > Could we introduce a zpool option (as in zpool set ) for > "scrub period", in "number of hours" (with 0 being no automatic scrubbing). > > I see several modern raidcontrollers (such as the LSI Megaraid MFI line) > has such features (called "patrol reads") already built into them. Why > should zfs have the same? Having the zpool automagically handling this > (probably a good thing to default it on 168 hours or one week) would also > mean that the scrubbing feature is independent from cron, and since scrub > already has lower priority than ... actual work, it really shouldn't annoy > anybody (except those having their server under their bed). > > Of course I'm more than willing to stand corrected if someone can tell me > where this is already implemented, or why it's not needed. Proper flames > over this should start with a "warning, flame" header, so I can don my > asbestos longjohns. ;) > That would add unnecessary code to the ZFS layer for something that cron can handle in one line. Someone could hack zfs.c to automatically handle editing the crontab but I don't know if it's worth the effort. Are you worried that cron will fail or is it just an aesthetic requirement ? -- Giovanni ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] zpool I/O error
On Fri, Mar 19, 2010 at 1:26 PM, Grant Lowe wrote: > Hi all, > > I'm trying to delete a zpool and when I do, I get this error: > > # zpool destroy oradata_fs1 > cannot open 'oradata_fs1': I/O error > # > > The pools I have on this box look like this: > > #zpool list > NAME SIZE USED AVAILCAP HEALTH ALTROOT > oradata_fs1 532G 119K 532G 0% DEGRADED - > rpool 136G 28.6G 107G21% ONLINE - > # > > Why can't I delete this pool? This is on Solaris 10 5/09 s10s_u7. > Please send the result of zpool status. Your devices are probably all offline but that shouldn't stop you from removing it, at least not on OpenSolaris. -- Giovanni ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] lazy zfs destroy
On Thu, Mar 18, 2010 at 1:19 AM, Chris Paul wrote: > OK I have a very large zfs snapshot I want to destroy. When I do this, the > system nearly freezes during the zfs destroy. This is a Sun Fire X4600 with > 128GB of memory. Now this may be more of a function of the IO device, but > let's say I don't care that this zfs destroy finishes quickly. I actually > don't care, as long as it finishes before I run out of disk space. > > So a suggestion for room for growth for the zfs suite is the ability to > lazily destroy snapshots, such that the destroy goes to sleep if the cpu > idle time falls under a certain percentage. > What build of OpenSolaris are you using ? Is it nearly freezing during all the process or just at the end ? There was another thread where a similar issue was discusses a week ago. -- Giovanni ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Scrub not completing?
On Wed, Mar 17, 2010 at 7:09 PM, Bill Sommerfeld wrote: > On 03/17/10 14:03, Ian Collins wrote: > >> I ran a scrub on a Solaris 10 update 8 system yesterday and it is 100% >> done, but not complete: >> >> scrub: scrub in progress for 23h57m, 100.00% done, 0h0m to go >> > > Don't panic. If "zpool iostat" still shows active reads from all disks in > the pool, just step back and let it do its thing until it says the scrub is > complete. > > There's a bug open on this: > > 6899970 scrub/resilver percent complete reporting in zpool status can be > overly optimistic > > scrub/resilver progress reporting compares the number of blocks read so far > to the number of blocks currently allocated in the pool. > > If blocks that have already been visited are freed and new blocks are > allocated, the seen:allocated ratio is no longer an accurate estimate of how > much more work is needed to complete the scrub. > > Before the scrub prefetch code went in, I would routinely see scrubs last > 75 hours which had claimed to be "100.00% done" for over a day. I've routinely seen that happen with resilvers on builds 126/127 on raidz/raidz2. It reaches completion and stay in progress for as much as 50 hours at times. We just wait and let it do its work. The bugs database doesn't show if developers have added comments about that. Would have access to check if resilvers were mentioned ? BTW, since this bug only exists in the bug database, does it mean it was filled by a Sun engineer or a customer ? What's the relationship between that and the defect database ? I'm still trying to understand the flow of information here, since both databases seem to be used exclusively for OpenSolaris but one is less open. -- Giovanni ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Posible newbie question about space between zpool and zfs file systems
On Wed, Mar 17, 2010 at 11:23 AM, wrote: > > > >IMHO, what matters is that pretty much everything from the disk controller > >to the CPU and network interface is advertised in power-of-2 terms and > disks > >sit alone using power-of-10. And students are taught that computers work > >with bits and so everything is a power of 2. > > That is simply not true: > >Memory: power of 2(bytes) >Network: power of 10 (bits/s)) >Disk: power of 10 (bytes) >CPU Frequency: power of 10 (cycles/s) >SD/Flash/..: power of 10 (bytes) >Bus speed: power of 10 > > Main memory is the odd one out. > My bad on generalizing that information. Perhaps the software stack dealing with disks should be changed to use power-of-10. Unlikely too. -- Giovanni ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] How to reserve space for a file on a zfs filesystem
On Wed, Mar 17, 2010 at 6:43 AM, wensheng liu wrote: > Hi all, > > How to reserve a space on a zfs filesystem? For mkfiel or dd will write > data to the > block, it is time consuming. whiel "mkfile -n" will not really hold the > space. > And zfs's set reservation only work on filesytem, not on file? > > Could anyone provide a solution for this? > Do you mean you want files created with "mkfile -n" to count against the total filesystem usage ? Since they've not allocated any blocks yet, ZFS would need to know about each spare file and read it's metadata before enforcing the filesystem reservation. I'm not sure it's doable. -- Giovanni ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Posible newbie question about space between zpool and zfs file systems
On Wed, Mar 17, 2010 at 9:34 AM, David Dyer-Bennet wrote: > On 3/16/2010 23:21, Erik Trimble wrote: > >> On 3/16/2010 8:29 PM, David Dyer-Bennet wrote: >> >>> On 3/16/2010 17:45, Erik Trimble wrote: >>> David Dyer-Bennet wrote: > On Tue, March 16, 2010 14:59, Erik Trimble wrote: > > Has there been a consideration by anyone to do a class-action lawsuit >> for false advertising on this? I know they now have to include the >> "1GB >> = 1,000,000,000 bytes" thing in their specs and somewhere on the box, >> but just because I say "1 L = 0.9 metric liters" somewhere on the box, >> it shouldn't mean that I should be able to avertise in huge letters "2 >> L >> bottle of Coke" on the outside of the package... >> > > I think "giga" is formally defined as a prefix meaning 10^9; that is, > the > definition the disk manufacturers are using is the standard metric one > and > very probably the one most people expect. There are international > standards for these things. > > I'm well aware of the history of power-of-two block and disk sizes in > computers (the first computers I worked with pre-dated that period); > but I > think we need to recognize that this is our own weird local usage of > terminology, and that we can't expect the rest of the world to change > to > our way of doing things. > That's RetConn-ing. The only reason the stupid GiB / GB thing came around in the past couple of years is that the disk drive manufacturers pushed SI to do it. Up until 5 years ago (or so), GigaByte meant a power of 2 to EVERYONE, not just us techies. I would hardly call 40+ years of using the various giga/mega/kilo prefixes as a power of 2 in computer science as non-authoritative. In fact, I would argue that the HD manufacturers don't have a leg to stand on - it's not like they were "outside" the field and used to the "standard" SI notation of powers of 10. Nope. They're inside the industry, used the powers-of-2 for decades, then suddenly decided to "modify" that meaning, as it served their marketing purposes. >>> >>> The SI meaning was first proposed in the 1920s, so far as I can tell. >>> Our entire history of special usage took place while the SI definition was >>> in place. We simply mis-used it. There was at the time no prefix for what >>> we actually wanted (not giga then, but mega), so we borrowed and repurposed >>> mega. >>> >>> Doesn't matter whether the "original" meaning of K/M/G was a >> power-of-10. What matters is internal usage in the industry. And that has >> been consistent with powers-of-2 for 40+ years. There has been NO outside >> understanding that GB = 1 billion bytes until the Storage Industry decided >> it wanted it that way. That's pretty much the definition of distorted >> advertising. >> > > That's simply not true. The first computer I programmed, an IBM 1620, was > routinely referred to as having "20K" of core. That meant 20,000 decimal > digits; not 20,480. The other two memory configurations were similarly > "40K" for 40,000 and "60K" for 60,000. The first computer I was *paid* for > programming, the 1401, had "8K" of core, and that was 8,000 locations, not > 8,192. This was right on 40 years ago (fall of 1969 when I started working > on the 1401). Yes, neither was brand new, but IBM was still leasing them to > customers (it came in configurations of 4k, 8k, 12k, and I think 16k; been a > while!). At this point in history it doesn't matter much who's right or wrong anymore. IMHO, what matters is that pretty much everything from the disk controller to the CPU and network interface is advertised in power-of-2 terms and disks sit alone using power-of-10. And students are taught that computers work with bits and so everything is a power of 2. Just last week I had to remind people that a 24-disk JBOD with 1TB disks wouldn't provide 24TB of storage since disks show up as 931GB. It *is* an anomaly and I don't expect it to be fixed. Perhaps some disk vendor could add more bits to its drives and advertise a "real 1TB disk" using power-of-2 and show how people are being misled by other vendors that use power-of-10. Highly unlikely but would sure get some respect from the storage community. -- Giovanni ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] persistent L2ARC
On Mon, Mar 15, 2010 at 5:39 PM, Abdullah Al-Dahlawi wrote: > Greeting ALL > > > I understand that L2ARC is still under enhancement. Does any one know if > ZFS can be upgrades to include "Persistent L2ARC", ie. L2ARC will not loose > its contents after system reboot ? > There is a bug opened for that but it doesn't seem to be implemented yet. http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=6662467 -- Giovanni ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Hardware Failure Best Practices
On Mon, Mar 8, 2010 at 2:00 PM, Chris Dunbar wrote: > Hello, > > I just found this list and am very excited that you all are here! I have a > homemade ZFS server that serves as our poor man's Thumper (we named it > thumpthis) and provides primarily NFS shares for our VMware environment. As > is often the case, the server has developed a hardware problem mere days > before I am ready to go live with a new replacement server (thumpthat). At > first the problem appeared to be a bad drive, but now I am not so sure. I > would like to sanity check my thought process with this list and see if > anybody has some different ideas. Here is a quick timeline of the trouble: > > 1. I noticed the following when running a routine zpool status: > > > mirrorDEGRADED 0 0 0 >c3t2d0 ONLINE 0 0 0 >c3t3d0 REMOVED 0 368K 0 > > > 2. I determined which drive appeared to be offline by watching drive lights > and then rebooted the server. > > 3. Initially the drive appeared to be fine and ZFS picked it backup and > resilvered the mirror. About 30 minutes later I noticed that the same drive > was again marked REMOVED. > > 4. I shut the server down and replaced the drives with a new, larger disk. > > 5. I ran zpool replace tank c3t3d0 and it happily went to work on the > replacement drive. A few hours later the resilver was complete and all > seemed well. > > 6. The next day, about 12 hours after installing the new drive I found the > same error message (here's the whole pool): > > config: > >NAMESTATE READ WRITE CKSUM >tankDEGRADED 0 0 0 > mirrorONLINE 0 0 0 >c3t0d0 ONLINE 0 0 0 >c3t1d0 ONLINE 0 0 0 > mirrorDEGRADED 0 0 0 >c3t2d0 ONLINE 0 0 0 >c3t3d0 REMOVED 0 370K 0 > mirrorONLINE 0 0 0 >c4t0d0 ONLINE 0 0 0 >c4t1d0 ONLINE 0 0 0 > mirrorONLINE 0 0 0 >c4t2d0 ONLINE 0 0 0 >c4t3d0 ONLINE 0 0 0 > > errors: No known data errors > > This is where I am now. Either my new hard drive is bad (not impossible) or > I am looking at some other hardware failure, possibly the AOC-SAT2-MV8 > controller card. I have a spare controller card (same make and model > purchased at the same time we built the server) and plan to replace that > tonight. Does that seem like the correct course of action? Are there any > steps I can take beforehand to zero in on the problem? Any words of > encouragement or wisdom? > What does `iostat -En` say ? My suggestion is to replace the cable that's connecting the c3t3d0 disk. IMHO, the cable is much more likely to be faulty than a single port on the disk controller. -- Giovanni Tirloni sysdroid.com ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Can you manually trigger spares?
On Mon, Mar 8, 2010 at 3:33 PM, Tim Cook wrote: > Is there a way to manually trigger a hot spare to kick in? Mine doesn't > appear to be doing so. What happened is I exported a pool to reinstall > solaris on this system. When I went to re-import it, one of the drives > refused to come back online. So, the pool imported degraded, but it doesn't > seem to want to use the hot spare... I've tried triggering a scrub to see if > that would give it a kick, but no-go. uts/common/fs/zfs/vdev.c says: /* * If we fail to open a vdev during an import, we mark it as * "not available", which signifies that it was never there to * begin with. Failure to open such a device is not considered * an error. */ If there is no error then the fault management code probably doesn't kick in and autoreplace isn't triggered. > > r...@fserv:~$ zpool status > pool: fserv > state: DEGRADED > status: One or more devices could not be opened. Sufficient replicas exist > for > the pool to continue functioning in a degraded state. > action: Attach the missing device and online it using 'zpool online'. >see: http://www.sun.com/msg/ZFS-8000-2Q > scrub: scrub completed after 3h19m with 0 errors on Mon Mar 8 02:28:08 > 2010 > config: > > NAME STATE READ WRITE CKSUM > fserv DEGRADED 0 0 0 > raidz2-0DEGRADED 0 0 0 > c2t0d0ONLINE 0 0 0 > c2t1d0ONLINE 0 0 0 > c2t2d0ONLINE 0 0 0 > c2t3d0ONLINE 0 0 0 > c2t4d0ONLINE 0 0 0 > c2t5d0ONLINE 0 0 0 > c3t0d0ONLINE 0 0 0 > c3t1d0ONLINE 0 0 0 > c3t2d0ONLINE 0 0 0 > c3t3d0ONLINE 0 0 0 > c3t4d0ONLINE 0 0 0 > 12589257915302950264 UNAVAIL 0 0 0 was > /dev/dsk/c7t5d0s0 > spares > c3t6d0 AVAIL > That crazy device name is guid (you can see that with eg. zdb -l /dev/rdsk/c3t1d0s0) I was able to replicate your situation here. # uname -a SunOS osol-dev 5.11 snv_133 i86pc i386 i86pc Solaris # zpool status tank pool: tank state: ONLINE scrub: none requested config: NAMESTATE READ WRITE CKSUM tankONLINE 0 0 0 mirror-0 ONLINE 0 0 0 c6t0d0 ONLINE 0 0 0 c6t1d0 ONLINE 0 0 0 cache c6t2d0ONLINE 0 0 0 spares c6t3d0AVAIL errors: No known data errors # zpool export tank # zpool import tank # zpool status tank pool: tank state: DEGRADED status: One or more devices could not be opened. Sufficient replicas exist for the pool to continue functioning in a degraded state. action: Attach the missing device and online it using 'zpool online'. see: http://www.sun.com/msg/ZFS-8000-2Q scrub: none requested config: NAME STATE READ WRITE CKSUM tank DEGRADED 0 0 0 mirror-0 DEGRADED 0 0 0 6462738093222634405 UNAVAIL 0 0 0 was /dev/dsk/c6t0d0s0 c6t1d0 ONLINE 0 0 0 cache c6t2d0 ONLINE 0 0 0 spares c6t3d0 AVAIL errors: No known data errors # zpool get autoreplace tank NAME PROPERTY VALUESOURCE tank autoreplace on local # fmdump -e -t 08Mar2010 TIME CLASS As you can see, no error report was posted. You can try to import the pool again and see if `fmdump -e` lists any errors afterwards. You use the spare with `zpool replace`. -- Giovanni Tirloni sysdroid.com ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] why L2ARC device is used to store files ?
On Fri, Mar 5, 2010 at 7:41 AM, Abdullah Al-Dahlawi wrote: > Hi Geovanni > > I was monitering the ssd cache using zpool iostat -v like you said. the > cache device within the pool was showing a persistent write IOPS during the > ten (1GB) file creation phase by the benchmark. > > The benchmark even gave an insufficient space and terminated which proves > that it was writing on the ssd cache (my HDD is 50GB free space) > The L2ARC cache is not accessible to end user applications. It's only used for reads that miss the ARC and it's managed internally by ZFS. I can't comment on the specifics of how ZFS evicts objects from ARC to L2ARC but that should never give you insufficient space errors. Your data is not getting stored in the cache device. The writes you see on the SSD device are ZFS moving objects from ARC to L2ARC. It has to write data there otherwise there is nothing to read back from later when a read() misses the ARC cache and checks L2ARC. I don't know what your OLTP benchmark does but my advice is to check if it's really writing files in the 'hdd' zpool mount point. -- Giovanni Tirloni sysdroid.com ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] why L2ARC device is used to store files ?
On Fri, Mar 5, 2010 at 6:46 AM, Abdullah Al-Dahlawi wrote: > Greeting All > > I have create a pool that consists oh a hard disk and a ssd as a cache > > zpool create hdd c11t0d0p3 > zpool add hdd cache c8t0d0p0 - cache device > > I ran an OLTP bench mark to emulate a DMBS > > One I ran the benchmark, the pool started create the database file on the > ssd cache device ??? > > > can any one explain why this happening ? > > is not L2ARC is used to absorb the evicted data from ARC ? > > why it is used this way ??? > > Hello Abdullah, I don't think I understand. How are you seeing files being created on the SSD disk ? You can check device usage with `zpool iostat -v hdd`. Please also send the output of `zpool status hdd`. Thank you, -- Giovanni Tirloni sysdroid.com ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Snapshot recycle freezes system activity
On Thu, Mar 4, 2010 at 7:28 PM, Ian Collins wrote: > Gary Mills wrote: > >> We have an IMAP e-mail server running on a Solaris 10 10/09 system. >> It uses six ZFS filesystems built on a single zpool with 14 daily >> snapshots. Every day at 11:56, a cron command destroys the oldest >> snapshots and creates new ones, both recursively. For about four >> minutes thereafter, the load average drops and I/O to the disk devices >> drops to almost zero. Then, the load average shoots up to about ten >> times normal and then declines to normal over about four minutes, as >> disk activity resumes. The statistics return to their normal state >> about ten minutes after the cron command runs. >> >> Is it destroying old snapshots or creating new ones that causes this >> dead time? What does each of these procedures do that could affect >> the system? What can I do to make this less visible to users? >> >> >> > I have a couple of Solaris 10 boxes that do something similar (hourly > snaps) and I've never seen any lag in creating and destroying snapshots. > One system with 16 filesystems takes 5 seconds to destroy the 16 oldest > snaps and create 5 recursive new ones. I logged load average on these boxes > and there is a small spike on the hour, but this is down to sending the > snaps, not creating them. > We've seen the behaviour that Gary describes while destroying datasets recursively (>600GB and with 7 snapshots). It seems that close to the end the server stalls for 10-15 minutes and NFS activity stops. For small datasets/snapshots that doesn't happen or is harder to notice. Does ZFS have to do something special when it's done releasing the data blocks at the end of the destroy operation ? -- Giovanni Tirloni sysdroid.com ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Fishworks 2010Q1 and dedup bug?
On Thu, Mar 4, 2010 at 4:40 PM, zfs ml wrote: > On 3/4/10 9:17 AM, Brent Jones wrote: > >> My rep says "Use dedupe at your own risk at this time". >> >> Guess they've been seeing a lot of issues, and regardless if its >> 'supported' or not, he said not to use it. >> > > So its not a feature, its a bug. They should release some official > statement if they are going to have the sales reps saying that. Either it > works or it doesn't and if it doesn't, then all parts of Oracle should be > saying the same thing, not just after they have your money (oh btw, that > dedup thing...). > > As discussed in a couple other threads, if Oracle wants to treat the > fishworks boxes like closed appliances, then it should "just work" and if it > doesn't then it should be treated like a toaster that doesn't work and they > should take it back. They seem to want to sell them with the benefits of > being closed for them - you shouldn't use the command line, etc but then act > like your unique workload/environment is somehow causing them to break when > they break. If they seal the box and put 5 knobs on the outside, don't blame > the customer when they turn all the knobs to 10 and the box doesn't work. > Take the box back, remove the knobs or fix the guts so all the knobs work as > advertised. It seems they kind of rushed the appliance into the market. We've a few 7410s and replication (with zfs send/receive) doesn't work after shares reach ~1TB (broken pipe error). It's frustrating and we can't do anything because every time we type "shell" in the CLI, it freaks us out with a message saying the warranty will be voided if we continue. I bet that we could work around that bug but we're not allowed and the workarounds provided by Sun haven't worked. Regarding dedup, Oracle is very courageous for including it in the 2010.Q1 release if this comes to be true. But I understand the pressure on then. Every other vendor out there is releasing products with deduplication. Personally, I would just wait 2-3 releases before using it in a black box like the 7000s. The hardware on the other hand is incredible in terms of resilience and performance, no doubt. Which makes me think the pretty interface becomes an annoyance sometimes. Let's wait for 2010.Q1 :) -- Giovanni Tirloni sysdroid.com ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Huge difference in reporting disk usage via du and zfs list. Fragmentation?
On Thu, Mar 4, 2010 at 10:52 AM, Holger Isenberg wrote: > Do we have enormous fragmentation here on our X4500 with Solaris 10, ZFS > Version 10? > > What except zfs send/receive can be done to free the fragmented space? > > One ZFS was used for some month to store some large disk images (each > 50GByte large) which are copied there with rsync. This ZFS then reports > 6.39TByte usage with zfs list and only 2TByte usage with du. > > The other ZFS was used for similar sized disk images, this time copied via > NFS as whole files. On this ZFS du and zfs report exactly the same usage of > 3.7TByte. > Please check the ZFS FAQ: http://hub.opensolaris.org/bin/view/Community+Group+zfs/faq There is a question regarding the difference between du, df and zfs list. -- Giovanni Tirloni sysdroid.com ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Mirror Stripe
On Mon, Mar 1, 2010 at 1:16 PM, Tony MacDoodle wrote: > What is the following syntax? > > *zpool create tank mirror c1t2d0 c1t3d0 mirror c1t4d0 c1t5d0 spare c1t6d0 > * > * > * > Is this RAID 0+1 or 1+0?* > * > That's RAID1+0. You are mirroring devices and them striping the mirrors together. AFAIK, RAID0+1 is not supported since a vdev can only be of type disk, mirror or raidz. And all vdevs are stripped together. Someone more experienced in ZFS can probably confirm/deny this. -- Giovanni Tirloni sysdroid.com ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] device mixed-up while tying to import.
On Sat, Feb 27, 2010 at 6:21 PM, Yariv Graf wrote: > > Hi, > It seems I can't import a single external HDD. > > pool: HD > id: 8012429942861870778 > state: UNAVAIL > status: One or more devices are missing from the system. > action: The pool cannot be imported. Attach the missing > devices and try again. >see: http://www.sun.com/msg/ZFS-8000-6X > config: > > HD UNAVAIL missing device > c16t0d0 ONLINE > You're probably missing the device that was used as a slog in this pool. Try to restablish that device and import the pool again. Right now ZFS cannot import a pool in that state but it's being worked on, according to Eric Schrock on Feb 6th. -- Giovanni Tirloni sysdroid.com ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] copies=2 and dedup
On Sat, Feb 27, 2010 at 10:40 AM, Dick Hoogendijk wrote: > I want zfs on a single drive so I use copies=2 for -some- extra safety. But > I wonder if dedup=on could mean something in this case too? That way the > same blocks would never be written more than twice. Or would that harm the > reliability of the drive and should I just use copies=2? > ZFS will honor copies=2 and keep two physical copies, even with deduplication enabled. -- Giovanni Tirloni sysdroid.com ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] Slowing down "zfs destroy"
Hello, While destroying a dataset, sometimes ZFS kind of hangs the machine. I imagine it's starving all I/O while deleting the blocks, right ? Here logbias=latency, the commit interval is the default (30 seconds) and we have SSDs for logs and cache. Is there a way to "slow down" the destroy a little bit in order to reserve I/O for NFS clients ? Degraded performance isn't as bad as total loss of availability our case. I was thinking we could set logbias=throughput and decrease the commit interval to 10 seconds to keep it running more smoothly. Here's the pool configuration. Note the 2 slogs devices.. they were supposed to be a mirror but got added by mistake. NAME STATE READ WRITE CKSUM trunk ONLINE 0 0 0 mirror ONLINE 0 0 0 c7t4d0 ONLINE 0 0 0 c7t5d0 ONLINE 0 0 0 mirror ONLINE 0 0 0 c7t6d0 ONLINE 0 0 0 c7t7d0 ONLINE 0 0 0 mirror ONLINE 0 0 0 c7t8d0 ONLINE 0 0 0 c7t9d0 ONLINE 0 0 0 mirror ONLINE 0 0 0 c7t10d0 ONLINE 0 0 0 c7t11d0 ONLINE 0 0 0 mirror ONLINE 0 0 0 c7t12d0 ONLINE 0 0 0 c7t13d0 ONLINE 0 0 0 mirror ONLINE 0 0 0 c7t14d0 ONLINE 0 0 0 c7t15d0 ONLINE 0 0 0 mirror ONLINE 0 0 0 c7t16d0 ONLINE 0 0 0 c7t17d0 ONLINE 0 0 0 mirror ONLINE 0 0 0 c7t18d0 ONLINE 0 0 0 c7t19d0 ONLINE 0 0 0 mirror ONLINE 0 0 0 c7t20d0 ONLINE 0 0 0 c7t21d0 ONLINE 0 0 0 logs ONLINE 0 0 0 c7t1d0 ONLINE 0 0 0 c7t2d0 ONLINE 0 0 0 cache c7t22d0ONLINE 0 0 0 spares c7t3d0 AVAIL Any ideas? Thank you, -- Giovanni Tirloni sysdroid.com ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS replace - many to one
On Thu, Feb 25, 2010 at 12:44 PM, Chad wrote: > I'm looking to migrate a pool from using multiple smaller LUNs to one > larger LUN. I don't see a way to do a zpool replace for multiple to one. > Anybody know how to do this? It needs to be non disruptive. > As others have noted, it doesn't seem possible. You could create a new zpool with this larger LUN and use zfs send/receive to migrate your data. -- Giovanni Tirloni sysdroid.com ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] [indiana-discuss] future of OpenSolaris
On Thu, Feb 25, 2010 at 9:47 AM, Jacob Ritorto wrote: > It's a kind gesture to say it'll continue to exist and all, but > without commercial support from the manufacturer, it's relegated to > hobbyist curiosity status for us. If I even mentioned using an > unsupported operating system to the higherups here, it'd be considered > absurd. I like free stuff to fool around with in my copious spare > time as much as the next guy, don't get me wrong, but that's not the > issue. For my company, no support contract equals 'Death of > OpenSolaris.' > OpenSolaris is not dying just because there is no support contract available for it, yet. Last time I looked Red Hat didn't offer support contracts for Fedora and that project is doing quite well. So please be a little more realistic and say "For my company, no support contracts for OpenSolaris means that we will not use it in our mission-critical servers". That's much more reasonable than saying the whole project is jeopardized. It's useless to try to decide your strategy right now when things are changing. Wait for some official word from Oracle and then decide what your company is going to do. You can always install Solaris if that makes sense for you. -- Giovanni Tirloni sysdroid.com ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Problems with sudden zfs capacity loss on snv_79a
On Thu, Feb 18, 2010 at 1:19 AM, Julius Roberts wrote: > Yes snv_79a is old, yes we're working separately on migrating to > snv_111b or later. But i need to solve this problem ASAP to buy me > some more time for that implementation. > > We pull data from a variety of sources onto our zpool called Backups, > then we snapshot them. We keep around 20 or so and then delete them > automatically. We've been doing this for around two years on this > system and it's been absolutely fantastic. Free-space hovers around > 300G. But suddenly something has changed: > > r...@darling(/)$:zfs list | head -1 && zfs list | tail -7 > NAME >USED AVAIL REFER MOUNTPOINT > Backups/natoffice/ons...@20091231_2347_triggeredby_20091231_2330 > 30.1G - 287G - > Backups/natoffice/ons...@20100131_2349_triggeredby_20100131_2330 > 17.7G - 287G - > Backups/natoffice/ons...@20100205_0001_triggeredby_20100204_2330 > 15.9G - 287G - > Backups/natoffice/ons...@20100212_0424_triggeredby_20100211_2330 > 152G - 285G - > Backups/natoffice/ons...@20100216_0430_triggeredby_20100215_2330 > 154G - 287G - > Backups/natoffice/ons...@20100217_0431_triggeredby_20100216_2330 > 154G - 287G - > Backups/natoffice/ons...@20100218_0423_triggeredby_20100217_2330 > 0 - 287G - > > Normally a snapshot shows USED around 15G to 30G. But suddenly, > snapshots of the same filesystem are showing USED ~150G. There are no > corresponding increases in any of the machines we copy data from, nor > has any of that data changed significantly. You can see that the > REFER hasn't changed much at all, this is normal. So we're backing up > the same amount of data, but it now occupies so much more on disk. > That of course means we can't keep nearly as many snapshots, and that > makes us all very nervous. > > Any ideas? > Is it possible that your users are now deleting everything before starting to write the backup data ? -- Giovanni Tirloni sysdroid.com ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] verging OT: how to buy J4500 w/o overpriced drives
On Tue, Feb 9, 2010 at 2:04 AM, Thomas Burgess wrote: > > On Mon, Feb 08, 2010 at 09:33:12PM -0500, Thomas Burgess wrote: >> > This is a far cry from an apples to apples comparison though. >> >> As much as I'm no fan of Apple, it's a pity they dropped ZFS because >> that would have brought considerable attention to the opportunity of >> marketing and offering zfs-suitable hardware to the consumer arena. >> Port-multiplier boxes already seem to be targetted most at the Apple >> crowd, even it's only in hope of scoring a better margin. >> >> Otherwise, bad analogies, whether about cars or fruit, don't help. >> >> > It might help people to understand how ridiculous they sound going on and > on about buying a premium storage appliance without any storage. I think > the car analogy was dead on. You don't have to agree with a vendors > practices to understand them. If you have a more fitting analogy, then by > all means lets hear it. > Dell joins the party: http://lists.us.dell.com/pipermail/linux-poweredge/2010-February/041335.html -- Giovanni ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] verging OT: how to buy J4500 w/o overpriced drives
On Tue, Feb 2, 2010 at 9:07 PM, Marc Nicholas wrote: > I believe magical unicorn controllers and drives are both bug-free and > 100% spec compliant. The leprichorns sell them if you're trying to > find them ;) > Well, "perfect" and "bug free" sure don't exist in our industry. The problem is that we see disk firmwares that are stupidly flawed and the revisions that get released aren't making it better. Otherwise people looking for quality would not have to spend extra on third-party reviewed drives from storage vendors. It's all too convenient how the industry is organized. That is, for disk and storage vendors. Not customers. -- Giovanni ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] verging OT: how to buy J4500 w/o overpriced drives
On Tue, Feb 2, 2010 at 1:58 PM, Tim Cook wrote: > > It's called spreading the costs around. Would you really rather pay 10x > the price on everything else besides the drives? This is essentially Sun's > way of tiered pricing. Rather than charge you a software fee based on how > much storage you have, they increase the price of the drives. Seems fairly > reasonable to me... it gives a low point of entry for people that don't need > that much storage without using ridiculous capacity based licensing on > software. > Smells like the Razor and Blades business model [1]. I think the industry is in a sad state when you buy enterprise-level drives and they don't work as expected (see that thread about TLER settings on WD enterprise drives) that you have to spend extra on drives that got reviewed by a third-party (Sun/EMC/etc). Just shows how bad the disk vendors are. I would be curious to know how the internal process of testing these drives work at Sun/EMC/etc when they find bugs and performance problems. Do they have access to the firmware's source code to fix it ? Or do they report the bugs back to Seagate/WD and they provide a new firmware for tests ? Do those bugs get fixed in other drives that Seagate/WD sells ? For me it's just hard to objectively point out the differences between Seagate's enterprise drives and the ones provided by Sun except that they were tested more. 1 - http://en.wikipedia.org/wiki/Freebie_marketing -- Giovanni ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] hard drive choice, TLER/ERC/CCTL
On Mon, Jan 4, 2010 at 3:51 PM, Joerg Schilling wrote: > Giovanni Tirloni wrote: > >> We use Seagate Barracuda ES.2 1TB disks and every time the OS starts >> to bang on a region of the disk with bad blocks (which essentially >> degrades the performance of the whole pool) we get a call from our >> clients complaining about NFS timeouts. They usually last for 5 >> minutes but I've seen it last for a whole hour while the drive is >> slowly dying. Off-lining the faulty disk fixes it. >> >> I'm trying to find out how the disks' firmware is programmed >> (timeouts, retries, etc) but so far nothing in the official docs. In >> this case the disk's retry timeout seem way too high for our needs and >> I believe a timeout limit imposed by the OS would help. > > Did you upgrade the firmware last spring? > > There is a known bug in the firmware that may let them go into alzheimer mode. No, as their "serial number check utility" was not returning any upgrades for the disks I checked... but now I see in the forums that they released some new versions. Thanks for the heads up. I'll give it a try and hopefully we can see some improvement here. -- Giovanni P. Tirloni ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] hard drive choice, TLER/ERC/CCTL
On Sat, Jan 2, 2010 at 4:07 PM, R.G. Keen wrote: > OK. From the above suppositions, if we had a desktop (infinitely > long retry on fail) disk and a soft-fail error in a sector, then the > disk would effectively hang each time the sector was accessed. > This would lead to > (1) ZFS->SD-> disk read of failing sector > (2) disk does not reply within 60 seconds (default) > (3) disk is reset by SD > (4) operation is retried by SD(?) > (5) disk does not reply within 60 seconds (default) > (6) disk is reset by SD ? > > then what? If I'm reading you correctly, the following string of > events happens: > >> The drivers will retry and fail the I/O. By default, for SATA >> disks using the sd driver, there are 5 retries of 60 seconds. >> After 5 minutes, the I/O will be declared failed and that info >> is passed back up the stack to ZFS, which will start its >> recovery. This is why the T part of N in T doesn't work so >> well for the TLER case. > > Hmmm... actually, it may be just fine for my personal wants. > If I had a desktop drive which went unresponsive for 60 seconds > on an I/O soft error, then the timeout would be five minutes. > at that time, zfs would... check me here... mark the block as > failed, and try to relocate the block on the disk. If that worked > fine, the previous sectors would be marked as unusable, and > work goes on, but with the actions noted in the logs. We use Seagate Barracuda ES.2 1TB disks and every time the OS starts to bang on a region of the disk with bad blocks (which essentially degrades the performance of the whole pool) we get a call from our clients complaining about NFS timeouts. They usually last for 5 minutes but I've seen it last for a whole hour while the drive is slowly dying. Off-lining the faulty disk fixes it. I'm trying to find out how the disks' firmware is programmed (timeouts, retries, etc) but so far nothing in the official docs. In this case the disk's retry timeout seem way too high for our needs and I believe a timeout limit imposed by the OS would help. -- Giovanni P. Tirloni ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss