Re: [zfs-discuss] Availability: ZFS needs to handle disk removal / driver failure better
Bill Sommerfeld wrote: > On Thu, 2008-08-28 at 13:05 -0700, Eric Schrock wrote: > >> A better option would be to not use this to perform FMA diagnosis, but >> instead work into the mirror child selection code. This has already >> been alluded to before, but it would be cool to keep track of latency >> over time, and use this to both a) prefer one drive over another when >> selecting the child and b) proactively timeout/ignore results from one >> child and select the other if it's taking longer than some historical >> standard deviation. This keeps away from diagnosing drives as faulty, >> but does allow ZFS to make better choices and maintain response times. >> It shouldn't be hard to keep track of the average and/or standard >> deviation and use it for selection; proactively timing out the slow I/Os >> is much trickier. >> > > tcp has to solve essentially the same problem: decide when a response is > "overdue" based only on the timing of recent successful exchanges in a > context where it's difficult to make assumptions about "reasonable" > expected behavior of the underlying network. > > it tracks both the smoothed round trip time and the variance, and > declares a response overdue after (SRTT + K * variance). > > I think you'd probably do well to start with something similar to what's > described in http://www.ietf.org/rfc/rfc2988.txt and then tweak based on > experience. > I think this is a good place to start. In general, we can see 3 orders of magnitude range for magnetic disk I/Os, 4 orders of magnitude for power managed disks. With that range, I don't see the variance being small, at least for magnetic disks. SSDs will have a much smaller variance, in general. For lopsided mirrors, such as magnetic disk mirrored to SSD or Bob's Dallas vs New York paths, we should be able to automatically steer towards the faster side. However, A comprehensive solution must also deal with top-level vdev usage, which can be very different than the physical vdevs. We can use driver-level FMA for the physical vdevs, but ultimately ZFS will need to be able to make decisions based on the response time across the top-level vdevs. This can be implemented in two phases, of course. I've got some lopsided mirror TNF data, so we could fairly easily try some algorithms... I'll whip it into shape for further analysis. -- richard ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ARCSTAT Kstat Definitions
G'Day Ben, ARC visibility is important; did you see Neel's arcstat?: http://www.solarisinternals.com/wiki/index.php/Arcstat Try -x for various sizes, and -v for definitions. On Thu, Aug 21, 2008 at 10:23:24AM -0700, Ben Rockwood wrote: > Its a starting point anyway. The key is to try and draw useful conclusions > from the info to answer the torrent of "why is my ARC 30GB???" > > There are several things I'm unclear on whether or not I'm properly > interpreting such as: > > * As you state, the anon pages. Even the comment in code is, to me anyway, a > little vague. I include them because otherwise you look at the hit counters > and wonder where a large chunk of them went. Yes, anon hits doesn't make sense - they are dirty pages and won't have a DVA, and so won't be findable by other threads in arc_read(). I can see why arc_summary.pl thinks they exist - accounting for the discrepancy between arcstats:hits and the sum of the hits from the four ARC lists. Ghost list hits aren't part of arcstats:hits - arcstats:hits are real hits, the ghost hits are an artifact of the ARC algorithm. If you do want to break down arcstats:hits into it's components, use: zfs:0:arcstats:demand_data_hits zfs:0:arcstats:demand_metadata_hits zfs:0:arcstats:prefetch_data_hits zfs:0:arcstats:prefetch_metadata_hits And for a different perspective on the demand hits: zfs:0:arcstats:mru_hits zfs:0:arcstats:mfu_hits Also, arc_summary.pl's reported MRU and MFU sizes aren't actual, these are target sizes. The ARC will try to steer itself towards them, but in at least one case (where the ARC has yet to fill) they can be very different from actual (until arc_adjust() is called to whip them back to size.) > * Prefetch... I want to use the Prefetch Data hit ratio as a judgment call on > the efficiency of prefetch. If the value is very low it might be best to > turn it off. but I'd like to hear that from someone else before I go > saying that. Sounds good to me. > In high latency environments, such as ZFS on iSCSI, prefetch can either > significantly help or hurt, determining which is difficult without some type > of metric as as above. > > * There are several instances (based on dtracing) in which the ARC is > bypassed... for ZIL I understand, in some other cases I need to spend more > time analyzing the DMU (dbuf_*) for why. > > * In answering the "Is having a 30GB ARC good?" question, I want to say that > if MFU is >60% of ARC, and if the hits are mostly MFU that you are deriving > significant benefit from your large ARC but on a system with a 2GB ARC or > a 30GB ARC the overall hit ratio tends to be 99%. Which is nuts, and tends > to reinforce a misinterpretation of anon hits. I wouldn't read *too* much into MRU vs MFU hits. MFU means 2 hits, MRU means 1. > The only way I'm seeing to _really_ understand ARC's efficiency is to look at > the overall number of reads and then how many are intercepted by ARC and how > many actually made it to disk... and why (prefetch or demand). This is > tricky to implement via kstats because you have to pick out and monitor the > zpool disks themselves. This would usually have more to do with the workload than the ARC's efficiency. > I've spent a lot of time in this code (arc.c) and still have a lot of > questions. I really wish there was an "Advanced ZFS Internals" talk coming > up; I simply can't keep spending so much time on this. Maybe you could try forgetting about the kstats for a moment and draw a fantasy arc_summary.pl output. Then we can look at adding kstats to make writing that script possible/easy (Mark and I could add the kstats, and Neel could provide the script, for example). Of course, if we do add more kstats, it's not going to help on older rev kernels out there... cheers, Brendan -- Brendan [CA, USA] ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Availability: ZFS needs to handle disk removal / driver failure better
On Thu, 2008-08-28 at 13:05 -0700, Eric Schrock wrote: > A better option would be to not use this to perform FMA diagnosis, but > instead work into the mirror child selection code. This has already > been alluded to before, but it would be cool to keep track of latency > over time, and use this to both a) prefer one drive over another when > selecting the child and b) proactively timeout/ignore results from one > child and select the other if it's taking longer than some historical > standard deviation. This keeps away from diagnosing drives as faulty, > but does allow ZFS to make better choices and maintain response times. > It shouldn't be hard to keep track of the average and/or standard > deviation and use it for selection; proactively timing out the slow I/Os > is much trickier. tcp has to solve essentially the same problem: decide when a response is "overdue" based only on the timing of recent successful exchanges in a context where it's difficult to make assumptions about "reasonable" expected behavior of the underlying network. it tracks both the smoothed round trip time and the variance, and declares a response overdue after (SRTT + K * variance). I think you'd probably do well to start with something similar to what's described in http://www.ietf.org/rfc/rfc2988.txt and then tweak based on experience. - Bill ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Availability: ZFS needs to handle disk removal / driver failure better
On Thu, 28 Aug 2008, Miles Nordin wrote: > None of the decisions I described its making based on performance > statistics are ``haywire''---I said it should funnel reads to the > faster side of the mirror, and do this really quickly and > unconservatively. What's your issue with that? >From what I understand, this is partially happening now based on average service time. If I/O is backed up for a device, then the other device is preferred. However it good to keep in mind that if data is never read, then it is never validated and corrected. It is good for ZFS to read data sometimes. Bob == Bob Friesenhahn [EMAIL PROTECTED], http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer,http://www.GraphicsMagick.org/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] liveupgrade ufs root -> zfs ?
Hi I'm not sure that the ZFS pool meets this requirement. I have # lufslist SXCE_94 Filesystem fstypedevice size Mounted on Mount Options --- --- -- /dev/dsk/c1t2d0s1 swap 2147880960 - - /dev/dsk/c1t2d0s0 ufs8590202880 / - /dev/dsk/c1t2d0s7 ufs5747496960 /export/home- # lufslist SXCE_95 Filesystem fstypedevice size Mounted on Mount Options --- --- -- /dev/dsk/c1t2d0s1 swap 2147880960 - - /dev/dsk/c1t2d0s4 ufs8590202880 / - /dev/dsk/c1t2d0s7 ufs5747496960 /export/home- Is it possible to delete SXCE_94, do a zpool create with /dev/dsk/c1t2d0s0, and then do a liveupgrade? I have the impression that it's possible, but that there are some extra steps needed (to specify the ZFS mount point?). A+ Paul -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] liveupgrade ufs root -> zfs ?
Hi I'm not sure that the ZFS pool meets this requirement. I have # lufslist SXCE_94 Filesystem fstypedevice size Mounted on Mount Options --- --- -- /dev/dsk/c1t2d0s1 swap 2147880960 - - /dev/dsk/c1t2d0s0 ufs8590202880 / - /dev/dsk/c1t2d0s7 ufs5747496960 /export/home- # lufslist SXCE_95 Filesystem fstypedevice size Mounted on Mount Options --- --- -- /dev/dsk/c1t2d0s1 swap 2147880960 - - /dev/dsk/c1t2d0s4 ufs8590202880 / - /dev/dsk/c1t2d0s7 ufs5747496960 /export/home- Is it possible to delete SXCE_94, do a zpool create with /dev/dsk/c1t2d0s0, and then do a liveupgrade? I have the impression that it's possible, but that there are some extra steps needed (to specify the ZFS mount point?). A+ Paul -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Availability: ZFS needs to handle disk removal / driver failure better
On Thu, Aug 28, 2008 at 08:34:24PM +0100, Ross Smith wrote: > > Personally, if a SATA disk wasn't responding to any requests after 2 > seconds I really don't care if an error has been detected, as far as > I'm concerned that disk is faulty. Unless you have power management enabled, or there's a bad region of the disk, or the bus was reset, or... > I do have a question though. From what you're saying, the response > time can't be consistent across all hardware, so you're once again at > the mercy of the storage drivers. Do you know how long does > B_FAILFAST takes to return a response on iSCSI? If that's over 1-2 > seconds I would still consider that too slow I'm afraid. It's main function is how it deals with retryable errors. If the drive responds with a retryable error, or any error at all, it won't attempt to retry again. If you have a device that is taking arbitrarily long to respond to successful commands (or to notice that a command won't succeed), it won't help you. > I understand that Sun in general don't want to add fault management to > ZFS, but I don't see how this particular timeout does anything other > than help ZFS when it's dealing with such a diverse range of media. I > agree that ZFS can't know itself what should be a valid timeout, but > that's exactly why this needs to be an optional administrator set > parameter. The administrator of a storage array who wants to set this > certainly knows what a valid timeout is for them, and these timeouts > are likely to be several orders of magnitude larger than the standard > response times. I would configure very different values for my SATA > drives as for my iSCSI connections, but in each case I would be > happier knowing that ZFS has more of a chance of catching bad drivers > or unexpected scenarios. The main problem with exposing tunables like this is that they have a direct correlation to service actions, and mis-diagnosing failures costs everybody (admin, companies, Sun, etc) lots of time and money. Once you expose such a tunable, it will be impossible to trust any FMA diagnosis, because you won't be able to know whether it was a mistaken tunable. A better option would be to not use this to perform FMA diagnosis, but instead work into the mirror child selection code. This has already been alluded to before, but it would be cool to keep track of latency over time, and use this to both a) prefer one drive over another when selecting the child and b) proactively timeout/ignore results from one child and select the other if it's taking longer than some historical standard deviation. This keeps away from diagnosing drives as faulty, but does allow ZFS to make better choices and maintain response times. It shouldn't be hard to keep track of the average and/or standard deviation and use it for selection; proactively timing out the slow I/Os is much trickier. As others have mentioned, things get more difficult with writes. If I issue a write to both halves of a mirror, should I return when the first one completes, or when both complete? One possibility is to expose this as a tunable, but any such "best effort RAS" is a little dicey because you have very little visibility into the state of the pool in this scenario - "is my data protected?" becomes a very difficult question to answer. - Eric -- Eric Schrock, Fishworkshttp://blogs.sun.com/eschrock ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] pulling disks was: ZFS hangs/freezes after disk
Many mid-range/high-end RAID controllers work by having a small timeout on individual disk I/O operations. If the disk doesn't respond quickly, they'll issue an I/O to the redundant disk(s) to get the data back to the host in a reasonable time. Often they'll change parameters on the disk to limit how long the disk retries before returning an error for a bad sector (this is standardized for SCSI, I don't recall offhand whether any of this is standardized for ATA). RAID 3 units, e.g. DataDirect, issue I/O to all disks simultaneously and when enough (N-1 or N-2) disks return data, they'll return the data to the host. At least they do that for full stripes. But this strategy works better for sequential I/O, not so good for random I/O, since you're using up extra bandwidth. Host-based RAID/mirroring almost never takes this strategy for two reasons. First, the bottleneck is almost always the channel from disk to host, and you don't want to clog it. [Yes, I know there's more bandwidth there than the sum of the disks, but consider latency.] Second, to read from two disks on a mirror, you'd need two memory buffers. -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] pulling disks was: ZFS hangs/freezes after disk
Many mid-range/high-end RAID controllers work by having a small timeout on individual disk I/O operations. If the disk doesn't respond quickly, they'll issue an I/O to the redundant disk(s) to get the data back to the host in a reasonable time. Often they'll change parameters on the disk to limit how long the disk retries before returning an error for a bad sector (this is standardized for SCSI, I don't recall offhand whether any of this is standardized for ATA). RAID 3 units, e.g. DataDirect, issue I/O to all disks simultaneously and when enough (N-1 or N-2) disks return data, they'll return the data to the host. At least they do that for full stripes. But this strategy works better for sequential I/O, not so good for random I/O, since you're using up extra bandwidth. Host-based RAID/mirroring almost never takes this strategy for two reasons. First, the bottleneck is almost always the channel from disk to host, and you don't want to clog it. [Yes, I know there's more bandwidth there than the sum of the disks, but consider latency.] Second, to read from two disks on a mirror, you'd need two memory buffers. -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Availability: ZFS needs to handle disk removal / driver failure better
> "bf" == Bob Friesenhahn <[EMAIL PROTECTED]> writes: bf> If the system or device is simply overwelmed with work, then bf> you would not want the system to go haywire and make the bf> problems much worse. None of the decisions I described its making based on performance statistics are ``haywire''---I said it should funnel reads to the faster side of the mirror, and do this really quickly and unconservatively. What's your issue with that? bf> You are saying that I can't split my mirrors between a local bf> disk in Dallas and a remote disk in New York accessed via bf> iSCSI? nope, you've misread. I'm saying reads should go to the local disk only, and writes should go to both. See SVM's 'metaparam -r'. I suggested that unlike the SVM feature it should be automatic, because by so being it becomes useful as an availability tool rather than just performance optimisation. The performance-statistic logic should influence read scheduling immediately, and generate events which are fed to FMA, then FMA can mark devices faulty. There's no need for both to make the same decision at the same time. If the events aren't useful for diagnosis, ZFS could not bother generating them, or fmd could ignore them in its diagnosis. I suspect they *would* be useful, though. I'm imagining the read rescheduling would happen very quickly, quicker than one would want a round-trip from FMA, in much less than a second. That's why it would have to compare devices to others in the same vdev, and to themselves over time, rather than use fixed timeouts or punt to haphazard driver and firmware logic. bf>o System waits substantial time for devices to (possibly) bf> recover in order to ensure that subsequently written data has bf> the least chance of being lost. There's no need for the filesystem to *wait* for data to be written, unless you are calling fsync. and maybe not even then if there's a slog. I said clearly that you read only one half of the mirror, but write to both. But you're right that the trick probably won't work perfectly---eventually dead devices need to be faulted. The idea is that normal write caching will buy you orders of magnitude longer time in which to make a better decision before anyone notices. Experience here is that ``waits substantial time'' usually means ``freezes for hours and gets rebooted''. There's no need to be abstract: we know what happens when a drive starts taking 1000x - 2000x longer than usual to respond to commands, and we know that this is THE common online failure mode for drives. That's what started the thread. so, think about this: hanging for an hour trying to write to a broken device may block other writes to devices which are still working, until the patiently-waiting data is eventually lost in the reboot. bf>o System immediately ignores slow devices and switches to bf> non-redundant non-fail-safe non-fault-tolerant bf> may-lose-your-data mode. When system is under intense load, bf> it automatically switches to the may-lose-your-data mode. nobody's proposing a system which silently rocks back and forth between faulted and online. That's not what we have now, and no such system would naturally arise. If FMA marked a drive faulty based on performance statistics, that drive would get retired permanently and hot-spare-replaced. Obviously false positives are bad, just as obviously as freezes/reboots are bad. It's not my idea to use FMA in this way. This is how FMA was pitched, and the excuse for leaving good exception handling out of ZFS for two years. so, where's the beef? pgpUDw139jf6A.pgp Description: PGP signature ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ARCSTAT Kstat Definitions
On Thu, Aug 21, 2008 at 8:47 PM, Ben Rockwood <[EMAIL PROTECTED]> wrote: > New version is available (v0.2) : > > * Fixes divide by zero, > * includes tuning from /etc/system in output > * if prefetch is disabled I explicitly say so. > * Accounts for jacked anon count. Still need improvement here. > * Added friendly explanations for MRU/MFU & Ghost lists counts. > > Page and examples are updated: cuddletech.com/arc_summary.pl > > Still needs work, but hopefully interest in this will stimulate some improved > understanding of ARC internals. For a bit of light relief (in other words, with pretty graphs) I've hacked up a graphical java version of Ben's script as part of jkstat (updated to 0.24): http://www.petertribble.co.uk/Solaris/jkstat.html Now, this is pretty rough, and chews up a modest amount of CPU, and I'm not sure of the interpretation, but I've basically taken Ben's code and lifted it more or less as is. -- -Peter Tribble http://www.petertribble.co.uk/ - http://ptribble.blogspot.com/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Availability: ZFS needs to handle disk removal / driver failure better
> "es" == Eric Schrock <[EMAIL PROTECTED]> writes: es> I don't think you understand how this works. Imagine two es> I/Os, just with different sd timeouts and retry logic - that's es> B_FAILFAST. It's quite simple, and independent of any es> hardware implementation. AIUI the main timeout to which we should be subject, at least for nearline drives, is about 30 seconds long and is decided by the drive's firmware, not the driver, and can't be negotiated in any way that's independent of the hardware implementation, although sometimes there are dependent ways to negotiate it. The driver could also decide through ``retry logic'' to time out the command sooner, before the drive completes it, but this won't do much good because the drive won't accept a second command until ITS timeout expires. which leads to the second problem, that we're talking about timeouts for individual I/O's, not marking whole devices. A ``fast'' timeout of even 1 second could cause a 100- or 1000-fold decrease in performance, which could end up being equivalent to a freeze depending on the type of load on the filesystem. pgphjTr74byaZ.pgp Description: PGP signature ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Availability: ZFS needs to handle disk removal / driver failure better
Hi guys, Bob, my thought was to have this timeout as something that can be optionally set by the administrator on a per pool basis. I'll admit I was mainly thinking about reads and hadn't considered the write scenario, but even having thought about that it's still a feature I'd like. After all, this would be a timeout set by the administrator based on the longest delay they can afford for that storage pool. Personally, if a SATA disk wasn't responding to any requests after 2 seconds I really don't care if an error has been detected, as far as I'm concerned that disk is faulty. I'd be quite happy for the array to drop to a degraded mode based on that and for writes to carry on with the rest of the array. Eric, thanks for the extra details, they're very much appreciated. It's good to hear you're working on this, and I love the idea of doing a B_FAILFAST read on both halves of the mirror. I do have a question though. From what you're saying, the response time can't be consistent across all hardware, so you're once again at the mercy of the storage drivers. Do you know how long does B_FAILFAST takes to return a response on iSCSI? If that's over 1-2 seconds I would still consider that too slow I'm afraid. I understand that Sun in general don't want to add fault management to ZFS, but I don't see how this particular timeout does anything other than help ZFS when it's dealing with such a diverse range of media. I agree that ZFS can't know itself what should be a valid timeout, but that's exactly why this needs to be an optional administrator set parameter. The administrator of a storage array who wants to set this certainly knows what a valid timeout is for them, and these timeouts are likely to be several orders of magnitude larger than the standard response times. I would configure very different values for my SATA drives as for my iSCSI connections, but in each case I would be happier knowing that ZFS has more of a chance of catching bad drivers or unexpected scenarios. I very much doubt hardware raid controllers would wait 3 minutes for a drive to return a response, they will have their own internal timeouts to know when a drive has failed, and while ZFS is dealing with very different hardware I can't help but feel it should have that same approach to management of its drives. However, that said, I'll be more than willing to test the new B_FAILFAST logic on iSCSI once it's released. Just let me know when it's out. Ross > Date: Thu, 28 Aug 2008 11:29:21 -0500 > From: [EMAIL PROTECTED] > To: [EMAIL PROTECTED] > CC: zfs-discuss@opensolaris.org > Subject: Re: [zfs-discuss] Availability: ZFS needs to handle disk removal / > driver failure better > > On Thu, 28 Aug 2008, Ross wrote: > > > > I believe ZFS should apply the same tough standards to pool > > availability as it does to data integrity. A bad checksum makes ZFS > > read the data from elsewhere, why shouldn't a timeout do the same > > thing? > > A problem is that for some devices, a five minute timeout is ok. For > others, there must be a problem if the device does not respond in a > second or two. > > If the system or device is simply overwelmed with work, then you would > not want the system to go haywire and make the problems much worse. > > Which of these do you prefer? > >o System waits substantial time for devices to (possibly) recover in > order to ensure that subsequently written data has the least > chance of being lost. > >o System immediately ignores slow devices and switches to > non-redundant non-fail-safe non-fault-tolerant may-lose-your-data > mode. When system is under intense load, it automatically > switches to the may-lose-your-data mode. > > Bob > == > Bob Friesenhahn > [EMAIL PROTECTED], http://www.simplesystems.org/users/bfriesen/ > GraphicsMagick Maintainer,http://www.GraphicsMagick.org/ > _ Get Hotmail on your mobile from Vodafone http://clk.atdmt.com/UKM/go/107571435/direct/01/___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Availability: ZFS needs to handle disk removal / driver failure better
On Thu, 28 Aug 2008, Miles Nordin wrote: > > you're right in terms of fixed timeouts, but there's no reason it > can't compare the performance of redundant data sources, and if one > vdev performs an order of magnitude slower than another set of vdevs > with sufficient redundancy, stop issuing reads except scrubs/healing > to the underperformer (issue writes only), and pass an event to FMA. You are saying that I can't split my mirrors between a local disk in Dallas and a remote disk in New York accessed via iSCSI? Why don't you want me to be able to do that? ZFS already backs off from writing to slow vdevs. > ZFS can also compare the performance of a drive to itself over time, > and if the performance suddenly decreases, do the same. While this may be useful for reads, I would hate to disable redundancy just because a device is currently slow. Bob == Bob Friesenhahn [EMAIL PROTECTED], http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer,http://www.GraphicsMagick.org/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Availability: ZFS needs to handle disk removal / driver failure better
On Thu, Aug 28, 2008 at 02:17:08PM -0400, Miles Nordin wrote: > > you're right in terms of fixed timeouts, but there's no reason it > can't compare the performance of redundant data sources, and if one > vdev performs an order of magnitude slower than another set of vdevs > with sufficient redundancy, stop issuing reads except scrubs/healing > to the underperformer (issue writes only), and pass an event to FMA. Yep, latency would be a useful metric to add to mirroring choices. The current logic is rather naive (round-robin) and could easily be enhanced. Making diagnoses based on this is much trickier, particularly at the ZFS level. A better option would be to leverage the SCSI FMA work going on to do a more intimate diagnosis at the scsa level. Also, the problem you are trying to solve - timing out the first I/O to take a long time - is not captured well by the type of hysteresis you would need to perform in order to do this diagnosis. It certainly can be done, but is much better suited to diagnosising a failing drive over time, not aborting a transaction in response to immediate failure. > This B_FAILFAST architecture captures the situation really poorly. I don't think you understand how this works. Imagine two I/Os, just with different sd timeouts and retry logic - that's B_FAILFAST. It's quite simple, and independent of any hardware implementation. - Eric -- Eric Schrock, Fishworkshttp://blogs.sun.com/eschrock ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] pulling disks was: ZFS hangs/freezes after disk failure,
> "jl" == Jonathan Loran <[EMAIL PROTECTED]> writes: jl> Fe = 46% failures/month * 12 months = 5.52 failures the original statistic wasn't of this kind. It was ``likelihood a single drive will experience one or more failures within 12 months''. so, you could say, ``If I have a thousand drives, about 4.66 of those drives will silently-corrupt at least once within 12 months.'' It is 0.466% no matter how many drives you have. And it's 4.66 drives, not 4.66 corruptions. The estimated number of corruptions is higher because some drives will corrupt twice, or thousands of times. It's not a BER, so you can't just add it like Richard did. If the original statistic in the paper were of the kind you're talking about, it would be larger than 0.466%. I'm not sure it would capture the situation well, though. I think you'd want to talk about bits of recoverable data after one year, not corruption ``events'', and this is not really measured well by the type of telemetry NetApp has. If it were, though, it would still be the same size number no matter how many drives you had. The 37% I gave was ``one or more within a population of 100 drives silently corrupts within 12 months.'' The 46% Richard gave has no meaning, and doesn't mean what you just said. The only statistic under discussion which (a) gets intimidatingly large as you increase the number of drives, and (b) is a ratio rather than, say, an absolute number of bits, is the one I gave. pgpl2HghkrzU1.pgp Description: PGP signature ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] eWeek: corrupt file brought down FAA's antiquated IT system
On Thu, Aug 28, 2008 at 12:38 PM, Bob Friesenhahn < [EMAIL PROTECTED]> wrote: > On Thu, 28 Aug 2008, Toby Thain wrote: > > > What goes unremarked here is how the original system has coped > > reliably for decades of (one guesses) geometrically growing load. > > Fantastic engineering from a company which went defunct shortly after > delivering the system. And let this be a lesson to all of you not to write code that is too good. If you can't sell an "update" (patch) every 6 months, you'll be out of business as well :D --Tim ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Availability: ZFS needs to handle disk removal / driver failure better
> "es" == Eric Schrock <[EMAIL PROTECTED]> writes: es> Finally, imposing additional timeouts in ZFS is a bad idea. es> [...] As such, it doesn't have the necessary context to know es> what constitutes a reasonable timeout. you're right in terms of fixed timeouts, but there's no reason it can't compare the performance of redundant data sources, and if one vdev performs an order of magnitude slower than another set of vdevs with sufficient redundancy, stop issuing reads except scrubs/healing to the underperformer (issue writes only), and pass an event to FMA. ZFS can also compare the performance of a drive to itself over time, and if the performance suddenly decreases, do the same. The former case eliminates the need for the mirror policies in SVM, which Ian requested a few hours ago for the situation that half the mirror is a slow iSCSI target for geographic redundancy and half is faster/local. Some care would have to be taken for targets shared by ZFS and some other initiator, but I'm not sure the care would really be that difficult to take, or that the oscillations induced by failing to take it would really be particularly harmful compared to unsupervised contention for a device. The latter notices quickly drives that have been pulled, or for Richard's ``overwhelmingly dominant'' case, for drives which are stalled for 30 seconds pending their report of an unrecovered read. Developing meaningful performance statistics for drives and a tool for displaying them would be useful for itself, not just for stopping freezes and preventing a failing drive from degrading performance a thousandfold. Issuing reads to redundant devices is cheap compared to freezing. The policy with which it's done is highly tunable and should be fun to tune and watch, and the consequence if the policy makes the wrong choice isn't incredibly dire. This B_FAILFAST architecture captures the situation really poorly. First, it's not implementable in any serious way with near-line drives, or really with any drives with which you're not intimately familiar and in control of firmware/release-engineering, and perhaps not with any drives period. I suspect in practice it's more a controller-level feature, about whether or not you'd like to distrust the device's error report and start resetting busses and channels and mucking everything up trying to recover from some kind of ``weirdness''. It's not an answer to the known problem of drives stalling for 30 seconds when they start to fail. First and a half, when it's not implemented, the system degrades to doubling your timeout pointlessly. A driver-level block cache of UNC's would probably have more value toward this speed/read-aggressiveness tradeoff than the whole B_FAILFAST architecture---just cache known unrecoverable read sectors, and refuse to issue further I/O for them until a timeout of 3 - 10 minutes passes. I bet this would speed up most failures tremendously, and without burdening upper layers with retry logic. Second, B_FAILFAST entertains the fantasy that I/O's are independent, while what happens in practice is that the drive hits a UNC on one I/O, and won't entertain any further I/O's no matter what flags the request has on it or how many times you ``reset'' things. Maybe you could try to rescue B_FAILFAST by putting clever statistics into the driver to compare the drive's performance to recent past as I suggested ZFS do, and admit no B_FAILFAST requests to queues of drives that have suddenly slowed down, just fail them immediately without even trying. I submit this queueing and statistic collection is actually _better_ managed by ZFS than the driver because ZFS can compare a whole floating-point statistic across a whole vdev, while even a driver which is fancier than we ever dreamed, is still playing poker with only 1 bit of input ``I'll call,'' or ``I'll fold.'' ZFS can see all the cards and get better results while being stupider and requiring less clever poker-guessing than would be required by a hypothetical driver B_FAILFAST implementation that actually worked. pgpqZb7GbAEgk.pgp Description: PGP signature ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] pulling disks was: ZFS hangs/freezes after disk failure,
Miles Nordin wrote: > What is a ``failure rate for a time interval''? > > Failure rate => Failures/unit time Failure rate for a time interval => (Failures/unit time) * time For example, if we have a failure rate: Fr = 46% failures/month Then the expectation value of a failure in one year: Fe = 46% failures/month * 12 months = 5.52 failures Jon -- - _/ _/ / - Jonathan Loran - - -/ / /IT Manager - - _ / _ / / Space Sciences Laboratory, UC Berkeley -/ / / (510) 643-5146 [EMAIL PROTECTED] - __/__/__/ AST:7731^29u18e3 ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] eWeek: corrupt file brought down FAA's antiquated IT system
On Aug 28, 2008, at 11:38 AM, Bob Friesenhahn wrote: > The old FORTRAN code > either had to be ported or new code written from scratch. Assuming it WAS written in FORTRAN there is no reason to believe it wouldn't just compile with a modern Fortran compiler. I've often run codes originally written in the sixties without any significant changes (very old codes may have used the frequency statement, toggled front panel lights or sensed toggle switches ... but that's pretty rare). -- Keith H. Bierman [EMAIL PROTECTED] | AIM kbiermank 5430 Nassau Circle East | Cherry Hills Village, CO 80113 | 303-997-2749 Copyright 2008 ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] eWeek: corrupt file brought down FAA's antiquated IT system
On Thu, 28 Aug 2008, Toby Thain wrote: > > "two 20-year-old redundant mainframe configurations ... that > apparently are hanging on for dear life until reinforcements arrive > in the form of a new, state-of-the-art system this winter." > > And we all know that 'new, state-of-the-art systems' are silver > bullets and good value for money. The problem is that the replacement system is almost certain to be less reliable and cause problems for a while. The old FORTRAN code either had to be ported or new code written from scratch. If they used off the shelf software for the replacement then there is no way that the new system can be supported for 20 years. > What goes unremarked here is how the original system has coped > reliably for decades of (one guesses) geometrically growing load. Fantastic engineering from a company which went defunct shortly after delivering the system. Bob == Bob Friesenhahn [EMAIL PROTECTED], http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer,http://www.GraphicsMagick.org/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] xVM GRUB entry incorrect with ZFS root
On Thu, Aug 28, 2008 at 09:25:14AM -0700, Trevor Watson wrote: > Looking at the GRUB menu, it appears as though the flags "-B $ZFS-BOOTFS" are > needed to be passed to the kernel. Is this something I can add to: kernel$ > /boot/$ISADIR/xen.gz or is there some other mechanism required for booting > Solaris xVM from ZFS ? You need to add it to the next line ($module ...). This was a bug that's now fixed in the latest LU regards john ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS Pools 1+TB
Kenny wrote: > > How did you determine from the format output the GB vs MB amount?? > > Where do you compute 931 GB vs 932 MB from this?? > > 2. c6t600A0B800049F93C030A48B3EA2Cd0 /scsi_vhci/[EMAIL PROTECTED] > > 3. c6t600A0B800049F93C030D48B3EAB6d0 > /scsi_vhci/[EMAIL PROTECTED] > It's in the part you didn't cut and paste: AVAILABLE DISK SELECTIONS: >3. c6t600A0B800049F93C030D48B3EAB6d0 > /scsi_vhci/[EMAIL PROTECTED] >4. c6t600A0B800049F93C031C48B3EC76d0 > /scsi_vhci/[EMAIL PROTECTED] >8. c6t600A0B800049F93C031048B3EB44d0 > /scsi_vhci/[EMAIL PROTECTED] > Look at the label: The last field. > Please educate me!! > No problem. Things like this have happened to me from time to time. -Kyle > Thanks again! > > --Kenny > -- > This message posted from opensolaris.org > ___ > zfs-discuss mailing list > zfs-discuss@opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss > ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Availability: ZFS needs to handle disk removal / driver failure better
Ross, thanks for the feedback. A couple points here - A lot of work went into improving the error handling around build 77 of Nevada. There are still problems today, but a number of the complaints we've seen are on s10 software or older nevada builds that didn't have these fixes. Anything from the pre-2008 (or pre-s10u5) timeframe should be taken with grain of salt. There is a fix in the immediate future to prevent I/O timeouts from hanging other parts of the system - namely administrative commands and other pool activity. So I/O to that particular pool will hang, but you'll still be able to run your favorite ZFS commands, and it won't impact the ability of other pools to run. We have some good ideas on how to improve the retry logic. There is a flag in Solaris, B_FAILFAST, that tells the drive to not try too hard getting the data. However, it can return failure when trying harder would produce the correct results. Currently, we try the first I/O with B_FAILFAST, and if that fails immediately retry without the flag. The idea is to elevate the retry logic to a higher level, so when a read from a side of a mirror fails with B_FAILFAST, instead of immediately retrying the same device without the failfast flag, we push the error higher up the stack, and issue another B_FAILFAST I/O to the other half of the mirror. Only if both fail with failfast do we try a more thorough request (though with ditto blocks we may try another vdev alltogether). This should improve I/O error latency for a subset of failure scenarios, and biasing reads away from degraded (but not faulty) devices should also improve response time. The tricky part is incoporating this into the FMA diagnosis engine, as devices may fail B_FAILFAST requests for a variety of non-fatal reasons. Finally, imposing additional timeouts in ZFS is a bad idea. ZFS is designed to be a generic storage consumer. It can be layered on top of directly attached disks, SSDs, SAN devices, iSCSI targets, files, and basically anything else. As such, it doesn't have the necessary context to know what constitutes a reasonable timeout. This is explicitly delegated to the underlying storage subsystem. If a storage subsystem is timing out for excessive periods of time when B_FAILFAST is set, then that's a bug in the storage subsystem, and working around it in ZFS with yet another set of tunables is not practical. It will be interesting to see if this is an issue after the retry logic is modified as described above. Hope that helps, - Eric On Thu, Aug 28, 2008 at 01:08:26AM -0700, Ross wrote: > Since somebody else has just posted about their entire system locking up when > pulling a drive, I thought I'd raise this for discussion. > > I think Ralf made a very good point in the other thread. ZFS can guarantee > data integrity, what it can't do is guarantee data availability. The problem > is, the way ZFS is marketed people expect it to be able to do just that. > > This turned into a longer thread than expected, so I'll start with what I'm > asking for, and then attempt to explain my thinking. I'm essentially asking > for two features to improve the availability of ZFS pools: > > - Isolation of storage drivers so that buggy drivers do not bring down the OS. > > - ZFS timeouts to improve pool availability when no timely response is > received from storage drivers. > > And my reasons for asking for these is that there are now many, many posts on > here about people experiencing either total system lockup or ZFS lockup after > removing a hot swap drive, and indeed while some of them are using consumer > hardware, others have reported problems with server grade kit that definately > should be able to handle these errors: > > Aug 2008: AMD SB600 - System hang > - http://www.opensolaris.org/jive/thread.jspa?threadID=70349 > Aug 2008: Supermicro SAT2-MV8 - System hang > - http://www.opensolaris.org/jive/thread.jspa?messageID=271218 > May 2008: Sun hardware - ZFS hang > - http://opensolaris.org/jive/thread.jspa?messageID=240481 > Feb 2008: iSCSI - ZFS hang > - http://www.opensolaris.org/jive/thread.jspa?messageID=206985 > Oct 2007: Supermicro SAT2-MV8 - system hang > - http://www.opensolaris.org/jive/thread.jspa?messageID=166037 > Sept 2007: Fibre channel > - http://opensolaris.org/jive/thread.jspa?messageID=151719 > ... etc > > Now while the root cause of each of these may be slightly different, I feel > it would still be good to address this if possible as it's going to affect > the perception of ZFS as a reliable system. > > The common factor in all of these is that either the solaris driver hangs and > locks the OS, or ZFS hangs and locks the pool. Most of these are for > hardware that should handle these failures fine (mine occured for hardware > that definately works fine under windows), so I'm wondering: Is there > anything that can be done to prevent either type of lockup in these > situations? > > Firstly, for the OS
Re: [zfs-discuss] ZFS Pools 1+TB
Ok so I knew it had to be operator headspace... I found my error and have fixed it in CAM. Thanks to all for helping my education!! However I do have a question. And pardon if it's a 101 type... How did you determine from the format output the GB vs MB amount?? Where do you compute 931 GB vs 932 MB from this?? 2. c6t600A0B800049F93C030A48B3EA2Cd0 /scsi_vhci/[EMAIL PROTECTED] 3. c6t600A0B800049F93C030D48B3EAB6d0 /scsi_vhci/[EMAIL PROTECTED] Please educate me!! Thanks again! --Kenny -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS Pools 1+TB
Ok so I knew it had to be operator headspace... I found my error and have fixed it in CAM. Thanks to all for helping my education!! However I do have a question. And pardon if it's a 101 type... How did you determine from the format output the GB vs MB amount?? Where do you compute 931 GB vs 932 MB from this?? 2. c6t600A0B800049F93C030A48B3EA2Cd0 /scsi_vhci/[EMAIL PROTECTED] 3. c6t600A0B800049F93C030D48B3EAB6d0 /scsi_vhci/[EMAIL PROTECTED] Please educate me!! Thanks again! --Kenny -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS Pools 1+TB
On Thu, 28 Aug 2008, Kenny wrote: > 2. c6t600A0B800049F93C030A48B3EA2Cd0 > /scsi_vhci/[EMAIL PROTECTED] Good. > 3. c6t600A0B800049F93C030D48B3EAB6d0 > /scsi_vhci/[EMAIL PROTECTED] Oops! Oops! Oops! It seems that some of your drives have the full 931.01GB exported while others have only 931.01MB exported. The smallest device size will be used to size the vdev in your pool. I sense a user error in the tedious CAM interface. CAM is slow so you need to be patient and take extra care when configuring the 2540 volumes. Bob == Bob Friesenhahn [EMAIL PROTECTED], http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer,http://www.GraphicsMagick.org/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS Pools 1+TB
On Thu, 28 Aug 2008, Kenny wrote: > Bob, Thanks for the reply. Yes I did read your white paper and am using > it!! Thanks again!! > > I used zpool iostat -v and it did't give the information as advertised... > see below The lack of size information seems quit odd. Bob == Bob Friesenhahn [EMAIL PROTECTED], http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer,http://www.GraphicsMagick.org/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] xVM GRUB entry incorrect with ZFS root
Take a look at my xVM/GRUB config: http://malsserver.blogspot.com/2008/08/installing-xvm.html On Thu, Aug 28, 2008 at 9:25 AM, Trevor Watson <[EMAIL PROTECTED]>wrote: > I just ran live-upgrade of my system from nv94/UFS to nv96/ZFS on x86. > > nv96/ZFS boots okay. However, I can't boot the Solaris xVM partition as the > GRUB entry does not contain the necessary magic to tell grub to use ZFS > instead of UFS. > > Looking at the GRUB menu, it appears as though the flags "-B $ZFS-BOOTFS" > are needed to be passed to the kernel. Is this something I can add to: > kernel$ /boot/$ISADIR/xen.gz or is there some other mechanism required for > booting Solaris xVM from ZFS ? > -- > This message posted from opensolaris.org > ___ > zfs-discuss mailing list > zfs-discuss@opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss > ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] pulling disks was: ZFS hangs/freezes after disk failure,
> "rm" == Robert Milkowski <[EMAIL PROTECTED]> writes: rm> Please look for slides 23-27 at rm> http://unixdays.pl/i/unixdays-prezentacje/2007/milek.pdf yeah, ok, ONCE AGAIN, I never said that checksums are worthless. relling: some drives don't return errors on unrecoverable read events. carton: I doubt that. Tell me a story about one that doesn't. Your stories are about storage subsystems again, not drives. Also most or all of the slides aren't about unrecoverable read events. pgpitPlQ325Eo.pgp Description: PGP signature ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] pulling disks was: ZFS hangs/freezes after disk failure,
> "re" == Richard Elling <[EMAIL PROTECTED]> writes: re> There is no error in my math. I presented a failure rate for re> a time interval, What is a ``failure rate for a time interval''? AIUI, the failure rate for a time interval is 0.46% / yr, no matter how many drives you have. pgpeGoMP0F3vv.pgp Description: PGP signature ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Availability: ZFS needs to handle disk removal / driver failure better
On Thu, 28 Aug 2008, Ross wrote: > > I believe ZFS should apply the same tough standards to pool > availability as it does to data integrity. A bad checksum makes ZFS > read the data from elsewhere, why shouldn't a timeout do the same > thing? A problem is that for some devices, a five minute timeout is ok. For others, there must be a problem if the device does not respond in a second or two. If the system or device is simply overwelmed with work, then you would not want the system to go haywire and make the problems much worse. Which of these do you prefer? o System waits substantial time for devices to (possibly) recover in order to ensure that subsequently written data has the least chance of being lost. o System immediately ignores slow devices and switches to non-redundant non-fail-safe non-fault-tolerant may-lose-your-data mode. When system is under intense load, it automatically switches to the may-lose-your-data mode. Bob == Bob Friesenhahn [EMAIL PROTECTED], http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer,http://www.GraphicsMagick.org/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] xVM GRUB entry incorrect with ZFS root
I just ran live-upgrade of my system from nv94/UFS to nv96/ZFS on x86. nv96/ZFS boots okay. However, I can't boot the Solaris xVM partition as the GRUB entry does not contain the necessary magic to tell grub to use ZFS instead of UFS. Looking at the GRUB menu, it appears as though the flags "-B $ZFS-BOOTFS" are needed to be passed to the kernel. Is this something I can add to: kernel$ /boot/$ISADIR/xen.gz or is there some other mechanism required for booting Solaris xVM from ZFS ? -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] xVM GRUB entry incorrect with ZFS root
I just ran live-upgrade of my system from nv94/UFS to nv96/ZFS on x86. nv96/ZFS boots okay. However, I can't boot the Solaris xVM partition as the GRUB entry does not contain the necessary magic to tell grub to use ZFS instead of UFS. Looking at the GRUB menu, it appears as though the flags "-B $ZFS-BOOTFS" are needed to be passed to the kernel. Is this something I can add to: kernel$ /boot/$ISADIR/xen.gz or is there some other mechanism required for booting Solaris xVM from ZFS ? -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Subversion repository on ZFS
On Aug 27, 2008, at 4:38 PM, Tim wrote: On Wed, Aug 27, 2008 at 3:29 PM, Ian Collins <[EMAIL PROTECTED]> wrote: Does anyone have any tuning tips for a Subversion repository on ZFS? The repository will mainly be storing binary (MS Office documents). It looks like a vanilla, uncompressed file system is the best bet. I have a SVN on ZFS repository with ~75K relatively small files and few binaries. That is working well without any special tuning. Shawn -- Shawn Ferry shawn.ferry at sun.com Senior Primary Systems Engineer Sun Managed Operations ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS Pools 1+TB
exactly :) On 8/28/08, Kyle McDonald <[EMAIL PROTECTED]> wrote: > Daniel Rock wrote: >> >> Kenny schrieb: >> >2. c6t600A0B800049F93C030A48B3EA2Cd0 >> >> > /scsi_vhci/[EMAIL PROTECTED] >> >3. c6t600A0B800049F93C030D48B3EAB6d0 >> >> > /scsi_vhci/[EMAIL PROTECTED] >> >> Disk 2: 931GB >> Disk 3: 931MB >> >> Do you see the difference? >> > Not just disk 3: > >> AVAILABLE DISK SELECTIONS: >>3. c6t600A0B800049F93C030D48B3EAB6d0 >> >> /scsi_vhci/[EMAIL PROTECTED] >>4. c6t600A0B800049F93C031C48B3EC76d0 >> >> /scsi_vhci/[EMAIL PROTECTED] >>8. c6t600A0B800049F93C031048B3EB44d0 >> >> /scsi_vhci/[EMAIL PROTECTED] >> > This all makes sense now, since a RAIDZ (or RAIDZ2) vdev can only be as > big as it's *smallest* component device. > >-Kyle > >> >> >> Daniel >> ___ >> zfs-discuss mailing list >> zfs-discuss@opensolaris.org >> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss >> > > ___ > zfs-discuss mailing list > zfs-discuss@opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss > ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] eWeek: corrupt file brought down FAA's antiquated IT system
[EMAIL PROTECTED] wrote on 08/28/2008 09:00:23 AM: > > On 28-Aug-08, at 10:54 AM, Toby Thain wrote: > > > > > On 28-Aug-08, at 10:11 AM, Richard Elling wrote: > > > >> It is rare to see this sort of "CNN Moment" attributed to file > >> corruption. > >> http://www.eweek.com/c/a/IT-Infrastructure/Corrupt-File-Brought- > >> Down-FAAs-Antiquated-IT-System/?kc=EWKNLNAV08282008STR4 > >> > > > > "two 20-year-old redundant mainframe configurations ... that > > apparently are hanging on for dear life until reinforcements arrive > > in the form of a new, state-of-the-art system this winter." > > > > And we all know that 'new, state-of-the-art systems' are silver > > bullets and good value for money. > > > > What goes unremarked here is how the original system has coped > > reliably for decades of (one guesses) geometrically growing load. > > D'oh! It was remarked below the fold. I should have read page 2 > before writing. > > The original architects seem to have done an excellent job, how many > of us are designing systems expected to run for 2 decades? (Yes I > know the cycles are shorter these days. If you bought a PDP-11 you > were expected to keep it running forever with component level repairs.) > > --Toby > Then you also missed the all important crescendo where eweek uses the last quarter of a poorly written article to shill completely unrelated but yet inference to tie to the story software. -Wade ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] pulling disks was: ZFS hangs/freezes after disk failure,
Robert Milkowski wrote: > Hello Miles, > > Wednesday, August 27, 2008, 10:51:49 PM, you wrote: > > MN> It's not really enough for me, but what's more the case doesn't match > MN> what we were looking for: a device which ``never returns error codes, > MN> always returns silently bad data.'' I asked for this because you said > MN> ``However, not all devices return error codes which indicate > MN> unrecoverable reads,'' which I think is wrong. Rather, most devices > MN> sometimes don't, not some devices always don't. > > > > Please look for slides 23-27 at > http://unixdays.pl/i/unixdays-prezentacje/2007/milek.pdf > > You really don't have to look very far to find this sort of thing. The scar just below my left knee is directly attributed to a bugid fixed in patch 106129-12. Warning: the following link may frighten experienced datacenter personnel, fortunately, the affected device is long since EOL. http://sunsolve.sun.com/search/document.do?assetkey=1-21-106129-12-1 -- richard ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] eWeek: corrupt file brought down FAA's antiquated IT system
On 28-Aug-08, at 10:54 AM, Toby Thain wrote: > > On 28-Aug-08, at 10:11 AM, Richard Elling wrote: > >> It is rare to see this sort of "CNN Moment" attributed to file >> corruption. >> http://www.eweek.com/c/a/IT-Infrastructure/Corrupt-File-Brought- >> Down-FAAs-Antiquated-IT-System/?kc=EWKNLNAV08282008STR4 >> > > "two 20-year-old redundant mainframe configurations ... that > apparently are hanging on for dear life until reinforcements arrive > in the form of a new, state-of-the-art system this winter." > > And we all know that 'new, state-of-the-art systems' are silver > bullets and good value for money. > > What goes unremarked here is how the original system has coped > reliably for decades of (one guesses) geometrically growing load. D'oh! It was remarked below the fold. I should have read page 2 before writing. The original architects seem to have done an excellent job, how many of us are designing systems expected to run for 2 decades? (Yes I know the cycles are shorter these days. If you bought a PDP-11 you were expected to keep it running forever with component level repairs.) --Toby ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] liveupgrade ufs root -> zfs ?
Hi, I think LU 94->96 would be fine, if there's no zone in your system, just simply do # cd /Solaris_11/Tools/Installers # liveupgrade20 --nodisplay # lucreate -c BE94 -n BE96 -p newpool (The newpool must be SMI lable) # luupgrade -u -n BE96 -s # luactivate BE96 # init 6 During snv_90~96, quite a lot LU bugs are solved, so I think you could complete the process successfully, if no special case.. Paul Floyd wrote: > Hi > > On my opensolaris machine I currently have SXCEs 95 and 94 in two BEs. The > same fdisk partition contains /export/home and swap. In a separate fdisk > partition on another disk I have a ZFS pool. > > Does anyone have a pointer to a howto for doing a liveupgrade such that I can > convert the SXCE 94 UFS BE to ZFS (and liveupgrade to SXCE 96 while I'm at > it) if this is possible? Searching with google shows a lot of blogs that > describe the early problems that existed when ZFS was first available (ON 90 > or so). > > A+ > Paul > > > This message posted from opensolaris.org > ___ > zfs-discuss mailing list > zfs-discuss@opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss > ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] pulling disks was: ZFS hangs/freezes after disk failure,
Hello Miles, Wednesday, August 27, 2008, 10:51:49 PM, you wrote: MN> It's not really enough for me, but what's more the case doesn't match MN> what we were looking for: a device which ``never returns error codes, MN> always returns silently bad data.'' I asked for this because you said MN> ``However, not all devices return error codes which indicate MN> unrecoverable reads,'' which I think is wrong. Rather, most devices MN> sometimes don't, not some devices always don't. Please look for slides 23-27 at http://unixdays.pl/i/unixdays-prezentacje/2007/milek.pdf -- Best regards, Robert Milkowskimailto:[EMAIL PROTECTED] http://milek.blogspot.com ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] eWeek: corrupt file brought down FAA's antiquated IT system
On 28-Aug-08, at 10:11 AM, Richard Elling wrote: > It is rare to see this sort of "CNN Moment" attributed to file > corruption. > http://www.eweek.com/c/a/IT-Infrastructure/Corrupt-File-Brought- > Down-FAAs-Antiquated-IT-System/?kc=EWKNLNAV08282008STR4 > "two 20-year-old redundant mainframe configurations ... that apparently are hanging on for dear life until reinforcements arrive in the form of a new, state-of-the-art system this winter." And we all know that 'new, state-of-the-art systems' are silver bullets and good value for money. What goes unremarked here is how the original system has coped reliably for decades of (one guesses) geometrically growing load. --Toby > -- richard > > ___ > zfs-discuss mailing list > zfs-discuss@opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] liveupgrade ufs root -> zfs ?
On Thu, 28 Aug 2008, Paul Floyd wrote: > Does anyone have a pointer to a howto for doing a liveupgrade such that > I can convert the SXCE 94 UFS BE to ZFS (and liveupgrade to SXCE 96 > while I'm at it) if this is possible? Searching with google shows a lot > of blogs that describe the early problems that existed when ZFS was > first available (ON 90 or so). It should be fairly straightforward to convert to ZFS: lucreate -p -n Doing the upgrade to 96, should be luupgrade -u -n -s Regards, markm ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] pulling disks was: ZFS hangs/freezes after disk failure,
Miles Nordin wrote: > re> Indeed. Intuitively, the AFR and population is more easily > re> grokked by the masses. > > It's nothing to do with masses. There's an error in your math. It's > not right under any circumstance. > There is no error in my math. I presented a failure rate for a time interval, you presented a probability of failure over a time interval. The two are both correct, but say different things. Mathematically, an AFR > 100% is quite possible and quite common. A probability of failure > 100% (1.0) is not. In my experience, failure rates described as annualized failure rates (AFR) are more intuitive than their mathematically equivalent counterpart: MTBF. -- richard ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS Pools 1+TB
Daniel Rock wrote: > > Kenny schrieb: > >2. c6t600A0B800049F93C030A48B3EA2Cd0 > > > /scsi_vhci/[EMAIL PROTECTED] > >3. c6t600A0B800049F93C030D48B3EAB6d0 > > > /scsi_vhci/[EMAIL PROTECTED] > > Disk 2: 931GB > Disk 3: 931MB > > Do you see the difference? > Not just disk 3: > AVAILABLE DISK SELECTIONS: >3. c6t600A0B800049F93C030D48B3EAB6d0 > /scsi_vhci/[EMAIL PROTECTED] >4. c6t600A0B800049F93C031C48B3EC76d0 > /scsi_vhci/[EMAIL PROTECTED] >8. c6t600A0B800049F93C031048B3EB44d0 > /scsi_vhci/[EMAIL PROTECTED] > This all makes sense now, since a RAIDZ (or RAIDZ2) vdev can only be as big as it's *smallest* component device. -Kyle > > > Daniel > ___ > zfs-discuss mailing list > zfs-discuss@opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss > ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] liveupgrade ufs root -> zfs ?
Hi On my opensolaris machine I currently have SXCEs 95 and 94 in two BEs. The same fdisk partition contains /export/home and swap. In a separate fdisk partition on another disk I have a ZFS pool. Does anyone have a pointer to a howto for doing a liveupgrade such that I can convert the SXCE 94 UFS BE to ZFS (and liveupgrade to SXCE 96 while I'm at it) if this is possible? Searching with google shows a lot of blogs that describe the early problems that existed when ZFS was first available (ON 90 or so). A+ Paul This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] eWeek: corrupt file brought down FAA's antiquated IT system
On Thu, Aug 28, 2008 at 06:11:06AM -0700, Richard Elling wrote: > It is rare to see this sort of "CNN Moment" attributed to file corruption. > http://www.eweek.com/c/a/IT-Infrastructure/Corrupt-File-Brought-Down-FAAs-Antiquated-IT-System/?kc=EWKNLNAV08282008STR4 `file corruption' takes the blame all the time, in my experience, but what does it mean? It likely has nothing to do with the filesystem. Probably an application wrote incorrect information into a file. -- -Gary Mills--Unix Support--U of M Academic Computing and Networking- ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS Pools 1+TB
Kenny schrieb: >2. c6t600A0B800049F93C030A48B3EA2Cd0 > /scsi_vhci/[EMAIL PROTECTED] >3. c6t600A0B800049F93C030D48B3EAB6d0 > /scsi_vhci/[EMAIL PROTECTED] Disk 2: 931GB Disk 3: 931MB Do you see the difference? Daniel ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] trouble with resilver after removing drive from 3510
Hello all, I tried to test the behavior of zpool recovering after removing one drive with strange results. Setup SunFire V240/4Gig RAM, Solaris10u5, fully patched (last week) 1 3510 12x 140Gig FC Drives, 12 luns (every drive is one lun), (I don't want to use the RAID hardware, letting ZFS doing all.) one pool with 5x2 disks and 2 spares (details below) After pulling drive 2 it took about two minutes to recognise the situation. zpool status command output and also zpool iostat 1 command output is very slow. some lines are fast, then it stops for about 30-60 seconds, but they do complete after all. the resilver has started but is VERY slow and shows strange data. The % done value is going up and down all the time. I don't think it is working correctly. zpool iostat 1 (when it works) shows many reads but very few writes. I would have expected a mainly equal read and write rate reading from the intact mirror-side writing to the spare-disk. Most of the time during resilver the machine is 99% idle, maximum 10% kernel load for some short times. Now I have waited for more than one day but nothing is getting better. I did not put a new drive in, I wanted to see one spare getting into use. snip of zpool iostat 1 tank 337G 343G313 2 37.4M 19.3K tank 337G 343G240 5 29.0M 38.6K tank 337G 343G355 6 44.4M 45.0K tank 337G 343G336 8 41.6M 57.9K tank 337G 343G422 0 46.0M 0 tank 337G 343G415 10 49.4M 70.8K tank 337G 343G358 0 43.3M 0 tank 337G 343G340 10 42.6M 70.8K tank 337G 343G323 5 38.1M 38.6K tank 337G 343G315 0 35.0M 0 tank 337G 343G336 0 40.0M 6.43K tank 337G 343G388 10 46.8M 70.8K tank 337G 343G351 4 43.9M 32.2K tank 337G 343G 5 5 620K 285K nothing useful (at least for me) in messages. after grep -v of the both lines date+time nftp scsi: [ID 107833 kern.warning] WARNING: /[EMAIL PROTECTED],70/SUNW,[EMAIL PROTECTED]/[EMAIL PROTECTED],0/[EMAIL PROTECTED],1 (ssd48): date+time nftpdrive offline only these entries to see: Aug 27 13:04:22 nftpi/o to invalid geometry Aug 27 13:04:32 nftpi/o to invalid geometry Aug 27 13:04:37 nftpi/o to invalid geometry Aug 27 13:04:37 nftpi/o to invalid geometry Aug 27 13:04:47 nftpi/o to invalid geometry Aug 27 13:04:52 nftpi/o to invalid geometry Aug 27 13:05:23 nftp fmd: [ID 441519 daemon.error] SUNW-MSG-ID: ZFS-8000-D3, TYPE: Fault, VER: 1, SEVERITY: Major Aug 27 13:05:23 nftp EVENT-TIME: Wed Aug 27 13:05:22 CEST 2008 Aug 27 13:05:23 nftp PLATFORM: SUNW,Sun-Fire-V240, CSN: -, HOSTNAME: nftp Aug 27 13:05:23 nftp SOURCE: zfs-diagnosis, REV: 1.0 Aug 27 13:05:23 nftp EVENT-ID: ea01afff-c58e-6b32-e345-81da8bf43146 Aug 27 13:05:23 nftp DESC: A ZFS device failed. Refer to http://sun.com/msg/ZFS-8000-D3 for more information. Aug 27 13:05:23 nftp AUTO-RESPONSE: No automated response will occur. Aug 27 13:05:23 nftp IMPACT: Fault tolerance of the pool may be compromised. Aug 27 13:05:23 nftp REC-ACTION: Run 'zpool status -x' and replace the bad device. uname -a SunOS nftp 5.10 Generic_137111-04 sun4u sparc SUNW,Sun-Fire-V240 before pulling drive: sccli> show disk Ch Id Size Speed LD Status IDs Rev 2(3) 0 136.73GB 200MB ld0ONLINE SEAGATE ST314680FSUN146G 0407 S/N 3HY602V37412 WWNN 200C505EB811 2(3) 1 136.73GB 200MB ld1ONLINE SEAGATE ST314680FSUN146G 0407 S/N 3HY61JX47412 WWNN 200C505EB885 2(3) 2 136.73GB 200MB ld2ONLINE SEAGATE ST3146807FC 0006 S/N 3HY62EGZ7443 WWNN 200C50D76130 2(3) 3 136.73GB 200MB ld3ONLINE SEAGATE ST314680FSUN146G 0407 S/N 3HY61JKG7411 WWNN 200C505EB815 2(3) 4 136.73GB 200MB ld4ONLINE SEAGATE ST314680FSUN146G 0407 S/N 3HY60YHX7410 WWNN 200C505EBCBB 2(3) 5 136.73GB 200MB ld5ONLINE SEAGATE ST314680FSUN146G 0407 S/N 3HY61FQ07412 WWNN 200C505E98B9 2(3) 6 136.73GB 200MB ld6
[zfs-discuss] eWeek: corrupt file brought down FAA's antiquated IT system
It is rare to see this sort of "CNN Moment" attributed to file corruption. http://www.eweek.com/c/a/IT-Infrastructure/Corrupt-File-Brought-Down-FAAs-Antiquated-IT-System/?kc=EWKNLNAV08282008STR4 -- richard ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] zpool import sees two pools
Victor Latushkin wrote: On 28.08.08 15:06, Chris Gerhard wrote: I have a USB disk with a pool on it called removable. On one laptop zpool import removable works just fine but on another with the same disk attached it tells me there is more than one matching pool: : sigma TS 6 $; pfexec zpool import removable cannot import 'removable': more than one matching pool import by numeric ID instead : sigma TS 7 $; pfexec zpool importpool: removable id: 16711095403932498465 state: ONLINE status: The pool is formatted using an older on-disk version. action: The pool can be imported using its name or numeric identifier, though some features will not be available without an explicit 'zpool upgrade'. config: removable ONLINE c3t0d0ONLINE pool: removable id: 13348174994041916803 state: FAULTED status: The pool metadata is corrupted. action: The pool cannot be imported due to damaged devices or data. see: http://www.sun.com/msg/ZFS-8000-72 config: removable FAULTED corrupted data c3t0d0p0 ONLINE : sigma TS 8 $; What I find curious is that this only happens on one system. Any ideas? What Solaris/ZFS versions are these systems running? it is a wild guess but may be there's some stale label with newer version which is recognized by one system and not recognized by another? Both are running snv_94. The system with the problem is nevada the system without the problem is OpenSolaris. What does zdb -l say? # zdb -l /dev/rdsk/c3t0d0p0 LABEL 0 failed to unpack label 0 LABEL 1 failed to unpack label 1 LABEL 2 version=1 name='removable' state=1 txg=18676 pool_guid=13348174994041916803 top_guid=17964267360868847787 guid=17964267360868847787 vdev_tree type='disk' id=0 guid=17964267360868847787 path='/vol/dev/dsk/c5t0d0/unknown_format' whole_disk=0 metaslab_array=13 metaslab_shift=30 ashift=9 asize=164691705856 LABEL 3 version=1 name='removable' state=1 txg=18676 pool_guid=13348174994041916803 top_guid=17964267360868847787 guid=17964267360868847787 vdev_tree type='disk' id=0 guid=17964267360868847787 path='/vol/dev/dsk/c5t0d0/unknown_format' whole_disk=0 metaslab_array=13 metaslab_shift=30 ashift=9 asize=164691705856 # # zdb -l /dev/rdsk/c3t0d0 LABEL 0 failed to unpack label 0 LABEL 1 failed to unpack label 1 LABEL 2 version=1 name='removable' state=1 txg=18676 pool_guid=13348174994041916803 top_guid=17964267360868847787 guid=17964267360868847787 vdev_tree type='disk' id=0 guid=17964267360868847787 path='/vol/dev/dsk/c5t0d0/unknown_format' whole_disk=0 metaslab_array=13 metaslab_shift=30 ashift=9 asize=164691705856 LABEL 3 version=1 name='removable' state=1 txg=18676 pool_guid=13348174994041916803 top_guid=17964267360868847787 guid=17964267360868847787 vdev_tree type='disk' id=0 guid=17964267360868847787 path='/vol/dev/dsk/c5t0d0/unknown_format' whole_disk=0 metaslab_array=13 metaslab_shift=30 ashift=9 asize=164691705856 # # zdb -l /dev/rdsk/c3t0d0s0 LABEL 0 version=10 name='removable' state=1 txg=75874 pool_guid=16711095403932498465 hostid=696785690 hostname='sigma' top_guid=18371933882888483558 guid=18371933882888483558 vdev_tree type='disk' id=0 guid=18371933882888483558 path='/dev/dsk/c3t0d0s0' devid='id1,[EMAIL PROTECTED]/a' phys_path='/[EMAIL PROTECTED],0/pci1028,[EMAIL PROTECTED],7/[EMAIL PROTECTED]/[EMAIL PROTECTED],0:a' whole_disk=1 metaslab_array=14 metaslab_shift=30 ashift=9 asize=164683055104 is_log=0 DTL=18 LABEL 1 version=10 name='removable' state=1 txg=75874 p
Re: [zfs-discuss] ZFS Pools 1+TB
Tim, Per your request... df -h bash-3.00# df -h Filesystem size used avail capacity Mounted on /dev/md/dsk/d10 98G 4.2G92G 5%/ /devices 0K 0K 0K 0%/devices ctfs 0K 0K 0K 0%/system/contract proc 0K 0K 0K 0%/proc mnttab 0K 0K 0K 0%/etc/mnttab swap32G 1.4M32G 1%/etc/svc/volatile objfs0K 0K 0K 0%/system/object /platform/SUNW,SPARC-Enterprise-T5220/lib/libc_psr/libc_psr_hwcap1.so.1 98G 4.2G92G 5% /platform/sun4v/lib/libc_psr.so.1 /platform/SUNW,SPARC-Enterprise-T5220/lib/sparcv9/libc_psr/libc_psr_hwcap1.so.1 98G 4.2G92G 5% /platform/sun4v/lib/sparcv9/libc_psr.so.1 fd 0K 0K 0K 0%/dev/fd /dev/md/dsk/d50 19G 4.3G15G23%/var swap 512M 112K 512M 1%/tmp swap32G40K32G 1%/var/run /dev/md/dsk/d309.6G 1.5G 8.1G16%/opt /dev/md/dsk/d401.9G 142M 1.7G 8%/export/home /vol/dev/dsk/c0t0d0/fm540cd3 591M 591M 0K 100%/cdrom/fm540cd3 log_data 8.8G44K 8.8G 1%/log_data bash-3.00# bash-3.00# df -h v/dsk/c0t0d0/fm540cd3 591M 591M 0K 100%/cdrom/fm540cd3 log_data 8.8G44K 8.8G 1%/log_data zpool status bash-3.00# zpool status pool: log_data state: ONLINE scrub: none requested config: NAME STATE READ WRITE CKSUM log_data ONLINE 0 0 0 raidz1 ONLINE 0 0 0 c6t600A0B800049F93C030A48B3EA2Cd0 ONLINE 0 0 0 c6t600A0B800049F93C030D48B3EAB6d0 ONLINE 0 0 0 c6t600A0B800049F93C031C48B3EC76d0 ONLINE 0 0 0 c6t600A0B800049F93C031F48B3ECA8d0 ONLINE 0 0 0 c6t600A0B800049F93C030448B3CDEEd0 ONLINE 0 0 0 c6t600A0B800049F93C030748B3E9F0d0 ONLINE 0 0 0 c6t600A0B800049F93C031048B3EB44d0 ONLINE 0 0 0 c6t600A0B800049F93C031348B3EB94d0 ONLINE 0 0 0 c6t600A0B800049F93C031648B3EBE4d0 ONLINE 0 0 0 c6t600A0B800049F93C031948B3EC28d0 ONLINE 0 0 0 c6t600A0B800049F93C032248B3ECDEd0 ONLINE 0 0 0 errors: No known data errors format bash-3.00# format Searching for disks...done AVAILABLE DISK SELECTIONS: 0. c1t0d0 /[EMAIL PROTECTED]/[EMAIL PROTECTED]/[EMAIL PROTECTED]/[EMAIL PROTECTED]/[EMAIL PROTECTED],0 1. c1t1d0 /[EMAIL PROTECTED]/[EMAIL PROTECTED]/[EMAIL PROTECTED]/[EMAIL PROTECTED]/[EMAIL PROTECTED],0 2. c6t600A0B800049F93C030A48B3EA2Cd0 /scsi_vhci/[EMAIL PROTECTED] 3. c6t600A0B800049F93C030D48B3EAB6d0 /scsi_vhci/[EMAIL PROTECTED] 4. c6t600A0B800049F93C031C48B3EC76d0 /scsi_vhci/[EMAIL PROTECTED] 5. c6t600A0B800049F93C031F48B3ECA8d0 /scsi_vhci/[EMAIL PROTECTED] 6. c6t600A0B800049F93C030448B3CDEEd0 /scsi_vhci/[EMAIL PROTECTED] 7. c6t600A0B800049F93C030748B3E9F0d0 /scsi_vhci/[EMAIL PROTECTED] 8. c6t600A0B800049F93C031048B3EB44d0 /scsi_vhci/[EMAIL PROTECTED] 9. c6t600A0B800049F93C031348B3EB94d0 /scsi_vhci/[EMAIL PROTECTED] 10. c6t600A0B800049F93C031648B3EBE4d0 /scsi_vhci/[EMAIL PROTECTED] 11. c6t600A0B800049F93C031948B3EC28d0 /scsi_vhci/[EMAIL PROTECTED] 12. c6t600A0B800049F93C032248B3ECDEd0 /scsi_vhci/[EMAIL PROTECTED] Specify disk (enter its number): This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS Pools 1+TB
Bob, Thanks for the reply. Yes I did read your white paper and am using it!! Thanks again!! I used zpool iostat -v and it did't give the information as advertised... see below bash-3.00# zpool iostat -v capacity operationsbandwidth poolused avail read write read write -- - - - - - - log_data 147K 9.81G 0 0 0 4 raidz1147K 9.81G 0 0 0 4 c6t600A0B800049F93C030A48B3EA2Cd0 - - 0 0 0 22 c6t600A0B800049F93C030D48B3EAB6d0 - - 0 0 0 22 c6t600A0B800049F93C031C48B3EC76d0 - - 0 0 0 22 c6t600A0B800049F93C031F48B3ECA8d0 - - 0 0 0 22 c6t600A0B800049F93C030448B3CDEEd0 - - 0 0 0 22 c6t600A0B800049F93C030748B3E9F0d0 - - 0 0 0 22 c6t600A0B800049F93C031048B3EB44d0 - - 0 0 0 22 c6t600A0B800049F93C031348B3EB94d0 - - 0 0 0 22 c6t600A0B800049F93C031648B3EBE4d0 - - 0 0 0 22 c6t600A0B800049F93C031948B3EC28d0 - - 0 0 0 22 c6t600A0B800049F93C032248B3ECDEd0 - - 0 0 0 22 -- - - - - - - (sorry but I can't get the horizontal format to set the columns correctly...) This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] zpool import sees two pools
On 28.08.08 15:06, Chris Gerhard wrote: > I have a USB disk with a pool on it called removable. On one laptop > zpool import removable works just fine but on another with the same > disk attached it tells me there is more than one matching pool: > > : sigma TS 6 $; pfexec zpool import removable > cannot import 'removable': more than one matching pool > import by numeric ID instead > : sigma TS 7 $; pfexec zpool import > pool: removable > id: 16711095403932498465 > state: ONLINE > status: The pool is formatted using an older on-disk version. > action: The pool can be imported using its name or numeric identifier, though > some features will not be available without an explicit 'zpool > upgrade'. > config: > > removable ONLINE > c3t0d0ONLINE > > pool: removable > id: 13348174994041916803 > state: FAULTED > status: The pool metadata is corrupted. > action: The pool cannot be imported due to damaged devices or data. >see: http://www.sun.com/msg/ZFS-8000-72 > config: > > removable FAULTED corrupted data > c3t0d0p0 ONLINE > : sigma TS 8 $; > > What I find curious is that this only happens on one system. Any ideas? What Solaris/ZFS versions are these systems running? it is a wild guess but may be there's some stale label with newer version which is recognized by one system and not recognized by another? What does zdb -l say? victor ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS hangs/freezes after disk failure,
Hi Todd, sorry for the delay in responding, been head down rewriting a utility for the last few days. Todd H. Poole wrote: > Howdy James, > > While responding to halstead's post (see below), I had to restart several > times to complete some testing. I'm not sure if that's important to these > commands or not, but I just wanted to put it out there anyway. > >> A few commands that you could provide the output from >> include: >> >> >> (these two show any FMA-related telemetry) >> fmadm faulty >> fmdump -v > > This is the output from both commands: > > [EMAIL PROTECTED]:~# fmadm faulty > --- -- - > TIMEEVENT-ID MSG-ID SEVERITY > --- -- - > Aug 27 01:07:08 0d9c30f1-b2c7-66b6-f58d-9c6bcb95392a ZFS-8000-FDMajor > > Fault class : fault.fs.zfs.vdev.io > Description : The number of I/O errors associated with a ZFS device exceeded > acceptable levels. Refer to > http://sun.com/msg/ZFS-8000-FD > for more information. > Response: The device has been offlined and marked as faulted. An attempt > will be made to activate a hot spare if available. > Impact : Fault tolerance of the pool may be compromised. > Action : Run 'zpool status -x' and replace the bad device. > > [EMAIL PROTECTED]:~# fmdump -v > TIME UUID SUNW-MSG-ID > Aug 27 01:07:08.2040 0d9c30f1-b2c7-66b6-f58d-9c6bcb95392a ZFS-8000-FD > 100% fault.fs.zfs.vdev.io > >Problem in: zfs://pool=mediapool/vdev=bfaa3595c0bf719 > Affects: zfs://pool=mediapool/vdev=bfaa3595c0bf719 > FRU: - > Location: - In other emails in this thread you've mentioned the desire to get an email (or some sort of notification) when Problems Happen(tm) in your system, and the FMA framework is how we achieve that in OpenSolaris. # fmadm config MODULE VERSION STATUS DESCRIPTION cpumem-retire1.1 active CPU/Memory Retire Agent disk-transport 1.0 active Disk Transport Agent eft 1.16active eft diagnosis engine fabric-xlate 1.0 active Fabric Ereport Translater fmd-self-diagnosis 1.0 active Fault Manager Self-Diagnosis io-retire2.0 active I/O Retire Agent snmp-trapgen 1.0 active SNMP Trap Generation Agent sysevent-transport 1.0 active SysEvent Transport Agent syslog-msgs 1.0 active Syslog Messaging Agent zfs-diagnosis1.0 active ZFS Diagnosis Engine zfs-retire 1.0 active ZFS Retire Agent You'll notice that we've got an SNMP agent there... and you can acquire a copy of the FMA mib from the Fault Management community pages (http://opensolaris.org/os/community/fm and http://opensolaris.org/os/community/fm/mib/). >> (this shows your storage controllers and what's >> connected to them) cfgadm -lav > > This is the output from cfgadm -lav > > [EMAIL PROTECTED]:~# cfgadm -lav > Ap_Id Receptacle Occupant Condition > Information > When Type Busy Phys_Id > usb2/1 emptyunconfigured ok > unavailable unknown n/devices/[EMAIL > PROTECTED],0/pci1458,[EMAIL PROTECTED]:1 > usb2/2 connectedconfigured ok > Mfg: Microsoft Product: Microsoft 3-Button Mouse with IntelliEye(TM) > NConfigs: 1 Config: 0 > unavailable usb-mousen/devices/[EMAIL > PROTECTED],0/pci1458,[EMAIL PROTECTED]:2 > usb3/1 emptyunconfigured ok [snip] > usb7/2 emptyunconfigured ok > unavailable unknown n/devices/[EMAIL > PROTECTED],0/pci1458,[EMAIL PROTECTED],1:2 > > You'll notice that the only thing listed is my USB mouse... is that expected? Yup. One of the artefacts of the cfgadm architecture. cfgadm(1m) works by using plugins - usb, FC, SCSI, SATA, pci hotplug, InfiniBand... but not IDE. I think you also were wondering how to tell what controller instances your disks were using in IDE mode - two basic ways of achieving this: /usr/bin/iostat -En and /usr/sbin/format Your IDE disks will attach using the cmdk driver and show up like this: c1d0 c1d1 c2d0 c2d1 In AHCI/SATA mode they'd show up as c1t0d0 c1t1d0 c1t2d0 c1t3d0 or something similar, depending on how the bios and the actual controllers sort themselves out. >> You'll also find messages in /var/adm/messages which >> might prove >> useful to review. > > If you really want, I can list the output from /var/adm/messages, but it > doesn't seem to add anything new to what I've already copied and pasted. No need - you've got them if you need them. [snip] >> http://docs.sun.com/app/docs/coll/40.1
Re: [zfs-discuss] Will there be a GUI for ZFS ?
On Thu, Aug 28, 2008 at 3:47 AM, Klaus Bergius <[EMAIL PROTECTED]>wrote: > I'll second the original questions, but would like to know specifically > when we will see (or how to install) the ZFS admin gui for OpenSolaris > 2008.05. > I installed 2008.05, then updated the system, so it is now snv_95. > There are no smc* commands, there is no service 'webconsole' to be seen in > svcs -a, > because: there is no SUNWzfsg package installed. > However, the SUNWzfsg package is also not in the > pkg.opensolaris.orgrepository. > > Any hint where to find the package? I would really love to have the zfs > admin gui on my system. > > -Klaus > > My personal conspiracy theory is it's part of "project fishworks" that is still under wraps. --Tim ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] ZFS boot reservations
Hey folks, Tim Foster just linked this bug to the zfs auto backup mailing list, and I wondered if anybody knew if the work being done on ZFS boot includes making use of ZFS reservations to ensure the boot filesystems always have enough free space? http://defect.opensolaris.org/bz/show_bug.cgi?id=3132 Ross This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] zpool import sees two pools
I have a USB disk with a pool on it called removable. On one laptop zpool import removable works just fine but on another with the same disk attached it tells me there is more than one matching pool: : sigma TS 6 $; pfexec zpool import removable cannot import 'removable': more than one matching pool import by numeric ID instead : sigma TS 7 $; pfexec zpool import pool: removable id: 16711095403932498465 state: ONLINE status: The pool is formatted using an older on-disk version. action: The pool can be imported using its name or numeric identifier, though some features will not be available without an explicit 'zpool upgrade'. config: removable ONLINE c3t0d0ONLINE pool: removable id: 13348174994041916803 state: FAULTED status: The pool metadata is corrupted. action: The pool cannot be imported due to damaged devices or data. see: http://www.sun.com/msg/ZFS-8000-72 config: removable FAULTED corrupted data c3t0d0p0 ONLINE : sigma TS 8 $; What I find curious is that this only happens on one system. Any ideas? This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Will there be a GUI for ZFS ?
There is no good ZFS gui. Nothing that is actively maintained, anyway. This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Will there be a GUI for ZFS ?
I'll second the original questions, but would like to know specifically when we will see (or how to install) the ZFS admin gui for OpenSolaris 2008.05. I installed 2008.05, then updated the system, so it is now snv_95. There are no smc* commands, there is no service 'webconsole' to be seen in svcs -a, because: there is no SUNWzfsg package installed. However, the SUNWzfsg package is also not in the pkg.opensolaris.org repository. Any hint where to find the package? I would really love to have the zfs admin gui on my system. -Klaus This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] [Fwd: Re: Review for 6729208 Optimize macros in sys/byteorder.h (due Sept. 3)]
Not the common case for ZFS but a useful performance improvement for when it does happen. This is as a result of some follow on work to optimising the byteswapping work Dan has done for the crypto algorithms in OpenSolaris. Original Message Subject: Re: Review for 6729208 Optimize macros in sys/byteorder.h (due Sept. 3) Date: Wed, 27 Aug 2008 11:56:23 -0700 (PDT) From: Dan Anderson <[EMAIL PROTECTED]> To: [EMAIL PROTECTED] Here's some performance results running " find . -exec ls -l" on separate ZFS filesystems created on x86 and sparc and imported/exported to amd64, em64t, and sun4u platforms. This shows performance gain from optimized byteorder.h macros. Percent savings, real time ZFS filesystem created originally on: Platformx86 sparc amd64 4% 3% em64t 3% 4% sun4u 4% 2% Environment: * Create 2 separate ZFS filesystems with 1024 directories, each with 32 files, are on x86 and sparc and zpool export/import to the other systems. * Run this command on ZFS filesystem: find . -exec ls -l {} \; >/dev/null * Run using NV97 with and without fix to RFE 6729208 (byteorder.h macro optimization) BTW, I still could use some code review comments: http://dan.drydog.com/reviews/6729208-bswap3/ -- This message posted from opensolaris.org ___ crypto-discuss mailing list [EMAIL PROTECTED] http://mail.opensolaris.org/mailman/listinfo/crypto-discuss -- Darren J Moffat ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] Availability: ZFS needs to handle disk removal / driver failure better
Since somebody else has just posted about their entire system locking up when pulling a drive, I thought I'd raise this for discussion. I think Ralf made a very good point in the other thread. ZFS can guarantee data integrity, what it can't do is guarantee data availability. The problem is, the way ZFS is marketed people expect it to be able to do just that. This turned into a longer thread than expected, so I'll start with what I'm asking for, and then attempt to explain my thinking. I'm essentially asking for two features to improve the availability of ZFS pools: - Isolation of storage drivers so that buggy drivers do not bring down the OS. - ZFS timeouts to improve pool availability when no timely response is received from storage drivers. And my reasons for asking for these is that there are now many, many posts on here about people experiencing either total system lockup or ZFS lockup after removing a hot swap drive, and indeed while some of them are using consumer hardware, others have reported problems with server grade kit that definately should be able to handle these errors: Aug 2008: AMD SB600 - System hang - http://www.opensolaris.org/jive/thread.jspa?threadID=70349 Aug 2008: Supermicro SAT2-MV8 - System hang - http://www.opensolaris.org/jive/thread.jspa?messageID=271218 May 2008: Sun hardware - ZFS hang - http://opensolaris.org/jive/thread.jspa?messageID=240481 Feb 2008: iSCSI - ZFS hang - http://www.opensolaris.org/jive/thread.jspa?messageID=206985 Oct 2007: Supermicro SAT2-MV8 - system hang - http://www.opensolaris.org/jive/thread.jspa?messageID=166037 Sept 2007: Fibre channel - http://opensolaris.org/jive/thread.jspa?messageID=151719 ... etc Now while the root cause of each of these may be slightly different, I feel it would still be good to address this if possible as it's going to affect the perception of ZFS as a reliable system. The common factor in all of these is that either the solaris driver hangs and locks the OS, or ZFS hangs and locks the pool. Most of these are for hardware that should handle these failures fine (mine occured for hardware that definately works fine under windows), so I'm wondering: Is there anything that can be done to prevent either type of lockup in these situations? Firstly, for the OS, if a storage component (hardware or driver) fails for a non essential part of the system, the entire OS should not hang. I appreciate there isn't a lot you can do if the OS is using the same driver as it's storage, but certainly in some of the cases above, the OS and the data are using different drivers, and I expect more examples of that could be found with a bit of work. Is there any way storage drivers could be isolated such that the OS (and hence ZFS) can report a problem with that particular driver without hanging the entire system? Please note: I know work is being done on FMA to handle all kinds of bugs, I'm not talking about that. It seems to me that FMA involves proper detection and reporting of bugs, which involves knowing in advance what the problems are and how to report them. What I'm looking for is something much simpler, something that's able to keep the OS running when it encounters unexpected or unhandled behaviour from storage drivers or hardware. It seems to me that one of the benefits of ZFS is working against it here. It's such a flexible system it's being used for many, many types of devices, and that means there are a whole host of drivers being used, and a lot of scope for bugs in those drivers. I know that ultimately any driver issues will need to be sorted individually, but what I'm wondering is whether there's any possibility of putting some error checking code at a layer above the drivers in such a way it's able to trap major problems without hanging the OS? ie: update ZFS/Solaris so they can handle storage layer bugs gracefully without downing the entire system. My second suggestion is to ask if ZFS can be made to handle unexpected events more gracefully. In the past I've suggested that ZFS have a separate timeout so that a redundant pool can continue working even if one device is not responding, and I really think that would be worthwhile. My idea is to have a "WAITING" status flag for drives, so that if one isn't responding quickly, ZFS can flag it as "WAITING", and attempt to read or write the same data from elsewhere in the pool. That would work alongside the existing failure modes, and would allow ZFS to handle hung drivers much more smoothly, preventing redundant pools hanging when a single drive fails. The ZFS update I feel is particularly appropriate. ZFS already uses checksumming since it doesn't trust drivers or hardware to always return the correct data. But ZFS then trusts those same drivers and hardware absolutely when it comes to the availability of the pool. I believe ZFS should apply the same tough standards to pool availability as it does to
Re: [zfs-discuss] Subversion repository on ZFS
Toby Thain wrote: > On 27-Aug-08, at 5:47 PM, Ian Collins wrote: > >> Tim writes: >> >>> On Wed, Aug 27, 2008 at 3:29 PM, Ian Collins <[EMAIL PROTECTED]> >>> wrote: >>> Does anyone have any tuning tips for a Subversion repository on ZFS? The repository will mainly be storing binary (MS Office documents). It looks like a vanilla, uncompressed file system is the best bet. >>> I believe this is called sharepoint :D >> Don't mention that abomination! > > Amen. Don't mention _that_ abomination! ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss