date:20080828

Re: [zfs-discuss] Availability: ZFS needs to handle disk removal / driver failure better

2008-08-28 Thread Richard Elling

Bill Sommerfeld wrote:
> On Thu, 2008-08-28 at 13:05 -0700, Eric Schrock wrote:
>   
>> A better option would be to not use this to perform FMA diagnosis, but
>> instead work into the mirror child selection code.  This has already
>> been alluded to before, but it would be cool to keep track of latency
>> over time, and use this to both a) prefer one drive over another when
>> selecting the child and b) proactively timeout/ignore results from one
>> child and select the other if it's taking longer than some historical
>> standard deviation.  This keeps away from diagnosing drives as faulty,
>> but does allow ZFS to make better choices and maintain response times.
>> It shouldn't be hard to keep track of the average and/or standard
>> deviation and use it for selection; proactively timing out the slow I/Os
>> is much trickier.
>> 
>
> tcp has to solve essentially the same problem: decide when a response is
> "overdue" based only on the timing of recent successful exchanges in a
> context where it's difficult to make assumptions about "reasonable"
> expected behavior of the underlying network.
>
> it tracks both the smoothed round trip time and the variance, and
> declares a response overdue after (SRTT + K * variance).
>
> I think you'd probably do well to start with something similar to what's
> described in http://www.ietf.org/rfc/rfc2988.txt and then tweak based on
> experience.
>   

I think this is a good place to start. In general, we can see 3 orders 
of magnitude
range for magnetic disk I/Os, 4 orders of magnitude for power managed disks.
With that range, I don't see the variance being small, at least for 
magnetic disks.
SSDs will have a much smaller variance, in general.  For lopsided 
mirrors, such
as magnetic disk mirrored to SSD or Bob's Dallas vs New York paths, we 
should
be able to automatically steer towards the faster side.

However, A comprehensive solution must also deal with top-level vdev usage,
which can be very different than the physical vdevs.  We can use 
driver-level FMA
for the physical vdevs, but ultimately ZFS will need to be able to make 
decisions
based on the response time across the top-level vdevs.  This can be 
implemented in
two phases, of course.

I've got some lopsided mirror TNF data, so we could fairly easily try some
algorithms... I'll whip it into shape for further analysis.
 -- richard

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] ARCSTAT Kstat Definitions

2008-08-28 Thread Brendan Gregg - Sun Microsystems

G'Day Ben,

ARC visibility is important; did you see Neel's arcstat?:

http://www.solarisinternals.com/wiki/index.php/Arcstat

Try -x for various sizes, and -v for definitions.

On Thu, Aug 21, 2008 at 10:23:24AM -0700, Ben Rockwood wrote:
> Its a starting point anyway.   The key is to try and draw useful conclusions 
> from the info to answer the torrent of "why is my ARC 30GB???"
> 
> There are several things I'm unclear on whether or not I'm properly 
> interpreting such as:
> 
> * As you state, the anon pages.  Even the comment in code is, to me anyway, a 
> little vague.  I include them because otherwise you look at the hit counters 
> and wonder where a large chunk of them went.

Yes, anon hits doesn't make sense - they are dirty pages and won't have a DVA,
and so won't be findable by other threads in arc_read().   I can see why
arc_summary.pl thinks they exist - accounting for the discrepancy between
arcstats:hits and the sum of the hits from the four ARC lists.  Ghost list
hits aren't part of arcstats:hits - arcstats:hits are real hits, the ghost
hits are an artifact of the ARC algorithm.  If you do want to break down
arcstats:hits into it's components, use:

zfs:0:arcstats:demand_data_hits
zfs:0:arcstats:demand_metadata_hits
zfs:0:arcstats:prefetch_data_hits
zfs:0:arcstats:prefetch_metadata_hits

And for a different perspective on the demand hits:

zfs:0:arcstats:mru_hits
zfs:0:arcstats:mfu_hits

Also, arc_summary.pl's reported MRU and MFU sizes aren't actual, these
are target sizes.  The ARC will try to steer itself towards them, but in at
least one case (where the ARC has yet to fill) they can be very different
from actual (until arc_adjust() is called to whip them back to size.)

> * Prefetch... I want to use the Prefetch Data hit ratio as a judgment call on 
> the efficiency of prefetch.  If the value is very low it might be best to 
> turn it off. but I'd like to hear that from someone else before I go 
> saying that.

Sounds good to me.

> In high latency environments, such as ZFS on iSCSI, prefetch can either 
> significantly help or hurt, determining which is difficult without some type 
> of metric as as above.
> 
> * There are several instances (based on dtracing) in which the ARC is 
> bypassed... for ZIL I understand, in some other cases I need to spend more 
> time analyzing the DMU (dbuf_*) for why.
>
> * In answering the "Is having a 30GB ARC good?" question, I want to say that 
> if MFU is >60% of ARC, and if the hits are mostly MFU that you are deriving 
> significant benefit from your large ARC but on a system with a 2GB ARC or 
> a 30GB ARC the overall hit ratio tends to be 99%.  Which is nuts, and tends 
> to reinforce a misinterpretation of anon hits.

I wouldn't read *too* much into MRU vs MFU hits.  MFU means 2 hits, MRU
means 1.

> The only way I'm seeing to _really_ understand ARC's efficiency is to look at 
> the overall number of reads and then how many are intercepted by ARC and how 
> many actually made it to disk... and why (prefetch or demand).  This is 
> tricky to implement via kstats because you have to pick out and monitor the 
> zpool disks themselves.

This would usually have more to do with the workload than the ARC's efficiency.

> I've spent a lot of time in this code (arc.c) and still have a lot of 
> questions.  I really wish there was an "Advanced ZFS Internals" talk coming 
> up; I simply can't keep spending so much time on this.

Maybe you could try forgetting about the kstats for a moment and draw a
fantasy arc_summary.pl output.  Then we can look at adding kstats to make
writing that script possible/easy (Mark and I could add the kstats, and
Neel could provide the script, for example).  Of course, if we do add more
kstats, it's not going to help on older rev kernels out there...

cheers,

Brendan

-- 
Brendan
[CA, USA]
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Availability: ZFS needs to handle disk removal / driver failure better

2008-08-28 Thread Bill Sommerfeld

On Thu, 2008-08-28 at 13:05 -0700, Eric Schrock wrote:
> A better option would be to not use this to perform FMA diagnosis, but
> instead work into the mirror child selection code.  This has already
> been alluded to before, but it would be cool to keep track of latency
> over time, and use this to both a) prefer one drive over another when
> selecting the child and b) proactively timeout/ignore results from one
> child and select the other if it's taking longer than some historical
> standard deviation.  This keeps away from diagnosing drives as faulty,
> but does allow ZFS to make better choices and maintain response times.
> It shouldn't be hard to keep track of the average and/or standard
> deviation and use it for selection; proactively timing out the slow I/Os
> is much trickier.

tcp has to solve essentially the same problem: decide when a response is
"overdue" based only on the timing of recent successful exchanges in a
context where it's difficult to make assumptions about "reasonable"
expected behavior of the underlying network.

it tracks both the smoothed round trip time and the variance, and
declares a response overdue after (SRTT + K * variance).

I think you'd probably do well to start with something similar to what's
described in http://www.ietf.org/rfc/rfc2988.txt and then tweak based on
experience.

- Bill

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Availability: ZFS needs to handle disk removal / driver failure better

2008-08-28 Thread Bob Friesenhahn

On Thu, 28 Aug 2008, Miles Nordin wrote:

> None of the decisions I described its making based on performance
> statistics are ``haywire''---I said it should funnel reads to the
> faster side of the mirror, and do this really quickly and
> unconservatively.  What's your issue with that?

>From what I understand, this is partially happening now based on 
average service time.  If I/O is backed up for a device, then the 
other device is preferred.  However it good to keep in mind that if 
data is never read, then it is never validated and corrected.  It is 
good for ZFS to read data sometimes.

Bob
==
Bob Friesenhahn
[EMAIL PROTECTED], http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,http://www.GraphicsMagick.org/

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] liveupgrade ufs root -> zfs ?

2008-08-28 Thread Paul Floyd

Hi

I'm not sure that the ZFS pool meets this requirement. I have

# lufslist SXCE_94
 
Filesystem  fstypedevice size Mounted on  Mount Options
---   --- --
/dev/dsk/c1t2d0s1   swap   2147880960 -   -
/dev/dsk/c1t2d0s0   ufs8590202880 /   -
/dev/dsk/c1t2d0s7   ufs5747496960 /export/home-

# lufslist SXCE_95
  
Filesystem  fstypedevice size Mounted on  Mount Options
---   --- --
/dev/dsk/c1t2d0s1   swap   2147880960 -   -
/dev/dsk/c1t2d0s4   ufs8590202880 /   -
/dev/dsk/c1t2d0s7   ufs5747496960 /export/home-

Is it possible to delete SXCE_94, do a zpool create with /dev/dsk/c1t2d0s0, and 
then do a liveupgrade?

I have the impression that it's possible, but that there are some extra steps 
needed (to specify the ZFS mount point?).

A+
Paul
--
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] liveupgrade ufs root -> zfs ?

2008-08-28 Thread Paul Floyd

Hi

I'm not sure that the ZFS pool meets this requirement. I have

# lufslist SXCE_94
 
Filesystem  fstypedevice size Mounted on  Mount Options
---   --- --
/dev/dsk/c1t2d0s1   swap   2147880960 -   -
/dev/dsk/c1t2d0s0   ufs8590202880 /   -
/dev/dsk/c1t2d0s7   ufs5747496960 /export/home-

# lufslist SXCE_95
  
Filesystem  fstypedevice size Mounted on  Mount Options
---   --- --
/dev/dsk/c1t2d0s1   swap   2147880960 -   -
/dev/dsk/c1t2d0s4   ufs8590202880 /   -
/dev/dsk/c1t2d0s7   ufs5747496960 /export/home-

Is it possible to delete SXCE_94, do a zpool create with /dev/dsk/c1t2d0s0, and 
then do a liveupgrade?

I have the impression that it's possible, but that there are some extra steps 
needed (to specify the ZFS mount point?).

A+
Paul
--
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Availability: ZFS needs to handle disk removal / driver failure better

2008-08-28 Thread Eric Schrock

On Thu, Aug 28, 2008 at 08:34:24PM +0100, Ross Smith wrote:
> 
> Personally, if a SATA disk wasn't responding to any requests after 2
> seconds I really don't care if an error has been detected, as far as
> I'm concerned that disk is faulty.

Unless you have power management enabled, or there's a bad region of the
disk, or the bus was reset, or...

> I do have a question though.  From what you're saying, the response
> time can't be consistent across all hardware, so you're once again at
> the mercy of the storage drivers.  Do you know how long does
> B_FAILFAST takes to return a response on iSCSI?  If that's over 1-2
> seconds I would still consider that too slow I'm afraid.

It's main function is how it deals with retryable errors.  If the drive
responds with a retryable error, or any error at all, it won't attempt
to retry again.  If you have a device that is taking arbitrarily long to
respond to successful commands (or to notice that a command won't
succeed), it won't help you.

> I understand that Sun in general don't want to add fault management to
> ZFS, but I don't see how this particular timeout does anything other
> than help ZFS when it's dealing with such a diverse range of media.  I
> agree that ZFS can't know itself what should be a valid timeout, but
> that's exactly why this needs to be an optional administrator set
> parameter.  The administrator of a storage array who wants to set this
> certainly knows what a valid timeout is for them, and these timeouts
> are likely to be several orders of magnitude larger than the standard
> response times.  I would configure very different values for my SATA
> drives as for my iSCSI connections, but in each case I would be
> happier knowing that ZFS has more of a chance of catching bad drivers
> or unexpected scenarios.

The main problem with exposing tunables like this is that they have a
direct correlation to service actions, and mis-diagnosing failures costs
everybody (admin, companies, Sun, etc) lots of time and money.  Once you
expose such a tunable, it will be impossible to trust any FMA diagnosis,
because you won't be able to know whether it was a mistaken tunable.

A better option would be to not use this to perform FMA diagnosis, but
instead work into the mirror child selection code.  This has already
been alluded to before, but it would be cool to keep track of latency
over time, and use this to both a) prefer one drive over another when
selecting the child and b) proactively timeout/ignore results from one
child and select the other if it's taking longer than some historical
standard deviation.  This keeps away from diagnosing drives as faulty,
but does allow ZFS to make better choices and maintain response times.
It shouldn't be hard to keep track of the average and/or standard
deviation and use it for selection; proactively timing out the slow I/Os
is much trickier.

As others have mentioned, things get more difficult with writes.  If I
issue a write to both halves of a mirror, should I return when the first
one completes, or when both complete?  One possibility is to expose this
as a tunable, but any such "best effort RAS" is a little dicey because
you have very little visibility into the state of the pool in this
scenario - "is my data protected?" becomes a very difficult question to
answer.

- Eric

--
Eric Schrock, Fishworkshttp://blogs.sun.com/eschrock
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] pulling disks was: ZFS hangs/freezes after disk

2008-08-28 Thread Anton B. Rang

Many mid-range/high-end RAID controllers work by having a small timeout on 
individual disk I/O operations. If the disk doesn't respond quickly, they'll 
issue an I/O to the redundant disk(s) to get the data back to the host in a 
reasonable time. Often they'll change parameters on the disk to limit how long 
the disk retries before returning an error for a bad sector (this is 
standardized for SCSI, I don't recall offhand whether any of this is 
standardized for ATA).

RAID 3 units, e.g. DataDirect, issue I/O to all disks simultaneously and when 
enough (N-1 or N-2) disks return data, they'll return the data to the host. At 
least they do that for full stripes. But this strategy works better for 
sequential I/O, not so good for random I/O, since you're using up extra 
bandwidth.

Host-based RAID/mirroring almost never takes this strategy for two reasons. 
First, the bottleneck is almost always the channel from disk to host, and you 
don't want to clog it. [Yes, I know there's more bandwidth there than the sum 
of the disks, but consider latency.] Second, to read from two disks on a 
mirror, you'd need two memory buffers.
--
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] pulling disks was: ZFS hangs/freezes after disk

2008-08-28 Thread Anton B. Rang

Many mid-range/high-end RAID controllers work by having a small timeout on 
individual disk I/O operations. If the disk doesn't respond quickly, they'll 
issue an I/O to the redundant disk(s) to get the data back to the host in a 
reasonable time. Often they'll change parameters on the disk to limit how long 
the disk retries before returning an error for a bad sector (this is 
standardized for SCSI, I don't recall offhand whether any of this is 
standardized for ATA).

RAID 3 units, e.g. DataDirect, issue I/O to all disks simultaneously and when 
enough (N-1 or N-2) disks return data, they'll return the data to the host. At 
least they do that for full stripes. But this strategy works better for 
sequential I/O, not so good for random I/O, since you're using up extra 
bandwidth.

Host-based RAID/mirroring almost never takes this strategy for two reasons. 
First, the bottleneck is almost always the channel from disk to host, and you 
don't want to clog it. [Yes, I know there's more bandwidth there than the sum 
of the disks, but consider latency.] Second, to read from two disks on a 
mirror, you'd need two memory buffers.
--
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Availability: ZFS needs to handle disk removal / driver failure better

2008-08-28 Thread Miles Nordin

> "bf" == Bob Friesenhahn <[EMAIL PROTECTED]> writes:

bf> If the system or device is simply overwelmed with work, then
bf> you would not want the system to go haywire and make the
bf> problems much worse.

None of the decisions I described its making based on performance
statistics are ``haywire''---I said it should funnel reads to the
faster side of the mirror, and do this really quickly and
unconservatively.  What's your issue with that?

bf> You are saying that I can't split my mirrors between a local
bf> disk in Dallas and a remote disk in New York accessed via
bf> iSCSI?

nope, you've misread.  I'm saying reads should go to the local disk
only, and writes should go to both.  See SVM's 'metaparam -r'.  I
suggested that unlike the SVM feature it should be automatic, because
by so being it becomes useful as an availability tool rather than just
performance optimisation.

The performance-statistic logic should influence read scheduling
immediately, and generate events which are fed to FMA, then FMA can
mark devices faulty.  There's no need for both to make the same
decision at the same time.  If the events aren't useful for diagnosis,
ZFS could not bother generating them, or fmd could ignore them in its
diagnosis.  I suspect they *would* be useful, though.

I'm imagining the read rescheduling would happen very quickly, quicker
than one would want a round-trip from FMA, in much less than a second.
That's why it would have to compare devices to others in the same
vdev, and to themselves over time, rather than use fixed timeouts or
punt to haphazard driver and firmware logic.

bf>o System waits substantial time for devices to (possibly)
bf> recover in order to ensure that subsequently written data has
bf> the least chance of being lost.

There's no need for the filesystem to *wait* for data to be written,
unless you are calling fsync.  and maybe not even then if there's a
slog.

I said clearly that you read only one half of the mirror, but write to
both.  But you're right that the trick probably won't work
perfectly---eventually dead devices need to be faulted.  The idea is
that normal write caching will buy you orders of magnitude longer time
in which to make a better decision before anyone notices.

Experience here is that ``waits substantial time'' usually means
``freezes for hours and gets rebooted''.  There's no need to be
abstract: we know what happens when a drive starts taking 1000x -
2000x longer than usual to respond to commands, and we know that this
is THE common online failure mode for drives.  That's what started the
thread.  so, think about this: hanging for an hour trying to write to
a broken device may block other writes to devices which are still
working, until the patiently-waiting data is eventually lost in the
reboot.

bf>o System immediately ignores slow devices and switches to
bf> non-redundant non-fail-safe non-fault-tolerant
bf> may-lose-your-data mode.  When system is under intense load,
bf> it automatically switches to the may-lose-your-data mode.

nobody's proposing a system which silently rocks back and forth
between faulted and online.  That's not what we have now, and no such
system would naturally arise.  If FMA marked a drive faulty based on
performance statistics, that drive would get retired permanently and
hot-spare-replaced.  Obviously false positives are bad, just as
obviously as freezes/reboots are bad.

It's not my idea to use FMA in this way.  This is how FMA was pitched,
and the excuse for leaving good exception handling out of ZFS for two
years.  so, where's the beef?


pgpUDw139jf6A.pgp
Description: PGP signature
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] ARCSTAT Kstat Definitions

2008-08-28 Thread Peter Tribble

On Thu, Aug 21, 2008 at 8:47 PM, Ben Rockwood <[EMAIL PROTECTED]> wrote:
> New version is available (v0.2) :
>
> * Fixes divide by zero,
> * includes tuning from /etc/system in output
> * if prefetch is disabled I explicitly say so.
> * Accounts for jacked anon count.  Still need improvement here.
> * Added friendly explanations for MRU/MFU & Ghost lists counts.
>
> Page and examples are updated: cuddletech.com/arc_summary.pl
>
> Still needs work, but hopefully interest in this will stimulate some improved 
> understanding of ARC internals.

For a bit of light relief (in other words, with pretty graphs) I've hacked up a
graphical java version of Ben's script as part of jkstat (updated to 0.24):

http://www.petertribble.co.uk/Solaris/jkstat.html

Now, this is pretty rough, and chews up a modest amount of CPU, and
I'm not sure of the interpretation, but I've basically taken Ben's code and
lifted it more or less as is.

-- 
-Peter Tribble
http://www.petertribble.co.uk/ - http://ptribble.blogspot.com/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Availability: ZFS needs to handle disk removal / driver failure better

2008-08-28 Thread Miles Nordin

> "es" == Eric Schrock <[EMAIL PROTECTED]> writes:

es> I don't think you understand how this works.  Imagine two
es> I/Os, just with different sd timeouts and retry logic - that's
es> B_FAILFAST.  It's quite simple, and independent of any
es> hardware implementation.

AIUI the main timeout to which we should be subject, at least for
nearline drives, is about 30 seconds long and is decided by the
drive's firmware, not the driver, and can't be negotiated in any way
that's independent of the hardware implementation, although sometimes
there are dependent ways to negotiate it.  The driver could also
decide through ``retry logic'' to time out the command sooner, before
the drive completes it, but this won't do much good because the drive
won't accept a second command until ITS timeout expires.

which leads to the second problem, that we're talking about timeouts
for individual I/O's, not marking whole devices.  A ``fast'' timeout
of even 1 second could cause a 100- or 1000-fold decrease in
performance, which could end up being equivalent to a freeze depending
on the type of load on the filesystem.


pgphjTr74byaZ.pgp
Description: PGP signature
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Availability: ZFS needs to handle disk removal / driver failure better

2008-08-28 Thread Ross Smith

Hi guys,

Bob, my thought was to have this timeout as something that can be optionally 
set by the administrator on a per pool basis.  I'll admit I was mainly thinking 
about reads and hadn't considered the write scenario, but even having thought 
about that it's still a feature I'd like.  After all, this would be a timeout 
set by the administrator based on the longest delay they can afford for that 
storage pool.

Personally, if a SATA disk wasn't responding to any requests after 2 seconds I 
really don't care if an error has been detected, as far as I'm concerned that 
disk is faulty.  I'd be quite happy for the array to drop to a degraded mode 
based on that and for writes to carry on with the rest of the array.

Eric, thanks for the extra details, they're very much appreciated.  It's good 
to hear you're working on this, and I love the idea of doing a B_FAILFAST read 
on both halves of the mirror.

I do have a question though.  From what you're saying, the response time can't 
be consistent across all hardware, so you're once again at the mercy of the 
storage drivers.  Do you know how long does B_FAILFAST takes to return a 
response on iSCSI?  If that's over 1-2 seconds I would still consider that too 
slow I'm afraid.

I understand that Sun in general don't want to add fault management to ZFS, but 
I don't see how this particular timeout does anything other than help ZFS when 
it's dealing with such a diverse range of media.  I agree that ZFS can't know 
itself what should be a valid timeout, but that's exactly why this needs to be 
an optional administrator set parameter.  The administrator of a storage array 
who wants to set this certainly knows what a valid timeout is for them, and 
these timeouts are likely to be several orders of magnitude larger than the 
standard response times.  I would configure very different values for my SATA 
drives as for my iSCSI connections, but in each case I would be happier knowing 
that ZFS has more of a chance of catching bad drivers or unexpected scenarios.

I very much doubt hardware raid controllers would wait 3 minutes for a drive to 
return a response, they will have their own internal timeouts to know when a 
drive has failed, and while ZFS is dealing with very different hardware I can't 
help but feel it should have that same approach to management of its drives.

However, that said, I'll be more than willing to test the new
B_FAILFAST logic on iSCSI once it's released.  Just let me know when
it's out.

Ross

> Date: Thu, 28 Aug 2008 11:29:21 -0500
> From: [EMAIL PROTECTED]
> To: [EMAIL PROTECTED]
> CC: zfs-discuss@opensolaris.org
> Subject: Re: [zfs-discuss] Availability: ZFS needs to handle disk removal / 
> driver failure better
> 
> On Thu, 28 Aug 2008, Ross wrote:
> >
> > I believe ZFS should apply the same tough standards to pool 
> > availability as it does to data integrity.  A bad checksum makes ZFS 
> > read the data from elsewhere, why shouldn't a timeout do the same 
> > thing?
> 
> A problem is that for some devices, a five minute timeout is ok.  For 
> others, there must be a problem if the device does not respond in a 
> second or two.
> 
> If the system or device is simply overwelmed with work, then you would 
> not want the system to go haywire and make the problems much worse.
> 
> Which of these do you prefer?
> 
>o System waits substantial time for devices to (possibly) recover in
>  order to ensure that subsequently written data has the least
>  chance of being lost.
> 
>o System immediately ignores slow devices and switches to
>  non-redundant non-fail-safe non-fault-tolerant may-lose-your-data
>  mode.  When system is under intense load, it automatically
>  switches to the may-lose-your-data mode.
> 
> Bob
> ==
> Bob Friesenhahn
> [EMAIL PROTECTED], http://www.simplesystems.org/users/bfriesen/
> GraphicsMagick Maintainer,http://www.GraphicsMagick.org/
> 

_
Get Hotmail on your mobile from Vodafone 
http://clk.atdmt.com/UKM/go/107571435/direct/01/___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Availability: ZFS needs to handle disk removal / driver failure better

2008-08-28 Thread Bob Friesenhahn

On Thu, 28 Aug 2008, Miles Nordin wrote:
>
> you're right in terms of fixed timeouts, but there's no reason it
> can't compare the performance of redundant data sources, and if one
> vdev performs an order of magnitude slower than another set of vdevs
> with sufficient redundancy, stop issuing reads except scrubs/healing
> to the underperformer (issue writes only), and pass an event to FMA.

You are saying that I can't split my mirrors between a local disk in 
Dallas and a remote disk in New York accessed via iSCSI?  Why don't 
you want me to be able to do that?

ZFS already backs off from writing to slow vdevs.

> ZFS can also compare the performance of a drive to itself over time,
> and if the performance suddenly decreases, do the same.

While this may be useful for reads, I would hate to disable redundancy 
just because a device is currently slow.

Bob
==
Bob Friesenhahn
[EMAIL PROTECTED], http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,http://www.GraphicsMagick.org/

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Availability: ZFS needs to handle disk removal / driver failure better

2008-08-28 Thread Eric Schrock

On Thu, Aug 28, 2008 at 02:17:08PM -0400, Miles Nordin wrote:
> 
> you're right in terms of fixed timeouts, but there's no reason it
> can't compare the performance of redundant data sources, and if one
> vdev performs an order of magnitude slower than another set of vdevs
> with sufficient redundancy, stop issuing reads except scrubs/healing
> to the underperformer (issue writes only), and pass an event to FMA.

Yep, latency would be a useful metric to add to mirroring choices.
The current logic is rather naive (round-robin) and could easily be
enhanced.

Making diagnoses based on this is much trickier, particularly at the ZFS
level.  A better option would be to leverage the SCSI FMA work going on
to do a more intimate diagnosis at the scsa level.

Also, the problem you are trying to solve - timing out the first I/O to
take a long time - is not captured well by the type  of hysteresis you
would need to perform in order to do this diagnosis.  It certainly can
be done, but is much better suited to diagnosising a failing drive over
time, not aborting a transaction in response to immediate failure.

> This B_FAILFAST architecture captures the situation really poorly.

I don't think you understand how this works.  Imagine two I/Os, just
with different sd timeouts and retry logic - that's B_FAILFAST.  It's
quite simple, and independent of any hardware implementation.

- Eric

--
Eric Schrock, Fishworkshttp://blogs.sun.com/eschrock
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] pulling disks was: ZFS hangs/freezes after disk failure,

2008-08-28 Thread Miles Nordin

> "jl" == Jonathan Loran <[EMAIL PROTECTED]> writes:

jl>   Fe = 46% failures/month * 12 months = 5.52 failures

the original statistic wasn't of this kind.  It was ``likelihood a
single drive will experience one or more failures within 12 months''.

so, you could say, ``If I have a thousand drives, about 4.66 of those
drives will silently-corrupt at least once within 12 months.''  It is
0.466% no matter how many drives you have.  

And it's 4.66 drives, not 4.66 corruptions.  The estimated number of
corruptions is higher because some drives will corrupt twice, or
thousands of times.  It's not a BER, so you can't just add it like
Richard did.

If the original statistic in the paper were of the kind you're talking
about, it would be larger than 0.466%.  I'm not sure it would capture
the situation well, though.  I think you'd want to talk about bits of
recoverable data after one year, not corruption ``events'', and this
is not really measured well by the type of telemetry NetApp has.  If
it were, though, it would still be the same size number no matter how
many drives you had.

The 37% I gave was ``one or more within a population of 100 drives
silently corrupts within 12 months.''  The 46% Richard gave has no
meaning, and doesn't mean what you just said.  The only statistic
under discussion which (a) gets intimidatingly large as you increase
the number of drives, and (b) is a ratio rather than, say, an absolute
number of bits, is the one I gave.


pgpl2HghkrzU1.pgp
Description: PGP signature
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] eWeek: corrupt file brought down FAA's antiquated IT system

2008-08-28 Thread Tim

On Thu, Aug 28, 2008 at 12:38 PM, Bob Friesenhahn <
[EMAIL PROTECTED]> wrote:

> On Thu, 28 Aug 2008, Toby Thain wrote:
>
> > What goes unremarked here is how the original system has coped
> > reliably for decades of (one guesses) geometrically growing load.
>
> Fantastic engineering from a company which went defunct shortly after
> delivering the system.

And let this be a lesson to all of you not to write code that is too good.
If you can't sell an "update" (patch) every 6 months, you'll be out of
business as well :D

--Tim
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Availability: ZFS needs to handle disk removal / driver failure better

2008-08-28 Thread Miles Nordin

> "es" == Eric Schrock <[EMAIL PROTECTED]> writes:

es> Finally, imposing additional timeouts in ZFS is a bad idea.
es> [...] As such, it doesn't have the necessary context to know
es> what constitutes a reasonable timeout.

you're right in terms of fixed timeouts, but there's no reason it
can't compare the performance of redundant data sources, and if one
vdev performs an order of magnitude slower than another set of vdevs
with sufficient redundancy, stop issuing reads except scrubs/healing
to the underperformer (issue writes only), and pass an event to FMA.

ZFS can also compare the performance of a drive to itself over time,
and if the performance suddenly decreases, do the same.

The former case eliminates the need for the mirror policies in SVM,
which Ian requested a few hours ago for the situation that half the
mirror is a slow iSCSI target for geographic redundancy and half is
faster/local.  Some care would have to be taken for targets shared by
ZFS and some other initiator, but I'm not sure the care would really
be that difficult to take, or that the oscillations induced by failing
to take it would really be particularly harmful compared to
unsupervised contention for a device.

The latter notices quickly drives that have been pulled, or for
Richard's ``overwhelmingly dominant'' case, for drives which are
stalled for 30 seconds pending their report of an unrecovered read.

Developing meaningful performance statistics for drives and a tool for
displaying them would be useful for itself, not just for stopping
freezes and preventing a failing drive from degrading performance a
thousandfold.

Issuing reads to redundant devices is cheap compared to freezing.  The
policy with which it's done is highly tunable and should be fun to
tune and watch, and the consequence if the policy makes the wrong
choice isn't incredibly dire.


This B_FAILFAST architecture captures the situation really poorly.

First, it's not implementable in any serious way with near-line
drives, or really with any drives with which you're not intimately
familiar and in control of firmware/release-engineering, and perhaps
not with any drives period.  I suspect in practice it's more a
controller-level feature, about whether or not you'd like to distrust
the device's error report and start resetting busses and channels and
mucking everything up trying to recover from some kind of
``weirdness''.  It's not an answer to the known problem of drives
stalling for 30 seconds when they start to fail.

First and a half, when it's not implemented, the system degrades to
doubling your timeout pointlessly.  A driver-level block cache of
UNC's would probably have more value toward this
speed/read-aggressiveness tradeoff than the whole B_FAILFAST
architecture---just cache known unrecoverable read sectors, and refuse
to issue further I/O for them until a timeout of 3 - 10 minutes
passes.  I bet this would speed up most failures tremendously, and
without burdening upper layers with retry logic.

Second, B_FAILFAST entertains the fantasy that I/O's are independent,
while what happens in practice is that the drive hits a UNC on one
I/O, and won't entertain any further I/O's no matter what flags the
request has on it or how many times you ``reset'' things.


Maybe you could try to rescue B_FAILFAST by putting clever statistics
into the driver to compare the drive's performance to recent past as I
suggested ZFS do, and admit no B_FAILFAST requests to queues of drives
that have suddenly slowed down, just fail them immediately without
even trying.  I submit this queueing and statistic collection is
actually _better_ managed by ZFS than the driver because ZFS can
compare a whole floating-point statistic across a whole vdev, while
even a driver which is fancier than we ever dreamed, is still playing
poker with only 1 bit of input ``I'll call,'' or ``I'll fold.''  ZFS
can see all the cards and get better results while being stupider and
requiring less clever poker-guessing than would be required by a
hypothetical driver B_FAILFAST implementation that actually worked.


pgpqZb7GbAEgk.pgp
Description: PGP signature
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] pulling disks was: ZFS hangs/freezes after disk failure,

2008-08-28 Thread Jonathan Loran



Miles Nordin wrote:
> What is a ``failure rate for a time interval''?
>
>   
Failure rate => Failures/unit time
Failure rate for a time interval => (Failures/unit time) * time

For example, if we have a failure rate: 

  Fr = 46% failures/month

Then the expectation value of a failure in one year:

  Fe = 46% failures/month  *  12 months = 5.52 failures


Jon

-- 


- _/ _/  /   - Jonathan Loran -   -
-/  /   /IT Manager   -
-  _  /   _  / / Space Sciences Laboratory, UC Berkeley
-/  / /  (510) 643-5146 [EMAIL PROTECTED]
- __/__/__/   AST:7731^29u18e3
 


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] eWeek: corrupt file brought down FAA's antiquated IT system

2008-08-28 Thread Keith Bierman

On Aug 28, 2008, at 11:38 AM, Bob Friesenhahn wrote:

>  The old FORTRAN code
> either had to be ported or new code written from scratch.

Assuming it WAS written in FORTRAN there is no reason to believe it  
wouldn't just compile with a modern Fortran compiler. I've often run  
codes originally written in the sixties without any significant  
changes (very old codes may have used the frequency statement,  
toggled front panel lights or sensed toggle switches ... but that's  
pretty rare).

-- 
Keith H. Bierman   [EMAIL PROTECTED]  | AIM kbiermank
5430 Nassau Circle East  |
Cherry Hills Village, CO 80113   | 303-997-2749
 Copyright 2008

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] eWeek: corrupt file brought down FAA's antiquated IT system

2008-08-28 Thread Bob Friesenhahn

On Thu, 28 Aug 2008, Toby Thain wrote:
>
> "two 20-year-old redundant mainframe configurations ... that
> apparently are hanging on for dear life until reinforcements arrive
> in the form of a new, state-of-the-art system this winter."
>
> And we all know that 'new, state-of-the-art systems' are silver
> bullets and good value for money.

The problem is that the replacement system is almost certain to be 
less reliable and cause problems for a while.  The old FORTRAN code 
either had to be ported or new code written from scratch.  If they 
used off the shelf software for the replacement then there is no way 
that the new system can be supported for 20 years.

> What goes unremarked here is how the original system has coped
> reliably for decades of (one guesses) geometrically growing load.

Fantastic engineering from a company which went defunct shortly after 
delivering the system.

Bob
==
Bob Friesenhahn
[EMAIL PROTECTED], http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,http://www.GraphicsMagick.org/

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] xVM GRUB entry incorrect with ZFS root

2008-08-28 Thread John Levon

On Thu, Aug 28, 2008 at 09:25:14AM -0700, Trevor Watson wrote:

> Looking at the GRUB menu, it appears as though the flags "-B $ZFS-BOOTFS" are 
> needed to be passed to the kernel. Is this something I can add to:  kernel$ 
> /boot/$ISADIR/xen.gz or is there some other mechanism required for booting 
> Solaris xVM from ZFS ?

You need to add it to the next line ($module ...). This was a bug that's
now fixed in the latest LU

regards
john
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] ZFS Pools 1+TB

2008-08-28 Thread Kyle McDonald

Kenny wrote:
>
> How did you determine from the format output the GB vs MB amount??
>
> Where do you compute 931 GB vs 932 MB from this??
>
> 2. c6t600A0B800049F93C030A48B3EA2Cd0 /scsi_vhci/[EMAIL PROTECTED]
>
> 3. c6t600A0B800049F93C030D48B3EAB6d0
> /scsi_vhci/[EMAIL PROTECTED]
>
It's in the part you didn't cut and paste:

AVAILABLE DISK SELECTIONS:
>3. c6t600A0B800049F93C030D48B3EAB6d0 
>   /scsi_vhci/[EMAIL PROTECTED]
>4. c6t600A0B800049F93C031C48B3EC76d0 
>   /scsi_vhci/[EMAIL PROTECTED]
>8. c6t600A0B800049F93C031048B3EB44d0 
>   /scsi_vhci/[EMAIL PROTECTED]
>   

Look at the label:



The last field.


> Please educate me!!  
>
No problem. Things like this have happened to me from time to time.

   -Kyle

> Thanks again!
>
> --Kenny
> --
> This message posted from opensolaris.org
> ___
> zfs-discuss mailing list
> zfs-discuss@opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
>   

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Availability: ZFS needs to handle disk removal / driver failure better

2008-08-28 Thread Eric Schrock

Ross, thanks for the feedback.  A couple points here -

A lot of work went into improving the error handling around build 77 of
Nevada.  There are still problems today, but a number of the
complaints we've seen are on s10 software or older nevada builds that
didn't have these fixes.  Anything from the pre-2008 (or pre-s10u5)
timeframe should be taken with grain of salt.

There is a fix in the immediate future to prevent I/O timeouts from
hanging other parts of the system - namely administrative commands and
other pool activity.  So I/O to that particular pool will hang, but
you'll still be able to run your favorite ZFS commands, and it won't
impact the ability of other pools to run.

We have some good ideas on how to improve the retry logic.  There is a
flag in Solaris, B_FAILFAST, that tells the drive to not try too hard
getting the data.  However, it can return failure when trying harder
would produce the correct results.  Currently, we try the first I/O with
B_FAILFAST, and if that fails immediately retry without the flag.  The
idea is to elevate the retry logic to a higher level, so when a read
from a side of a mirror fails with B_FAILFAST, instead of immediately
retrying the same device without the failfast flag, we push the error
higher up the stack, and issue another B_FAILFAST I/O to the other half
of the mirror.  Only if both fail with failfast do we try a more
thorough request (though with ditto blocks we may try another vdev
alltogether). This should improve I/O error latency for a subset of
failure scenarios, and biasing reads away from degraded (but not faulty)
devices should also improve response time.  The tricky part is
incoporating this into the FMA diagnosis engine, as devices may fail
B_FAILFAST requests for a variety of non-fatal reasons.

Finally, imposing additional timeouts in ZFS is a bad idea.  ZFS is
designed to be a generic storage consumer.  It can be layered on top of
directly attached disks, SSDs, SAN devices, iSCSI targets, files, and
basically anything else.  As such, it doesn't have the necessary context
to know what constitutes a reasonable timeout.  This is explicitly
delegated to the underlying storage subsystem.  If a storage subsystem
is timing out for excessive periods of time when B_FAILFAST is set, then
that's a bug in the storage subsystem, and working around it in ZFS with
yet another set of tunables is not practical.  It will be interesting to
see if this is an issue after the retry logic is modified as described
above.

Hope that helps,

- Eric

On Thu, Aug 28, 2008 at 01:08:26AM -0700, Ross wrote:
> Since somebody else has just posted about their entire system locking up when 
> pulling a drive, I thought I'd raise this for discussion.
> 
> I think Ralf made a very good point in the other thread.  ZFS can guarantee 
> data integrity, what it can't do is guarantee data availability.  The problem 
> is, the way ZFS is marketed people expect it to be able to do just that.
> 
> This turned into a longer thread than expected, so I'll start with what I'm 
> asking for, and then attempt to explain my thinking.  I'm essentially asking 
> for two features to improve the availability of ZFS pools:
> 
> - Isolation of storage drivers so that buggy drivers do not bring down the OS.
> 
> - ZFS timeouts to improve pool availability when no timely response is 
> received from storage drivers.
> 
> And my reasons for asking for these is that there are now many, many posts on 
> here about people experiencing either total system lockup or ZFS lockup after 
> removing a hot swap drive, and indeed while some of them are using consumer 
> hardware, others have reported problems with server grade kit that definately 
> should be able to handle these errors:
> 
> Aug 2008:  AMD SB600 - System hang
>  - http://www.opensolaris.org/jive/thread.jspa?threadID=70349
> Aug 2008:  Supermicro SAT2-MV8 - System hang
>  - http://www.opensolaris.org/jive/thread.jspa?messageID=271218
> May 2008: Sun hardware - ZFS hang
>  - http://opensolaris.org/jive/thread.jspa?messageID=240481
> Feb 2008:  iSCSI - ZFS hang
>  - http://www.opensolaris.org/jive/thread.jspa?messageID=206985
> Oct 2007:  Supermicro SAT2-MV8 - system hang
>  - http://www.opensolaris.org/jive/thread.jspa?messageID=166037
> Sept 2007:  Fibre channel
>  - http://opensolaris.org/jive/thread.jspa?messageID=151719
> ... etc
> 
> Now while the root cause of each of these may be slightly different, I feel 
> it would still be good to address this if possible as it's going to affect 
> the perception of ZFS as a reliable system.
> 
> The common factor in all of these is that either the solaris driver hangs and 
> locks the OS, or ZFS hangs and locks the pool.  Most of these are for 
> hardware that should handle these failures fine (mine occured for hardware 
> that definately works fine under windows), so I'm wondering:  Is there 
> anything that can be done to prevent either type of lockup in these 
> situations?
> 
> Firstly, for the OS

Re: [zfs-discuss] ZFS Pools 1+TB

2008-08-28 Thread Kenny

Ok so I knew it had to be operator headspace...  

I found my error and have fixed it in CAM.  Thanks to all for helping my 
education!!  

However I do have a question.  And pardon if it's a 101 type...

How did you determine from the format output the GB vs MB amount??

Where do you compute 931 GB vs 932 MB from this??

2. c6t600A0B800049F93C030A48B3EA2Cd0 /scsi_vhci/[EMAIL PROTECTED]

3. c6t600A0B800049F93C030D48B3EAB6d0
/scsi_vhci/[EMAIL PROTECTED]

Please educate me!!  

Thanks again!

--Kenny
--
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] ZFS Pools 1+TB

2008-08-28 Thread Kenny

Ok so I knew it had to be operator headspace...  

I found my error and have fixed it in CAM.  Thanks to all for helping my 
education!!  

However I do have a question.  And pardon if it's a 101 type...

How did you determine from the format output the GB vs MB amount??

Where do you compute 931 GB vs 932 MB from this??

2. c6t600A0B800049F93C030A48B3EA2Cd0 /scsi_vhci/[EMAIL PROTECTED]

3. c6t600A0B800049F93C030D48B3EAB6d0
/scsi_vhci/[EMAIL PROTECTED]

Please educate me!!  

Thanks again!

--Kenny
--
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] ZFS Pools 1+TB

2008-08-28 Thread Bob Friesenhahn

On Thu, 28 Aug 2008, Kenny wrote:
>   2. c6t600A0B800049F93C030A48B3EA2Cd0 
>  /scsi_vhci/[EMAIL PROTECTED]

Good.

>   3. c6t600A0B800049F93C030D48B3EAB6d0 
>  /scsi_vhci/[EMAIL PROTECTED]

Oops!  Oops!  Oops!

It seems that some of your drives have the full 931.01GB exported 
while others have only 931.01MB exported.  The smallest device size 
will be used to size the vdev in your pool.  I sense a user error in 
the tedious CAM interface.  CAM is slow so you need to be patient and 
take extra care when configuring the 2540 volumes.

Bob
==
Bob Friesenhahn
[EMAIL PROTECTED], http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,http://www.GraphicsMagick.org/

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] ZFS Pools 1+TB

2008-08-28 Thread Bob Friesenhahn

On Thu, 28 Aug 2008, Kenny wrote:

> Bob,  Thanks for the reply.  Yes I did read your white paper and am using 
> it!!  Thanks again!!
>
> I used zpool iostat -v and it did't give the information as advertised...  
> see below

The lack of size information seems quit odd.

Bob
==
Bob Friesenhahn
[EMAIL PROTECTED], http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,http://www.GraphicsMagick.org/

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] xVM GRUB entry incorrect with ZFS root

2008-08-28 Thread Malachi de Ælfweald

Take a look at my xVM/GRUB config:
http://malsserver.blogspot.com/2008/08/installing-xvm.html

On Thu, Aug 28, 2008 at 9:25 AM, Trevor Watson <[EMAIL PROTECTED]>wrote:

> I just ran live-upgrade of my system from nv94/UFS to nv96/ZFS on x86.
>
> nv96/ZFS boots okay. However, I can't boot the Solaris xVM partition as the
> GRUB entry does not contain the necessary magic to tell grub to use ZFS
> instead of UFS.
>
> Looking at the GRUB menu, it appears as though the flags "-B $ZFS-BOOTFS"
> are needed to be passed to the kernel. Is this something I can add to:
>  kernel$ /boot/$ISADIR/xen.gz or is there some other mechanism required for
> booting Solaris xVM from ZFS ?
> --
> This message posted from opensolaris.org
> ___
> zfs-discuss mailing list
> zfs-discuss@opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
>
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] pulling disks was: ZFS hangs/freezes after disk failure,

2008-08-28 Thread Miles Nordin

> "rm" == Robert Milkowski <[EMAIL PROTECTED]> writes:

rm> Please look for slides 23-27 at
rm> http://unixdays.pl/i/unixdays-prezentacje/2007/milek.pdf

yeah, ok, ONCE AGAIN, I never said that checksums are worthless.

relling: some drives don't return errors on unrecoverable read events.
carton: I doubt that.  Tell me a story about one that doesn't.

Your stories are about storage subsystems again, not drives.  Also
most or all of the slides aren't about unrecoverable read events.


pgpitPlQ325Eo.pgp
Description: PGP signature
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] pulling disks was: ZFS hangs/freezes after disk failure,

2008-08-28 Thread Miles Nordin

> "re" == Richard Elling <[EMAIL PROTECTED]> writes:

re> There is no error in my math.  I presented a failure rate for
re> a time interval,

What is a ``failure rate for a time interval''?

AIUI, the failure rate for a time interval is 0.46% / yr, no matter how
many drives you have.


pgpeGoMP0F3vv.pgp
Description: PGP signature
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Availability: ZFS needs to handle disk removal / driver failure better

2008-08-28 Thread Bob Friesenhahn

On Thu, 28 Aug 2008, Ross wrote:
>
> I believe ZFS should apply the same tough standards to pool 
> availability as it does to data integrity.  A bad checksum makes ZFS 
> read the data from elsewhere, why shouldn't a timeout do the same 
> thing?

A problem is that for some devices, a five minute timeout is ok.  For 
others, there must be a problem if the device does not respond in a 
second or two.

If the system or device is simply overwelmed with work, then you would 
not want the system to go haywire and make the problems much worse.

Which of these do you prefer?

   o System waits substantial time for devices to (possibly) recover in
 order to ensure that subsequently written data has the least
 chance of being lost.

   o System immediately ignores slow devices and switches to
 non-redundant non-fail-safe non-fault-tolerant may-lose-your-data
 mode.  When system is under intense load, it automatically
 switches to the may-lose-your-data mode.

Bob
==
Bob Friesenhahn
[EMAIL PROTECTED], http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,http://www.GraphicsMagick.org/

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

[zfs-discuss] xVM GRUB entry incorrect with ZFS root

2008-08-28 Thread Trevor Watson

I just ran live-upgrade of my system from nv94/UFS to nv96/ZFS on x86. 

nv96/ZFS boots okay. However, I can't boot the Solaris xVM partition as the 
GRUB entry does not contain the necessary magic to tell grub to use ZFS instead 
of UFS.

Looking at the GRUB menu, it appears as though the flags "-B $ZFS-BOOTFS" are 
needed to be passed to the kernel. Is this something I can add to:  kernel$ 
/boot/$ISADIR/xen.gz or is there some other mechanism required for booting 
Solaris xVM from ZFS ?
--
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

[zfs-discuss] xVM GRUB entry incorrect with ZFS root

2008-08-28 Thread Trevor Watson

I just ran live-upgrade of my system from nv94/UFS to nv96/ZFS on x86. 

nv96/ZFS boots okay. However, I can't boot the Solaris xVM partition as the 
GRUB entry does not contain the necessary magic to tell grub to use ZFS instead 
of UFS.

Looking at the GRUB menu, it appears as though the flags "-B $ZFS-BOOTFS" are 
needed to be passed to the kernel. Is this something I can add to:  kernel$ 
/boot/$ISADIR/xen.gz or is there some other mechanism required for booting 
Solaris xVM from ZFS ?
--
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Subversion repository on ZFS

2008-08-28 Thread Shawn Ferry




On Aug 27, 2008, at 4:38 PM, Tim wrote:


On Wed, Aug 27, 2008 at 3:29 PM, Ian Collins <[EMAIL PROTECTED]> wrote:

Does anyone have any tuning tips for a Subversion repository on  
ZFS?  The

repository will mainly be storing binary (MS Office documents).

It looks like a vanilla, uncompressed file system is the best bet.


I have a SVN on ZFS repository with ~75K relatively small files and  
few binaries. That is working well without any special tuning.


Shawn


--
Shawn Ferry  shawn.ferry at sun.com
Senior Primary Systems Engineer
Sun Managed Operations





___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] ZFS Pools 1+TB

2008-08-28 Thread Tim

exactly :)



On 8/28/08, Kyle McDonald <[EMAIL PROTECTED]> wrote:
> Daniel Rock wrote:
>>
>> Kenny schrieb:
>> >2. c6t600A0B800049F93C030A48B3EA2Cd0
>> 
>> >   /scsi_vhci/[EMAIL PROTECTED]
>> >3. c6t600A0B800049F93C030D48B3EAB6d0
>> 
>> >   /scsi_vhci/[EMAIL PROTECTED]
>>
>> Disk 2: 931GB
>> Disk 3: 931MB
>>
>> Do you see the difference?
>>
> Not just disk 3:
>
>> AVAILABLE DISK SELECTIONS:
>>3. c6t600A0B800049F93C030D48B3EAB6d0
>> 
>>   /scsi_vhci/[EMAIL PROTECTED]
>>4. c6t600A0B800049F93C031C48B3EC76d0
>> 
>>   /scsi_vhci/[EMAIL PROTECTED]
>>8. c6t600A0B800049F93C031048B3EB44d0
>> 
>>   /scsi_vhci/[EMAIL PROTECTED]
>>
> This all makes sense now, since a RAIDZ (or RAIDZ2) vdev can only be as
> big as it's *smallest* component device.
>
>-Kyle
>
>>
>>
>> Daniel
>> ___
>> zfs-discuss mailing list
>> zfs-discuss@opensolaris.org
>> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
>>
>
> ___
> zfs-discuss mailing list
> zfs-discuss@opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
>
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] eWeek: corrupt file brought down FAA's antiquated IT system

2008-08-28 Thread Wade . Stuart

[EMAIL PROTECTED] wrote on 08/28/2008 09:00:23 AM:

>
> On 28-Aug-08, at 10:54 AM, Toby Thain wrote:
>
> >
> > On 28-Aug-08, at 10:11 AM, Richard Elling wrote:
> >
> >> It is rare to see this sort of "CNN Moment" attributed to file
> >> corruption.
> >> http://www.eweek.com/c/a/IT-Infrastructure/Corrupt-File-Brought-
> >> Down-FAAs-Antiquated-IT-System/?kc=EWKNLNAV08282008STR4
> >>
> >
> > "two 20-year-old redundant mainframe configurations ... that
> > apparently are hanging on for dear life until reinforcements arrive
> > in the form of a new, state-of-the-art system this winter."
> >
> > And we all know that 'new, state-of-the-art systems' are silver
> > bullets and good value for money.
> >
> > What goes unremarked here is how the original system has coped
> > reliably for decades of (one guesses) geometrically growing load.
>
> D'oh! It was remarked below the fold. I should have read page 2
> before writing.
>
> The original architects seem to have done an excellent job, how many
> of us are designing systems expected to run for 2 decades? (Yes I
> know the cycles are shorter these days. If you bought a PDP-11 you
> were expected to keep it running forever with component level repairs.)
>
> --Toby
>

Then you also missed the all important crescendo where eweek uses the last
quarter of a poorly written article to shill completely unrelated but yet
inference to tie to the story software.

-Wade

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] pulling disks was: ZFS hangs/freezes after disk failure,

2008-08-28 Thread Richard Elling

Robert Milkowski wrote:
> Hello Miles,
>
> Wednesday, August 27, 2008, 10:51:49 PM, you wrote:
>
> MN> It's not really enough for me, but what's more the case doesn't match
> MN> what we were looking for: a device which ``never returns error codes,
> MN> always returns silently bad data.''  I asked for this because you said
> MN> ``However, not all devices return error codes which indicate
> MN> unrecoverable reads,'' which I think is wrong.  Rather, most devices
> MN> sometimes don't, not some devices always don't.
>
>
>
> Please look for slides 23-27 at 
> http://unixdays.pl/i/unixdays-prezentacje/2007/milek.pdf
>
>   

You really don't have to look very far to find this sort of thing.
The scar just below my left knee is directly attributed to a bugid
fixed in patch 106129-12.  Warning: the following link may
frighten experienced datacenter personnel, fortunately, the affected
device is long since EOL.
http://sunsolve.sun.com/search/document.do?assetkey=1-21-106129-12-1
 -- richard

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] eWeek: corrupt file brought down FAA's antiquated IT system

2008-08-28 Thread Toby Thain

On 28-Aug-08, at 10:54 AM, Toby Thain wrote:

>
> On 28-Aug-08, at 10:11 AM, Richard Elling wrote:
>
>> It is rare to see this sort of "CNN Moment" attributed to file
>> corruption.
>> http://www.eweek.com/c/a/IT-Infrastructure/Corrupt-File-Brought-
>> Down-FAAs-Antiquated-IT-System/?kc=EWKNLNAV08282008STR4
>>
>
> "two 20-year-old redundant mainframe configurations ... that
> apparently are hanging on for dear life until reinforcements arrive
> in the form of a new, state-of-the-art system this winter."
>
> And we all know that 'new, state-of-the-art systems' are silver
> bullets and good value for money.
>
> What goes unremarked here is how the original system has coped
> reliably for decades of (one guesses) geometrically growing load.

D'oh! It was remarked below the fold. I should have read page 2  
before writing.

The original architects seem to have done an excellent job, how many  
of us are designing systems expected to run for 2 decades? (Yes I  
know the cycles are shorter these days. If you bought a PDP-11 you  
were expected to keep it running forever with component level repairs.)

--Toby

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] liveupgrade ufs root -> zfs ?

2008-08-28 Thread Robin Guo

Hi,

  I think LU 94->96 would be fine, if there's no zone in your system,
just simply do

  # cd /Solaris_11/Tools/Installers
  # liveupgrade20 --nodisplay
  # lucreate -c BE94 -n BE96  -p newpool   (The newpool must be SMI lable)
  # luupgrade -u -n BE96 -s 
  # luactivate BE96
  # init 6
 
  During snv_90~96, quite a lot LU bugs are solved, so I think you could 
complete
the process successfully, if no special case..


Paul Floyd wrote:
> Hi
>
> On my opensolaris machine I currently have SXCEs 95 and 94 in two BEs. The 
> same fdisk partition contains /export/home and swap. In a separate fdisk 
> partition on another disk I have a ZFS pool.
>
> Does anyone have a pointer to a howto for doing a liveupgrade such that I can 
> convert the SXCE 94 UFS BE to ZFS (and liveupgrade to SXCE 96 while I'm at 
> it) if this is possible? Searching with google shows a lot of blogs that 
> describe the early problems that existed when ZFS was first available (ON 90 
> or so).
>
> A+
> Paul
>  
>  
> This message posted from opensolaris.org
> ___
> zfs-discuss mailing list
> zfs-discuss@opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
>   

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] pulling disks was: ZFS hangs/freezes after disk failure,

2008-08-28 Thread Robert Milkowski

Hello Miles,

Wednesday, August 27, 2008, 10:51:49 PM, you wrote:

MN> It's not really enough for me, but what's more the case doesn't match
MN> what we were looking for: a device which ``never returns error codes,
MN> always returns silently bad data.''  I asked for this because you said
MN> ``However, not all devices return error codes which indicate
MN> unrecoverable reads,'' which I think is wrong.  Rather, most devices
MN> sometimes don't, not some devices always don't.



Please look for slides 23-27 at 
http://unixdays.pl/i/unixdays-prezentacje/2007/milek.pdf


-- 
Best regards,
 Robert Milkowskimailto:[EMAIL PROTECTED]
   http://milek.blogspot.com

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] eWeek: corrupt file brought down FAA's antiquated IT system

2008-08-28 Thread Toby Thain

On 28-Aug-08, at 10:11 AM, Richard Elling wrote:

> It is rare to see this sort of "CNN Moment" attributed to file  
> corruption.
> http://www.eweek.com/c/a/IT-Infrastructure/Corrupt-File-Brought- 
> Down-FAAs-Antiquated-IT-System/?kc=EWKNLNAV08282008STR4
>

"two 20-year-old redundant mainframe configurations ... that  
apparently are hanging on for dear life until reinforcements arrive  
in the form of a new, state-of-the-art system this winter."

And we all know that 'new, state-of-the-art systems' are silver  
bullets and good value for money.

What goes unremarked here is how the original system has coped  
reliably for decades of (one guesses) geometrically growing load.

--Toby

>  -- richard
>
> ___
> zfs-discuss mailing list
> zfs-discuss@opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] liveupgrade ufs root -> zfs ?

2008-08-28 Thread Mark J Musante

On Thu, 28 Aug 2008, Paul Floyd wrote:

> Does anyone have a pointer to a howto for doing a liveupgrade such that 
> I can convert the SXCE 94 UFS BE to ZFS (and liveupgrade to SXCE 96 
> while I'm at it) if this is possible? Searching with google shows a lot 
> of blogs that describe the early problems that existed when ZFS was 
> first available (ON 90 or so).

It should be fairly straightforward to convert to ZFS: lucreate -p  
-n 

Doing the upgrade to 96, should be luupgrade -u -n  -s 



Regards,
markm
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] pulling disks was: ZFS hangs/freezes after disk failure,

2008-08-28 Thread Richard Elling

Miles Nordin wrote:
> re> Indeed.  Intuitively, the AFR and population is more easily
> re> grokked by the masses.
>
> It's nothing to do with masses.  There's an error in your math.  It's
> not right under any circumstance.
>   

There is no error in my math.  I presented a failure rate for a time 
interval,
you presented a probability of failure over a time interval.  The two are
both correct, but say different things.  Mathematically, an AFR > 100%
is quite possible and quite common.  A probability of failure > 100% (1.0)
is not.  In my experience, failure rates described as annualized failure
rates (AFR) are more intuitive than their mathematically equivalent
counterpart: MTBF.
 -- richard

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] ZFS Pools 1+TB

2008-08-28 Thread Kyle McDonald

Daniel Rock wrote:
>
> Kenny schrieb:
> >2. c6t600A0B800049F93C030A48B3EA2Cd0 
> 
> >   /scsi_vhci/[EMAIL PROTECTED]
> >3. c6t600A0B800049F93C030D48B3EAB6d0 
> 
> >   /scsi_vhci/[EMAIL PROTECTED]
>
> Disk 2: 931GB
> Disk 3: 931MB
>
> Do you see the difference?
>
Not just disk 3:

> AVAILABLE DISK SELECTIONS:
>3. c6t600A0B800049F93C030D48B3EAB6d0 
>   /scsi_vhci/[EMAIL PROTECTED]
>4. c6t600A0B800049F93C031C48B3EC76d0 
>   /scsi_vhci/[EMAIL PROTECTED]
>8. c6t600A0B800049F93C031048B3EB44d0 
>   /scsi_vhci/[EMAIL PROTECTED]
>   
This all makes sense now, since a RAIDZ (or RAIDZ2) vdev can only be as 
big as it's *smallest* component device.

   -Kyle

>
>
> Daniel
> ___
> zfs-discuss mailing list
> zfs-discuss@opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
>

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

[zfs-discuss] liveupgrade ufs root -> zfs ?

2008-08-28 Thread Paul Floyd

Hi

On my opensolaris machine I currently have SXCEs 95 and 94 in two BEs. The same 
fdisk partition contains /export/home and swap. In a separate fdisk partition 
on another disk I have a ZFS pool.

Does anyone have a pointer to a howto for doing a liveupgrade such that I can 
convert the SXCE 94 UFS BE to ZFS (and liveupgrade to SXCE 96 while I'm at it) 
if this is possible? Searching with google shows a lot of blogs that describe 
the early problems that existed when ZFS was first available (ON 90 or so).

A+
Paul
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] eWeek: corrupt file brought down FAA's antiquated IT system

2008-08-28 Thread Gary Mills

On Thu, Aug 28, 2008 at 06:11:06AM -0700, Richard Elling wrote:
> It is rare to see this sort of "CNN Moment" attributed to file corruption.
> http://www.eweek.com/c/a/IT-Infrastructure/Corrupt-File-Brought-Down-FAAs-Antiquated-IT-System/?kc=EWKNLNAV08282008STR4

`file corruption' takes the blame all the time, in my experience, but
what does it mean?  It likely has nothing to do with the filesystem.
Probably an application wrote incorrect information into a file.

-- 
-Gary Mills--Unix Support--U of M Academic Computing and Networking-
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] ZFS Pools 1+TB

2008-08-28 Thread Daniel Rock

Kenny schrieb:
>2. c6t600A0B800049F93C030A48B3EA2Cd0 
>   /scsi_vhci/[EMAIL PROTECTED]
>3. c6t600A0B800049F93C030D48B3EAB6d0 
>   /scsi_vhci/[EMAIL PROTECTED]

Disk 2: 931GB
Disk 3: 931MB

Do you see the difference?



Daniel
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

[zfs-discuss] trouble with resilver after removing drive from 3510

2008-08-28 Thread Thomas Bleek


Hello all,

I tried to test the behavior of zpool recovering after removing one 
drive with strange results.


Setup SunFire V240/4Gig RAM, Solaris10u5, fully patched (last week)
1 3510 12x 140Gig FC Drives, 12 luns (every drive is one lun), (I don't 
want to use the RAID hardware, letting ZFS doing all.)

one pool with 5x2 disks and 2 spares
(details below)

After pulling drive 2 it took about two minutes to recognise the situation.
zpool status command output and also zpool iostat 1 command output is 
very slow. some lines are fast, then it stops for about 30-60 seconds, 
but they do complete after all.
the resilver has started but is VERY slow and shows strange data. The % 
done value is going up and down all the time. I don't think it is 
working correctly.
zpool iostat 1 (when it works) shows many reads but very few writes. I 
would have expected a mainly equal read and write rate reading from the 
intact mirror-side writing to the spare-disk.


Most of the time during resilver the machine is 99% idle, maximum 10% 
kernel load for some short times.


Now I have waited for more than one day but nothing is getting better.
I did not put a new drive in, I wanted to see one spare getting into use.

snip of zpool iostat 1

tank 337G   343G313  2  37.4M  19.3K
tank 337G   343G240  5  29.0M  38.6K
tank 337G   343G355  6  44.4M  45.0K
tank 337G   343G336  8  41.6M  57.9K
tank 337G   343G422  0  46.0M  0
tank 337G   343G415 10  49.4M  70.8K
tank 337G   343G358  0  43.3M  0
tank 337G   343G340 10  42.6M  70.8K
tank 337G   343G323  5  38.1M  38.6K
tank 337G   343G315  0  35.0M  0
tank 337G   343G336  0  40.0M  6.43K
tank 337G   343G388 10  46.8M  70.8K
tank 337G   343G351  4  43.9M  32.2K
tank 337G   343G  5  5   620K   285K

nothing useful (at least for me) in messages. after grep -v of the both 
lines
date+time nftp scsi: [ID 107833 kern.warning] WARNING: 
/[EMAIL PROTECTED],70/SUNW,[EMAIL PROTECTED]/[EMAIL PROTECTED],0/[EMAIL PROTECTED],1 (ssd48):

date+time nftpdrive offline
only these entries to see:
Aug 27 13:04:22 nftpi/o to invalid geometry
Aug 27 13:04:32 nftpi/o to invalid geometry
Aug 27 13:04:37 nftpi/o to invalid geometry
Aug 27 13:04:37 nftpi/o to invalid geometry
Aug 27 13:04:47 nftpi/o to invalid geometry
Aug 27 13:04:52 nftpi/o to invalid geometry
Aug 27 13:05:23 nftp fmd: [ID 441519 daemon.error] SUNW-MSG-ID: 
ZFS-8000-D3, TYPE: Fault, VER: 1, SEVERITY: Major

Aug 27 13:05:23 nftp EVENT-TIME: Wed Aug 27 13:05:22 CEST 2008
Aug 27 13:05:23 nftp PLATFORM: SUNW,Sun-Fire-V240, CSN: -, HOSTNAME: nftp
Aug 27 13:05:23 nftp SOURCE: zfs-diagnosis, REV: 1.0
Aug 27 13:05:23 nftp EVENT-ID: ea01afff-c58e-6b32-e345-81da8bf43146
Aug 27 13:05:23 nftp DESC: A ZFS device failed.  Refer to 
http://sun.com/msg/ZFS-8000-D3 for more information.

Aug 27 13:05:23 nftp AUTO-RESPONSE: No automated response will occur.
Aug 27 13:05:23 nftp IMPACT: Fault tolerance of the pool may be compromised.
Aug 27 13:05:23 nftp REC-ACTION: Run 'zpool status -x' and replace the 
bad device.




uname -a
SunOS nftp 5.10 Generic_137111-04 sun4u sparc SUNW,Sun-Fire-V240


before pulling drive:

sccli> show disk
Ch Id  Size   Speed  LD Status IDs  
Rev 

2(3)   0  136.73GB   200MB  ld0ONLINE SEAGATE ST314680FSUN146G 
0407

  S/N 3HY602V37412
 WWNN 200C505EB811
2(3)   1  136.73GB   200MB  ld1ONLINE SEAGATE ST314680FSUN146G 
0407

  S/N 3HY61JX47412
 WWNN 200C505EB885
2(3)   2  136.73GB   200MB  ld2ONLINE SEAGATE ST3146807FC  
0006

  S/N 3HY62EGZ7443
 WWNN 200C50D76130
2(3)   3  136.73GB   200MB  ld3ONLINE SEAGATE ST314680FSUN146G 
0407

  S/N 3HY61JKG7411
 WWNN 200C505EB815
2(3)   4  136.73GB   200MB  ld4ONLINE SEAGATE ST314680FSUN146G 
0407

  S/N 3HY60YHX7410
 WWNN 200C505EBCBB
2(3)   5  136.73GB   200MB  ld5ONLINE SEAGATE ST314680FSUN146G 
0407

  S/N 3HY61FQ07412
 WWNN 200C505E98B9
2(3)   6  136.73GB   200MB  ld6

[zfs-discuss] eWeek: corrupt file brought down FAA's antiquated IT system

2008-08-28 Thread Richard Elling

It is rare to see this sort of "CNN Moment" attributed to file corruption.
http://www.eweek.com/c/a/IT-Infrastructure/Corrupt-File-Brought-Down-FAAs-Antiquated-IT-System/?kc=EWKNLNAV08282008STR4

 -- richard

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] zpool import sees two pools

2008-08-28 Thread Chris Gerhard


Victor Latushkin wrote:

On 28.08.08 15:06, Chris Gerhard wrote:

I have a USB disk with a pool on it called removable. On one laptop
zpool import removable works just fine but on another with the same
disk attached it tells me there is more than one matching pool:

: sigma TS 6 $; pfexec zpool import removable
cannot import 'removable': more than one matching pool
import by numeric ID instead
: sigma TS 7 $; pfexec zpool importpool: removable
id: 16711095403932498465
 state: ONLINE
status: The pool is formatted using an older on-disk version.
action: The pool can be imported using its name or numeric identifier, 
though
some features will not be available without an explicit 'zpool 
upgrade'.

config:

removable   ONLINE
  c3t0d0ONLINE

  pool: removable
id: 13348174994041916803
 state: FAULTED
status: The pool metadata is corrupted.
action: The pool cannot be imported due to damaged devices or data.
   see: http://www.sun.com/msg/ZFS-8000-72
config:

removable   FAULTED  corrupted data
  c3t0d0p0  ONLINE
: sigma TS 8 $;

What I find curious is that this only happens on one system. Any ideas?


What Solaris/ZFS versions are these systems running? it is a wild guess 
but may be there's some stale label with newer version which is 
recognized by one system and not recognized by another?




Both are running snv_94. The system with the problem is nevada the 
system without the problem is OpenSolaris.




What does zdb -l say?


# zdb -l /dev/rdsk/c3t0d0p0

LABEL 0

failed to unpack label 0

LABEL 1

failed to unpack label 1

LABEL 2

version=1
name='removable'
state=1
txg=18676
pool_guid=13348174994041916803
top_guid=17964267360868847787
guid=17964267360868847787
vdev_tree
type='disk'
id=0
guid=17964267360868847787
path='/vol/dev/dsk/c5t0d0/unknown_format'
whole_disk=0
metaslab_array=13
metaslab_shift=30
ashift=9
asize=164691705856

LABEL 3

version=1
name='removable'
state=1
txg=18676
pool_guid=13348174994041916803
top_guid=17964267360868847787
guid=17964267360868847787
vdev_tree
type='disk'
id=0
guid=17964267360868847787
path='/vol/dev/dsk/c5t0d0/unknown_format'
whole_disk=0
metaslab_array=13
metaslab_shift=30
ashift=9
asize=164691705856
#


# zdb -l /dev/rdsk/c3t0d0

LABEL 0

failed to unpack label 0

LABEL 1

failed to unpack label 1

LABEL 2

version=1
name='removable'
state=1
txg=18676
pool_guid=13348174994041916803
top_guid=17964267360868847787
guid=17964267360868847787
vdev_tree
type='disk'
id=0
guid=17964267360868847787
path='/vol/dev/dsk/c5t0d0/unknown_format'
whole_disk=0
metaslab_array=13
metaslab_shift=30
ashift=9
asize=164691705856

LABEL 3

version=1
name='removable'
state=1
txg=18676
pool_guid=13348174994041916803
top_guid=17964267360868847787
guid=17964267360868847787
vdev_tree
type='disk'
id=0
guid=17964267360868847787
path='/vol/dev/dsk/c5t0d0/unknown_format'
whole_disk=0
metaslab_array=13
metaslab_shift=30
ashift=9
asize=164691705856
#

# zdb -l /dev/rdsk/c3t0d0s0

LABEL 0

version=10
name='removable'
state=1
txg=75874
pool_guid=16711095403932498465
hostid=696785690
hostname='sigma'
top_guid=18371933882888483558
guid=18371933882888483558
vdev_tree
type='disk'
id=0
guid=18371933882888483558
path='/dev/dsk/c3t0d0s0'
devid='id1,[EMAIL PROTECTED]/a'
phys_path='/[EMAIL PROTECTED],0/pci1028,[EMAIL PROTECTED],7/[EMAIL 
PROTECTED]/[EMAIL PROTECTED],0:a'
whole_disk=1
metaslab_array=14
metaslab_shift=30
ashift=9
asize=164683055104
is_log=0
DTL=18

LABEL 1

version=10
name='removable'
state=1
txg=75874
p

Re: [zfs-discuss] ZFS Pools 1+TB

2008-08-28 Thread Kenny

Tim,

Per your request...

df -h

bash-3.00# df -h
Filesystem size   used  avail capacity  Mounted on
/dev/md/dsk/d10 98G   4.2G92G 5%/
/devices 0K 0K 0K 0%/devices
ctfs 0K 0K 0K 0%/system/contract
proc 0K 0K 0K 0%/proc
mnttab   0K 0K 0K 0%/etc/mnttab
swap32G   1.4M32G 1%/etc/svc/volatile
objfs0K 0K 0K 0%/system/object
/platform/SUNW,SPARC-Enterprise-T5220/lib/libc_psr/libc_psr_hwcap1.so.1
98G   4.2G92G 5%
/platform/sun4v/lib/libc_psr.so.1
/platform/SUNW,SPARC-Enterprise-T5220/lib/sparcv9/libc_psr/libc_psr_hwcap1.so.1
98G   4.2G92G 5%
/platform/sun4v/lib/sparcv9/libc_psr.so.1
fd   0K 0K 0K 0%/dev/fd
/dev/md/dsk/d50 19G   4.3G15G23%/var
swap   512M   112K   512M 1%/tmp
swap32G40K32G 1%/var/run
/dev/md/dsk/d309.6G   1.5G   8.1G16%/opt
/dev/md/dsk/d401.9G   142M   1.7G 8%/export/home
/vol/dev/dsk/c0t0d0/fm540cd3
   591M   591M 0K   100%/cdrom/fm540cd3
log_data   8.8G44K   8.8G 1%/log_data
bash-3.00# bash-3.00# df -h
v/dsk/c0t0d0/fm540cd3
   591M   591M 0K   100%/cdrom/fm540cd3
log_data   8.8G44K   8.8G 1%/log_data



zpool status

bash-3.00# zpool status   
  pool: log_data
 state: ONLINE
 scrub: none requested
config:

NAME   STATE READ WRITE CKSUM
log_data   ONLINE   0 0 0
  raidz1   ONLINE   0 0 0
c6t600A0B800049F93C030A48B3EA2Cd0  ONLINE   0 0 0
c6t600A0B800049F93C030D48B3EAB6d0  ONLINE   0 0 0
c6t600A0B800049F93C031C48B3EC76d0  ONLINE   0 0 0
c6t600A0B800049F93C031F48B3ECA8d0  ONLINE   0 0 0
c6t600A0B800049F93C030448B3CDEEd0  ONLINE   0 0 0
c6t600A0B800049F93C030748B3E9F0d0  ONLINE   0 0 0
c6t600A0B800049F93C031048B3EB44d0  ONLINE   0 0 0
c6t600A0B800049F93C031348B3EB94d0  ONLINE   0 0 0
c6t600A0B800049F93C031648B3EBE4d0  ONLINE   0 0 0
c6t600A0B800049F93C031948B3EC28d0  ONLINE   0 0 0
c6t600A0B800049F93C032248B3ECDEd0  ONLINE   0 0 0

errors: No known data errors



format

bash-3.00# format
Searching for disks...done


AVAILABLE DISK SELECTIONS:
   0. c1t0d0 
  /[EMAIL PROTECTED]/[EMAIL PROTECTED]/[EMAIL PROTECTED]/[EMAIL 
PROTECTED]/[EMAIL PROTECTED],0
   1. c1t1d0 
  /[EMAIL PROTECTED]/[EMAIL PROTECTED]/[EMAIL PROTECTED]/[EMAIL 
PROTECTED]/[EMAIL PROTECTED],0
   2. c6t600A0B800049F93C030A48B3EA2Cd0 
  /scsi_vhci/[EMAIL PROTECTED]
   3. c6t600A0B800049F93C030D48B3EAB6d0 
  /scsi_vhci/[EMAIL PROTECTED]
   4. c6t600A0B800049F93C031C48B3EC76d0 
  /scsi_vhci/[EMAIL PROTECTED]
   5. c6t600A0B800049F93C031F48B3ECA8d0 
  /scsi_vhci/[EMAIL PROTECTED]
   6. c6t600A0B800049F93C030448B3CDEEd0 
  /scsi_vhci/[EMAIL PROTECTED]
   7. c6t600A0B800049F93C030748B3E9F0d0 
  /scsi_vhci/[EMAIL PROTECTED]
   8. c6t600A0B800049F93C031048B3EB44d0 
  /scsi_vhci/[EMAIL PROTECTED]
   9. c6t600A0B800049F93C031348B3EB94d0 
  /scsi_vhci/[EMAIL PROTECTED]
  10. c6t600A0B800049F93C031648B3EBE4d0 
  /scsi_vhci/[EMAIL PROTECTED]
  11. c6t600A0B800049F93C031948B3EC28d0 
  /scsi_vhci/[EMAIL PROTECTED]
  12. c6t600A0B800049F93C032248B3ECDEd0 
  /scsi_vhci/[EMAIL PROTECTED]
Specify disk (enter its number):
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] ZFS Pools 1+TB

2008-08-28 Thread Kenny

Bob,  Thanks for the reply.  Yes I did read your white paper and am using it!!  
Thanks again!!

I used zpool iostat -v and it did't give the information as advertised...  see 
below


bash-3.00# zpool iostat -v

capacity 
operationsbandwidth
poolused  avail   read  
write   read  write
--  -  -  -  -  -  -


log_data  147K  9.81G   
   0  0  0  4
 
 raidz1147K  9.81G  
0  0  0  4

c6t600A0B800049F93C030A48B3EA2Cd0  -  -  0  0  0
 22
c6t600A0B800049F93C030D48B3EAB6d0  -  -  0  0  0
 22
c6t600A0B800049F93C031C48B3EC76d0  -  -  0  0  0
 22
c6t600A0B800049F93C031F48B3ECA8d0  -  -  0  0  0
 22
c6t600A0B800049F93C030448B3CDEEd0  -  -  0  0  0
 22
c6t600A0B800049F93C030748B3E9F0d0  -  -  0  0  0
 22
c6t600A0B800049F93C031048B3EB44d0  -  -  0  0  0
 22
c6t600A0B800049F93C031348B3EB94d0  -  -  0  0  0
 22
c6t600A0B800049F93C031648B3EBE4d0  -  -  0  0  0
 22
c6t600A0B800049F93C031948B3EC28d0  -  -  0  0  0
 22
c6t600A0B800049F93C032248B3ECDEd0  -  -  0  0  0
 22

--  -  -  -  -  -  -

(sorry but I can't get the horizontal format to set the columns correctly...)
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] zpool import sees two pools

2008-08-28 Thread Victor Latushkin

On 28.08.08 15:06, Chris Gerhard wrote:
> I have a USB disk with a pool on it called removable. On one laptop
> zpool import removable works just fine but on another with the same
> disk attached it tells me there is more than one matching pool:
> 
> : sigma TS 6 $; pfexec zpool import removable
> cannot import 'removable': more than one matching pool
> import by numeric ID instead
> : sigma TS 7 $; pfexec zpool import  
>   pool: removable
> id: 16711095403932498465
>  state: ONLINE
> status: The pool is formatted using an older on-disk version.
> action: The pool can be imported using its name or numeric identifier, though
> some features will not be available without an explicit 'zpool 
> upgrade'.
> config:
> 
> removable   ONLINE
>   c3t0d0ONLINE
> 
>   pool: removable
> id: 13348174994041916803
>  state: FAULTED
> status: The pool metadata is corrupted.
> action: The pool cannot be imported due to damaged devices or data.
>see: http://www.sun.com/msg/ZFS-8000-72
> config:
> 
> removable   FAULTED  corrupted data
>   c3t0d0p0  ONLINE
> : sigma TS 8 $;
> 
> What I find curious is that this only happens on one system. Any ideas?

What Solaris/ZFS versions are these systems running? it is a wild guess 
but may be there's some stale label with newer version which is 
recognized by one system and not recognized by another?

What does zdb -l say?

victor
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] ZFS hangs/freezes after disk failure,

2008-08-28 Thread James C. McPherson

Hi Todd,
sorry for the delay in responding, been head down rewriting
a utility for the last few days.


Todd H. Poole wrote:
> Howdy James,
> 
> While responding to halstead's post (see below), I had to restart several
> times to complete some testing. I'm not sure if that's important to these
> commands or not, but I just wanted to put it out there anyway.
> 
>> A few commands that you could provide the output from
>> include:
>>
>>
>> (these two show any FMA-related telemetry)
>> fmadm faulty
>> fmdump -v
> 
> This is the output from both commands:
> 
> [EMAIL PROTECTED]:~# fmadm faulty
> ---   -- -
> TIMEEVENT-ID  MSG-ID SEVERITY
> ---   -- -
> Aug 27 01:07:08 0d9c30f1-b2c7-66b6-f58d-9c6bcb95392a  ZFS-8000-FDMajor
> 
> Fault class : fault.fs.zfs.vdev.io
> Description : The number of I/O errors associated with a ZFS device exceeded
> acceptable levels.  Refer to 
> http://sun.com/msg/ZFS-8000-FD
>  for more information.
> Response: The device has been offlined and marked as faulted.  An attempt
> will be made to activate a hot spare if available.
> Impact  : Fault tolerance of the pool may be compromised.
> Action  : Run 'zpool status -x' and replace the bad device.
 >
> [EMAIL PROTECTED]:~# fmdump -v
> TIME UUID SUNW-MSG-ID
> Aug 27 01:07:08.2040 0d9c30f1-b2c7-66b6-f58d-9c6bcb95392a ZFS-8000-FD
>  100%  fault.fs.zfs.vdev.io
> 
>Problem in: zfs://pool=mediapool/vdev=bfaa3595c0bf719
>   Affects: zfs://pool=mediapool/vdev=bfaa3595c0bf719
>   FRU: -
>  Location: -


In other emails in this thread you've mentioned the desire to
get an email (or some sort of notification) when Problems Happen(tm)
in your system, and the FMA framework is how we achieve that
in OpenSolaris.



# fmadm config
MODULE   VERSION STATUS  DESCRIPTION
cpumem-retire1.1 active  CPU/Memory Retire Agent
disk-transport   1.0 active  Disk Transport Agent
eft  1.16active  eft diagnosis engine
fabric-xlate 1.0 active  Fabric Ereport Translater
fmd-self-diagnosis   1.0 active  Fault Manager Self-Diagnosis
io-retire2.0 active  I/O Retire Agent
snmp-trapgen 1.0 active  SNMP Trap Generation Agent
sysevent-transport   1.0 active  SysEvent Transport Agent
syslog-msgs  1.0 active  Syslog Messaging Agent
zfs-diagnosis1.0 active  ZFS Diagnosis Engine
zfs-retire   1.0 active  ZFS Retire Agent


You'll notice that we've got an SNMP agent there... and you
can acquire a copy of the FMA mib from the Fault Management
community pages (http://opensolaris.org/os/community/fm and
http://opensolaris.org/os/community/fm/mib/).




>> (this shows your storage controllers and what's
>> connected to them) cfgadm -lav
> 
> This is the output from cfgadm -lav
> 
> [EMAIL PROTECTED]:~# cfgadm -lav
> Ap_Id  Receptacle   Occupant Condition  
> Information
> When Type Busy Phys_Id
> usb2/1 emptyunconfigured ok
> unavailable  unknown  n/devices/[EMAIL 
> PROTECTED],0/pci1458,[EMAIL PROTECTED]:1
> usb2/2 connectedconfigured   ok
> Mfg: Microsoft  Product: Microsoft 3-Button Mouse with IntelliEye(TM)
> NConfigs: 1  Config: 0  
> unavailable  usb-mousen/devices/[EMAIL 
> PROTECTED],0/pci1458,[EMAIL PROTECTED]:2
> usb3/1 emptyunconfigured ok
[snip]
> usb7/2 emptyunconfigured ok
> unavailable  unknown  n/devices/[EMAIL 
> PROTECTED],0/pci1458,[EMAIL PROTECTED],1:2
> 
> You'll notice that the only thing listed is my USB mouse... is that expected?

Yup. One of the artefacts of the cfgadm architecture. cfgadm(1m)
works by using plugins - usb, FC, SCSI, SATA, pci hotplug, InfiniBand...
but not IDE.

I think you also were wondering how to tell what controller
instances your disks were using in IDE mode - two basic ways
of achieving this:

/usr/bin/iostat -En

and

/usr/sbin/format

Your IDE disks will attach using the cmdk driver and show up like this:

c1d0
c1d1
c2d0
c2d1

In AHCI/SATA mode they'd show up as

c1t0d0
c1t1d0
c1t2d0
c1t3d0

or something similar, depending on how the bios and the actual
controllers sort themselves out.


>> You'll also find messages in /var/adm/messages which
>> might prove
>> useful to review.
> 
> If you really want, I can list the output from /var/adm/messages, but it
> doesn't seem to add anything new to what I've already copied and pasted.

No need - you've got them if you need them.

[snip]

>> http://docs.sun.com/app/docs/coll/40.1

Re: [zfs-discuss] Will there be a GUI for ZFS ?

2008-08-28 Thread Tim

On Thu, Aug 28, 2008 at 3:47 AM, Klaus Bergius <[EMAIL PROTECTED]>wrote:

> I'll second the original questions, but would like to know specifically
> when we will see (or how to install) the ZFS admin gui for OpenSolaris
> 2008.05.
> I installed 2008.05, then updated the system, so it is now snv_95.
> There are no smc* commands, there is no service 'webconsole' to be seen in
> svcs -a,
> because: there is no SUNWzfsg package installed.
> However, the SUNWzfsg package is also not in the 
> pkg.opensolaris.orgrepository.
>
> Any hint where to find the package? I would really love to have the zfs
> admin gui on my system.
>
> -Klaus
>
>
My personal conspiracy theory is it's part of "project fishworks" that is
still under wraps.

--Tim
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

[zfs-discuss] ZFS boot reservations

2008-08-28 Thread Ross

Hey folks,

Tim Foster just linked this bug to the zfs auto backup mailing list, and I 
wondered if anybody knew if the work being done on ZFS boot includes making use 
of ZFS reservations to ensure the boot filesystems always have enough free 
space?

http://defect.opensolaris.org/bz/show_bug.cgi?id=3132

Ross
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

[zfs-discuss] zpool import sees two pools

2008-08-28 Thread Chris Gerhard

I have a USB disk with a pool on it called removable. On one laptop zpool 
import removable works just fine but on another with the same disk attached it 
tells me there is more than one matching pool:

: sigma TS 6 $; pfexec zpool import removable
cannot import 'removable': more than one matching pool
import by numeric ID instead
: sigma TS 7 $; pfexec zpool import  
  pool: removable
id: 16711095403932498465
 state: ONLINE
status: The pool is formatted using an older on-disk version.
action: The pool can be imported using its name or numeric identifier, though
some features will not be available without an explicit 'zpool upgrade'.
config:

removable   ONLINE
  c3t0d0ONLINE

  pool: removable
id: 13348174994041916803
 state: FAULTED
status: The pool metadata is corrupted.
action: The pool cannot be imported due to damaged devices or data.
   see: http://www.sun.com/msg/ZFS-8000-72
config:

removable   FAULTED  corrupted data
  c3t0d0p0  ONLINE
: sigma TS 8 $;

What I find curious is that this only happens on one system. Any ideas?
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Will there be a GUI for ZFS ?

2008-08-28 Thread MC

There is no good ZFS gui.  Nothing that is actively maintained, anyway.
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Will there be a GUI for ZFS ?

2008-08-28 Thread Klaus Bergius

I'll second the original questions, but would like to know specifically when we 
will see (or how to install) the ZFS admin gui for OpenSolaris 2008.05.
I installed 2008.05, then updated the system, so it is now snv_95. 
There are no smc* commands, there is no service 'webconsole' to be seen in svcs 
-a,
because: there is no SUNWzfsg package installed.
However, the SUNWzfsg package is also not in the pkg.opensolaris.org repository.

Any hint where to find the package? I would really love to have the zfs admin 
gui on my system.

-Klaus
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

[zfs-discuss] [Fwd: Re: Review for 6729208 Optimize macros in sys/byteorder.h (due Sept. 3)]

2008-08-28 Thread Darren J Moffat

Not the common case for ZFS but a useful performance improvement
for when it does happen.  This is as a result of some follow on work to 
optimising the byteswapping work Dan has done for the crypto algorithms 
in OpenSolaris.

 Original Message 
Subject: Re: Review for 6729208 Optimize macros in sys/byteorder.h (due 
Sept. 3)
Date: Wed, 27 Aug 2008 11:56:23 -0700 (PDT)
From: Dan Anderson <[EMAIL PROTECTED]>
To: [EMAIL PROTECTED]

Here's some performance results running " find . -exec ls -l" on 
separate ZFS filesystems created on x86 and sparc and imported/exported 
to amd64, em64t, and sun4u platforms.  This shows performance gain from 
optimized byteorder.h macros.

Percent savings, real time
ZFS filesystem created originally on:
Platformx86 sparc
amd64   4%  3%
em64t   3%  4%
sun4u   4%  2%

Environment:
* Create 2 separate ZFS filesystems with 1024 directories, each with 32 
files,
are on x86 and sparc and zpool export/import to the other systems.
* Run this command on ZFS filesystem:  find . -exec ls -l {} \; >/dev/null
* Run using NV97 with and without fix to RFE 6729208 (byteorder.h macro 
optimization)

BTW, I still could use some code review comments:
http://dan.drydog.com/reviews/6729208-bswap3/

--
This message posted from opensolaris.org
___
crypto-discuss mailing list
[EMAIL PROTECTED]
http://mail.opensolaris.org/mailman/listinfo/crypto-discuss

-- 
Darren J Moffat
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

[zfs-discuss] Availability: ZFS needs to handle disk removal / driver failure better

2008-08-28 Thread Ross

Since somebody else has just posted about their entire system locking up when 
pulling a drive, I thought I'd raise this for discussion.

I think Ralf made a very good point in the other thread.  ZFS can guarantee 
data integrity, what it can't do is guarantee data availability.  The problem 
is, the way ZFS is marketed people expect it to be able to do just that.

This turned into a longer thread than expected, so I'll start with what I'm 
asking for, and then attempt to explain my thinking.  I'm essentially asking 
for two features to improve the availability of ZFS pools:

- Isolation of storage drivers so that buggy drivers do not bring down the OS.

- ZFS timeouts to improve pool availability when no timely response is received 
from storage drivers.

And my reasons for asking for these is that there are now many, many posts on 
here about people experiencing either total system lockup or ZFS lockup after 
removing a hot swap drive, and indeed while some of them are using consumer 
hardware, others have reported problems with server grade kit that definately 
should be able to handle these errors:

Aug 2008:  AMD SB600 - System hang
 - http://www.opensolaris.org/jive/thread.jspa?threadID=70349
Aug 2008:  Supermicro SAT2-MV8 - System hang
 - http://www.opensolaris.org/jive/thread.jspa?messageID=271218
May 2008: Sun hardware - ZFS hang
 - http://opensolaris.org/jive/thread.jspa?messageID=240481
Feb 2008:  iSCSI - ZFS hang
 - http://www.opensolaris.org/jive/thread.jspa?messageID=206985
Oct 2007:  Supermicro SAT2-MV8 - system hang
 - http://www.opensolaris.org/jive/thread.jspa?messageID=166037
Sept 2007:  Fibre channel
 - http://opensolaris.org/jive/thread.jspa?messageID=151719
... etc

Now while the root cause of each of these may be slightly different, I feel it 
would still be good to address this if possible as it's going to affect the 
perception of ZFS as a reliable system.

The common factor in all of these is that either the solaris driver hangs and 
locks the OS, or ZFS hangs and locks the pool.  Most of these are for hardware 
that should handle these failures fine (mine occured for hardware that 
definately works fine under windows), so I'm wondering:  Is there anything that 
can be done to prevent either type of lockup in these situations?

Firstly, for the OS, if a storage component (hardware or driver) fails for a 
non essential part of the system, the entire OS should not hang.  I appreciate 
there isn't a lot you can do if the OS is using the same driver as it's 
storage, but certainly in some of the cases above, the OS and the data are 
using different drivers, and I expect more examples of that could be found with 
a bit of work.  Is there any way storage drivers could be isolated such that 
the OS (and hence ZFS) can report a problem with that particular driver without 
hanging the entire system?

Please note:  I know work is being done on FMA to handle all kinds of bugs, I'm 
not talking about that.  It seems to me that FMA involves proper detection and 
reporting of bugs, which involves knowing in advance what the problems are and 
how to report them.  What I'm looking for is something much simpler, something 
that's able to keep the OS running when it encounters unexpected or unhandled 
behaviour from storage drivers or hardware.

It seems to me that one of the benefits of ZFS is working against it here.  
It's such a flexible system it's being used for many, many types of devices, 
and that means there are a whole host of drivers being used, and a lot of scope 
for bugs in those drivers.  I know that ultimately any driver issues will need 
to be sorted individually, but what I'm wondering is whether there's any 
possibility of putting some error checking code at a layer above the drivers in 
such a way it's able to trap major problems without hanging the OS?  ie: update 
ZFS/Solaris so they can handle storage layer bugs gracefully without downing 
the entire system.

My second suggestion is to ask if ZFS can be made to handle unexpected events 
more gracefully.  In the past I've suggested that ZFS have a separate timeout 
so that a redundant pool can continue working even if one device is not 
responding, and I really think that would be worthwhile.  My idea is to have a 
"WAITING" status flag for drives, so that if one isn't responding quickly, ZFS 
can flag it as "WAITING", and attempt to read or write the same data from 
elsewhere in the pool.  That would work alongside the existing failure modes, 
and would allow ZFS to handle hung drivers much more smoothly, preventing 
redundant pools hanging when a single drive fails.

The ZFS update I feel is particularly appropriate.  ZFS already uses 
checksumming since it doesn't trust drivers or hardware to always return the 
correct data.  But ZFS then trusts those same drivers and hardware absolutely 
when it comes to the availability of the pool.

I believe ZFS should apply the same tough standards to pool availability as it 
does to

Re: [zfs-discuss] Subversion repository on ZFS

2008-08-28 Thread Thommy M. Malmström

Toby Thain wrote:
> On 27-Aug-08, at 5:47 PM, Ian Collins wrote:
> 
>> Tim writes:
>>
>>> On Wed, Aug 27, 2008 at 3:29 PM, Ian Collins <[EMAIL PROTECTED]>  
>>> wrote:
>>>
 Does anyone have any tuning tips for a Subversion repository on  
 ZFS?  The
 repository will mainly be storing binary (MS Office documents).

 It looks like a vanilla, uncompressed file system is the best bet.

>>> I believe this is called sharepoint :D
>> Don't mention that abomination!
> 
> Amen.

Don't mention _that_ abomination!

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

63 matches

Mail list logo