> "es" == Eric Schrock <[EMAIL PROTECTED]> writes:
es> Are you running your experiments on build 101 or later?
no.
aside from that quick one for copies=2 im pretty bad about running
well-designed experiments. and I do have old builds. I need to buy
more hardware.
It's hard to know how
On Wed, Nov 26, 2008 at 07:02:11PM -0500, Miles Nordin wrote:
> (2) The FMA model of collecting telemmetry, taking it into
> user-space, chin-strokingly contemplating it for a while, then
> decreeing a diagnosis, is actually a rather limited one. I can
> think of two kinds of limit
On Wed, 26 Nov 2008, Miles Nordin wrote:
>
> (2) The FMA model of collecting telemmetry, taking it into
> user-space, chin-strokingly contemplating it for a while, then
> decreeing a diagnosis, is actually a rather limited one. I can
> think of two kinds of limit:
>
> (a) you're di
> "rs" == Ross Smith <[EMAIL PROTECTED]> writes:
> "nw" == Nicolas Williams <[EMAIL PROTECTED]> writes:
rs> I disagree Bob, I think this is a very different function to
rs> that which FMA provides.
I see two problems.
(1) FMA doesn't seem to work very well, and was used as an ex
It's hard to tell exactly what you are asking for, but this sounds
similar to how ZFS already works. If ZFS decides that a device is
pathologically broken (as evidenced by vdev_probe() failure), it knows
that FMA will come back and diagnose the drive is faulty (becuase we
generate a probe_failure
On Tue, 25 Nov 2008, Ross Smith wrote:
> I disagree Bob, I think this is a very different function to that
> which FMA provides.
>
> As far as I know, FMA doesn't have access to the big picture of pool
> configuration that ZFS has, so why shouldn't ZFS use that information
> to increase the reliab
I disagree Bob, I think this is a very different function to that
which FMA provides.
As far as I know, FMA doesn't have access to the big picture of pool
configuration that ZFS has, so why shouldn't ZFS use that information
to increase the reliability of the pool while still using FMA to
handle d
On Tue, Nov 25, 2008 at 11:55:17AM +0100, [EMAIL PROTECTED] wrote:
> >My idea is simply to allow the pool to continue operation while
> >waiting for the drive to fault, even if that's a faulty write. It
> >just means that the rest of the operations (reads and writes) can keep
> >working for the mi
Scara Maccai wrote:
>> Oh, and regarding the original post -- as several
>> readers correctly
>> surmised, we weren't faking anything, we just didn't
>> want to wait
>> for all the device timeouts. Because the disks were
>> on USB, which
>> is a hotplug-capable bus, unplugging the dead disk
>> gen
On Tue, 25 Nov 2008, Ross Smith wrote:
>
> Good to hear there's work going on to address this.
>
> What did you guys think to my idea of ZFS supporting a "waiting for a
> response" status for disks as an interim solution that allows the pool
> to continue operation while it's waiting for FMA or the
Ross Smith wrote:
> My justification for this is that it seems to me that you can split
> disk behavior into two states:
> - returns data ok
> - doesn't return data ok
>
> And for the state where it's not returning data, you can again split
> that in two:
> - returns wrong data
> - doesn't return
> Oh, and regarding the original post -- as several
> readers correctly
> surmised, we weren't faking anything, we just didn't
> want to wait
> for all the device timeouts. Because the disks were
> on USB, which
> is a hotplug-capable bus, unplugging the dead disk
> generated an
> interrupt that b
> The shortcomings of timeouts have been discussed on this list before. How do
> you tell the difference between a drive that is dead and a path that is just
> highly loaded?
A path that is dead is either returning bad data, or isn't returning
anything. A highly loaded path is by definition readi
On 25-Nov-08, at 5:10 AM, Ross Smith wrote:
> Hey Jeff,
>
> Good to hear there's work going on to address this.
>
> What did you guys think to my idea of ZFS supporting a "waiting for a
> response" status for disks as an interim solution that allows the pool
> to continue operation while it's wai
Hmm, true. The idea doesn't work so well if you have a lot of writes,
so there needs to be some thought as to how you handle that.
Just thinking aloud, could the missing writes be written to the log
file on the rest of the pool? Or temporarily stored somewhere else in
the pool? Would it be an o
>My idea is simply to allow the pool to continue operation while
>waiting for the drive to fault, even if that's a faulty write. It
>just means that the rest of the operations (reads and writes) can keep
>working for the minute (or three) it takes for FMA and the rest of the
>chain to flag a dev
No, I count that as "doesn't return data ok", but my post wasn't very
clear at all on that.
Even for a write, the disk will return something to indicate that the
action has completed, so that can also be covered by just those two
scenarios, and right now ZFS can lock the whole pool up if it's
wait
>My justification for this is that it seems to me that you can split
>disk behavior into two states:
>- returns data ok
>- doesn't return data ok
I think you're missing "won't write".
There's clearly a difference between "get data from a different copy"
which you can fix but retrying data to a
PS. I think this also gives you a chance at making the whole problem
much simpler. Instead of the hard question of "is this faulty",
you're just trying to say "is it working right now?".
In fact, I'm now wondering if the "waiting for a response" flag
wouldn't be better as "possibly faulty". Tha
Hey Jeff,
Good to hear there's work going on to address this.
What did you guys think to my idea of ZFS supporting a "waiting for a
response" status for disks as an interim solution that allows the pool
to continue operation while it's waiting for FMA or the driver to
fault the drive?
I do appre
I think we (the ZFS team) all generally agree with you. The current
nevada code is much better at handling device failures than it was
just a few months ago. And there are additional changes that were
made for the FishWorks (a.k.a. Amber Road, a.k.a. Sun Storage 7000)
product line that will make
But that's exactly the problem Richard: AFAIK.
Can you state that absolutely, categorically, there is no failure mode out
there (caused by hardware faults, or bad drivers) that won't lock a drive up
for hours? You can't, obviously, which is why we keep saying that ZFS should
have this kind of
Scara Maccai wrote:
>> In the worst case, the device would be selectable,
>> but not responding
>> to data requests which would lead through the device
>> retry logic and can
>> take minutes.
>>
>
> that's what I didn't know: that a driver could take minutes (hours???) to
> decide that a devi
> In the worst case, the device would be selectable,
> but not responding
> to data requests which would lead through the device
> retry logic and can
> take minutes.
that's what I didn't know: that a driver could take minutes (hours???) to
decide that a device is not working anymore.
Now it come
Toby Thain wrote:
> On 24-Nov-08, at 3:49 PM, Miles Nordin wrote:
>
>
>>> "tt" == Toby Thain <[EMAIL PROTECTED]> writes:
>>>
>> tt> Why would it be assumed to be a bug in Solaris? Seems more
>> tt> likely on balance to be a problem in the error reporting path
>>
On 24-Nov-08, at 3:49 PM, Miles Nordin wrote:
>> "tt" == Toby Thain <[EMAIL PROTECTED]> writes:
>
> tt> Why would it be assumed to be a bug in Solaris? Seems more
> tt> likely on balance to be a problem in the error reporting path
> tt> or a controller/ firmware weakness.
>
> It's
> "tt" == Toby Thain <[EMAIL PROTECTED]> writes:
tt> Why would it be assumed to be a bug in Solaris? Seems more
tt> likely on balance to be a problem in the error reporting path
tt> or a controller/ firmware weakness.
It's not really an assumption. It's been discussed in here a l
"C. Bergström" wrote:
> Will Murnane wrote:
> > On Mon, Nov 24, 2008 at 10:40, Scara Maccai <[EMAIL PROTECTED]> wrote:
> >
> >> Still don't understand why even the one on
> http://www.opensolaris.com/, "ZFS - A Smashing Hit", doesn't
> show the app running in the moment the HD is smashed... we
> if a disk vanishes like
> a sledgehammer
> hit it, ZFS will wait on the device driver to decide
> it's dead.
OK I see it.
> That said, there have been several threads about
> wanting configurable
> device timeouts handled at the ZFS level rather than
> the device driver
> level.
Uh, so I can
Will Murnane wrote:
> On Mon, Nov 24, 2008 at 10:40, Scara Maccai <[EMAIL PROTECTED]> wrote:
>
>> Still don't understand why even the one on http://www.opensolaris.com/, "ZFS
>> – A Smashing Hit", doesn't show the app running in the moment the HD is
>> smashed... weird...
>>
Sorry this is
On Mon, Nov 24, 2008 at 10:40, Scara Maccai <[EMAIL PROTECTED]> wrote:
> Still don't understand why even the one on http://www.opensolaris.com/, "ZFS
> – A Smashing Hit", doesn't show the app running in the moment the HD is
> smashed... weird...
ZFS is primarily about protecting your data: correc
On 24-Nov-08, at 10:40 AM, Scara Maccai wrote:
>> Why would it be assumed to be a bug in Solaris? Seems
>> more likely on
>> balance to be a problem in the error reporting path
>> or a controller/
>> firmware weakness.
>
> Weird: they would use a controller/firmware that doesn't work? Bad
> cal
> Why would it be assumed to be a bug in Solaris? Seems
> more likely on
> balance to be a problem in the error reporting path
> or a controller/
> firmware weakness.
Weird: they would use a controller/firmware that doesn't work? Bad call...
> I'm pretty sure the first 2 versions of this demo
On 23-Nov-08, at 12:21 PM, Scara Maccai wrote:
> I watched both the youtube video
>
> http://www.youtube.com/watch?v=CN6iDzesEs0
>
> and the one on http://www.opensolaris.com/, "ZFS – A Smashing Hit".
>
> In the first one is obvious that the app stops working when they
> smash the drives; they
I watched both the youtube video
http://www.youtube.com/watch?v=CN6iDzesEs0
and the one on http://www.opensolaris.com/, "ZFS – A Smashing Hit".
In the first one is obvious that the app stops working when they smash the
drives; they have to physically detach the drive before the array
reconstru
35 matches
Mail list logo