Re: [zfs-discuss] "ZFS, Smashing Baby" a fake???

2008-11-28 Thread Miles Nordin
> "es" == Eric Schrock <[EMAIL PROTECTED]> writes: es> Are you running your experiments on build 101 or later? no. aside from that quick one for copies=2 im pretty bad about running well-designed experiments. and I do have old builds. I need to buy more hardware. It's hard to know how

Re: [zfs-discuss] "ZFS, Smashing Baby" a fake???

2008-11-26 Thread Eric Schrock
On Wed, Nov 26, 2008 at 07:02:11PM -0500, Miles Nordin wrote: > (2) The FMA model of collecting telemmetry, taking it into > user-space, chin-strokingly contemplating it for a while, then > decreeing a diagnosis, is actually a rather limited one. I can > think of two kinds of limit

Re: [zfs-discuss] "ZFS, Smashing Baby" a fake???

2008-11-26 Thread Bob Friesenhahn
On Wed, 26 Nov 2008, Miles Nordin wrote: > > (2) The FMA model of collecting telemmetry, taking it into > user-space, chin-strokingly contemplating it for a while, then > decreeing a diagnosis, is actually a rather limited one. I can > think of two kinds of limit: > > (a) you're di

Re: [zfs-discuss] "ZFS, Smashing Baby" a fake???

2008-11-26 Thread Miles Nordin
> "rs" == Ross Smith <[EMAIL PROTECTED]> writes: > "nw" == Nicolas Williams <[EMAIL PROTECTED]> writes: rs> I disagree Bob, I think this is a very different function to rs> that which FMA provides. I see two problems. (1) FMA doesn't seem to work very well, and was used as an ex

Re: [zfs-discuss] "ZFS, Smashing Baby" a fake???

2008-11-25 Thread Eric Schrock
It's hard to tell exactly what you are asking for, but this sounds similar to how ZFS already works. If ZFS decides that a device is pathologically broken (as evidenced by vdev_probe() failure), it knows that FMA will come back and diagnose the drive is faulty (becuase we generate a probe_failure

Re: [zfs-discuss] "ZFS, Smashing Baby" a fake???

2008-11-25 Thread Bob Friesenhahn
On Tue, 25 Nov 2008, Ross Smith wrote: > I disagree Bob, I think this is a very different function to that > which FMA provides. > > As far as I know, FMA doesn't have access to the big picture of pool > configuration that ZFS has, so why shouldn't ZFS use that information > to increase the reliab

Re: [zfs-discuss] "ZFS, Smashing Baby" a fake???

2008-11-25 Thread Ross Smith
I disagree Bob, I think this is a very different function to that which FMA provides. As far as I know, FMA doesn't have access to the big picture of pool configuration that ZFS has, so why shouldn't ZFS use that information to increase the reliability of the pool while still using FMA to handle d

Re: [zfs-discuss] "ZFS, Smashing Baby" a fake???

2008-11-25 Thread Nicolas Williams
On Tue, Nov 25, 2008 at 11:55:17AM +0100, [EMAIL PROTECTED] wrote: > >My idea is simply to allow the pool to continue operation while > >waiting for the drive to fault, even if that's a faulty write. It > >just means that the rest of the operations (reads and writes) can keep > >working for the mi

Re: [zfs-discuss] "ZFS, Smashing Baby" a fake???

2008-11-25 Thread Richard Elling
Scara Maccai wrote: >> Oh, and regarding the original post -- as several >> readers correctly >> surmised, we weren't faking anything, we just didn't >> want to wait >> for all the device timeouts. Because the disks were >> on USB, which >> is a hotplug-capable bus, unplugging the dead disk >> gen

Re: [zfs-discuss] "ZFS, Smashing Baby" a fake???

2008-11-25 Thread Bob Friesenhahn
On Tue, 25 Nov 2008, Ross Smith wrote: > > Good to hear there's work going on to address this. > > What did you guys think to my idea of ZFS supporting a "waiting for a > response" status for disks as an interim solution that allows the pool > to continue operation while it's waiting for FMA or the

Re: [zfs-discuss] "ZFS, Smashing Baby" a fake???

2008-11-25 Thread Moore, Joe
Ross Smith wrote: > My justification for this is that it seems to me that you can split > disk behavior into two states: > - returns data ok > - doesn't return data ok > > And for the state where it's not returning data, you can again split > that in two: > - returns wrong data > - doesn't return

Re: [zfs-discuss] "ZFS, Smashing Baby" a fake???

2008-11-25 Thread Scara Maccai
> Oh, and regarding the original post -- as several > readers correctly > surmised, we weren't faking anything, we just didn't > want to wait > for all the device timeouts. Because the disks were > on USB, which > is a hotplug-capable bus, unplugging the dead disk > generated an > interrupt that b

Re: [zfs-discuss] "ZFS, Smashing Baby" a fake???

2008-11-25 Thread Ross Smith
> The shortcomings of timeouts have been discussed on this list before. How do > you tell the difference between a drive that is dead and a path that is just > highly loaded? A path that is dead is either returning bad data, or isn't returning anything. A highly loaded path is by definition readi

Re: [zfs-discuss] "ZFS, Smashing Baby" a fake???

2008-11-25 Thread Toby Thain
On 25-Nov-08, at 5:10 AM, Ross Smith wrote: > Hey Jeff, > > Good to hear there's work going on to address this. > > What did you guys think to my idea of ZFS supporting a "waiting for a > response" status for disks as an interim solution that allows the pool > to continue operation while it's wai

Re: [zfs-discuss] "ZFS, Smashing Baby" a fake???

2008-11-25 Thread Ross Smith
Hmm, true. The idea doesn't work so well if you have a lot of writes, so there needs to be some thought as to how you handle that. Just thinking aloud, could the missing writes be written to the log file on the rest of the pool? Or temporarily stored somewhere else in the pool? Would it be an o

Re: [zfs-discuss] "ZFS, Smashing Baby" a fake???

2008-11-25 Thread Casper . Dik
>My idea is simply to allow the pool to continue operation while >waiting for the drive to fault, even if that's a faulty write. It >just means that the rest of the operations (reads and writes) can keep >working for the minute (or three) it takes for FMA and the rest of the >chain to flag a dev

Re: [zfs-discuss] "ZFS, Smashing Baby" a fake???

2008-11-25 Thread Ross Smith
No, I count that as "doesn't return data ok", but my post wasn't very clear at all on that. Even for a write, the disk will return something to indicate that the action has completed, so that can also be covered by just those two scenarios, and right now ZFS can lock the whole pool up if it's wait

Re: [zfs-discuss] "ZFS, Smashing Baby" a fake???

2008-11-25 Thread Casper . Dik
>My justification for this is that it seems to me that you can split >disk behavior into two states: >- returns data ok >- doesn't return data ok I think you're missing "won't write". There's clearly a difference between "get data from a different copy" which you can fix but retrying data to a

Re: [zfs-discuss] "ZFS, Smashing Baby" a fake???

2008-11-25 Thread Ross Smith
PS. I think this also gives you a chance at making the whole problem much simpler. Instead of the hard question of "is this faulty", you're just trying to say "is it working right now?". In fact, I'm now wondering if the "waiting for a response" flag wouldn't be better as "possibly faulty". Tha

Re: [zfs-discuss] "ZFS, Smashing Baby" a fake???

2008-11-25 Thread Ross Smith
Hey Jeff, Good to hear there's work going on to address this. What did you guys think to my idea of ZFS supporting a "waiting for a response" status for disks as an interim solution that allows the pool to continue operation while it's waiting for FMA or the driver to fault the drive? I do appre

Re: [zfs-discuss] "ZFS, Smashing Baby" a fake???

2008-11-25 Thread Jeff Bonwick
I think we (the ZFS team) all generally agree with you. The current nevada code is much better at handling device failures than it was just a few months ago. And there are additional changes that were made for the FishWorks (a.k.a. Amber Road, a.k.a. Sun Storage 7000) product line that will make

Re: [zfs-discuss] "ZFS, Smashing Baby" a fake???

2008-11-24 Thread Ross
But that's exactly the problem Richard: AFAIK. Can you state that absolutely, categorically, there is no failure mode out there (caused by hardware faults, or bad drivers) that won't lock a drive up for hours? You can't, obviously, which is why we keep saying that ZFS should have this kind of

Re: [zfs-discuss] "ZFS, Smashing Baby" a fake???

2008-11-24 Thread Richard Elling
Scara Maccai wrote: >> In the worst case, the device would be selectable, >> but not responding >> to data requests which would lead through the device >> retry logic and can >> take minutes. >> > > that's what I didn't know: that a driver could take minutes (hours???) to > decide that a devi

Re: [zfs-discuss] "ZFS, Smashing Baby" a fake???

2008-11-24 Thread Scara Maccai
> In the worst case, the device would be selectable, > but not responding > to data requests which would lead through the device > retry logic and can > take minutes. that's what I didn't know: that a driver could take minutes (hours???) to decide that a device is not working anymore. Now it come

Re: [zfs-discuss] "ZFS, Smashing Baby" a fake???

2008-11-24 Thread Richard Elling
Toby Thain wrote: > On 24-Nov-08, at 3:49 PM, Miles Nordin wrote: > > >>> "tt" == Toby Thain <[EMAIL PROTECTED]> writes: >>> >> tt> Why would it be assumed to be a bug in Solaris? Seems more >> tt> likely on balance to be a problem in the error reporting path >>

Re: [zfs-discuss] "ZFS, Smashing Baby" a fake???

2008-11-24 Thread Toby Thain
On 24-Nov-08, at 3:49 PM, Miles Nordin wrote: >> "tt" == Toby Thain <[EMAIL PROTECTED]> writes: > > tt> Why would it be assumed to be a bug in Solaris? Seems more > tt> likely on balance to be a problem in the error reporting path > tt> or a controller/ firmware weakness. > > It's

Re: [zfs-discuss] "ZFS, Smashing Baby" a fake???

2008-11-24 Thread Miles Nordin
> "tt" == Toby Thain <[EMAIL PROTECTED]> writes: tt> Why would it be assumed to be a bug in Solaris? Seems more tt> likely on balance to be a problem in the error reporting path tt> or a controller/ firmware weakness. It's not really an assumption. It's been discussed in here a l

Re: [zfs-discuss] "ZFS, Smashing Baby" a fake???

2008-11-24 Thread Moore, Joe
"C. Bergström" wrote: > Will Murnane wrote: > > On Mon, Nov 24, 2008 at 10:40, Scara Maccai <[EMAIL PROTECTED]> wrote: > > > >> Still don't understand why even the one on > http://www.opensolaris.com/, "ZFS - A Smashing Hit", doesn't > show the app running in the moment the HD is smashed... we

Re: [zfs-discuss] "ZFS, Smashing Baby" a fake???

2008-11-24 Thread Scara Maccai
> if a disk vanishes like > a sledgehammer > hit it, ZFS will wait on the device driver to decide > it's dead. OK I see it. > That said, there have been several threads about > wanting configurable > device timeouts handled at the ZFS level rather than > the device driver > level. Uh, so I can

Re: [zfs-discuss] "ZFS, Smashing Baby" a fake???

2008-11-24 Thread C. Bergström
Will Murnane wrote: > On Mon, Nov 24, 2008 at 10:40, Scara Maccai <[EMAIL PROTECTED]> wrote: > >> Still don't understand why even the one on http://www.opensolaris.com/, "ZFS >> – A Smashing Hit", doesn't show the app running in the moment the HD is >> smashed... weird... >> Sorry this is

Re: [zfs-discuss] "ZFS, Smashing Baby" a fake???

2008-11-24 Thread Will Murnane
On Mon, Nov 24, 2008 at 10:40, Scara Maccai <[EMAIL PROTECTED]> wrote: > Still don't understand why even the one on http://www.opensolaris.com/, "ZFS > – A Smashing Hit", doesn't show the app running in the moment the HD is > smashed... weird... ZFS is primarily about protecting your data: correc

Re: [zfs-discuss] "ZFS, Smashing Baby" a fake???

2008-11-24 Thread Toby Thain
On 24-Nov-08, at 10:40 AM, Scara Maccai wrote: >> Why would it be assumed to be a bug in Solaris? Seems >> more likely on >> balance to be a problem in the error reporting path >> or a controller/ >> firmware weakness. > > Weird: they would use a controller/firmware that doesn't work? Bad > cal

Re: [zfs-discuss] "ZFS, Smashing Baby" a fake???

2008-11-24 Thread Scara Maccai
> Why would it be assumed to be a bug in Solaris? Seems > more likely on > balance to be a problem in the error reporting path > or a controller/ > firmware weakness. Weird: they would use a controller/firmware that doesn't work? Bad call... > I'm pretty sure the first 2 versions of this demo

Re: [zfs-discuss] "ZFS, Smashing Baby" a fake???

2008-11-24 Thread Toby Thain
On 23-Nov-08, at 12:21 PM, Scara Maccai wrote: > I watched both the youtube video > > http://www.youtube.com/watch?v=CN6iDzesEs0 > > and the one on http://www.opensolaris.com/, "ZFS – A Smashing Hit". > > In the first one is obvious that the app stops working when they > smash the drives; they

[zfs-discuss] "ZFS, Smashing Baby" a fake???

2008-11-23 Thread Scara Maccai
I watched both the youtube video http://www.youtube.com/watch?v=CN6iDzesEs0 and the one on http://www.opensolaris.com/, "ZFS – A Smashing Hit". In the first one is obvious that the app stops working when they smash the drives; they have to physically detach the drive before the array reconstru