Re: [zfs-discuss] Slices and reservations Was: Re: How long should an empty destroy take? snv_134

Edward Ned Harvey Mon, 07 Mar 2011 20:13:06 -0800

> From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-
> boun...@opensolaris.org] On Behalf Of Yaverot
> 
> >I recommend:
> >When creating your new pool, use slices of the new disks, which are 99%
of
> >the size of the new disks instead of using the whole new disks.  Because
> >this is a more reliable way of avoiding the problem "my new replacement
> disk
> >for the failed disk is slightly smaller than the failed disk and
therefore I
> >can't replace."
> 
> 1. While performance isn't my top priority, doesn't using slices make a
> significant difference?


Somewhere in some guide, it says so.  But the answer is no.  If you look
more closely at that guide (what is it, the best practices guide?  or
something else?) you'll see what it *really* says is "we don't recommend
using slices, because sharing the hardware cache across multiple pools hurts
performance" or "sharing cache across zfs and ufs hurts performance" or
something like that.  But if you're only using one big slice for 99% of the
whole disk and not using any other slice, then that whole argument is
irrelevant.  Also, thanks to system ram cache, I contend that disk-based
hardware cache is totally useless anyway.  The disk hardware cache will
never have a hit except in truly weird esoteric cases.  In normal cases, all
the disk sectors that were read recently enough to be in disk based hardware
cache will also be in system ram cache, and therefore the system will not
request that sector from the disk again. 

I know I did some benchmarking, with and without slices, and found no
difference.  I'd be interested if anyone in the world has a counterexample.
I know how to generate such a scenario, but like I said, it's an esoteric
corner case that's not important in reality.


> 2. Doesn't snv_134 that I'm running already account for variances in these
> nominally-same disks?

Yes.  (I don't know which build introduced it, so I'm not confirming b134
specifically, but it's in some build and higher.)
But as evidenced by a recent thread from Robert Hartzell "cannot replace
c10t0d0 with c10t0d0: device is too     small" it doesn't always work.


> 3. The market refuses to sell disks under $50, therefore I won't be able
to
> buy drives of 'matching' capacity anyway.

Normally, when replacing matching capacity drives, it's either something you
bought in advance (like a hotspare or coldspare), or received via warranty.
Maybe it doesn't matter for you, but it matters for some people.


> >I also recommend:
> >In every pool, create some space reservation.  So when and if you ever
hit
> >100% usage again and start to hit the system crash scenario, you can do a
> >zfs destroy (snapshot) and delete the space reservation, in order to
avoid
> >the system crash scenario you just witnessed.  Hopefully.
> 
> 1. Why would tank being practically full affect management of other pools
> and start the crash scenario I encountered? rpool & rpool/swap remained at
> 1% use, the apparent trigger was doing a "zpool destroy others" which is
> neither the rpool the system runs out of, nor tank.

That wasn't the trigger - That was just the first symptom that you noticed.
The actual trigger happened earlier, while tank was 100% full and some
operations were still in progress.  The precise trigger is difficult to
identify, because it only sends the system into a long slow downward spiral.
It doesn't cause immediate system failure.  Generally by the time you notice
any symptoms, it's already been spiraling downward for some time, so even if
you know the right buttons to pull it out of the spiral, you won't know that
you know the right buttons.  Because after you press them, you still have to
wait "for some time" for it to recover.  I had the sun support rep tell me
someone else had the same problem, and they waited a week and eventually it
recovered.  I wasn't able to wait that long.  I power cycled and fixed it
instantly.

I don't know the answer to your question, "why would it behave that way."
And it doesn't always happen.  But I've certainly seen it a few times
before.  Notice how precisely I told you exactly what you should expect to
happen next.  It's a clear pattern, but not clear enough or common enough to
get enough attention to get fixed, apparently.

Long ago I opened bug reports with oracle support, but nobody seems to be
doing anything about it.


> 2. How can a zfs destroy ($snapshot) complete when both "zpool destroy"
> and "zfs list" fail to complete?

Precisely the problem.  The zfs destroy snapshot also hangs.  You're hosed
until you reboot.
But zfs destroy snapshot isn't the only way in the world to free up some
space.  You can also 
        zfs set reservation=5G tank
        zfs set reservation=none tank
When you're in the failure mode that you experienced, nobody has yet
confirmed the ability or inability to set the reservation to none.  IF IT
WORKS, then you could immediately afterward do a zfs destroy snapshot.  But
most likely the reservation won't do any good anyway.  But it doesn't hurt
anything, and it's worth a try.

Add one more failing backup parachute into your toolkit.

_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Slices and reservations Was: Re: How long should an empty destroy take? snv_134

Reply via email to