Quoting Jeremy Chadwick <free...@jdc.parodius.com> (from Mon, 15 Feb 2010 04:27:44 -0800):

On Mon, Feb 15, 2010 at 10:50:00AM +0100, Alexander Leidinger wrote:
Quoting Jeremy Chadwick <free...@jdc.parodius.com> (from Mon, 15 Feb
2010 01:07:56 -0800):

>On Mon, Feb 15, 2010 at 10:49:47AM +0200, Dan Naumov wrote:
>>> I had a feeling someone would bring up L2ARC/cache devices.  This gives
>>> me the opportunity to ask something that's been on my mind for quite
>>> some time now:
>>>
>>> Aside from the capacity different (e.g. 40GB vs. 1GB), is there a
>>> benefit to using a dedicated RAM disk (e.g. md(4)) to a pool for
>>> L2ARC/cache?  The ZFS documentation explicitly states that cache
>>> device content is considered volatile.
>>
>>Using a ramdisk as an L2ARC vdev doesn't make any sense at all. If you
>>have RAM to spare, it should be used by regular ARC.
>
>...except that it's already been proven on FreeBSD that the ARC getting
>out of control can cause kernel panics[1], horrible performance until

First and foremost, sorry for the long post.  I tried to keep it short,
but sometimes there's just a lot to be said.

And sometimes a shorter answer takes longer...

There are other ways (not related to ZFS) to shoot into your feet
too, I'm tempted to say that this is
 a) a documentation bug
and
 b) a lack of sanity checking of the values... anyone out there with
a good algorithm for something like this?

Normally you do some testing with the values you use, so once you
resolved the issues, the system should be stable.

What documentation?  :-)  The Wiki?  If so, that's been outdated for

Hehe... :)

some time; I know Ivan Voras was doing his best to put good information
there, but it's hard given the below chaos.

Do you want write access to it (in case you haven't, I didn't check)?

The following tunables are recurrently mentioned as focal points, but no
one's explained in full how to tune these "properly", or which does what
(perfect example: vm.kmem_size_max vs. vm.kmem_size.  _max used to be
what you'd adjust to solve kmem exhaustion issues, but now people are
saying otherwise?).  I realise it may differ per system (given how much
RAM the system has), so different system configurations/examples would
need to be provided.  I realise that the behaviour of some have changed
too (e.g. -RELEASE differs from -STABLE, and 7.x differs from 8.x).
I've marked commonly-referred-to tunables with an asterisk:

It can also be that some people just tell something without really knowing what they say (based upon some kind of observed evidence, not because of being a bad guy).

  kern.maxvnodes

Needs to be tuned if you run out of vnodes... ok, this is obvious. I do not know how it will show up (panic or graceful error handling, e.g. ENOMEM).

* vm.kmem_size
* vm.kmem_size_max

I tried kmem_size_max on -current (this year), and I got a panic during use, I changed kmem_size to the same value I have for _max and it didn't panic anymore. It looks (from mails on the lists) that _max is supposed to give a max value for auto-enhancement, but at least it was not working with ZFS last month (and I doubt it works now).

* vfs.zfs.arc_min
* vfs.zfs.arc_max

_min = minimum even when the system is running out of memory (the ARC gives back memory if other parts of the kernel need it). _max = maximum (with a recent ZFS on 7/8/9 (7.3 will have it, 8.1 will have it too) I've never seen the size exceed the _max anymore)

  vfs.zfs.prefetch_disable  (auto-tuned based on available RAM on 8-STABLE)
  vfs.zfs.txg.timeout

It looks like the txg is just a workaround. I've read a little bit in Brendan's blog and it seems they noticed the periodic writes too (with the nice graphical performance monitoring of OpenStorage) and they are investigating the issue. It looks like we are more affected by this (for whatever reason). What it is doing (attention, this is an observation, not a technical description of code I've read!) seems to be to write out data to the disks more early (and thus there is less data to write -> less blocking to notice).

  vfs.zfs.vdev.cache.size
  vfs.zfs.vdev.cache.bshift
  vfs.zfs.vdev.max_pending

Uhm... this smells like you got it out of one of my posts where I told that I experiment with this on a system. I can tell you that I have no system with this tuned anymore, tuning kmem_size (and KVA_PAGES during kernel compile) has a bigger impact.

  vfs.zfs.zil_disable

What it does should be obvious. IMHO this should not help much regarding stability (changing kmem_size should give a bigger impact). As don't know what was tested on systems where this is disabled, I want to highlight the "IMHO" in the sentence before...

Then, when it comes to debugging problems as a result of tuning
improperly (or entire lack of), the following counters (not tunables)
are thrown into the mix as "things people should look at":

  kstat.zfs.misc.arcstats.c
  kstat.zfs.misc.arcstats.c_min
  kstat.zfs.misc.arcstats.c_max

c_max is vfs.zfs.arc_max, c_min is vfs.zfs.arc_min.

  kstat.zfs.misc.arcstats.evict_skip
  kstat.zfs.misc.arcstats.memory_throttle_count
  kstat.zfs.misc.arcstats.size

I'm not very sure about size and c... both represent some kind of current size, but they are not the same.


About the tuning I would recommend to depend upon a more human readable representation. I've seen someone posting something like this, but I do not know how it was generated (some kind of script, but I do not know where to get it).

All that said:

I would be more than happy to write some coherent documentation that
folks could refer to "officially", but rather than spend my entire
lifetime reverse-engineering the ZFS code I think it'd make more sense
to get some official parties involved to explain things.

http://www.solarisinternals.com/wiki/index.php/ZFS_Evil_Tuning_Guide

I'd like to add some kind of monitoring section as well -- how
administrators could keep an eye on things and detect, semi-early, if
additional tuning is required or something along those lines.

>ZFS has had its active/inactive lists flushed[2], and brings into

Someone needs to sit down and play a little bit with ways to tell
the ARC that there is free memory. The mail you reference already
tells that the inactive/cached lists should maybe taken into account
too (I didn't had a look at this part of the ZFS code).

>question how proper tuning is to be established and what the effects are
>on the rest of the system[3].  There are still reports of people

That's what I talk about regarding b) above. If you specify an
arc_max which is too big (arc_max > kmem_size - SOME_SAVE_VALUE),
there should be a message from the kernel and the value should be
adjusted to a save amount.

Until the problems are fixed, a MD for L2ARC may be a viable
alternative (if you have enough mem to give for this). Feel free to
provide benchmark numbers, but in general I see this just as a
workaround for the current issues.

I've played with this a bit (2-disk mirror + one 256MB md), but I'm not
completely sure how to read the bonnie++ results, nor am I sure I'm
using the right arguments (bonnie++ -s8192 -n64 -d/pool on a machine
that has 4GB).

L2ARC ("cache" vdev) is supposed to improve random reads, while a "log"

It is supposed to improve random reads, if the working set is in the cache...

vdev (presumably something that links in with the ZIL) improves random
writes.  I'm not sure where bonnie++ tests random reads, but I do see it

It is not supposed to improve random writes, it is supposed to improve direct writes (man 2 open, search for O_FSYNC... in Solaris it is O_DSYNC).

testing random seeks.

[...]

The options as I see them are (a)) figure out some *reliable* way to
describe to folks how to tune their systems to not experience ARC or
memory exhaustion related issues, or (b) utilise L2ARC exclusively and
set the ARC (arc_max) to something fairly small.

I would prefer a) together with some more sanity checking when changing the values. :)

It is just that it is not easy to come up with a correct sanity checking...

Bye,
Alexander.

--
If sarcasm were posted on Usenet, would anybody notice?
                -- James Nicoll

http://www.Leidinger.net    Alexander @ Leidinger.net: PGP ID = B0063FE7
http://www.FreeBSD.org       netchild @ FreeBSD.org  : PGP ID = 72077137
_______________________________________________
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"

Reply via email to