Re: UMA cache back pressure

Jeff Roberson Mon, 18 Nov 2013 19:56:07 -0800


On Mon, 18 Nov 2013, Alexander Motin wrote:

On 18.11.2013 21:11, Jeff Roberson wrote:

On Mon, 18 Nov 2013, Alexander Motin wrote:

I've created patch, based on earlier work of avg@, to add back
pressure to UMA allocation caches. The problem of physical memory or
KVA exhaustion existed there for many years and it is quite critical
now for improving systems performance while keeping stability. Changes
done in memory allocation last years improved situation. but haven't
fixed completely. My patch solves remaining problems from two sides:
a) reducing bucket sizes every time system detects low memory
condition; and b) as last-resort mechanism for very low memory
condition, it cycling over all CPUs to purge their per-CPU UMA caches.
Benefit of this approach is in absence of any additional hard-coded
limits on cache sizes -- they are self-tuned, based on load and memory
pressure.

With this change I believe it should be safe enough to enable UMA
allocation caches in ZFS via vfs.zfs.zio.use_uma tunable (at least for
amd64). I did many tests on machine with 24 logical cores (and as
result strong allocation cache effects), and can say that with 40GB
RAM using UMA caches, allowed by this change, by two times increases
results of SPEC NFS benchmark on ZFS pool of several SSDs. To test
system stability I've run the same test with physical memory limited
to just 2GB and system successfully survived that, and even showed
results 1.5 times better then with just last resort measures of b). In
both cases tools/umastat no longer shows unbound UMA cache growth,
that makes me believe in viability of this approach for longer runs.

I would like to hear some comments about that:
http://people.freebsd.org/~mav/uma_pressure.patch


Hey Mav,

This is a great start and great results.  I think it could probably even
go in as-is, but I have a few suggestions.


Hey! Thanks for your review. I appreciate.


And I appreciate more people being interested in working on the allocator.

First, let's test this with something that is really super allocator
heavy and doesn't benefit much from bucket sizing.  For example, a
network forwarding test.  Or maybe you could get someone like Netflix
that is using it to push a lot of bits with less filesystem cost than
zfs and spec.
I am not sure what simple forwarding may show in this case. Even on myworkload with ZFS creating strong memory pressure I still have mbuf* zonesbuckets almost (some totally) maxed out. Without other major (or even any)pressure in system they just can't become bigger then maximum. But if you canpropose some interesting test case with pressure that I can reproduce -- I amall ears.

I think part of that is also because you're using min free pages right nowas your threshold. It should probably be triggering slightly more often.

Second, the cpu binding is a very costly and very high-latency
operation. It would make sense to do CPU_FOREACH and then ZONE_FOREACH.
You're also biasing the first zones in the list.  The low memory
condition will more often clear after you check these first zones.  So
you might just check it once and equally penalize all zones.  I'm
concerned that doing CPU_FOREACH in every zone will slow the pagedaemon
more.
I completely agree with all you said here. This part of code I just tookas-is from earlier work. It definitely can be improved. I'll take a look onthat. But as I have mentioned in one of earlier responses that code used in_very_ rare cases, unless system is heavily overloaded on memory, like doingZFS on box with 24 cores and 2GB RAM. During reasonable operation it isenough to have soft back pressure to keep on caches in shape and never callthat.
We also have been working towards per-domain pagedaemons so
perhaps we should have a uma-reclaim taskqueue that we wake up to do the
work?
VM is not my area so far, so please propose "the right way". I took this tasknow only because I have to due to huge performance bottleneck this problemcauses and years it remains unsolved.

Well it's probably fine to keep abusing the first domain's pageout daemonfor now but we won't want to in the future, especially if we want to keepeach domain's page daemon on the socket that it's managing.

Third, using vm_page_count_min() will only trigger when the pageout
daemon can't keep up with the free target.  Typically this should only
happen with a lot of dirty mmap'd pages or incredibly high system load
coupled with frequent allocations.  So there may be many cases where
reclaiming the extra UMA memory is helpful but the pagedaemon can still
keep up while pushing out file pages that we'd prefer to keep.

As I have told that is indeed last resort. It does not need to be done often.Per-CPU caches just should not grow without real need to the point when theyhave to be cleaned.

Let me explain it differently. Right now you're handling cases ofoverloaded CPU, if we run this code under different conditions we couldhandle overloaded memory better as well. Imagine a system which hasoversized buckets and lots of wasted memory but a pageout daemon which isstill meeting targets by evicting page cache pages. Perhaps there was atemporary use of some very large zones which is no longer necessary.Since we meet the paging target quickly enough we will never discover thisother memory that we can evict.

Look at the vm page targets. The target is very far from the min. Sotypically the thread just wakes up and evicts clean pages very quickly toaccommodate this. ZFS is particularly affected because its pages can't beevicted by the page daemon, so you're more likely to run out, but othersystems would benefit from this and they do have pages which could beevicted where you'd like to preserve them by trimming the uma cache.


Does that make sense?

I think the perfect heuristic would have some idea of how likely the UMA
pages are to be re-used immediately so we can more effectively tradeoff
between file pages and kernel memory cache.  As it is now we limit the
uma_reclaim() calls to every 10 seconds when there is memory pressure.
Perhaps we could keep a timestamp for when the last slab was allocated
to a zone and do the more expensive reclaim on zones who have timestamps
that exceed some threshold?  Then have a lower threshold for reclaiming
at all? Again, it doesn't need to be perfect, but I believe we can catch
a wider set of cases by carefully scheduling this.
I was thinking about that too. But I think timestamps should be set not onslab, but on bucket. The fact that zone is not allocating new slabs does notmean it does not use its already allocated buckets. If we put time of thelast refill into each bucket, then we should be able to purge all buckets,unused for specified period of time. Additionally we could put timestamp onzone and update it every time zone runs out of its cache. If zone does notrun out of cache for some time -- probably it has unused buckets. So when weneed some RAM we should take a first look on zones that had stale timestamp.

Many healthy flow control algorithms maintain a relatively steady state byperiodically testing the edges. I would prefer to maintain the timestampon a per-zone basis and not per-bucket anyway as it saves some space andwe'd have to resize all the buckets if we take up another pointers space.

Anyway, I'm not too dogmatic about it. There are probably severalconvenient ways to write it and no perfect one.

May I suggest that you make the change to only FOREACH_CPU once and thencommit with your current heuristic. Then we can try to take it one stepfurther?


Thanks,
Jeff


--
Alexander Motin

_______________________________________________
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org"

Re: UMA cache back pressure

Reply via email to