On Thu, Feb 04, 2010 at 11:51:11AM +0500, Anton Maksimenkov wrote:
> 2010/2/3 Ariane van der Steldt <ari...@stack.nl>:
> > On Tue, Feb 02, 2010 at 01:21:55AM +0500, Anton Maksimenkov wrote:
> >> uvm_map_lookup_entry()... in uvm_map_findspace()...
> >
> > I'm pretty sure that function is bugged: I think it doesn't do
> > wrap-around on the search for an entry. ?
> >
> > Both functions are rather complex. And both functions pre-date random
> > allocation. All this map->hint business has no reason to remain, since
> > random allocation is here to stay and the hint has a high failure rate
> > since then (if it is used, which it usually isn't anyway).
> >
> > The RB_TREE code in uvm_map_findspace is a pretty good solution given
> > the map entry struct. You will need to understand the RB_AUGMENT hack
> > that is only used in this place of the kernel though.
> 
> 2010/2/3 Owain Ainsworth <zer...@googlemail.com>:
> > On Wed, Feb 03, 2010 at 04:33:50PM +0100, Ariane van der Steldt wrote:
> >> I'm pretty sure that function is bugged...
> > Yes. Henning has a bgpd core router where bgpd occasionally dies due to
> 
> 
> Let me introduce my idea.
> 
> In fact, we have only two functions in uvm_map.c which make searching
> in maps. These are uvm_map_lookup_entry() and uvm_map_findspace().
> The uvm_map_lookup_entry() do search by address (VA), while
> uvm_map_findspace() do search by free space.

uvm_map_findspace() is also used when searching by addr:
mmap(..., MAP_FIXED) requires a search by addr.
uvm_map_findspace() should also be able to take an addr constraint: like
Oga said, i386 has a segmented view of memory wrt W^X.

> And we have a RB_HEAD(uvm_tree, vm_map_entry) rbhead, which is
> "indexed by address" (see uvm_compare()). Since that this RB_TREE
> provide a "searching by address" option to us. This is what the
> RB_TREE used for.
> But when we try to use that RB_TREE for track "free space" and to
> SEARCH by free space ? we do dirty and ugly things!
> Then we add a RB_AUGMENT hack (it is so ugly and hard to use right
> with others that you can't find it anywhere in source tree... exept
> the uvm_map.c) and other tricks...

Nah, RB_AUGMENT is easy. When an item in the tree is deleted, each node
in the tree that is altered (position or insertion/removal) has each of
its children and each of its parents RB_AUGMENTed. The RB_AUGMENT calls
are done in such a way that each node is process prior to
RB_PARENT(node) being processed.

> In the end we got very unclear
> uvm_map.c code in these places, which is very hard to understand and
> track, and which is overloaded and overcomplexed by that tricks and
> crutches.
> We must stop it, because it keeps hold our hands and brains.
> 
> We can do things very clear and easy. Let me show how exactly.
> 
> Let's just add second RB_TREE, which is "indexed by space", let's call
> it RB_HEAD(uvm_tree_by_space, vm_map_entry) rbhead_by_space, for
> example. That uvm_tree_by_space will contain vm_map_entry'es sorted by
> free space. And that uvm_tree_by_space will be used only in
> uvm_map_findspace() to search for  vm_map_entry with needed free
> space. RB_NFIND() is our best friend here! Simple, reasonable fast,
> clearly and easy to understand.

I use that trick in uvm_pmemrange. It's a nice algorithm, easy to
implement. Keep in mind however: you'll need to support random
allocation, or the diff will be rejected.

I was actually thinking along the same lines: rb-tree of free space,
ordered by size.
Let: start = first rb-entry with enough space for allocation
Let: end = rb_max(free tree)
Let: chosen = random chunk in [start, end]

You'll want indices on the tree, so the random generator can be given a
number to use. (If the number is too large, you'll get bias on the
largest segments, if too small, you won't consider them.)

Once you have 'chosen', you need to generate a random address inside.

Let: offset = random number between 0 and (end-start)
Now you have a random address inside chosen:
chosen.start + offset.

Disadvantage of the algorithm is that it requires 2 random numbers (the
current algorithm only requires 1). Advantage is that this algorithm
always produces a valid address, so no forward searching is necessary
anymore.

The address may have to be shifted because pmap_prefer says so. If
that's the case, do so whenever possible: it removes aliasing problems
on some architectures (afaik pmap has a way of dealing with it, but
it's hideously expensive when it has to).

> Actually, since we can have two or more vm_map_entry'es with equal
> space, so each RB_TREE element will be the list of vm_map_entry'es
> with that equal space. We can use some additional logic here when we
> push/pop vm_map_entry from that list ? we can pop vm_map_entry with
> smallest start address, for example.

You can't simply take the lowest start address. It would make addresses
completely predictable.

> Then we must free the first RB_TREE (uvm_tree, "indexed by address")
> from tracking space/ownspace and other nasty tricks, and stop using it
> in uvm_map_findspace(). This uvm_tree must be used only in
> uvm_map_lookup_entry(). Again, RB_NFIND() is our best friend here (of
> course, in couple with RB_LEFT and other logic, etc). And again,
> simple, reasonable fast, clearly and easy to understand!
> And we can (and must) COMPLETELY STOP using RB_AUGMENT hack and forget
> about it as a bad dream, at last.

RB_AUGMENT (as it is implemented) is a hack. However the idea behind
RB_AUGMENT is sound. It enables some algorithms that would be hideously
expensive without (for example, RB_AUGMENT can produce O(1) size lookups
and make the tree indexed). I agree that it is not always the best
solution to a problem, but don't discard it right away, it has its uses.
If you chose the algorithm I described above, you'll need RB_AUGMENT.

> If we decide that using of RB_TREE at each lookup/findspace is "heavy"
> when there are little of vm_map_entry'es in map, we can just use
> linear search in this case. And use RB_FIND/NFIND only when number of
> vm_map_entry'es will be bigger than some number.

I don't like switching search strategies. They make the code harder
to validate and introduce additional edge cases. Stick to one search
strategy and only introduce a second one if the first one becomes a
bottle neck.

> We can forget about "hint" and probably drop it's usage here in
> uvm_map.c (I'll review it later). RB_FIND/RB_NFIND will do their best,
> especially with "random allocation", especially when there will be
> many vm_map_entry'es in the map.

Hint currently has a dual meaning: it's a hint (you can ignore it)
unless MAP_FIXED is specified, in which case it's a requirement.
The hint meaning is often pretty good: mquery(2) often generates it in a
previous call. However for kernel memory it's horrible: all invocations
of uvm_map_* specify vm_map_min(map) as hint.
It's just the vm_map.hint that is really bad.

> So in the end.
> I have some pieces of "patch" in my mind, and I'll create diff with
> ideas introduced here.

I don't think patches will solve uvm_map issues completely. Proper
redesign is probably the way to go.

> If you have any positives and negatives so please, say it here.

You'll want to be able to join free entries that are adjecent back
together. So the free entries will have to be in both the addr and the
free tree.

There's two more issues: kva and guard pages.

kva is limited and running random allocation on sparc (4MB total kva)
would add a lot of memory pressure on the kernel. It's easy to eliminate
that need ofcourse: just don't use the random algorithm when kva is
concerned.

guard pages in the kernel are useful to check for access before the
start or after the end of the allocated range. The guard pages also
introduce slow recycle: all (actually, most) other memory will be
visited before the most recently freed entry will be revisited.
Guard pages are a hassle in some cases, when entries are put in
seperately, but need to be joined together (no guard page in between)
for some operations in the kernel (mapping userspace memory into the
kernel). A proper implementation would probably count the guard pages as
free as a last resort, when a mapping may not fail, but count them as
in-use otherwise (except for fixed mappings).
Fortunately, those guard pages are only in the kernel map, not in
process maps.

To summarize:
- random allocation for any userspace (and maybe for some kernel space?
  it'd be really sweet to be able to turn that on)
- smallest fit for kernel maps (reduce fragmentation, important on small
  kva architectures like sparc)
- address lookup for free memory (needed for MAP_FIXED in mmap(2))
- space lookup for free memory (it's fast and provable, I want it)
- guard pages (optionally turned on for when we suspect a
  mis-allocation in kernel map, you can tie it to kva_guardpages)
- slow recycle (optionally turned on for when we suspect a
  use-after-free bug in kernel map)
- pmap_prefer may disagree with your suggested vaddr, in which case you
  may need to select a different segment after all

One more problem you may face, is uvm_km_getpage starvation. If it
happens, you'll be unable to allocate a vm_map_entry. In your design,
you'll need 2 per allocation most of the time. If that's too much
pressure, the kernel may starve itself and be unable to create new
entries, becoming completely jammed. This is a problem on any
non-pmap_direct architectures (like i386).

I think you'll need 2 trees for this design to work, one of them
RB_AUGMENTed.

Should you choose to implement this, I'd be happy to buy you a beer:
I'm always happy if others want to do things on my todo ;) . You may
want to have a look at uvm_pmemrange, it exploits the same tricks wrt
trees as you are proposing, although it's only concerned with free
memory (since it's the physmem allocator). An old version (that won't
apply to a current tree anymore) lies around in cvs' attic, or bug me
for a version that does apply.

Ciao,
-- 
Ariane

Reply via email to