Re: Examining the VM splay tree effectiveness

2010-10-03 Thread Ivan Voras
On 10/01/10 20:28, Ed Schouten wrote:
> Andre,
> 
> * Andre Oppermann  wrote:
>> A splay tree is an interesting binary search tree with insertion,
>> lookup and removal performed in O(log n) *amortized* time.  With
>> the *amortized* time being the crucial difference to other binary trees.
>> On every access *including* lookup it rotates the tree to make the
>> just found element the new root node.  For all gory details see:
>>  http://en.wikipedia.org/wiki/Splay_tree
> 
> Even though a red-black tree is quite good since it guarantees a $2 \log
> n$ upperbound, the problem is that it's quite computationally intensive.
> 
> Maybe it would be worth looking at other types of balanced trees? For
> example, another type of tree which has only $O(\log n)$ amortized
> insertion/removal/lookup time, but could already be a lot better in
> practice, is a Treap.

How many elements are held in vm_map trees? From the source it looks
like one tree with all pages in the system and then one per-process?

Trees have very varied real-time characteristics, e.g.:

http://attractivechaos.awardspace.com/udb.html
http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.83.6795&rep=rep1&type=pdf


___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to "freebsd-hackers-unsubscr...@freebsd.org"


Re: Examining the VM splay tree effectiveness

2010-10-03 Thread Ivan Voras
On 10/01/10 10:54, Andre Oppermann wrote:
> On 30.09.2010 19:51, Ivan Voras wrote:
>> On 09/30/10 18:37, Andre Oppermann wrote:
>>
>>> Both the vmmap and page table make use of splay trees to manage the
>>> entries and to speed up lookups compared to long to traverse linked
>>> lists or more memory expensive hash tables.  Some structures though
>>> do have an additional linked list to simplify ordered traversals.
>>
>> The property of splay tree requiring *writes* for nearly every read
>> really is a thorn in the eye for SMP. It seems to me that even if the
>> immediate benefits from converting to something else are not directly
>> observable, it will still be worth doing it.
> 
> Fully agreed.
> 
>> It's a shame that RCU is still a patent minefield :/
>>
>> http://mirror.leaseweb.com/kernel/people/npiggin/patches/lockless/2.6.16-rc5/radix-intro.pdf
>>
> 
> I'm not convinced that RCU is the only scalable way of sharing a
> data structure across a possibly large number of CPU's.

Of course, it's just well understood currently.

> The term "lockless" is often used and frequently misunderstood.
> Having a lockess data structure *doesn't* mean that is either
> performant, scalable or both.  It heavily depends on a number
> of additional factors.  Many times "lockless" just replaces a
> simple lock/unlock cycle with a number of atomic operations on
> the data structure.  This can easily backfire because an atomic

Yes.

> operation just hides the computational complexity and also dirties
> the CPU's cache lines.  Generally on cache coherent architectures
> almost all of the atomic operation is in HW with bus lock cycles,
> bus snooping and whatnot.  While seemingly simple form the programmers
> point of view, the overhead and latency is still there.  Needless

Yes, you basically just offload the operation to hardware but the steps
it needs to make are the same in concept.

>  a) make sure the lock is held for only a small amount of time
> to avoid lock contention.
>  b) do everything you can outside of the lock.
>  c) if the lock is found to be heavily contended rethink the
> whole approach and check if other data structures can be used.
>  d) minimize write accesses to memory in the lock protected
> shared data structure.
>  e) PROFILE, DON'T SPECULATE! Measure the access pattern and
> measure the locking/data access strategy's cost in terms
> of CPU cycles consumed.
> 
>  f) on lookup heavy data structures avoid writing to memory and
> by it dirtying CPU caches.
>  g) on modify heavy data structures avoid touching too many
> elements.
>  h) on lookup and modify heavy data structure that are used
> across many CPU's all bets are off and a different data
> structure approach should be considered resulting ideally
> in case f).
> 
> It all starts with the hypothesis that a data structure is not
> optimally locked.

This looks like material for a developer-centric wiki page :) There is a
lot of dispersed wisdom in this thread which would be nice if gathered
in one place.



___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to "freebsd-hackers-unsubscr...@freebsd.org"


Re: Examining the VM splay tree effectiveness

2010-10-01 Thread Ed Schouten
Andre,

* Andre Oppermann  wrote:
> A splay tree is an interesting binary search tree with insertion,
> lookup and removal performed in O(log n) *amortized* time.  With
> the *amortized* time being the crucial difference to other binary trees.
> On every access *including* lookup it rotates the tree to make the
> just found element the new root node.  For all gory details see:
>  http://en.wikipedia.org/wiki/Splay_tree

Even though a red-black tree is quite good since it guarantees a $2 \log
n$ upperbound, the problem is that it's quite computationally intensive.

Maybe it would be worth looking at other types of balanced trees? For
example, another type of tree which has only $O(\log n)$ amortized
insertion/removal/lookup time, but could already be a lot better in
practice, is a Treap.

Greetings,
-- 
 Ed Schouten 
 WWW: http://80386.nl/


pgpcTegdy4WZ3.pgp
Description: PGP signature


Re: Examining the VM splay tree effectiveness

2010-10-01 Thread Andre Oppermann

On 30.09.2010 19:51, Ivan Voras wrote:

On 09/30/10 18:37, Andre Oppermann wrote:


Both the vmmap and page table make use of splay trees to manage the
entries and to speed up lookups compared to long to traverse linked
lists or more memory expensive hash tables.  Some structures though
do have an additional linked list to simplify ordered traversals.


The property of splay tree requiring *writes* for nearly every read
really is a thorn in the eye for SMP. It seems to me that even if the
immediate benefits from converting to something else are not directly
observable, it will still be worth doing it.


Fully agreed.


It's a shame that RCU is still a patent minefield :/

http://mirror.leaseweb.com/kernel/people/npiggin/patches/lockless/2.6.16-rc5/radix-intro.pdf


I'm not convinced that RCU is the only scalable way of sharing a
data structure across a possibly large number of CPU's.

The term "lockless" is often used and frequently misunderstood.
Having a lockess data structure *doesn't* mean that is either
performant, scalable or both.  It heavily depends on a number
of additional factors.  Many times "lockless" just replaces a
simple lock/unlock cycle with a number of atomic operations on
the data structure.  This can easily backfire because an atomic
operation just hides the computational complexity and also dirties
the CPU's cache lines.  Generally on cache coherent architectures
almost all of the atomic operation is in HW with bus lock cycles,
bus snooping and whatnot.  While seemingly simple form the programmers
point of view, the overhead and latency is still there.  Needless
to say that on more relaxed architectures (SPARC, Alpha, ...) the
overhead is higher.  Also important in the overall picture are the
memory barrier semantics of locks.  Some newer CPU's have started
to provide hardware implemented lock managers and combine it with
SMT features so that access to an already locked lock causes an
immediate HW thread switch and on unlock a switch back.  We also
have rm_locks (read mostly locks) that do not require synchronization
on read but have a more expensive write lock.  In UMA we use a mix
of global pools of elements with per-CPU caching of free elements.

As always the best approach depends on the dominant access pattern
of a structure.  It all comes down to the amortization cost of the
different locking or "lockless" strategies.  It's also important
to make sure that worst case behavior doesn't bring down the box.

As a rule of thumb I use this:

 a) make sure the lock is held for only a small amount of time
to avoid lock contention.
 b) do everything you can outside of the lock.
 c) if the lock is found to be heavily contended rethink the
whole approach and check if other data structures can be used.
 d) minimize write accesses to memory in the lock protected
shared data structure.
 e) PROFILE, DON'T SPECULATE! Measure the access pattern and
measure the locking/data access strategy's cost in terms
of CPU cycles consumed.

 f) on lookup heavy data structures avoid writing to memory and
by it dirtying CPU caches.
 g) on modify heavy data structures avoid touching too many
elements.
 h) on lookup and modify heavy data structure that are used
across many CPU's all bets are off and a different data
structure approach should be considered resulting ideally
in case f).

It all starts with the hypothesis that a data structure is not
optimally locked.

--
Andre
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to "freebsd-hackers-unsubscr...@freebsd.org"


Re: Examining the VM splay tree effectiveness

2010-09-30 Thread Andre Oppermann

On 30.09.2010 20:38, Alfred Perlstein wrote:

Andre,

Your observations on the effectiveness of the splay tree
mirror the concerns I have with it when I read about it.

I have always wondered though if the splay-tree algorithm
was modified to only perform rotations when a lookup required
more than "N" traversals to reach a node.

This would self-balance the tree and maintain cache without
the expense of writes for nearly all lookups.


This would break the underlying assumption of the splay tree.
A splay tree is not a balanced tree.  With this approach you
can easily get stuck in a sub-optimal situation with many
lookups traversing N-1 nodes and not getting the cache benefit.
When N nodes are traversed for an outlier, and the rotate happens,
you rotate again on the next lookup or you get stuck in another
sub-optimal situation.  In effect you get the worst of an splay
tree while not being able to gain on the amortization effect
when the root node hit rate is high.


I'm wondering if you have the time/interest in rerunning
your tests, but modifying the algorithm to only rebalance
the splay if a lookup requires more than let's say 3, 5, 7
or 10 traversals.


As the assumption of the algorithm is broken I don't think it's
useful to spend time on this.  I'd rather try and add a last-match
pointer to the RB tree to take home the locality effect in vmmap.

--
Andre
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to "freebsd-hackers-unsubscr...@freebsd.org"


Re: Examining the VM splay tree effectiveness

2010-09-30 Thread Andre Oppermann

On 30.09.2010 20:01, Alan Cox wrote:

On Thu, Sep 30, 2010 at 12:37 PM, Andre Oppermann  wrote:


On 30.09.2010 18:37, Andre Oppermann wrote:


Just for the kick of it I decided to take a closer look at the use of
splay trees (inherited from Mach if I read the history correctly) in
the FreeBSD VM system suspecting an interesting journey.



Correcting myself regarding the history: The splay tree for vmmap was
done about 8 years ago by alc@ to replace a simple linked list and was
a huge improvement.  The change in vmpage from a hash to the same splay
tree as in vmmap was committed by dillon@ about 7.5 years ago with some
involvement of a...@.
ss



Yes, and there is a substantial difference in the degree of locality of
access to these different structures, and thus the effectiveness of a splay
tree.  When I did the last round of changes to the locking on the vm map, I
made some measurements of the splay tree's performance on a JVM running a
moderately large bioinformatics application.  The upshot was that the
average number of map entries visited on an access to the vm map's splay
tree was less than the expected depth of a node in a perfectly balanced
tree.


This is indeed an expected property of the splay tree.  For the vmmap
the root node hit rate, that is the least recently accessed node, is very
high.  Other nodes that are frequently accessed as well should be high
up too.  Nonetheless the churn rate on the nodes in tree rotations is
considerable.

The question now is whether we can get the same positive effect but with
less negative side effects.  Of particular concern is the CPU cache
dirtying and thrashing by the splay rotations.  Especially with today's
and tomorrows many core, many socket, many threads/processes machines.
Another concern is worst case behavior when the root node hit rate is
low.  Currently this is the case in vmpage for certain.  It may also
become a concern in vmmap with the use of memguard (fragmenting the
address space) or an adversary trying to exploit worst case behavior
on a shared host by mapping only every other page.

This concern is validated when accepting the measurement by the 2008 SoC
student Mayur (thanks to rdivacky@ for digging up his page):
 http://wiki.freebsd.org/MayurShardul/VM_Algorithm_Improvement

In the measurements of his alternative radix trie implementation the
lookup time is *230 times less* than for the splay tree (as measured by
with the TSC).  The description of the performed benchmark is extremely
sparse and it is not clear what access pattern he simulated.  Probably
a random lookup pattern which is of course very bad for the splay tree.
Any balanced tree will be better for random lookups.

Whether his measured improvement is due to using a radix trie or if it
would perform equally to a normal balanced binary tree I can't say (yet).


I teach class shortly.  I'll provide more details later.


--
Andre
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to "freebsd-hackers-unsubscr...@freebsd.org"


Re: Examining the VM splay tree effectiveness

2010-09-30 Thread Ivan Voras
On 09/30/10 20:38, Alfred Perlstein wrote:
> Andre,
> 
> Your observations on the effectiveness of the splay tree
> mirror the concerns I have with it when I read about it.
> 
> I have always wondered though if the splay-tree algorithm
> was modified to only perform rotations when a lookup required
> more than "N" traversals to reach a node.
> 
> This would self-balance the tree and maintain cache without 
> the expense of writes for nearly all lookups.
> 
> I'm wondering if you have the time/interest in rerunning
> your tests, but modifying the algorithm to only rebalance
> the splay if a lookup requires more than let's say 3, 5, 7
> or 10 traversals.

I see two possible problems with this:

1: the validity of this heuristics, since the splay is not meant to help
the current lookup but future lookups, and if you "now" require e.g.
5-deep traversal, (barring external information about the structures -
meybe some inner relationship of the nodes can be exploitet) it is I
think about the same probability that the next lookup will hit that
rotated node or the former root node.

2: rotating only on the N'th lookup would have to go like this:

   1. take a read-only lock
   2. make the lookup, count the depth
   3. if depth > N:
  1. relock for write (lock upgrade will not always work)
  2. recheck if the tree is still the same; bail if it isn't
  3. do the splay
   4. unlock

i.e. suspiciously complicated. That is, if you want to take advantage of
read paralelism; if the tree is write-locked all the time it's simpler
but only inefficient.

Of course, real-world measurements trump theory :)

___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to "freebsd-hackers-unsubscr...@freebsd.org"


Re: Examining the VM splay tree effectiveness

2010-09-30 Thread Alfred Perlstein
Andre,

Your observations on the effectiveness of the splay tree
mirror the concerns I have with it when I read about it.

I have always wondered though if the splay-tree algorithm
was modified to only perform rotations when a lookup required
more than "N" traversals to reach a node.

This would self-balance the tree and maintain cache without 
the expense of writes for nearly all lookups.

I'm wondering if you have the time/interest in rerunning
your tests, but modifying the algorithm to only rebalance
the splay if a lookup requires more than let's say 3, 5, 7
or 10 traversals.

what do you think?

-Alfred

* Andre Oppermann  [100930 09:38] wrote:
> Just for the kick of it I decided to take a closer look at the use of
> splay trees (inherited from Mach if I read the history correctly) in
> the FreeBSD VM system suspecting an interesting journey.
> 
> The VM system has two major structures:
>  1) the VM map which is per process and manages its address space
> by tracking all regions that are allocated and their permissions.
>  2) the VM page table which is per VM map and provides the backing
> memory pages to the regions and interfaces with the pmap layer
> (virtual->physical).
> 
> Both of these are very frequently accessed through memory allocations
> from userspace and page faults from the pmap layer.  Their efficiency
> and SMP scalability is crucial for many operations beginning with fork/
> exec, going through malloc/mmap and ending with free/exit of a process.
> 
> Both the vmmap and page table make use of splay trees to manage the
> entries and to speed up lookups compared to long to traverse linked
> lists or more memory expensive hash tables.  Some structures though
> do have an additional linked list to simplify ordered traversals.
> 
> A splay tree is an interesting binary search tree with insertion,
> lookup and removal performed in O(log n) *amortized* time.  With
> the *amortized* time being the crucial difference to other binary trees.
> On every access *including* lookup it rotates the tree to make the
> just found element the new root node.  For all gory details see:
>  http://en.wikipedia.org/wiki/Splay_tree
> 
> This behavior has a few important implications:
>  1) the root node and constant rotation works like a cache with
> least recent looked up node at the top and less recently ones
> close to the top;
>  2) every lookup that doesn't match the root node, ie. a cache miss,
> causes a rotation of the tree to make the newly found node the new
> root;
>  3) during the rotate it has to re-arrange and touch possibly a large
> number of other nodes;
>  4) in worst case the tree may resemble a linked list.  A splay tree
> makes no guarantee that it is balanced.
> 
> For the kernel and the splay tree usage in the VM map and page table
> some more implications come up:
>  a) the amortization effect/cost only balance if the majority of
> lookups are root node (cache) hits, otherwise the cost of
> rebalancing destroys any advantage and quickly turns into a
> large overhead.
>  b) the overhead becomes even worse due to touching many nodes and
> causing a lot of CPU cache line dirtying.  For processes with
> shared memory or threads across CPU's this overhead becomes
> almost excessive because the CPU's stomp on each others cached
> nodes and get their own caches invalidated.  The penalty is a
> high-latency memory refetch not only for the root node lookup
> but every hop in the following rebalancing operation => thrashing.
> 
> To quantify the positive and negative effects of the splay tree I
> instrumented the code and added counters to track the behavior of
> the splay tree in the vmmap and page table.
> 
> The test case is an AMD64 kernel build to get a first overview.
> Other system usages may have different results depending on their
> fork and memory usage patters.
> 
> The results are very interesting:
> 
>  The page table shows a *very* poor root node hit rate and an excessive
>  amount of rotational node modifications representing almost the
>  worst case for splay trees.
> 
> http://chart.apis.google.com/chart?chs=700x300&chbh=a,23&chds=0,127484769&chd=t:16946915,16719791,48872230,131057,74858589,105206121&cht=bvg&chl=insert|remove|lookup|cachehit|rotates|rotatemodifies
> 
>  The vmmap shows a high root node hit rate but still a significant
>  amount of rotational node modifications.
> 
> http://chart.apis.google.com/chart?chs=700x300&chbh=a,23&chds=0,21881583&chd=t:724293,723701,20881583,19044544,3719582,4553350&cht=bvg&chl=insert|remove|lookup|cachehit|rotates|rotatemodifies
> 
> From these observations I come to the conclusion that a splay tree
> is suboptimal for the normal case and quickly horrible even on small
> deviations from the normal case in the vmmap.  For the page table it
> is always horrible.  Not to mention the locking complications due to
> modifying the tree on lookups.
> 
> Based on an examinati

Re: Examining the VM splay tree effectiveness

2010-09-30 Thread Ivan Voras
On 09/30/10 20:01, Alan Cox wrote:
> On Thu, Sep 30, 2010 at 12:37 PM, Andre Oppermann  wrote:
> 
>> On 30.09.2010 18:37, Andre Oppermann wrote:
>>
>>> Just for the kick of it I decided to take a closer look at the use of
>>> splay trees (inherited from Mach if I read the history correctly) in
>>> the FreeBSD VM system suspecting an interesting journey.
>>>
>>
>> Correcting myself regarding the history: The splay tree for vmmap was
>> done about 8 years ago by alc@ to replace a simple linked list and was
>> a huge improvement.  The change in vmpage from a hash to the same splay
>> tree as in vmmap was committed by dillon@ about 7.5 years ago with some
>> involvement of a...@.
>> ss

> Yes, and there is a substantial difference in the degree of locality of
> access to these different structures, and thus the effectiveness of a splay
> tree.  When I did the last round of changes to the locking on the vm map, I
> made some measurements of the splay tree's performance on a JVM running a
> moderately large bioinformatics application.  The upshot was that the
> average number of map entries visited on an access to the vm map's splay
> tree was less than the expected depth of a node in a perfectly balanced
> tree.

Sorry, I'm not sure how to parse that - are you saying that the splaying
helped, making the number of nodes visited during tree search lesser
than the average node depth in a balanced tree?

Even if so, wouldn't the excessive bandwidth lost in the splaying ops
(and worse - write bandwidth) make it unsuitable today?


___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to "freebsd-hackers-unsubscr...@freebsd.org"


Re: Examining the VM splay tree effectiveness

2010-09-30 Thread Alan Cox
On Thu, Sep 30, 2010 at 12:37 PM, Andre Oppermann  wrote:

> On 30.09.2010 18:37, Andre Oppermann wrote:
>
>> Just for the kick of it I decided to take a closer look at the use of
>> splay trees (inherited from Mach if I read the history correctly) in
>> the FreeBSD VM system suspecting an interesting journey.
>>
>
> Correcting myself regarding the history: The splay tree for vmmap was
> done about 8 years ago by alc@ to replace a simple linked list and was
> a huge improvement.  The change in vmpage from a hash to the same splay
> tree as in vmmap was committed by dillon@ about 7.5 years ago with some
> involvement of a...@.
> ss
>

Yes, and there is a substantial difference in the degree of locality of
access to these different structures, and thus the effectiveness of a splay
tree.  When I did the last round of changes to the locking on the vm map, I
made some measurements of the splay tree's performance on a JVM running a
moderately large bioinformatics application.  The upshot was that the
average number of map entries visited on an access to the vm map's splay
tree was less than the expected depth of a node in a perfectly balanced
tree.

I teach class shortly.  I'll provide more details later.

Regards,
Alan
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to "freebsd-hackers-unsubscr...@freebsd.org"


Re: Examining the VM splay tree effectiveness

2010-09-30 Thread Andre Oppermann

On 30.09.2010 19:15, Matthew Fleming wrote:

On Thu, Sep 30, 2010 at 9:37 AM, Andre Oppermann  wrote:

Just for the kick of it I decided to take a closer look at the use of
splay trees (inherited from Mach if I read the history correctly) in
the FreeBSD VM system suspecting an interesting journey.

The VM system has two major structures:
  1) the VM map which is per process and manages its address space
by tracking all regions that are allocated and their permissions.
  2) the VM page table which is per VM map and provides the backing
memory pages to the regions and interfaces with the pmap layer
(virtual->physical).

Both of these are very frequently accessed through memory allocations
from userspace and page faults from the pmap layer.  Their efficiency
and SMP scalability is crucial for many operations beginning with fork/
exec, going through malloc/mmap and ending with free/exit of a process.

Both the vmmap and page table make use of splay trees to manage the
entries and to speed up lookups compared to long to traverse linked
lists or more memory expensive hash tables.  Some structures though
do have an additional linked list to simplify ordered traversals.

A splay tree is an interesting binary search tree with insertion,
lookup and removal performed in O(log n) *amortized* time.  With
the *amortized* time being the crucial difference to other binary trees.
On every access *including* lookup it rotates the tree to make the
just found element the new root node.  For all gory details see:
  http://en.wikipedia.org/wiki/Splay_tree

This behavior has a few important implications:
  1) the root node and constant rotation works like a cache with
least recent looked up node at the top and less recently ones
close to the top;
  2) every lookup that doesn't match the root node, ie. a cache miss,
causes a rotation of the tree to make the newly found node the new
root;
  3) during the rotate it has to re-arrange and touch possibly a large
number of other nodes;
  4) in worst case the tree may resemble a linked list.  A splay tree
makes no guarantee that it is balanced.

For the kernel and the splay tree usage in the VM map and page table
some more implications come up:
  a) the amortization effect/cost only balance if the majority of
lookups are root node (cache) hits, otherwise the cost of
rebalancing destroys any advantage and quickly turns into a
large overhead.
  b) the overhead becomes even worse due to touching many nodes and
causing a lot of CPU cache line dirtying.  For processes with
shared memory or threads across CPU's this overhead becomes
almost excessive because the CPU's stomp on each others cached
nodes and get their own caches invalidated.  The penalty is a
high-latency memory refetch not only for the root node lookup
but every hop in the following rebalancing operation =>  thrashing.

To quantify the positive and negative effects of the splay tree I
instrumented the code and added counters to track the behavior of
the splay tree in the vmmap and page table.

The test case is an AMD64 kernel build to get a first overview.
Other system usages may have different results depending on their
fork and memory usage patters.

The results are very interesting:

  The page table shows a *very* poor root node hit rate and an excessive
  amount of rotational node modifications representing almost the
  worst case for splay trees.

http://chart.apis.google.com/chart?chs=700x300&chbh=a,23&chds=0,127484769&chd=t:16946915,16719791,48872230,131057,74858589,105206121&cht=bvg&chl=insert|remove|lookup|cachehit|rotates|rotatemodifies

  The vmmap shows a high root node hit rate but still a significant
  amount of rotational node modifications.

http://chart.apis.google.com/chart?chs=700x300&chbh=a,23&chds=0,21881583&chd=t:724293,723701,20881583,19044544,3719582,4553350&cht=bvg&chl=insert|remove|lookup|cachehit|rotates|rotatemodifies

 From these observations I come to the conclusion that a splay tree
is suboptimal for the normal case and quickly horrible even on small
deviations from the normal case in the vmmap.  For the page table it
is always horrible.  Not to mention the locking complications due to
modifying the tree on lookups.

Based on an examination of other binary trees I put forward the
hypothesis that a Red-Black tree is a much better, if not the optimal,
fit for the vmmap and page table.  RB trees are balanced binary trees
and O(log n) in all cases.  The big advantage in this context is that
lookups are pure reads and do not cause CPU cache invalidations on
other CPU's and always only require a read lock without the worst
case properties of the unbalanced splay tree.  The high cache locality
of the vmmap lookups can be used with the RB tree as well by simply
adding a pointer to the least recently found node.  To prevent write locking
this can be done lazily.  More profiling is required to make
a non-speculative statement on this t

Re: Examining the VM splay tree effectiveness

2010-09-30 Thread Roman Divacky
are you aware of Summer of Code 2008 project by Mayur Shardul?

quoting: http://www.freebsd.org/projects/summerofcode-2008.html

Project: VM Algorithm Improvement
Student: Mayur Shardul
Mentor: Jeff Roberson
Summary:
A new data structure, viz. radix tree, was implemented and used for management 
of the resident pages. The
objective is efficient use of memory and faster performance. The biggest 
challenge was to service insert requests
on the data structure without blocking. Because of this constraint the memory 
allocation failures were not
acceptable, to solve the problem the required memory was allocated at the boot 
time. Both the data structures were
used in parallel to check the correctness and we also benchmarked the data 
structures and found that radix trees
gave much better performance over splay trees.

Ready to enter CVS/SVN: We will investigate some more approaches to handle 
allocation failures before the new data
structure goes in CVS.


On Thu, Sep 30, 2010 at 06:37:38PM +0200, Andre Oppermann wrote:
> Just for the kick of it I decided to take a closer look at the use of
> splay trees (inherited from Mach if I read the history correctly) in
> the FreeBSD VM system suspecting an interesting journey.
> 
> The VM system has two major structures:
>  1) the VM map which is per process and manages its address space
> by tracking all regions that are allocated and their permissions.
>  2) the VM page table which is per VM map and provides the backing
> memory pages to the regions and interfaces with the pmap layer
> (virtual->physical).
> 
> Both of these are very frequently accessed through memory allocations
> from userspace and page faults from the pmap layer.  Their efficiency
> and SMP scalability is crucial for many operations beginning with fork/
> exec, going through malloc/mmap and ending with free/exit of a process.
> 
> Both the vmmap and page table make use of splay trees to manage the
> entries and to speed up lookups compared to long to traverse linked
> lists or more memory expensive hash tables.  Some structures though
> do have an additional linked list to simplify ordered traversals.
> 
> A splay tree is an interesting binary search tree with insertion,
> lookup and removal performed in O(log n) *amortized* time.  With
> the *amortized* time being the crucial difference to other binary trees.
> On every access *including* lookup it rotates the tree to make the
> just found element the new root node.  For all gory details see:
>  http://en.wikipedia.org/wiki/Splay_tree
> 
> This behavior has a few important implications:
>  1) the root node and constant rotation works like a cache with
> least recent looked up node at the top and less recently ones
> close to the top;
>  2) every lookup that doesn't match the root node, ie. a cache miss,
> causes a rotation of the tree to make the newly found node the new
> root;
>  3) during the rotate it has to re-arrange and touch possibly a large
> number of other nodes;
>  4) in worst case the tree may resemble a linked list.  A splay tree
> makes no guarantee that it is balanced.
> 
> For the kernel and the splay tree usage in the VM map and page table
> some more implications come up:
>  a) the amortization effect/cost only balance if the majority of
> lookups are root node (cache) hits, otherwise the cost of
> rebalancing destroys any advantage and quickly turns into a
> large overhead.
>  b) the overhead becomes even worse due to touching many nodes and
> causing a lot of CPU cache line dirtying.  For processes with
> shared memory or threads across CPU's this overhead becomes
> almost excessive because the CPU's stomp on each others cached
> nodes and get their own caches invalidated.  The penalty is a
> high-latency memory refetch not only for the root node lookup
> but every hop in the following rebalancing operation => thrashing.
> 
> To quantify the positive and negative effects of the splay tree I
> instrumented the code and added counters to track the behavior of
> the splay tree in the vmmap and page table.
> 
> The test case is an AMD64 kernel build to get a first overview.
> Other system usages may have different results depending on their
> fork and memory usage patters.
> 
> The results are very interesting:
> 
>  The page table shows a *very* poor root node hit rate and an excessive
>  amount of rotational node modifications representing almost the
>  worst case for splay trees.
> 
> http://chart.apis.google.com/chart?chs=700x300&chbh=a,23&chds=0,127484769&chd=t:16946915,16719791,48872230,131057,74858589,105206121&cht=bvg&chl=insert|remove|lookup|cachehit|rotates|rotatemodifies
> 
>  The vmmap shows a high root node hit rate but still a significant
>  amount of rotational node modifications.
> 
> http://chart.apis.google.com/chart?chs=700x300&chbh=a,23&chds=0,21881583&chd=t:724293,723701,20881583,19044544,3719582,4553350&cht=bvg&chl=insert|remove|lookup|ca

Re: Examining the VM splay tree effectiveness

2010-09-30 Thread Andre Oppermann

On 30.09.2010 18:37, Andre Oppermann wrote:

Just for the kick of it I decided to take a closer look at the use of
splay trees (inherited from Mach if I read the history correctly) in
the FreeBSD VM system suspecting an interesting journey.


Correcting myself regarding the history: The splay tree for vmmap was
done about 8 years ago by alc@ to replace a simple linked list and was
a huge improvement.  The change in vmpage from a hash to the same splay
tree as in vmmap was committed by dillon@ about 7.5 years ago with some
involvement of a...@.

--
Andre
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to "freebsd-hackers-unsubscr...@freebsd.org"


Re: Examining the VM splay tree effectiveness

2010-09-30 Thread Matthew Fleming
On Thu, Sep 30, 2010 at 9:37 AM, Andre Oppermann  wrote:
> Just for the kick of it I decided to take a closer look at the use of
> splay trees (inherited from Mach if I read the history correctly) in
> the FreeBSD VM system suspecting an interesting journey.
>
> The VM system has two major structures:
>  1) the VM map which is per process and manages its address space
>    by tracking all regions that are allocated and their permissions.
>  2) the VM page table which is per VM map and provides the backing
>    memory pages to the regions and interfaces with the pmap layer
>    (virtual->physical).
>
> Both of these are very frequently accessed through memory allocations
> from userspace and page faults from the pmap layer.  Their efficiency
> and SMP scalability is crucial for many operations beginning with fork/
> exec, going through malloc/mmap and ending with free/exit of a process.
>
> Both the vmmap and page table make use of splay trees to manage the
> entries and to speed up lookups compared to long to traverse linked
> lists or more memory expensive hash tables.  Some structures though
> do have an additional linked list to simplify ordered traversals.
>
> A splay tree is an interesting binary search tree with insertion,
> lookup and removal performed in O(log n) *amortized* time.  With
> the *amortized* time being the crucial difference to other binary trees.
> On every access *including* lookup it rotates the tree to make the
> just found element the new root node.  For all gory details see:
>  http://en.wikipedia.org/wiki/Splay_tree
>
> This behavior has a few important implications:
>  1) the root node and constant rotation works like a cache with
>    least recent looked up node at the top and less recently ones
>    close to the top;
>  2) every lookup that doesn't match the root node, ie. a cache miss,
>    causes a rotation of the tree to make the newly found node the new
>    root;
>  3) during the rotate it has to re-arrange and touch possibly a large
>    number of other nodes;
>  4) in worst case the tree may resemble a linked list.  A splay tree
>    makes no guarantee that it is balanced.
>
> For the kernel and the splay tree usage in the VM map and page table
> some more implications come up:
>  a) the amortization effect/cost only balance if the majority of
>    lookups are root node (cache) hits, otherwise the cost of
>    rebalancing destroys any advantage and quickly turns into a
>    large overhead.
>  b) the overhead becomes even worse due to touching many nodes and
>    causing a lot of CPU cache line dirtying.  For processes with
>    shared memory or threads across CPU's this overhead becomes
>    almost excessive because the CPU's stomp on each others cached
>    nodes and get their own caches invalidated.  The penalty is a
>    high-latency memory refetch not only for the root node lookup
>    but every hop in the following rebalancing operation => thrashing.
>
> To quantify the positive and negative effects of the splay tree I
> instrumented the code and added counters to track the behavior of
> the splay tree in the vmmap and page table.
>
> The test case is an AMD64 kernel build to get a first overview.
> Other system usages may have different results depending on their
> fork and memory usage patters.
>
> The results are very interesting:
>
>  The page table shows a *very* poor root node hit rate and an excessive
>  amount of rotational node modifications representing almost the
>  worst case for splay trees.
>
> http://chart.apis.google.com/chart?chs=700x300&chbh=a,23&chds=0,127484769&chd=t:16946915,16719791,48872230,131057,74858589,105206121&cht=bvg&chl=insert|remove|lookup|cachehit|rotates|rotatemodifies
>
>  The vmmap shows a high root node hit rate but still a significant
>  amount of rotational node modifications.
>
> http://chart.apis.google.com/chart?chs=700x300&chbh=a,23&chds=0,21881583&chd=t:724293,723701,20881583,19044544,3719582,4553350&cht=bvg&chl=insert|remove|lookup|cachehit|rotates|rotatemodifies
>
> From these observations I come to the conclusion that a splay tree
> is suboptimal for the normal case and quickly horrible even on small
> deviations from the normal case in the vmmap.  For the page table it
> is always horrible.  Not to mention the locking complications due to
> modifying the tree on lookups.
>
> Based on an examination of other binary trees I put forward the
> hypothesis that a Red-Black tree is a much better, if not the optimal,
> fit for the vmmap and page table.  RB trees are balanced binary trees
> and O(log n) in all cases.  The big advantage in this context is that
> lookups are pure reads and do not cause CPU cache invalidations on
> other CPU's and always only require a read lock without the worst
> case properties of the unbalanced splay tree.  The high cache locality
> of the vmmap lookups can be used with the RB tree as well by simply
> adding a pointer to the least recently found node.  To prevent write locking
> this can be 

Examining the VM splay tree effectiveness

2010-09-30 Thread Andre Oppermann

Just for the kick of it I decided to take a closer look at the use of
splay trees (inherited from Mach if I read the history correctly) in
the FreeBSD VM system suspecting an interesting journey.

The VM system has two major structures:
 1) the VM map which is per process and manages its address space
by tracking all regions that are allocated and their permissions.
 2) the VM page table which is per VM map and provides the backing
memory pages to the regions and interfaces with the pmap layer
(virtual->physical).

Both of these are very frequently accessed through memory allocations
from userspace and page faults from the pmap layer.  Their efficiency
and SMP scalability is crucial for many operations beginning with fork/
exec, going through malloc/mmap and ending with free/exit of a process.

Both the vmmap and page table make use of splay trees to manage the
entries and to speed up lookups compared to long to traverse linked
lists or more memory expensive hash tables.  Some structures though
do have an additional linked list to simplify ordered traversals.

A splay tree is an interesting binary search tree with insertion,
lookup and removal performed in O(log n) *amortized* time.  With
the *amortized* time being the crucial difference to other binary trees.
On every access *including* lookup it rotates the tree to make the
just found element the new root node.  For all gory details see:
 http://en.wikipedia.org/wiki/Splay_tree

This behavior has a few important implications:
 1) the root node and constant rotation works like a cache with
least recent looked up node at the top and less recently ones
close to the top;
 2) every lookup that doesn't match the root node, ie. a cache miss,
causes a rotation of the tree to make the newly found node the new
root;
 3) during the rotate it has to re-arrange and touch possibly a large
number of other nodes;
 4) in worst case the tree may resemble a linked list.  A splay tree
makes no guarantee that it is balanced.

For the kernel and the splay tree usage in the VM map and page table
some more implications come up:
 a) the amortization effect/cost only balance if the majority of
lookups are root node (cache) hits, otherwise the cost of
rebalancing destroys any advantage and quickly turns into a
large overhead.
 b) the overhead becomes even worse due to touching many nodes and
causing a lot of CPU cache line dirtying.  For processes with
shared memory or threads across CPU's this overhead becomes
almost excessive because the CPU's stomp on each others cached
nodes and get their own caches invalidated.  The penalty is a
high-latency memory refetch not only for the root node lookup
but every hop in the following rebalancing operation => thrashing.

To quantify the positive and negative effects of the splay tree I
instrumented the code and added counters to track the behavior of
the splay tree in the vmmap and page table.

The test case is an AMD64 kernel build to get a first overview.
Other system usages may have different results depending on their
fork and memory usage patters.

The results are very interesting:

 The page table shows a *very* poor root node hit rate and an excessive
 amount of rotational node modifications representing almost the
 worst case for splay trees.

http://chart.apis.google.com/chart?chs=700x300&chbh=a,23&chds=0,127484769&chd=t:16946915,16719791,48872230,131057,74858589,105206121&cht=bvg&chl=insert|remove|lookup|cachehit|rotates|rotatemodifies

 The vmmap shows a high root node hit rate but still a significant
 amount of rotational node modifications.

http://chart.apis.google.com/chart?chs=700x300&chbh=a,23&chds=0,21881583&chd=t:724293,723701,20881583,19044544,3719582,4553350&cht=bvg&chl=insert|remove|lookup|cachehit|rotates|rotatemodifies

From these observations I come to the conclusion that a splay tree
is suboptimal for the normal case and quickly horrible even on small
deviations from the normal case in the vmmap.  For the page table it
is always horrible.  Not to mention the locking complications due to
modifying the tree on lookups.

Based on an examination of other binary trees I put forward the
hypothesis that a Red-Black tree is a much better, if not the optimal,
fit for the vmmap and page table.  RB trees are balanced binary trees
and O(log n) in all cases.  The big advantage in this context is that
lookups are pure reads and do not cause CPU cache invalidations on
other CPU's and always only require a read lock without the worst
case properties of the unbalanced splay tree.  The high cache locality
of the vmmap lookups can be used with the RB tree as well by simply
adding a pointer to the least recently found node.  To prevent write 
locking this can be done lazily.  More profiling is required to make

a non-speculative statement on this though.  In addition a few of
the additional linked lists that currently accompany the vmmap and
page structures are n