Re: sysctl for querying kmem_map->size

2010-09-30 Thread Garrett Cooper
On Thu, Sep 30, 2010 at 1:26 PM, Andriy Gapon  wrote:
> on 30/09/2010 21:52 Garrett Cooper said the following:
>> On Thu, Sep 30, 2010 at 10:17 AM, Andriy Gapon  wrote:
>>>
>>> Here's a patch that adds a sysctl for querying kmem_map->size, which may be 
>>> useful
>>> for system state/resources monitoring:
>>> http://people.freebsd.org/~avg/sysctl-kmem_map_size.diff
>>>
>>> I am quite unsure about sizeof(kmem_map->size) == sizeof(int) hack, but I 
>>> couldn't
>>> think of other way to decide whether to use SYSCTL_ADD_UINT or 
>>> SYSCTL_ADD_ULONG
>>> depending on real type behind vm_size_t.
>>
>>     Is the base value of the field size_t? If so, then it's ulong on
>> 64-bit archs and uint on 32-bit archs. Maybe it's a good time then to
>
> No, it's vm_size_t, but it's defined similarly to size_t I guess:
> vm_size_t -> __vm_size_t -> {__uint32_t or __uint64_t depending on arch}

Well... that's basically size_t right? At first inspection it
seems silly to define it as __uint32_t or __uint64_t if it's really
the same thing (in theory) as __size_t . Of course the vmem folks
(like alc@) might have more context as to why things were done this
way.

>> actually get the sysctl and tunables work that I started on into base.
>> I have a functioning and tested copy of the tunables work, but I'll
>> need to do similar for the sysctls as well (des@ and I kind of got out
>> of sync a few months back).
>
> I believe that this is the first time I hear about this project.

It wasn't an official project. It was more or less an email thread
that sort of turned into a challenge, which turned into a project :) :
http://unix.derkeiler.com/Mailing-Lists/FreeBSD/hackers/2010-08/msg00054.html
.
As with all things though, I'm not a committer by any means, so
unless someone works with me to review and commit my code, it will
exist purely in the ether for all eternity. Part of the drawbacks of
being in GSoC purgatory.
Thanks!
-Garrett
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to "freebsd-hackers-unsubscr...@freebsd.org"


Re: Examining the VM splay tree effectiveness

2010-09-30 Thread Andre Oppermann

On 30.09.2010 20:38, Alfred Perlstein wrote:

Andre,

Your observations on the effectiveness of the splay tree
mirror the concerns I have with it when I read about it.

I have always wondered though if the splay-tree algorithm
was modified to only perform rotations when a lookup required
more than "N" traversals to reach a node.

This would self-balance the tree and maintain cache without
the expense of writes for nearly all lookups.


This would break the underlying assumption of the splay tree.
A splay tree is not a balanced tree.  With this approach you
can easily get stuck in a sub-optimal situation with many
lookups traversing N-1 nodes and not getting the cache benefit.
When N nodes are traversed for an outlier, and the rotate happens,
you rotate again on the next lookup or you get stuck in another
sub-optimal situation.  In effect you get the worst of an splay
tree while not being able to gain on the amortization effect
when the root node hit rate is high.


I'm wondering if you have the time/interest in rerunning
your tests, but modifying the algorithm to only rebalance
the splay if a lookup requires more than let's say 3, 5, 7
or 10 traversals.


As the assumption of the algorithm is broken I don't think it's
useful to spend time on this.  I'd rather try and add a last-match
pointer to the RB tree to take home the locality effect in vmmap.

--
Andre
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to "freebsd-hackers-unsubscr...@freebsd.org"


Re: Examining the VM splay tree effectiveness

2010-09-30 Thread Andre Oppermann

On 30.09.2010 20:01, Alan Cox wrote:

On Thu, Sep 30, 2010 at 12:37 PM, Andre Oppermann  wrote:


On 30.09.2010 18:37, Andre Oppermann wrote:


Just for the kick of it I decided to take a closer look at the use of
splay trees (inherited from Mach if I read the history correctly) in
the FreeBSD VM system suspecting an interesting journey.



Correcting myself regarding the history: The splay tree for vmmap was
done about 8 years ago by alc@ to replace a simple linked list and was
a huge improvement.  The change in vmpage from a hash to the same splay
tree as in vmmap was committed by dillon@ about 7.5 years ago with some
involvement of a...@.
ss



Yes, and there is a substantial difference in the degree of locality of
access to these different structures, and thus the effectiveness of a splay
tree.  When I did the last round of changes to the locking on the vm map, I
made some measurements of the splay tree's performance on a JVM running a
moderately large bioinformatics application.  The upshot was that the
average number of map entries visited on an access to the vm map's splay
tree was less than the expected depth of a node in a perfectly balanced
tree.


This is indeed an expected property of the splay tree.  For the vmmap
the root node hit rate, that is the least recently accessed node, is very
high.  Other nodes that are frequently accessed as well should be high
up too.  Nonetheless the churn rate on the nodes in tree rotations is
considerable.

The question now is whether we can get the same positive effect but with
less negative side effects.  Of particular concern is the CPU cache
dirtying and thrashing by the splay rotations.  Especially with today's
and tomorrows many core, many socket, many threads/processes machines.
Another concern is worst case behavior when the root node hit rate is
low.  Currently this is the case in vmpage for certain.  It may also
become a concern in vmmap with the use of memguard (fragmenting the
address space) or an adversary trying to exploit worst case behavior
on a shared host by mapping only every other page.

This concern is validated when accepting the measurement by the 2008 SoC
student Mayur (thanks to rdivacky@ for digging up his page):
 http://wiki.freebsd.org/MayurShardul/VM_Algorithm_Improvement

In the measurements of his alternative radix trie implementation the
lookup time is *230 times less* than for the splay tree (as measured by
with the TSC).  The description of the performed benchmark is extremely
sparse and it is not clear what access pattern he simulated.  Probably
a random lookup pattern which is of course very bad for the splay tree.
Any balanced tree will be better for random lookups.

Whether his measured improvement is due to using a radix trie or if it
would perform equally to a normal balanced binary tree I can't say (yet).


I teach class shortly.  I'll provide more details later.


--
Andre
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to "freebsd-hackers-unsubscr...@freebsd.org"


Re: sysctl for querying kmem_map->size

2010-09-30 Thread Andriy Gapon
on 30/09/2010 21:52 Garrett Cooper said the following:
> On Thu, Sep 30, 2010 at 10:17 AM, Andriy Gapon  wrote:
>>
>> Here's a patch that adds a sysctl for querying kmem_map->size, which may be 
>> useful
>> for system state/resources monitoring:
>> http://people.freebsd.org/~avg/sysctl-kmem_map_size.diff
>>
>> I am quite unsure about sizeof(kmem_map->size) == sizeof(int) hack, but I 
>> couldn't
>> think of other way to decide whether to use SYSCTL_ADD_UINT or 
>> SYSCTL_ADD_ULONG
>> depending on real type behind vm_size_t.
> 
> Is the base value of the field size_t? If so, then it's ulong on
> 64-bit archs and uint on 32-bit archs. Maybe it's a good time then to

No, it's vm_size_t, but it's defined similarly to size_t I guess:
vm_size_t -> __vm_size_t -> {__uint32_t or __uint64_t depending on arch}

> actually get the sysctl and tunables work that I started on into base.
> I have a functioning and tested copy of the tunables work, but I'll
> need to do similar for the sysctls as well (des@ and I kind of got out
> of sync a few months back).

I believe that this is the first time I hear about this project.

-- 
Andriy Gapon
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to "freebsd-hackers-unsubscr...@freebsd.org"


Re: Examining the VM splay tree effectiveness

2010-09-30 Thread Ivan Voras
On 09/30/10 20:38, Alfred Perlstein wrote:
> Andre,
> 
> Your observations on the effectiveness of the splay tree
> mirror the concerns I have with it when I read about it.
> 
> I have always wondered though if the splay-tree algorithm
> was modified to only perform rotations when a lookup required
> more than "N" traversals to reach a node.
> 
> This would self-balance the tree and maintain cache without 
> the expense of writes for nearly all lookups.
> 
> I'm wondering if you have the time/interest in rerunning
> your tests, but modifying the algorithm to only rebalance
> the splay if a lookup requires more than let's say 3, 5, 7
> or 10 traversals.

I see two possible problems with this:

1: the validity of this heuristics, since the splay is not meant to help
the current lookup but future lookups, and if you "now" require e.g.
5-deep traversal, (barring external information about the structures -
meybe some inner relationship of the nodes can be exploitet) it is I
think about the same probability that the next lookup will hit that
rotated node or the former root node.

2: rotating only on the N'th lookup would have to go like this:

   1. take a read-only lock
   2. make the lookup, count the depth
   3. if depth > N:
  1. relock for write (lock upgrade will not always work)
  2. recheck if the tree is still the same; bail if it isn't
  3. do the splay
   4. unlock

i.e. suspiciously complicated. That is, if you want to take advantage of
read paralelism; if the tree is write-locked all the time it's simpler
but only inefficient.

Of course, real-world measurements trump theory :)

___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to "freebsd-hackers-unsubscr...@freebsd.org"


Re: sysctl for querying kmem_map->size

2010-09-30 Thread Garrett Cooper
On Thu, Sep 30, 2010 at 10:17 AM, Andriy Gapon  wrote:
>
> Here's a patch that adds a sysctl for querying kmem_map->size, which may be 
> useful
> for system state/resources monitoring:
> http://people.freebsd.org/~avg/sysctl-kmem_map_size.diff
>
> I am quite unsure about sizeof(kmem_map->size) == sizeof(int) hack, but I 
> couldn't
> think of other way to decide whether to use SYSCTL_ADD_UINT or 
> SYSCTL_ADD_ULONG
> depending on real type behind vm_size_t.

Is the base value of the field size_t? If so, then it's ulong on
64-bit archs and uint on 32-bit archs. Maybe it's a good time then to
actually get the sysctl and tunables work that I started on into base.
I have a functioning and tested copy of the tunables work, but I'll
need to do similar for the sysctls as well (des@ and I kind of got out
of sync a few months back).
Thanks,
-Garrett
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to "freebsd-hackers-unsubscr...@freebsd.org"


Re: Examining the VM splay tree effectiveness

2010-09-30 Thread Alfred Perlstein
ons I come to the conclusion that a splay tree
> is suboptimal for the normal case and quickly horrible even on small
> deviations from the normal case in the vmmap.  For the page table it
> is always horrible.  Not to mention the locking complications due to
> modifying the tree on lookups.
> 
> Based on an examination of other binary trees I put forward the
> hypothesis that a Red-Black tree is a much better, if not the optimal,
> fit for the vmmap and page table.  RB trees are balanced binary trees
> and O(log n) in all cases.  The big advantage in this context is that
> lookups are pure reads and do not cause CPU cache invalidations on
> other CPU's and always only require a read lock without the worst
> case properties of the unbalanced splay tree.  The high cache locality
> of the vmmap lookups can be used with the RB tree as well by simply
> adding a pointer to the least recently found node.  To prevent write 
> locking this can be done lazily.  More profiling is required to make
> a non-speculative statement on this though.  In addition a few of
> the additional linked lists that currently accompany the vmmap and
> page structures are no longer necessary as they easily can be done
> with standard RB tree accessors.  Our standard userspace jemalloc
> also uses RB trees for its internal housekeeping.  RB tree details:
>  http://en.wikipedia.org/wiki/Red-black_tree
> 
> I say hypothesis because I haven't measured the difference to an
> RB tree implementation yet.  I've hacked up a crude and somewhat
> mechanical patch to convert the vmmap and page VM structures to
> use RB trees, the vmmap part is not stable yet.  The page part
> seems to work fine though.
> 
> This is what I've hacked together so far:
>  http://people.freebsd.org/~andre/vmmap_vmpage_stats-20100930.diff
>  http://people.freebsd.org/~andre/vmmap_vmpage_rbtree-20100930.diff
> 
> The diffs are still in their early stages and do not make use of
> any code simplifications becoming possible by using RB trees instead
> of the current splay trees.
> 
> Comments on the VM issue and splay vs. RB tree hypothesis welcome.
> 
> -- 
> Andre
> ___
> freebsd-hackers@freebsd.org mailing list
> http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
> To unsubscribe, send any mail to "freebsd-hackers-unsubscr...@freebsd.org"

-- 
- Alfred Perlstein
.- VMOA #5191, 03 vmax, 92 gs500, 85 ch250, 07 zx10
.- FreeBSD committer
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to "freebsd-hackers-unsubscr...@freebsd.org"


Re: 8.1-R - Marvell 88SX6081 SATA controller via mvs = lots of errors

2010-09-30 Thread Alexander Motin
Hi.

Karl Pielorz wrote:
> I just switched my 8.1-R/amd64 (dual Opteron) system from ATA over to
> the new mvs driver, and started seeing a whole bunch of errors (which
> appear to have hosed one of my zfs volumes during a scrub) - anyone know
> what the following errors actually mean?
> 
> The machine has 2 * 88SX6081's in it:
> 
> "
> Sep 28 19:58:49 kernel: mvs0:  port
> 0x3000-0x30ff mem 0xd010-0xd01f,0xd040-0xd07f irq 24 at
> device 4.0 on pci17
> Sep 28 19:58:49 kernel: mvs0: Gen-II, 8 3Gbps ports, Port Multiplier
> ...
> Sep 28 19:58:49 kernel: mvs1:  port
> 0x4000-0x40ff mem 0xd0c0-0xd0cf,0xd080-0xd0bf irq 28 at
> device 4.0 on pci18
> Sep 28 19:58:49 kernel: mvs1: Gen-II, 8 3Gbps ports, Port Multiplier
> supported
> "
> 
> Under 7.2 they ran fine, with the ATA driver. I use ZFS on this machine
> - and both pools were scrubbed before the upgrade (and backed up
> fortunately!).
> 
> 
> With the mvs driver, during a scrub of the main volume, I see:
> 
> "
> Sep 29 08:56:13 kernel: mvsch12: EMPTY CRPB 6 (->14) 1 4000
> Sep 29 08:56:13 kernel: mvsch12: EMPTY CRPB 7 (->14) 0 4000
> Sep 29 08:56:13 kernel: mvsch12: EMPTY CRPB 8 (->14) 2 4000
> "
> 
> [repeated a lot - interspersed with zfs reporting problems with files,
> on all the devices in the pool]
> 
> I then also get a whole bunch of:
> 
> "
> Sep 29 08:56:56 kernel: mvsch0: Timeout on slot 1
> Sep 29 08:56:56 kernel: mvsch0: iec 0200 sstat 0123 serr
>  edma_s 1020 dma_c  dma_s  rs 0006 statu
> s 40
> Sep 29 08:56:56 kernel: mvsch0:  ... waiting for slots 0004
> Sep 29 08:56:56 kernel: mvsch12: Timeout on slot 5
> Sep 29 08:56:56 kernel: mvsch12: iec 0200 sstat 0123 serr
>  edma_s 1121 dma_c  dma_s  rs 0028 stat
> us 40
> "

"EMPTY CRPB" error means that controller reported completion for command
slot that driver counted as empty at the moment. Can't say if it is
hardware or driver issue. Timeouts could be related but I am not sure
what is the reason and what is consequence here. It could help if you
send me full log of those messages to create full picture.

-- 
Alexander Motin
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to "freebsd-hackers-unsubscr...@freebsd.org"


Re: Examining the VM splay tree effectiveness

2010-09-30 Thread Ivan Voras
On 09/30/10 20:01, Alan Cox wrote:
> On Thu, Sep 30, 2010 at 12:37 PM, Andre Oppermann  wrote:
> 
>> On 30.09.2010 18:37, Andre Oppermann wrote:
>>
>>> Just for the kick of it I decided to take a closer look at the use of
>>> splay trees (inherited from Mach if I read the history correctly) in
>>> the FreeBSD VM system suspecting an interesting journey.
>>>
>>
>> Correcting myself regarding the history: The splay tree for vmmap was
>> done about 8 years ago by alc@ to replace a simple linked list and was
>> a huge improvement.  The change in vmpage from a hash to the same splay
>> tree as in vmmap was committed by dillon@ about 7.5 years ago with some
>> involvement of a...@.
>> ss

> Yes, and there is a substantial difference in the degree of locality of
> access to these different structures, and thus the effectiveness of a splay
> tree.  When I did the last round of changes to the locking on the vm map, I
> made some measurements of the splay tree's performance on a JVM running a
> moderately large bioinformatics application.  The upshot was that the
> average number of map entries visited on an access to the vm map's splay
> tree was less than the expected depth of a node in a perfectly balanced
> tree.

Sorry, I'm not sure how to parse that - are you saying that the splaying
helped, making the number of nodes visited during tree search lesser
than the average node depth in a balanced tree?

Even if so, wouldn't the excessive bandwidth lost in the splaying ops
(and worse - write bandwidth) make it unsuitable today?


___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to "freebsd-hackers-unsubscr...@freebsd.org"


Re: Examining the VM splay tree effectiveness

2010-09-30 Thread Alan Cox
On Thu, Sep 30, 2010 at 12:37 PM, Andre Oppermann  wrote:

> On 30.09.2010 18:37, Andre Oppermann wrote:
>
>> Just for the kick of it I decided to take a closer look at the use of
>> splay trees (inherited from Mach if I read the history correctly) in
>> the FreeBSD VM system suspecting an interesting journey.
>>
>
> Correcting myself regarding the history: The splay tree for vmmap was
> done about 8 years ago by alc@ to replace a simple linked list and was
> a huge improvement.  The change in vmpage from a hash to the same splay
> tree as in vmmap was committed by dillon@ about 7.5 years ago with some
> involvement of a...@.
> ss
>

Yes, and there is a substantial difference in the degree of locality of
access to these different structures, and thus the effectiveness of a splay
tree.  When I did the last round of changes to the locking on the vm map, I
made some measurements of the splay tree's performance on a JVM running a
moderately large bioinformatics application.  The upshot was that the
average number of map entries visited on an access to the vm map's splay
tree was less than the expected depth of a node in a perfectly balanced
tree.

I teach class shortly.  I'll provide more details later.

Regards,
Alan
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to "freebsd-hackers-unsubscr...@freebsd.org"


Re: Examining the VM splay tree effectiveness

2010-09-30 Thread Andre Oppermann
 profiling is required to make
a non-speculative statement on this though.  In addition a few of
the additional linked lists that currently accompany the vmmap and
page structures are no longer necessary as they easily can be done
with standard RB tree accessors.  Our standard userspace jemalloc
also uses RB trees for its internal housekeeping.  RB tree details:
  http://en.wikipedia.org/wiki/Red-black_tree

I say hypothesis because I haven't measured the difference to an
RB tree implementation yet.  I've hacked up a crude and somewhat
mechanical patch to convert the vmmap and page VM structures to
use RB trees, the vmmap part is not stable yet.  The page part
seems to work fine though.

This is what I've hacked together so far:
  http://people.freebsd.org/~andre/vmmap_vmpage_stats-20100930.diff
  http://people.freebsd.org/~andre/vmmap_vmpage_rbtree-20100930.diff

The diffs are still in their early stages and do not make use of
any code simplifications becoming possible by using RB trees instead
of the current splay trees.

Comments on the VM issue and splay vs. RB tree hypothesis welcome.


Do you have actual performance numbers from a real workload?


No, not yet.  I obviously would have posted those numbers as well.
What do you consider a representative real workload?


I implemented an AA tree locally (basically a 2,3 tree that uses
coloring to represent the 3 nodes) and the performance was much worse
than the splay tree.  I assume this is due to the overhead of keeping
the tree balanced; I didn't instrument either the splay or the AA
code.  I suspect that the overhead of an RB tree would similarly wash
out the benefits of O(log_2 n) time, but this is theory.  The facts
would be measuring several different workloads an looking at the
system/real times for them between the two solutions.


I think there are two pronounced effects to consider:
 a) the algorithmic complexity
 b) the real world implications of said complexity and operations
on today's CPU's multi-hierarchy caches and associated SMP side
effects.


We've talked internally at $work about using a B+-tree with maybe
branching factor 5-7; whatever makes sense for the size of a cache
line.  This seems likely to be better for performance than an RB-tree
but does require a lot more changes, since separate memory is needed
for the tree's nodes outside the vm_page structure.  There just hasn't
been time to implement it and try it out.

>

Unfortunately I won't have the time to experiment at $work for a few
months on this problem.


--
Andre
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to "freebsd-hackers-unsubscr...@freebsd.org"


Re: Examining the VM splay tree effectiveness

2010-09-30 Thread Roman Divacky
gh root node hit rate but still a significant
>  amount of rotational node modifications.
> 
> http://chart.apis.google.com/chart?chs=700x300&chbh=a,23&chds=0,21881583&chd=t:724293,723701,20881583,19044544,3719582,4553350&cht=bvg&chl=insert|remove|lookup|cachehit|rotates|rotatemodifies
> 
> From these observations I come to the conclusion that a splay tree
> is suboptimal for the normal case and quickly horrible even on small
> deviations from the normal case in the vmmap.  For the page table it
> is always horrible.  Not to mention the locking complications due to
> modifying the tree on lookups.
> 
> Based on an examination of other binary trees I put forward the
> hypothesis that a Red-Black tree is a much better, if not the optimal,
> fit for the vmmap and page table.  RB trees are balanced binary trees
> and O(log n) in all cases.  The big advantage in this context is that
> lookups are pure reads and do not cause CPU cache invalidations on
> other CPU's and always only require a read lock without the worst
> case properties of the unbalanced splay tree.  The high cache locality
> of the vmmap lookups can be used with the RB tree as well by simply
> adding a pointer to the least recently found node.  To prevent write 
> locking this can be done lazily.  More profiling is required to make
> a non-speculative statement on this though.  In addition a few of
> the additional linked lists that currently accompany the vmmap and
> page structures are no longer necessary as they easily can be done
> with standard RB tree accessors.  Our standard userspace jemalloc
> also uses RB trees for its internal housekeeping.  RB tree details:
>  http://en.wikipedia.org/wiki/Red-black_tree
> 
> I say hypothesis because I haven't measured the difference to an
> RB tree implementation yet.  I've hacked up a crude and somewhat
> mechanical patch to convert the vmmap and page VM structures to
> use RB trees, the vmmap part is not stable yet.  The page part
> seems to work fine though.
> 
> This is what I've hacked together so far:
>  http://people.freebsd.org/~andre/vmmap_vmpage_stats-20100930.diff
>  http://people.freebsd.org/~andre/vmmap_vmpage_rbtree-20100930.diff
> 
> The diffs are still in their early stages and do not make use of
> any code simplifications becoming possible by using RB trees instead
> of the current splay trees.
> 
> Comments on the VM issue and splay vs. RB tree hypothesis welcome.
> 
> -- 
> Andre
> ___
> freebsd-curr...@freebsd.org mailing list
> http://lists.freebsd.org/mailman/listinfo/freebsd-current
> To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org"
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to "freebsd-hackers-unsubscr...@freebsd.org"


Re: Examining the VM splay tree effectiveness

2010-09-30 Thread Andre Oppermann

On 30.09.2010 18:37, Andre Oppermann wrote:

Just for the kick of it I decided to take a closer look at the use of
splay trees (inherited from Mach if I read the history correctly) in
the FreeBSD VM system suspecting an interesting journey.


Correcting myself regarding the history: The splay tree for vmmap was
done about 8 years ago by alc@ to replace a simple linked list and was
a huge improvement.  The change in vmpage from a hash to the same splay
tree as in vmmap was committed by dillon@ about 7.5 years ago with some
involvement of a...@.

--
Andre
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to "freebsd-hackers-unsubscr...@freebsd.org"


sysctl for querying kmem_map->size

2010-09-30 Thread Andriy Gapon

Here's a patch that adds a sysctl for querying kmem_map->size, which may be 
useful
for system state/resources monitoring:
http://people.freebsd.org/~avg/sysctl-kmem_map_size.diff

I am quite unsure about sizeof(kmem_map->size) == sizeof(int) hack, but I 
couldn't
think of other way to decide whether to use SYSCTL_ADD_UINT or SYSCTL_ADD_ULONG
depending on real type behind vm_size_t.

Comments, suggestions?

-- 
Andriy Gapon
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to "freebsd-hackers-unsubscr...@freebsd.org"


Re: Examining the VM splay tree effectiveness

2010-09-30 Thread Matthew Fleming
nvalidations on
> other CPU's and always only require a read lock without the worst
> case properties of the unbalanced splay tree.  The high cache locality
> of the vmmap lookups can be used with the RB tree as well by simply
> adding a pointer to the least recently found node.  To prevent write locking
> this can be done lazily.  More profiling is required to make
> a non-speculative statement on this though.  In addition a few of
> the additional linked lists that currently accompany the vmmap and
> page structures are no longer necessary as they easily can be done
> with standard RB tree accessors.  Our standard userspace jemalloc
> also uses RB trees for its internal housekeeping.  RB tree details:
>  http://en.wikipedia.org/wiki/Red-black_tree
>
> I say hypothesis because I haven't measured the difference to an
> RB tree implementation yet.  I've hacked up a crude and somewhat
> mechanical patch to convert the vmmap and page VM structures to
> use RB trees, the vmmap part is not stable yet.  The page part
> seems to work fine though.
>
> This is what I've hacked together so far:
>  http://people.freebsd.org/~andre/vmmap_vmpage_stats-20100930.diff
>  http://people.freebsd.org/~andre/vmmap_vmpage_rbtree-20100930.diff
>
> The diffs are still in their early stages and do not make use of
> any code simplifications becoming possible by using RB trees instead
> of the current splay trees.
>
> Comments on the VM issue and splay vs. RB tree hypothesis welcome.

Do you have actual performance numbers from a real workload?

I implemented an AA tree locally (basically a 2,3 tree that uses
coloring to represent the 3 nodes) and the performance was much worse
than the splay tree.  I assume this is due to the overhead of keeping
the tree balanced; I didn't instrument either the splay or the AA
code.  I suspect that the overhead of an RB tree would similarly wash
out the benefits of O(log_2 n) time, but this is theory.  The facts
would be measuring several different workloads an looking at the
system/real times for them between the two solutions.

We've talked internally at $work about using a B+-tree with maybe
branching factor 5-7; whatever makes sense for the size of a cache
line.  This seems likely to be better for performance than an RB-tree
but does require a lot more changes, since separate memory is needed
for the tree's nodes outside the vm_page structure.  There just hasn't
been time to implement it and try it out.

Unfortunately I won't have the time to experiment at $work for a few
months on this problem.

Thanks,
matthew
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to "freebsd-hackers-unsubscr...@freebsd.org"


Examining the VM splay tree effectiveness

2010-09-30 Thread Andre Oppermann
d lists that currently accompany the vmmap and
page structures are no longer necessary as they easily can be done
with standard RB tree accessors.  Our standard userspace jemalloc
also uses RB trees for its internal housekeeping.  RB tree details:
 http://en.wikipedia.org/wiki/Red-black_tree

I say hypothesis because I haven't measured the difference to an
RB tree implementation yet.  I've hacked up a crude and somewhat
mechanical patch to convert the vmmap and page VM structures to
use RB trees, the vmmap part is not stable yet.  The page part
seems to work fine though.

This is what I've hacked together so far:
 http://people.freebsd.org/~andre/vmmap_vmpage_stats-20100930.diff
 http://people.freebsd.org/~andre/vmmap_vmpage_rbtree-20100930.diff

The diffs are still in their early stages and do not make use of
any code simplifications becoming possible by using RB trees instead
of the current splay trees.

Comments on the VM issue and splay vs. RB tree hypothesis welcome.

--
Andre
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to "freebsd-hackers-unsubscr...@freebsd.org"


Re: page table fault, which should map kernel virtual address space

2010-09-30 Thread Svatopluk Kraus
On Tue, Sep 21, 2010 at 7:38 PM, Alan Cox  wrote:
> On Mon, Sep 20, 2010 at 9:32 AM, Svatopluk Kraus  wrote:
>> Beyond 'kernel_map', some submaps of 'kernel_map' (buffer_map,
>> pager_map,...) exist as result of 'kmem_suballoc' function call.
>> When this submaps are used (for example 'kmem_alloc_nofault'
>> function) and its virtual address subspace is at the end of
>> used kernel virtual address space at the moment (and above 'NKPT'
>> preallocation), then missing page tables are not allocated
>> and double fault can happen.
>>
>
> No, the page tables are allocated.  If you create a submap X of the kernel
> map using kmem_suballoc(), then a vm_map_findspace() is performed by
> vm_map_find() on the kernel map to find space for the submap X.  As you note
> above, the call to vm_map_findspace() on the kernel map will call
> pmap_growkernel() if needed to extend the kernel page table.
>
> If you create another submap X' of X, then that submap X' can only map
> addresses that fall within the range for X.  So, any necessary page table
> pages were allocated when X was created.

You are right. Mea culpa. I was focused on a solution and made
too quick conclusion. The page table fault hitted in 'pager_map',
which is submap of 'clean_map' and when I debugged the problem
I didn't see a submap stuff as a whole.

> That said, there may actually be a problem with the implementation of the
> superpage_align parameter to kmem_suballoc().  If a submap is created with
> superpage_align equal to TRUE, but the submap's size is not a multiple of
> the superpage size, then vm_map_find() may not allocate a page table page
> for the last megabyte or so of the submap.
>
> There are only a few places where kmem_suballoc() is called with
> superpage_align set to TRUE.  If you changed them to FALSE, that is an easy
> way to test this hypothesis.

Yes, it helps.

My story is that the problem started up when I updated a project
('coldfire' port)
based on FreeBSD 8.0. to FreeBSD current version. In the current version
the 'clean_map' submap is created with superpage_align set to TRUE.

I have looked at vm_map_find() and debugged the page table fault once again.
IMO, it looks that a do-while loop does not work in the function as intended.
A vm_map_findspace() finds a space and calls pmap_growkernel() if needed.
A pmap_align_superpage() arrange the space but never call pmap_growkernel().
A vm_map_insert() inserts the aligned space into a map without error
and never call pmap_growkernel() and does not invoke loop iteration.

I don't know too much about an idea how a virtual memory model is implemented
and used in other modules. But it seems that it could be more reasonable to
align address space in vm_map_findspace() internally and not to loop externally.

I have tried to add a check in vm_map_insert() that checks the 'end' parameter
against 'kernel_vm_end' variable and returns KERN_NO_SPACE error if needed.
In this case the loop in vm_map_find() works and I have no problem with
the page table fault. But 'kernel_vm_end' variable must be initializated
properly before first use of vm_map_insert(). The 'kernel_vm_end' variable
can be self-initializated in pmap_growkernel() in FreeBSD 8.0 (it is too late),
but it was changed in current version ('i386' port).

Thanks for your answer, but I'm still looking for permanent
and approved solution.

 Regards, Svata
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to "freebsd-hackers-unsubscr...@freebsd.org"


Re: Adding a V=R mapping for amd64?

2010-09-30 Thread Kostik Belousov
On Wed, Sep 29, 2010 at 01:22:56PM -0700, Matthew Fleming wrote:
> On Wed, Sep 29, 2010 at 1:12 PM, Kostik Belousov  wrote:
> > On Wed, Sep 29, 2010 at 12:40:57PM -0700, Matthew Fleming wrote:
> >> I'm hacking around with making a "fast reboot" that puts a copy of the
> >> MBR from disk into address 0x7c00 and, after disabling various
> >> translation bits and stopping other CPUs, branches to it, to skip the
> >> hardware self test that normally happens on boot.
> >>
> >> I haven't gotten to the point of attempting to run the code at 0x7c00
> >> because I'm first hitting a different error.  Despite my attempts to
> >> enter a translation into the hardware page table, I get a panic trying
> >> to write to address 0x7000, where I intended to put the trampoline
> >> code that turns off translation.
> >>
> >> Rebooting...
> >> Attempt to reset to MBR...
> >> XXX attempting pmap_kenter()...
> >> XXX copying bootstrap code...
> >> panic @ time 1285760103.939, thread 0xff000775d960: Fatal trap 12:
> >> page fault while in kernel mode
> >>
> >> cpuid = 0
> >> Panic occurred in module kernel loaded at 0x8010:
> >>
> >> Stack: --
> >> kernel:trap_fatal+0xac
> >> kernel:trap_pfault+0x24c
> >> kernel:trap+0x42e
> >> kernel:bcopy+0x16
> >> kernel:shutdown_reset+0x48
> >> kernel:boot+0x317
> >> kernel:reboot+0x60
> >> kernel:ia32_syscall+0x1cd
> >> --
> >> cpuid = 0; apic id = 00
> >> fault virtual address   = 0x7000
> >> fault code              = supervisor write data, page not present
> >> stack pointer           = 0x10:0xff8059e07670
> >> frame pointer           = 0x10:0xff8059e07780
> >>
> >> Here's what I think is the relevant snippets of code.  Note that I
> >> reserved the vm_page_t for physical page 7 as mbr_page early in boot,
> >> so I know the memory is free.
> >>
> >> void
> >> pmap_kenter_VR(vm_paddr_t pa)
> >> {
> >>       pmap_t pmap = kernel_pmap;
> >>       vm_page_t mpte;
> >>       pd_entry_t *pde;
> >>       pt_entry_t *pte;
> >>
> >>       vm_page_lock_queues();
> >>       PMAP_LOCK(pmap);
> >>       mpte = pmap_allocpte(pmap, pa, M_WAITOK);
> >>
> >>       pde = pmap_pde(pmap, pa);
> >>       if (pde == NULL || (*pde & PG_V) == 0)
> >>               panic("%s: invalid page directory va=%#lx", __func__, pa);
> >>       if ((*pde & PG_PS) != 0)
> >>               panic("%s: attempted pmap_enter on 2MB page", __func__);
> >>       pte = pmap_pde_to_pte(pde, pa);
> >>       if (pte == NULL)
> >>               panic("%s: no pte va=%#lx", __func__, pa);
> >>
> >>       if (*pte != 0) {
> >>               /* Remove extra pte reference. */
> >>               mpte->wire_count--;
> >>       }
> >>         pte_store(pte, pa | PG_RW | PG_V | PG_G | pg_nx);
> >>
> >>       vm_page_unlock_queues();
> >>       PMAP_UNLOCK(pmap);
> >> }
> >>
> >> Then in cpu_reset():
> >>
> >>       /*
> >>        * Establish a V=R mapping for the MBR page, and copy a
> >>        * reasonable guess at the size of the bootstrap code into the
> >>        * beginning of the page.
> >>        */
> >>       printf("XXX attempting pmap_kenter()...\n");
> >>       pmap_kenter_VR(trunc_page(mbaddr));
> >>       printf("XXX copying bootstrap code...\n");
> >>       to_copy = (uintptr_t)xxx_reset_end - (uintptr_t)xxx_reset_real;
> >>       if (to_copy > mbaddr - trunc_page(mbaddr))
> >>               to_copy = mbaddr - trunc_page(mbaddr);
> >>       bcopy(xxx_reset_real, (void *)trunc_page(mbaddr), to_copy);  /* die 
> >> here */
> >>       printf("XXX attempting to turn off xlation and re-run MBR...\n");
> >>       xxx_reset_real(mbaddr);
> >>
> >>
> >> My first attempt was a call to
> >>       pmap_kenter(trunc_page(0x7c00), trunc_page(0x7c00));
> >> which failed trying to dereference the non-existent PDE.
> >>
> >> My second attempt called
> >>       pmap_enter(kernel_pmap, trunc_page(0x7c00), VM_PROT_WRITE, mbr_page,
> >>           VM_PROT_ALL, 0);
> >> That failed with the same crash as the attempt using pmap_kenter_VR().
> >>
> >> So... any thoughts as to why, after an apparently successful
> >> installation of an xlation, I still get a panic as though there were
> >> no xlation?
> >
> > Weird formatting of backtrace. Is this some proprietary code ?
> 
> Yeah, there's some Isilon local modifications.
> 
> > Why do you try to create 1-1 mapping at all ? The MBR code should be
> > executing in real mode anyway, and the mapping address at the moment of
> > bcopy does not matter at all. I think that the use of PHYS_TO_DMAP()
> > should give you direct mapping.
> 
> I assumed I need trampoline code.  The moment I turn off the xlation
> bits, if the instruction pointer I'm running from is the normal kernel
> addresses, won't I die horribly trying to access 0x8000 or
> so in real mode?
Then, there is more then identity mapping. You need to gradually turn
off protected mode, in particular, load segment registers