Re: Why is the fib benchmark still slow - part 1

2004-11-06 Thread Leopold Toetsch
Dan Sugalski [EMAIL PROTECTED] wrote:

 The current calling scheme stands.

[ ... ]

 No more performance changes. Period. We get parrot fully functional first.

Ok. As long as PIR code is concerned call internals are nicely hidden by
the function call and return syntax. But PASM could be a problem if
there is a growing amount of code around.
We have to change calling conventions slightly for more speed
eventually.

leo


Re: Why is the fib benchmark still slow - part 1

2004-11-06 Thread Dan Sugalski
At 9:38 AM +0100 11/6/04, Leopold Toetsch wrote:
Dan Sugalski [EMAIL PROTECTED] wrote:
 The current calling scheme stands.
[ ... ]
 No more performance changes. Period. We get parrot fully functional first.
Ok. As long as PIR code is concerned call internals are nicely hidden by
the function call and return syntax. But PASM could be a problem if
there is a growing amount of code around.
We have to change calling conventions slightly for more speed
eventually.
We can lift the requirement that registers be saved for calls. Beyond 
that, the calling conventions are fixed.
--
Dan

--it's like this---
Dan Sugalski  even samurai
[EMAIL PROTECTED] have teddy bears and even
  teddy bears get drunk


Why is the fib benchmark still slow - part 1

2004-11-05 Thread Leopold Toetsch
Below (inline/attached) is a longish analysis of fib benchmark timings.
leo
Why is the fib benchmark still slow - part 1

Python runs the fib test roughly twice as fast as Parrot. I don't
like that ;) So what's the problem?

1) CPU cache issues

First, if you like to investigate the issue yourself (and run
i386/linux or similar, and you don't have it) get valgrind. Great
tool. It contains a plugin called cachegrind which shows detailed
information about executed inctructions including cache misses.

For parrot -j -Oc examples/benchmarks/fib.imc[1] (and perl, python ...)
we get:

 IRefs DRrefsTime   L2 Misses
Perl 5.8.0   2.0e9  1.2e92.5s   7.600
Perl 5.8.0-thr   2.2e9  1.4e93.0s   8.500
Python 2.3.3 1.2e9  0.7e91.4s  54.000
Parrot IMS   1.1e9  0.7e93.3s   7.000.000
Parrot  MS   1.7e9  0.7e92.6s   4.800.000
Lua 5.0 [2]  0.7e9  0.4e90.85 !!!   4.000

IRefs ... instructions executed
DRefs ... data read + write
IMS   ... Parrot with incremental MS (settings.h), -O3 compiled
MS... Parrot CVS with stop-the-world MS, -O3

DRefs are boring - except that Perl5 takes 50% more. IRefs are quite
interesting, IMS has fewest, even less then Python. But benchmark
timings are differing totally. IMS is slower then MS and both are much
slower then Python.

The reason is in the last columns. The valgrind manual states (and
above numbers second it) that:
*one Level 2 cache miss takes roughly the time of 200 instructions*.

These 7 Mio cache misses account for ~1.4e9 IRefs, which more then
doubles the exucution time. Timings might be better on a fat Xeon CPU
but that doesn't really matter (above is w. AMD 800, 256 KB L2 cache)

The cache misses are basically in two places

a) accessing a new register frame and context
b) during DOD/GC

We have to address both areas to get rid of the majority of cache
misses.

ad a)

For each level of recursion, we are allocating a new context structure
and a new register frame. Half of these is coming from the recently
implemented return continuation and register frame chaches. The other
half has to be freshly allocated. We get exactly for every second
function call L2 cache misses for both the context and the register
structure.

We can't do much against the cache misses in the context, just make it
as small as possible: e.g by moving the now unused old register stack
pointers (4 words) out of the context or toss these 4 pointers
entirely.

But we can address the problem with the register frame, by --well
making it smaller. Python is running the code on the stack. It's
touching only SP(0) and SP(1) mostly.

Register usage analysis of fib shows that it could run on Parrot with
just *one* persistent register per kind.

More about that is coming and was already said: sliding register
window.

ad b)

The (currently broken) Parrot setting ARENA_DOD_FLAGS shows one
possibility to reduce cache misses in DOD. During a sweep (which runs
through all allocated object memory) the memory itself isn't touched,
just a nibble per object is used, which holds the relevant information
like is_live.

A second way to address DOD issues it to make the PMCs variable sized.
I've proposed that already not too long ago.

Third and independent of these is not to touch most of the memory at
all by implementing a generatioal garbage collector. After an object
has survived a few GC cycles, its moved into an older generation,
which isn't subject to a GC sweep.

Both the make PMCs variable sized and the generatioal GC need very
likely another indirection to access a PMC. But by avoiding just one
cache miss, you can do 200 of such indirect accesses.

Thanks for reading til here  comments welcome,
leo

[1] fib.imc is using integers, which is already an optimization. But
that's not the point here.
[2] valgrind --skin=cachegrind /opt/src/lua-5.0/bin/lua
 /opt/src/lua-5.0/test/fib.lua  28
and that's even comparing a memoized fib with plain too.


Re: Why is the fib benchmark still slow - part 1

2004-11-05 Thread Leopold Toetsch
Miroslav Silovic wrote:
[EMAIL PROTECTED] wrote:
a) accessing a new register frame and context
b) during DOD/GC
Or it would make sense to use multi-frame register chunks. I kept 
locality of access in mind but somehow never spelled it out. But I 
*think* I mentioned 64kb as a good chunk size precisely because it fits 
well into the CPU cache - without ever specifying this as the reason.
Yep that's the idea, as originally proposed The one frame per chunk was 
the intermediate step. And I'm thinking of 64 Kb too. The Parrot 
register structure is 640 bytes on a 32bit system w. 8byte double. With 
that size we have worst-case 1% overhead for allocating a new chunk.
Or, 100 levels nesting are w/o any allocation.

Anyway, if you can pop both register frames -and- context structures, 
you won't run GC too often, and everything will nicely fit into the 
cache.  
I thought about that too, but I'd rather have registers adjacent, so 
that the caller can place function arguments directly into the callers 
in arguments.

OTOH it doesn't really matter, if the context structure is in the frame 
too. We'd just need to skip that gap. REG_INT(64) or I64 is as valid as 
I0 or I4, as long as it's assured, that it's exactly addressing the 
incoming argument area of the called function.

... Is the context structure a PMC now (and does it have to be, if 
the code doesn't specifically request access to it?)
The context structure is inside the interpreter structure. When doing a 
function call, its a malloced copy of the caller's state, hanging off 
the continuation PMC.

ad b)

Is there a way to find out how many misses came out from DoD, compared 
to register frames allocation?
Sure. Cachegrind is showing you the line in C source code ;) With 
incremental MS we have (top n, rounded):

L2 write misses (all in context and registers)
500.000 Parrot_Sub_invoke  touch interpreter-ctx.bp
500.000   -- touch registers e.g. REG_PMC(0) = foo
200.000 copy_registers   --
700.000  ???:???   very likely JIT code writing regs first
600.000 malloc.c:chunk_alloc
L2 read misses (DOD)
1.300.000 Parrot_dod_sweep
1.300.000 contained_in_pool
  600.000 get_free_object free_list handling
plus 500.000 more in __libc_free.
I believe that you shouldn't litter (i.e. create an immediately GCable 
object) on each function call - at least not without generational 
collector specifically optimised to work with this.
The problem isn't the object creation per se, but the sweep through the 
*whole object memory* to detect dead objects. It's of course true, that 
we don't need the return continuation PMC for the fib benchmark. But a 
HLL translated fib would use Integer PMCs for calculations.

...  This would entail 
the first generation that fits into the CPU cache and copying out live 
objects from it. And this means copying GC for Parrot, something that 
(IMHO) would be highly nontrivial to retrofit.
A copying GC isn't really hard to implement. And it has the additional 
benefit of providing better cache locality. Nontrivial to retrofit or 
not, we need a generational GC.

   Miro
leo


Re: Why is the fib benchmark still slow - part 1

2004-11-05 Thread Miroslav Silovic
[EMAIL PROTECTED] wrote:
a) accessing a new register frame and context
b) during DOD/GC
We have to address both areas to get rid of the majority of cache
misses.
ad a)
For each level of recursion, we are allocating a new context structure
and a new register frame. Half of these is coming from the recently
implemented return continuation and register frame chaches. The other
half has to be freshly allocated. We get exactly for every second
function call L2 cache misses for both the context and the register
structure.
 

Or it would make sense to use multi-frame register chunks. I kept 
locality of access in mind but somehow never spelled it out. But I 
*think* I mentioned 64kb as a good chunk size precisely because it fits 
well into the CPU cache - without ever specifying this as the reason.

Anyway, if you can pop both register frames -and- context structures, 
you won't run GC too often, and everything will nicely fit into the 
cache.  Is the context structure a PMC now (and does it have to be, if 
the code doesn't specifically request access to it?)

ad b)
The (currently broken) Parrot setting ARENA_DOD_FLAGS shows one
possibility to reduce cache misses in DOD. During a sweep (which runs
through all allocated object memory) the memory itself isn't touched,
just a nibble per object is used, which holds the relevant information
like is_live.
 

Is there a way to find out how many misses came out from DoD, compared 
to register frames allocation?

I believe that you shouldn't litter (i.e. create an immediately GCable 
object) on each function call - at least not without generational 
collector specifically optimised to work with this. This would entail 
the first generation that fits into the CPU cache and copying out live 
objects from it. And this means copying GC for Parrot, something that 
(IMHO) would be highly nontrivial to retrofit.

   Miro


Re: Why is the fib benchmark still slow - part 1

2004-11-05 Thread Miroslav Silovic
Leopold Toetsch wrote:
I believe that you shouldn't litter (i.e. create an immediately 
GCable object) on each function call - at least not without 
generational collector specifically optimised to work with this.

The problem isn't the object creation per se, but the sweep through 
the *whole object memory* to detect dead objects. It's of course true, 
that we don't need the return continuation PMC for the fib benchmark.
Well, creation is also the problem if you crawl the entire free heap
before triggering the next GC round. You get a potential cache miss on
each creation and on each mark and on each destruction. To keep GC out
of the way, the entire arena has to be confined to cache size or less.
But a HLL translated fib would use Integer PMCs for calculations.
Hmm, I'm nitpicking here, but it's not how e.g. Psyco works. It
specialises each function to specific argument types and recompiles for
each new argument type set. Assuming that you'll call only very few
functions with more than 1-2 type combinations, this is a good tradeoff.
It also removes a lot of consing, especially for arithmetics.

...  This would entail the first generation that fits into the CPU 
cache and copying out live objects from it. And this means copying GC 
for Parrot, something that (IMHO) would be highly nontrivial to 
retrofit.

A copying GC isn't really hard to implement. And it has the additional 
benefit of providing better cache locality. Nontrivial to retrofit or 
not, we need a generational GC.
Copying and generational are orthogonal concepts. It's quite possible to
have noncopying gengc (and nongenerational copying GC, but that's beside
the point). This gives you quick mark phases but without so much gain
with locality (I think Boehm GC can do this).
The problem with copying GC is that pointers can change under your feet
at every opportunity. Embedded C libraries that try to manipulate GCed
objects really hate when that happens - in particular where some
pointers get cached in the CPU registers - and telling GC to (un)protect
a pointer is a chore on C programmers (as bad as manual refcounting). I
suppose that there are good solutions to this, I'm just not aware of any.
The gain is that you can guarantee that the youngest generation will be
no bigger than X kb. This can be very good thing.
However, for the problem at hand - namely, littering during function
calls, custom deallocator (that'd be chunks) could be enough. In
particular, it makes sense to use it in conjunction with a non-copying
gengc.
   Miro



Re: Why is the fib benchmark still slow - part 1

2004-11-05 Thread Leopold Toetsch
Miroslav Silovic [EMAIL PROTECTED] wrote:

 The problem with copying GC is that pointers can change under your feet
 at every opportunity.

Basically yes. We have that problem currently with the copying
collection of strings. Normally this is solved by keeping the old object
in place, so that pointer stores into the object don't fail. And a write
barrier redirects the desired action to the new object location with a
forward pointer.

 However, for the problem at hand - namely, littering during function
 calls, custom deallocator (that'd be chunks) could be enough.

Yes. As said, there are several issues, and we've to address one by one.

 Miro

leo


Re: Why is the fib benchmark still slow - part 1

2004-11-05 Thread Piers Cawley
Miroslav Silovic [EMAIL PROTECTED] writes:

 Leopold Toetsch wrote:

 I believe that you shouldn't litter (i.e. create an immediately
 GCable object) on each function call - at least not without
 generational collector specifically optimised to work with this.


 The problem isn't the object creation per se, but the sweep through
 the *whole object memory* to detect dead objects. It's of course true,
 that we don't need the return continuation PMC for the fib benchmark.

 Well, creation is also the problem if you crawl the entire free heap
 before triggering the next GC round. You get a potential cache miss on
 each creation and on each mark and on each destruction. To keep GC out
 of the way, the entire arena has to be confined to cache size or less.

 But a HLL translated fib would use Integer PMCs for calculations.

 Hmm, I'm nitpicking here, but it's not how e.g. Psyco works. It
 specialises each function to specific argument types and recompiles for
 each new argument type set. Assuming that you'll call only very few
 functions with more than 1-2 type combinations, this is a good tradeoff.
 It also removes a lot of consing, especially for arithmetics.


 ...  This would entail the first generation that fits into the CPU
 cache and copying out live objects from it. And this means copying GC
 for Parrot, something that (IMHO) would be highly nontrivial to
 retrofit.


 A copying GC isn't really hard to implement. And it has the additional
 benefit of providing better cache locality. Nontrivial to retrofit or
 not, we need a generational GC.

The catch with generation GC is that, once you have guaranteed
destructors being called promptly, you still have to sweep the whole
arena every time you leave a scope.


Re: Why is the fib benchmark still slow - part 1

2004-11-05 Thread Dan Sugalski
At 11:39 AM +0100 11/5/04, Leopold Toetsch wrote:
The cache misses are basically in two places
a) accessing a new register frame and context
b) during DOD/GC
b) is relatively easy -- I'd bet that the vast majority of the cache 
misses are because of the copying collector. That could be cleared up 
my moving to a non-copying collector. I'm perfectly fine with that.

Having said that, I do *not* want the copying collector to be the 
default, or even enabled, for any non-production release, the same 
way that the default build is with no C compiler optimizations. When 
we cut 1.0.0, or want to do a real benchmark test, it can be turned 
on, but until then I want it off. The copying collector right now is 
truly vicious on dangling pointer bugs, and makes parrot die at the 
drop of a hat when there's a GC/DOD problem. I rather like that.

a), the register frame/context stuff. We've batted around a number of 
solutions to this, and I think a variant of the we get change 
interpreter structure on invoke will get us what we need.

If, generally speaking, invoking something gets you a new interpreter 
structure with most of the contents of the current structure, we can 
tidy some stuff up and, I think, make things a bit cleaner. As a side 
effect, a return continuation becomes the interpreter structure 
itself. If we do this we have five basic 'destination' PMCs:

1) Sub PMC
2) Closure PMC
3) Coroutine PMC
4) Continuation PMC
5) Return continuation
On invocation:
Sub PMC) Allocates a new interpreter structure from the interpreter 
cache. Copies the calling convention registers into the new 
interpreter. Copies most of the environment data from the old 
interpreter into the new one. Copies a few bits of info for the new 
sub (default namespace pointer and such) from the sub PMC into the 
new interpreter

Closure PMC) Pretty much the same as the sub PMC, except that the 
closure PMC copies its cached lexical pad pointer into the new 
interpreter rather than starting fresh

Coroutine PMC) Like the closure PMC, except that it caches the top 
half of the register sets and copies those in on invocation, and 
holds two start addresses (the default start and the current start).

Continuation PMC) Allocates a new interpreter and copies the entire 
cached interpreter into it with one massive memcpy. (Whether the low 
half of the registers gets copied in is up in the air) If the current 
interpreter's not had a continuation taken on it, then it's 
immediately put on the free interpreter list.

Return Continuation PMC) We just make it the interpreter we use, 
copying the low half registers of the (now old) interpreter in. The 
current interpreter's immediately put on the free interpreter list.

I think this covers it all. It means that returning from a sub can 
immediately recycle an interpreter, and needs only a potentially 
small memcpy to work. Calling a sub does require an allocation off a 
special queue and some memory copying, but since each sub potentially 
has a very different environment (lexicals, namespaces, opcode files, 
bytecode file, security settings, and such) we're going to have to do 
this regardless of what we want.

We *could*, to cut down on some of this, split out the 'constant 
across calls' parts, like the counters, memory pools, arenas, and 
whatnot, into a separate structure that gets accessed indirectly. 
Depending on how much there was we may or may not see a win, since 
we're trading off a little extra access time (through a pointer) for 
a bit less memory copying on sub invocation. It's probably worth it, 
though.
--
Dan

--it's like this---
Dan Sugalski  even samurai
[EMAIL PROTECTED] have teddy bears and even
  teddy bears get drunk


Re: Why is the fib benchmark still slow - part 1

2004-11-05 Thread Leopold Toetsch
Dan Sugalski [EMAIL PROTECTED] wrote:
 At 11:39 AM +0100 11/5/04, Leopold Toetsch wrote:
The cache misses are basically in two places

a) accessing a new register frame and context
b) during DOD/GC

 b) is relatively easy -- I'd bet that the vast majority of the cache
 misses are because of the copying collector.

Which copying collector? We have one for buffer memory. That isn't even
triggered. Please (re)read the thread. It's all during MS sweep phase.
And there are details, where the misses are ...

[ ... ]

 If, generally speaking, invoking something gets you a new interpreter
 structure

That's cache-wise worse then the current scheme and doesn't address the
cache issues at all, sorry. The interpreter structure is bigger then a
register frame so it'll blow caches more.

leo


Re: Why is the fib benchmark still slow - part 1

2004-11-05 Thread Leopold Toetsch
Piers Cawley [EMAIL PROTECTED] wrote:

 The catch with generation GC is that, once you have guaranteed
 destructors being called promptly, you still have to sweep the whole
 arena every time you leave a scope.

Yes. I hope that we can track objects needing timely destruction
directly. We have write barriers that should be able to track the
liveness of these biests.

leo


Re: Why is the fib benchmark still slow - part 1

2004-11-05 Thread Leopold Toetsch
Dan Sugalski [EMAIL PROTECTED] wrote:

[ Yet another f'up ]

 ..., except that it caches the top
 half of the register sets

[ ... ]

 copying the low half registers of the (now old)

Dan, the split in lower and upper half of registers was a premature
optimization with zig opcodes to address the problem, what we already
had at that time - memory copying.

Please remember how these opcodes and the split came in. It was wrong and
it remains wrong.

That approach tried to address the real problem (memory copying cost and
cache pullution) by only doing half of the work. This makes either a
16x4-register machine out of Parrot or a broken 32x4-register machine,
because some needed registers might not be preserved.

This does just not work as it doesn't account for the actual register
usage, neither of the caller nor the called sub.

And it's a PITA for the register allocation code.

leo


Re: Why is the fib benchmark still slow - part 1

2004-11-05 Thread Dan Sugalski
At 9:34 PM +0100 11/5/04, Leopold Toetsch wrote:
Dan Sugalski [EMAIL PROTECTED] wrote:
[ Yet another f'up ]
 ..., except that it caches the top
 half of the register sets
[ ... ]
 copying the low half registers of the (now old)
Dan, the split in lower and upper half of registers was a premature
optimization
You're right. At this point, all this stuff is a premature optimization.
The current calling scheme stands. We're not changing it now. We're 
not going to tune it now. PMC sizes stay the way they are. The DOD 
sweep can pick up dead continuations when it needs to, since that 
works well enough. The source changes for GC barriers can stay, and 
any changes to allow later replacement of the GC or DOD systems can 
go in, but if anyone wants to screw around with a generational 
collector they can do it in a branch -- the existing MS DOD system 
and copying collector are fine for now.

No more performance changes. Period. We get parrot fully functional first.
--
Dan
--it's like this---
Dan Sugalski  even samurai
[EMAIL PROTECTED] have teddy bears and even
  teddy bears get drunk