Re: Why is the fib benchmark still slow - part 1
Dan Sugalski [EMAIL PROTECTED] wrote: The current calling scheme stands. [ ... ] No more performance changes. Period. We get parrot fully functional first. Ok. As long as PIR code is concerned call internals are nicely hidden by the function call and return syntax. But PASM could be a problem if there is a growing amount of code around. We have to change calling conventions slightly for more speed eventually. leo
Re: Why is the fib benchmark still slow - part 1
At 9:38 AM +0100 11/6/04, Leopold Toetsch wrote: Dan Sugalski [EMAIL PROTECTED] wrote: The current calling scheme stands. [ ... ] No more performance changes. Period. We get parrot fully functional first. Ok. As long as PIR code is concerned call internals are nicely hidden by the function call and return syntax. But PASM could be a problem if there is a growing amount of code around. We have to change calling conventions slightly for more speed eventually. We can lift the requirement that registers be saved for calls. Beyond that, the calling conventions are fixed. -- Dan --it's like this--- Dan Sugalski even samurai [EMAIL PROTECTED] have teddy bears and even teddy bears get drunk
Why is the fib benchmark still slow - part 1
Below (inline/attached) is a longish analysis of fib benchmark timings. leo Why is the fib benchmark still slow - part 1 Python runs the fib test roughly twice as fast as Parrot. I don't like that ;) So what's the problem? 1) CPU cache issues First, if you like to investigate the issue yourself (and run i386/linux or similar, and you don't have it) get valgrind. Great tool. It contains a plugin called cachegrind which shows detailed information about executed inctructions including cache misses. For parrot -j -Oc examples/benchmarks/fib.imc[1] (and perl, python ...) we get: IRefs DRrefsTime L2 Misses Perl 5.8.0 2.0e9 1.2e92.5s 7.600 Perl 5.8.0-thr 2.2e9 1.4e93.0s 8.500 Python 2.3.3 1.2e9 0.7e91.4s 54.000 Parrot IMS 1.1e9 0.7e93.3s 7.000.000 Parrot MS 1.7e9 0.7e92.6s 4.800.000 Lua 5.0 [2] 0.7e9 0.4e90.85 !!! 4.000 IRefs ... instructions executed DRefs ... data read + write IMS ... Parrot with incremental MS (settings.h), -O3 compiled MS... Parrot CVS with stop-the-world MS, -O3 DRefs are boring - except that Perl5 takes 50% more. IRefs are quite interesting, IMS has fewest, even less then Python. But benchmark timings are differing totally. IMS is slower then MS and both are much slower then Python. The reason is in the last columns. The valgrind manual states (and above numbers second it) that: *one Level 2 cache miss takes roughly the time of 200 instructions*. These 7 Mio cache misses account for ~1.4e9 IRefs, which more then doubles the exucution time. Timings might be better on a fat Xeon CPU but that doesn't really matter (above is w. AMD 800, 256 KB L2 cache) The cache misses are basically in two places a) accessing a new register frame and context b) during DOD/GC We have to address both areas to get rid of the majority of cache misses. ad a) For each level of recursion, we are allocating a new context structure and a new register frame. Half of these is coming from the recently implemented return continuation and register frame chaches. The other half has to be freshly allocated. We get exactly for every second function call L2 cache misses for both the context and the register structure. We can't do much against the cache misses in the context, just make it as small as possible: e.g by moving the now unused old register stack pointers (4 words) out of the context or toss these 4 pointers entirely. But we can address the problem with the register frame, by --well making it smaller. Python is running the code on the stack. It's touching only SP(0) and SP(1) mostly. Register usage analysis of fib shows that it could run on Parrot with just *one* persistent register per kind. More about that is coming and was already said: sliding register window. ad b) The (currently broken) Parrot setting ARENA_DOD_FLAGS shows one possibility to reduce cache misses in DOD. During a sweep (which runs through all allocated object memory) the memory itself isn't touched, just a nibble per object is used, which holds the relevant information like is_live. A second way to address DOD issues it to make the PMCs variable sized. I've proposed that already not too long ago. Third and independent of these is not to touch most of the memory at all by implementing a generatioal garbage collector. After an object has survived a few GC cycles, its moved into an older generation, which isn't subject to a GC sweep. Both the make PMCs variable sized and the generatioal GC need very likely another indirection to access a PMC. But by avoiding just one cache miss, you can do 200 of such indirect accesses. Thanks for reading til here comments welcome, leo [1] fib.imc is using integers, which is already an optimization. But that's not the point here. [2] valgrind --skin=cachegrind /opt/src/lua-5.0/bin/lua /opt/src/lua-5.0/test/fib.lua 28 and that's even comparing a memoized fib with plain too.
Re: Why is the fib benchmark still slow - part 1
Miroslav Silovic wrote: [EMAIL PROTECTED] wrote: a) accessing a new register frame and context b) during DOD/GC Or it would make sense to use multi-frame register chunks. I kept locality of access in mind but somehow never spelled it out. But I *think* I mentioned 64kb as a good chunk size precisely because it fits well into the CPU cache - without ever specifying this as the reason. Yep that's the idea, as originally proposed The one frame per chunk was the intermediate step. And I'm thinking of 64 Kb too. The Parrot register structure is 640 bytes on a 32bit system w. 8byte double. With that size we have worst-case 1% overhead for allocating a new chunk. Or, 100 levels nesting are w/o any allocation. Anyway, if you can pop both register frames -and- context structures, you won't run GC too often, and everything will nicely fit into the cache. I thought about that too, but I'd rather have registers adjacent, so that the caller can place function arguments directly into the callers in arguments. OTOH it doesn't really matter, if the context structure is in the frame too. We'd just need to skip that gap. REG_INT(64) or I64 is as valid as I0 or I4, as long as it's assured, that it's exactly addressing the incoming argument area of the called function. ... Is the context structure a PMC now (and does it have to be, if the code doesn't specifically request access to it?) The context structure is inside the interpreter structure. When doing a function call, its a malloced copy of the caller's state, hanging off the continuation PMC. ad b) Is there a way to find out how many misses came out from DoD, compared to register frames allocation? Sure. Cachegrind is showing you the line in C source code ;) With incremental MS we have (top n, rounded): L2 write misses (all in context and registers) 500.000 Parrot_Sub_invoke touch interpreter-ctx.bp 500.000 -- touch registers e.g. REG_PMC(0) = foo 200.000 copy_registers -- 700.000 ???:??? very likely JIT code writing regs first 600.000 malloc.c:chunk_alloc L2 read misses (DOD) 1.300.000 Parrot_dod_sweep 1.300.000 contained_in_pool 600.000 get_free_object free_list handling plus 500.000 more in __libc_free. I believe that you shouldn't litter (i.e. create an immediately GCable object) on each function call - at least not without generational collector specifically optimised to work with this. The problem isn't the object creation per se, but the sweep through the *whole object memory* to detect dead objects. It's of course true, that we don't need the return continuation PMC for the fib benchmark. But a HLL translated fib would use Integer PMCs for calculations. ... This would entail the first generation that fits into the CPU cache and copying out live objects from it. And this means copying GC for Parrot, something that (IMHO) would be highly nontrivial to retrofit. A copying GC isn't really hard to implement. And it has the additional benefit of providing better cache locality. Nontrivial to retrofit or not, we need a generational GC. Miro leo
Re: Why is the fib benchmark still slow - part 1
[EMAIL PROTECTED] wrote: a) accessing a new register frame and context b) during DOD/GC We have to address both areas to get rid of the majority of cache misses. ad a) For each level of recursion, we are allocating a new context structure and a new register frame. Half of these is coming from the recently implemented return continuation and register frame chaches. The other half has to be freshly allocated. We get exactly for every second function call L2 cache misses for both the context and the register structure. Or it would make sense to use multi-frame register chunks. I kept locality of access in mind but somehow never spelled it out. But I *think* I mentioned 64kb as a good chunk size precisely because it fits well into the CPU cache - without ever specifying this as the reason. Anyway, if you can pop both register frames -and- context structures, you won't run GC too often, and everything will nicely fit into the cache. Is the context structure a PMC now (and does it have to be, if the code doesn't specifically request access to it?) ad b) The (currently broken) Parrot setting ARENA_DOD_FLAGS shows one possibility to reduce cache misses in DOD. During a sweep (which runs through all allocated object memory) the memory itself isn't touched, just a nibble per object is used, which holds the relevant information like is_live. Is there a way to find out how many misses came out from DoD, compared to register frames allocation? I believe that you shouldn't litter (i.e. create an immediately GCable object) on each function call - at least not without generational collector specifically optimised to work with this. This would entail the first generation that fits into the CPU cache and copying out live objects from it. And this means copying GC for Parrot, something that (IMHO) would be highly nontrivial to retrofit. Miro
Re: Why is the fib benchmark still slow - part 1
Leopold Toetsch wrote: I believe that you shouldn't litter (i.e. create an immediately GCable object) on each function call - at least not without generational collector specifically optimised to work with this. The problem isn't the object creation per se, but the sweep through the *whole object memory* to detect dead objects. It's of course true, that we don't need the return continuation PMC for the fib benchmark. Well, creation is also the problem if you crawl the entire free heap before triggering the next GC round. You get a potential cache miss on each creation and on each mark and on each destruction. To keep GC out of the way, the entire arena has to be confined to cache size or less. But a HLL translated fib would use Integer PMCs for calculations. Hmm, I'm nitpicking here, but it's not how e.g. Psyco works. It specialises each function to specific argument types and recompiles for each new argument type set. Assuming that you'll call only very few functions with more than 1-2 type combinations, this is a good tradeoff. It also removes a lot of consing, especially for arithmetics. ... This would entail the first generation that fits into the CPU cache and copying out live objects from it. And this means copying GC for Parrot, something that (IMHO) would be highly nontrivial to retrofit. A copying GC isn't really hard to implement. And it has the additional benefit of providing better cache locality. Nontrivial to retrofit or not, we need a generational GC. Copying and generational are orthogonal concepts. It's quite possible to have noncopying gengc (and nongenerational copying GC, but that's beside the point). This gives you quick mark phases but without so much gain with locality (I think Boehm GC can do this). The problem with copying GC is that pointers can change under your feet at every opportunity. Embedded C libraries that try to manipulate GCed objects really hate when that happens - in particular where some pointers get cached in the CPU registers - and telling GC to (un)protect a pointer is a chore on C programmers (as bad as manual refcounting). I suppose that there are good solutions to this, I'm just not aware of any. The gain is that you can guarantee that the youngest generation will be no bigger than X kb. This can be very good thing. However, for the problem at hand - namely, littering during function calls, custom deallocator (that'd be chunks) could be enough. In particular, it makes sense to use it in conjunction with a non-copying gengc. Miro
Re: Why is the fib benchmark still slow - part 1
Miroslav Silovic [EMAIL PROTECTED] wrote: The problem with copying GC is that pointers can change under your feet at every opportunity. Basically yes. We have that problem currently with the copying collection of strings. Normally this is solved by keeping the old object in place, so that pointer stores into the object don't fail. And a write barrier redirects the desired action to the new object location with a forward pointer. However, for the problem at hand - namely, littering during function calls, custom deallocator (that'd be chunks) could be enough. Yes. As said, there are several issues, and we've to address one by one. Miro leo
Re: Why is the fib benchmark still slow - part 1
Miroslav Silovic [EMAIL PROTECTED] writes: Leopold Toetsch wrote: I believe that you shouldn't litter (i.e. create an immediately GCable object) on each function call - at least not without generational collector specifically optimised to work with this. The problem isn't the object creation per se, but the sweep through the *whole object memory* to detect dead objects. It's of course true, that we don't need the return continuation PMC for the fib benchmark. Well, creation is also the problem if you crawl the entire free heap before triggering the next GC round. You get a potential cache miss on each creation and on each mark and on each destruction. To keep GC out of the way, the entire arena has to be confined to cache size or less. But a HLL translated fib would use Integer PMCs for calculations. Hmm, I'm nitpicking here, but it's not how e.g. Psyco works. It specialises each function to specific argument types and recompiles for each new argument type set. Assuming that you'll call only very few functions with more than 1-2 type combinations, this is a good tradeoff. It also removes a lot of consing, especially for arithmetics. ... This would entail the first generation that fits into the CPU cache and copying out live objects from it. And this means copying GC for Parrot, something that (IMHO) would be highly nontrivial to retrofit. A copying GC isn't really hard to implement. And it has the additional benefit of providing better cache locality. Nontrivial to retrofit or not, we need a generational GC. The catch with generation GC is that, once you have guaranteed destructors being called promptly, you still have to sweep the whole arena every time you leave a scope.
Re: Why is the fib benchmark still slow - part 1
At 11:39 AM +0100 11/5/04, Leopold Toetsch wrote: The cache misses are basically in two places a) accessing a new register frame and context b) during DOD/GC b) is relatively easy -- I'd bet that the vast majority of the cache misses are because of the copying collector. That could be cleared up my moving to a non-copying collector. I'm perfectly fine with that. Having said that, I do *not* want the copying collector to be the default, or even enabled, for any non-production release, the same way that the default build is with no C compiler optimizations. When we cut 1.0.0, or want to do a real benchmark test, it can be turned on, but until then I want it off. The copying collector right now is truly vicious on dangling pointer bugs, and makes parrot die at the drop of a hat when there's a GC/DOD problem. I rather like that. a), the register frame/context stuff. We've batted around a number of solutions to this, and I think a variant of the we get change interpreter structure on invoke will get us what we need. If, generally speaking, invoking something gets you a new interpreter structure with most of the contents of the current structure, we can tidy some stuff up and, I think, make things a bit cleaner. As a side effect, a return continuation becomes the interpreter structure itself. If we do this we have five basic 'destination' PMCs: 1) Sub PMC 2) Closure PMC 3) Coroutine PMC 4) Continuation PMC 5) Return continuation On invocation: Sub PMC) Allocates a new interpreter structure from the interpreter cache. Copies the calling convention registers into the new interpreter. Copies most of the environment data from the old interpreter into the new one. Copies a few bits of info for the new sub (default namespace pointer and such) from the sub PMC into the new interpreter Closure PMC) Pretty much the same as the sub PMC, except that the closure PMC copies its cached lexical pad pointer into the new interpreter rather than starting fresh Coroutine PMC) Like the closure PMC, except that it caches the top half of the register sets and copies those in on invocation, and holds two start addresses (the default start and the current start). Continuation PMC) Allocates a new interpreter and copies the entire cached interpreter into it with one massive memcpy. (Whether the low half of the registers gets copied in is up in the air) If the current interpreter's not had a continuation taken on it, then it's immediately put on the free interpreter list. Return Continuation PMC) We just make it the interpreter we use, copying the low half registers of the (now old) interpreter in. The current interpreter's immediately put on the free interpreter list. I think this covers it all. It means that returning from a sub can immediately recycle an interpreter, and needs only a potentially small memcpy to work. Calling a sub does require an allocation off a special queue and some memory copying, but since each sub potentially has a very different environment (lexicals, namespaces, opcode files, bytecode file, security settings, and such) we're going to have to do this regardless of what we want. We *could*, to cut down on some of this, split out the 'constant across calls' parts, like the counters, memory pools, arenas, and whatnot, into a separate structure that gets accessed indirectly. Depending on how much there was we may or may not see a win, since we're trading off a little extra access time (through a pointer) for a bit less memory copying on sub invocation. It's probably worth it, though. -- Dan --it's like this--- Dan Sugalski even samurai [EMAIL PROTECTED] have teddy bears and even teddy bears get drunk
Re: Why is the fib benchmark still slow - part 1
Dan Sugalski [EMAIL PROTECTED] wrote: At 11:39 AM +0100 11/5/04, Leopold Toetsch wrote: The cache misses are basically in two places a) accessing a new register frame and context b) during DOD/GC b) is relatively easy -- I'd bet that the vast majority of the cache misses are because of the copying collector. Which copying collector? We have one for buffer memory. That isn't even triggered. Please (re)read the thread. It's all during MS sweep phase. And there are details, where the misses are ... [ ... ] If, generally speaking, invoking something gets you a new interpreter structure That's cache-wise worse then the current scheme and doesn't address the cache issues at all, sorry. The interpreter structure is bigger then a register frame so it'll blow caches more. leo
Re: Why is the fib benchmark still slow - part 1
Piers Cawley [EMAIL PROTECTED] wrote: The catch with generation GC is that, once you have guaranteed destructors being called promptly, you still have to sweep the whole arena every time you leave a scope. Yes. I hope that we can track objects needing timely destruction directly. We have write barriers that should be able to track the liveness of these biests. leo
Re: Why is the fib benchmark still slow - part 1
Dan Sugalski [EMAIL PROTECTED] wrote: [ Yet another f'up ] ..., except that it caches the top half of the register sets [ ... ] copying the low half registers of the (now old) Dan, the split in lower and upper half of registers was a premature optimization with zig opcodes to address the problem, what we already had at that time - memory copying. Please remember how these opcodes and the split came in. It was wrong and it remains wrong. That approach tried to address the real problem (memory copying cost and cache pullution) by only doing half of the work. This makes either a 16x4-register machine out of Parrot or a broken 32x4-register machine, because some needed registers might not be preserved. This does just not work as it doesn't account for the actual register usage, neither of the caller nor the called sub. And it's a PITA for the register allocation code. leo
Re: Why is the fib benchmark still slow - part 1
At 9:34 PM +0100 11/5/04, Leopold Toetsch wrote: Dan Sugalski [EMAIL PROTECTED] wrote: [ Yet another f'up ] ..., except that it caches the top half of the register sets [ ... ] copying the low half registers of the (now old) Dan, the split in lower and upper half of registers was a premature optimization You're right. At this point, all this stuff is a premature optimization. The current calling scheme stands. We're not changing it now. We're not going to tune it now. PMC sizes stay the way they are. The DOD sweep can pick up dead continuations when it needs to, since that works well enough. The source changes for GC barriers can stay, and any changes to allow later replacement of the GC or DOD systems can go in, but if anyone wants to screw around with a generational collector they can do it in a branch -- the existing MS DOD system and copying collector are fine for now. No more performance changes. Period. We get parrot fully functional first. -- Dan --it's like this--- Dan Sugalski even samurai [EMAIL PROTECTED] have teddy bears and even teddy bears get drunk