Re: Why is the fib benchmark still slow - part 1

Leopold Toetsch Fri, 05 Nov 2004 05:24:29 -0800

Miroslav Silovic wrote:

[EMAIL PROTECTED] wrote:
a) accessing a new register frame and context
b) during DOD/GC
Or it would make sense to use multi-frame register chunks. I kept locality of access in mind but somehow never spelled it out. But I *think* I mentioned 64kb as a good chunk size precisely because it fits well into the CPU cache - without ever specifying this as the reason.

Yep that's the idea, as originally proposed The one frame per chunk was the intermediate step. And I'm thinking of 64 Kb too. The Parrot register structure is 640 bytes on a 32bit system w. 8byte double. With that size we have worst-case 1% overhead for allocating a new chunk. Or, 100 levels nesting are w/o any allocation.

Anyway, if you can pop both register frames -and- context structures, you won't run GC too often, and everything will nicely fit into the cache.

I thought about that too, but I'd rather have registers adjacent, so that the caller can place function arguments directly into the callers in arguments.

OTOH it doesn't really matter, if the context structure is in the frame too. We'd just need to skip that gap. REG_INT(64) or I64 is as valid as I0 or I4, as long as it's assured, that it's exactly addressing the incoming argument area of the called function.

... Is the context structure a PMC now (and does it have to be, if the code doesn't specifically request access to it?)

The context structure is inside the interpreter structure. When doing a function call, its a malloced copy of the caller's state, hanging off the continuation PMC.

ad b)

Is there a way to find out how many misses came out from DoD, compared to register frames allocation?

Sure. Cachegrind is showing you the line in C source code ;) With incremental M&S we have (top n, rounded):

L2 write misses (all in context and registers)

500.000 Parrot_Sub_invoke  touch interpreter->ctx.bp
500.000   -""-             touch registers e.g. REG_PMC(0) = foo
200.000 copy_registers       -""-
700.000  ???:???           very likely JIT code writing regs first
600.000 malloc.c:chunk_alloc

L2 read misses (DOD)

1.300.000 Parrot_dod_sweep
1.300.000 contained_in_pool
  600.000 get_free_object     free_list handling

plus 500.000 more in __libc_free.

I believe that you shouldn't litter (i.e. create an immediately GCable object) on each function call - at least not without generational collector specifically optimised to work with this.

The problem isn't the object creation per se, but the sweep through the *whole object memory* to detect dead objects. It's of course true, that we don't need the return continuation PMC for the fib benchmark. But a HLL translated fib would use Integer PMCs for calculations.

... This would entail the first generation that fits into the CPU cache and copying out live objects from it. And this means copying GC for Parrot, something that (IMHO) would be highly nontrivial to retrofit.

A copying GC isn't really hard to implement. And it has the additional benefit of providing better cache locality. Nontrivial to retrofit or not, we need a generational GC.

Miro

leo

Re: Why is the fib benchmark still slow - part 1

Reply via email to