Re: The Big Three Rakudo (and Parrot OO) Bottlenecks

2008-04-16 Thread chromatic
On Wednesday 16 April 2008 15:39:41 Jonathan Worthington wrote:

> chromatic wrote:

> > It helps the PIR Ackerman benchmark by 4.67%.  parrot_pass_args gets more
> > expensive, but next_arg_sig and everything else except for
> > Parrot_init_arg_indexes_and_sig_pmc gets called much, much less.  I
> > played with the patch a little bit, but didn't get it much faster.

> So is the optimization worth applying, do you think? Seems like a
> relatively small gain once we take it away from a benchmark that we'd
> expect it to directly help...

> > It may be possible to avoid the Parrot_init_arg_indexes_and_sig_pmc
> > calls, but I couldn't immediately see how.
> I pondered this too, but didn't really look into it properly.

If we can do this, the optimization might be worth it.

-- c


Re: The Big Three Rakudo (and Parrot OO) Bottlenecks

2008-04-16 Thread Jonathan Worthington

chromatic wrote:
It helps the PIR Ackerman benchmark by 4.67%.  parrot_pass_args gets more 
expensive, but next_arg_sig and everything else except for 
Parrot_init_arg_indexes_and_sig_pmc gets called much, much less.  I played 
with the patch a little bit, but didn't get it much faster.
  
So is the optimization worth applying, do you think? Seems like a 
relatively small gain once we take it away from a benchmark that we'd 
expect it to directly help...



It may be possible to avoid the Parrot_init_arg_indexes_and_sig_pmc calls, but 
I couldn't immediately see how.
  

I pondered this too, but didn't really look into it properly.

Jonathan



Re: The Big Three Rakudo (and Parrot OO) Bottlenecks

2008-04-15 Thread chromatic
On Tuesday 15 April 2008 17:19:42 Jonathan Worthington wrote:

> I thought that detecting when the signature on the caller and callee
> side were identical and fast-tracking that might help. I stuck in
> something to count how many times this happened. It was the case in 23%
> of calls while compiling Rakudo. So I did the optimisation in the
> attached patch, which basically just copies arguments from one set of
> registers to the other about as efficiently as I can see how to do it.
>
> The end result? I can't actually detect enough of a difference to make
> me think I've really improved anything. Not in the compilation time of
> Rakudo's actions.pm, nor when I took a Perl 6 Ackerman's function
> implementation and ran it under Rakudo.

It helps the PIR Ackerman benchmark by 4.67%.  parrot_pass_args gets more 
expensive, but next_arg_sig and everything else except for 
Parrot_init_arg_indexes_and_sig_pmc gets called much, much less.  I played 
with the patch a little bit, but didn't get it much faster.

It may be possible to avoid the Parrot_init_arg_indexes_and_sig_pmc calls, but 
I couldn't immediately see how.

-- c


Re: The Big Three Rakudo (and Parrot OO) Bottlenecks

2008-04-15 Thread Patrick R. Michaud
On Wed, Apr 16, 2008 at 02:19:42AM +0200, Jonathan Worthington wrote:
> I thought that detecting when the signature on the caller and callee 
> side were identical and fast-tracking that might help. I stuck in 
> something to count how many times this happened. It was the case in 23% 
> of calls while compiling Rakudo. So I did the optimisation in the 
> attached patch, which basically just copies arguments from one set of 
> registers to the other about as efficiently as I can see how to do it.
> 
> The end result? I can't actually detect enough of a difference to make 
> me think I've really improved anything. Not in the compilation time of 
> Rakudo's actions.pm, nor when I took a Perl 6 Ackerman's function 
> implementation and ran it under Rakudo.

FWIW, all of the regexes compiled by PGE end up with slurpy hashes
in their parameter list, so (IIUC from the earlier IRC discussion)
they aren't seeing any benefit from this optimization.

A similar situation applies for the methods in PCT -- most of them
also have slurpy hash parameters.  This is in order to be compatible
with Perl 6's default *%_ parameter in methods.

Pm


Re: The Big Three Rakudo (and Parrot OO) Bottlenecks

2008-04-15 Thread Jonathan Worthington

Hi,

chromatic wrote:
3) PCC argument processing is slow.  I've looked over Parrot_pass_args and 
Parrot_process_args several times in the past few months, and I didn't see 
any obvious speedups or tricks.  However, we do spend a lot of time shuffling data back and forth, and something (instinct, blind desire, lower blood sugar than normal) suggests that we already *know* some of this information already.  That is, at the point of a call we know if there are slurpy or named arguments and if the invoked function wants to bind slurpy or named parameters.  It seems like we could have a fast path for that common operation... somehow.
  
I thought that detecting when the signature on the caller and callee 
side were identical and fast-tracking that might help. I stuck in 
something to count how many times this happened. It was the case in 23% 
of calls while compiling Rakudo. So I did the optimisation in the 
attached patch, which basically just copies arguments from one set of 
registers to the other about as efficiently as I can see how to do it.


The end result? I can't actually detect enough of a difference to make 
me think I've really improved anything. Not in the compilation time of 
Rakudo's actions.pm, nor when I took a Perl 6 Ackerman's function 
implementation and ran it under Rakudo.


Jonathan
Index: src/inter_call.c
===
--- src/inter_call.c(revision 26955)
+++ src/inter_call.c(working copy)
@@ -1520,12 +1520,45 @@
 Parrot_init_arg_indexes_and_sig_pmc(interp, dest_ctx, dest_indexes,
 dest_signature, &st.dest);
 
-Parrot_process_args(interp, &st, param_or_result);
+if (st.src.u.op.signature == st.dest.u.op.signature && 
!PMC_IS_NULL(st.src.u.op.signature)) {
+/* Identical signatures, so we can fast-track it. */
+PMC *sig = st.src.u.op.signature;
+int sig_length = st.src.n;
+opcode_t *src_idxs = st.src.u.op.pc;
+opcode_t *dest_idxs = st.dest.u.op.pc;
+int i;
+for (i = 0; i < sig_length; i++) {
+const int constant = PARROT_ARG_CONSTANT_ISSET(SIG_ITEM(sig, i));
+int src_idx = src_idxs[i];
+int dest_idx = dest_idxs[i];
+switch (PARROT_ARG_TYPE_MASK_MASK(SIG_ITEM(sig, i))) {
+case PARROT_ARG_INTVAL:
+CTX_REG_INT(st.dest.ctx, dest_idx) = constant ?
+src_idx : CTX_REG_INT(st.src.ctx, src_idx);
+break;
+case PARROT_ARG_STRING:
+CTX_REG_STR(st.dest.ctx, dest_idx) = constant ?
+st.src.ctx->constants[src_idx]->u.string : 
CTX_REG_STR(st.src.ctx, src_idx);
+break;
+case PARROT_ARG_FLOATVAL:
+CTX_REG_NUM(st.dest.ctx, dest_idx) = constant ?
+st.src.ctx->constants[src_idx]->u.number : 
CTX_REG_NUM(st.src.ctx, src_idx);
+break;
+case PARROT_ARG_PMC:
+CTX_REG_PMC(st.dest.ctx, dest_idx) = constant ?
+st.src.ctx->constants[src_idx]->u.key : 
CTX_REG_PMC(st.src.ctx, src_idx);
+break;
+}
+}
+}
+else {
+Parrot_process_args(interp, &st, param_or_result);
 
-/* If we created a slurpy, we had to DOD register it so it did not get
- * collected during arg processing; we'll now unregister it. */
-if (st.dest.slurp)
-dod_unregister_pmc(interp, st.dest.slurp);
+/* If we created a slurpy, we had to DOD register it so it did not get
+ * collected during arg processing; we'll now unregister it. */
+if (st.dest.slurp)
+dod_unregister_pmc(interp, st.dest.slurp);
+}
 }
 
 


Re: The Big Three Rakudo (and Parrot OO) Bottlenecks

2008-04-15 Thread Leopold Toetsch
Am Samstag, 12. April 2008 02:27 schrieb chromatic:
> I've committed a couple of minor optimizations which speed up Rakudo and
> Perl OO in general by about 35%.  There may be a few more lurking, but I
> keep running into three spots which dominate most of the other optimization
> strategies I might pursue.
>
> 1) The Garbage Collector is algorithmically inefficient.  There are various
> potential optimization strategies here which don't require us walking every
> allocated header in the system multiple times (yes, it's that bad in at
> least two places), 

Last time I looked at it, a header isn't "walked" if the live bit is already 
set. pobject_lives() will be called, yes, but that was a macro in very former 
times and cheap for the common case, when the live bit is already set.

> but the real solution I suspect is to replace the 
> current GC with at least the one specified in the new GC PDD.

Which pdd are you refering to? I don't see any substantial text in 
pdd09_gc.pod.

> The two worst spots are turning off the live flag (and destroying
> everything with a custom destroy flag) and dividing newly-allocated memory
> pools into headers and manually adding each header to a linked-list of free
> objects.

A bitmap handling for live flags was in the code base ~1.5 yrs ago. You could 
look at svn diffs near:

r14799 | leo | 2006-09-30 12:38:35 +0200 (Sa, 30 Sep 2006) | 4 lines
r14798 | leo | 2006-09-30 11:24:49 +0200 (Sa, 30 Sep 2006) | 6 lines

[GC] remove ARENA_DOD_FLAGS code

I've removed it, because I couldn't measure any remarkable performance 
improvements.

> Flag marking would be much cheaper (and likely cache-friendlier, though I
> haven't done the math to make sure of this) 

See above.

> The free object list is the reason that compacting/copying collectors are
> popular,

[ ... ]

Copying collectors need another indirection in PMC allocation, as a PMC isn't 
allowed to move - external C code might have stored a pointer to a PMC. Did 
this philosophy change somewhen?

Anyway, due to better cache locality and in combination with diffently sized 
PMCs a copying collector could be faster I'd estimate.

> With that all said, one good optimization is to allocate fewer PObjs

Indeed.

> -- c

leo


Re: The Big Three Rakudo (and Parrot OO) Bottlenecks

2008-04-12 Thread chromatic
On Friday 11 April 2008 19:05:41 Bob Rogers wrote:

>From: chromatic <[EMAIL PROTECTED]>

>The free object list is the reason that compacting/copying collectors
>are popular, specifically that all you have to do to find the next
>free object is bump the pointer by sizeof (header) and see if that's
>still within the bounds of the pool.
>
> You don't even need the bounds check, strictly.  Several popular free
> Lisps just blindly initialize the last word of the new object, and if
> that segfaults, the allocation is restarted after creating a new pool.
> But I don't know if it's really worth it for anything smaller than a
> cons (two pointers).

I can see that working, but I'm a little leery of catching segfaults in a 
cross-platform way (and the cost of that context switch... hmm, have to 
benchmark that), but I'm more worried that there won't be a segfault because 
the next page happens to be in already-allocated memory somewhere and we're 
overwriting something else.  We have multiple pools of different sizes.

> Wouldn't it be better to have the GC trigger after the next X bytes or
> objects are allocated, rather than using the end of the pool?  The pool
> boundary seems pretty arbitrary to me.

Possibly.  I'm not sure how that interacts with the three-color incremental 
scheme in the PDD, though.  We could go for the low-latency system or a very 
conservative system or a high throughput system, but I need to think more 
about the implications of the documented system.

> Weren't we just recently having a problem with special treatment for
> constant PMCs?  Full support for magic flying ponies is gonna cost you,
> and it'll probably be in terms of reliability.

I know, but once in a while refcounting has its advantages

-- c


Re: The Big Three Rakudo (and Parrot OO) Bottlenecks

2008-04-11 Thread chromatic
On Friday 11 April 2008 17:27:34 chromatic wrote:

> 1) The Garbage Collector is algorithmically inefficient.  There are various
> potential optimization strategies here which don't require us walking every
> allocated header in the system multiple times (yes, it's that bad in at
> least two places), but the real solution I suspect is to replace the
> current GC with at least the one specified in the new GC PDD.

I realize there's a third (potential) problem with the GC.  I don't recall 
offhand (and I don't have time at the moment) to see if it keeps any 
statistics about whether the current GC run actually reclaimed anything 
substantial.  That is, if we exhausted a free list and ran the GC but only 
freed a couple of headers, we know we'll just have to run another full mark 
and sweep shortly anyway.  My evidenceless guess is that it's worth just 
allocating another empty pool unless we've reclaimed 1/3 to 1/2 of a pool's 
worth of headers.

This is probably an optimization we should consider for any GC scheme.

Oh, and while I'm asking for magic flying ponies who eat garbage and produce 
candy, wouldn't it be nice to have a way to allocate a PMC or other PObj you 
know you need for a couple of instructions and then never again and then 
immediately free it?

-- c