Re: Repeated Loopy Variable Width String Character Access is Slooooow-ish

Patrick R. Michaud Sat, 05 Jan 2008 01:27:04 -0800

On Sat, Jan 05, 2008 at 01:09:01AM -0600, Patrick R. Michaud wrote:
> On Sat, Jan 05, 2008 at 12:29:40AM -0600, Patrick R. Michaud wrote:
> > On Fri, Jan 04, 2008 at 07:43:18PM -0800, chromatic wrote:
> > > (Callgrind suggests that about 45% of the running time of 
> > > the NQP part of the build comes from utf8_set_position 
> > > and utf8_skip_forward.)
> > 
> > Even better might be to figure out why utf8_set_position
> > and utf8_skip_forward are slow and see if those can be sped
> > up somehow.
> 
> I just looked at this, and ouch.  Every call to get_codepoint()
> for utf8 strings scans from the beginning of the string to
> locate the corresponding character position.  Since get_codepoint()
> is repeatedly called for every find_cclass and find_not_cclass
> opcode using a utf8 string target, and since most strings tend
> to get "promoted" into utf8, this repeated scanning can end up 
> being really slow.  (For example, the find_not_cclass opcode 
> gets used for scanning whitespace.)


As of r24557 I've rewritten find_cclass and find_not_cclass
so that they use a string iterator instead of repeated calls
to ENCODING_GET_CODEPOINT.  I also improved utf8_set_position
a bit so that it doesn't always have to restart position
counting from the beginning of the string.  As a result,
compiling the actions.pl script on my machine goes from 39s to
a little over 28s -- about a 25% speed increase.

It's likely that even with these improvement we still do a
fair bit of position counting.  For example, utf8_skip_forward 
and ENCODING_GET_CODEPOINT are probably still called a fair bit --
if nothing else, the is_cclass opcode uses them, as do other
operations.  Some sort of memoization for utf8_skip_forward might
give us even more speedups, but the amount of improvement would
really depend on how/when these are being called.

I think it will still be worthwhile to investigate
converting strings into a fixed-width encoding of some sort
instead of performing scans on variable-width encodings.

Thanks!

Pm

Re: Repeated Loopy Variable Width String Character Access is Slooooow-ish

Reply via email to