Re: Repeated Loopy Variable Width String Character Access is Slooooow-ish

Patrick R. Michaud Fri, 04 Jan 2008 23:09:25 -0800

On Sat, Jan 05, 2008 at 12:29:40AM -0600, Patrick R. Michaud wrote:
> On Fri, Jan 04, 2008 at 07:43:18PM -0800, chromatic wrote:
> > (Callgrind suggests that about 45% of the running time of 
> > the NQP part of the build comes from utf8_set_position 
> > and utf8_skip_forward.)
> 
> Even better might be to figure out why utf8_set_position
> and utf8_skip_forward are slow and see if those can be sped
> up somehow.


I just looked at this, and ouch.  Every call to get_codepoint()
for utf8 strings scans from the beginning of the string to
locate the corresponding character position.  Since get_codepoint()
is repeatedly called for every find_cclass and find_not_cclass
opcode using a utf8 string target, and since most strings tend
to get "promoted" into utf8, this repeated scanning can end up 
being really slow.  (For example, the find_not_cclass opcode 
gets used for scanning whitespace.)

Thus, I agree that using a fixed-width encoding on strings 
could be a big improvement for anything using PGE, because then
these opcodes would avoid the repeated scans from the start
of the string.  I also think this means we need a way in PIR to 
easily unicode string literals using fixed-width encodings.

Also, we ought to be able to speed up find_not_cclass and 
find_cclass by using string iterators instead of repeated
calls to get_codepoint.  This could reduce the repeated
scans from the beginning of the string.

Lastly, string iterators on utf8 encoded strings do some
basic memoizing of the last known character offset and
location, but the utf8_set_position function doesn't make
use of this information -- it always restarts from the
beginning.  (There's even an XXX note in src/encodings/utf8.c 
that says it should use the quickest direction instead
of scanning from the start.)

If nobody else is likely to look into improving these
sections of the code, I suspect I should probably go
ahead and spend the time to do it.

c -- what's the individual runtime percentages for 
utf8_set_position and utf8_skip_forward?

Thanks!

Pm

Re: Repeated Loopy Variable Width String Character Access is Slooooow-ish

Reply via email to