On Sat, Jan 05, 2008 at 12:29:40AM -0600, Patrick R. Michaud wrote: > On Fri, Jan 04, 2008 at 07:43:18PM -0800, chromatic wrote: > > (Callgrind suggests that about 45% of the running time of > > the NQP part of the build comes from utf8_set_position > > and utf8_skip_forward.) > > Even better might be to figure out why utf8_set_position > and utf8_skip_forward are slow and see if those can be sped > up somehow.
I just looked at this, and ouch. Every call to get_codepoint() for utf8 strings scans from the beginning of the string to locate the corresponding character position. Since get_codepoint() is repeatedly called for every find_cclass and find_not_cclass opcode using a utf8 string target, and since most strings tend to get "promoted" into utf8, this repeated scanning can end up being really slow. (For example, the find_not_cclass opcode gets used for scanning whitespace.) Thus, I agree that using a fixed-width encoding on strings could be a big improvement for anything using PGE, because then these opcodes would avoid the repeated scans from the start of the string. I also think this means we need a way in PIR to easily unicode string literals using fixed-width encodings. Also, we ought to be able to speed up find_not_cclass and find_cclass by using string iterators instead of repeated calls to get_codepoint. This could reduce the repeated scans from the beginning of the string. Lastly, string iterators on utf8 encoded strings do some basic memoizing of the last known character offset and location, but the utf8_set_position function doesn't make use of this information -- it always restarts from the beginning. (There's even an XXX note in src/encodings/utf8.c that says it should use the quickest direction instead of scanning from the start.) If nobody else is likely to look into improving these sections of the code, I suspect I should probably go ahead and spend the time to do it. c -- what's the individual runtime percentages for utf8_set_position and utf8_skip_forward? Thanks! Pm