On Sat, Jan 05, 2008 at 12:17:00PM +0100, Cosimo Streppone wrote: > Patrick wrote: > > > >[...] I also improved utf8_set_position > >a bit so that it doesn't always have to restart position > >counting from the beginning of the string. As a result, > >compiling the actions.pl script on my machine goes from 39s to > >a little over 28s -- about a 25% speed increase. > > [... /me reads again the diff ...] > > I realized while writing this that if `i->charpos > pos'. > you simply end up re-scanning the string from the start. > Is that correct?
Correct. > Maybe it could be an idea to scan backwards in that case? > Please don't yell at me, I'm just trying to follow up :) It's a very good question. My suspicion is that we rarely "scan backwards". The utf8_set_position function actually manipulates a String_iter struct, as opposed to the string itself, and from the cases I've looked at, it looks as though utf8_set_position is used to set the iterator to a known offset after the iterator is created. All subsequent scanning with the iterator tends to go forward after that (using the get_and_advance and set_and_advance methods). So, scanning backwards feels a lot like a premature optimization to me -- i.e., we could implement it, but there's a good chance it never comes up in real use. In fact, I just did a quick check of this by adding a print message to utf8_set_position to send a message whenever a backwards scan is encountered, and the only times it occurs we're moving from offset 1 to offset 0 -- i.e., "scanning backwards" would actually take longer than the current algorithm. (I'm also a little curious as to the conditions when we'd be moving from offset 1 to offset 0, but will check that later.) Thanks for the question and reviewing the code! Based on this I'm removing the "XXX" note about scanning in both directions, and I also see where I forgot to properly cast the initialization of u8ptr. Pm