On Sun, Feb 12, 2012 at 04:31:49AM +0100, Xavier Noria wrote: > On Sun, Feb 12, 2012 at 4:12 AM, Xavier Noria <f...@hashref.com> wrote: > > Nowadays long strings get a performance boost. That does not make sense > > statistically speaking, English words should be the fast ones. > > > > Indeed, running the benchmark against /usr/share/dict/words gives an > overall speedup of more than 7x: > > https://gist.github.com/1806049 > > The bigger the sample the greater the speedup because inflections are the > exception. The majority of words are inflected using the last rule, so the > difference in the technique to loop is bigger. > > This should happen also in real life, the exceptions are rare, in general > most words will be applying the last rule.
Interesting. Have you investigated expanding the regular expressions and doing hash based replacement via gsub!? Since we can know the replacements in advance, it's possible to compile a hash and use it for the replacement. If the hash misses, we can fall back to a linear scan. Here's a quick implementation as an example. We could probably optimize more of the expressions: https://gist.github.com/1806575 -- Aaron Patterson http://tenderlovemaking.com/
pgpYwdeFJCXK9.pgp
Description: PGP signature