Raphael Geissert <atom...@gmail.com> writes: > Anyway, I have written several different implementations; one similar to > the one I previously wrote but turning the whole list of known bad words > into a big ORed regex and, as expected, resulted a lot faster than my > first one. But the vast majority of times it was still slower than the > current algorithm. > > These are the benchmark results of several methods, all dropping the > regex that strips most non-word characters. > > On the output of strings /usr/bin/php5 (50 times): > Rate bts orig newfg > bts 7.74/s -- -44% -61% > orig 13.7/s 77% -- -30% > newg 19.7/s 154% 43% -- > > on /usr/share/common-licenses/GPL-3 (1000 times): > Rate bts orig new > bts 58.6/s -- -60% -76% > orig 146/s 148% -- -40% > new 242/s 312% 66% -- > > bts: the one I first submitted on this bug report > orig: the current one > new: the proposed one > > The idea behind removing the regex that removes all non-alphabetic > characters is that the likelyhood for the resulting "word" to be an > actual match should be extremely remote. Instead, the replacement takes > care of removing dots, commas, and other symbols that are commonly used > in sentences.
Yeah, this looks much better. Applied with one change: keeping hyphens to match the behavior of the previous code. -- Russ Allbery (r...@debian.org) <http://www.eyrie.org/~eagle/> -- To UNSUBSCRIBE, email to debian-lint-maint-requ...@lists.debian.org with a subject of "unsubscribe". Trouble? Contact listmas...@lists.debian.org