Raphael Geissert <atom...@gmail.com> writes:

> Anyway, I have written several different implementations; one similar to
> the one I previously wrote but turning the whole list of known bad words
> into a big ORed regex and, as expected, resulted a lot faster than my
> first one. But the vast majority of times it was still slower than the
> current algorithm.
>
> These are the benchmark results of several methods, all dropping the
> regex that strips most non-word characters.
>
> On the output of strings /usr/bin/php5 (50 times):
>         Rate   bts  orig  newfg
> bts   7.74/s    --  -44%  -61%
> orig  13.7/s   77%    --  -30%
> newg 19.7/s  154%   43%     --
>
> on /usr/share/common-licenses/GPL-3 (1000 times):
>         Rate   bts  orig  new
> bts   58.6/s    --  -60%  -76%
> orig   146/s  148%    --  -40%
> new  242/s  312%   66%     --
>
> bts: the one I first submitted on this bug report
> orig: the current one
> new: the proposed one
>
> The idea behind removing the regex that removes all non-alphabetic
> characters is that the likelyhood for the resulting "word" to be an
> actual match should be extremely remote. Instead, the replacement takes
> care of removing dots, commas, and other symbols that are commonly used
> in sentences.

Yeah, this looks much better.  Applied with one change: keeping hyphens to
match the behavior of the previous code.

-- 
Russ Allbery (r...@debian.org)               <http://www.eyrie.org/~eagle/>



-- 
To UNSUBSCRIBE, email to debian-lint-maint-requ...@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact listmas...@lists.debian.org

Reply via email to