I'm afraid your message didn't make it to the newsgroup provided by gmane; so it is until now that I'm reading your email.
On Sunday 08 February 2009 21:11:20 Russ Allbery wrote: > Raphael Geissert <atomo64+deb...@gmail.com> writes: > > There might indeed be a degradation on those texts where the number of > > characters is near the limit of the methods switcher. Other than those > > cases, when the number of words in the text is greater than the number > > of words in the hash there should be a performance improvement when > > searching for every known word in the text (there is indeed a > > performance penalty because of the regex and the text traversal, but it > > is not enough to make it slower than the other method). > > But why would that be the case? > I don't remember very well my reasoning behind the new implementation but it was along the lines of being faster because it only looked up known words, ignoring dups. > There are two scaling factors here. Let's say n is the length of the text > and w is the number of corrections we have. The current algorithm walks > the text once and does a hash table lookup for each word. (Well, > technically it probably walks the text once to split it on whitespace and > then a second time to do the hash table lookups.) So, we do work > proportional to the length of the text, or O(n). The factor is roughly > 2 plus the length of time it takes to hash a word. > Plus the lc and regex operations on every word (a word being whatever is between two spaces or the start and end of line, or any combination of them). > Here's the new code: [...] > > The first branch is the current algorithm. This code chooses that > algorithm if n is small. If n is large, it switches to doing a scan of > the full text for each correction that we have, which is work proportional > to w, the number of corrections. > > I was wrong originally -- this isn't O(n^2). It's still O(n), but with a > different factor, this time proportional to w. It's O(wn), essentially. > There are two less operations being performed on the new implementation: lc and the stripping regex are done directly on the whole text, not per word. And I wouldn't consider it O(wn), it is more like O(wn/p), where p is the probability that a known word appears in the text, because if a known word appears, for example, at the beginning of the text it won't have to recur over the rest of the text. > So, for this to be faster, the time it takes to check a portion of the > text against every correction we have using a regex must be less than the > time it takes to hash a word. > > This makes no sense to me. Any reasonable hash algorithm should be much > faster than more than 100 regex matches. > > The benchmarking looks like it may just be in the noise for me: [...] agreed [...] > > So I'm getting basically similar results either way. I would expect that. I believe that's because the version of lintian you used for your tests didn't include the binary spell checking patch, my results did. Running lintian without that patch makes it spell check only very small texts, IOW: the new algorithm is never used without the patch. Anyway, I have written several different implementations; one similar to the one I previously wrote but turning the whole list of known bad words into a big ORed regex and, as expected, resulted a lot faster than my first one. But the vast majority of times it was still slower than the current algorithm. These are the benchmark results of several methods, all dropping the regex that strips most non-word characters. On the output of strings /usr/bin/php5 (50 times): Rate bts orig newfg bts 7.74/s -- -44% -61% orig 13.7/s 77% -- -30% newg 19.7/s 154% 43% -- on /usr/share/common-licenses/GPL-3 (1000 times): Rate bts orig new bts 58.6/s -- -60% -76% orig 146/s 148% -- -40% new 242/s 312% 66% -- bts: the one I first submitted on this bug report orig: the current one new: the proposed one The idea behind removing the regex that removes all non-alphabetic characters is that the likelyhood for the resulting "word" to be an actual match should be extremely remote. Instead, the replacement takes care of removing dots, commas, and other symbols that are commonly used in sentences. Cheers, -- Raphael Geissert - Debian Maintainer www.debian.org - get.debian.net
lintian_spelling-optim.mbox
Description: application/mbox