On Sun, 09 May 2010 02:11:21 -0400, Lionello Lunesu <l...@lunesu.remove.com> wrote:

I'm in the middle of moving from one city to another so don't wait for
me. I have attached the D version of the code in the wikipedia article
(including the patch for transpositions.)

It's not straightforward to drop this in (apart from it being in D),
since speller.c creates all variations on a string (=inefficient) and
uses a callback function to check if a variation is a valid symbol.

I'll have more time to look at this next week.

Several others have privately brought up this problem to Walter. He does not want to change how the symbol lookup tables work, and there is no way to iterate them. Therefore, without a way to iterate current symbols, this is the only way the algo can be written.

However, according to reports on the latest beta, he's sped up the lookup times for symbols significantly enough to perceptually reduce the problem.

Because of the nature of the algorithm, the longer the invalid symbol, the slower the algorithm. It would be interesting to see a comparison between the current beta code and code that does a full iteration with very long symbols. I don't know if anyone wants to look at modifying the symbol lookup data structures to allow iteration, it may be perceptually insignificant, and unimportant for most developers.

A quick test on a long symbol name:


void s023456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567891() {}
void main()
{
     
s123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890();
}



And on the various compilers:

2.043: 0.052s, does not suggest
2.045: 5.4s, suggests the alternative
beta: 0.8s, does not suggest

2.043 only does spell checking with a Levenshtein distance of 1, 2.045 does 2, but is extremely slow. The beta does a distance of 2, but only if the errors are close together (Walter added this as an optimization to remove one factor from the runtime complexity).

So clearly, there is still room for improvement, I think with a proper symbol iteration, we could get the time down to be as quick as 2.043 or faster *and* provide the ability to check for a complete LD of 2 where the errors are not close together. We might be able to even push the LD to 3 or 4.

I've thought about the optimization of errors close together being checked, and I think the counter case is the case which takes the longest -- a long symbol. Typically people use camelCase to denote symbols, what if two of the words in the symbol were misspelled by one character (or a capitalization was forgotten)?

It may not be an issue, the spell checker is simply a nice hint, but isn't essential to determine errors.

-Steve

Reply via email to