David Spencer wrote:
To restate the question for a second.

The misspelled word is: "conts".
The sugggestion expected is "const", which seems reasonable enough as it's just a transposition away, thus the string distance is low.


But - I guess the problem w/ the algorithm is that for short words like this, with transpositions, the two words won't share many ngrams.

Just looking at 3grams...

conts -> con ont nts
const -> con ons nst

Thus they just share 1 3gram, thus this is why it scores so low. This is an interesting issue, how to tune the algorithm so that it might return words this close higher.

If you added 2-grams, then it would look like this (constructing also special start/end grams):


conts -> _c co on nt ts s_
const -> _c co on ns st t_

which gives 50% of overlap.

In another system that I designed we were using a combination of 2-4 grams, albeit for a slightly different purpose, so in this case it would be:

conts:
        _c co on nt ts s_, _co con ont nts ts_, _con cont onts nts_

const:
        _c co on ns st t_, _co con ons nst st_, _con cons onst nst_

and the overlap is 40%.



I guess one way is to add all simple transpositions to the lookup table (the "ngram index") so that these could easily be found, with the heuristic that "a frequent way of misspelling words is to transpose two adjacent letters".

Yes, sounds like a good idea. Even though it increases the size of the lookup index, it still beats using the linear search...


--
Best regards,
Andrzej Bialecki

-------------------------------------------------
Software Architect, System Integration Specialist
CEN/ISSS EC Workshop, ECIMF project chair
EU FP6 E-Commerce Expert/Evaluator
-------------------------------------------------
FreeBSD developer (http://www.freebsd.org)


--------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]



Reply via email to