On Dec 20, 7:07 pm, Øyvind <oyvin...@gmail.com> wrote: > Thanks for the useful comments. > > On 20 Des, 01:38, John Machin <sjmac...@lexicon.net> wrote: > > > On Dec 20, 10:02 am, Øyvind <oyvin...@gmail.com> wrote: > > > > Based on examples and formulas > > > fromhttp://en.wikipedia.org/wiki/Jaro-Winkler. > > > For another Python implementation, google "febrl". > > > > Useful for measuring similarity between two strings. For example if > > > you want to detect that the user did a typo. > > > You mean like comparing the user's input word with some collection of > > valid words? You would need to be using something else as a quick-and- > > dirty filter ... Jaro-Winkler is relatively slow. > > Do you have any alternative suggestions?
Use a Levenshtein or Damerau edit distance. The latter handles adjacent transpositions (...AB...->...BA...) which covers well over 90% of the transposition errors I've ever seen. The remainder e.g. ...AXB...->...BXA... are given the same weight by Jaro-Winkler; this is IMHO silly. Read a little bit further than the CS101 level on Levenshtein edit distance and ignore any implementation which calculates all N*M cells of a dynamic programming matrix -- in the case where the objective is to calculate the distance except that once the distance is known to be over a threshold you don't care how big it is (you will reject a match if the distance is too big), then you only need to fill in the entries in a narrow band astride the trailing diagonal, and you can reliably give up if the current distance on the diagonal becomes too high. Ukkonen's the man in this field. It gets better. You can forget the matrix when the word lengths are less than the number of bits in your computer word (even 32 bits is enough for most words in most languages). The bit-parallel techniques include Damerau distance. This paper by Heikki Hyyrö is well worth reading, and refers to a whole lot of previous work, including Ukkonen's: http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.6.2242 And you can forget about all the fancy techniques when the word lengths are so different that they can't be the same (e.g. Fu and Featherstonehaugh) -- or the shorter should perhaps be checked to see if it's a plausible abbreviation or contraction of the longer. > > Also as the Wikipedia article says, > > it's not a metric. I.e. it doesn't satisfy dist(a, c) <= dist(a, b) + > > dist(b, c). > > Its not a mathematical correct metric, but it is a useful heuristical > metric. Useful for what? IMHO: If it doesn't satisfy the symmetry condition and the triangle rule, it's just plain dodgy. J-W is dodgy and slow. Might as well use soundex. End of story. > > > The above code is not symmetrical; jarow_m(s1, s2) does not > > necessarily equal jarow_m(s2, s1). The article talks about one "m", > > not two of them. > > Hmm.. also a good point. I will make it count all the matches, not > just the matches in s1. The whole Jaro definition is woolly ... what is a match when you have more than one occurrence of the same letter? Have a look at what the matches might be between MISSISSIPPI and a variant with an extra I in the middle e.g MISSISISIPPI. Cheers, John -- http://mail.python.org/mailman/listinfo/python-list