> > We can have an information retrieval API for aproximate string matching, i.e. > Levenshtein distance (already implemented, various versions), Hamming > distance, both are the most used and simplest edit distances. > Then you have Longest common subsequence, Longest common substring (they are > implemented in a package called "Fuzz", #longestCommonSubsequenceWith: ). > Also there is the shift-or adapted for approximate matches (also > implemented), fuzzy phrasing is another world also. Many applications use > Damerau edit distance. Bioinformatics uses the Needleman-Wunsch and > Smith-Waterman, but they call them "aligners" :) but you don't want to code > the optimized version in Smalltalk, some say it could take years. > All edit distances out there have specific requirements and no one is better > than another for all cases. For example Jaro-Winkler is useful for one-word > short strings. >
I’m not sure that all these edit distances should be part of the String core api. Now what would be good is to have a chapter describing them. This chapter would work well with the bioSmalltalk one :) > You have a lot of options for research. Smalltalkers here are very > experienced and clever, always gives cool advices so don't be afraid to ask. > > Cheers, > > Hernán > > > -- > Cheers, > Daniela Meneses >