On 03.06.2013 17:29, h...@tbbs.net wrote: > I wish to join two tables on likeness, not equality, of character strings. > Soundex does not work. I am using the Levenstein edit distance, written in > SQL, a very costly test, and I am in no position to write it in C and link it > to MySQL--and joining on equality takes a fraction of a second, and this > takes hours. Any good ideas?
equality checks have a linear cost of O(min(len1,len2)) and can make use of indexes, too, while Levenshtein cost is is almost quadratic O(len1*len2) and can't make any good use of indexes ... even using a C UDF would help only so far with this kind of complexity. It will increase performance by a constant factor, but given long enough input strings the len1*len2 factor will still account for the majority of the run time increase over simple equality comparions there are a few possible points of optimization though, first of all you can cut off equal start and end sequences (linear complexity for that part instead of quadratic). You can also add a few more tricks if you are only interested in matches below a certain distance threshold: * if string lengths differ by more than the threshold value you can rule out this pair of strings as being "similar" right away * while iterating over the distance array keep track of the min. distance value of the current row ... if at the end of a row is larger than the threshold distance you can terminate right away * only calculate operation cost, not operation type * do not maintain a full len1*len2 array, having only the previous and current row in two one dimensional arrays is sufficient (this esp. helps in C implementation as the functions working set is more likely to fit into CPU caches) ... -- Hartmut Holzgraefe <hart...@skysql.com> Principal Support Engineer (EMEA) SkySQL AB - http://www.skysql.com/ -- MySQL General Mailing List For list archives: http://lists.mysql.com/mysql To unsubscribe: http://lists.mysql.com/mysql