Re: string-likeness
I will second Rick's approach and have implemented something very similar for a client when soundex feel short of expectation. It worked very well. On Mon, Jun 3, 2013 at 5:43 PM, Rick James wrote: > Soundex is the 'right' approach, but it needs improvement. So, find an > improvement, then do something like this... > Store the Soundex value in a column of its own, INDEX that column, and JOIN > on that column using "=". Thus, ... > * You have spent the effort to convert to Soundex once, not on every call. > * Multiple strings will have the same Soundex, but generally not many will > have the same. Hence, the JOIN won't be 1:1, but rather some small number. > > Other approaches (eg, Levenshtein) need both strings in the computation. It > _may_ be possible to work around that by the following. > Let's say you wanted to a "match" if > * one letter was dropped or added or changed, or > * one pair of adjacent letters was swapped. > Then... For a N-letter word, store N+1 rows: > * The word, as is, > * The N words, each shortened by one letter. > Then an equal match on that hacked column will catch single > dropped/added/changed letter with only N+1 matches. > (Minor note: doubled letters make the count less than N+1.) > >> -Original Message- >> From: h...@tbbs.net [mailto:h...@tbbs.net] >> Sent: Monday, June 03, 2013 8:30 AM >> To: mysql@lists.mysql.com >> Subject: string-likeness >> >> I wish to join two tables on likeness, not equality, of character strings. >> Soundex does not work. I am using the Levenstein edit distance, written in >> SQL, a very costly test, and I am in no position to write it in C and link >> it to MySQL--and joining on equality takes a fraction of a second, and >> this takes hours. Any good ideas? >> >> >> -- >> MySQL General Mailing List >> For list archives: http://lists.mysql.com/mysql >> To unsubscribe:http://lists.mysql.com/mysql > > > -- > MySQL General Mailing List > For list archives: http://lists.mysql.com/mysql > To unsubscribe:http://lists.mysql.com/mysql > -- - michael dykman - mdyk...@gmail.com May the Source be with you. -- MySQL General Mailing List For list archives: http://lists.mysql.com/mysql To unsubscribe:http://lists.mysql.com/mysql
RE: string-likeness
Soundex is the 'right' approach, but it needs improvement. So, find an improvement, then do something like this... Store the Soundex value in a column of its own, INDEX that column, and JOIN on that column using "=". Thus, ... * You have spent the effort to convert to Soundex once, not on every call. * Multiple strings will have the same Soundex, but generally not many will have the same. Hence, the JOIN won't be 1:1, but rather some small number. Other approaches (eg, Levenshtein) need both strings in the computation. It _may_ be possible to work around that by the following. Let's say you wanted to a "match" if * one letter was dropped or added or changed, or * one pair of adjacent letters was swapped. Then... For a N-letter word, store N+1 rows: * The word, as is, * The N words, each shortened by one letter. Then an equal match on that hacked column will catch single dropped/added/changed letter with only N+1 matches. (Minor note: doubled letters make the count less than N+1.) > -Original Message- > From: h...@tbbs.net [mailto:h...@tbbs.net] > Sent: Monday, June 03, 2013 8:30 AM > To: mysql@lists.mysql.com > Subject: string-likeness > > I wish to join two tables on likeness, not equality, of character strings. > Soundex does not work. I am using the Levenstein edit distance, written in > SQL, a very costly test, and I am in no position to write it in C and link > it to MySQL--and joining on equality takes a fraction of a second, and > this takes hours. Any good ideas? > > > -- > MySQL General Mailing List > For list archives: http://lists.mysql.com/mysql > To unsubscribe:http://lists.mysql.com/mysql -- MySQL General Mailing List For list archives: http://lists.mysql.com/mysql To unsubscribe:http://lists.mysql.com/mysql
Re: string-likeness
On 03.06.2013 17:29, h...@tbbs.net wrote: > I wish to join two tables on likeness, not equality, of character strings. > Soundex does not work. I am using the Levenstein edit distance, written in > SQL, a very costly test, and I am in no position to write it in C and link it > to MySQL--and joining on equality takes a fraction of a second, and this > takes hours. Any good ideas? equality checks have a linear cost of O(min(len1,len2)) and can make use of indexes, too, while Levenshtein cost is is almost quadratic O(len1*len2) and can't make any good use of indexes ... even using a C UDF would help only so far with this kind of complexity. It will increase performance by a constant factor, but given long enough input strings the len1*len2 factor will still account for the majority of the run time increase over simple equality comparions there are a few possible points of optimization though, first of all you can cut off equal start and end sequences (linear complexity for that part instead of quadratic). You can also add a few more tricks if you are only interested in matches below a certain distance threshold: * if string lengths differ by more than the threshold value you can rule out this pair of strings as being "similar" right away * while iterating over the distance array keep track of the min. distance value of the current row ... if at the end of a row is larger than the threshold distance you can terminate right away * only calculate operation cost, not operation type * do not maintain a full len1*len2 array, having only the previous and current row in two one dimensional arrays is sufficient (this esp. helps in C implementation as the functions working set is more likely to fit into CPU caches) ... -- Hartmut Holzgraefe Principal Support Engineer (EMEA) SkySQL AB - http://www.skysql.com/ -- MySQL General Mailing List For list archives: http://lists.mysql.com/mysql To unsubscribe:http://lists.mysql.com/mysql
Re: string-likeness
- Original Message - > From: h...@tbbs.net > > I wish to join two tables on likeness, not equality, of character > strings. Soundex does not work. I am using the Levenstein edit > distance, written in SQL, a very costly test, and I am in no > position to write it in C and link it to MySQL--and joining on > equality takes a fraction of a second, and this takes hours. Any > good ideas? Looks like Sphinx might be a helpful addition to your stack. Way too much to explain how to do so here, but it's a proper fulltext engine, and has a MySQL-compatible interface. -- Unhappiness is discouraged and will be corrected with kitten pictures. -- MySQL General Mailing List For list archives: http://lists.mysql.com/mysql To unsubscribe:http://lists.mysql.com/mysql
string-likeness
I wish to join two tables on likeness, not equality, of character strings. Soundex does not work. I am using the Levenstein edit distance, written in SQL, a very costly test, and I am in no position to write it in C and link it to MySQL--and joining on equality takes a fraction of a second, and this takes hours. Any good ideas? -- MySQL General Mailing List For list archives: http://lists.mysql.com/mysql To unsubscribe:http://lists.mysql.com/mysql
ANN: Hopper - Stored Routine debugger - v1.3.0 released
ANN: Hopper - Stored Routine debugger - v1.3.0 released Dear ladies and gentlemen, Upscene Productions is proud to announce version 1.3.0 of our product called "Hopper". Hopper is a Windows-based Stored Routine and Trigger Debugger, available for InterBase, Firebird and MySQL. For more information, see http://www.upscene.com/go/?go=hopper This release contains bugfixes and improvements, see http://www.upscene.com/go/?go=news&id=20130603 Bugfixes include, but not limited to: - possible "data type not supported" on specific (param) IS NULL constructor in Firebird or InterBase - (INSERT INTO ...) RETURNING-clause not working in Firebird version - assigning values to output parameters fails using Trace Into in MySQL - possible errors on using built in function in SQL queries in MySQL With regards, Martijn Tonies Upscene Productions http://www.upscene.com -- MySQL General Mailing List For list archives: http://lists.mysql.com/mysql To unsubscribe:http://lists.mysql.com/mysql