string-likeness
I wish to join two tables on likeness, not equality, of character strings. Soundex does not work. I am using the Levenstein edit distance, written in SQL, a very costly test, and I am in no position to write it in C and link it to MySQL--and joining on equality takes a fraction of a second, and this takes hours. Any good ideas? -- MySQL General Mailing List For list archives: http://lists.mysql.com/mysql To unsubscribe:http://lists.mysql.com/mysql
Re: string-likeness
- Original Message - > From: h...@tbbs.net > > I wish to join two tables on likeness, not equality, of character > strings. Soundex does not work. I am using the Levenstein edit > distance, written in SQL, a very costly test, and I am in no > position to write it in C and link it to MySQL--and joining on > equality takes a fraction of a second, and this takes hours. Any > good ideas? Looks like Sphinx might be a helpful addition to your stack. Way too much to explain how to do so here, but it's a proper fulltext engine, and has a MySQL-compatible interface. -- Unhappiness is discouraged and will be corrected with kitten pictures. -- MySQL General Mailing List For list archives: http://lists.mysql.com/mysql To unsubscribe:http://lists.mysql.com/mysql
Re: string-likeness
On 03.06.2013 17:29, h...@tbbs.net wrote: > I wish to join two tables on likeness, not equality, of character strings. > Soundex does not work. I am using the Levenstein edit distance, written in > SQL, a very costly test, and I am in no position to write it in C and link it > to MySQL--and joining on equality takes a fraction of a second, and this > takes hours. Any good ideas? equality checks have a linear cost of O(min(len1,len2)) and can make use of indexes, too, while Levenshtein cost is is almost quadratic O(len1*len2) and can't make any good use of indexes ... even using a C UDF would help only so far with this kind of complexity. It will increase performance by a constant factor, but given long enough input strings the len1*len2 factor will still account for the majority of the run time increase over simple equality comparions there are a few possible points of optimization though, first of all you can cut off equal start and end sequences (linear complexity for that part instead of quadratic). You can also add a few more tricks if you are only interested in matches below a certain distance threshold: * if string lengths differ by more than the threshold value you can rule out this pair of strings as being "similar" right away * while iterating over the distance array keep track of the min. distance value of the current row ... if at the end of a row is larger than the threshold distance you can terminate right away * only calculate operation cost, not operation type * do not maintain a full len1*len2 array, having only the previous and current row in two one dimensional arrays is sufficient (this esp. helps in C implementation as the functions working set is more likely to fit into CPU caches) ... -- Hartmut Holzgraefe Principal Support Engineer (EMEA) SkySQL AB - http://www.skysql.com/ -- MySQL General Mailing List For list archives: http://lists.mysql.com/mysql To unsubscribe:http://lists.mysql.com/mysql
RE: string-likeness
Soundex is the 'right' approach, but it needs improvement. So, find an improvement, then do something like this... Store the Soundex value in a column of its own, INDEX that column, and JOIN on that column using "=". Thus, ... * You have spent the effort to convert to Soundex once, not on every call. * Multiple strings will have the same Soundex, but generally not many will have the same. Hence, the JOIN won't be 1:1, but rather some small number. Other approaches (eg, Levenshtein) need both strings in the computation. It _may_ be possible to work around that by the following. Let's say you wanted to a "match" if * one letter was dropped or added or changed, or * one pair of adjacent letters was swapped. Then... For a N-letter word, store N+1 rows: * The word, as is, * The N words, each shortened by one letter. Then an equal match on that hacked column will catch single dropped/added/changed letter with only N+1 matches. (Minor note: doubled letters make the count less than N+1.) > -Original Message- > From: h...@tbbs.net [mailto:h...@tbbs.net] > Sent: Monday, June 03, 2013 8:30 AM > To: mysql@lists.mysql.com > Subject: string-likeness > > I wish to join two tables on likeness, not equality, of character strings. > Soundex does not work. I am using the Levenstein edit distance, written in > SQL, a very costly test, and I am in no position to write it in C and link > it to MySQL--and joining on equality takes a fraction of a second, and > this takes hours. Any good ideas? > > > -- > MySQL General Mailing List > For list archives: http://lists.mysql.com/mysql > To unsubscribe:http://lists.mysql.com/mysql -- MySQL General Mailing List For list archives: http://lists.mysql.com/mysql To unsubscribe:http://lists.mysql.com/mysql
Re: string-likeness
I will second Rick's approach and have implemented something very similar for a client when soundex feel short of expectation. It worked very well. On Mon, Jun 3, 2013 at 5:43 PM, Rick James wrote: > Soundex is the 'right' approach, but it needs improvement. So, find an > improvement, then do something like this... > Store the Soundex value in a column of its own, INDEX that column, and JOIN > on that column using "=". Thus, ... > * You have spent the effort to convert to Soundex once, not on every call. > * Multiple strings will have the same Soundex, but generally not many will > have the same. Hence, the JOIN won't be 1:1, but rather some small number. > > Other approaches (eg, Levenshtein) need both strings in the computation. It > _may_ be possible to work around that by the following. > Let's say you wanted to a "match" if > * one letter was dropped or added or changed, or > * one pair of adjacent letters was swapped. > Then... For a N-letter word, store N+1 rows: > * The word, as is, > * The N words, each shortened by one letter. > Then an equal match on that hacked column will catch single > dropped/added/changed letter with only N+1 matches. > (Minor note: doubled letters make the count less than N+1.) > >> -Original Message- >> From: h...@tbbs.net [mailto:h...@tbbs.net] >> Sent: Monday, June 03, 2013 8:30 AM >> To: mysql@lists.mysql.com >> Subject: string-likeness >> >> I wish to join two tables on likeness, not equality, of character strings. >> Soundex does not work. I am using the Levenstein edit distance, written in >> SQL, a very costly test, and I am in no position to write it in C and link >> it to MySQL--and joining on equality takes a fraction of a second, and >> this takes hours. Any good ideas? >> >> >> -- >> MySQL General Mailing List >> For list archives: http://lists.mysql.com/mysql >> To unsubscribe:http://lists.mysql.com/mysql > > > -- > MySQL General Mailing List > For list archives: http://lists.mysql.com/mysql > To unsubscribe:http://lists.mysql.com/mysql > -- - michael dykman - mdyk...@gmail.com May the Source be with you. -- MySQL General Mailing List For list archives: http://lists.mysql.com/mysql To unsubscribe:http://lists.mysql.com/mysql
Re: string-likeness
2013/06/03 18:38 +0200, Hartmut Holzgraefe equality checks have a linear cost of O(min(len1,len2)) and can make use of indexes, too, while Levenshtein cost is is almost quadratic O(len1*len2) and can't make any good use of indexes ... even using a C UDF would help only so far with this kind of complexity. It will increase performance by a constant factor, but given long enough input strings the len1*len2 factor will still account for the majority of the run time increase over simple equality comparions My set isn't that big (not the hundreds of thousands to which many on this list refer), only big enough to be a pain, and here the constant, between implementing in interpreted SQL with no array, only temporary table, and compiled C, with real array, probably matters--except that my C-implementation won't happen. there are a few possible points of optimization though, first of all you can cut off equal start and end sequences (linear complexity for that part instead of quadratic). You can also add a few more tricks if you are only interested in matches below a certain distance threshold: * if string lengths differ by more than the threshold value you can rule out this pair of strings as being "similar" right away * while iterating over the distance array keep track of the min. distance value of the current row ... if at the end of a row is larger than the threshold distance you can terminate right away Didn't think of these ... will have to find a threshold * only calculate operation cost, not operation type * do not maintain a full len1*len2 array, having only the previous and current row in two one dimensional arrays is sufficient (this esp. helps in C implementation as the functions working set is more likely to fit into CPU caches) I already do this, because MySQL has no arrays, and I use a small temporary table instead of one linear array. -- MySQL General Mailing List For list archives: http://lists.mysql.com/mysql To unsubscribe:http://lists.mysql.com/mysql
RE: string-likeness
2013/06/03 21:43 +, Rick James Soundex is the 'right' approach, but it needs improvement. So, find an improvement, then do something like this... Hashing involves somekind normalizing, and in my case I see no means to it; otherwise I would not have considered something so costly. On the other hand, maybe I am comparing lists of place-names, and I want to match, say, any of "Mount Saint Francis" or "MT ST FRANCIS" or "MOUNT ST FRANCIS" or "MT SAINT FRANCIS"--but it is not all standard abbreviations. Sometimes there is "Galvestn" or "Galvston" or "Galvstn" for "Galveston", and it is not always vowel-letter deletion, either: "Ft Benj Harrison", "FT BENJAMIN HARRISON", "Ft Benj Harsn"; "CLVR MIL ACAD", "Culver Milt Acad". Anyhow, I gave up on a perfect solution, and instead added to each name the name padded with '%'s. On joining the longer name is used, but instead of the shorter the padded is used after "LIKE", if "LOCATE" also fails to match, and overall the Levenstein edit distance is used only for a check, with short-circuit "AND" and "OR" supposed (and the timing is such that I believe it is): ON (LOCATE(Bookk.Burgh, PO.Burgh) > 0 OR LOCATE(PO.Burgh, Bookk.Burgh) > 0 OR CHAR_LENGTH(Bookk.Burgh) > CHAR_LENGTH(PO.Burgh) AND Bookk.Burgh LIKE PO.pBurgh OR CHAR_LENGTH(Bookk.Burgh) < CHAR_LENGTH(PO.Burgh) AND PO.Burgh LIKE Bookk.pBurgh) AND mismatch(Bookk.Burgh, PO.Burgh, 1, 2, 1) < 8 IS NOT FALSE It does not match "MOUNT ST FRANCIS" and "MT SAINT FRANCIS". At least for LOCATE and LIKE there are linear-time algorithms. All along I assumed that in the end some of the mismatching will be handled by hand. It is not that big a list, but doing all by hand is far too much. -- MySQL General Mailing List For list archives: http://lists.mysql.com/mysql To unsubscribe:http://lists.mysql.com/mysql