subject:"string\-likeness"

string-likeness

2013-06-03 Thread hsv

I wish to join two tables on likeness, not equality, of character strings. 
Soundex does not work. I am using the Levenstein edit distance, written in SQL, 
a very costly test, and I am in no position to write it in C and link it to 
MySQL--and joining on equality takes a fraction of a second, and this takes 
hours. Any good ideas?


-- 
MySQL General Mailing List
For list archives: http://lists.mysql.com/mysql
To unsubscribe:http://lists.mysql.com/mysql

Re: string-likeness

2013-06-03 Thread Johan De Meersman

- Original Message -
> From: h...@tbbs.net
> 
> I wish to join two tables on likeness, not equality, of character
> strings. Soundex does not work. I am using the Levenstein edit
> distance, written in SQL, a very costly test, and I am in no
> position to write it in C and link it to MySQL--and joining on
> equality takes a fraction of a second, and this takes hours. Any
> good ideas?

Looks like Sphinx might be a helpful addition to your stack. Way too much to 
explain how to do so here, but it's a proper fulltext engine, and has a 
MySQL-compatible interface.


-- 
Unhappiness is discouraged and will be corrected with kitten pictures.

-- 
MySQL General Mailing List
For list archives: http://lists.mysql.com/mysql
To unsubscribe:http://lists.mysql.com/mysql

Re: string-likeness

2013-06-03 Thread Hartmut Holzgraefe

On 03.06.2013 17:29, h...@tbbs.net wrote:
> I wish to join two tables on likeness, not equality, of character strings. 
> Soundex does not work. I am using the Levenstein edit distance, written in 
> SQL, a very costly test, and I am in no position to write it in C and link it 
> to MySQL--and joining on equality takes a fraction of a second, and this 
> takes hours. Any good ideas?

equality checks have a linear cost of O(min(len1,len2)) and can make
use of indexes, too, while Levenshtein cost is is almost quadratic
O(len1*len2) and can't make any good use of indexes ... even using
a C UDF would help only so far with this kind of complexity. It will
increase performance by a constant factor, but given long enough
input strings the len1*len2 factor will still account for the majority
of the run time increase over simple equality comparions

there are a few possible points of optimization though, first of all
you can cut off equal start and end sequences (linear complexity for
that part instead of quadratic). You can also add a few more tricks
if you are only interested in matches below a certain distance threshold:

* if string lengths differ by more than the threshold value you can
  rule out this pair of strings as being "similar" right away

* while iterating over the distance array keep track of the min.
  distance value of the current row ... if at the end of a row
  is larger than the threshold distance you can terminate right away

* only calculate operation cost, not operation type

* do not maintain a full len1*len2 array, having only the previous
  and current row in two one dimensional arrays is sufficient
  (this esp. helps in C implementation as the functions working set
  is more likely to fit into CPU caches)

...

-- 
Hartmut Holzgraefe 
Principal Support Engineer (EMEA)
SkySQL AB - http://www.skysql.com/

-- 
MySQL General Mailing List
For list archives: http://lists.mysql.com/mysql
To unsubscribe:http://lists.mysql.com/mysql

RE: string-likeness

2013-06-03 Thread Rick James

Soundex is the 'right' approach, but it needs improvement.  So, find an 
improvement, then do something like this...
Store the Soundex value in a column of its own, INDEX that column, and JOIN on 
that column using "=".  Thus, ...
* You have spent the effort to convert to Soundex once, not on every call.
* Multiple strings will have the same Soundex, but generally not many will have 
the same.  Hence, the JOIN won't be 1:1, but rather some small number.

Other approaches (eg, Levenshtein) need both strings in the computation.  It 
_may_ be possible to work around that by the following.
Let's say you wanted to a "match" if
* one letter was dropped or added or changed, or
* one pair of adjacent letters was swapped.
Then...  For a N-letter word, store N+1 rows:
* The word, as is,
* The N words, each shortened by one letter.
Then an equal match on that hacked column will catch single 
dropped/added/changed letter with only N+1 matches.
(Minor note:  doubled letters make the count less than N+1.)

> -Original Message-
> From: h...@tbbs.net [mailto:h...@tbbs.net]
> Sent: Monday, June 03, 2013 8:30 AM
> To: mysql@lists.mysql.com
> Subject: string-likeness
> 
> I wish to join two tables on likeness, not equality, of character strings.
> Soundex does not work. I am using the Levenstein edit distance, written in
> SQL, a very costly test, and I am in no position to write it in C and link
> it to MySQL--and joining on equality takes a fraction of a second, and
> this takes hours. Any good ideas?
> 
> 
> --
> MySQL General Mailing List
> For list archives: http://lists.mysql.com/mysql
> To unsubscribe:http://lists.mysql.com/mysql


--
MySQL General Mailing List
For list archives: http://lists.mysql.com/mysql
To unsubscribe:http://lists.mysql.com/mysql

Re: string-likeness

2013-06-03 Thread Michael Dykman

I will second Rick's approach and have implemented something very
similar for a client when soundex feel short of expectation.  It
worked very well.

On Mon, Jun 3, 2013 at 5:43 PM, Rick James  wrote:
> Soundex is the 'right' approach, but it needs improvement.  So, find an 
> improvement, then do something like this...
> Store the Soundex value in a column of its own, INDEX that column, and JOIN 
> on that column using "=".  Thus, ...
> * You have spent the effort to convert to Soundex once, not on every call.
> * Multiple strings will have the same Soundex, but generally not many will 
> have the same.  Hence, the JOIN won't be 1:1, but rather some small number.
>
> Other approaches (eg, Levenshtein) need both strings in the computation.  It 
> _may_ be possible to work around that by the following.
> Let's say you wanted to a "match" if
> * one letter was dropped or added or changed, or
> * one pair of adjacent letters was swapped.
> Then...  For a N-letter word, store N+1 rows:
> * The word, as is,
> * The N words, each shortened by one letter.
> Then an equal match on that hacked column will catch single 
> dropped/added/changed letter with only N+1 matches.
> (Minor note:  doubled letters make the count less than N+1.)
>
>> -Original Message-
>> From: h...@tbbs.net [mailto:h...@tbbs.net]
>> Sent: Monday, June 03, 2013 8:30 AM
>> To: mysql@lists.mysql.com
>> Subject: string-likeness
>>
>> I wish to join two tables on likeness, not equality, of character strings.
>> Soundex does not work. I am using the Levenstein edit distance, written in
>> SQL, a very costly test, and I am in no position to write it in C and link
>> it to MySQL--and joining on equality takes a fraction of a second, and
>> this takes hours. Any good ideas?
>>
>>
>> --
>> MySQL General Mailing List
>> For list archives: http://lists.mysql.com/mysql
>> To unsubscribe:http://lists.mysql.com/mysql
>
>
> --
> MySQL General Mailing List
> For list archives: http://lists.mysql.com/mysql
> To unsubscribe:http://lists.mysql.com/mysql
>



-- 
 - michael dykman
 - mdyk...@gmail.com

 May the Source be with you.

-- 
MySQL General Mailing List
For list archives: http://lists.mysql.com/mysql
To unsubscribe:http://lists.mysql.com/mysql

Re: string-likeness

2013-06-04 Thread hsv

 2013/06/03 18:38 +0200, Hartmut Holzgraefe 
equality checks have a linear cost of O(min(len1,len2)) and can make
use of indexes, too, while Levenshtein cost is is almost quadratic
O(len1*len2) and can't make any good use of indexes ... even using
a C UDF would help only so far with this kind of complexity. It will
increase performance by a constant factor, but given long enough
input strings the len1*len2 factor will still account for the majority
of the run time increase over simple equality comparions

My set isn't that big (not the hundreds of thousands to which many on this list 
refer), only big enough to be a pain, and here the constant, between 
implementing in interpreted SQL with no array, only temporary table, and 
compiled C, with real array, probably matters--except that my C-implementation 
won't happen.


there are a few possible points of optimization though, first of all
you can cut off equal start and end sequences (linear complexity for
that part instead of quadratic). You can also add a few more tricks
if you are only interested in matches below a certain distance threshold:

* if string lengths differ by more than the threshold value you can
  rule out this pair of strings as being "similar" right away

* while iterating over the distance array keep track of the min.
  distance value of the current row ... if at the end of a row
  is larger than the threshold distance you can terminate right away

Didn't think of these ... will have to find a threshold


* only calculate operation cost, not operation type

* do not maintain a full len1*len2 array, having only the previous
  and current row in two one dimensional arrays is sufficient
  (this esp. helps in C implementation as the functions working set
  is more likely to fit into CPU caches) 

I already do this, because MySQL has no arrays, and I use a small temporary 
table instead of one linear array.


-- 
MySQL General Mailing List
For list archives: http://lists.mysql.com/mysql
To unsubscribe:http://lists.mysql.com/mysql

RE: string-likeness

2013-06-06 Thread hsv

 2013/06/03 21:43 +, Rick James 
Soundex is the 'right' approach, but it needs improvement.  So, find an 
improvement, then do something like this...

Hashing involves somekind normalizing, and in my case I see no means to it; 
otherwise I would not have considered something so costly. On the other hand, 
maybe 

I am comparing lists of place-names, and I want to match, say, any of "Mount 
Saint Francis" or "MT ST FRANCIS" or "MOUNT ST FRANCIS" or "MT SAINT 
FRANCIS"--but it is not all standard abbreviations. Sometimes there is 
"Galvestn" or "Galvston" or "Galvstn" for "Galveston", and it is not always 
vowel-letter deletion, either: "Ft Benj Harrison", "FT BENJAMIN HARRISON",  "Ft 
Benj Harsn"; "CLVR MIL ACAD", "Culver Milt Acad".

Anyhow, I gave up on a perfect solution, and instead added to each name the 
name padded with '%'s. On joining the longer name is used, but instead of the 
shorter the padded is used after "LIKE", if "LOCATE" also fails to match, and 
overall the Levenstein edit distance is used only for a check, with 
short-circuit "AND" and "OR" supposed (and the timing is such that I believe it 
is):

ON (LOCATE(Bookk.Burgh, PO.Burgh) > 0 OR LOCATE(PO.Burgh, Bookk.Burgh) > 0 OR 
CHAR_LENGTH(Bookk.Burgh) > CHAR_LENGTH(PO.Burgh) AND Bookk.Burgh LIKE PO.pBurgh 
OR CHAR_LENGTH(Bookk.Burgh) < CHAR_LENGTH(PO.Burgh) AND PO.Burgh LIKE 
Bookk.pBurgh)
AND mismatch(Bookk.Burgh, PO.Burgh, 1, 2, 1) < 8 IS NOT FALSE

It does not match "MOUNT ST FRANCIS" and "MT SAINT FRANCIS".
At least for LOCATE and LIKE there are linear-time algorithms.

All along I assumed that in the end some of the mismatching will be handled by 
hand. It is not that big a list, but doing all by hand is far too much. 


-- 
MySQL General Mailing List
For list archives: http://lists.mysql.com/mysql
To unsubscribe:http://lists.mysql.com/mysql

string-likeness

Re: string-likeness

Re: string-likeness

RE: string-likeness

Re: string-likeness

Re: string-likeness

RE: string-likeness

7 matches

Site Navigation

Mail list logo

Footer information