Re: [HACKERS] multibyte charater set in levenshtein function

Robert Haas Mon, 02 Aug 2010 13:53:31 -0700

2010/8/2 Alexander Korotkov <[email protected]>:
> The dump of the table with russian dictionary is in attachment.
>
> I use following tests:
> SELECT SUM(levenshtein(a, 'foo')) from words;
> SELECT SUM(levenshtein(a, 'Urbański')) FROM words;
> SELECT SUM(levenshtein(a, 'ańs')) FROM words;
> SELECT SUM(levenshtein(a, 'foo')) from words2;
> SELECT SUM(levenshtein(a, 'дом')) FROM words2;
> SELECT SUM(levenshtein(a, 'компьютер')) FROM words2;
>
> With your last version of patch results are:
> 63ms 94ms 61ms 100ms 121ms 217ms.
>
> But I found some other optimization. With this condition:
> if (x[x_char_len-1] == y[y_char_len-1] && x_char_len == y_char_len &&
>     (x_char_len == 1 || char_same(x, y, x_char_len - 1)))
> results become:
> 58ms 96ms 63ms 102ms 107ms 171ms
>
> On single-byte characters results almost didn't change, but they come better
> with multi-byte characters. Generally, this improvement is because first
> bytes of different characters are frequently same (for example, almost all
> Cyrillic characters start from same byte, and I think that there is similar
> situation in other alphabets), but match of last bytes is just a
> coincidence.


Hey, that's really cool.  It helps a lot here, too:

My previous patch, with your 5 test cases:
Time: 100.556 ms
Time: 205.254 ms
Time: 127.070 ms
Time: 77.734 ms
Time: 90.345 ms
Time: 166.954 ms

Your original patch, same 5 test cases:
Time: 131.489 ms
Time: 215.048 ms
Time: 125.471 ms
Time: 80.068 ms
Time: 87.110 ms
Time: 166.918 ms

My patch, with your proposed change to compare the last character
first, same 5 test cases:
Time: 96.489 ms
Time: 201.497 ms
Time: 121.876 ms
Time: 75.005 ms
Time: 80.219 ms
Time: 142.844 ms

Updated patch attached.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise Postgres Company

mb_levenshtein_rmh-v3.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list ([email protected])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] multibyte charater set in levenshtein function

Reply via email to