2010/8/2 Alexander Korotkov <aekorot...@gmail.com>: > The dump of the table with russian dictionary is in attachment. > > I use following tests: > SELECT SUM(levenshtein(a, 'foo')) from words; > SELECT SUM(levenshtein(a, 'Urbański')) FROM words; > SELECT SUM(levenshtein(a, 'ańs')) FROM words; > SELECT SUM(levenshtein(a, 'foo')) from words2; > SELECT SUM(levenshtein(a, 'дом')) FROM words2; > SELECT SUM(levenshtein(a, 'компьютер')) FROM words2; > > With your last version of patch results are: > 63ms 94ms 61ms 100ms 121ms 217ms. > > But I found some other optimization. With this condition: > if (x[x_char_len-1] == y[y_char_len-1] && x_char_len == y_char_len && > (x_char_len == 1 || char_same(x, y, x_char_len - 1))) > results become: > 58ms 96ms 63ms 102ms 107ms 171ms > > On single-byte characters results almost didn't change, but they come better > with multi-byte characters. Generally, this improvement is because first > bytes of different characters are frequently same (for example, almost all > Cyrillic characters start from same byte, and I think that there is similar > situation in other alphabets), but match of last bytes is just a > coincidence.
Hey, that's really cool. It helps a lot here, too: My previous patch, with your 5 test cases: Time: 100.556 ms Time: 205.254 ms Time: 127.070 ms Time: 77.734 ms Time: 90.345 ms Time: 166.954 ms Your original patch, same 5 test cases: Time: 131.489 ms Time: 215.048 ms Time: 125.471 ms Time: 80.068 ms Time: 87.110 ms Time: 166.918 ms My patch, with your proposed change to compare the last character first, same 5 test cases: Time: 96.489 ms Time: 201.497 ms Time: 121.876 ms Time: 75.005 ms Time: 80.219 ms Time: 142.844 ms Updated patch attached. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise Postgres Company
mb_levenshtein_rmh-v3.patch
Description: Binary data
-- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers