wade <[EMAIL PROTECTED]> writes: > I redid my trials with the same data set on 7.2.3 --with-multibyte and I > get the same brutal performance hit, so it is definitely a > multibyte-specific problem. > > There are only about 1000 words that appear more than once (2 or 3 times) > in 27k rows.
Right, so the caching of compiled regexps that regexp.c does is of no help, and any change in its behavior in 7.3 wouldn't have made any difference anyway. I leapt to a conclusion after reviewing the CVS logs for pertinent changes, but it was the wrong conclusion. The true problem is that MULTIBYTE is now forced on, and that causes some loops in the regexp compiler to change from 256 to 65536 iterations. I believe if you change NC in src/include/regex/utils.h from its new value of 65536 back to 256, performance will go back where it was. This will *not* do if you run any multibyte character sets --- but as long as the database is all ASCII or ISO-8859-whatever, it should be a safe hack that will let you use 7.3.*. Rather than trying to band-aid a solution like this in the main sources, I think I shall go investigate Spencer's new regexp code in Tcl, which reputedly is designed for wider-than-8-bit chars from the get-go. We've had it on the TODO list for a long time to assimilate that code; it's probably time to make it happen. regards, tom lane ---------------------------(end of broadcast)--------------------------- TIP 6: Have you searched our list archives? http://archives.postgresql.org