Re: [HACKERS] Notes about fixing regexes and UTF-8 (yet again)

Heikki Linnakangas Fri, 17 Feb 2012 00:49:33 -0800

On 16.02.2012 01:06, Tom Lane wrote:

In bug #6457 it's pointed out that we *still* don't have full
functionality for locale-dependent regexp behavior with UTF8 encoding.
The reason is that there's old crufty code in regc_locale.c that only
considers character codes up to 255 when searching for characters that
should be considered "letters", "digits", etc.  We could fix that, for
some value of "fix", by iterating up to perhaps 0xFFFF when dealing with
UTF8 encoding, but the time that would take is unappealing.  Especially
so considering that this code is executed afresh anytime we compile a
regex that requires locale knowledge.


I looked into the upstream Tcl code and observed that they deal with
this by having hard-wired tables of which Unicode code points are to be
considered letters etc.  The tables are directly traceable to the
Unicode standard (they provide a script to regenerate them from files
available from unicode.org).  Nonetheless, I do not find that approach
appealing, mainly because we'd be risking deviating from the libc locale
code's behavior within regexes when we follow it everywhere else.
It seems entirely likely to me that a particular locale setting might
consider only some of what Unicode says are letters to be letters.

However, we could possibly compromise by using Unicode-derived tables
as a guide to which code points are worth probing libc for.  That is,
assume that a utf8-based locale will never claim that some code is a
letter that unicode.org doesn't think is a letter.  That would cut the
number of required probes by a pretty large factor.

The other thing that seems worth doing is to install some caching.
We could presumably assume that the behavior of iswupper() et al are
fixed for the duration of a database session, so that we only need to
run the probe loop once when first asked to create a cvec for a
particular category.

Thoughts, better ideas?

Here's a wild idea: keep the class of each codepoint in a hash table.Initialize it with all codepoints up to 0xFFFF. After that, whenever astring contains a character that's not in the hash table yet, query theclass of that character, and add it to the hash table. Then recompilethe whole regex and restart the matching engine.

Recompiling is expensive, but if you cache the results for the session,it would probably be acceptable.


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list ([email protected])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Notes about fixing regexes and UTF-8 (yet again)

Reply via email to