Re: [HACKERS] Notes about fixing regexes and UTF-8 (yet again)

Andrew Dunstan Fri, 17 Feb 2012 06:56:59 -0800


On 02/17/2012 09:39 AM, Tom Lane wrote:

Heikki Linnakangas<[email protected]>  writes:

Here's a wild idea: keep the class of each codepoint in a hash table.
Initialize it with all codepoints up to 0xFFFF. After that, whenever a
string contains a character that's not in the hash table yet, query the
class of that character, and add it to the hash table. Then recompile
the whole regex and restart the matching engine.
Recompiling is expensive, but if you cache the results for the session,
it would probably be acceptable.

Dunno ... recompiling is so expensive that I can't see this being a win;
not to mention that it would require fundamental surgery on the regex
code.

In the Tcl implementation, no codepoints above U+FFFF have any locale
properties (alpha/digit/punct/etc), period.  Personally I'd not have a
problem imposing the same limitation, so that dealing with stuff above
that range isn't really a consideration anyway.

up to U+FFFF is the BMP which is described as containing "characters foralmost all modern languages, and a large number of special characters."It seems very likely to be acceptable not to bother about the locale ofcode points in the supplementary planes.

See <http://en.wikipedia.org/wiki/Plane_%28Unicode%29> for descriptionsof which sets of characters are involved.



cheers

andrew



--
Sent via pgsql-hackers mailing list ([email protected])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Notes about fixing regexes and UTF-8 (yet again)

Reply via email to