On 02/17/2012 09:39 AM, Tom Lane wrote:
Heikki Linnakangas<heikki.linnakan...@enterprisedb.com>  writes:
Here's a wild idea: keep the class of each codepoint in a hash table.
Initialize it with all codepoints up to 0xFFFF. After that, whenever a
string contains a character that's not in the hash table yet, query the
class of that character, and add it to the hash table. Then recompile
the whole regex and restart the matching engine.
Recompiling is expensive, but if you cache the results for the session,
it would probably be acceptable.
Dunno ... recompiling is so expensive that I can't see this being a win;
not to mention that it would require fundamental surgery on the regex
code.

In the Tcl implementation, no codepoints above U+FFFF have any locale
properties (alpha/digit/punct/etc), period.  Personally I'd not have a
problem imposing the same limitation, so that dealing with stuff above
that range isn't really a consideration anyway.


up to U+FFFF is the BMP which is described as containing "characters for almost all modern languages, and a large number of special characters." It seems very likely to be acceptable not to bother about the locale of code points in the supplementary planes.

See <http://en.wikipedia.org/wiki/Plane_%28Unicode%29> for descriptions of which sets of characters are involved.


cheers

andrew



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Reply via email to