On Tue, Apr 14, 2009 at 1:32 PM, Peter Eisentraut <pete...@gmx.net> wrote: > On Monday 13 April 2009 22:39:58 Andrew Dunstan wrote: >> Umm, but isn't that because your encoding is using one code point? >> >> See the OP's explanation w.r.t. canonical equivalence. >> >> This isn't about the number of bytes, but about whether or not we should >> count characters encoded as two or more combined code points as a single >> char or not. > > Here is a test case that shows the problem (if your terminal can display > combining characters (xterm appears to work)): > > SELECT U&'\00E9', char_length(U&'\00E9'); > ?column? | char_length > ----------+------------- > é | 1 > (1 row) > > SELECT U&'\0065\0301', char_length(U&'\0065\0301'); > ?column? | char_length > ----------+------------- > é | 2 > (1 row)
What's really at issue is "what is a string?". That is, it a sequence of characters or a sequence of code points. If it's the former then we would also have to prohibit certain strings such as U&'\0301' entirely. And we have to make substr() pick out the right number of code points, etc. -- greg -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers