On 05/16/2014 06:05 PM, Tom Lane wrote:
Quite some time ago, we made the chr() function accept Unicode code points up to U+1FFFFF, which is the largest value that will fit in a 4-byte UTF8 string. It was pointed out to me though that RFC3629 restricted the original definition of UTF8 to only allow code points up to U+10FFFF (for compatibility with UTF16). While that might not be something we feel we need to follow exactly, pg_utf8_islegal implements the checking algorithm specified by RFC3629, and will therefore reject points above U+10FFFF.This means you can use chr() to create values that will be rejected on dump and reload: u8=# create table tt (f1 text); CREATE TABLE u8=# insert into tt values(chr('x001fffff'::bit(32)::int)); INSERT 0 1 u8=# select * from tt; f1 ---- (1 row) u8=# \copy tt to 'junk' COPY 1 u8=# \copy tt from 'junk' ERROR: 22021: invalid byte sequence for encoding "UTF8": 0xf7 0xbf 0xbf 0xbf CONTEXT: COPY tt, line 1 LOCATION: report_invalid_encoding, wchar.c:2011 I think this probably means we need to change chr() to reject code points above 10ffff. Should we back-patch that, or just do it in HEAD?
+1 for back-patching. A value that cannot be restored is bad, and I can't imagine any legitimate use case for producing a Unicode character larger than U+10FFFF with chr(x), when the rest of the system doesn't handle it. Fully supporting such values might be useful, but that's a different story.
- Heikki -- Sent via pgsql-hackers mailing list ([email protected]) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
