On 05/16/2014 12:43 PM, Heikki Linnakangas wrote:
On 05/16/2014 06:05 PM, Tom Lane wrote:
Quite some time ago, we made the chr() function accept Unicode code points up to U+1FFFFF, which is the largest value that will fit in a 4-byte UTF8
string.  It was pointed out to me though that RFC3629 restricted the
original definition of UTF8 to only allow code points up to U+10FFFF (for
compatibility with UTF16).  While that might not be something we feel we
need to follow exactly, pg_utf8_islegal implements the checking algorithm
specified by RFC3629, and will therefore reject points above U+10FFFF.

This means you can use chr() to create values that will be rejected on
dump and reload:

u8=# create table tt (f1 text);
CREATE TABLE
u8=# insert into tt values(chr('x001fffff'::bit(32)::int));
INSERT 0 1
u8=# select * from tt;
  f1
----

(1 row)

u8=# \copy tt to 'junk'
COPY 1
u8=# \copy tt from 'junk'
ERROR: 22021: invalid byte sequence for encoding "UTF8": 0xf7 0xbf 0xbf 0xbf
CONTEXT:  COPY tt, line 1
LOCATION:  report_invalid_encoding, wchar.c:2011

I think this probably means we need to change chr() to reject code points
above 10ffff.  Should we back-patch that, or just do it in HEAD?

+1 for back-patching. A value that cannot be restored is bad, and I can't imagine any legitimate use case for producing a Unicode character larger than U+10FFFF with chr(x), when the rest of the system doesn't handle it. Fully supporting such values might be useful, but that's a different story.



My understanding us that U+10FFFF is the highest legal Unicode code point anyway. So this is really just tightening our routines to make sure we don't produce an invalid value. We won't be disallowing anything that is legal Unicode.

cheers

andrew


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Reply via email to