Re: [HACKERS] chr() is still too loose about UTF8 code points

Andrew Dunstan Fri, 16 May 2014 10:11:55 -0700


On 05/16/2014 12:43 PM, Heikki Linnakangas wrote:

On 05/16/2014 06:05 PM, Tom Lane wrote:
Quite some time ago, we made the chr() function accept Unicode codepointsup to U+1FFFFF, which is the largest value that will fit in a 4-byteUTF8
string.  It was pointed out to me though that RFC3629 restricted the
original definition of UTF8 to only allow code points up to U+10FFFF(for
compatibility with UTF16).  While that might not be something we feel we
need to follow exactly, pg_utf8_islegal implements the checkingalgorithm
specified by RFC3629, and will therefore reject points above U+10FFFF.
This means you can use chr() to create values that will be rejected on
dump and reload:

u8=# create table tt (f1 text);
CREATE TABLE
u8=# insert into tt values(chr('x001fffff'::bit(32)::int));
INSERT 0 1
u8=# select * from tt;
  f1
----

(1 row)

u8=# \copy tt to 'junk'
COPY 1
u8=# \copy tt from 'junk'
ERROR: 22021: invalid byte sequence for encoding "UTF8": 0xf7 0xbf0xbf 0xbf
CONTEXT:  COPY tt, line 1
LOCATION:  report_invalid_encoding, wchar.c:2011
I think this probably means we need to change chr() to reject codepoints
above 10ffff.  Should we back-patch that, or just do it in HEAD?
+1 for back-patching. A value that cannot be restored is bad, and Ican't imagine any legitimate use case for producing a Unicodecharacter larger than U+10FFFF with chr(x), when the rest of thesystem doesn't handle it. Fully supporting such values might beuseful, but that's a different story.

My understanding us that U+10FFFF is the highest legal Unicode codepoint anyway. So this is really just tightening our routines to makesure we don't produce an invalid value. We won't be disallowing anythingthat is legal Unicode.


cheers

andrew


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] chr() is still too loose about UTF8 code points

Reply via email to