Re: Unicode escapes with any backend encoding
On 3/6/20 2:19 PM, Tom Lane wrote: >> Maybe Chapman has a use case in mind he can test with? Barring that, >> the patch seems ready for commit. > > I went ahead and pushed this, just to get it out of my queue. > Chapman's certainly welcome to kibitz some more of course. Sorry, yeah, I don't think I had any kibitzing to do. My use case was for an automated SQL generator to confidently emit Unicode- escaped forms with few required assumptions about the database they'll be loaded in, subject of course to the natural limitation that its encoding contain the characters being used, but not to arbitrary other limits. And unless I misunderstand the patch, it accomplishes that, thereby depriving me of stuff to kibitz about. Regards, -Chap
Re: Unicode escapes with any backend encoding
John Naylor writes: > Not this patch's job perhaps, but now that check_unicode_value() only > depends on the input, maybe it can be put into pgwchar.h with other > static inline helper functions? That test is duplicated in > addunicode() and pg_unicode_to_server(). Maybe: > static inline bool > codepoint_is_valid(pgwchar c) > { >return (c > 0 && c <= 0x10); > } Seems reasonable, done. > Maybe Chapman has a use case in mind he can test with? Barring that, > the patch seems ready for commit. I went ahead and pushed this, just to get it out of my queue. Chapman's certainly welcome to kibitz some more of course. regards, tom lane
Re: Unicode escapes with any backend encoding
On Tue, Feb 25, 2020 at 1:49 AM Tom Lane wrote: > > I wrote: > > [ unicode-escapes-with-other-server-encodings-2.patch ] > > I see this patch got sideswiped by the recent refactoring of JSON > lexing. Here's an attempt at fixing it up. Since the frontend > code isn't going to have access to encoding conversion facilities, > this creates a difference between frontend and backend handling > of JSON Unicode escapes, which is mildly annoying but probably > isn't going to bother anyone in the real world. Outside of > jsonapi.c, there are no changes from v2. With v3, I successfully converted escapes using a database with EUC-KR encoding, from strings, json, and jsonpath expressions. Then I ran a raw parsing microbenchmark with ASCII unicode escapes in UTF-8 to verify no significant regression. I also tried the same with EUC-KR, even though that's not really apples-to-apples since it doesn't work on HEAD. It seems to give the same numbers. (median of 3, done 3 times with postmaster restart in between) master, UTF-8 ascii 1.390s 1.405s 1.406s v3, UTF-8 ascii 1.396s 1.388s 1.390s v3, EUC-KR non-ascii 1.382s 1.401s 1.394s Not this patch's job perhaps, but now that check_unicode_value() only depends on the input, maybe it can be put into pgwchar.h with other static inline helper functions? That test is duplicated in addunicode() and pg_unicode_to_server(). Maybe: static inline bool codepoint_is_valid(pgwchar c) { return (c > 0 && c <= 0x10); } Maybe Chapman has a use case in mind he can test with? Barring that, the patch seems ready for commit. -- John Naylorhttps://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: Unicode escapes with any backend encoding
On Mon, Feb 24, 2020 at 11:19 PM Tom Lane wrote: > I see this patch got sideswiped by the recent refactoring of JSON > lexing. Here's an attempt at fixing it up. Since the frontend > code isn't going to have access to encoding conversion facilities, > this creates a difference between frontend and backend handling > of JSON Unicode escapes, which is mildly annoying but probably > isn't going to bother anyone in the real world. Outside of > jsonapi.c, there are no changes from v2. For the record, as far as JSON goes, I think I'm responsible for the current set of restrictions, and I'm not attached to them. I believe I was uncertain of my ability to implement anything better than what we have now and also slightly unclear on what the semantics ought to be. I'm happy to see it improved, though. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: Unicode escapes with any backend encoding
I wrote: > [ unicode-escapes-with-other-server-encodings-2.patch ] I see this patch got sideswiped by the recent refactoring of JSON lexing. Here's an attempt at fixing it up. Since the frontend code isn't going to have access to encoding conversion facilities, this creates a difference between frontend and backend handling of JSON Unicode escapes, which is mildly annoying but probably isn't going to bother anyone in the real world. Outside of jsonapi.c, there are no changes from v2. regards, tom lane diff --git a/doc/src/sgml/json.sgml b/doc/src/sgml/json.sgml index 1b6aaf0..a9c68c7 100644 --- a/doc/src/sgml/json.sgml +++ b/doc/src/sgml/json.sgml @@ -61,8 +61,8 @@ - PostgreSQL allows only one character set - encoding per database. It is therefore not possible for the JSON + RFC 7159 specifies that JSON strings should be encoded in UTF8. + It is therefore not possible for the JSON types to conform rigidly to the JSON specification unless the database encoding is UTF8. Attempts to directly include characters that cannot be represented in the database encoding will fail; conversely, @@ -77,13 +77,13 @@ regardless of the database encoding, and are checked only for syntactic correctness (that is, that four hex digits follow \u). However, the input function for jsonb is stricter: it disallows - Unicode escapes for non-ASCII characters (those above U+007F) - unless the database encoding is UTF8. The jsonb type also + Unicode escapes for characters that cannot be represented in the database + encoding. The jsonb type also rejects \u (because that cannot be represented in PostgreSQL's text type), and it insists that any use of Unicode surrogate pairs to designate characters outside the Unicode Basic Multilingual Plane be correct. Valid Unicode escapes - are converted to the equivalent ASCII or UTF8 character for storage; + are converted to the equivalent single character for storage; this includes folding surrogate pairs into a single character. @@ -96,9 +96,8 @@ not jsonb. The fact that the json input function does not make these checks may be considered a historical artifact, although it does allow for simple storage (without processing) of JSON Unicode - escapes in a non-UTF8 database encoding. In general, it is best to - avoid mixing Unicode escapes in JSON with a non-UTF8 database encoding, - if possible. + escapes in a database encoding that does not support the represented + characters. @@ -144,8 +143,8 @@ string text -\u is disallowed, as are non-ASCII Unicode - escapes if database encoding is not UTF8 +\u is disallowed, as are Unicode escapes + representing characters not available in the database encoding number diff --git a/doc/src/sgml/syntax.sgml b/doc/src/sgml/syntax.sgml index c908e0b..e134877 100644 --- a/doc/src/sgml/syntax.sgml +++ b/doc/src/sgml/syntax.sgml @@ -189,6 +189,23 @@ UPDATE "my_table" SET "a" = 5; ampersands. The length limitation still applies. + +Quoting an identifier also makes it case-sensitive, whereas +unquoted names are always folded to lower case. For example, the +identifiers FOO, foo, and +"foo" are considered the same by +PostgreSQL, but +"Foo" and "FOO" are +different from these three and each other. (The folding of +unquoted names to lower case in PostgreSQL is +incompatible with the SQL standard, which says that unquoted names +should be folded to upper case. Thus, foo +should be equivalent to "FOO" not +"foo" according to the standard. If you want +to write portable applications you are advised to always quote a +particular name or never quote it.) + + Unicode escape in identifiers @@ -230,7 +247,8 @@ U&"d!0061t!+61" UESCAPE '!' The escape character can be any single character other than a hexadecimal digit, the plus sign, a single quote, a double quote, or a whitespace character. Note that the escape character is -written in single quotes, not double quotes. +written in single quotes, not double quotes, +after UESCAPE. @@ -239,32 +257,18 @@ U&"d!0061t!+61" UESCAPE '!' -The Unicode escape syntax works only when the server encoding is -UTF8. When other server encodings are used, only code -points in the ASCII range (up to \007F) can be -specified. Both the 4-digit and the 6-digit form can be used to +Either the 4-digit or the 6-digit escape form can be used to specify UTF-16 surrogate pairs to compose characters with code points larger than U+, although the availability of the 6-digit form technically makes this unnecessary. (Surrogate -pairs are not stored directly, but combined into a single -code point that is then encoded in UTF-8.) +pairs are n
Re: Unicode escapes with any backend encoding
I wrote: > Andrew Dunstan writes: >> Perhaps I expressed myself badly. What I meant was that we should keep >> the json and text escape rules in sync, as they are now. Since we're >> changing the text rules to allow resolvable non-ascii unicode escapes >> in non-utf8 locales, we should do the same for json. > Got it. I'll make the patch do that in a little bit. OK, here's v2, which brings JSONB into the fold and also makes some effort to produce an accurate error cursor for invalid Unicode escapes. As it's set up, we only pay the extra cost of setting up an error context callback when we're actually processing a Unicode escape, so I think that's an acceptable cost. (It's not much of a cost, anyway.) The callback support added here is pretty much a straight copy-and-paste of the existing functions setup_parser_errposition_callback() and friends. That's slightly annoying --- we could perhaps merge those into one. But I didn't see a good common header to put such a thing into, so I just did it like this. Another note is that we could use the additional scanner infrastructure to produce more accurate error pointers for other cases where we're whining about a bad escape sequence, or some other sub-part of a lexical token. I think that'd likely be a good idea, since the existing cursor placement at the start of the token isn't too helpful if e.g. you're dealing with a very long string constant. But to keep this focused, I only touched the behavior for Unicode escapes. The rest could be done as a separate patch. This also mops up after 7f380c59 by making use of the new pg_wchar.c exports is_utf16_surrogate_first() etc everyplace that they're relevant (which is just the JSON code I was touching anyway, as it happens). I also made a bit of an effort to ensure test coverage of all the code touched in that patch and this one. regards, tom lane diff --git a/doc/src/sgml/json.sgml b/doc/src/sgml/json.sgml index 6ff8751..0f0d0c6 100644 --- a/doc/src/sgml/json.sgml +++ b/doc/src/sgml/json.sgml @@ -61,8 +61,8 @@ - PostgreSQL allows only one character set - encoding per database. It is therefore not possible for the JSON + RFC 7159 specifies that JSON strings should be encoded in UTF8. + It is therefore not possible for the JSON types to conform rigidly to the JSON specification unless the database encoding is UTF8. Attempts to directly include characters that cannot be represented in the database encoding will fail; conversely, @@ -77,13 +77,13 @@ regardless of the database encoding, and are checked only for syntactic correctness (that is, that four hex digits follow \u). However, the input function for jsonb is stricter: it disallows - Unicode escapes for non-ASCII characters (those above U+007F) - unless the database encoding is UTF8. The jsonb type also + Unicode escapes for characters that cannot be represented in the database + encoding. The jsonb type also rejects \u (because that cannot be represented in PostgreSQL's text type), and it insists that any use of Unicode surrogate pairs to designate characters outside the Unicode Basic Multilingual Plane be correct. Valid Unicode escapes - are converted to the equivalent ASCII or UTF8 character for storage; + are converted to the equivalent single character for storage; this includes folding surrogate pairs into a single character. @@ -96,9 +96,8 @@ not jsonb. The fact that the json input function does not make these checks may be considered a historical artifact, although it does allow for simple storage (without processing) of JSON Unicode - escapes in a non-UTF8 database encoding. In general, it is best to - avoid mixing Unicode escapes in JSON with a non-UTF8 database encoding, - if possible. + escapes in a database encoding that does not support the represented + characters. @@ -144,8 +143,8 @@ string text -\u is disallowed, as are non-ASCII Unicode - escapes if database encoding is not UTF8 +\u is disallowed, as are Unicode escapes + representing characters not available in the database encoding number diff --git a/doc/src/sgml/syntax.sgml b/doc/src/sgml/syntax.sgml index c908e0b..e134877 100644 --- a/doc/src/sgml/syntax.sgml +++ b/doc/src/sgml/syntax.sgml @@ -189,6 +189,23 @@ UPDATE "my_table" SET "a" = 5; ampersands. The length limitation still applies. + +Quoting an identifier also makes it case-sensitive, whereas +unquoted names are always folded to lower case. For example, the +identifiers FOO, foo, and +"foo" are considered the same by +PostgreSQL, but +"Foo" and "FOO" are +different from these three and each other. (The folding of +unquoted names to lower case in PostgreSQL is +incompatible with the SQL standard, which says that unquoted names +should be folded to upper case.
Re: Unicode escapes with any backend encoding
Andrew Dunstan writes: > Perhaps I expressed myself badly. What I meant was that we should keep > the json and text escape rules in sync, as they are now. Since we're > changing the text rules to allow resolvable non-ascii unicode escapes > in non-utf8 locales, we should do the same for json. Got it. I'll make the patch do that in a little bit. regards, tom lane
Re: Unicode escapes with any backend encoding
On Wed, Jan 15, 2020 at 7:55 AM Tom Lane wrote: > > Andrew Dunstan writes: > > On Wed, Jan 15, 2020 at 4:25 AM Chapman Flack wrote: > >> On 1/14/20 10:10 AM, Tom Lane wrote: > >>> to me that this error is just useless pedantry. As long as the DB > >>> encoding can represent the desired character, it should be transparent > >>> to users. > > >> That's my position too. > > > and mine. > > I'm confused --- yesterday you seemed to be against this idea. > Have you changed your mind? > > I'll gladly go change the patch if people are on board with this. > > Perhaps I expressed myself badly. What I meant was that we should keep the json and text escape rules in sync, as they are now. Since we're changing the text rules to allow resolvable non-ascii unicode escapes in non-utf8 locales, we should do the same for json. cheers andrew -- Andrew Dunstanhttps://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: Unicode escapes with any backend encoding
On 1/14/20 4:25 PM, Tom Lane wrote: > Andrew Dunstan writes: >> On Wed, Jan 15, 2020 at 4:25 AM Chapman Flack wrote: >>> On 1/14/20 10:10 AM, Tom Lane wrote: to me that this error is just useless pedantry. As long as the DB encoding can represent the desired character, it should be transparent to users. > >>> That's my position too. > >> and mine. > > I'm confused --- yesterday you seemed to be against this idea. > Have you changed your mind? > > I'll gladly go change the patch if people are on board with this. Hmm, well, let me clarify for my own part what I think I'm agreeing with ... perhaps it's misaligned with something further upthread. In an ideal world (which may be ideal in more ways than are in scope for the present discussion) I would expect to see these principles: 1. On input, whether a Unicode escape is or isn't allowed should not depend on any encoding settings. It should be lexically allowed always, and if it represents a character that exists in the server encoding, it should mean that character. If it's not representable in the storage format, it should produce an error that says that. 2. If it happens that the character is representable in both the storage encoding and the client encoding, it shouldn't matter whether it arrives literally as an é or as an escape. Either should get stored on disk as the same bytes. 3. On output, as long as the character is representable in the client encoding, there is nothing to worry about. It will be sent as its representation in the client encoding (which may be different bytes than its representation in the server encoding). 4. If a character to be output isn't in the client encoding, it will be datatype-dependent whether there is any way to escape. For example, xml_out could produce forms, and json_out could produce \u forms. 5. If the datatype being output has no escaping rules available (as would be the case for an ordinary text column, say), then the unrepresentable character has to be reported in an error. (Encoding conversions often have the option of substituting a replacement character like ? but I don't believe a DBMS has any business making such changes to data, unless by explicit opt-in. If it can't give you the data you wanted, it should say "here's why I can't give you that.") 6. While 'text' in general provides no escaping mechanism, some functions that produce text may still have that option. For example, quote_literal and quote_ident could conceivably produce the U&'...' or U&"..." forms, respectively, if the argument contains characters that won't go in the client encoding. I understand that on the way from 1 to 6 I will have drifted further from what's discussed in this thread; for example, I bet that quote_literal/quote_ident never produce U& forms now, and that no one is proposing to change that, and I'm pretending not to notice the question of how astonishing such behavior could be. (Not to mention, how would they know whether they are returning a value that's destined to go across the client encoding, rather than to be used in a purely server-side expression? Maybe distinct versions of those functions could take an encoding argument, and produce the U& forms when the content won't go in the specified encoding. That would avoid astonishing changes to existing functions.) Regards, -Chap
Re: Unicode escapes with any backend encoding
Andrew Dunstan writes: > On Wed, Jan 15, 2020 at 4:25 AM Chapman Flack wrote: >> On 1/14/20 10:10 AM, Tom Lane wrote: >>> to me that this error is just useless pedantry. As long as the DB >>> encoding can represent the desired character, it should be transparent >>> to users. >> That's my position too. > and mine. I'm confused --- yesterday you seemed to be against this idea. Have you changed your mind? I'll gladly go change the patch if people are on board with this. regards, tom lane
Re: Unicode escapes with any backend encoding
On Wed, Jan 15, 2020 at 4:25 AM Chapman Flack wrote: > > On 1/14/20 10:10 AM, Tom Lane wrote: > > to me that this error is just useless pedantry. As long as the DB > > encoding can represent the desired character, it should be transparent > > to users. > > That's my position too. > and mine. cheers andrew -- Andrew Dunstanhttps://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: Unicode escapes with any backend encoding
On 1/14/20 10:10 AM, Tom Lane wrote: > to me that this error is just useless pedantry. As long as the DB > encoding can represent the desired character, it should be transparent > to users. That's my position too. Regards, -Chap
Re: Unicode escapes with any backend encoding
I wrote: > Andrew Dunstan writes: >> On Tue, Jan 14, 2020 at 10:02 AM Tom Lane wrote: >>> Grepping for other direct uses of unicode_to_utf8(), I notice that >>> there are a couple of places in the JSON code where we have a similar >>> restriction that you can only write a Unicode escape in UTF8 server >>> encoding. I'm not sure whether these same semantics could be >>> applied there, so I didn't touch that. >> Off the cuff I'd be inclined to say we should keep the text escape >> rules the same. We've already extended the JSON standard y allowing >> non-UTF8 encodings. > Right. I'm just thinking though that if you can write "é" literally > in a JSON string, even though you're using LATIN1 not UTF8, then why > not allow writing that as "\u00E9" instead? The latter is arguably > truer to spec. > However, if JSONB collapses "\u00E9" to LATIN1 "é", that would be bad, > unless we have a way to undo it on printout. So there might be > some more moving parts here than I thought. On third thought, what would be so bad about that? Let's suppose I write: INSERT ... values('{"x": "\u00E9"}'::jsonb); and the jsonb parsing logic chooses to collapse the backslash to the represented character, i.e., "é". Why should it matter whether the database encoding is UTF8 or LATIN1? If I am using UTF8 client encoding, I will see the "é" in UTF8 encoding either way, because of output encoding conversion. If I am using LATIN1 client encoding, I will see the "é" in LATIN1 either way --- or at least, I will if the database encoding is UTF8. Right now I get an error for that when the database encoding is LATIN1 ... but if I store the "é" as literal "é", it works, either way. So it seems to me that this error is just useless pedantry. As long as the DB encoding can represent the desired character, it should be transparent to users. regards, tom lane
Re: Unicode escapes with any backend encoding
Andrew Dunstan writes: > On Tue, Jan 14, 2020 at 10:02 AM Tom Lane wrote: >> Grepping for other direct uses of unicode_to_utf8(), I notice that >> there are a couple of places in the JSON code where we have a similar >> restriction that you can only write a Unicode escape in UTF8 server >> encoding. I'm not sure whether these same semantics could be >> applied there, so I didn't touch that. > Off the cuff I'd be inclined to say we should keep the text escape > rules the same. We've already extended the JSON standard y allowing > non-UTF8 encodings. Right. I'm just thinking though that if you can write "é" literally in a JSON string, even though you're using LATIN1 not UTF8, then why not allow writing that as "\u00E9" instead? The latter is arguably truer to spec. However, if JSONB collapses "\u00E9" to LATIN1 "é", that would be bad, unless we have a way to undo it on printout. So there might be some more moving parts here than I thought. regards, tom lane
Re: Unicode escapes with any backend encoding
On Tue, Jan 14, 2020 at 10:02 AM Tom Lane wrote: > > > Grepping for other direct uses of unicode_to_utf8(), I notice that > there are a couple of places in the JSON code where we have a similar > restriction that you can only write a Unicode escape in UTF8 server > encoding. I'm not sure whether these same semantics could be > applied there, so I didn't touch that. > Off the cuff I'd be inclined to say we should keep the text escape rules the same. We've already extended the JSON standard y allowing non-UTF8 encodings. cheers andrew -- Andrew Dunstanhttps://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services