Re: [HACKERS] Unicode escapes in literals
I wrote: SQL has the following escape syntax for it: U&'special character: \' [ UESCAPE '\' ] Here is an in-progress patch for this. It still needs updates in the psql scanner and possibly other scanners. But the server-side functionality works. Index: doc/src/sgml/syntax.sgml === RCS file: /cvsroot/pgsql/doc/src/sgml/syntax.sgml,v retrieving revision 1.123 diff -u -3 -p -c -r1.123 syntax.sgml *** doc/src/sgml/syntax.sgml26 Jun 2008 22:24:42 - 1.123 --- doc/src/sgml/syntax.sgml27 Oct 2008 16:54:26 - *** UPDATE "my_table" SET "a" = 5; *** 190,195 --- 190,247 + A variant of quoted identifiers allows including escaped Unicode + characters identified by their code points. This variant starts + with U& (upper or lower case U followed by + ampersand) immediately before the opening double quote, without + any spaces in between, for example U&"foo". + (Note that this creates an ambiguity with the + operator &. Use spaces around the operator to + avoid this problem.) Inside the quotes, Unicode characters can be + specified in escaped form by writing a backslash followed by the + four-digit hexadecimal code point number or alternatively a + backslash followed by a plus sign followed by a six-digt + hexadecimal code point number. For example, the + identifier "data" could be written as + + U&"d\0061t\0061" + + or equivalently + + U&"d\+61t\+61" + + The following less trivial example writes the Russian + word slon (elephant) in Cyrillic letters: + + U&"\0441\043B\043E\043D" + + + + + If a different escape character than backslash is desired, it can + be specified using the UESCAPE clause after the + string, for example: + + U&"d!0061t!0061" UESCAPE '!' + + The escape character can be any single character other than a + hexadecimal digit, the plus sign, a single quote, a double quote, + or a whitespace character. Note that the escape character is + written in single quotes, not double quotes. + + + + To include the escape character in the identifier literally, write + it twice. + + + + The Unicode escape syntax works only when the server encoding is + UTF8. When other server encodings are used, only code points in + the ASCII range (up to \007F) can be specified. + + + Quoting an identifier also makes it case-sensitive, whereas unquoted names are always folded to lower case. For example, the identifiers FOO, foo, and *** UPDATE "my_table" SET "a" = 5; *** 245,251 write two adjacent single quotes, e.g. 'Dianne''s horse'. Note that this is not the same as a double-quote ! character ("). --- 297,303 write two adjacent single quotes, e.g. 'Dianne''s horse'. Note that this is not the same as a double-quote ! character ("). *** SELECT 'foo' 'bar'; *** 269,282 by SQL; PostgreSQL is following the standard.) - escape string syntax backslash escapes PostgreSQL also accepts escape string constants, which are an extension to the SQL standard. An escape string constant is specified by writing the letter --- 321,339 by SQL; PostgreSQL is following the standard.) + + + + String Constants with C-Style Escapes escape string syntax backslash escapes + + PostgreSQL also accepts escape string constants, which are an extension to the SQL standard. An escape string constant is specified by writing the letter *** SELECT 'foo' 'bar'; *** 287,293 Within an escape string, a backslash character (\) begins a C-like backslash escape sequence, in which the combination of backslash and following character(s) represent a special byte ! value: Backslash Escape Sequences --- 344,351 Within an escape string, a backslash character (\) begins a C-like backslash escape sequence, in which the combination of backslash and following character(s) represent a special byte ! value, shown in ! Backslash Escape Sequences *** SELECT 'foo' 'bar'; *** 341,354 ! It is your responsibility that the byte sequences you create are ! valid characters in the server character set encoding. Any other character following a backslash is taken literally. Thus, to include a backslash character, write two backslashes (\\). Also, a single quote can be included in an escape string by writing \', in addition to the normal w
Re: [HACKERS] Unicode escapes in literals
Andrew Sullivan <[EMAIL PROTECTED]> writes: > On Thu, Oct 23, 2008 at 06:04:43PM +0300, Peter Eisentraut wrote: >> Yeah, excellent question. It seems completely unnecessary, but it is >> surely there in the syntax diagram. > Probably because many Unicode representations are done with "U+" > followed by 4-6 hexadecimal units, but "+" is problematic for other > reasons (in some vendor's implementation)? They could hardly ignore the conflict with the operator interpretation for +. The committee has now cut themselves off from ever having a standard operator named &, but I suppose they didn't think ahead to that. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Unicode escapes in literals
On Thu, Oct 23, 2008 at 06:04:43PM +0300, Peter Eisentraut wrote: >> Man that's ugly. Why the ampersand? > > Yeah, excellent question. It seems completely unnecessary, but it is > surely there in the syntax diagram. Probably because many Unicode representations are done with "U+" followed by 4-6 hexadecimal units, but "+" is problematic for other reasons (in some vendor's implementation)? A -- Andrew Sullivan [EMAIL PROTECTED] +1 503 667 4564 x104 http://www.commandprompt.com/ -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Unicode escapes in literals
Peter Eisentraut <[EMAIL PROTECTED]> writes: > There are some other disadvantages for making a function call. You > couldn't use that kind of literal in any other place where the parser > calls for a string constant: role names, tablespace locations, > passwords, copy delimiters, enum values, function body, file names. Good point. I'm okay with supporting the feature only when database encoding is UTF8. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Unicode escapes in literals
Tom Lane wrote: Peter Eisentraut <[EMAIL PROTECTED]> writes: SQL has the following escape syntax for it: U&'special character: \' [ UESCAPE '\' ] Man that's ugly. Why the ampersand? Yeah, excellent question. It seems completely unnecessary, but it is surely there in the syntax diagram. How do you propose to distinguish this from a perfectly legitimate use of the & operator? Well, technically, there is going to be some conflict, but the practical impact should be minimal because: - There are no spaces allowed between U&' . We typically suggest spaces around binary operators. - Naming a column "u" might not be terribly common. - Binary-and with an undecorated string literal is not very common. Of course, I have no data for these assertions. An inquiry on -general might give more insight. 2. Convert this syntax to a function call. But that would then create a lot of inconsistencies, such as needing functional indexes for matches against what should really be a literal. Uh, why do you think that? The function could surely be stable, even immutable if you grant that a database's encoding can't change. Yeah, true, that would work. There are some other disadvantages for making a function call. You couldn't use that kind of literal in any other place where the parser calls for a string constant: role names, tablespace locations, passwords, copy delimiters, enum values, function body, file names. There is also a related feature for Unicode escapes in identifiers, and it might be good to keep the door open on that. We could to a dual approach: Convert in the scanner when server encoding is UTF8, and pass on as function call otherwise. Surely ugly though. Or pass it on as a separate token type to the analyze phase, but that is a lot more work. Others: What use cases do you envision, and what requirements would they create for this feature? -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Unicode escapes in literals
Peter Eisentraut <[EMAIL PROTECTED]> writes: > SQL has the following escape syntax for it: > U&'special character: \' [ UESCAPE '\' ] Man that's ugly. Why the ampersand? How do you propose to distinguish this from a perfectly legitimate use of the & operator? > 2. Convert this syntax to a function call. But that would then create a > lot of inconsistencies, such as needing functional indexes for matches > against what should really be a literal. Uh, why do you think that? The function could surely be stable, even immutable if you grant that a database's encoding can't change. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers