Re: [HACKERS] Unicode escapes in literals
I wrote: SQL has the following escape syntax for it: U'special character: \' [ UESCAPE '\' ] Here is an in-progress patch for this. It still needs updates in the psql scanner and possibly other scanners. But the server-side functionality works. Index: doc/src/sgml/syntax.sgml === RCS file: /cvsroot/pgsql/doc/src/sgml/syntax.sgml,v retrieving revision 1.123 diff -u -3 -p -c -r1.123 syntax.sgml *** doc/src/sgml/syntax.sgml26 Jun 2008 22:24:42 - 1.123 --- doc/src/sgml/syntax.sgml27 Oct 2008 16:54:26 - *** UPDATE my_table SET a = 5; *** 190,195 --- 190,247 /para para + A variant of quoted identifiers allows including escaped Unicode + characters identified by their code points. This variant starts + with literalU/literal (upper or lower case U followed by + ampersand) immediately before the opening double quote, without + any spaces in between, for example literalUfoo/literal. + (Note that this creates an ambiguity with the + operator literal/literal. Use spaces around the operator to + avoid this problem.) Inside the quotes, Unicode characters can be + specified in escaped form by writing a backslash followed by the + four-digit hexadecimal code point number or alternatively a + backslash followed by a plus sign followed by a six-digt + hexadecimal code point number. For example, the + identifier literaldata/literal could be written as + programlisting + Ud\0061t\0061 + /programlisting + or equivalently + programlisting + Ud\+61t\+61 + /programlisting + The following less trivial example writes the Russian + word quoteslon/quote (elephant) in Cyrillic letters: + programlisting + U\0441\043B\043E\043D + /programlisting +/para + +para + If a different escape character than backslash is desired, it can + be specified using the literalUESCAPE/literal clause after the + string, for example: + programlisting + Ud!0061t!0061 UESCAPE '!' + /programlisting + The escape character can be any single character other than a + hexadecimal digit, the plus sign, a single quote, a double quote, + or a whitespace character. Note that the escape character is + written in single quotes, not double quotes. +/para + +para + To include the escape character in the identifier literally, write + it twice. +/para + +para + The Unicode escape syntax works only when the server encoding is + UTF8. When other server encodings are used, only code points in + the ASCII range (up to literal\007F/literal) can be specified. +/para + +para Quoting an identifier also makes it case-sensitive, whereas unquoted names are always folded to lower case. For example, the identifiers literalFOO/literal, literalfoo/literal, and *** UPDATE my_table SET a = 5; *** 245,251 write two adjacent single quotes, e.g. literal'Dianne''s horse'/literal. Note that this is emphasisnot/ the same as a double-quote ! character (literal/). /para para --- 297,303 write two adjacent single quotes, e.g. literal'Dianne''s horse'/literal. Note that this is emphasisnot/ the same as a double-quote ! character (literal/). !-- font-lock sanity: -- /para para *** SELECT 'foo' 'bar'; *** 269,282 by acronymSQL/acronym; productnamePostgreSQL/productname is following the standard.) /para - para indexterm primaryescape string syntax/primary /indexterm indexterm primarybackslash escapes/primary /indexterm productnamePostgreSQL/productname also accepts quoteescape/ string constants, which are an extension to the SQL standard. An escape string constant is specified by writing the letter --- 321,339 by acronymSQL/acronym; productnamePostgreSQL/productname is following the standard.) /para +/sect3 + +sect3 id=sql-syntax-strings-escape + titleString Constants with C-Style Escapes/title indexterm primaryescape string syntax/primary /indexterm indexterm primarybackslash escapes/primary /indexterm + + para productnamePostgreSQL/productname also accepts quoteescape/ string constants, which are an extension to the SQL standard. An escape string constant is specified by writing the letter *** SELECT 'foo' 'bar'; *** 287,293 Within an escape string, a backslash character (literal\/) begins a C-like firsttermbackslash escape/ sequence, in which the combination of backslash and following character(s) represent a special byte ! value: table id=sql-backslash-table titleBackslash Escape Sequences/title
Re: [HACKERS] Unicode escapes in literals
Peter Eisentraut [EMAIL PROTECTED] writes: SQL has the following escape syntax for it: U'special character: \' [ UESCAPE '\' ] Man that's ugly. Why the ampersand? How do you propose to distinguish this from a perfectly legitimate use of the operator? 2. Convert this syntax to a function call. But that would then create a lot of inconsistencies, such as needing functional indexes for matches against what should really be a literal. Uh, why do you think that? The function could surely be stable, even immutable if you grant that a database's encoding can't change. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Unicode escapes in literals
Tom Lane wrote: Peter Eisentraut [EMAIL PROTECTED] writes: SQL has the following escape syntax for it: U'special character: \' [ UESCAPE '\' ] Man that's ugly. Why the ampersand? Yeah, excellent question. It seems completely unnecessary, but it is surely there in the syntax diagram. How do you propose to distinguish this from a perfectly legitimate use of the operator? Well, technically, there is going to be some conflict, but the practical impact should be minimal because: - There are no spaces allowed between U' . We typically suggest spaces around binary operators. - Naming a column u might not be terribly common. - Binary-and with an undecorated string literal is not very common. Of course, I have no data for these assertions. An inquiry on -general might give more insight. 2. Convert this syntax to a function call. But that would then create a lot of inconsistencies, such as needing functional indexes for matches against what should really be a literal. Uh, why do you think that? The function could surely be stable, even immutable if you grant that a database's encoding can't change. Yeah, true, that would work. There are some other disadvantages for making a function call. You couldn't use that kind of literal in any other place where the parser calls for a string constant: role names, tablespace locations, passwords, copy delimiters, enum values, function body, file names. There is also a related feature for Unicode escapes in identifiers, and it might be good to keep the door open on that. We could to a dual approach: Convert in the scanner when server encoding is UTF8, and pass on as function call otherwise. Surely ugly though. Or pass it on as a separate token type to the analyze phase, but that is a lot more work. Others: What use cases do you envision, and what requirements would they create for this feature? -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Unicode escapes in literals
Peter Eisentraut [EMAIL PROTECTED] writes: There are some other disadvantages for making a function call. You couldn't use that kind of literal in any other place where the parser calls for a string constant: role names, tablespace locations, passwords, copy delimiters, enum values, function body, file names. Good point. I'm okay with supporting the feature only when database encoding is UTF8. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Unicode escapes in literals
On Thu, Oct 23, 2008 at 06:04:43PM +0300, Peter Eisentraut wrote: Man that's ugly. Why the ampersand? Yeah, excellent question. It seems completely unnecessary, but it is surely there in the syntax diagram. Probably because many Unicode representations are done with U+ followed by 4-6 hexadecimal units, but + is problematic for other reasons (in some vendor's implementation)? A -- Andrew Sullivan [EMAIL PROTECTED] +1 503 667 4564 x104 http://www.commandprompt.com/ -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Unicode escapes in literals
Andrew Sullivan [EMAIL PROTECTED] writes: On Thu, Oct 23, 2008 at 06:04:43PM +0300, Peter Eisentraut wrote: Yeah, excellent question. It seems completely unnecessary, but it is surely there in the syntax diagram. Probably because many Unicode representations are done with U+ followed by 4-6 hexadecimal units, but + is problematic for other reasons (in some vendor's implementation)? They could hardly ignore the conflict with the operator interpretation for +. The committee has now cut themselves off from ever having a standard operator named , but I suppose they didn't think ahead to that. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers