Re: [HACKERS] Unicode escapes in literals
I wrote: SQL has the following escape syntax for it: U'special character: \' [ UESCAPE '\' ] Here is an in-progress patch for this. It still needs updates in the psql scanner and possibly other scanners. But the server-side functionality works. Index: doc/src/sgml/syntax.sgml === RCS file: /cvsroot/pgsql/doc/src/sgml/syntax.sgml,v retrieving revision 1.123 diff -u -3 -p -c -r1.123 syntax.sgml *** doc/src/sgml/syntax.sgml26 Jun 2008 22:24:42 - 1.123 --- doc/src/sgml/syntax.sgml27 Oct 2008 16:54:26 - *** UPDATE my_table SET a = 5; *** 190,195 --- 190,247 /para para + A variant of quoted identifiers allows including escaped Unicode + characters identified by their code points. This variant starts + with literalU/literal (upper or lower case U followed by + ampersand) immediately before the opening double quote, without + any spaces in between, for example literalUfoo/literal. + (Note that this creates an ambiguity with the + operator literal/literal. Use spaces around the operator to + avoid this problem.) Inside the quotes, Unicode characters can be + specified in escaped form by writing a backslash followed by the + four-digit hexadecimal code point number or alternatively a + backslash followed by a plus sign followed by a six-digt + hexadecimal code point number. For example, the + identifier literaldata/literal could be written as + programlisting + Ud\0061t\0061 + /programlisting + or equivalently + programlisting + Ud\+61t\+61 + /programlisting + The following less trivial example writes the Russian + word quoteslon/quote (elephant) in Cyrillic letters: + programlisting + U\0441\043B\043E\043D + /programlisting +/para + +para + If a different escape character than backslash is desired, it can + be specified using the literalUESCAPE/literal clause after the + string, for example: + programlisting + Ud!0061t!0061 UESCAPE '!' + /programlisting + The escape character can be any single character other than a + hexadecimal digit, the plus sign, a single quote, a double quote, + or a whitespace character. Note that the escape character is + written in single quotes, not double quotes. +/para + +para + To include the escape character in the identifier literally, write + it twice. +/para + +para + The Unicode escape syntax works only when the server encoding is + UTF8. When other server encodings are used, only code points in + the ASCII range (up to literal\007F/literal) can be specified. +/para + +para Quoting an identifier also makes it case-sensitive, whereas unquoted names are always folded to lower case. For example, the identifiers literalFOO/literal, literalfoo/literal, and *** UPDATE my_table SET a = 5; *** 245,251 write two adjacent single quotes, e.g. literal'Dianne''s horse'/literal. Note that this is emphasisnot/ the same as a double-quote ! character (literal/). /para para --- 297,303 write two adjacent single quotes, e.g. literal'Dianne''s horse'/literal. Note that this is emphasisnot/ the same as a double-quote ! character (literal/). !-- font-lock sanity: -- /para para *** SELECT 'foo' 'bar'; *** 269,282 by acronymSQL/acronym; productnamePostgreSQL/productname is following the standard.) /para - para indexterm primaryescape string syntax/primary /indexterm indexterm primarybackslash escapes/primary /indexterm productnamePostgreSQL/productname also accepts quoteescape/ string constants, which are an extension to the SQL standard. An escape string constant is specified by writing the letter --- 321,339 by acronymSQL/acronym; productnamePostgreSQL/productname is following the standard.) /para +/sect3 + +sect3 id=sql-syntax-strings-escape + titleString Constants with C-Style Escapes/title indexterm primaryescape string syntax/primary /indexterm indexterm primarybackslash escapes/primary /indexterm + + para productnamePostgreSQL/productname also accepts quoteescape/ string constants, which are an extension to the SQL standard. An escape string constant is specified by writing the letter *** SELECT 'foo' 'bar'; *** 287,293 Within an escape string, a backslash character (literal\/) begins a C-like firsttermbackslash escape/ sequence, in which the combination of backslash and following character(s) represent a special byte ! value: table id=sql-backslash-table titleBackslash Escape Sequences/title
[HACKERS] Unicode escapes in literals
I would like to add an escape mechanism to PostgreSQL for entering arbitrary Unicode characters into string literals. We currently only have the option of entering the character directly via the keyboard or cut-and-paste, which is difficult for a number of reasons, such as when the font doesn't have the character, and entering the UTF8-encoded bytes using the E'...' strings, which is hardly usable. SQL has the following escape syntax for it: U'special character: \' [ UESCAPE '\' ] where is the hexadecimal Unicode codepoint. So this is pretty much just another variant on what the E'...' syntax does. The trick is that since we have user-definable encoding conversion routines, we can't convert the Unicode codepoint to the server encoding in the scanner stage. I imagine there are two ways to address this: 1. Only support this syntax when the server encoding is UTF8. This would probably cover most use cases anyway. We could have limited support for characters in the ASCII range for all server encodings. 2. Convert this syntax to a function call. But that would then create a lot of inconsistencies, such as needing functional indexes for matches against what should really be a literal. I'd be happy to start with UTF8 support only. Other ideas? -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Unicode escapes in literals
Peter Eisentraut [EMAIL PROTECTED] writes: SQL has the following escape syntax for it: U'special character: \' [ UESCAPE '\' ] Man that's ugly. Why the ampersand? How do you propose to distinguish this from a perfectly legitimate use of the operator? 2. Convert this syntax to a function call. But that would then create a lot of inconsistencies, such as needing functional indexes for matches against what should really be a literal. Uh, why do you think that? The function could surely be stable, even immutable if you grant that a database's encoding can't change. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Unicode escapes in literals
Tom Lane wrote: Peter Eisentraut [EMAIL PROTECTED] writes: SQL has the following escape syntax for it: U'special character: \' [ UESCAPE '\' ] Man that's ugly. Why the ampersand? Yeah, excellent question. It seems completely unnecessary, but it is surely there in the syntax diagram. How do you propose to distinguish this from a perfectly legitimate use of the operator? Well, technically, there is going to be some conflict, but the practical impact should be minimal because: - There are no spaces allowed between U' . We typically suggest spaces around binary operators. - Naming a column u might not be terribly common. - Binary-and with an undecorated string literal is not very common. Of course, I have no data for these assertions. An inquiry on -general might give more insight. 2. Convert this syntax to a function call. But that would then create a lot of inconsistencies, such as needing functional indexes for matches against what should really be a literal. Uh, why do you think that? The function could surely be stable, even immutable if you grant that a database's encoding can't change. Yeah, true, that would work. There are some other disadvantages for making a function call. You couldn't use that kind of literal in any other place where the parser calls for a string constant: role names, tablespace locations, passwords, copy delimiters, enum values, function body, file names. There is also a related feature for Unicode escapes in identifiers, and it might be good to keep the door open on that. We could to a dual approach: Convert in the scanner when server encoding is UTF8, and pass on as function call otherwise. Surely ugly though. Or pass it on as a separate token type to the analyze phase, but that is a lot more work. Others: What use cases do you envision, and what requirements would they create for this feature? -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Unicode escapes in literals
Peter Eisentraut [EMAIL PROTECTED] writes: There are some other disadvantages for making a function call. You couldn't use that kind of literal in any other place where the parser calls for a string constant: role names, tablespace locations, passwords, copy delimiters, enum values, function body, file names. Good point. I'm okay with supporting the feature only when database encoding is UTF8. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Unicode escapes in literals
On Thu, Oct 23, 2008 at 06:04:43PM +0300, Peter Eisentraut wrote: Man that's ugly. Why the ampersand? Yeah, excellent question. It seems completely unnecessary, but it is surely there in the syntax diagram. Probably because many Unicode representations are done with U+ followed by 4-6 hexadecimal units, but + is problematic for other reasons (in some vendor's implementation)? A -- Andrew Sullivan [EMAIL PROTECTED] +1 503 667 4564 x104 http://www.commandprompt.com/ -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Unicode escapes in literals
Andrew Sullivan [EMAIL PROTECTED] writes: On Thu, Oct 23, 2008 at 06:04:43PM +0300, Peter Eisentraut wrote: Yeah, excellent question. It seems completely unnecessary, but it is surely there in the syntax diagram. Probably because many Unicode representations are done with U+ followed by 4-6 hexadecimal units, but + is problematic for other reasons (in some vendor's implementation)? They could hardly ignore the conflict with the operator interpretation for +. The committee has now cut themselves off from ever having a standard operator named , but I suppose they didn't think ahead to that. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers