Re: [HACKERS] Unicode escapes in literals

2008-10-27 Thread Peter Eisentraut

I wrote:

SQL has the following escape syntax for it:

   U'special character: \' [ UESCAPE '\' ]


Here is an in-progress patch for this.  It still needs updates in the 
psql scanner and possibly other scanners.  But the server-side 
functionality works.


Index: doc/src/sgml/syntax.sgml
===
RCS file: /cvsroot/pgsql/doc/src/sgml/syntax.sgml,v
retrieving revision 1.123
diff -u -3 -p -c -r1.123 syntax.sgml
*** doc/src/sgml/syntax.sgml26 Jun 2008 22:24:42 -  1.123
--- doc/src/sgml/syntax.sgml27 Oct 2008 16:54:26 -
*** UPDATE my_table SET a = 5;
*** 190,195 
--- 190,247 
 /para
  
 para
+ A variant of quoted identifiers allows including escaped Unicode
+ characters identified by their code points.  This variant starts
+ with literalU/literal (upper or lower case U followed by
+ ampersand) immediately before the opening double quote, without
+ any spaces in between, for example literalUfoo/literal.
+ (Note that this creates an ambiguity with the
+ operator literal/literal.  Use spaces around the operator to
+ avoid this problem.)  Inside the quotes, Unicode characters can be
+ specified in escaped form by writing a backslash followed by the
+ four-digit hexadecimal code point number or alternatively a
+ backslash followed by a plus sign followed by a six-digt
+ hexadecimal code point number.  For example, the
+ identifier literaldata/literal could be written as
+ programlisting
+ Ud\0061t\0061
+ /programlisting
+ or equivalently
+ programlisting
+ Ud\+61t\+61
+ /programlisting
+ The following less trivial example writes the Russian
+ word quoteslon/quote (elephant) in Cyrillic letters:
+ programlisting
+ U\0441\043B\043E\043D
+ /programlisting
+/para
+ 
+para
+ If a different escape character than backslash is desired, it can
+ be specified using the literalUESCAPE/literal clause after the
+ string, for example:
+ programlisting
+ Ud!0061t!0061 UESCAPE '!'
+ /programlisting
+ The escape character can be any single character other than a
+ hexadecimal digit, the plus sign, a single quote, a double quote,
+ or a whitespace character.  Note that the escape character is
+ written in single quotes, not double quotes.
+/para
+ 
+para
+ To include the escape character in the identifier literally, write
+ it twice.
+/para
+ 
+para
+ The Unicode escape syntax works only when the server encoding is
+ UTF8.  When other server encodings are used, only code points in
+ the ASCII range (up to literal\007F/literal) can be specified.
+/para
+ 
+para
  Quoting an identifier also makes it case-sensitive, whereas
  unquoted names are always folded to lower case.  For example, the
  identifiers literalFOO/literal, literalfoo/literal, and
*** UPDATE my_table SET a = 5;
*** 245,251 
   write two adjacent single quotes, e.g.
   literal'Dianne''s horse'/literal.
   Note that this is emphasisnot/ the same as a double-quote
!  character (literal/).
  /para
  
  para
--- 297,303 
   write two adjacent single quotes, e.g.
   literal'Dianne''s horse'/literal.
   Note that this is emphasisnot/ the same as a double-quote
!  character (literal/). !-- font-lock sanity:  --
  /para
  
  para
*** SELECT 'foo'  'bar';
*** 269,282 
   by acronymSQL/acronym; productnamePostgreSQL/productname is
   following the standard.)
  /para
  
- para
   indexterm
primaryescape string syntax/primary
   /indexterm
   indexterm
primarybackslash escapes/primary
   /indexterm
   productnamePostgreSQL/productname also accepts quoteescape/
   string constants, which are an extension to the SQL standard.
   An escape string constant is specified by writing the letter
--- 321,339 
   by acronymSQL/acronym; productnamePostgreSQL/productname is
   following the standard.)
  /para
+/sect3
+ 
+sect3 id=sql-syntax-strings-escape
+ titleString Constants with C-Style Escapes/title
  
   indexterm
primaryescape string syntax/primary
   /indexterm
   indexterm
primarybackslash escapes/primary
   /indexterm
+ 
+ para
   productnamePostgreSQL/productname also accepts quoteescape/
   string constants, which are an extension to the SQL standard.
   An escape string constant is specified by writing the letter
*** SELECT 'foo'  'bar';
*** 287,293 
   Within an escape string, a backslash character (literal\/) begins a
   C-like firsttermbackslash escape/ sequence, in which the combination
   of backslash and following character(s) represent a special byte
!  value:
  
   table id=sql-backslash-table
titleBackslash Escape Sequences/title

Re: [HACKERS] Unicode escapes in literals

2008-10-23 Thread Tom Lane
Peter Eisentraut [EMAIL PROTECTED] writes:
 SQL has the following escape syntax for it:
 U'special character: \' [ UESCAPE '\' ]

Man that's ugly.  Why the ampersand?  How do you propose to distinguish
this from a perfectly legitimate use of the  operator?

 2. Convert this syntax to a function call.  But that would then create a 
 lot of inconsistencies, such as needing functional indexes for matches 
 against what should really be a literal.

Uh, why do you think that?  The function could surely be stable, even
immutable if you grant that a database's encoding can't change.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Unicode escapes in literals

2008-10-23 Thread Peter Eisentraut

Tom Lane wrote:

Peter Eisentraut [EMAIL PROTECTED] writes:

SQL has the following escape syntax for it:
U'special character: \' [ UESCAPE '\' ]


Man that's ugly.  Why the ampersand?


Yeah, excellent question.  It seems completely unnecessary, but it is 
surely there in the syntax diagram.



How do you propose to distinguish
this from a perfectly legitimate use of the  operator?


Well, technically, there is going to be some conflict, but the practical 
impact should be minimal because:


- There are no spaces allowed between U' .  We typically suggest spaces 
around binary operators.


- Naming a column u might not be terribly common.

- Binary-and with an undecorated string literal is not very common.

Of course, I have no data for these assertions.  An inquiry on -general 
might give more insight.


2. Convert this syntax to a function call.  But that would then create a 
lot of inconsistencies, such as needing functional indexes for matches 
against what should really be a literal.


Uh, why do you think that?  The function could surely be stable, even
immutable if you grant that a database's encoding can't change.


Yeah, true, that would work.

There are some other disadvantages for making a function call.  You 
couldn't use that kind of literal in any other place where the parser 
calls for a string constant: role names, tablespace locations, 
passwords, copy delimiters, enum values, function body, file names.


There is also a related feature for Unicode escapes in identifiers, and 
it might be good to keep the door open on that.


We could to a dual approach: Convert in the scanner when server encoding 
 is UTF8, and pass on as function call otherwise.  Surely ugly though.


Or pass it on as a separate token type to the analyze phase, but that is 
a lot more work.



Others: What use cases do you envision, and what requirements would they 
create for this feature?


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Unicode escapes in literals

2008-10-23 Thread Tom Lane
Peter Eisentraut [EMAIL PROTECTED] writes:
 There are some other disadvantages for making a function call.  You 
 couldn't use that kind of literal in any other place where the parser 
 calls for a string constant: role names, tablespace locations, 
 passwords, copy delimiters, enum values, function body, file names.

Good point.  I'm okay with supporting the feature only when database
encoding is UTF8.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Unicode escapes in literals

2008-10-23 Thread Andrew Sullivan
On Thu, Oct 23, 2008 at 06:04:43PM +0300, Peter Eisentraut wrote:
 Man that's ugly.  Why the ampersand?

 Yeah, excellent question.  It seems completely unnecessary, but it is 
 surely there in the syntax diagram.

Probably because many Unicode representations are done with U+
followed by 4-6 hexadecimal units, but + is problematic for other
reasons (in some vendor's implementation)?

A

-- 
Andrew Sullivan
[EMAIL PROTECTED]
+1 503 667 4564 x104
http://www.commandprompt.com/

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Unicode escapes in literals

2008-10-23 Thread Tom Lane
Andrew Sullivan [EMAIL PROTECTED] writes:
 On Thu, Oct 23, 2008 at 06:04:43PM +0300, Peter Eisentraut wrote:
 Yeah, excellent question.  It seems completely unnecessary, but it is 
 surely there in the syntax diagram.

 Probably because many Unicode representations are done with U+
 followed by 4-6 hexadecimal units, but + is problematic for other
 reasons (in some vendor's implementation)?

They could hardly ignore the conflict with the operator interpretation
for +.  The committee has now cut themselves off from ever having a
standard operator named , but I suppose they didn't think ahead to that.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers