Re: [HACKERS] Unicode escapes in literals

2008-10-27 Thread Peter Eisentraut

I wrote:

SQL has the following escape syntax for it:

   U&'special character: \' [ UESCAPE '\' ]


Here is an in-progress patch for this.  It still needs updates in the 
psql scanner and possibly other scanners.  But the server-side 
functionality works.


Index: doc/src/sgml/syntax.sgml
===
RCS file: /cvsroot/pgsql/doc/src/sgml/syntax.sgml,v
retrieving revision 1.123
diff -u -3 -p -c -r1.123 syntax.sgml
*** doc/src/sgml/syntax.sgml26 Jun 2008 22:24:42 -  1.123
--- doc/src/sgml/syntax.sgml27 Oct 2008 16:54:26 -
*** UPDATE "my_table" SET "a" = 5;
*** 190,195 
--- 190,247 
 
  
 
+ A variant of quoted identifiers allows including escaped Unicode
+ characters identified by their code points.  This variant starts
+ with U& (upper or lower case U followed by
+ ampersand) immediately before the opening double quote, without
+ any spaces in between, for example U&"foo".
+ (Note that this creates an ambiguity with the
+ operator &.  Use spaces around the operator to
+ avoid this problem.)  Inside the quotes, Unicode characters can be
+ specified in escaped form by writing a backslash followed by the
+ four-digit hexadecimal code point number or alternatively a
+ backslash followed by a plus sign followed by a six-digt
+ hexadecimal code point number.  For example, the
+ identifier "data" could be written as
+ 
+ U&"d\0061t\0061"
+ 
+ or equivalently
+ 
+ U&"d\+61t\+61"
+ 
+ The following less trivial example writes the Russian
+ word slon (elephant) in Cyrillic letters:
+ 
+ U&"\0441\043B\043E\043D"
+ 
+
+ 
+
+ If a different escape character than backslash is desired, it can
+ be specified using the UESCAPE clause after the
+ string, for example:
+ 
+ U&"d!0061t!0061" UESCAPE '!'
+ 
+ The escape character can be any single character other than a
+ hexadecimal digit, the plus sign, a single quote, a double quote,
+ or a whitespace character.  Note that the escape character is
+ written in single quotes, not double quotes.
+
+ 
+
+ To include the escape character in the identifier literally, write
+ it twice.
+
+ 
+
+ The Unicode escape syntax works only when the server encoding is
+ UTF8.  When other server encodings are used, only code points in
+ the ASCII range (up to \007F) can be specified.
+
+ 
+
  Quoting an identifier also makes it case-sensitive, whereas
  unquoted names are always folded to lower case.  For example, the
  identifiers FOO, foo, and
*** UPDATE "my_table" SET "a" = 5;
*** 245,251 
   write two adjacent single quotes, e.g.
   'Dianne''s horse'.
   Note that this is not the same as a double-quote
!  character (").
  
  
  
--- 297,303 
   write two adjacent single quotes, e.g.
   'Dianne''s horse'.
   Note that this is not the same as a double-quote
!  character ("). 
  
  
  
*** SELECT 'foo'  'bar';
*** 269,282 
   by SQL; PostgreSQL is
   following the standard.)
  
  
- 
   
escape string syntax
   
   
backslash escapes
   
   PostgreSQL also accepts escape
   string constants, which are an extension to the SQL standard.
   An escape string constant is specified by writing the letter
--- 321,339 
   by SQL; PostgreSQL is
   following the standard.)
  
+
+ 
+
+ String Constants with C-Style Escapes
  
   
escape string syntax
   
   
backslash escapes
   
+ 
+ 
   PostgreSQL also accepts escape
   string constants, which are an extension to the SQL standard.
   An escape string constant is specified by writing the letter
*** SELECT 'foo'  'bar';
*** 287,293 
   Within an escape string, a backslash character (\) begins a
   C-like backslash escape sequence, in which the combination
   of backslash and following character(s) represent a special byte
!  value:
  
   
Backslash Escape Sequences
--- 344,351 
   Within an escape string, a backslash character (\) begins a
   C-like backslash escape sequence, in which the combination
   of backslash and following character(s) represent a special byte
!  value, shown in 
! 
  
   
Backslash Escape Sequences
*** SELECT 'foo'  'bar';
*** 341,354 

   
  
!  It is your responsibility that the byte sequences you create are
!  valid characters in the server character set encoding. Any other
   character following a backslash is taken literally. Thus, to
   include a backslash character, write two backslashes (\\).
   Also, a single quote can be included in an escape string by writing
   \', in addition to the normal w

Re: [HACKERS] Unicode escapes in literals

2008-10-23 Thread Tom Lane
Andrew Sullivan <[EMAIL PROTECTED]> writes:
> On Thu, Oct 23, 2008 at 06:04:43PM +0300, Peter Eisentraut wrote:
>> Yeah, excellent question.  It seems completely unnecessary, but it is 
>> surely there in the syntax diagram.

> Probably because many Unicode representations are done with "U+"
> followed by 4-6 hexadecimal units, but "+" is problematic for other
> reasons (in some vendor's implementation)?

They could hardly ignore the conflict with the operator interpretation
for +.  The committee has now cut themselves off from ever having a
standard operator named &, but I suppose they didn't think ahead to that.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Unicode escapes in literals

2008-10-23 Thread Andrew Sullivan
On Thu, Oct 23, 2008 at 06:04:43PM +0300, Peter Eisentraut wrote:
>> Man that's ugly.  Why the ampersand?
>
> Yeah, excellent question.  It seems completely unnecessary, but it is 
> surely there in the syntax diagram.

Probably because many Unicode representations are done with "U+"
followed by 4-6 hexadecimal units, but "+" is problematic for other
reasons (in some vendor's implementation)?

A

-- 
Andrew Sullivan
[EMAIL PROTECTED]
+1 503 667 4564 x104
http://www.commandprompt.com/

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Unicode escapes in literals

2008-10-23 Thread Tom Lane
Peter Eisentraut <[EMAIL PROTECTED]> writes:
> There are some other disadvantages for making a function call.  You 
> couldn't use that kind of literal in any other place where the parser 
> calls for a string constant: role names, tablespace locations, 
> passwords, copy delimiters, enum values, function body, file names.

Good point.  I'm okay with supporting the feature only when database
encoding is UTF8.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Unicode escapes in literals

2008-10-23 Thread Peter Eisentraut

Tom Lane wrote:

Peter Eisentraut <[EMAIL PROTECTED]> writes:

SQL has the following escape syntax for it:
U&'special character: \' [ UESCAPE '\' ]


Man that's ugly.  Why the ampersand?


Yeah, excellent question.  It seems completely unnecessary, but it is 
surely there in the syntax diagram.



How do you propose to distinguish
this from a perfectly legitimate use of the & operator?


Well, technically, there is going to be some conflict, but the practical 
impact should be minimal because:


- There are no spaces allowed between U&' .  We typically suggest spaces 
around binary operators.


- Naming a column "u" might not be terribly common.

- Binary-and with an undecorated string literal is not very common.

Of course, I have no data for these assertions.  An inquiry on -general 
might give more insight.


2. Convert this syntax to a function call.  But that would then create a 
lot of inconsistencies, such as needing functional indexes for matches 
against what should really be a literal.


Uh, why do you think that?  The function could surely be stable, even
immutable if you grant that a database's encoding can't change.


Yeah, true, that would work.

There are some other disadvantages for making a function call.  You 
couldn't use that kind of literal in any other place where the parser 
calls for a string constant: role names, tablespace locations, 
passwords, copy delimiters, enum values, function body, file names.


There is also a related feature for Unicode escapes in identifiers, and 
it might be good to keep the door open on that.


We could to a dual approach: Convert in the scanner when server encoding 
 is UTF8, and pass on as function call otherwise.  Surely ugly though.


Or pass it on as a separate token type to the analyze phase, but that is 
a lot more work.



Others: What use cases do you envision, and what requirements would they 
create for this feature?


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Unicode escapes in literals

2008-10-23 Thread Tom Lane
Peter Eisentraut <[EMAIL PROTECTED]> writes:
> SQL has the following escape syntax for it:
> U&'special character: \' [ UESCAPE '\' ]

Man that's ugly.  Why the ampersand?  How do you propose to distinguish
this from a perfectly legitimate use of the & operator?

> 2. Convert this syntax to a function call.  But that would then create a 
> lot of inconsistencies, such as needing functional indexes for matches 
> against what should really be a literal.

Uh, why do you think that?  The function could surely be stable, even
immutable if you grant that a database's encoding can't change.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers