Re: [HACKERS] Unicode escapes in literals

2008-10-27 Thread Peter Eisentraut

I wrote:

SQL has the following escape syntax for it:

   U'special character: \' [ UESCAPE '\' ]


Here is an in-progress patch for this.  It still needs updates in the 
psql scanner and possibly other scanners.  But the server-side 
functionality works.


Index: doc/src/sgml/syntax.sgml
===
RCS file: /cvsroot/pgsql/doc/src/sgml/syntax.sgml,v
retrieving revision 1.123
diff -u -3 -p -c -r1.123 syntax.sgml
*** doc/src/sgml/syntax.sgml26 Jun 2008 22:24:42 -  1.123
--- doc/src/sgml/syntax.sgml27 Oct 2008 16:54:26 -
*** UPDATE my_table SET a = 5;
*** 190,195 
--- 190,247 
 /para
  
 para
+ A variant of quoted identifiers allows including escaped Unicode
+ characters identified by their code points.  This variant starts
+ with literalU/literal (upper or lower case U followed by
+ ampersand) immediately before the opening double quote, without
+ any spaces in between, for example literalUfoo/literal.
+ (Note that this creates an ambiguity with the
+ operator literal/literal.  Use spaces around the operator to
+ avoid this problem.)  Inside the quotes, Unicode characters can be
+ specified in escaped form by writing a backslash followed by the
+ four-digit hexadecimal code point number or alternatively a
+ backslash followed by a plus sign followed by a six-digt
+ hexadecimal code point number.  For example, the
+ identifier literaldata/literal could be written as
+ programlisting
+ Ud\0061t\0061
+ /programlisting
+ or equivalently
+ programlisting
+ Ud\+61t\+61
+ /programlisting
+ The following less trivial example writes the Russian
+ word quoteslon/quote (elephant) in Cyrillic letters:
+ programlisting
+ U\0441\043B\043E\043D
+ /programlisting
+/para
+ 
+para
+ If a different escape character than backslash is desired, it can
+ be specified using the literalUESCAPE/literal clause after the
+ string, for example:
+ programlisting
+ Ud!0061t!0061 UESCAPE '!'
+ /programlisting
+ The escape character can be any single character other than a
+ hexadecimal digit, the plus sign, a single quote, a double quote,
+ or a whitespace character.  Note that the escape character is
+ written in single quotes, not double quotes.
+/para
+ 
+para
+ To include the escape character in the identifier literally, write
+ it twice.
+/para
+ 
+para
+ The Unicode escape syntax works only when the server encoding is
+ UTF8.  When other server encodings are used, only code points in
+ the ASCII range (up to literal\007F/literal) can be specified.
+/para
+ 
+para
  Quoting an identifier also makes it case-sensitive, whereas
  unquoted names are always folded to lower case.  For example, the
  identifiers literalFOO/literal, literalfoo/literal, and
*** UPDATE my_table SET a = 5;
*** 245,251 
   write two adjacent single quotes, e.g.
   literal'Dianne''s horse'/literal.
   Note that this is emphasisnot/ the same as a double-quote
!  character (literal/).
  /para
  
  para
--- 297,303 
   write two adjacent single quotes, e.g.
   literal'Dianne''s horse'/literal.
   Note that this is emphasisnot/ the same as a double-quote
!  character (literal/). !-- font-lock sanity:  --
  /para
  
  para
*** SELECT 'foo'  'bar';
*** 269,282 
   by acronymSQL/acronym; productnamePostgreSQL/productname is
   following the standard.)
  /para
  
- para
   indexterm
primaryescape string syntax/primary
   /indexterm
   indexterm
primarybackslash escapes/primary
   /indexterm
   productnamePostgreSQL/productname also accepts quoteescape/
   string constants, which are an extension to the SQL standard.
   An escape string constant is specified by writing the letter
--- 321,339 
   by acronymSQL/acronym; productnamePostgreSQL/productname is
   following the standard.)
  /para
+/sect3
+ 
+sect3 id=sql-syntax-strings-escape
+ titleString Constants with C-Style Escapes/title
  
   indexterm
primaryescape string syntax/primary
   /indexterm
   indexterm
primarybackslash escapes/primary
   /indexterm
+ 
+ para
   productnamePostgreSQL/productname also accepts quoteescape/
   string constants, which are an extension to the SQL standard.
   An escape string constant is specified by writing the letter
*** SELECT 'foo'  'bar';
*** 287,293 
   Within an escape string, a backslash character (literal\/) begins a
   C-like firsttermbackslash escape/ sequence, in which the combination
   of backslash and following character(s) represent a special byte
!  value:
  
   table id=sql-backslash-table
titleBackslash Escape Sequences/title

[HACKERS] Unicode escapes in literals

2008-10-23 Thread Peter Eisentraut
I would like to add an escape mechanism to PostgreSQL for entering 
arbitrary Unicode characters into string literals.  We currently only 
have the option of entering the character directly via the keyboard or 
cut-and-paste, which is difficult for a number of reasons, such as when 
the font doesn't have the character, and entering the UTF8-encoded bytes 
using the E'...' strings, which is hardly usable.


SQL has the following escape syntax for it:

   U'special character: \' [ UESCAPE '\' ]

where  is the hexadecimal Unicode codepoint.  So this is pretty much 
just another variant on what the E'...' syntax does.


The trick is that since we have user-definable encoding conversion 
routines, we can't convert the Unicode codepoint to the server encoding 
in the scanner stage.  I imagine there are two ways to address this:


1. Only support this syntax when the server encoding is UTF8.  This 
would probably cover most use cases anyway.  We could have limited 
support for characters in the ASCII range for all server encodings.


2. Convert this syntax to a function call.  But that would then create a 
lot of inconsistencies, such as needing functional indexes for matches 
against what should really be a literal.


I'd be happy to start with UTF8 support only.  Other ideas?

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Unicode escapes in literals

2008-10-23 Thread Tom Lane
Peter Eisentraut [EMAIL PROTECTED] writes:
 SQL has the following escape syntax for it:
 U'special character: \' [ UESCAPE '\' ]

Man that's ugly.  Why the ampersand?  How do you propose to distinguish
this from a perfectly legitimate use of the  operator?

 2. Convert this syntax to a function call.  But that would then create a 
 lot of inconsistencies, such as needing functional indexes for matches 
 against what should really be a literal.

Uh, why do you think that?  The function could surely be stable, even
immutable if you grant that a database's encoding can't change.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Unicode escapes in literals

2008-10-23 Thread Peter Eisentraut

Tom Lane wrote:

Peter Eisentraut [EMAIL PROTECTED] writes:

SQL has the following escape syntax for it:
U'special character: \' [ UESCAPE '\' ]


Man that's ugly.  Why the ampersand?


Yeah, excellent question.  It seems completely unnecessary, but it is 
surely there in the syntax diagram.



How do you propose to distinguish
this from a perfectly legitimate use of the  operator?


Well, technically, there is going to be some conflict, but the practical 
impact should be minimal because:


- There are no spaces allowed between U' .  We typically suggest spaces 
around binary operators.


- Naming a column u might not be terribly common.

- Binary-and with an undecorated string literal is not very common.

Of course, I have no data for these assertions.  An inquiry on -general 
might give more insight.


2. Convert this syntax to a function call.  But that would then create a 
lot of inconsistencies, such as needing functional indexes for matches 
against what should really be a literal.


Uh, why do you think that?  The function could surely be stable, even
immutable if you grant that a database's encoding can't change.


Yeah, true, that would work.

There are some other disadvantages for making a function call.  You 
couldn't use that kind of literal in any other place where the parser 
calls for a string constant: role names, tablespace locations, 
passwords, copy delimiters, enum values, function body, file names.


There is also a related feature for Unicode escapes in identifiers, and 
it might be good to keep the door open on that.


We could to a dual approach: Convert in the scanner when server encoding 
 is UTF8, and pass on as function call otherwise.  Surely ugly though.


Or pass it on as a separate token type to the analyze phase, but that is 
a lot more work.



Others: What use cases do you envision, and what requirements would they 
create for this feature?


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Unicode escapes in literals

2008-10-23 Thread Tom Lane
Peter Eisentraut [EMAIL PROTECTED] writes:
 There are some other disadvantages for making a function call.  You 
 couldn't use that kind of literal in any other place where the parser 
 calls for a string constant: role names, tablespace locations, 
 passwords, copy delimiters, enum values, function body, file names.

Good point.  I'm okay with supporting the feature only when database
encoding is UTF8.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Unicode escapes in literals

2008-10-23 Thread Andrew Sullivan
On Thu, Oct 23, 2008 at 06:04:43PM +0300, Peter Eisentraut wrote:
 Man that's ugly.  Why the ampersand?

 Yeah, excellent question.  It seems completely unnecessary, but it is 
 surely there in the syntax diagram.

Probably because many Unicode representations are done with U+
followed by 4-6 hexadecimal units, but + is problematic for other
reasons (in some vendor's implementation)?

A

-- 
Andrew Sullivan
[EMAIL PROTECTED]
+1 503 667 4564 x104
http://www.commandprompt.com/

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Unicode escapes in literals

2008-10-23 Thread Tom Lane
Andrew Sullivan [EMAIL PROTECTED] writes:
 On Thu, Oct 23, 2008 at 06:04:43PM +0300, Peter Eisentraut wrote:
 Yeah, excellent question.  It seems completely unnecessary, but it is 
 surely there in the syntax diagram.

 Probably because many Unicode representations are done with U+
 followed by 4-6 hexadecimal units, but + is problematic for other
 reasons (in some vendor's implementation)?

They could hardly ignore the conflict with the operator interpretation
for +.  The committee has now cut themselves off from ever having a
standard operator named , but I suppose they didn't think ahead to that.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers