Re: [HACKERS] Unicode string literals versus the world

2009-05-29 Thread Peter Eisentraut
On Friday 29 May 2009 06:31:23 Bruce Momjian wrote:
 Peter Eisentraut wrote:
  On Tuesday 05 May 2009 03:01:05 Tom Lane wrote:
   Peter Eisentraut pete...@gmx.net writes:
On Tuesday 14 April 2009 21:34:51 Peter Eisentraut wrote:
I think we can handle that and the cases Tom presents by erroring
out when the U syntax is used with stdstr off.
   
Proposed patch for that attached.
  
   I have not been able to think of any security hole in that proposal,
   so this patch seems acceptable to me.  I wonder though whether any
   corresponding change is needed in psql's lexer, and if so how should
   it react exactly to the rejection case.
 
  I had thought about that as well, but concluded that no additional change
  is necessary.
 
  Note that the *corresponding* change would be psql complaining I don't
  like what you entered, versus the just-committed behavior that psql is
  indifferent and the server complains I don't like what you sent me.
 
  In any case, the point of the change is to prevent confusion in client
  programs, so if we had to patch psql to make sense, then the change would
  have been pointless in the first place.

 I assume there is no TODO here.

No, it should be fine.

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Unicode string literals versus the world

2009-05-28 Thread Bruce Momjian
Peter Eisentraut wrote:
 On Tuesday 05 May 2009 03:01:05 Tom Lane wrote:
  Peter Eisentraut pete...@gmx.net writes:
   On Tuesday 14 April 2009 21:34:51 Peter Eisentraut wrote:
   I think we can handle that and the cases Tom presents by erroring out
   when the U syntax is used with stdstr off.
  
   Proposed patch for that attached.
 
  I have not been able to think of any security hole in that proposal,
  so this patch seems acceptable to me.  I wonder though whether any
  corresponding change is needed in psql's lexer, and if so how should
  it react exactly to the rejection case.
 
 I had thought about that as well, but concluded that no additional change is 
 necessary.
 
 Note that the *corresponding* change would be psql complaining I don't like 
 what you entered, versus the just-committed behavior that psql is 
 indifferent 
 and the server complains I don't like what you sent me.
 
 In any case, the point of the change is to prevent confusion in client 
 programs, so if we had to patch psql to make sense, then the change would 
 have 
 been pointless in the first place.

I assume there is no TODO here.

-- 
  Bruce Momjian  br...@momjian.ushttp://momjian.us
  EnterpriseDB http://enterprisedb.com

  + If your life is a hard drive, Christ can be your backup. +

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Unicode string literals versus the world

2009-05-05 Thread Peter Eisentraut
On Tuesday 05 May 2009 03:01:05 Tom Lane wrote:
 Peter Eisentraut pete...@gmx.net writes:
  On Tuesday 14 April 2009 21:34:51 Peter Eisentraut wrote:
  I think we can handle that and the cases Tom presents by erroring out
  when the U syntax is used with stdstr off.
 
  Proposed patch for that attached.

 I have not been able to think of any security hole in that proposal,
 so this patch seems acceptable to me.  I wonder though whether any
 corresponding change is needed in psql's lexer, and if so how should
 it react exactly to the rejection case.

I had thought about that as well, but concluded that no additional change is 
necessary.

Note that the *corresponding* change would be psql complaining I don't like 
what you entered, versus the just-committed behavior that psql is indifferent 
and the server complains I don't like what you sent me.

In any case, the point of the change is to prevent confusion in client 
programs, so if we had to patch psql to make sense, then the change would have 
been pointless in the first place.

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Unicode string literals versus the world

2009-05-04 Thread Peter Eisentraut
On Tuesday 14 April 2009 21:34:51 Peter Eisentraut wrote:
 On Tuesday 14 April 2009 17:13:00 Marko Kreen wrote:
  If the parsing does not happen in 2 passes and it does not take account
  of stdstr setting then the  default breakage would be:
 
 stdstr=off, U' \' UESCAPE '!'.

 I think we can handle that and the cases Tom presents by erroring out when
 the U syntax is used with stdstr off.

Proposed patch for that attached.
Index: doc/src/sgml/syntax.sgml
===
RCS file: /cvsroot/pgsql/doc/src/sgml/syntax.sgml,v
retrieving revision 1.131
diff -u -3 -p -r1.131 syntax.sgml
--- doc/src/sgml/syntax.sgml	27 Apr 2009 16:27:36 -	1.131
+++ doc/src/sgml/syntax.sgml	4 May 2009 22:08:27 -
@@ -500,6 +500,17 @@ Uamp;'d!0061t!+61' UESCAPE '!'
 /para
 
 para
+ Also, the Unicode escape syntax for string constants only works
+ when the configuration
+ parameter xref linkend=guc-standard-conforming-strings is
+ turned on.  This is because otherwise this syntax could confuse
+ clients that parse the SQL statements to the point that it could
+ lead to SQL injections and similar security issues.  If the
+ parameter is set to off, this syntax will be rejected with an
+ error message.
+/para
+
+para
  To include the escape character in the string literally, write it
  twice.
 /para
Index: src/backend/parser/scan.l
===
RCS file: /cvsroot/pgsql/src/backend/parser/scan.l,v
retrieving revision 1.151
diff -u -3 -p -r1.151 scan.l
--- src/backend/parser/scan.l	19 Apr 2009 21:08:54 -	1.151
+++ src/backend/parser/scan.l	4 May 2009 22:08:27 -
@@ -469,6 +469,11 @@ other			.
 	startlit();
 }
 {xusstart}		{
+	if (!standard_conforming_strings)
+		ereport(ERROR,
+(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+ errmsg(unsafe use of string constant with Unicode escapes),
+ errdetail(String constants with Unicode escapes cannot be used when standard_conforming_strings is off.)));
 	SET_YYLLOC();
 	BEGIN(xus);
 	startlit();
Index: src/test/regress/expected/strings.out
===
RCS file: /cvsroot/pgsql/src/test/regress/expected/strings.out,v
retrieving revision 1.35
diff -u -3 -p -r1.35 strings.out
--- src/test/regress/expected/strings.out	29 Oct 2008 08:04:54 -	1.35
+++ src/test/regress/expected/strings.out	4 May 2009 22:08:27 -
@@ -22,6 +22,7 @@ ERROR:  syntax error at or near ' - thi
 LINE 3: ' - third line'
 ^
 -- Unicode escapes
+SET standard_conforming_strings TO on;
 SELECT U'd\0061t\+61' AS Ud\0061t\+61;
  data 
 --
@@ -34,6 +35,18 @@ SELECT U'd!0061t\+61' UESCAPE '!' A
  dat\+61
 (1 row)
 
+SELECT U' \' UESCAPE '!' AS tricky;
+ tricky 
+
+  \
+(1 row)
+
+SELECT 'tricky' AS U\ UESCAPE '!';
+   \
+
+ tricky
+(1 row)
+
 SELECT U'wrong: \061';
 ERROR:  invalid Unicode escape value at or near \061'
 LINE 1: SELECT U'wrong: \061';
@@ -46,6 +59,32 @@ SELECT U'wrong: +0061' UESCAPE '+';
 ERROR:  invalid Unicode escape character at or near +'
 LINE 1: SELECT U'wrong: +0061' UESCAPE '+';
  ^
+SET standard_conforming_strings TO off;
+SELECT U'd\0061t\+61' AS Ud\0061t\+61;
+ERROR:  unsafe use of string constant with Unicode escapes
+DETAIL:  String constants with Unicode escapes cannot be used when standard_conforming_strings is off.
+SELECT U'd!0061t\+61' UESCAPE '!' AS Ud*0061t\+61 UESCAPE '*';
+ERROR:  unsafe use of string constant with Unicode escapes
+DETAIL:  String constants with Unicode escapes cannot be used when standard_conforming_strings is off.
+SELECT U' \' UESCAPE '!' AS tricky;
+ERROR:  unsafe use of string constant with Unicode escapes
+DETAIL:  String constants with Unicode escapes cannot be used when standard_conforming_strings is off.
+SELECT 'tricky' AS U\ UESCAPE '!';
+   \
+
+ tricky
+(1 row)
+
+SELECT U'wrong: \061';
+ERROR:  unsafe use of string constant with Unicode escapes
+DETAIL:  String constants with Unicode escapes cannot be used when standard_conforming_strings is off.
+SELECT U'wrong: \+0061';
+ERROR:  unsafe use of string constant with Unicode escapes
+DETAIL:  String constants with Unicode escapes cannot be used when standard_conforming_strings is off.
+SELECT U'wrong: +0061' UESCAPE '+';
+ERROR:  unsafe use of string constant with Unicode escapes
+DETAIL:  String constants with Unicode escapes cannot be used when standard_conforming_strings is off.
+RESET standard_conforming_strings;
 --
 -- test conversions between various string types
 -- E021-10 implicit casting among the character data types
Index: src/test/regress/sql/strings.sql
===
RCS file: /cvsroot/pgsql/src/test/regress/sql/strings.sql,v
retrieving revision 

Re: [HACKERS] Unicode string literals versus the world

2009-05-04 Thread Tom Lane
Peter Eisentraut pete...@gmx.net writes:
 On Tuesday 14 April 2009 21:34:51 Peter Eisentraut wrote:
 I think we can handle that and the cases Tom presents by erroring out when
 the U syntax is used with stdstr off.

 Proposed patch for that attached.

I have not been able to think of any security hole in that proposal,
so this patch seems acceptable to me.  I wonder though whether any
corresponding change is needed in psql's lexer, and if so how should
it react exactly to the rejection case.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Unicode string literals versus the world

2009-05-04 Thread Hiroshi Saito

Hi.

quick test for great patch. !

== SCRIPT ==
set CLIENT_ENCODING to 'UTF-8';
DROP TABLE ucheck CASCADE;
CREATE TABLE ucheck (key VARCHAR(10) PRIMARY KEY, data NCHAR(50));
set STANDARD_CONFORMING_STRINGS to on;
INSERT INTO ucheck VALUES('ucheck1',u'\68ee\9dd7\5916');
SELECT * FROM ucheck;
set CLIENT_ENCODING to 'SHIFTJIS2004';
SELECT * FROM ucheck;
set STANDARD_CONFORMING_STRINGS to off;
INSERT INTO ucheck VALUES('ucheck2',u'\68ee\9dd7\5916');

=== As for psql ===
This should notice the console of Japanese windows-XP about code can't  to display. 


C:\workpsql
psql (8.4beta1)
help でヘルプを表示します.

HIROSHI=# set CLIENT_ENCODING to 'UTF-8';
SET
HIROSHI=# DROP TABLE ucheck CASCADE;
DROP TABLE
HIROSHI=# CREATE TABLE ucheck (key VARCHAR(10) PRIMARY KEY, data NCHAR(50));
NOTICE:  CREATE TABLE / PRIMARY KEY will create implicit index ucheck_pkey for
table ucheck
CREATE TABLE
HIROSHI=# set STANDARD_CONFORMING_STRINGS to on;
SET
HIROSHI=# INSERT INTO ucheck VALUES('ucheck',u'\68ee\9dd7\5916');
INSERT 0 1
HIROSHI=# SELECT * FROM ucheck;
 key   | data
+---
ucheck | 譽ョ鮃怜、・
(1 行)


HIROSHI=# set CLIENT_ENCODING to 'SHIFTJIS2004';
SET
HIROSHI=# SELECT * FROM ucheck;
 key   | data
+---
ucheck | 森・外
(1 行)

HIROSHI=# set STANDARD_CONFORMING_STRINGS to off;
SET
HIROSHI=# INSERT INTO ucheck VALUES('ucheck2',u'\68ee\9dd7\5916');
ERROR:  unsafe use of string constant with Unicode escapes
DETAIL:  String constants with Unicode escapes cannot be used when standard_conf
orming_strings is off.

=== As for pgAdminIII ===
pgadmin is shown very well.!!:-)
http://winpg.jp/~saito/pg_work/ucheck/jisx0213.png

P.S)
This test obtains the same result by MimerDB.

Regards,
Hiroshi Saito

- Original Message - 
From: Tom Lane t...@sss.pgh.pa.us




Peter Eisentraut pete...@gmx.net writes:

On Tuesday 14 April 2009 21:34:51 Peter Eisentraut wrote:

I think we can handle that and the cases Tom presents by erroring out when
the U syntax is used with stdstr off.



Proposed patch for that attached.


I have not been able to think of any security hole in that proposal,
so this patch seems acceptable to me.  I wonder though whether any
corresponding change is needed in psql's lexer, and if so how should
it react exactly to the rejection case.

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Unicode string literals versus the world

2009-04-17 Thread Sam Mason
On Thu, Apr 16, 2009 at 12:08:37PM -0400, Tom Lane wrote:
 Sam Mason s...@samason.me.uk writes:
  I've failed to keep up with the discussion so I'm not sure where this
  conversation has got to!  Is the consensus for 8.4 to enable SQL2003
  style Ulit escaped literals if and only if standard_conforming_strings
  is set?
 
 That was Peter's proposal, and no one's shot a hole in it yet ...

Just noticed that the spec only supports four hex digits; this would
imply that support for anything outside the BMP would have to be done
by encoding the character as a surrogate pair.  If the code doesn't
do this already (the original patch didn't seem to) these should be
normalised back to a single character in a similar manner to Marko's
recent patch[1].

 I think the discussion about whether/how to add a Unicode extension to
 E''-style literals is 8.5 material.  We are in beta so now is not
 the time to add new features, especially ones that weren't even on the
 TODO list before.

OK, sounds reasonable.

-- 
  Sam  http://samason.me.uk/

 [1] http://archives.postgresql.org//pgsql-hackers/2009-04/msg00904.php

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Unicode string literals versus the world

2009-04-17 Thread Tom Lane
Sam Mason s...@samason.me.uk writes:
 Just noticed that the spec only supports four hex digits;

Better read it again.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Unicode string literals versus the world

2009-04-17 Thread Sam Mason
On Fri, Apr 17, 2009 at 10:15:57AM -0400, Tom Lane wrote:
 Sam Mason s...@samason.me.uk writes:
  Just noticed that the spec only supports four hex digits;
 
 Better read it again.

You're right of course.  My ability to read patches seems not to be very
good.

-- 
  Sam  http://samason.me.uk/

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Unicode string literals versus the world

2009-04-16 Thread Sam Mason
On Wed, Apr 15, 2009 at 11:19:42PM +0300, Marko Kreen wrote:
 On 4/15/09, Tom Lane t...@sss.pgh.pa.us wrote:
  Given Martijn's complaint about more-than-16-bit code points, I think
   the \u proposal is not mature enough to go into 8.4.  We can think
   about some version of that later, if there's enough interest.
 
 I think it would be good idea. Basically we should pick one from
 couple of pre-existing sane schemes.  Here is quick summary
 of Python, Perl and Java:
 
 Python [1]:
 
   \u - 16-bit codepoint
   \U - 32-bit codepoint
   \N{char-name}  - Characted by name

Microsoft have also gone this way in C#, named code points are not
supported however.

 Perl [2]:
 
   \x{..} - {} contains hexadecimal codepoint
   \N{char-name}  - Unicode char name

Looks OK, but the 'x' seems somewhat redundant.  Why not just:

  \{}

This would be following the BitC[2] project, especially if it was more
like:

  \{U+}

e.g.

  \{U+03BB}

would be the lowercase lambda character.  Added appeal is in the fact
that this (i.e. U+03BB) is how the Unicode consortium spells code
points.

 Java [3]:
 
   \u - 16-bit codepoint

AFAIK, Java isn't the best reference to choose; it assumed from an early
point in its design that Unicode characters were at most 16bits and
hence had to switch its internal representation to UTF-16.  I don't
program much Java these days to know how it's all worked out, but it
would be interesting to hear from people who regularly have to deal with
characters outside the BMP (i.e. code points greater than 65535).

-- 
  Sam  http://samason.me.uk/

 [1] http://msdn.microsoft.com/en-us/library/aa664669(VS.71).aspx
 [2] http://www.bitc-lang.org/docs/bitc/spec.html#stringlit

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Unicode string literals versus the world

2009-04-16 Thread Marko Kreen
On 4/16/09, Sam Mason s...@samason.me.uk wrote:
 On Wed, Apr 15, 2009 at 11:19:42PM +0300, Marko Kreen wrote:
   On 4/15/09, Tom Lane t...@sss.pgh.pa.us wrote:

   Given Martijn's complaint about more-than-16-bit code points, I think
 the \u proposal is not mature enough to go into 8.4.  We can think
 about some version of that later, if there's enough interest.
  
   I think it would be good idea. Basically we should pick one from
   couple of pre-existing sane schemes.  Here is quick summary
   of Python, Perl and Java:
  
   Python [1]:
  
 \u - 16-bit codepoint
 \U - 32-bit codepoint
 \N{char-name}  - Characted by name


 Microsoft have also gone this way in C#, named code points are not
  supported however.

And it handles also non-BMP codepoints with \u escape similarly:

  http://en.csharp-online.net/ECMA-334:_9.4.1_Unicode_escape_sequences

This makes it even more standard.

   Perl [2]:
  
 \x{..} - {} contains hexadecimal codepoint
 \N{char-name}  - Unicode char name


 Looks OK, but the 'x' seems somewhat redundant.  Why not just:

   \{}

  This would be following the BitC[2] project, especially if it was more
  like:

   \{U+}

  e.g.

   \{U+03BB}

  would be the lowercase lambda character.  Added appeal is in the fact
  that this (i.e. U+03BB) is how the Unicode consortium spells code
  points.

We already got yet-another-unique-way-of-escaping-unicode with U.

Now let's try to support some actual standard also.

   Java [3]:
  
 \u - 16-bit codepoint


 AFAIK, Java isn't the best reference to choose; it assumed from an early
  point in its design that Unicode characters were at most 16bits and
  hence had to switch its internal representation to UTF-16.  I don't
  program much Java these days to know how it's all worked out, but it
  would be interesting to hear from people who regularly have to deal with
  characters outside the BMP (i.e. code points greater than 65535).

You did not read my mail carefully enough - the Java and also Python/C#
already support non-BMP chars with '\u' and exactly the same (utf16) way.

-- 
marko

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Unicode string literals versus the world

2009-04-16 Thread Andrew Dunstan



Tatsuo Ishii wrote:
I could live with either. Wikipedia says: The characters outside the 
first plane usually have very specialized or rare use. For years we 
rejected all characters beyond the first plane, and while that's fixed 
now, the volume of complaints wasn't huge.
  

I you mean first plane as BMP (i.e. 16bit range), above is not true
for PostgreSQL 7.3 or later at least.



Oops. I meant 8.2 or later.

  


Umm, that's what I said. We used to do it like that, but we don't any more.

cheers

andrew

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Unicode string literals versus the world

2009-04-16 Thread Sam Mason
On Thu, Apr 16, 2009 at 02:47:20PM +0300, Marko Kreen wrote:
 On 4/16/09, Sam Mason s...@samason.me.uk wrote:
  Microsoft have also gone this way in C#, named code points are not
  supported however.
 
 And it handles also non-BMP codepoints with \u escape similarly:
 
   http://en.csharp-online.net/ECMA-334:_9.4.1_Unicode_escape_sequences
 
 This makes it even more standard.

I fail to see what you're pointing out here; as far as I understand it,
\u is for BMP code points and \U extends the range out to 32bit code
points.  I can't see anything about non-BMP and \u in the above link,
you appear free to write your own surrogate pairs but that seems like an
independent issue.

I'd not realised before that C# is specified to use UTF-16 as its
internal encoding.

   This would be following the BitC[2] project, especially if it was more
   like:
 
\{U+}
 
 We already got yet-another-unique-way-of-escaping-unicode with U.
 
 Now let's try to support some actual standard also.

That comes across *very* negatively; I hope it's just a language issue.

I read your parent post as soliciting opinions on possible ways to
encode Unicode characters in PG's literals.  The U'lit' was criticised,
you posted some suggestions, I followed up with what I hoped to be a
useful addition.  It seems useful here to separate de jure from de
facto standards; implementing U'lit' would be following the de jure
standard, anything else would be de facto.

A survey of existing SQL implementations would seem to be more appropriate
as well:

Oracle: UNISTR(string-literal) and \

  It looks as though Oracle originally used UCS-2 internally (i.e. BMP
  only) but more recently Unicode support has been improved to allow
  other planes.

MS-SQL Server: 

  can't find anything remotely useful; best seems to be to use
  NCHAR(integer-expression) which looks somewhat unmaintainable.

DB2: Ustring-literal and \xx

  i.e. it follows the SQL-2003 spec

FireBird:

  can't find much either; support looks somewhat low on the ground

MySQL:

  same again, seems to assume query is encoded in UTF-8

Summary seems to be that either I'm bad at searching or support for
Unicode doesn't seem very complete in the database world and people work
around it somehow.

 You did not read my mail carefully enough - the Java and also Python/C#
 already support non-BMP chars with '\u' and exactly the same (utf16) way.

Again, I think this may be a language issue; if not then more verbose
explanations help, maybe something like sorry, I obviously didn't
explain that very well.  You will of course felt you explained it
perfectly well, but everybody enters a discussion with different
intuitions and biases, email has a nasty habit of accentuating these
differences and compounding them with language problems.

I'd never heard of UTF-16 surrogate pairs before this discussion and
hence didn't realise that it's valid to have a surrogate pair in place
of a single code point.  The docs say that D800 DF02 corresponds to
U+10302, Python would appear to follow my intuitions in that:

  ord(u'\uD800\uDF02')

results in an error instead of giving back 66306, as I'd expect.  Is
this a bug in Python, my understanding, or something else?

-- 
  Sam  http://samason.me.uk/

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Unicode string literals versus the world

2009-04-16 Thread Tom Lane
Sam Mason s...@samason.me.uk writes:
 I'd never heard of UTF-16 surrogate pairs before this discussion and
 hence didn't realise that it's valid to have a surrogate pair in place
 of a single code point.  The docs say that D800 DF02 corresponds to
 U+10302, Python would appear to follow my intuitions in that:

   ord(u'\uD800\uDF02')

 results in an error instead of giving back 66306, as I'd expect.  Is
 this a bug in Python, my understanding, or something else?

I might be wrong, but I think surrogate pairs are expressly forbidden in
all representations other than UTF16/UCS2.  We definitely forbid them
when validating UTF-8 strings --- that's per an RFC recommendation.
It sounds like Python is doing the same.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Unicode string literals versus the world

2009-04-16 Thread Sam Mason
On Thu, Apr 16, 2009 at 10:54:16AM -0400, Tom Lane wrote:
 Sam Mason s...@samason.me.uk writes:
  I'd never heard of UTF-16 surrogate pairs before this discussion and
  hence didn't realise that it's valid to have a surrogate pair in place
  of a single code point.  The docs say that D800 DF02 corresponds to
  U+10302, Python would appear to follow my intuitions in that:
 
ord(u'\uD800\uDF02')
 
  results in an error instead of giving back 66306, as I'd expect.  Is
  this a bug in Python, my understanding, or something else?
 
 I might be wrong, but I think surrogate pairs are expressly forbidden in
 all representations other than UTF16/UCS2.  We definitely forbid them
 when validating UTF-8 strings --- that's per an RFC recommendation.
 It sounds like Python is doing the same.

OK, that's good.  I thought I was missing something.  A minor point is
that in UCS2 each 16bit value is exactly one character and characters
outside the BMP aren't supported, hence the need for UTF-16.

I've failed to keep up with the discussion so I'm not sure where this
conversation has got to!  Is the consensus for 8.4 to enable SQL2003
style Ulit escaped literals if and only if standard_conforming_strings
is set?  This seems easiest for client code as it can use this
exclusively for knowing what to do with backslashes.

-- 
  Sam  http://samason.me.uk/

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Unicode string literals versus the world

2009-04-16 Thread Marko Kreen
On 4/16/09, Sam Mason s...@samason.me.uk wrote:
 On Thu, Apr 16, 2009 at 02:47:20PM +0300, Marko Kreen wrote:
   On 4/16/09, Sam Mason s...@samason.me.uk wrote:
   Microsoft have also gone this way in C#, named code points are not
supported however.
  
   And it handles also non-BMP codepoints with \u escape similarly:
  
 http://en.csharp-online.net/ECMA-334:_9.4.1_Unicode_escape_sequences
  
   This makes it even more standard.


 I fail to see what you're pointing out here; as far as I understand it,
  \u is for BMP code points and \U extends the range out to 32bit code
  points.  I can't see anything about non-BMP and \u in the above link,
  you appear free to write your own surrogate pairs but that seems like an
  independent issue.

Ok, maybe I glanced too quickly over that page.

I can't find definite deference only hint on several pages:

  \U \U Unicode escape sequence for surrogate pairs.

Which hints that you can aswell enter the pairs directly: \uxx\uxx.
If I'd be language designer, I would not see any reason to disallow it.

And anyway, at least mono seems to support it:

using System;
public class HelloWorld {
public static void Main() {
Console.WriteLine(\uD800\uDF02\n);
}
}

It will output single UTF8 character.  I think this should settle it.

  I'd not realised before that C# is specified to use UTF-16 as its
  internal encoding.

 This would be following the BitC[2] project, especially if it was more
 like:
   
  \{U+}
  

  We already got yet-another-unique-way-of-escaping-unicode with U.
  
   Now let's try to support some actual standard also.


 That comes across *very* negatively; I hope it's just a language issue.

  I read your parent post as soliciting opinions on possible ways to
  encode Unicode characters in PG's literals.  The U'lit' was criticised,
  you posted some suggestions, I followed up with what I hoped to be a
  useful addition.  It seems useful here to separate de jure from de
  facto standards; implementing U'lit' would be following the de jure
  standard, anything else would be de facto.

  A survey of existing SQL implementations would seem to be more appropriate
  as well:

  Oracle: UNISTR(string-literal) and \

   It looks as though Oracle originally used UCS-2 internally (i.e. BMP
   only) but more recently Unicode support has been improved to allow
   other planes.

  MS-SQL Server:

   can't find anything remotely useful; best seems to be to use
   NCHAR(integer-expression) which looks somewhat unmaintainable.

  DB2: Ustring-literal and \xx

   i.e. it follows the SQL-2003 spec

  FireBird:

   can't find much either; support looks somewhat low on the ground

  MySQL:

   same again, seems to assume query is encoded in UTF-8

  Summary seems to be that either I'm bad at searching or support for
  Unicode doesn't seem very complete in the database world and people work
  around it somehow.

The de-facto about Postgres is stdstr=off.  Even if not, E'' strings
are still better for various things, so it would be good if they also
aquired unicode-capabilities.

   You did not read my mail carefully enough - the Java and also Python/C#
   already support non-BMP chars with '\u' and exactly the same (utf16) way.


 Again, I think this may be a language issue; if not then more verbose
  explanations help, maybe something like sorry, I obviously didn't
  explain that very well.  You will of course felt you explained it
  perfectly well, but everybody enters a discussion with different
  intuitions and biases, email has a nasty habit of accentuating these
  differences and compounding them with language problems.

  I'd never heard of UTF-16 surrogate pairs before this discussion and
  hence didn't realise that it's valid to have a surrogate pair in place
  of a single code point.  The docs say that D800 DF02 corresponds to
  U+10302, Python would appear to follow my intuitions in that:

   ord(u'\uD800\uDF02')

  results in an error instead of giving back 66306, as I'd expect.  Is
  this a bug in Python, my understanding, or something else?

Python's internal representation is *not* UTF-16, but plain UCS2/UCS4,
that is - plain 16 or 32-bit values.  Seems your python is compiled with
UCS2, not UCS4.  As I understand, in UCS2 mode it simply takes surrogate
values as-is.  From ord() docs:

  If a unicode argument is given and Python was built with UCS2 Unicode,
  then the character’s code point must be in the range [0..65535]
  inclusive; otherwise the string length is two, and a TypeError will
  be raised.

So only in UCS4 mode it detects surrogates and converts them to internal
representation.  (Which in Postgres case would be UTF8.)

Or perhaps it is partially UTF16 aware - eg. I/O routines do unterstand
UTF16 but low-level string routines do not:

  print %s % u'\uD800\uDF02'

seems to handle it properly.

-- 
marko

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to 

Re: [HACKERS] Unicode string literals versus the world

2009-04-16 Thread Andrew Dunstan



Tom Lane wrote:

Sam Mason s...@samason.me.uk writes:
  

I'd never heard of UTF-16 surrogate pairs before this discussion and
hence didn't realise that it's valid to have a surrogate pair in place
of a single code point.  The docs say that D800 DF02 corresponds to
U+10302, Python would appear to follow my intuitions in that:



  

  ord(u'\uD800\uDF02')



  

results in an error instead of giving back 66306, as I'd expect.  Is
this a bug in Python, my understanding, or something else?



I might be wrong, but I think surrogate pairs are expressly forbidden in
all representations other than UTF16/UCS2.  We definitely forbid them
when validating UTF-8 strings --- that's per an RFC recommendation.
It sounds like Python is doing the same.


  


You mustn't encode the surrogate, but it's up to us how we allow people 
to designate a given code point.


Frankly, I think we shouldn't provide for using surrogates at all. I 
would prefer something like \u for BMP items and \U as the 
straight 32bit designation of a higher codepoint.


cheers

andrew

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Unicode string literals versus the world

2009-04-16 Thread Marko Kreen
On 4/16/09, Tom Lane t...@sss.pgh.pa.us wrote:
 Sam Mason s...@samason.me.uk writes:
   I'd never heard of UTF-16 surrogate pairs before this discussion and
   hence didn't realise that it's valid to have a surrogate pair in place
   of a single code point.  The docs say that D800 DF02 corresponds to
   U+10302, Python would appear to follow my intuitions in that:

 ord(u'\uD800\uDF02')

   results in an error instead of giving back 66306, as I'd expect.  Is
   this a bug in Python, my understanding, or something else?


 I might be wrong, but I think surrogate pairs are expressly forbidden in
  all representations other than UTF16/UCS2.  We definitely forbid them
  when validating UTF-8 strings --- that's per an RFC recommendation.
  It sounds like Python is doing the same.

The point here is that Python/Java/C# allow them for escaping non-BMP
unicode values, irrespective of their interal encoding.

-- 
marko

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Unicode string literals versus the world

2009-04-16 Thread Sam Mason
On Thu, Apr 16, 2009 at 06:34:06PM +0300, Marko Kreen wrote:
 Which hints that you can aswell enter the pairs directly: \uxx\uxx.
 If I'd be language designer, I would not see any reason to disallow it.
 
 And anyway, at least mono seems to support it:
 
 using System;
 public class HelloWorld {
 public static void Main() {
 Console.WriteLine(\uD800\uDF02\n);
 }
 }
 
 It will output single UTF8 character.  I think this should settle it.

I don't have any .net stuff installed so can't test; but C# is defined
to use UTF-16 as its internal representation so it would make sense if
the above gets treated as a single character internally.  However, if it
used any other encoding the above should be treated as an error.

 The de-facto about Postgres is stdstr=off.  Even if not, E'' strings
 are still better for various things, so it would be good if they also
 aquired unicode-capabilities.

OK, this seems independent of the U'lit' discussion that started the
thread.  Note that PG already supports UTF8; if you want the character
I've been using in my examples up-thread, you can do:

  SELECT E'\xF0\x90\x8C\x82';

I have a feeling that this is predicated on the server_encoding being
set to utf8 and this can only be done at database creation time.
Another alternative would be to use the convert_from function, i.e:

  SELECT convert_from(E'\xF0\x90\x8C\x82', 'UTF8');

Never had to do this though, so there may be better options available.

 Python's internal representation is *not* UTF-16, but plain UCS2/UCS4,
 that is - plain 16 or 32-bit values.  Seems your python is compiled with
 UCS2, not UCS4.

Cool, I didn't know that.  I believe mine is UCS4 as I can do:

  ord(u'\U00010302')

and I get 66306 back rather than an error.

 As I understand, in UCS2 mode it simply takes surrogate
 values as-is.

UCS2 doesn't have surrogate pairs, or at least I believe it's considered
a bug if you don't get an error when you present it with one.

 From ord() docs:
 
   If a unicode argument is given and Python was built with UCS2 Unicode,
   then the character’s code point must be in the range [0..65535]
   inclusive; otherwise the string length is two, and a TypeError will
   be raised.
 
 So only in UCS4 mode it detects surrogates and converts them to internal
 representation.  (Which in Postgres case would be UTF8.)

I think you mean UTF-16 instead of UCS4; but otherwise, yes.

 Or perhaps it is partially UTF16 aware - eg. I/O routines do unterstand
 UTF16 but low-level string routines do not:
 
   print %s % u'\uD800\uDF02'
 
 seems to handle it properly.

Yes, I get this as well.  It's all a bit weird, which is why I was
asking if this a bug in Python, my understanding, or something else.

When I do:

  python EOF | hexdump -C
  print u\uD800\uDF02
  EOF

to see what it's doing I get an error which I'm not expecting, hence I
think it's probably my understanding.

-- 
  Sam  http://samason.me.uk/

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Unicode string literals versus the world

2009-04-16 Thread Tom Lane
Sam Mason s...@samason.me.uk writes:
 I've failed to keep up with the discussion so I'm not sure where this
 conversation has got to!  Is the consensus for 8.4 to enable SQL2003
 style Ulit escaped literals if and only if standard_conforming_strings
 is set?

That was Peter's proposal, and no one's shot a hole in it yet ...

I think the discussion about whether/how to add a Unicode extension to
E''-style literals is 8.5 material.  We are in beta so now is not
the time to add new features, especially ones that weren't even on the
TODO list before.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Unicode string literals versus the world

2009-04-15 Thread Sam Mason
On Tue, Apr 14, 2009 at 04:01:48PM +0300, Peter Eisentraut wrote:
 On Saturday 11 April 2009 18:20:47 Sam Mason wrote:
  I can't see much support in the other database engines; searched for
  Oracle, MS-SQL, DB2 and Firebird.  MySQL has it planned for 7.1, so not
  for a while.
 
 DB2 supports it, as far as I know.

Doh, yes it does doesn't it.  Sorry I searched for a bit and failed to
find anything before.  Looks as though the signal to noise ratio was far
too low as I've just searched again and found a (single) reference to
their docs describing the feature[1].

I've also just noticed that the MySQL todo item points to several other
implementations and how they handle Unicode escape sequences.  The most
common option (bearing in mind that this is a sample of mainly FOSS
databases) seems to be doing some variant of '\u0123', as in the style
of Python.  This is only supported for literals and no support for
identifiers appears to be provided.

-- 
  Sam  http://samason.me.uk/

 [1] 
http://publib.boulder.ibm.com/infocenter/db2luw/v9r5/topic/com.ibm.db2.luw.sql.ref.doc/doc/r731.html

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Unicode string literals versus the world

2009-04-15 Thread David E. Wheeler

On Apr 15, 2009, at 4:45 AM, Sam Mason wrote:


Doh, yes it does doesn't it.  Sorry I searched for a bit and failed to
find anything before.  Looks as though the signal to noise ratio was  
far

too low as I've just searched again and found a (single) reference to
their docs describing the feature[1].


This is ugly, but not completely unpalatable:

U'\0141ód\017A is a city in Poland'   U'c:\\temp'   U'@+01D11E'  
UESCAPE '@'


Wouldn't we just then say that U'' strings are always standard- 
conforming?


Best,

David
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Unicode string literals versus the world

2009-04-15 Thread Martijn van Oosterhout
On Tue, Apr 14, 2009 at 08:10:54AM -0400, Andrew Dunstan wrote:
 Marko Kreen wrote:
 I still stand on my proposal, how about extending E'' strings with
 unicode escapes (eg. \u)?  The E'' strings are already more
 clearly defined than '' and they are our own, we don't need to
 consider random standards, but can consider our sanity.
 
 I suspect there would be lots more support in the user community, where 
 \u is well understood in a number of contexts (Java and ECMAScript, 
 for example). It's also tolerably sane.

By the way, that's an example of how to do it wrong, there are more
than 2^16 unicode characters, you want to be able to support the full
21-bit range if you're going to do it right.

FWIW, I prefer the perl syntax which simply extends \x: \x{1344}, which
makes it clear it's hex and doesn't make assumptions as to how many
characters are used.

Have a nice day,
-- 
Martijn van Oosterhout   klep...@svana.org   http://svana.org/kleptog/
 Please line up in a tree and maintain the heap invariant while 
 boarding. Thank you for flying nlogn airlines.


signature.asc
Description: Digital signature


Re: [HACKERS] Unicode string literals versus the world

2009-04-15 Thread Tom Lane
David E. Wheeler da...@kineticode.com writes:
 Wouldn't we just then say that U'' strings are always standard- 
 conforming?

That's exactly what's causing the problem --- they are, but there
is lots of software that won't know it.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Unicode string literals versus the world

2009-04-15 Thread Greg Stark
On Wed, Apr 15, 2009 at 6:23 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 Wouldn't we just then say that U'' strings are always standard-
 conforming?

 That's exactly what's causing the problem --- they are, but there
 is lots of software that won't know it.


We could say U'' escapes only work if you have
standards_conforming_strings set to true.


-- 
greg

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Unicode string literals versus the world

2009-04-15 Thread Greg Stark
On Wed, Apr 15, 2009 at 6:52 PM, Greg Stark st...@enterprisedb.com wrote:
 On Wed, Apr 15, 2009 at 6:23 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 Wouldn't we just then say that U'' strings are always standard-
 conforming?

 That's exactly what's causing the problem --- they are, but there
 is lots of software that won't know it.


 We could say U'' escapes only work if you have
 standards_conforming_strings set to true.

Or say that if you have standards_conforming_strings false then any
string which contains a literal \ or ' is an error. You shouldn't ever
really need either since you could use the unicode escape for either
after all.



-- 
greg

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Unicode string literals versus the world

2009-04-15 Thread Marko Kreen
On 4/15/09, Greg Stark st...@enterprisedb.com wrote:
 On Wed, Apr 15, 2009 at 6:23 PM, Tom Lane t...@sss.pgh.pa.us wrote:
   Wouldn't we just then say that U'' strings are always standard-
   conforming?
  
   That's exactly what's causing the problem --- they are, but there
   is lots of software that won't know it.

 We could say U'' escapes only work if you have
  standards_conforming_strings set to true.

Whats wrong with requiring U to conform with stdstr=off quoting rules?

You can use UESCAPE if you dont want to double backslashes.

-- 
marko

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Unicode string literals versus the world

2009-04-15 Thread Tom Lane
Marko Kreen mark...@gmail.com writes:
 Whats wrong with requiring U to conform with stdstr=off quoting rules?

The sole and only excuse for that misbegotten syntax is to be exactly
SQL spec compliant --- otherwise we might as well pick something saner.
So it needs to work like stdstr=on.  I thought Peter's proposal of
rejecting it altogether when stdstr=off might be reasonable.  The space
sensitivity around the  still sucks, but I have not (yet) thought of
a credible security exploit for that.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Unicode string literals versus the world

2009-04-15 Thread Marko Kreen
On 4/15/09, Tom Lane t...@sss.pgh.pa.us wrote:
 Marko Kreen mark...@gmail.com writes:
  Whats wrong with requiring U to conform with stdstr=off quoting rules?

  The sole and only excuse for that misbegotten syntax is to be exactly
  SQL spec compliant --- otherwise we might as well pick something saner.
  So it needs to work like stdstr=on.  I thought Peter's proposal of
  rejecting it altogether when stdstr=off might be reasonable.  The space
  sensitivity around the  still sucks, but I have not (yet) thought of
  a credible security exploit for that.

So the U syntax is only available if stdstr=on?  Sort of makes sense.

As both this and the doubling-\\ way would mean we should have usable
alternative in case of stdstr=off also, so in the end we have agreed
to accept \u also?

-- 
marko

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Unicode string literals versus the world

2009-04-15 Thread Andrew Dunstan



Martijn van Oosterhout wrote:

On Tue, Apr 14, 2009 at 08:10:54AM -0400, Andrew Dunstan wrote:
  

Marko Kreen wrote:


I still stand on my proposal, how about extending E'' strings with
unicode escapes (eg. \u)?  The E'' strings are already more
clearly defined than '' and they are our own, we don't need to
consider random standards, but can consider our sanity.
  
I suspect there would be lots more support in the user community, where 
\u is well understood in a number of contexts (Java and ECMAScript, 
for example). It's also tolerably sane.



By the way, that's an example of how to do it wrong, there are more
than 2^16 unicode characters, you want to be able to support the full
21-bit range if you're going to do it right.

FWIW, I prefer the perl syntax which simply extends \x: \x{1344}, which
makes it clear it's hex and doesn't make assumptions as to how many
characters are used.
  


I could live with either. Wikipedia says: The characters outside the 
first plane usually have very specialized or rare use. For years we 
rejected all characters beyond the first plane, and while that's fixed 
now, the volume of complaints wasn't huge.


cheers

andrew


Have a nice day,
  


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Unicode string literals versus the world

2009-04-15 Thread Tom Lane
Marko Kreen mark...@gmail.com writes:
 As both this and the doubling-\\ way would mean we should have usable
 alternative in case of stdstr=off also, so in the end we have agreed
 to accept \u also?

Given Martijn's complaint about more-than-16-bit code points, I think
the \u proposal is not mature enough to go into 8.4.  We can think
about some version of that later, if there's enough interest.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Unicode string literals versus the world

2009-04-15 Thread Marko Kreen
On 4/15/09, Tom Lane t...@sss.pgh.pa.us wrote:
 Marko Kreen mark...@gmail.com writes:
  As both this and the doubling-\\ way would mean we should have usable
   alternative in case of stdstr=off also, so in the end we have agreed
   to accept \u also?

 Given Martijn's complaint about more-than-16-bit code points, I think
  the \u proposal is not mature enough to go into 8.4.  We can think
  about some version of that later, if there's enough interest.

I think it would be good idea. Basically we should pick one from
couple of pre-existing sane schemes.  Here is quick summary
of Python, Perl and Java:

Python [1]:

  \u - 16-bit codepoint
  \U - 32-bit codepoint
  \N{char-name}  - Characted by name

Perl [2]:

  \x{..} - {} contains hexadecimal codepoint
  \N{char-name}  - Unicode char name

Java [3]:

  \u - 16-bit codepoint

Perl is OK, but the \x makes think of literal hex-decimal, and thats
because they have extented their literal byte-escapes to support unicode.
So I doubt we should promote it more.  \u{} would be nicer, but that
would not be an established standard.

Both Python and Java allow using \u to encode higher codepoints with
surrogate pairs.  Which sort of makes it standard and Python superset
of Java.  (Obviously that does not mean you need to store them
as surrogate pairs.)

Problem with having only \u would be that this would make hard to enter
higher codepoints manually.  So \U would also be good to have.

There is no hurry with \N{} but if it would be possible, it would be
main reason to have custom unicode escaping.

So my proposal would be Python escapes without \N{}.

Whether we should have it in 8.4, I don't know.  If we will have U,
but it does not work with stdstr=off, this seems to hint we should have
some other escaping method available in 8.4 for stdstd=off users?

-- 
marko

[1] http://docs.python.org/reference/lexical_analysis.html#string-literals
[2] http://perldoc.perl.org/perluniintro.html
[3] http://java.sun.com/docs/books/jls/third_edition/html/lexical.html

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Unicode string literals versus the world

2009-04-15 Thread Tatsuo Ishii
  I still stand on my proposal, how about extending E'' strings with
  unicode escapes (eg. \u)?  The E'' strings are already more
  clearly defined than '' and they are our own, we don't need to
  consider random standards, but can consider our sanity.

  I suspect there would be lots more support in the user community, where 
  \u is well understood in a number of contexts (Java and ECMAScript, 
  for example). It's also tolerably sane.
  
 
  By the way, that's an example of how to do it wrong, there are more
  than 2^16 unicode characters, you want to be able to support the full
  21-bit range if you're going to do it right.
 
  FWIW, I prefer the perl syntax which simply extends \x: \x{1344}, which
  makes it clear it's hex and doesn't make assumptions as to how many
  characters are used.

 
 I could live with either. Wikipedia says: The characters outside the 
 first plane usually have very specialized or rare use. For years we 
 rejected all characters beyond the first plane, and while that's fixed 
 now, the volume of complaints wasn't huge.

I you mean first plane as BMP (i.e. 16bit range), above is not true
for PostgreSQL 7.3 or later at least.
--
Tatsuo Ishii
SRA OSS, Inc. Japan

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Unicode string literals versus the world

2009-04-15 Thread Tatsuo Ishii
   I still stand on my proposal, how about extending E'' strings with
   unicode escapes (eg. \u)?  The E'' strings are already more
   clearly defined than '' and they are our own, we don't need to
   consider random standards, but can consider our sanity.
 
   I suspect there would be lots more support in the user community, where 
   \u is well understood in a number of contexts (Java and ECMAScript, 
   for example). It's also tolerably sane.
   
  
   By the way, that's an example of how to do it wrong, there are more
   than 2^16 unicode characters, you want to be able to support the full
   21-bit range if you're going to do it right.
  
   FWIW, I prefer the perl syntax which simply extends \x: \x{1344}, which
   makes it clear it's hex and doesn't make assumptions as to how many
   characters are used.
 
  
  I could live with either. Wikipedia says: The characters outside the 
  first plane usually have very specialized or rare use. For years we 
  rejected all characters beyond the first plane, and while that's fixed 
  now, the volume of complaints wasn't huge.
 
 I you mean first plane as BMP (i.e. 16bit range), above is not true
 for PostgreSQL 7.3 or later at least.

Oops. I meant 8.2 or later.
--
Tatsuo Ishii
SRA OSS, Inc. Japan

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Unicode string literals versus the world

2009-04-14 Thread Peter Eisentraut
On Saturday 11 April 2009 00:54:25 Tom Lane wrote:
 It gets worse though: I have seldom seen such a badly designed piece of
 syntax as the Unicode string syntax --- see
 http://developer.postgresql.org/pgdocs/postgres/sql-syntax-lexical.html#SQL
-SYNTAX-STRINGS-UESCAPE

 You scan the string, and then after that they tell you what the escape
 character is!?  Not to mention the obvious ambiguity with  as an
 operator.

 If we let this go into 8.4, our previous rounds with security holes
 caused by careless string parsing will look like a day at the beach.
 No frontend that isn't fully cognizant of the Unicode string syntax is
 going to parse such things correctly --- it's going to be trivial for
 a bad guy to confuse a quoting mechanism as to what's an escape and what
 isn't.

Note that the escape character marks the Unicode escapes; it doesn't affect the 
quote characters that delimit the string.  So offhand I can't see any potential 
for quote confusion/SQL injection type problems.  Please elaborate if you see 
a problem.

If there are problems, we could consider getting rid of the UESCAPE clause.  
Without it, the U'' strings would behave much like the E'' strings.  But I'd 
like to understand the problem first.

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Unicode string literals versus the world

2009-04-14 Thread Peter Eisentraut
On Saturday 11 April 2009 21:50:29 Josh Berkus wrote:
 On 4/11/09 11:47 AM, Marko Kreen wrote:
  On 4/11/09, Tom Lanet...@sss.pgh.pa.us  wrote:
It gets worse though: I have seldom seen such a badly designed piece
  of syntax as the Unicode string syntax --- see
   
  http://developer.postgresql.org/pgdocs/postgres/sql-syntax-lexical.html#
 SQL-SYNTAX-STRINGS-UESCAPE

 WTF?  Whose feature is this?  What's the use case?

The use case is approximately the same as #; in HTML: entering Unicode 
characters that your screen or keyboard cannot easily produce.  It's a 
desperately needed feature for me.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Unicode string literals versus the world

2009-04-14 Thread Marko Kreen
On 4/14/09, Peter Eisentraut pete...@gmx.net wrote:
 On Saturday 11 April 2009 00:54:25 Tom Lane wrote:
   It gets worse though: I have seldom seen such a badly designed piece of
   syntax as the Unicode string syntax --- see
   http://developer.postgresql.org/pgdocs/postgres/sql-syntax-lexical.html#SQL
  -SYNTAX-STRINGS-UESCAPE
  
   You scan the string, and then after that they tell you what the escape
   character is!?  Not to mention the obvious ambiguity with  as an
   operator.
  
   If we let this go into 8.4, our previous rounds with security holes
   caused by careless string parsing will look like a day at the beach.
   No frontend that isn't fully cognizant of the Unicode string syntax is
   going to parse such things correctly --- it's going to be trivial for
   a bad guy to confuse a quoting mechanism as to what's an escape and what
   isn't.


 Note that the escape character marks the Unicode escapes; it doesn't affect 
 the
  quote characters that delimit the string.  So offhand I can't see any 
 potential
  for quote confusion/SQL injection type problems.  Please elaborate if you see
  a problem.

  If there are problems, we could consider getting rid of the UESCAPE clause.
  Without it, the U'' strings would behave much like the E'' strings.  But I'd
  like to understand the problem first.

I think the problem is that they should not act like E'' strings, but they
should act like plain '' strings - they should follow stdstr setting.

That way existing tools that may (or may not..) understand E'' and stdstr
settings, but definitely have not heard about U'' strings can still
parse the SQL without new surprises.

If they already act that way then keeping U should be fine.

And if UESCAPE does not affect main string parsing, but is handled in
second pass going over parsed string - like bytea \ - then that should
also be fine and should not cause any new surprises.

But if not, it must go.

I would prefer that such quoting extensions would wait until
stdstr=on setting is the only mode Postgres will operate.
Fitting new quoting ways to environment with flippable stdstr setting
will be rather painful for everyone.

I still stand on my proposal, how about extending E'' strings with
unicode escapes (eg. \u)?  The E'' strings are already more
clearly defined than '' and they are our own, we don't need to
consider random standards, but can consider our sanity.

-- 
marko

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Unicode string literals versus the world

2009-04-14 Thread Marko Kreen
On 4/14/09, Peter Eisentraut pete...@gmx.net wrote:
 On Tuesday 14 April 2009 14:38:38 Marko Kreen wrote:
   I think the problem is that they should not act like E'' strings, but they
   should act like plain '' strings - they should follow stdstr setting.
  
   That way existing tools that may (or may not..) understand E'' and stdstr
   settings, but definitely have not heard about U'' strings can still
   parse the SQL without new surprises.


 Can you be more specific in what surprises you expect?  What algorithms do
  you suppose those existing tools use and what expectations do they have?

If the parsing does not happen in 2 passes and it does not take account
of stdstr setting then the  default breakage would be:

   stdstr=off, U' \' UESCAPE '!'.

And anything, whose security or functionality depends on parsing SQL
can be broken that way.

Broken functionality would be eg. Slony (or other replication solution)
distributing developer-written SQL code to bunch of nodes.  It needs to
parse text file to SQL statements and execute them separately.

There are probably other solutions who expect to understand SQL
at least token level to function correctly.  (pgpool, java has
probably something depending on it, etc.)

   I still stand on my proposal, how about extending E'' strings with
   unicode escapes (eg. \u)?  The E'' strings are already more
   clearly defined than '' and they are our own, we don't need to
   consider random standards, but can consider our sanity.


 This doesn't excite me.  I think the tendency should be to get rid of E''
  usage, because its definition of escape sequences is single-byte and ASCII
  centric and thus overall a legacy construct.

Why are you concentrating only on \0xx escapes?  The \\, \n, etc
seem standard and forward looking enough.  Yes, unicode escapes are
missing but we can add them without breaking anything.

  Certainly, we will want to keep
  around E'' for a long time or forever, but it is a legitimate goal for
  application writers to not use it, which is after all the reason behind this
  whole standards-conforming strings project.  I wouldn't want to have a
  forward-looking feature such as the Unicode escapes be burdened with that 
 kind
  of legacy behavior.

  Also note that Unicode escapes are also available for identifiers, for which
  there is no existing E that you can add it to.

Well, I was not rejecting the standard quoting, but suggesting
postponing until the stdstr mess is sorted out.  We can use \uXX
in meantime and I think most Postgres users would prefer to keep
using it...

-- 
marko

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Unicode string literals versus the world

2009-04-14 Thread Tom Lane
Peter Eisentraut pete...@gmx.net writes:
 On Saturday 11 April 2009 00:54:25 Tom Lane wrote:
 If we let this go into 8.4, our previous rounds with security holes
 caused by careless string parsing will look like a day at the beach.

 Note that the escape character marks the Unicode escapes; it doesn't
 affect the quote characters that delimit the string.  So offhand I
 can't see any potential for quote confusion/SQL injection type
 problems.  Please elaborate if you see a problem.

The problem is the interaction with non-standard-conforming strings.

Case 1:

select u'foo\' uescape ',' ...

The backend will see the backslash as just a data character, and
will think that ... is live SQL text.  A non-Unicode-literal-aware
frontend will think that the backslash escapes the second quote, the
comma is live SQL text, and the ... is quoted material.  Construction
of an actual SQL injection attack is left as an exercise for the reader,
but certainly the raw material is here.

Case 2:

select u'foo' uescape '\' ...

Again, any existing frontend code will think that the backslash quotes
the final quote and the ... is quoted material.  This one is
particularly nasty because we allow arbitrary amounts of whitespace and
commenting on either side of uescape:


select u'foo' 
/* hello joe, do you /* understand nested comments today?
*/
-- yes, this one too */
uescape
-- but not this one /*
'\' ...

I suspect that it's actually impossible to parse such a thing correctly
without a full-fledged flex lexer or something of equivalent complexity.
Certainly it's a couple of orders of magnitude harder than it is for
either standard-conforming or E'' literals.

Case 3:

select u'foo\' uescape ',' ...

select u  'foo\' uescape ',' ...

In the first form the ... is live SQL, in the second form it is quoted
material.  This means that you might correctly validate a query and then
have your results invalidated by later processing that innocently adds
or removes whitespace.  (This is particularly nasty in a standard that
demands we parse x/-1 and x / -1 the same ...)

So what we've got here is a whole new set of potential SQL injection
attacks by confusing frontend literal-syntax checking, plus a truly
staggering increase in the minimum *required* complexity of such
checking.  I understand the usefulness of being able to write Unicode
code points, but they're not useful enough to justify this syntax.

This thread has already mentioned a couple of ways we could add the
same facility without making any lexical-level changes, at least for
data values.  I admit that the SQL:2008 way also covers Unicode code
points in identifiers, which we can't emulate without a lexical change;
but frankly I think the use-case for that is so thin as to be almost
nonexistent.  Who is going to choose identifiers that they cannot easily
type on their keyboards?

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Unicode string literals versus the world

2009-04-14 Thread Andrew Dunstan



Marko Kreen wrote:

I still stand on my proposal, how about extending E'' strings with
unicode escapes (eg. \u)?  The E'' strings are already more
clearly defined than '' and they are our own, we don't need to
consider random standards, but can consider our sanity.

  


I suspect there would be lots more support in the user community, where 
\u is well understood in a number of contexts (Java and ECMAScript, 
for example). It's also tolerably sane.


cheers

andrew

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Unicode string literals versus the world

2009-04-14 Thread Tom Lane
Marko Kreen mark...@gmail.com writes:
 I would prefer that such quoting extensions would wait until
 stdstr=on setting is the only mode Postgres will operate.
 Fitting new quoting ways to environment with flippable stdstr setting
 will be rather painful for everyone.

It would certainly be a lot safer to wait until non-standard-conforming
strings don't exist anymore.  The problem is that that may never happen,
and is certainly not on the roadmap to happen in the foreseeable future.

 I still stand on my proposal, how about extending E'' strings with
 unicode escapes (eg. \u)?  The E'' strings are already more
 clearly defined than '' and they are our own, we don't need to
 consider random standards, but can consider our sanity.

That's one way we could proceed.  The other proposal that seemed
attractive to me was a decode-like function:

uescape('foo\00e9bar')
uescape('foo\00e9bar', '\')

(double all the backslashes if you assume not
standard_conforming_strings).  The arguments in favor of this one
are (1) you can apply it to the result of an expression, it's not
strictly tied to literals; and (2) it's a lot lower-footprint solution
since it doesn't affect basic literal handling.  If you wish to suppose
that this is only a stopgap until someday when we can implement the SQL
standard syntax more safely, then low footprint is good.  One could
even imagine back-porting this into existing releases as a user-defined
function.

The solution with \u in extended literals is probably workable too.
I'm slightly worried about the possibility of issues with code that
thinks it knows what an E-literal means but doesn't really.  In
particular something might think it knows that \u just means u,
and proceed to strip the backslash.  I don't see a path for that to
become a security hole though, only a garden-variety bug.  So I could
live with that one on the grounds of being easier to use (which it
would be, because of less typing compared to uescape()).

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Unicode string literals versus the world

2009-04-14 Thread Peter Eisentraut
On Tuesday 14 April 2009 14:38:38 Marko Kreen wrote:
 I think the problem is that they should not act like E'' strings, but they
 should act like plain '' strings - they should follow stdstr setting.

 That way existing tools that may (or may not..) understand E'' and stdstr
 settings, but definitely have not heard about U'' strings can still
 parse the SQL without new surprises.

Can you be more specific in what surprises you expect?  What algorithms do 
you suppose those existing tools use and what expectations do they have?

 I still stand on my proposal, how about extending E'' strings with
 unicode escapes (eg. \u)?  The E'' strings are already more
 clearly defined than '' and they are our own, we don't need to
 consider random standards, but can consider our sanity.

This doesn't excite me.  I think the tendency should be to get rid of E'' 
usage, because its definition of escape sequences is single-byte and ASCII 
centric and thus overall a legacy construct.  Certainly, we will want to keep 
around E'' for a long time or forever, but it is a legitimate goal for 
application writers to not use it, which is after all the reason behind this 
whole standards-conforming strings project.  I wouldn't want to have a 
forward-looking feature such as the Unicode escapes be burdened with that kind 
of legacy behavior.

Also note that Unicode escapes are also available for identifiers, for which 
there is no existing E that you can add it to.

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Unicode string literals versus the world

2009-04-14 Thread Peter Eisentraut
On Saturday 11 April 2009 18:20:47 Sam Mason wrote:
 I can't see much support in the other database engines; searched for
 Oracle, MS-SQL, DB2 and Firebird.  MySQL has it planned for 7.1, so not
 for a while.

DB2 supports it, as far as I know.

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Unicode string literals versus the world

2009-04-14 Thread Robert Haas
On Tue, Apr 14, 2009 at 8:53 AM, Peter Eisentraut pete...@gmx.net wrote:
 This doesn't excite me.  I think the tendency should be to get rid of E''
 usage, because its definition of escape sequences is single-byte and ASCII
 centric and thus overall a legacy construct.  Certainly, we will want to keep
 around E'' for a long time or forever, but it is a legitimate goal for
 application writers to not use it, which is after all the reason behind this
 whole standards-conforming strings project.  I wouldn't want to have a
 forward-looking feature such as the Unicode escapes be burdened with that kind
 of legacy behavior.

 Also note that Unicode escapes are also available for identifiers, for which
 there is no existing E that you can add it to.

Maybe I've just got my head deeply in the sand, but I don't understand
what the alternative to E'' supposedly is.  How am I supposed to write
the equivalent of E'\t\n\f' without using E''?  The
standard_conforming_strings syntax apparently supports no escapes of
any kind, which seems so hideously inconvenient that I can't even
imagine why someone wants that behavior.

...Robert

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Unicode string literals versus the world

2009-04-14 Thread Tom Lane
Robert Haas robertmh...@gmail.com writes:
 Maybe I've just got my head deeply in the sand, but I don't understand
 what the alternative to E'' supposedly is.  How am I supposed to write
 the equivalent of E'\t\n\f' without using E''?  The
 standard_conforming_strings syntax apparently supports no escapes of
 any kind, which seems so hideously inconvenient that I can't even
 imagine why someone wants that behavior.

Well, quite aside from issues of compatibility with standards and other
databases, I'm sure there are lots of Windows users who are more
interested in being able to store a Windows pathname without doubling
their backslashes than they are in being able to type readable names
for ASCII control characters.  After all, in most cases you can get
those characters into a string just by typing them (especially if you
aren't using readline or something like it).

BTW, does anyone know whether Unicode includes the ASCII control
characters ... ie, is \u0009 a name for tab?  If so, maybe this
syntax is in part an attempt to cover that use-case in the standard.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Unicode string literals versus the world

2009-04-14 Thread Peter Eisentraut
On Tuesday 14 April 2009 20:35:21 Robert Haas wrote:
 Maybe I've just got my head deeply in the sand, but I don't understand
 what the alternative to E'' supposedly is.  How am I supposed to write
 the equivalent of E'\t\n\f' without using E''?

Well, the first alternative is to type those characters in literally.  The 
second alternative is the U'' syntax. ;-)  The third alternative is to design 
applications that don't need this, because the processing behavior of those 
characters is quite unportable.  But of course in some cases using the E'' 
syntax is the most convenient.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Unicode string literals versus the world

2009-04-14 Thread David E. Wheeler

On Apr 14, 2009, at 11:22 AM, Tom Lane wrote:


BTW, does anyone know whether Unicode includes the ASCII control
characters ... ie, is \u0009 a name for tab?  If so, maybe this
syntax is in part an attempt to cover that use-case in the standard.


Yes, you can use, e.g., #x0009; in HTML to represent a tab character.

Best,

David

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Unicode string literals versus the world

2009-04-14 Thread Peter Eisentraut
On Tuesday 14 April 2009 17:13:00 Marko Kreen wrote:
 If the parsing does not happen in 2 passes and it does not take account
 of stdstr setting then the  default breakage would be:

stdstr=off, U' \' UESCAPE '!'.

I think we can handle that and the cases Tom presents by erroring out when the 
U syntax is used with stdstr off.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Unicode string literals versus the world

2009-04-14 Thread Peter Eisentraut
On Tuesday 14 April 2009 21:22:29 Tom Lane wrote:
 BTW, does anyone know whether Unicode includes the ASCII control
 characters ... ie, is \u0009 a name for tab?  If so, maybe this
 syntax is in part an attempt to cover that use-case in the standard.

Yes on both.

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Unicode string literals versus the world

2009-04-14 Thread Peter Eisentraut
On Tuesday 14 April 2009 18:54:33 Tom Lane wrote:
 The other proposal that seemed
 attractive to me was a decode-like function:

 uescape('foo\00e9bar')
 uescape('foo\00e9bar', '\')

This was discussed previously, but rejected with the following argument:

There are some other disadvantages for making a function call.  You 
couldn't use that kind of literal in any other place where the parser 
calls for a string constant: role names, tablespace locations, 
passwords, copy delimiters, enum values, function body, file names.

Some of those appear to be very plausible use cases.

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Unicode string literals versus the world

2009-04-14 Thread Peter Eisentraut
On Tuesday 14 April 2009 17:32:00 Tom Lane wrote:
  I admit that the SQL:2008 way also covers Unicode code
 points in identifiers, which we can't emulate without a lexical change;
 but frankly I think the use-case for that is so thin as to be almost
 nonexistent.  Who is going to choose identifiers that they cannot easily
 type on their keyboards?

For example, table names are automatically generated, or you write a test case 
for weird looking names, or you want to add special characters in an 
identifier that will later be displayed somewhere, or in general you are 
writing an application for a foreign language.

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Unicode string literals versus the world

2009-04-14 Thread Tom Lane
Peter Eisentraut pete...@gmx.net writes:
 I think we can handle that and the cases Tom presents by erroring out
 when the U syntax is used with stdstr off.

I think you're missing the point --- this is not about whether the
syntax is unambiguous (it is already) but about whether a frontend that
doesn't understand it 100% will be secure against subversion.  I have no
confidence in the latter assumption.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Unicode string literals versus the world

2009-04-14 Thread Robert Haas
On Tue, Apr 14, 2009 at 2:22 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 Robert Haas robertmh...@gmail.com writes:
 Maybe I've just got my head deeply in the sand, but I don't understand
 what the alternative to E'' supposedly is.  How am I supposed to write
 the equivalent of E'\t\n\f' without using E''?  The
 standard_conforming_strings syntax apparently supports no escapes of
 any kind, which seems so hideously inconvenient that I can't even
 imagine why someone wants that behavior.

 Well, quite aside from issues of compatibility with standards and other
 databases, I'm sure there are lots of Windows users who are more
 interested in being able to store a Windows pathname without doubling
 their backslashes than they are in being able to type readable names
 for ASCII control characters.  After all, in most cases you can get
 those characters into a string just by typing them (especially if you
 aren't using readline or something like it).

Well, that's fine, but that's a long way from Peter's statement that
I think the tendency should be to get rid of E'' usage.  It's only
been a minor inconvenience for me to change my applications to use
E'', but I certainly don't see how I could get by without it, and it's
far more like other programming languages that I use (e.g. C) than the
standard syntax is.

...Robert

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Unicode string literals versus the world

2009-04-14 Thread Tom Lane
Peter Eisentraut pete...@gmx.net writes:
 On Tuesday 14 April 2009 18:54:33 Tom Lane wrote:
 The other proposal that seemed
 attractive to me was a decode-like function:
 
 uescape('foo\00e9bar')
 uescape('foo\00e9bar', '\')

 This was discussed previously, but rejected with the following argument:

 There are some other disadvantages for making a function call.  You 
 couldn't use that kind of literal in any other place where the parser 
 calls for a string constant: role names, tablespace locations, 
 passwords, copy delimiters, enum values, function body, file names.

I'm less than convinced that those are really plausible use-cases for
characters that one is unable to type directly.  However, I'll grant the
point.  So that narrows us down to considering the \u extension to E''
strings as a saner and safer alternative to the spec's syntax.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Unicode string literals versus the world

2009-04-14 Thread Peter Eisentraut
On Tuesday 14 April 2009 21:48:12 Tom Lane wrote:
 Peter Eisentraut pete...@gmx.net writes:
  I think we can handle that and the cases Tom presents by erroring out
  when the U syntax is used with stdstr off.

 I think you're missing the point --- this is not about whether the
 syntax is unambiguous (it is already) but about whether a frontend that
 doesn't understand it 100% will be secure against subversion.  I have no
 confidence in the latter assumption.

I think I am getting the point quite well.  Do you have an example how this 
can be subverted?


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Unicode string literals versus the world

2009-04-14 Thread Tom Lane
Meredith L. Patterson m...@osogato.com writes:
 Tom Lane wrote:
 I suspect that it's actually impossible to parse such a thing correctly
 without a full-fledged flex lexer or something of equivalent complexity.

 Is there a reason not to use a full-fledged flex lexer?

The point is that that's a pretty large imposition on client code that
we don't control or maintain, in order to get a feature that could be
gotten in much less dangerous ways that don't impact any code outside
PG.

 I'd be willing to take a crack at such a thing, but I'm working 80-hour
 weeks through the end of June and likely wouldn't be able to put in any
 time on it till then. So I definitely couldn't promise anything for 8.4,
 but if putting it off till 8.5 works, sign me up.

Shall we pass your name on to every package using Postgres, then?  This
is *not* about code within Postgres.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Unicode string literals versus the world

2009-04-14 Thread Meredith L. Patterson
Tom Lane wrote:
 I suspect that it's actually impossible to parse such a thing correctly
 without a full-fledged flex lexer or something of equivalent complexity.
 Certainly it's a couple of orders of magnitude harder than it is for
 either standard-conforming or E'' literals.

Is there a reason not to use a full-fledged flex lexer?

I'd be willing to take a crack at such a thing, but I'm working 80-hour
weeks through the end of June and likely wouldn't be able to put in any
time on it till then. So I definitely couldn't promise anything for 8.4,
but if putting it off till 8.5 works, sign me up.

Cheers,
--mlp
_
Meredith L. Patterson
Founder and CTO
Osogato, Inc.

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Unicode string literals versus the world

2009-04-14 Thread Meredith L. Patterson
Tom Lane wrote:
 This is *not* about code within Postgres.

One typically provides libraries for this sort of thing, but your point
is taken; suggestion withdrawn.

--mlp
_
Meredith L. Patterson
Founder and CTO
Osogato, Inc.

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Unicode string literals versus the world

2009-04-14 Thread Marko Kreen
On 4/14/09, Tom Lane t...@sss.pgh.pa.us wrote:
 Peter Eisentraut pete...@gmx.net writes:
   On Tuesday 14 April 2009 18:54:33 Tom Lane wrote:
   The other proposal that seemed
   attractive to me was a decode-like function:
  
   uescape('foo\00e9bar')
   uescape('foo\00e9bar', '\')

   This was discussed previously, but rejected with the following argument:

   There are some other disadvantages for making a function call.  You
   couldn't use that kind of literal in any other place where the parser
   calls for a string constant: role names, tablespace locations,
   passwords, copy delimiters, enum values, function body, file names.


 I'm less than convinced that those are really plausible use-cases for
  characters that one is unable to type directly.  However, I'll grant the
  point.  So that narrows us down to considering the \u extension to E''
  strings as a saner and safer alternative to the spec's syntax.

My vote would go to \u.  The U may be sql standard but it's different
from any established practical standard.


Alternative would be to make U follow stdstr setting:

stdstr=on - you get fully standard-conforming syntax:

  U'\xxx' UESCAPE '\'

stdstr=off - you need to follow old quoting rules:

  U'\\xxx' UESCAPE '\\'

This would result in safe, and when stdstr=on, fully standard compliant
syntax.  Only downside would be that in practice - stdstr=off - it would
be unusable.


Third alternative would be to do both of them - \u as a usable method
and safe-U to mark the checkbox for SQL-standard compliance.
If we do want U, I would prefer that to U-only syntax.

-- 
marko

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Unicode string literals versus the world

2009-04-14 Thread Robert Haas
On Tue, Apr 14, 2009 at 2:55 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 Robert Haas robertmh...@gmail.com writes:
 Well, that's fine, but that's a long way from Peter's statement that
 I think the tendency should be to get rid of E'' usage.

 Bear in mind that that's Peter's opinion; it's not necessarily shared
 by anyone else.  I was just responding to your assertion of the
 diametrically opposed position that non-E strings are useless (which
 I don't share either).

Useless might be an overstatement, but I certainly have gotten no
benefit out of them.  It seems decidedly odd to me to propose that
users embed literal control characters in their code/SQL scripts in
lieu of using escape sequences.  If that were a normal and reasonable
thing to do then I expect C, C++, Perl, Python, Ruby, Javascript, and
countless other languages wouldn't provide this functionality.  In
reality, most of them do provide it, sometimes as the only option.
And as to your point about Windows pathnames, if you're trying to
avoid escaping lots of backslashes, dollar-quoting is a perfectly
adequate solution.

The real motivation for standard_conforming_strings is the one
embedded in the name.  And I'm fine with that.  But I don't see E''
disappearing from my code any time soon.  I periodically have a need
for things like foo || E'\n' || bar and it screws up the formatting of
the code to write foo || '
' || bar.  Writing it that way, it's also entirely non-obvious whether
there are any spaces or tabs after the opening quote and before the
newline, and if so whether or not they are intentional.  But I don't
mind if someone ELSE likes the new, standard strings - it just isn't
me.  :-)

...Robert

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Unicode string literals versus the world

2009-04-11 Thread Sam Mason
On Fri, Apr 10, 2009 at 05:54:25PM -0400, Tom Lane wrote:
 It gets worse though: I have seldom seen such a badly designed piece of
 syntax as the Unicode string syntax --- see
 http://developer.postgresql.org/pgdocs/postgres/sql-syntax-lexical.html#SQL-SYNTAX-STRINGS-UESCAPE
 
 I think we need to give very serious consideration to ripping out that
 feature.

I'd agree it's comically bad syntax; I just had a look in the archives
and it was only put in a few months ago:

  http://archives.postgresql.org/pgsql-hackers/2008-10/msg01169.php

I can't see much support in the other database engines; searched for
Oracle, MS-SQL, DB2 and Firebird.  MySQL has it planned for 7.1, so not
for a while.

  http://forge.mysql.com/worklog/task.php?id=3529

-- 
  Sam  http://samason.me.uk/

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Unicode string literals versus the world

2009-04-11 Thread Andrew Dunstan


Tom Lane wrote:


It gets worse though: I have seldom seen such a badly designed piece of
syntax as the Unicode string syntax --- see
http://developer.postgresql.org/pgdocs/postgres/sql-syntax-lexical.html#SQL-SYNTAX-STRINGS-UESCAPE

You scan the string, and then after that they tell you what the escape
character is!?  Not to mention the obvious ambiguity with  as an
operator.

If we let this go into 8.4, our previous rounds with security holes
caused by careless string parsing will look like a day at the beach.
No frontend that isn't fully cognizant of the Unicode string syntax is
going to parse such things correctly --- it's going to be trivial for
a bad guy to confuse a quoting mechanism as to what's an escape and what
isn't.

I think we need to give very serious consideration to ripping out that
feature.


  


+1

I don't recall a great deal of discussion about it, and it certainly 
looks pretty horrible now you point it out.


cheers

andrew

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Unicode string literals versus the world

2009-04-11 Thread Marko Kreen
On 4/11/09, Tom Lane t...@sss.pgh.pa.us wrote:
  It gets worse though: I have seldom seen such a badly designed piece of
  syntax as the Unicode string syntax --- see
  
 http://developer.postgresql.org/pgdocs/postgres/sql-syntax-lexical.html#SQL-SYNTAX-STRINGS-UESCAPE

  You scan the string, and then after that they tell you what the escape
  character is!?  Not to mention the obvious ambiguity with  as an
  operator.

  If we let this go into 8.4, our previous rounds with security holes
  caused by careless string parsing will look like a day at the beach.
  No frontend that isn't fully cognizant of the Unicode string syntax is
  going to parse such things correctly --- it's going to be trivial for
  a bad guy to confuse a quoting mechanism as to what's an escape and what
  isn't.

  I think we need to give very serious consideration to ripping out that
  feature.

Ugh, it's rather dubious indeed.  Especially when we are already in
the middle of seriously confusing conversion from stdstr=off - on.
Is it really OK to introduce even more complexity in the mix?

Alternative proposal - maybe it would be saner to introduce \u
escape to E'' strings as a non-standard way for quoting unicode.

Later when the standard quoting is our only quoting method we can play
with standard extensions?

-- 
marko

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Unicode string literals versus the world

2009-04-11 Thread Josh Berkus

On 4/11/09 11:47 AM, Marko Kreen wrote:

On 4/11/09, Tom Lanet...@sss.pgh.pa.us  wrote:

  It gets worse though: I have seldom seen such a badly designed piece of
  syntax as the Unicode string syntax --- see
  
http://developer.postgresql.org/pgdocs/postgres/sql-syntax-lexical.html#SQL-SYNTAX-STRINGS-UESCAPE


WTF?  Whose feature is this?  What's the use case?

Speak up, submitter, please.

--
Josh Berkus
PostgreSQL Experts Inc.
www.pgexperts.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Unicode string literals versus the world

2009-04-11 Thread Tom Lane
Josh Berkus j...@agliodbs.com writes:
 On 4/11/09, Tom Lanet...@sss.pgh.pa.us  wrote:
 http://developer.postgresql.org/pgdocs/postgres/sql-syntax-lexical.html#SQL-SYNTAX-STRINGS-UESCAPE

 WTF?  Whose feature is this?  What's the use case?

Peter put it in, I think.  It is in the SQL:2008 spec, but that doesn't
change the fact that it's a horribly bad piece of design.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Unicode string literals versus the world

2009-04-11 Thread Josh Berkus



Peter put it in, I think.  It is in the SQL:2008 spec, but that doesn't
change the fact that it's a horribly bad piece of design.


Hmmm.  We're not going to implement *everything* in the spec; nobody 
does, even IBM.  I think maybe these kinds of additions need to be 
hashed out for value so we don't waste Peter's work in the future.  Peter?


--
Josh Berkus
PostgreSQL Experts Inc.
www.pgexperts.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers