Re: [HACKERS] UTF8 or Unicode

2005-02-25 Thread Karel Zak
On Thu, 2005-02-24 at 23:51 -0500, Bruce Momjian wrote:
 Tatsuo Ishii wrote:
  I do not object the changing UNICODE-UTF-8, but all these discussions
  sound a little bit funny to me.
  
  If you want to blame UNICODE, you should blame LATIN1 etc. as
  well. LATIN1(ISO-8859-1) is actually a character set name, not an
  encoding name. ISO-8859-1 can be encoded in 8-bit single byte
  stream. But it can be encoded in 7-bit too. So when we refer to
  LATIN1(ISO-8859-1), it's not clear if it's encoded in 7/8-bit.
 
 Wow, Tatsuo has a point here.  Looking at encnames.c, I see:
 
 UNICODE, PG_UTF8
 
 but also:
 
 WIN, PG_WIN1251
 LATIN1, PG_LATIN1

 so I see what he is saying.  We are not consistent in favoring the
 official names vs. the common names.

Yes. I said already. For example WIN is extremely bad alias. It all is
heritage from old versions.

 I will work on a patch that people can review and test.

Thanks.

Karel

-- 
Karel Zak [EMAIL PROTECTED]


---(end of broadcast)---
TIP 9: the planner will ignore your desire to choose an index scan if your
  joining column's datatypes do not match


Re: [HACKERS] UTF8 or Unicode

2005-02-25 Thread Peter Eisentraut
Am Freitag, 25. Februar 2005 05:51 schrieb Bruce Momjian:
 so I see what he is saying.  We are not consistent in favoring the
 official names vs. the common names.

 I will work on a patch that people can review and test.

I think this is what we should do:

UNICODE = UTF8
ALT = WIN866
WIN = WIN1251
TCVN = WIN1258

That should clear it up.

-- 
Peter Eisentraut
http://developer.postgresql.org/~petere/

---(end of broadcast)---
TIP 5: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq


Re: [HACKERS] UTF8 or Unicode

2005-02-25 Thread Bruce Momjian
Peter Eisentraut wrote:
 Am Freitag, 25. Februar 2005 05:51 schrieb Bruce Momjian:
  so I see what he is saying.  We are not consistent in favoring the
  official names vs. the common names.
 
  I will work on a patch that people can review and test.
 
 I think this is what we should do:
 
 UNICODE = UTF8
 ALT = WIN866
 WIN = WIN1251
 TCVN = WIN1258

OK, but what about latin1?

-- 
  Bruce Momjian|  http://candle.pha.pa.us
  pgman@candle.pha.pa.us   |  (610) 359-1001
  +  If your life is a hard drive, |  13 Roberts Road
  +  Christ can be your backup.|  Newtown Square, Pennsylvania 19073

---(end of broadcast)---
TIP 9: the planner will ignore your desire to choose an index scan if your
  joining column's datatypes do not match


Re: [HACKERS] UTF8 or Unicode

2005-02-25 Thread Tom Lane
Bruce Momjian pgman@candle.pha.pa.us writes:
 Peter Eisentraut wrote:
 I think this is what we should do:
 
 UNICODE = UTF8
 ALT = WIN866
 WIN = WIN1251
 TCVN = WIN1258

 OK, but what about latin1?

I think LATIN1 is fine as-is.  It's a reasonably popular name for the
character set, and despite Tatsuo's complaint, it's not going to confuse
anyone in practice --- the 7-bit version of that standard has no traction.
The reason UNICODE is a bad name for UTF8 is exactly that there are
multiple physical encodings of Unicode that are in common use.

regards, tom lane

---(end of broadcast)---
TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]


Re: [HACKERS] UTF8 or Unicode

2005-02-25 Thread Peter Eisentraut
Am Freitag, 25. Februar 2005 16:26 schrieb Bruce Momjian:
 OK, but what about latin1?

The following character set names are specified in the SQL standard and 
therefore somewhat non-negotiable:

SQL_CHARACTER
GRAPHIC_IRV
LATIN1
ISO8BIT
UTF16
UTF8
UCS2
SQL_TEXT
SQL_IDENTIFIER

So we have to use LATIN1, even though it creates an inconsistency.  We 
discussed this a while ago during the last great renaming, I think.

Btw., I think ISO8BIT is the correct name for what we call SQL_ASCII, but I 
haven't analyzed that in detail, yet.

-- 
Peter Eisentraut
http://developer.postgresql.org/~petere/

---(end of broadcast)---
TIP 2: you can get off all lists at once with the unregister command
(send unregister YourEmailAddressHere to [EMAIL PROTECTED])


Re: [HACKERS] UTF8 or Unicode

2005-02-25 Thread Bruce Momjian
Peter Eisentraut wrote:
 Am Freitag, 25. Februar 2005 16:26 schrieb Bruce Momjian:
  OK, but what about latin1?
 
 The following character set names are specified in the SQL standard and 
 therefore somewhat non-negotiable:
 
 SQL_CHARACTER
 GRAPHIC_IRV
 LATIN1
 ISO8BIT
 UTF16
 UTF8
 UCS2
 SQL_TEXT
 SQL_IDENTIFIER
 
 So we have to use LATIN1, even though it creates an inconsistency.  We 
 discussed this a while ago during the last great renaming, I think.
 

Oh, UTF8 and not UTF-8?  I thought UTF-8 was the standard name, but if
ANSI uses UTF8 we will have to use that.

 Btw., I think ISO8BIT is the correct name for what we call SQL_ASCII, but I 
 haven't analyzed that in detail, yet.

OK, please let us know.

-- 
  Bruce Momjian|  http://candle.pha.pa.us
  pgman@candle.pha.pa.us   |  (610) 359-1001
  +  If your life is a hard drive, |  13 Roberts Road
  +  Christ can be your backup.|  Newtown Square, Pennsylvania 19073

---(end of broadcast)---
TIP 4: Don't 'kill -9' the postmaster


Re: [HACKERS] UTF8 or Unicode

2005-02-24 Thread Bruce Momjian
Tatsuo Ishii wrote:
 I do not object the changing UNICODE-UTF-8, but all these discussions
 sound a little bit funny to me.
 
 If you want to blame UNICODE, you should blame LATIN1 etc. as
 well. LATIN1(ISO-8859-1) is actually a character set name, not an
 encoding name. ISO-8859-1 can be encoded in 8-bit single byte
 stream. But it can be encoded in 7-bit too. So when we refer to
 LATIN1(ISO-8859-1), it's not clear if it's encoded in 7/8-bit.

Wow, Tatsuo has a point here.  Looking at encnames.c, I see:

UNICODE, PG_UTF8

but also:

WIN, PG_WIN1251
LATIN1, PG_LATIN1

and I see conversions for those:

iso88591, PG_LATIN1
win, PG_WIN1251

so I see what he is saying.  We are not consistent in favoring the
official names vs. the common names.

I will work on a patch that people can review and test.

-- 
  Bruce Momjian|  http://candle.pha.pa.us
  pgman@candle.pha.pa.us   |  (610) 359-1001
  +  If your life is a hard drive, |  13 Roberts Road
  +  Christ can be your backup.|  Newtown Square, Pennsylvania 19073

---(end of broadcast)---
TIP 8: explain analyze is your friend


Re: [HACKERS] UTF8 or Unicode

2005-02-24 Thread Peter Eisentraut
Bruce Momjian wrote:
 We are not consistent in favoring the
 official names vs. the common names.

The problem is rather that there are too many standards and conventions 
to choose from.

-- 
Peter Eisentraut
http://developer.postgresql.org/~petere/

---(end of broadcast)---
TIP 9: the planner will ignore your desire to choose an index scan if your
  joining column's datatypes do not match


Re: [HACKERS] UTF8 or Unicode

2005-02-22 Thread Tatsuo Ishii
I do not object the changing UNICODE-UTF-8, but all these discussions
sound a little bit funny to me.

If you want to blame UNICODE, you should blame LATIN1 etc. as
well. LATIN1(ISO-8859-1) is actually a character set name, not an
encoding name. ISO-8859-1 can be encoded in 8-bit single byte
stream. But it can be encoded in 7-bit too. So when we refer to
LATIN1(ISO-8859-1), it's not clear if it's encoded in 7/8-bit.
--
Tatsuo Ishii

From: Bruce Momjian pgman@candle.pha.pa.us
Subject: Re: [HACKERS] UTF8 or Unicode
Date: Mon, 21 Feb 2005 22:08:25 -0500 (EST)
Message-ID: [EMAIL PROTECTED]

 Tom Lane wrote:
  Bruce Momjian pgman@candle.pha.pa.us writes:
   I think we just need to _favor_ UTF8.
  
  I agree.
  
   The question is where are we
   favoring Unicode rather than UTF8?
  
  It's the canonical name of the encoding, both in the code and the docs.
  
  regression=# create database e encoding 'utf-8';
  CREATE DATABASE
  regression=# \l
   List of databases
  Name|  Owner   | Encoding  
  +--+---
   e  | postgres | UNICODE
   regression | postgres | SQL_ASCII
   template0  | postgres | SQL_ASCII
   template1  | postgres | SQL_ASCII
  (5 rows)
  
  As soon as we decide whether the canonical name is UTF8 or UTF-8
  ;-) we can fix it.
 
 I checked and it looks like UTF-8 is the correct usage:
 
   http://www.unicode.org/glossary/
 
 -- 
   Bruce Momjian|  http://candle.pha.pa.us
   pgman@candle.pha.pa.us   |  (610) 359-1001
   +  If your life is a hard drive, |  13 Roberts Road
   +  Christ can be your backup.|  Newtown Square, Pennsylvania 19073
 
 ---(end of broadcast)---
 TIP 6: Have you searched our list archives?
 
http://archives.postgresql.org
 

---(end of broadcast)---
TIP 7: don't forget to increase your free space map settings


Re: [HACKERS] UTF8 or Unicode

2005-02-21 Thread Bruce Momjian
Tom Lane wrote:
 Bruce Momjian pgman@candle.pha.pa.us writes:
  I think we just need to _favor_ UTF8.
 
 I agree.
 
  The question is where are we
  favoring Unicode rather than UTF8?
 
 It's the canonical name of the encoding, both in the code and the docs.
 
 regression=# create database e encoding 'utf-8';
 CREATE DATABASE
 regression=# \l
  List of databases
 Name|  Owner   | Encoding  
 +--+---
  e  | postgres | UNICODE
  regression | postgres | SQL_ASCII
  template0  | postgres | SQL_ASCII
  template1  | postgres | SQL_ASCII
 (5 rows)
 
 As soon as we decide whether the canonical name is UTF8 or UTF-8
 ;-) we can fix it.

I checked and it looks like UTF-8 is the correct usage:

http://www.unicode.org/glossary/

-- 
  Bruce Momjian|  http://candle.pha.pa.us
  pgman@candle.pha.pa.us   |  (610) 359-1001
  +  If your life is a hard drive, |  13 Roberts Road
  +  Christ can be your backup.|  Newtown Square, Pennsylvania 19073

---(end of broadcast)---
TIP 6: Have you searched our list archives?

   http://archives.postgresql.org


Re: [HACKERS] UTF8 or Unicode

2005-02-18 Thread Karel Zak
On Tue, 2005-02-15 at 14:33 +0100, Peter Eisentraut wrote:
 Am Dienstag, 15. Februar 2005 10:22 schrieb Karel Zak:
  in PG: unicode = utf8 = utf-8
 
  Our internal routines in src/backend/utils/mb/encnames.c accept all
  synonyms. The official internal PG name for UTF-8 is UNICODE :-(
 
 I think in the SQL standard the official name is UTF8.  If someone wants to 
 verify that this is the case and is exactly the encoding we offer (perhaps 
 modulo the 0x1 issue), then it might make sense to change the canonical 
 form to UTF8.

Yes, I think we should fix it and remove UNICODE and WIN encoding names
from PG code.

Karel

-- 
Karel Zak [EMAIL PROTECTED]


---(end of broadcast)---
TIP 9: the planner will ignore your desire to choose an index scan if your
  joining column's datatypes do not match


Re: [HACKERS] UTF8 or Unicode

2005-02-18 Thread Oliver Jowett
Karel Zak wrote:
Yes, I think we should fix it and remove UNICODE and WIN encoding names
from PG code.
The JDBC driver asks for a UNICODE client encoding before it knows the 
server version it is talking to. How do you avoid breaking this?

-O
---(end of broadcast)---
TIP 2: you can get off all lists at once with the unregister command
   (send unregister YourEmailAddressHere to [EMAIL PROTECTED])


Re: [HACKERS] UTF8 or Unicode

2005-02-18 Thread Karel Zak
On Sat, 2005-02-19 at 00:27 +1300, Oliver Jowett wrote:
 Karel Zak wrote:
 
  Yes, I think we should fix it and remove UNICODE and WIN encoding names
  from PG code.
 
 The JDBC driver asks for a UNICODE client encoding before it knows the 
 server version it is talking to. How do you avoid breaking this?

Fix JDBC driver as soon as possible.

Add to 8.1 release notes: encoding names 'UNICODE' and 'WIN' are
deprecated and it will removed in next release. Please, use correct
names UTF-8 and WIN1215.

8.2: remove it.

OK?

Karel

-- 
Karel Zak [EMAIL PROTECTED]


---(end of broadcast)---
TIP 7: don't forget to increase your free space map settings


Re: [HACKERS] UTF8 or Unicode

2005-02-18 Thread Christopher Kings-Lynne
Add to 8.1 release notes: encoding names 'UNICODE' and 'WIN' are
deprecated and it will removed in next release. Please, use correct
names UTF-8 and WIN1215.
8.2: remove it.
OK?
Why on earth remove it?  Just leave it in as an alias to UTF8
Chris
---(end of broadcast)---
TIP 9: the planner will ignore your desire to choose an index scan if your
 joining column's datatypes do not match


Re: [HACKERS] UTF8 or Unicode

2005-02-18 Thread Oliver Jowett
Karel Zak wrote:
On Sat, 2005-02-19 at 00:27 +1300, Oliver Jowett wrote:
Karel Zak wrote:

Yes, I think we should fix it and remove UNICODE and WIN encoding names
from PG code.
The JDBC driver asks for a UNICODE client encoding before it knows the 
server version it is talking to. How do you avoid breaking this?
Fix JDBC driver as soon as possible.
How, exactly? Ask for a 'utf8' client encoding instead of 'UNICODE'? 
Will this work if the driver is connecting to an older server?

Add to 8.1 release notes: encoding names 'UNICODE' and 'WIN' are
deprecated and it will removed in next release. Please, use correct
names UTF-8 and WIN1215.
8.0 appears to spell it 'utf8'.
Removing the existing aliases seems like a fairly gratuitous 
incompatibility to introduce to me.

-O
---(end of broadcast)---
TIP 2: you can get off all lists at once with the unregister command
   (send unregister YourEmailAddressHere to [EMAIL PROTECTED])


Re: [HACKERS] UTF8 or Unicode

2005-02-18 Thread Dave Page



-Original Message-
From: [EMAIL PROTECTED] on behalf of Oliver Jowett
Sent: Fri 2/18/2005 11:27 AM
To: Karel Zak
Cc: List pgsql-hackers
Subject: Re: [HACKERS] UTF8 or Unicode
 
Karel Zak wrote:

 Yes, I think we should fix it and remove UNICODE and WIN encoding names
 from PG code.

 The JDBC driver asks for a UNICODE client encoding before it knows the 
 server version it is talking to. How do you avoid breaking this?

So does pgAdmin.

Regards, Dave

---(end of broadcast)---
TIP 4: Don't 'kill -9' the postmaster


Re: [HACKERS] UTF8 or Unicode

2005-02-18 Thread Bruce Momjian
Dave Page wrote:
 Karel Zak wrote:
 
  Yes, I think we should fix it and remove UNICODE and WIN encoding names
  from PG code.
 
  The JDBC driver asks for a UNICODE client encoding before it knows the 
  server version it is talking to. How do you avoid breaking this?
 
 So does pgAdmin.

I think we just need to _favor_ UTF8.  The question is where are we
favoring Unicode rather than UTF8?

-- 
  Bruce Momjian|  http://candle.pha.pa.us
  pgman@candle.pha.pa.us   |  (610) 359-1001
  +  If your life is a hard drive, |  13 Roberts Road
  +  Christ can be your backup.|  Newtown Square, Pennsylvania 19073

---(end of broadcast)---
TIP 3: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly


Re: [HACKERS] UTF8 or Unicode

2005-02-18 Thread Tom Lane
Bruce Momjian pgman@candle.pha.pa.us writes:
 I think we just need to _favor_ UTF8.

I agree.

 The question is where are we
 favoring Unicode rather than UTF8?

It's the canonical name of the encoding, both in the code and the docs.

regression=# create database e encoding 'utf-8';
CREATE DATABASE
regression=# \l
 List of databases
Name|  Owner   | Encoding  
+--+---
 e  | postgres | UNICODE
 regression | postgres | SQL_ASCII
 template0  | postgres | SQL_ASCII
 template1  | postgres | SQL_ASCII
(5 rows)

As soon as we decide whether the canonical name is UTF8 or UTF-8
;-) we can fix it.

regards, tom lane

---(end of broadcast)---
TIP 4: Don't 'kill -9' the postmaster


Re: [HACKERS] UTF8 or Unicode

2005-02-16 Thread Agent M
On Feb 14, 2005, at 9:27 PM, Abhijit Menon-Sen wrote:

I know UTF8 is a type of unicode but do we need to rename anything
from Unicode to UTF8?
I don't know. I'll go through the documentation to see if I can find
anything that needs changing.
It's not the documentation that is wrong. Specifying the database 
encoding as Unicode is simply a bug (see initdb). What if 
postgresql supports UTF-16 in the future? What would you call it?

Also, the backend protocol also uses UNICODE when specifying the 
encoding. All the other encoding names are specified correctly AFAICS.

I brought this up before:
http://archives.postgresql.org/pgsql-hackers/2004-10/msg00811.php
We could make UTF8 the canonical form in the aliasing mechanism, but
beta 4 is a bit late to come up with this kind of idea.
--
Peter Eisentraut
http://developer.postgresql.org/~petere/


---(end of broadcast)---
TIP 3: if posting/reading through Usenet, please send an appropriate
 subscribe-nomail command to [EMAIL PROTECTED] so that your
 message can get through to the mailing list cleanly


Re: [HACKERS] UTF8 or Unicode

2005-02-15 Thread Karel Zak
On Mon, 2005-02-14 at 22:05 -0500, Bruce Momjian wrote:
 Abhijit Menon-Sen wrote:
  At 2005-02-14 21:14:54 -0500, pgman@candle.pha.pa.us wrote:
  
   Should our multi-byte encoding be referred to as UTF8 or Unicode?
  
  The *encoding* should certainly be referred to as UTF-8. Unicode is a
  character set, not an encoding; Unicode characters may be encoded with
  UTF-8, among other things.
  
  (One might think of a charset as being a set of integers representing
  characters, and an encoding as specifying how those integers may be
  converted to bytes.)
  
   I know UTF8 is a type of unicode but do we need to rename anything
   from Unicode to UTF8?
  
  I don't know. I'll go through the documentation to see if I can find
  anything that needs changing.
 
 I looked at encoding.sgml and that mentions Unicode, and then UTF8 as an
 acronym. I am wondering if we need to make UTF8 first and Unicode
 second.  Does initdb accept UTF8 as an encoding?

in PG: unicode = utf8 = utf-8 

Our internal routines in src/backend/utils/mb/encnames.c accept all
synonyms. The official internal PG name for UTF-8 is UNICODE :-(

It's historical reason that UTF8 = UNICODE, because there was UNICODE
first. It's same like WIN for WIN1251 (in sources it's marked as
_dirty_ alias)...

I think initdb uses pg_char_to_encoding() from
src/backend/utils/mb/encnames.c and it should be accept all aliases.

Karel

-- 
Karel Zak [EMAIL PROTECTED]


---(end of broadcast)---
TIP 2: you can get off all lists at once with the unregister command
(send unregister YourEmailAddressHere to [EMAIL PROTECTED])


Re: [HACKERS] UTF8 or Unicode

2005-02-15 Thread Peter Eisentraut
Am Dienstag, 15. Februar 2005 10:22 schrieb Karel Zak:
 in PG: unicode = utf8 = utf-8

 Our internal routines in src/backend/utils/mb/encnames.c accept all
 synonyms. The official internal PG name for UTF-8 is UNICODE :-(

I think in the SQL standard the official name is UTF8.  If someone wants to 
verify that this is the case and is exactly the encoding we offer (perhaps 
modulo the 0x1 issue), then it might make sense to change the canonical 
form to UTF8.

-- 
Peter Eisentraut
http://developer.postgresql.org/~petere/

---(end of broadcast)---
TIP 8: explain analyze is your friend


[HACKERS] UTF8 or Unicode

2005-02-14 Thread Bruce Momjian
Should our multi-byte encoding be referred to as UTF8 or Unicode?
I know UTF8 is a type of unicode but do we need to rename anything from
Unicode to UTF8?

Someone asked me via private email.

-- 
  Bruce Momjian|  http://candle.pha.pa.us
  pgman@candle.pha.pa.us   |  (610) 359-1001
  +  If your life is a hard drive, |  13 Roberts Road
  +  Christ can be your backup.|  Newtown Square, Pennsylvania 19073

---(end of broadcast)---
TIP 4: Don't 'kill -9' the postmaster


Re: [HACKERS] UTF8 or Unicode

2005-02-14 Thread Abhijit Menon-Sen
At 2005-02-14 21:14:54 -0500, pgman@candle.pha.pa.us wrote:

 Should our multi-byte encoding be referred to as UTF8 or Unicode?

The *encoding* should certainly be referred to as UTF-8. Unicode is a
character set, not an encoding; Unicode characters may be encoded with
UTF-8, among other things.

(One might think of a charset as being a set of integers representing
characters, and an encoding as specifying how those integers may be
converted to bytes.)

 I know UTF8 is a type of unicode but do we need to rename anything
 from Unicode to UTF8?

I don't know. I'll go through the documentation to see if I can find
anything that needs changing.

-- ams

---(end of broadcast)---
TIP 6: Have you searched our list archives?

   http://archives.postgresql.org


Re: [HACKERS] UTF8 or Unicode

2005-02-14 Thread Bruce Momjian
Abhijit Menon-Sen wrote:
 At 2005-02-14 21:14:54 -0500, pgman@candle.pha.pa.us wrote:
 
  Should our multi-byte encoding be referred to as UTF8 or Unicode?
 
 The *encoding* should certainly be referred to as UTF-8. Unicode is a
 character set, not an encoding; Unicode characters may be encoded with
 UTF-8, among other things.
 
 (One might think of a charset as being a set of integers representing
 characters, and an encoding as specifying how those integers may be
 converted to bytes.)
 
  I know UTF8 is a type of unicode but do we need to rename anything
  from Unicode to UTF8?
 
 I don't know. I'll go through the documentation to see if I can find
 anything that needs changing.

I looked at encoding.sgml and that mentions Unicode, and then UTF8 as an
acronym. I am wondering if we need to make UTF8 first and Unicode
second.  Does initdb accept UTF8 as an encoding?

-- 
  Bruce Momjian|  http://candle.pha.pa.us
  pgman@candle.pha.pa.us   |  (610) 359-1001
  +  If your life is a hard drive, |  13 Roberts Road
  +  Christ can be your backup.|  Newtown Square, Pennsylvania 19073

---(end of broadcast)---
TIP 6: Have you searched our list archives?

   http://archives.postgresql.org