subject:"\[HACKERS\] UTF8 national character data type support WIP patch and list of open issues."

Re: [HACKERS] UTF8 national character data type support WIP patch and list of open issues.

2013-11-13 Thread Peter Eisentraut

On 11/12/13, 1:57 AM, Tatsuo Ishii wrote:
 Currently there's no such an universal encoding in the universe, I
 think the only way is, inventing it by ourselves.

I think ISO 2022 is something in that direction, but it's not
ASCII-safe, AFAICT.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] UTF8 national character data type support WIP patch and list of open issues.

2013-11-13 Thread Martijn van Oosterhout

On Tue, Nov 12, 2013 at 03:57:52PM +0900, Tatsuo Ishii wrote:
 I have been thinking about this for years and I think the key idea for
 this is, implementing universal encoding. The universal encoding
 should have following characteristics to implement N2 encoding in a
 database.
 
 1) no loss of round trip encoding conversion
 
 2) no mapping table is necessary to convert from/to existing encodings
 
 Once we implement the universal encoding, other problem such as
 pg_database with multiple encoding problem can be solved easily.

Isn't this essentially what the MULE internal encoding is?

 Currently there's no such an universal encoding in the universe, I
 think the only way is, inventing it by ourselves.

This sounds like a terrible idea. In the future people are only going
to want more advanced text functions, regular expressions, indexing and
making encodings that don't exist anywhere else seems like a way to
make a lot of work for little benefit.

A better idea seems to me is to (if postgres is configured properly)
embed the non-round-trippable characters in the custom character part
of the unicode character set. In other words, adjust the mappings
tables on demand and voila.

Have a nice day,
-- 
Martijn van Oosterhout   klep...@svana.org   http://svana.org/kleptog/
 He who writes carelessly confesses thereby at the very outset that he does
 not attach much importance to his own thoughts.
   -- Arthur Schopenhauer


signature.asc
Description: Digital signature

Re: [HACKERS] UTF8 national character data type support WIP patch and list of open issues.

2013-11-13 Thread Tom Lane

Martijn van Oosterhout klep...@svana.org writes:
 On Tue, Nov 12, 2013 at 03:57:52PM +0900, Tatsuo Ishii wrote:
 Once we implement the universal encoding, other problem such as
 pg_database with multiple encoding problem can be solved easily.

 Isn't this essentially what the MULE internal encoding is?

MULE is completely evil.  It has N different encodings for the same
character, not to mention no support code available.

 Currently there's no such an universal encoding in the universe, I
 think the only way is, inventing it by ourselves.

 This sounds like a terrible idea. In the future people are only going
 to want more advanced text functions, regular expressions, indexing and
 making encodings that don't exist anywhere else seems like a way to
 make a lot of work for little benefit.

Agreed.

 A better idea seems to me is to (if postgres is configured properly)
 embed the non-round-trippable characters in the custom character part
 of the unicode character set. In other words, adjust the mappings
 tables on demand and voila.

From the standpoint of what will happen with existing library code
(like strcoll), I'm not sure it's all that easy.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] UTF8 national character data type support WIP patch and list of open issues.

2013-11-13 Thread Tatsuo Ishii

 Isn't this essentially what the MULE internal encoding is?

No. MULE is not powerfull enough and overly complicated to deal with
different encodings (character sets).

 Currently there's no such an universal encoding in the universe, I
 think the only way is, inventing it by ourselves.
 
 This sounds like a terrible idea. In the future people are only going
 to want more advanced text functions, regular expressions, indexing and
 making encodings that don't exist anywhere else seems like a way to
 make a lot of work for little benefit.

That is probably a misunderstanding. We don't need to modify existing
text handling modules such as text functions, regular expressions,
indexing etc. We just convert from the universal encoding X to the
original encoding before calling them. The process is pretty easy and
fast because it just requires skipping encoding identifier and
encoding length part.

Basically the encoding X should be used for lower layer modules of
PostgreSQL and higher layer module such as living in
src/backend/utils/adt should not aware it.

 A better idea seems to me is to (if postgres is configured properly)
 embed the non-round-trippable characters in the custom character part
 of the unicode character set. In other words, adjust the mappings
 tables on demand and voila.

Using Unicode requires overhead for encoding conversion because it
needs to look up mapping tables. That will be a huge handicap for
large data and that I want to avoid in the first place.
--
Tatsuo Ishii
SRA OSS, Inc. Japan
English: http://www.sraoss.co.jp/index_en.php
Japanese: http://www.sraoss.co.jp


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] UTF8 national character data type support WIP patch and list of open issues.

2013-11-13 Thread Tatsuo Ishii

 MULE is completely evil.

 It has N different encodings for the same
 character,

What's wrong with that? It aims that in the first place.

 not to mention no support code available.
--
Tatsuo Ishii
SRA OSS, Inc. Japan
English: http://www.sraoss.co.jp/index_en.php
Japanese: http://www.sraoss.co.jp


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] UTF8 national character data type support WIP patch and list of open issues.

2013-11-13 Thread Tom Lane

Tatsuo Ishii is...@postgresql.org writes:
 MULE is completely evil.
 It has N different encodings for the same character,

 What's wrong with that? It aims that in the first place.

It greatly complicates comparisons --- at least, if you'd like to preserve
the principle that strings that appear the same are equal.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] UTF8 national character data type support WIP patch and list of open issues.

2013-11-13 Thread Tatsuo Ishii

 Tatsuo Ishii is...@postgresql.org writes:
 MULE is completely evil.
 It has N different encodings for the same character,
 
 What's wrong with that? It aims that in the first place.
 
 It greatly complicates comparisons --- at least, if you'd like to preserve
 the principle that strings that appear the same are equal.

You don't need to consider it because there's no place in PostgreSQL
where a MULE encoded text consists of multiple encodings as far as I
know.

BTW, same characters are assigned different code points are pretty
common in many character sets (Unicode, for example).
--
Tatsuo Ishii
SRA OSS, Inc. Japan
English: http://www.sraoss.co.jp/index_en.php
Japanese: http://www.sraoss.co.jp


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] UTF8 national character data type support WIP patch and list of open issues.

2013-11-13 Thread Tom Lane

Tatsuo Ishii is...@postgresql.org writes:
 BTW, same characters are assigned different code points are pretty
 common in many character sets (Unicode, for example).

This is widely considered a security bug; read section 10 in RFC 3629 (the
definition of UTF8), and search the CVE database a bit if you still doubt
it's a threat.  I'm going to push back very hard on any suggestion that
Postgres should build itself around a text representation with that kind
of weakness designed in.

regards, tom lane

[1] http://tools.ietf.org/html/rfc3629#section-10


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] UTF8 national character data type support WIP patch and list of open issues.

2013-11-11 Thread Tatsuo Ishii

 I'd be much more impressed by seeing a road map for how we get to a
 useful amount of added functionality --- which, to my mind, would be
 the ability to support N different encodings in one database, for N2.
 But even if you think N=2 is sufficient, we haven't got a road map, and
 commandeering spec-mandated syntax for an inadequate feature doesn't seem
 like a good first step.  It'll just make our backwards-compatibility
 problems even worse when somebody does come up with a real solution.

I have been thinking about this for years and I think the key idea for
this is, implementing universal encoding. The universal encoding
should have following characteristics to implement N2 encoding in a
database.

1) no loss of round trip encoding conversion

2) no mapping table is necessary to convert from/to existing encodings

Once we implement the universal encoding, other problem such as
pg_database with multiple encoding problem can be solved easily.

Currently there's no such an universal encoding in the universe, I
think the only way is, inventing it by ourselves.

At this point the design of the encoding I have in mind is,

1) 1 byte encoding identifier + 7 bytes body (totaly 8 bytes). The
   encoding identifier's value is between 0x80 and 0xff and is
   assigned to exiting encoding such as UTF-8, ascii, EUC-JP and so
   on. The encodings should be limited to database safe
   encodings. The encoding body is raw characters represented by
   existing encodings. This form is called word.

2) We also have mutibyte representation of the universal
   encoding. The first byte represents the lenght of the multibyte
   character (similar to the first byte of UTF-8). The second byte is
   the encoding identifier explained in above. The rest of the
   character is same as above.

#1 and #2 are logically same and converted to each other, and we can
use one of them whenever we like.

The form #1 is easy to handle because each word has fixed length (8
bytes). So probably used in temporary data in memory. The second form
can save space and will be used in the data itself.

If we want to have a table encoded in an encoding different from the
database encoding, the table is encoded in the universal
encoding. pg_class should remember the fact to avoid the confusion
about what encoding a table is using. I think majority of tables in a
database uses the same encoding as the database encoding. Only a few
tables want to have different encoding. The design pushes the penalty
to such minorities.

If we need to join two tables which have different encoding, we need
to convert them into the same encoding (this should succeed if the
encodings are compatible). If fails, the join will fail too.

We could expand the technique above for the design which allow each
column has different encoding.
--
Tatsuo Ishii
SRA OSS, Inc. Japan
English: http://www.sraoss.co.jp/index_en.php
Japanese: http://www.sraoss.co.jp


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] UTF8 national character data type support WIP patch and list of open issues.

2013-11-10 Thread MauMau


From: Albe Laurenz laurenz.a...@wien.gv.at
In a way, it is similar to using the data type serial.  The column will be
displayed as integer, and the information that it was a serial can
only be inferred from the DEFAULT value.
It seems that this is working fine and does not cause many problems,
so I don't see why things should be different here.


Yes, I agree with you in that serial being a synonym is almost no problem. 
But that's because serial is not an SQL-standard data type but a type unique 
to PostgreSQL.


On the other hand, nchar is an established data type in the SQL standard.  I 
think most people will expect to get nchar as output from psql \d and 
pg_dump as they specified in DDL.  If they get char as output for nchar 
columns from pg_dump, wouldn't they get in trouble if they want to import 
schema/data from PostgreSQL to other database products?  The documentation 
for pg_dump says that pg_dump pays attention to easing migrating to other 
DBMSs.  I like this idea and want to respect this.


http://www.postgresql.org/docs/current/static/app-pgdump.html
--
Script files can be used to reconstruct the database even on other machines 
and other architectures; with some modifications, even on other SQL database 
products.

...
--use-set-session-authorization
Output SQL-standard SET SESSION AUTHORIZATION commands instead of ALTER 
OWNER commands to determine object ownership. This makes the dump more 
standards-compatible, ...

--


Regards
MauMau



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] UTF8 national character data type support WIP patch and list of open issues.

2013-11-10 Thread MauMau


From: Albe Laurenz laurenz.a...@wien.gv.at
In a way, it is similar to using the data type serial.  The column will be
displayed as integer, and the information that it was a serial can
only be inferred from the DEFAULT value.
It seems that this is working fine and does not cause many problems,
so I don't see why things should be different here.


Yes, I agree with you in that serial being a synonym is almost no problem. 
But that's because serial is not an SQL-standard data type but a type unique 
to PostgreSQL.


On the other hand, nchar is an established data type in the SQL standard.  I 
think most people will expect to get nchar as output from psql \d and 
pg_dump as they specified in DDL.  If they get char as output for nchar 
columns from pg_dump, wouldn't they get in trouble if they want to import 
schema/data from PostgreSQL to other database products?  The documentation 
for pg_dump says that pg_dump pays attention to easing migrating to other 
DBMSs.  I like this idea and want to respect this.


http://www.postgresql.org/docs/current/static/app-pgdump.html
--
Script files can be used to reconstruct the database even on other machines 
and other architectures; with some modifications, even on other SQL database 
products.

...
--use-set-session-authorization
Output SQL-standard SET SESSION AUTHORIZATION commands instead of ALTER 
OWNER commands to determine object ownership. This makes the dump more 
standards-compatible, ...

--


Regards
MauMau



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] UTF8 national character data type support WIP patch and list of open issues.

2013-11-10 Thread Tom Lane

MauMau maumau...@gmail.com writes:
 On the other hand, nchar is an established data type in the SQL standard.  I 
 think most people will expect to get nchar as output from psql \d and 
 pg_dump as they specified in DDL.

This argument seems awfully weak.  You've been able to say
 create table nt (nf national character varying(22));
in Postgres since around 1997, but I don't recall one single bug report
about how that is displayed as just character varying(22).

The other big problem with this line of argument is that you're trying
to claim better spec compliance for what is at best a rather narrow
interpretation with really minimal added functionality.  (In fact,
until you have a solution for the problem that incoming and outgoing
data must be in the database's primary encoding, you don't actually have
*any* added functionality, just syntactic sugar that does nothing useful.)
Unless you can demonstrate by lawyerly reading of the spec that the spec
requires exactly the behavior this patch implements, you do not have a leg
to stand on here.  But you can't demonstrate that, because it doesn't.

I'd be much more impressed by seeing a road map for how we get to a
useful amount of added functionality --- which, to my mind, would be
the ability to support N different encodings in one database, for N2.
But even if you think N=2 is sufficient, we haven't got a road map, and
commandeering spec-mandated syntax for an inadequate feature doesn't seem
like a good first step.  It'll just make our backwards-compatibility
problems even worse when somebody does come up with a real solution.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] UTF8 national character data type support WIP patch and list of open issues.

2013-11-09 Thread Albe Laurenz

MauMau wrote:
 Let me repeat myself: I think the biggest and immediate issue is that
 PostgreSQL does not support national character types at least officially.
 Officially means the description in the manual.  So I don't have strong
 objection against the current (hidden) implementation of nchar types in
 PostgreSQL which are just synonyms, as long as the official support is
 documented.  Serious users don't want to depend on hidden features.

I agree with you there.
Actually it is somewhat documented in
http://www.postgresql.org/docs/9.3/static/features-sql-standard.html
as F421, but that requires that you read the SQL standard.

 However, doesn't the current synonym approach have any problems?  Wouldn't
 it produce any trouble in the future?  If we treat nchar as char, we lose
 the fact that the user requested nchar.  Can we lose the fact so easily and
 produce irreversible result as below?

I don't think that it is a problem.
According to the SQL standard, the user requested a CHAR or VARCHAR with
an encoding of the choice of the DBMS.
PostgreSQL chooses the database encoding.

In a way, it is similar to using the data type serial.  The column will be
displayed as integer, and the information that it was a serial can
only be inferred from the DEFAULT value.
It seems that this is working fine and does not cause many problems,
so I don't see why things should be different here.

Again, for serial the behaviour is well documented, so that seconds
your request for more documentation.
Would you like to write a patch for that?

Yours,
Laurenz Albe

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] UTF8 national character data type support WIP patch and list of open issues.

2013-11-08 Thread Robert Haas

On Tue, Nov 5, 2013 at 5:15 PM, Peter Eisentraut pete...@gmx.net wrote:
 On 11/5/13, 1:04 AM, Arulappan, Arul Shaji wrote:
 Implements NCHAR/NVARCHAR as distinct data types, not as synonyms

 If, per SQL standard, NCHAR(x) is equivalent to CHAR(x) CHARACTER SET
 cs, then for some cs, NCHAR(x) must be the same as CHAR(x).
 Therefore, an implementation as separate data types is wrong.

Interesting.

Since the point doesn't seem to be getting through, let me try to be
more clear: we're not going to accept any form of this patch.  A patch
that makes some progress toward actually coping with multiple
encodings in the same database would be very much worth considering,
but adding compatible syntax with incompatible semantics is not of
interest to the PostgreSQL project.  We have had this debate on many
other topics in the past and will no doubt have it again in the
future, but the outcome is always the same.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] UTF8 national character data type support WIP patch and list of open issues.

2013-11-08 Thread MauMau


From: Robert Haas robertmh...@gmail.com

On Tue, Nov 5, 2013 at 5:15 PM, Peter Eisentraut pete...@gmx.net wrote:

On 11/5/13, 1:04 AM, Arulappan, Arul Shaji wrote:

Implements NCHAR/NVARCHAR as distinct data types, not as synonyms


If, per SQL standard, NCHAR(x) is equivalent to CHAR(x) CHARACTER SET
cs, then for some cs, NCHAR(x) must be the same as CHAR(x).
Therefore, an implementation as separate data types is wrong.


Since the point doesn't seem to be getting through, let me try to be
more clear: we're not going to accept any form of this patch.  A patch
that makes some progress toward actually coping with multiple
encodings in the same database would be very much worth considering,
but adding compatible syntax with incompatible semantics is not of
interest to the PostgreSQL project.  We have had this debate on many
other topics in the past and will no doubt have it again in the
future, but the outcome is always the same.


It doesn't seem that there is any semantics incompatible with the SQL 
standard as follows:


- In the first step, cs is the database encoding, which is used for 
char/varchar/text.
- In the second (or final) step, where multiple encodings per database is 
supported, cs is the national character encoding which is specified with 
CREATE DATABASE ... NATIONAL CHARACTER ENCODING cs.  If NATIONAL CHARACTER 
ENCODING clause is omitted, cs is the database encoding as step 1.


Let me repeat myself: I think the biggest and immediate issue is that 
PostgreSQL does not support national character types at least officially. 
Officially means the description in the manual.  So I don't have strong 
objection against the current (hidden) implementation of nchar types in 
PostgreSQL which are just synonyms, as long as the official support is 
documented.  Serious users don't want to depend on hidden features.


However, doesn't the current synonym approach have any problems?  Wouldn't 
it produce any trouble in the future?  If we treat nchar as char, we lose 
the fact that the user requested nchar.  Can we lose the fact so easily and 
produce irreversible result as below?


--
Maybe so.  I guess the distinct type for NCHAR is for future extension and
user friendliness.  As one user, I expect to get national character
instead of char character set xxx as output of psql \d and pg_dump when I
specified national character in DDL.  In addition, that makes it easy to
use the pg_dump output for importing data to other DBMSs for some reason.
--


Regards
MauMau




--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] UTF8 national character data type support WIP patch and list of open issues.

2013-11-05 Thread Albe Laurenz

Arul Shaji Arulappan wrote:
 Attached is a patch that implements the first set of changes discussed
 in this thread originally. They are:
 
 (i) Implements NCHAR/NVARCHAR as distinct data types, not as synonyms so
 that:
   - psql \d can display the user-specified data types.
   - pg_dump/pg_dumpall can output NCHAR/NVARCHAR columns as-is,
 not as CHAR/VARCHAR.
   - Groundwork to implement additional features for NCHAR/NVARCHAR
 in the future (For eg: separate encoding for nchar columns).
 (ii) Support for NCHAR/NVARCHAR in ECPG
 (iii) Documentation changes to reflect the new data type

If I understood the discussion correctly the use case is that
there are advantages to having a database encoding different
from UTF-8, but you'd still want sume UTF-8 columns.

Wouldn't it be a better design to allow specifying the encoding
per column?  That would give you more flexibility.

I know that NCHAR/NVARCHAR is SQL Standard, but as I still think
that it is a wart.

Yours,
Laurenz Albe

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] UTF8 national character data type support WIP patch and list of open issues.

2013-11-05 Thread MauMau


From: Albe Laurenz laurenz.a...@wien.gv.at

If I understood the discussion correctly the use case is that
there are advantages to having a database encoding different
from UTF-8, but you'd still want sume UTF-8 columns.

Wouldn't it be a better design to allow specifying the encoding
per column?  That would give you more flexibility.


Yes, you are right.  In the previous discussion:

- That would be nice if available, but it is hard to implement multiple 
encodings in one database.
- Some people (I'm not sure many or few) are NCHAR/NVARCHAR in other DBMSs. 
To invite them to PostgreSQL, it's important to support national character 
feature syntactically and document it in the manual.  This is the first 
step.
- As the second step, we can implement multiple encodings in one database. 
According to the SQL standard, NCHAR(n) is equivalent to CHAR(n) 
CHARACTER SET cs, where cs is an implementation-defined character set.



Regards
MauMau



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] UTF8 national character data type support WIP patch and list of open issues.

2013-11-05 Thread Albe Laurenz

MauMau wrote:
 From: Albe Laurenz laurenz.a...@wien.gv.at
 If I understood the discussion correctly the use case is that
 there are advantages to having a database encoding different
 from UTF-8, but you'd still want sume UTF-8 columns.

 Wouldn't it be a better design to allow specifying the encoding
 per column?  That would give you more flexibility.
 
 Yes, you are right.  In the previous discussion:
 
 - That would be nice if available, but it is hard to implement multiple
 encodings in one database.

Granted.

 - Some people (I'm not sure many or few) are NCHAR/NVARCHAR in other DBMSs.
 To invite them to PostgreSQL, it's important to support national character
 feature syntactically and document it in the manual.  This is the first
 step.

I looked into the Standard, and it does not have NVARCHAR.
The type is called NATIONAL CHARACTER VARYING, NATIONAL CHAR VARYING
or NCHAR VARYING.

I guess that the goal of this patch is to support Oracle syntax.
But anybody trying to port CREATE TABLE statements from Oracle
is already exposed to enough incompatibilities that the difference between
NVARCHAR and NCHAR VARYING will not be the reason to reject PostgreSQL.

In other words, I doubt that introducing the nonstandard NVARCHAR
will have more benefits than drawbacks (new reserved word).

Regarding the Standard compliant names of these data types, PostgreSQL
already supports those.  Maybe some documentation would help.

 - As the second step, we can implement multiple encodings in one database.
 According to the SQL standard, NCHAR(n) is equivalent to CHAR(n)
 CHARACTER SET cs, where cs is an implementation-defined character set.

That second step would definitely have benefits.

But I don't think that this requires the first step that your patch
implements, it is in fact orthogonal.

I don't think that there is any need to change NCHAR even if we
get per-column encoding, it is just syntactic sugar to support
SQL Feature F421.

Why not tackle the second step first?

Yours,
Laurenz Albe

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] UTF8 national character data type support WIP patch and list of open issues.

2013-11-05 Thread MauMau


From: Albe Laurenz laurenz.a...@wien.gv.at

I looked into the Standard, and it does not have NVARCHAR.
The type is called NATIONAL CHARACTER VARYING, NATIONAL CHAR VARYING
or NCHAR VARYING.


OUch, that's just a mistake in my mail.  You are correct.



 I guess that the goal of this patch is to support Oracle syntax.
But anybody trying to port CREATE TABLE statements from Oracle
is already exposed to enough incompatibilities that the difference between
NVARCHAR and NCHAR VARYING will not be the reason to reject PostgreSQL.
In other words, I doubt that introducing the nonstandard NVARCHAR
will have more benefits than drawbacks (new reserved word).


Agreed.  But I'm in favor of supporting other DBMS's syntax if it doesn't 
complicate the spec or implementation too much, because it can help migrate 
to PostgreSQL.  I understand PostgreSQL has made such efforts like PL/pgSQL 
which is similar to PL/SQL, text data type, AS in SELECT statement, etc.




But I don't think that this requires the first step that your patch
implements, it is in fact orthogonal.


(It's not my patch.)



Regarding the Standard compliant names of these data types, PostgreSQL
already supports those.  Maybe some documentation would help.

I don't think that there is any need to change NCHAR even if we
get per-column encoding, it is just syntactic sugar to support
SQL Feature F421.


Maybe so.  I guess the distinct type for NCHAR is for future extension and 
user friendliness.  As one user, I expect to get national character 
instead of char character set xxx as output of psql \d and pg_dump when I 
specified national character in DDL.  In addition, that makes it easy to 
use the pg_dump output for importing data to other DBMSs for some reason.



Regards
MauMau



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] UTF8 national character data type support WIP patch and list of open issues.

2013-11-05 Thread Peter Eisentraut

On 11/5/13, 1:04 AM, Arulappan, Arul Shaji wrote:
 Implements NCHAR/NVARCHAR as distinct data types, not as synonyms

If, per SQL standard, NCHAR(x) is equivalent to CHAR(x) CHARACTER SET
cs, then for some cs, NCHAR(x) must be the same as CHAR(x).
Therefore, an implementation as separate data types is wrong.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] UTF8 national character data type support WIP patch and list of open issues.

2013-09-25 Thread MauMau


From: Peter Eisentraut pete...@gmx.net

On Tue, 2013-09-24 at 21:04 +0900, MauMau wrote:
4. I guess some users really want to continue to use ShiftJIS or EUC_JP 
for

database encoding, and use NCHAR for a limited set of columns to store
international text in Unicode:
- to avoid code conversion between the server and the client for 
performance

- because ShiftJIS and EUC_JP require less amount of storage (2 bytes for
most Kanji) than UTF-8 (3 bytes)
This use case is described in chapter 6 of Oracle Database Globalization
Support Guide.


But your proposal wouldn't address the first point, because data would
have to go client - server - NCHAR.

The second point is valid, but it's going to be an awful amount of work
for that limited result.


I (or, Oracle's use case) meant the following, for example:

initdb -E EUC_JP
CREATE DATABASE mydb ENCODING EUC_JP NATIONAL ENCODING UTF-8;
CREATE TABLE mytable (
   col1 char(10),   -- EUC_JP text
   col2 Nchar(10), -- UTF-8 text
);
client encoding = EUC_JP

That is,

1. Currently, the user is only handling Japanese text.  To avoid unnecessary 
conversion, he uses EUC_JP for both client and server.
2. He needs to store some limited amount of international (non-Japanese) 
text in a few columns for a new feature of the system.  But the 
international text is limited, so he wants to sacrifice performance and 
storage cost due to code conversion for most text and more bytes for each 
character.


Regards
MauMau



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] UTF8 national character data type support WIP patch and list of open issues.

2013-09-25 Thread MauMau


From: Greg Stark st...@mit.edu

If it's not lossy then what's the point? From the client's point of view
it'll be functionally equivalent to text then.


Sorry, what Tatsuo san suggested meant was same or compatible, not lossy. 
I quote the relevant part below.  This is enough for the use case I 
mentioned in my previous mail several hours ago (actually, that is what 
Oracle manual describes...).


http://www.postgresql.org/message-id/20130920.085853.1628917054830864151.t-is...@sraoss.co.jp

[Excerpt]

What about limiting to use NCHAR with a database which has same
encoding or compatible encoding (on which the encoding conversion is
defined)? This way, NCHAR text can be automatically converted from
NCHAR to the database encoding in the server side thus we can treat
NCHAR exactly same as CHAR afterward.  I suppose what encoding is used
for NCHAR should be defined in initdb time or creation of the database
(if we allow this, we need to add a new column to know what encoding
is used for NCHAR).

For example, CREATE TABLE t1(t NCHAR(10)) will succeed if NCHAR is
UTF-8 and database encoding is UTF-8. Even succeed if NCHAR is
SHIFT-JIS and database encoding is UTF-8 because there is a conversion
between UTF-8 and SHIFT-JIS. However will not succeed if NCHAR is
SHIFT-JIS and database encoding is ISO-8859-1 because there's no
conversion between them.



Regards
MauMau



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] UTF8 national character data type support WIP patch and list of open issues.

2013-09-24 Thread MauMau


From: Peter Eisentraut pete...@gmx.net

That assumes that the conversion client encoding - server encoding -
NCHAR encoding is not lossy.


Yes, so Tatsuo san suggested to restrict server encoding - NCHAR encoding 
combination to those with lossless conversion.




I thought one main point of this exercise
was the avoid these conversions and be able to go straight from client
encoding into NCHAR.


It's slightly different.  Please see the following excerpt:

http://www.postgresql.org/message-id/B1A7485194DE4FDAB8FA781AFB570079@maumau

4. I guess some users really want to continue to use ShiftJIS or EUC_JP for
database encoding, and use NCHAR for a limited set of columns to store
international text in Unicode:
- to avoid code conversion between the server and the client for performance
- because ShiftJIS and EUC_JP require less amount of storage (2 bytes for
most Kanji) than UTF-8 (3 bytes)
This use case is described in chapter 6 of Oracle Database Globalization
Support Guide.



Regards
MauMau



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] UTF8 national character data type support WIP patch and list of open issues.

2013-09-24 Thread Peter Eisentraut

On Tue, 2013-09-24 at 21:04 +0900, MauMau wrote:
 4. I guess some users really want to continue to use ShiftJIS or EUC_JP for
 database encoding, and use NCHAR for a limited set of columns to store
 international text in Unicode:
 - to avoid code conversion between the server and the client for performance
 - because ShiftJIS and EUC_JP require less amount of storage (2 bytes for
 most Kanji) than UTF-8 (3 bytes)
 This use case is described in chapter 6 of Oracle Database Globalization
 Support Guide.

But your proposal wouldn't address the first point, because data would
have to go client - server - NCHAR.

The second point is valid, but it's going to be an awful amount of work
for that limited result.



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] UTF8 national character data type support WIP patch and list of open issues.

2013-09-23 Thread MauMau


From: Tatsuo Ishii is...@postgresql.org

I don't think the bind placeholder is the case. That is processed by
exec_bind_message() in postgres.c. It has enough info about the type
of the placeholder, and I think we can easily deal with NCHAR. Same
thing can be said to COPY case.


Yes, I've learned it.  Agreed.  If we allow an encoding for NCHAR different 
from the database encoding, we can convert text from the client encoding to 
the NCHAR encoding in nchar_in() for example.  We can retrieve the NCHAR 
encoding from pg_database and store it in a global variable at session 
start.




Problem is an ordinary query (simple protocol Q message) as you
pointed out. Encoding conversion happens at a very early stage (note
that fast-path case has the same issue). If a query message contains,
say, SHIFT-JIS and EUC-JP, then we are going into trouble because the
encoding conversion routine (pg_client_to_server) regards that the
message from client contains only one encoding. However my question
is, does it really happen? Because there's any text editor which can
create SHIFT-JIS and EUC-JP mixed text. So my guess is, when user want
to use NCHAR as SHIFT-JIS text, the rest of query consist of either
SHIFT-JIS or plain ASCII. If so, what the user need to do is, set the
client encoding to SJIFT-JIS and everything should be fine.

Maumau, is my guess correct?


Yes, I believe you are right.  Regardless of whether we support multiple 
encodings in one database or not, a single client encoding will be 
sufficient for one session.  When receiving the Q message, the whole SQL 
text is converted from the client encoding to the database encoding.  This 
part needs no modification.  During execution of the Q message, NCHAR 
values are converted from the database encoding to the NCHAR encoding.


Thank you very much, Tatsuo san.  Everybody, is there any other challenge we 
should consider to support NCHAR/NVARCHAR types as distinct types?


Regards
MauMau



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] UTF8 national character data type support WIP patch and list of open issues.

2013-09-23 Thread Robert Haas

On Fri, Sep 20, 2013 at 8:32 PM, MauMau maumau...@gmail.com wrote:
 I don't think that you'll be able to
 get consensus around that path on this mailing list.
 I agree that the fact we have both varchar and text feels like a wart.

 Is that right?  I don't feel varchar/text case is a wart.  I think text was
 introduced for a positive reason to ease migration from other DBMSs.  The
 manual says:

 http://www.postgresql.org/docs/current/static/datatype-character.html

 Although the type text is not in the SQL standard, several other SQL
 database management systems have it as well.

 And isn't EnterpriseDB doing similar things for Oracle compatibility,
 although I'm not sure about the details?  Could you share your idea why we
 won't get consensus?

Sure, it's EnterpriseDB's policy to add features that facilitate
migrations from other databases - particularly Oracle - to our
product, Advanced Server, even if those features don't otherwise add
any value.  However, the community is usually reluctant to add such
features to PostgreSQL.  Also, at least up until now, the existing
aliasing of nchar and nchar varying to other data types has been
adequate for the needs of our customers, and we've handled a bunch of
other type-name incompatibilities with similar tricks.  What you are
proposing goes off in a different direction from both PostgreSQL and
Advanced Server, and that's why I'm skeptical.  If you were proposing
something that we were doing in Advanced Server with great success, it
would be a bit disingenuous of me to argue against doing the same
thing in PostgreSQL, but that's not the case.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] UTF8 national character data type support WIP patch and list of open issues.

2013-09-23 Thread Peter Eisentraut

On 9/23/13 2:53 AM, MauMau wrote:
 Yes, I believe you are right.  Regardless of whether we support multiple
 encodings in one database or not, a single client encoding will be
 sufficient for one session.  When receiving the Q message, the whole
 SQL text is converted from the client encoding to the database
 encoding.  This part needs no modification.  During execution of the Q
 message, NCHAR values are converted from the database encoding to the
 NCHAR encoding.

That assumes that the conversion client encoding - server encoding -
NCHAR encoding is not lossy.  I thought one main point of this exercise
was the avoid these conversions and be able to go straight from client
encoding into NCHAR.



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] UTF8 national character data type support WIP patch and list of open issues.

2013-09-23 Thread MauMau


From: Robert Haas robertmh...@gmail.com

Sure, it's EnterpriseDB's policy to add features that facilitate
migrations from other databases - particularly Oracle - to our
product, Advanced Server, even if those features don't otherwise add
any value.  However, the community is usually reluctant to add such
features to PostgreSQL.  Also, at least up until now, the existing
aliasing of nchar and nchar varying to other data types has been
adequate for the needs of our customers, and we've handled a bunch of
other type-name incompatibilities with similar tricks.  What you are
proposing goes off in a different direction from both PostgreSQL and
Advanced Server, and that's why I'm skeptical.  If you were proposing
something that we were doing in Advanced Server with great success, it
would be a bit disingenuous of me to argue against doing the same
thing in PostgreSQL, but that's not the case.


Sorry, I didn't mean to imitate EnterpriseDB.  My intent is to just increase 
the popularity of PostgreSQL (or prevent the drop in popularity?).  NCHAR is 
so basic that we can/should accept proper support.


Aliasing would be nice to some extent, if its offical support would be 
documented in PG manual.  However, just aliasing loses NCHAR type 
information through pg_dump.  This is contrary to the benefit of pg_dump --  
allow migration from PG to other DBMSs, possibly for performance comparison:


http://www.postgresql.org/docs/current/static/app-pgdump.html

Script files can be used to reconstruct the database even on other machines 
and other architectures; with some modifications, even on other SQL database 
products.


In addition, distinct types for NCHAR/NVARCHAR allow future extension such 
as different encoding for NCHAR and UTF-16.



Regards
MauMau



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] UTF8 national character data type support WIP patch and list of open issues.

2013-09-22 Thread Valentine Gogichashvili



  PostgreSQL has a very powerful possibilities for storing any kind of
 encoding. So maybe it makes sense to add the ENCODING as another column
 property, the same way a COLLATION was added?


 Some other people in this community suggested that.  ANd the SQL standard
 suggests the same -- specifying a character encoding for each column:
 CHAR(n) CHARASET SET ch.


  Text operations should work automatically, as in memory all strings will
 be
 converted to the database encoding.

 This approach will also open a possibility to implement custom ENCODINGs
 for the column data storage, like snappy compression or even BSON, gobs or
 protbufs for much more compact type storage.


 Thanks for your idea that sounds interesting, although I don't understand
 that well.


The idea is very simple:

CREATE DATABASE utf8_database ENCODING 'utf8';

\c utf8_database

CREATE TABLE a(
  id serial,
  ascii_data text ENCODING 'ascii', -- will use ascii_to_utf8 to read and
utf8_to_ascii to write
  koi8_data text ENCODING 'koi8_r', -- will use koi8_r_to_utf8 to read and
utf8_to_koi8_r to write
  json_data json ENCODING 'bson' -- will use bson_to_json to read and
json_to_bson to write
);

The problem with bson_to_json here is that probably it will not be possible
to write JSON in koi8_r for example. But now it is also even not considered
in these discussions.

If the ENCODING machinery would get not only the encoding name, but also
the type OID, it should be possible to write encoders for TYPEs and array
of TYPEs (I had to do it using the casts to bytea and protobuff to minimize
the size of storage for an array of types when writing a lot of data, that
could be unpacked afterwords directly in the DB as normal database types).

I hope I made my point a little bit clearer.

Regards,

Valentine Gogichashvili

Re: [HACKERS] UTF8 national character data type support WIP patch and list of open issues.

2013-09-21 Thread Tatsuo Ishii

 I think the point here is that, at least as I understand it, encoding
 conversion and sanitization happens at a very early stage right now,
 when we first receive the input from the client. If the user sends a
 string of bytes as part of a query or bind placeholder that's not
 valid in the database encoding, it's going to error out before any
 type-specific code has an opportunity to get control.   Look at
 textin(), for example.  There's no encoding check there.  That means
 it's already been done at that point.  To make this work, someone's
 going to have to figure out what to do about *that*.  Until we have a
 sketch of what the design for that looks like, I don't see how we can
 credibly entertain more specific proposals.

I don't think the bind placeholder is the case. That is processed by
exec_bind_message() in postgres.c. It has enough info about the type
of the placeholder, and I think we can easily deal with NCHAR. Same
thing can be said to COPY case.

Problem is an ordinary query (simple protocol Q message) as you
pointed out. Encoding conversion happens at a very early stage (note
that fast-path case has the same issue). If a query message contains,
say, SHIFT-JIS and EUC-JP, then we are going into trouble because the
encoding conversion routine (pg_client_to_server) regards that the
message from client contains only one encoding. However my question
is, does it really happen? Because there's any text editor which can
create SHIFT-JIS and EUC-JP mixed text. So my guess is, when user want
to use NCHAR as SHIFT-JIS text, the rest of query consist of either
SHIFT-JIS or plain ASCII. If so, what the user need to do is, set the
client encoding to SJIFT-JIS and everything should be fine.

Maumau, is my guess correct?
--
Tatsuo Ishii
SRA OSS, Inc. Japan
English: http://www.sraoss.co.jp/index_en.php
Japanese: http://www.sraoss.co.jp


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] UTF8 national character data type support WIP patch and list of open issues.

2013-09-20 Thread Martijn van Oosterhout

On Fri, Sep 20, 2013 at 08:58:53AM +0900, Tatsuo Ishii wrote:
 For example, CREATE TABLE t1(t NCHAR(10)) will succeed if NCHAR is
 UTF-8 and database encoding is UTF-8. Even succeed if NCHAR is
 SHIFT-JIS and database encoding is UTF-8 because there is a conversion
 between UTF-8 and SHIFT-JIS. However will not succeed if NCHAR is
 SHIFT-JIS and database encoding is ISO-8859-1 because there's no
 conversion between them.

As far as I can tell the whole reason for introducing NCHAR is to
support SHIFT-JIS, there hasn't been call for any other encodings, that
I can remember anyway.

So rather than this whole NCHAR thing, why not just add a type
sjistext, and a few type casts and call it a day...

Have a nice day,
-- 
Martijn van Oosterhout   klep...@svana.org   http://svana.org/kleptog/
 He who writes carelessly confesses thereby at the very outset that he does
 not attach much importance to his own thoughts.
   -- Arthur Schopenhauer


signature.asc
Description: Digital signature

Re: [HACKERS] UTF8 national character data type support WIP patch and list of open issues.

2013-09-20 Thread Robert Haas

On Thu, Sep 19, 2013 at 7:58 PM, Tatsuo Ishii is...@postgresql.org wrote:
 What about limiting to use NCHAR with a database which has same
 encoding or compatible encoding (on which the encoding conversion is
 defined)? This way, NCHAR text can be automatically converted from
 NCHAR to the database encoding in the server side thus we can treat
 NCHAR exactly same as CHAR afterward.  I suppose what encoding is used
 for NCHAR should be defined in initdb time or creation of the database
 (if we allow this, we need to add a new column to know what encoding
 is used for NCHAR).

 For example, CREATE TABLE t1(t NCHAR(10)) will succeed if NCHAR is
 UTF-8 and database encoding is UTF-8. Even succeed if NCHAR is
 SHIFT-JIS and database encoding is UTF-8 because there is a conversion
 between UTF-8 and SHIFT-JIS. However will not succeed if NCHAR is
 SHIFT-JIS and database encoding is ISO-8859-1 because there's no
 conversion between them.

I think the point here is that, at least as I understand it, encoding
conversion and sanitization happens at a very early stage right now,
when we first receive the input from the client.  If the user sends a
string of bytes as part of a query or bind placeholder that's not
valid in the database encoding, it's going to error out before any
type-specific code has an opportunity to get control.   Look at
textin(), for example.  There's no encoding check there.  That means
it's already been done at that point.  To make this work, someone's
going to have to figure out what to do about *that*.  Until we have a
sketch of what the design for that looks like, I don't see how we can
credibly entertain more specific proposals.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] UTF8 national character data type support WIP patch and list of open issues.

2013-09-20 Thread Robert Haas

On Thu, Sep 19, 2013 at 6:42 PM, MauMau maumau...@gmail.com wrote:
 National character types support may be important to some potential users of
 PostgreSQL and the popularity of PostgreSQL, not me.  That's why national
 character support is listed in the PostgreSQL TODO wiki.  We might be losing
 potential users just because their selection criteria includes national
 character support.

We'd have to go back and search the archives to figure out why that
item was added to the TODO, but I'd be surprised if anyone ever had it
in mind to create additional types that behave just like existing
types but with different names.  I don't think that you'll be able to
get consensus around that path on this mailing list.

 I am not keen to introduce support for nchar and nvarchar as
 differently-named types with identical semantics.

 Similar examples already exist:

 - varchar and text: the only difference is the existence of explicit length
 limit
 - numeric and decimal
 - int and int4, smallint and int2, bigint and int8
 - real/double precison and float

I agree that the fact we have both varchar and text feels like a wart.
 The other examples mostly involve different names for the same
underlying type, and so are different from what you are asking for
here.

 I understand your feeling.  The concern about incompatibility can be
 eliminated by thinking the following way.  How about this?

 - NCHAR can be used with any database encoding.

 - At first, NCHAR is exactly the same as CHAR.  That is,
 implementation-defined character set described in the SQL standard is the
 database character set.

 - In the future, the character set for NCHAR can be selected at database
 creation like Oracle's CREATE DATABAWSE  NATIONAL CHARACTER SET
 AL16UTF16.  The default it the database set.

Hmm.  So under that design, a database could support up to a total of
two character sets, the one that you get when you say 'foo' and the
other one that you get when you say n'foo'.

I guess we could do that, but it seems a bit limited.  If we're going
to go to the trouble of supporting multiple character sets, why not
support an arbitrary number instead of just two?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] UTF8 national character data type support WIP patch and list of open issues.

2013-09-20 Thread Peter Eisentraut

On 9/20/13 2:22 PM, Robert Haas wrote:
 I am not keen to introduce support for nchar and nvarchar as
  differently-named types with identical semantics.
 
  Similar examples already exist:
 
  - varchar and text: the only difference is the existence of explicit length
  limit
  - numeric and decimal
  - int and int4, smallint and int2, bigint and int8
  - real/double precison and float
 I agree that the fact we have both varchar and text feels like a wart.
  The other examples mostly involve different names for the same
 underlying type, and so are different from what you are asking for
 here.

Also note that we already have NCHAR [VARYING].  It's mapped to char or
varchar, respectively, in the parser, just like int, real, etc. are handled.



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] UTF8 national character data type support WIP patch and list of open issues.

2013-09-20 Thread MauMau


From: Tatsuo Ishii is...@postgresql.org

What about limiting to use NCHAR with a database which has same
encoding or compatible encoding (on which the encoding conversion is
defined)? This way, NCHAR text can be automatically converted from
NCHAR to the database encoding in the server side thus we can treat
NCHAR exactly same as CHAR afterward.  I suppose what encoding is used
for NCHAR should be defined in initdb time or creation of the database
(if we allow this, we need to add a new column to know what encoding
is used for NCHAR).

For example, CREATE TABLE t1(t NCHAR(10)) will succeed if NCHAR is
UTF-8 and database encoding is UTF-8. Even succeed if NCHAR is
SHIFT-JIS and database encoding is UTF-8 because there is a conversion
between UTF-8 and SHIFT-JIS. However will not succeed if NCHAR is
SHIFT-JIS and database encoding is ISO-8859-1 because there's no
conversion between them.


Thanks for the idea, it sounds flexible for wider use.  Your cooperation 
would be much appreciated to devise implementation with as little code as 
possible.


Regards
MauMau




--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] UTF8 national character data type support WIP patch and list of open issues.

2013-09-20 Thread MauMau


From: Valentine Gogichashvili val...@gmail.com
the whole NCHAR appeared as hack for the systems, that did not have it 
from

the beginning. It would not be needed, if all the text would be magically
stored in UNICODE or UTF from the beginning and idea of character would be
the same as an idea of a rune and not a byte.


I guess so, too.



PostgreSQL has a very powerful possibilities for storing any kind of
encoding. So maybe it makes sense to add the ENCODING as another column
property, the same way a COLLATION was added?


Some other people in this community suggested that.  ANd the SQL standard 
suggests the same -- specifying a character encoding for each column: 
CHAR(n) CHARASET SET ch.



Text operations should work automatically, as in memory all strings will 
be

converted to the database encoding.

This approach will also open a possibility to implement custom ENCODINGs
for the column data storage, like snappy compression or even BSON, gobs or
protbufs for much more compact type storage.


Thanks for your idea that sounds interesting, although I don't understand 
that well.


Regards
MauMau



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] UTF8 national character data type support WIP patch and list of open issues.

2013-09-20 Thread MauMau


From: Martijn van Oosterhout klep...@svana.org

As far as I can tell the whole reason for introducing NCHAR is to
support SHIFT-JIS, there hasn't been call for any other encodings, that
I can remember anyway.


Could you elaborate on this, giving some info sources?



So rather than this whole NCHAR thing, why not just add a type
sjistext, and a few type casts and call it a day...


The main reason for supporting NCHAR types is to ease migration from other 
DBMSs, not requiring DDL changes.  So sjistext does not match that purpose.



Regards
MauMau



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] UTF8 national character data type support WIP patch and list of open issues.

2013-09-20 Thread MauMau


From: Robert Haas robertmh...@gmail.com

I don't think that you'll be able to
get consensus around that path on this mailing list.



I agree that the fact we have both varchar and text feels like a wart.


Is that right?  I don't feel varchar/text case is a wart.  I think text was 
introduced for a positive reason to ease migration from other DBMSs.  The 
manual says:


http://www.postgresql.org/docs/current/static/datatype-character.html

Although the type text is not in the SQL standard, several other SQL 
database management systems have it as well.


And isn't EnterpriseDB doing similar things for Oracle compatibility, 
although I'm not sure about the details?  Could you share your idea why we 
won't get consensus?





I understand your feeling.  The concern about incompatibility can be
eliminated by thinking the following way.  How about this?

- NCHAR can be used with any database encoding.

- At first, NCHAR is exactly the same as CHAR.  That is,
implementation-defined character set described in the SQL standard is 
the

database character set.

- In the future, the character set for NCHAR can be selected at database
creation like Oracle's CREATE DATABAWSE  NATIONAL CHARACTER SET
AL16UTF16.  The default it the database set.


Hmm.  So under that design, a database could support up to a total of
two character sets, the one that you get when you say 'foo' and the
other one that you get when you say n'foo'.

I guess we could do that, but it seems a bit limited.  If we're going
to go to the trouble of supporting multiple character sets, why not
support an arbitrary number instead of just two?


I agree with you about the arbitrary number.  Tatsuo san gave us a good 
suggestion.  Let's consider how to implement that.


Regards
MauMau



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] UTF8 national character data type support WIP patch and list of open issues.

2013-09-20 Thread MauMau


From: Robert Haas robertmh...@gmail.com
On Thu, Sep 19, 2013 at 7:58 PM, Tatsuo Ishii is...@postgresql.org 
wrote:

What about limiting to use NCHAR with a database which has same
encoding or compatible encoding (on which the encoding conversion is
defined)? This way, NCHAR text can be automatically converted from
NCHAR to the database encoding in the server side thus we can treat
NCHAR exactly same as CHAR afterward.  I suppose what encoding is used
for NCHAR should be defined in initdb time or creation of the database
(if we allow this, we need to add a new column to know what encoding
is used for NCHAR).

For example, CREATE TABLE t1(t NCHAR(10)) will succeed if NCHAR is
UTF-8 and database encoding is UTF-8. Even succeed if NCHAR is
SHIFT-JIS and database encoding is UTF-8 because there is a conversion
between UTF-8 and SHIFT-JIS. However will not succeed if NCHAR is
SHIFT-JIS and database encoding is ISO-8859-1 because there's no
conversion between them.


I think the point here is that, at least as I understand it, encoding
conversion and sanitization happens at a very early stage right now,
when we first receive the input from the client.  If the user sends a
string of bytes as part of a query or bind placeholder that's not
valid in the database encoding, it's going to error out before any
type-specific code has an opportunity to get control.   Look at
textin(), for example.  There's no encoding check there.  That means
it's already been done at that point.  To make this work, someone's
going to have to figure out what to do about *that*.  Until we have a
sketch of what the design for that looks like, I don't see how we can
credibly entertain more specific proposals.


OK, I see your point.  Let's consider that design.  I'll learn the code 
regarding this.  Does anybody, especially Tatsuo san, Tom san, Peter san, 
have any good idea?


Regards
MauMau



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] UTF8 national character data type support WIP patch and list of open issues.

2013-09-19 Thread Robert Haas

On Wed, Sep 18, 2013 at 6:42 PM, MauMau maumau...@gmail.com wrote:
 It seems to me that these two points here are the real core of your
 proposal.  The rest is just syntactic sugar.

 No, those are desirable if possible features.  What's important is to
 declare in the manual that PostgreSQL officially supports national character
 types, as I stated below.

That may be what's important to you, but it's not what's important to
me.  I am not keen to introduce support for nchar and nvarchar as
differently-named types with identical semantics.  And I think it's an
even worse idea to introduce them now, making them work one way, and
then later change the behavior in a backward-incompatible fashion.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] UTF8 national character data type support WIP patch and list of open issues.

2013-09-19 Thread MauMau


From: Robert Haas robertmh...@gmail.com

That may be what's important to you, but it's not what's important to
me.


National character types support may be important to some potential users of 
PostgreSQL and the popularity of PostgreSQL, not me.  That's why national 
character support is listed in the PostgreSQL TODO wiki.  We might be losing 
potential users just because their selection criteria includes national 
character support.




I am not keen to introduce support for nchar and nvarchar as
differently-named types with identical semantics.


Similar examples already exist:

- varchar and text: the only difference is the existence of explicit length 
limit

- numeric and decimal
- int and int4, smallint and int2, bigint and int8
- real/double precison and float

In addition, the SQL standard itself admits:

The key words NATIONAL CHARACTER are used to specify the character type 
with an implementation-
defined character set. Special syntax (N'string') is provided for 
representing literals in that character set.

...
NATIONAL CHARACTER is equivalent to the corresponding character string 
type with a specification
of CHARACTER SET CSN, where CSN is an implementation-defined character 
set name.


A national character string literal is equivalent to a character string 
literal with the N replaced by
introducercharacter set specification, where character set 
specification is an implementation-

defined character set name.



And I think it's an
even worse idea to introduce them now, making them work one way, and
then later change the behavior in a backward-incompatible fashion.


I understand your feeling.  The concern about incompatibility can be 
eliminated by thinking the following way.  How about this?


- NCHAR can be used with any database encoding.

- At first, NCHAR is exactly the same as CHAR.  That is, 
implementation-defined character set described in the SQL standard is the 
database character set.


- In the future, the character set for NCHAR can be selected at database 
creation like Oracle's CREATE DATABAWSE  NATIONAL CHARACTER SET 
AL16UTF16.  The default it the database set.



Could you tell me what kind of specification we should implement if we 
officially support national character types?


Regards
MauMau




--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] UTF8 national character data type support WIP patch and list of open issues.

2013-09-19 Thread Tatsuo Ishii

 On Mon, Sep 16, 2013 at 8:49 AM, MauMau maumau...@gmail.com wrote:
 2. NCHAR/NVARCHAR columns can be used in non-UTF-8 databases and always
 contain Unicode data.
 ...
 3. Store strings in UTF-16 encoding in NCHAR/NVARCHAR columns.
 Fixed-width encoding may allow faster string manipulation as described in
 Oracle's manual.  But I'm not sure about this, because UTF-16 is not a real
 fixed-width encoding due to supplementary characters.
 
 It seems to me that these two points here are the real core of your
 proposal.  The rest is just syntactic sugar.
 
 Let me start with the second one: I don't think there's likely to be
 any benefit in using UTF-16 as the internal encoding.  In fact, I
 think it's likely to make things quite a bit more complicated, because
 we have a lot of code that assumes that server encodings have certain
 properties that UTF-16 doesn't - specifically, that any byte with the
 high-bit clear represents the corresponding ASCII character.

Agreed.

 As to the first one, if we're going to go to the (substantial) trouble
 of building infrastructure to allow a database to store data in
 multiple encodings, why limit it to storing UTF-8 in non-UTF-8
 databases?  What about storing SHIFT-JIS in UTF-8 databases, or
 Windows-yourfavoriteM$codepagehere in UTF-8 databases, or any other
 combination you might care to name?
 
 Whether we go that way or not, I think storing data in one encoding in
 a database with a different encoding is going to be pretty tricky and
 require far-reaching changes.  You haven't mentioned any of those
 issues or discussed how you would solve them.

What about limiting to use NCHAR with a database which has same
encoding or compatible encoding (on which the encoding conversion is
defined)? This way, NCHAR text can be automatically converted from
NCHAR to the database encoding in the server side thus we can treat
NCHAR exactly same as CHAR afterward.  I suppose what encoding is used
for NCHAR should be defined in initdb time or creation of the database
(if we allow this, we need to add a new column to know what encoding
is used for NCHAR).

For example, CREATE TABLE t1(t NCHAR(10)) will succeed if NCHAR is
UTF-8 and database encoding is UTF-8. Even succeed if NCHAR is
SHIFT-JIS and database encoding is UTF-8 because there is a conversion
between UTF-8 and SHIFT-JIS. However will not succeed if NCHAR is
SHIFT-JIS and database encoding is ISO-8859-1 because there's no
conversion between them.
--
Tatsuo Ishii
SRA OSS, Inc. Japan
English: http://www.sraoss.co.jp/index_en.php
Japanese: http://www.sraoss.co.jp


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] UTF8 national character data type support WIP patch and list of open issues.

2013-09-19 Thread Valentine Gogichashvili

Hi,


  That may be what's important to you, but it's not what's important to
 me.


 National character types support may be important to some potential users
 of PostgreSQL and the popularity of PostgreSQL, not me.  That's why
 national character support is listed in the PostgreSQL TODO wiki.  We might
 be losing potential users just because their selection criteria includes
 national character support.


the whole NCHAR appeared as hack for the systems, that did not have it from
the beginning. It would not be needed, if all the text would be magically
stored in UNICODE or UTF from the beginning and idea of character would be
the same as an idea of a rune and not a byte.

PostgreSQL has a very powerful possibilities for storing any kind of
encoding. So maybe it makes sense to add the ENCODING as another column
property, the same way a COLLATION was added?

It would make it possible to have a database, that talks to the clients in
UTF8 and stores text and varchar data in the encoding that is the most
appropriate for the situation.

It will make it impossible (or complicated) to make the database have a
non-UTF8 default encoding (I wonder who should need that in this case), as
conversions will not be possible from the broader charsets into the default
database encoding.

One could define an additional DATABASE property like LC_ENCODING that
would work for the ENCODING property of a column like LC_COLLATE for
COLLATE property of a column.

Text operations should work automatically, as in memory all strings will be
converted to the database encoding.

This approach will also open a possibility to implement custom ENCODINGs
for the column data storage, like snappy compression or even BSON, gobs or
protbufs for much more compact type storage.

Regards,

-- Valentine Gogichashvili

Re: [HACKERS] UTF8 national character data type support WIP patch and list of open issues.

2013-09-18 Thread Robert Haas

On Mon, Sep 16, 2013 at 8:49 AM, MauMau maumau...@gmail.com wrote:
 2. NCHAR/NVARCHAR columns can be used in non-UTF-8 databases and always
 contain Unicode data.
...
 3. Store strings in UTF-16 encoding in NCHAR/NVARCHAR columns.
 Fixed-width encoding may allow faster string manipulation as described in
 Oracle's manual.  But I'm not sure about this, because UTF-16 is not a real
 fixed-width encoding due to supplementary characters.

It seems to me that these two points here are the real core of your
proposal.  The rest is just syntactic sugar.

Let me start with the second one: I don't think there's likely to be
any benefit in using UTF-16 as the internal encoding.  In fact, I
think it's likely to make things quite a bit more complicated, because
we have a lot of code that assumes that server encodings have certain
properties that UTF-16 doesn't - specifically, that any byte with the
high-bit clear represents the corresponding ASCII character.

As to the first one, if we're going to go to the (substantial) trouble
of building infrastructure to allow a database to store data in
multiple encodings, why limit it to storing UTF-8 in non-UTF-8
databases?  What about storing SHIFT-JIS in UTF-8 databases, or
Windows-yourfavoriteM$codepagehere in UTF-8 databases, or any other
combination you might care to name?

Whether we go that way or not, I think storing data in one encoding in
a database with a different encoding is going to be pretty tricky and
require far-reaching changes.  You haven't mentioned any of those
issues or discussed how you would solve them.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] UTF8 national character data type support WIP patch and list of open issues.

2013-09-18 Thread Tom Lane

Robert Haas robertmh...@gmail.com writes:
 On Mon, Sep 16, 2013 at 8:49 AM, MauMau maumau...@gmail.com wrote:
 2. NCHAR/NVARCHAR columns can be used in non-UTF-8 databases and always
 contain Unicode data.
 ...
 3. Store strings in UTF-16 encoding in NCHAR/NVARCHAR columns.
 Fixed-width encoding may allow faster string manipulation as described in
 Oracle's manual.  But I'm not sure about this, because UTF-16 is not a real
 fixed-width encoding due to supplementary characters.

 It seems to me that these two points here are the real core of your
 proposal.  The rest is just syntactic sugar.

 Let me start with the second one: I don't think there's likely to be
 any benefit in using UTF-16 as the internal encoding.  In fact, I
 think it's likely to make things quite a bit more complicated, because
 we have a lot of code that assumes that server encodings have certain
 properties that UTF-16 doesn't - specifically, that any byte with the
 high-bit clear represents the corresponding ASCII character.

Another point to keep in mind is that UTF16 is not really any easier
to deal with than UTF8, unless you write code that fails to support
characters outside the basic multilingual plane.  Which is a restriction
I don't believe we'd accept.  But without that restriction, you're still
forced to deal with variable-width characters; and there's nothing very
nice about the way that's done in UTF16.  So on the whole I think it
makes more sense to use UTF8 for this.

I share Robert's misgivings about difficulties in dealing with characters
that are not representable in the database's principal encoding.  Still,
you probably won't find out about many of those until you try it.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] UTF8 national character data type support WIP patch and list of open issues.

2013-09-18 Thread Heikki Linnakangas


On 18.09.2013 16:16, Robert Haas wrote:

On Mon, Sep 16, 2013 at 8:49 AM, MauMaumaumau...@gmail.com  wrote:

2. NCHAR/NVARCHAR columns can be used in non-UTF-8 databases and always
contain Unicode data.

...

3. Store strings in UTF-16 encoding in NCHAR/NVARCHAR columns.
Fixed-width encoding may allow faster string manipulation as described in
Oracle's manual.  But I'm not sure about this, because UTF-16 is not a real
fixed-width encoding due to supplementary characters.


It seems to me that these two points here are the real core of your
proposal.  The rest is just syntactic sugar.

Let me start with the second one: I don't think there's likely to be
any benefit in using UTF-16 as the internal encoding.  In fact, I
think it's likely to make things quite a bit more complicated, because
we have a lot of code that assumes that server encodings have certain
properties that UTF-16 doesn't - specifically, that any byte with the
high-bit clear represents the corresponding ASCII character.

As to the first one, if we're going to go to the (substantial) trouble
of building infrastructure to allow a database to store data in
multiple encodings, why limit it to storing UTF-8 in non-UTF-8
databases?  What about storing SHIFT-JIS in UTF-8 databases, or
Windows-yourfavoriteM$codepagehere in UTF-8 databases, or any other
combination you might care to name?

Whether we go that way or not, I think storing data in one encoding in
a database with a different encoding is going to be pretty tricky and
require far-reaching changes.  You haven't mentioned any of those
issues or discussed how you would solve them.


I'm not too thrilled about complicating the system for that, either. If 
you really need to deal with many different languages, you can do that 
today by using UTF-8 everywhere. Sure, it might not be the most 
efficient encoding for some characters, but it works.


There is one reason, however, that makes it a lot more compelling: we 
already support having databases with different encodings in the same 
cluster, but the encoding used in the shared catalogs, for usernames and 
database names for example, is not well-defined. If we dealt with 
different encodings in the same database, that inconsistency would go away.


- Heikki


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] UTF8 national character data type support WIP patch and list of open issues.

2013-09-18 Thread MauMau


From: Robert Haas robertmh...@gmail.com

On Mon, Sep 16, 2013 at 8:49 AM, MauMau maumau...@gmail.com wrote:

2. NCHAR/NVARCHAR columns can be used in non-UTF-8 databases and always
contain Unicode data.

...

3. Store strings in UTF-16 encoding in NCHAR/NVARCHAR columns.
Fixed-width encoding may allow faster string manipulation as described in
Oracle's manual.  But I'm not sure about this, because UTF-16 is not a 
real

fixed-width encoding due to supplementary characters.


It seems to me that these two points here are the real core of your
proposal.  The rest is just syntactic sugar.


No, those are desirable if possible features.  What's important is to 
declare in the manual that PostgreSQL officially supports national character 
types, as I stated below.



1. Accept NCHAR/NVARCHAR as data type name and N'...' syntactically.
This is already implemented.  PostgreSQL treats NCHAR/NVARCHAR as synonyms
for CHAR/VARCHAR, and ignores N prefix.  But this is not documented.

2. Declare support for national character support in the manual.
1 is not sufficient because users don't want to depend on undocumented
behavior.  This is exactly what the TODO item national character support
in PostgreSQL TODO wiki is about.

3. Implement NCHAR/NVARCHAR as distinct data types, not as synonyms so 
that:

- psql \d can display the user-specified data types.
- pg_dump/pg_dumpall can output NCHAR/NVARCHAR columns as-is, not as
CHAR/VARCHAR.
- To implement additional features for NCHAR/NVARCHAR in the future, as
described below.



And when declaring that, we had better implement NCHAR types as distinct 
types with their own OIDs so that we can extend NCHAR behavior in the 
future.
As the first stage, I think it's okay to treat NCHAR types exactly the same 
as CHAR/VARCHAR types.  For example, in ECPG:


switch (type)
case OID_FOR_CHAR:
case OID_FOR_VARCHAR:
case OID_FOR_TEXT:
case OID_FOR_NCHAR:  /* new code */
case OID_FOR_NVARCHAR:  /* new code */
some processing;
break;
And in JDBC, just call methods for non-national character types. 
Currently, those national character methods throw SQLException.


public void setNString(int parameterIndex, String value) throws SQLException
{
setString(parameterIndex, value);
}



Let me start with the second one: I don't think there's likely to be
any benefit in using UTF-16 as the internal encoding.  In fact, I
think it's likely to make things quite a bit more complicated, because
we have a lot of code that assumes that server encodings have certain
properties that UTF-16 doesn't - specifically, that any byte with the
high-bit clear represents the corresponding ASCII character.

As to the first one, if we're going to go to the (substantial) trouble
of building infrastructure to allow a database to store data in
multiple encodings, why limit it to storing UTF-8 in non-UTF-8
databases?  What about storing SHIFT-JIS in UTF-8 databases, or
Windows-yourfavoriteM$codepagehere in UTF-8 databases, or any other
combination you might care to name?

Whether we go that way or not, I think storing data in one encoding in
a database with a different encoding is going to be pretty tricky and
require far-reaching changes.  You haven't mentioned any of those
issues or discussed how you would solve them.



Yes, you are probably right -- I'm not sure UTF-16 has really benefits that 
UTF-8 doesn't have.  But why did Windows and Java choose UTF-16 for internal 
strings rather than UTF-8?  Why did Oracle recommend UTF-16 for NCHAR?  I 
have no clear idea.  Anyway, I don't strongly push UTF-16 and complicate the 
encoding handling.


Regards
MauMau



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] UTF8 national character data type support WIP patch and list of open issues.

2013-09-18 Thread MauMau


From: Tom Lane t...@sss.pgh.pa.us

Another point to keep in mind is that UTF16 is not really any easier
to deal with than UTF8, unless you write code that fails to support
characters outside the basic multilingual plane.  Which is a restriction
I don't believe we'd accept.  But without that restriction, you're still
forced to deal with variable-width characters; and there's nothing very
nice about the way that's done in UTF16.  So on the whole I think it
makes more sense to use UTF8 for this.


I feel so.  I guess why Windows, Java, and Oracle chose UTF-16 is ... it was 
UCS-2 only with BMP when they chose it.  So character handling was easier 
and faster thanks to fixed-width encoding.


Regards
MauMau



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] UTF8 national character data type support WIP patch and list of open issues.

2013-09-17 Thread Arulappan, Arul Shaji



-Original Message-
From: pgsql-hackers-ow...@postgresql.org [mailto:pgsql-hackers-
ow...@postgresql.org] On Behalf Of MauMau

Hello,

I think it would be nice for PostgreSQL to support national character
types
largely because it should ease migration from other DBMSs.

[Reasons why we need NCHAR]
--
1. Invite users of other DBMSs to PostgreSQL.  Oracle, SQL Server,
MySQL, etc.
all have NCHAR support.  PostgreSQL is probably the only database out
of major
ones that does not support NCHAR.
Sadly, I've read a report from some Japanese government agency that the
number
of MySQL users exceeded that of PostgreSQL here in Japan in 2010 or
2011.  I
wouldn't say that is due to NCHAR support, but it might be one reason.
I want
PostgreSQL to be more popular and regain those users.

2. Enhance the open image of PostgreSQL by implementing more features
of SQL
standard.  NCHAR may be a wrong and unnecessary feature of SQL standard
now
that we have Unicode support, but it is defined in the standard and
widely
implemented.

3. I have heard that some potential customers didn't adopt PostgreSQL
due to
lack of NCHAR support.  However, I don't know the exact reason why they
need
NCHAR.

The use case we have is for customer(s) who are modernizing their
databases on mainframes. These applications are typically written in
COBOL which does have extensive support for National Characters.
Supporting National Characters as in-built data types in PostgreSQL is,
not to exaggerate, an important criteria in their decision to use
PostgreSQL or not. (So is Embedded COBOL. But that is a separate issue.)




4. I guess some users really want to continue to use ShiftJIS or EUC_JP
for
database encoding, and use NCHAR for a limited set of columns to store
international text in Unicode:
- to avoid code conversion between the server and the client for
performance
- because ShiftJIS and EUC_JP require less amount of storage (2 bytes
for most
Kanji) than UTF-8 (3 bytes) This use case is described in chapter 6 of
Oracle
Database Globalization Support Guide.
--


I think we need to do the following:

[Minimum requirements]
--
1. Accept NCHAR/NVARCHAR as data type name and N'...' syntactically.
This is already implemented.  PostgreSQL treats NCHAR/NVARCHAR as
synonyms for
CHAR/VARCHAR, and ignores N prefix.  But this is not documented.

2. Declare support for national character support in the manual.
1 is not sufficient because users don't want to depend on undocumented
behavior.  This is exactly what the TODO item national character
support
in PostgreSQL TODO wiki is about.

3. Implement NCHAR/NVARCHAR as distinct data types, not as synonyms so
that:
- psql \d can display the user-specified data types.
- pg_dump/pg_dumpall can output NCHAR/NVARCHAR columns as-is, not as
CHAR/VARCHAR.
- To implement additional features for NCHAR/NVARCHAR in the future, as
described below.
--


Agreed. This is our minimum requirement too. 

Rgds,
Arul Shaji








[Optional requirements]
--
1. Implement client driver support, such as:
- NCHAR host variable type (e.g. NCHAR var_name[12];) in ECPG, as
specified
in the SQL standard.
- national character methods (e.g. setNString, getNString,
setNCharacterStream) as specified in JDBC 4.0.
I think at first we can treat these national-character-specific
features as the
same as CHAR/VARCHAR.

2. NCHAR/NVARCHAR columns can be used in non-UTF-8 databases and always
contain
Unicode data.
I think it is sufficient at first that NCHAR/NVARCHAR columns can only
be used
in UTF-8 databases and they store UTF-8 strings.  This allows us to
reuse the
input/output/send/recv functions and other infrastructure of
CHAR/VARCHAR.
This is a reasonable compromise to avoid duplication and minimize the
first
implementation of NCHAR support.

3. Store strings in UTF-16 encoding in NCHAR/NVARCHAR columns.
Fixed-width encoding may allow faster string manipulation as described
in
Oracle's manual.  But I'm not sure about this, because UTF-16 is not a
real
fixed-width encoding due to supplementary characters.

This would definitely be a welcome addition. 



--


I don't think it is good to implement NCHAR/NVARCHAR types as
extensions like
contrib/citext, because NCHAR/NVARCHAR are basic types and need
client-side
support.  That is, client drivers need to be aware of the fixed
NCHAR/NVARCHAR
OID values.

How do you think we should implement NCHAR support?

Regards
MauMau



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To
make
changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers




-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:

Re: [HACKERS] UTF8 national character data type support WIP patch and list of open issues.

2013-09-16 Thread MauMau


Hello,

I think it would be nice for PostgreSQL to support national character types 
largely because it should ease migration from other DBMSs.


[Reasons why we need NCHAR]
--
1. Invite users of other DBMSs to PostgreSQL.  Oracle, SQL Server, MySQL, 
etc. all have NCHAR support.  PostgreSQL is probably the only database out 
of major ones that does not support NCHAR.
Sadly, I've read a report from some Japanese government agency that the 
number of MySQL users exceeded that of PostgreSQL here in Japan in 2010 or 
2011.  I wouldn't say that is due to NCHAR support, but it might be one 
reason.  I want PostgreSQL to be more popular and regain those users.


2. Enhance the open image of PostgreSQL by implementing more features of 
SQL standard.  NCHAR may be a wrong and unnecessary feature of SQL standard 
now that we have Unicode support, but it is defined in the standard and 
widely implemented.


3. I have heard that some potential customers didn't adopt PostgreSQL due to 
lack of NCHAR support.  However, I don't know the exact reason why they need 
NCHAR.


4. I guess some users really want to continue to use ShiftJIS or EUC_JP for 
database encoding, and use NCHAR for a limited set of columns to store 
international text in Unicode:

- to avoid code conversion between the server and the client for performance
- because ShiftJIS and EUC_JP require less amount of storage (2 bytes for 
most Kanji) than UTF-8 (3 bytes)
This use case is described in chapter 6 of Oracle Database Globalization 
Support Guide.

--


I think we need to do the following:

[Minimum requirements]
--
1. Accept NCHAR/NVARCHAR as data type name and N'...' syntactically.
This is already implemented.  PostgreSQL treats NCHAR/NVARCHAR as synonyms 
for CHAR/VARCHAR, and ignores N prefix.  But this is not documented.


2. Declare support for national character support in the manual.
1 is not sufficient because users don't want to depend on undocumented 
behavior.  This is exactly what the TODO item national character support 
in PostgreSQL TODO wiki is about.


3. Implement NCHAR/NVARCHAR as distinct data types, not as synonyms so that:
- psql \d can display the user-specified data types.
- pg_dump/pg_dumpall can output NCHAR/NVARCHAR columns as-is, not as 
CHAR/VARCHAR.
- To implement additional features for NCHAR/NVARCHAR in the future, as 
described below.

--




[Optional requirements]
--
1. Implement client driver support, such as:
- NCHAR host variable type (e.g. NCHAR var_name[12];) in ECPG, as 
specified in the SQL standard.
- national character methods (e.g. setNString, getNString, 
setNCharacterStream) as specified in JDBC 4.0.
I think at first we can treat these national-character-specific features as 
the same as CHAR/VARCHAR.


2. NCHAR/NVARCHAR columns can be used in non-UTF-8 databases and always 
contain Unicode data.
I think it is sufficient at first that NCHAR/NVARCHAR columns can only be 
used in UTF-8 databases and they store UTF-8 strings.  This allows us to 
reuse the input/output/send/recv functions and other infrastructure of 
CHAR/VARCHAR.  This is a reasonable compromise to avoid duplication and 
minimize the first implementation of NCHAR support.


3. Store strings in UTF-16 encoding in NCHAR/NVARCHAR columns.
Fixed-width encoding may allow faster string manipulation as described in 
Oracle's manual.  But I'm not sure about this, because UTF-16 is not a real 
fixed-width encoding due to supplementary characters.

--


I don't think it is good to implement NCHAR/NVARCHAR types as extensions 
like contrib/citext, because NCHAR/NVARCHAR are basic types and need 
client-side support.  That is, client drivers need to be aware of the fixed 
NCHAR/NVARCHAR OID values.


How do you think we should implement NCHAR support?

Regards
MauMau



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] UTF8 national character data type support WIP patch and list of open issues.

2013-09-04 Thread Tom Lane

Boguk, Maksym maks...@fast.au.fujitsu.com writes:
 Hi,  my task is implementing ANSI NATIONAL character string types as
 part of PostgreSQL core.

No, that's not a given.  You have a problem to solve, ie store some UTF8
strings in a database that's mostly just 1-byte data.  It is not clear
that NATIONAL CHARACTER is the best solution to that problem.  And I don't
think that you're going to convince anybody that this is an improvement in
spec compliance, because there's too much gap between what you're doing
here and what it says in the spec.

 Both of these approach requires dump/restore the whole database which is
 not always an opinion.

That's a disadvantage, agreed, but it's not a large enough one to reject
the approach, because what you want to do also has very significant
disadvantages.

I think it is extremely likely that we will end up rejecting a patch based
on NATIONAL CHARACTER altogether.  It will require too much duplicative
code, it requires too many application-side changes to make use of the
functionality, and it will break any applications that are relying on the
current behavior of that syntax.  But the real problem is that you're
commandeering syntax defined in the SQL spec for what is in the end quite
a narrow usage.  I agree that the use-case will be very handy for some
applications ... but if we were ever to try to achieve real spec
compliance for the SQL features around character sets, this doesn't look
like a step on the way to that.

I think you'd be well advised to take a hard look at the
specialized-database-encoding approach.  From here it looks like a 99%
solution for about 1% of the effort; and since it would be quite
uninvasive to the system as a whole, it's unlikely that such a patch
would get rejected.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] UTF8 national character data type support WIP patch and list of open issues.

2013-09-03 Thread Heikki Linnakangas


On 03.09.2013 05:28, Boguk, Maksym wrote:

Target usage:  ability to store UTF8 national characters in some
selected fields inside a single-byte encoded database.
For sample if I have a ru-RU.koi8r encoded database with mostly Russian
text inside,  it would be nice to be able store an Japanese text in one
field without converting the whole database to UTF8 (convert such
database to UTF8 easily could almost double the database size even if
only one field in whole database will use any symbols outside of
ru-RU.koi8r encoding).


Ok.


What has been done:

1)Addition of new string data types NATIONAL CHARACTER and NATIONAL
CHARACTER VARIABLE.
These types differ from the char/varchar data types in one important
respect:  NATIONAL string types are always have UTF8 encoding even
(independent from used database encoding).


I don't like the approach of adding a new data type for this. The 
encoding used for a text field should be an implementation detail, not 
something that's exposed to users at the schema-level. A separate data 
type makes an nvarchar field behave slightly differently from text, for 
example when it's passed to and from functions. It will also require 
drivers and client applications to know about it.



What need to be done:

1)Full set of string functions and operators for NATIONAL types (we
could not use generic text functions because they assume that the stings
will have database encoding).
Now only basic set implemented.
2)Need implement some way to define default collation for a NATIONAL
types.
3)Need implement some way to input UTF8 characters into NATIONAL types
via SQL  (there are serious open problem... it will be defined later in
the text).


Yeah, all of these issues stem from the fact that the NATIONAL types are 
separate from text.


I think we should take a completely different approach to this. Two 
alternatives spring to mind:


1. Implement a new encoding.  The new encoding would be some variant of 
UTF-8 that encodes languages like Russian more efficiently. Then just 
use that in the whole database. Something like SCSU 
(http://www.unicode.org/reports/tr6/) should do the trick, although I'm 
not sure if SCSU can be used as a server-encoding. A lot of code relies 
on the fact that a server encoding must have the high bit set in all 
bytes that are part of a multi-byte character. That's why SJIS for 
example can only be used as a client-encoding. But surely you could come 
up with some subset or variant of SCSU which satisfies that requirement.


2. Compress the column. Simply do ALTER TABLE foo ALTER COLUMN bar SET 
STORAGE MAIN. That will make Postgres compress that field. That might 
not be very efficient for compressing short cyrillic text encoded in 
UTF-8 today, but that could be improved. There has been discussion on 
supporting more compression algorithms in the past, and one such 
algorithm could be again something like SCSU.


- Heikki


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] UTF8 national character data type support WIP patch and list of open issues.

2013-09-03 Thread Tom Lane

Heikki Linnakangas hlinnakan...@vmware.com writes:
 On 03.09.2013 05:28, Boguk, Maksym wrote:
 Target usage:  ability to store UTF8 national characters in some
 selected fields inside a single-byte encoded database.

 I think we should take a completely different approach to this. Two 
 alternatives spring to mind:

 1. Implement a new encoding.  The new encoding would be some variant of 
 UTF-8 that encodes languages like Russian more efficiently.

+1.  I'm not sure that SCSU satisfies the requirement (which I read as
that Russian text should be pretty much 1 byte/character).  But surely
we could devise a variant that does.  For instance, it could look like
koi8r (or any other single-byte encoding of your choice) with one byte
value, say 255, reserved as a prefix.  255 means that a UTF8 character
follows.  The main complication here is that you don't want to allow more
than one way to represent a character --- else you break text hashing,
for instance.  So you'd have to take care that you never emit the 255+UTF8
representation for a character that can be represented in the single-byte
encoding.  In particular, you'd never encode ASCII that way, and thus this
would satisfy the all-multibyte-chars-must-have-all-high-bits-set rule.

Ideally we could make a variant like this for each supported single-byte
encoding, and thus you could optimize a database for mostly but not
entirely LATIN1 text, etc.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] UTF8 national character data type support WIP patch and list of open issues.

2013-09-03 Thread Boguk, Maksym


 1)Addition of new string data types NATIONAL CHARACTER and NATIONAL 
 CHARACTER VARIABLE.
 These types differ from the char/varchar data types in one important
 respect:  NATIONAL string types are always have UTF8 encoding even 
 (independent from used database encoding).

I don't like the approach of adding a new data type for this. The
encoding used for a text field should be an implementation detail, not
something that's exposed to users at the schema-level. A separate data
type makes an nvarchar field behave slightly differently from text, for
example when it's passed to and from functions. It will also require
drivers and client applications to know about it.

Hi,  my task is implementing ANSI NATIONAL character string types as
part of PostgreSQL core.
And requirement  require drivers and client applications to know about
it is reason why it could not be done as add-on (these new types should
have a fixed OID for most drivers from my experience).
Implementing them as UTF8 data-type is first step which allows have
NATIONAL characters with encoding differ from database encoding (and
might me even support multiple encoding for common string types in
future).


 1)Full set of string functions and operators for NATIONAL types (we 
 could not use generic text functions because they assume that the 
 stings will have database encoding).
 Now only basic set implemented.
 2)Need implement some way to define default collation for a NATIONAL 
 types.
 3)Need implement some way to input UTF8 characters into NATIONAL
types 
 via SQL  (there are serious open problem... it will be defined later 
 in the text).

Yeah, all of these issues stem from the fact that the NATIONAL types
are separate from text.
I think we should take a completely different approach to this. Two
alternatives spring to mind:

1. Implement a new encoding.  The new encoding would be some variant of
UTF-8 that encodes languages like Russian more efficiently. Then just
use that in the whole database. Something like SCSU
(http://www.unicode.org/reports/tr6/) should do the trick, although I'm
not sure if SCSU can be used as a server-encoding. A lot of code relies
on the fact that a server encoding must have the high bit set in all
bytes that are part of a multi-byte character. That's why SJIS for
example can only be used as a client-encoding. But surely you could come
up with some subset or variant of SCSU which satisfies that requirement.

2. Compress the column. Simply do ALTER TABLE foo ALTER COLUMN bar SET
STORAGE MAIN. That will make Postgres compress that field. That might
not be very efficient for compressing short Cyrillic text encoded in
UTF-8 today, but that could be improved. There has been discussion on
supporting more compression algorithms in the past, and one such
algorithm could be again something like SCSU.

Both of these approach requires dump/restore the whole database which is
not always an opinion.
Implementing an UTF8 NATIONAL character as new datatype will provide
opinion use pg_upgrade to latest version and have required functionality
without prolonged downtime.

PS: is it possible to reserve some narrow type OID range in PostgreSQL
core for the future use?

Kind Regards,
Maksym






-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

54 matches

Mail list logo