Re: [HACKERS] Proposal - Support for National Characters functionality

2013-07-31 Thread Boguk, Maksym
Hi everyone,

I will try answer on all questions related to proposed National
Characters support.

 2)Provide support for the new GUC nchar_collation to provide the 
 database with information about the default collation that needs to
be 
 used for the new data types.

A GUC seems like completely the wrong tack to be taking.  In the first
place, that would mandate just one value (at a time anyway) of
collation, which is surely not much of an advance over what's already
possible.  In the second place, what happens if you change the value?
All your indexes on nchar columns are corrupt, that's what.  Actually
the data itself would be corrupt, if you intend that this setting
determines the encoding and not just the collation.  If you really are
speaking only of collation, it's not clear to me exactly what this
proposal offers that can't be achieved today (with greater security,
functionality and spec compliance) by using COLLATE clauses on plain
text columns.
Actually, you really haven't answered at all what it is you want to do
that COLLATE can't do.

I think I give a wrong description there... it will be not GUC but
GUC-type value which will be initialized during CREATE DATABASE and will
be read only after, very similar to the lc_collate.
So I think name  national_lc_collate will be better.
Function of this value - provide information about the default collation
for the NATIONAL CHARACTERS inside the database.
That's not limits user ability of use an alternative collation for
NATIONAL CHARACTERS during create table via COLLATE keyword.

E.g. if we have second encoding inside the database - we should have
information about used collation somewhere.

 4)Because all symbols from non-UTF8 encodings could be represented as
 UTF8 (but the reverse is not true) comparison between N* types and
the 
 regular string types inside database will be performed in UTF8 form.

I believe that in some Far Eastern character sets there are some
characters that map to the same Unicode glyph, but that some people
would prefer to keep separate.  So transcoding to UTF8 isn't necessarily
lossless.  This is one of the reasons why we've resisted adopting ICU or
standardizing on UTF8 as the One True Database Encoding.  Now this may
or may not matter for comparison to strings that were in some other
encoding to start with --- but as soon as you base your design on the
premise that UTF8 is a universal encoding, you are sliding down a
slippery slope to a design that will meet resistance.

Will the conversion of both sides to the pg_wchar before comparison fix
this problem?
Anyway, if the database going to use more than one encoding, a some
universal encoding should be used to allow comparison between them.
After some analyse I think pg_wchar is better candidate to this role
than UTF8.

 6)Client input/output of NATIONAL strings - NATIONAL strings will 
 respect the client_encoding setting, and their values will be 
 transparently converted to the requested client_encoding before
 sending(receiving) to client (the same mechanics as used for usual 
 string types).
 So no mixed encoding in client input/output will be
supported/available.

If you have this restriction, then I'm really failing to see what
benefit there is over what can be done today with COLLATE.

There are two targets for this project:

1. Legacy database with non-utf8 encoding, which should support old
non-utf8 applications and new UTF8 applications.
In that case the old applications will use the legacy database encoding
(and because these applications are legacy they doesn't work with new
NATIONAL CHARACTERS data/tables).
And the new applications will use client-side UTF8 encoding and they
will be able store international texts in NATIONAL CHARACTER columns.
Dump/restore of the whole database to change the database encoding to
UTF8 not always possible, so there necessity of the some easy to use
workaround. 

2.Better compatibility with the ANSI SQL standard.


Kind Regards,
Maksym



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Proposal - Support for National Characters functionality

2013-07-31 Thread Alvaro Herrera
Boguk, Maksym escribió:

 I think I give a wrong description there... it will be not GUC but
 GUC-type value which will be initialized during CREATE DATABASE and will
 be read only after, very similar to the lc_collate.
 So I think name  national_lc_collate will be better.
 Function of this value - provide information about the default collation
 for the NATIONAL CHARACTERS inside the database.
 That's not limits user ability of use an alternative collation for
 NATIONAL CHARACTERS during create table via COLLATE keyword.

This seems a bit odd.  I mean, if I want the option for differing
encodings, surely I need to be able to set them for each column, not at
the database level.

Also, as far as I understand what we want to control here is the
encoding that the strings are in (the mapping of bytes to characters),
not the collation (the way a set of strings are ordered).  So it doesn't
make sense to set the NATIONAL CHARACTER option using the COLLATE
keyword.

-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Proposal - Support for National Characters functionality

2013-07-31 Thread Tom Lane
Alvaro Herrera alvhe...@2ndquadrant.com writes:
 Also, as far as I understand what we want to control here is the
 encoding that the strings are in (the mapping of bytes to characters),
 not the collation (the way a set of strings are ordered).  So it doesn't
 make sense to set the NATIONAL CHARACTER option using the COLLATE
 keyword.

My thought is that we should simply ignore the NATIONAL CHARACTER syntax,
which is not the first nor the last brain-damaged feature design in the
SQL standard.  It's basically useless for what we want because there's
noplace to specify which encoding you mean.  Instead, let's consider that
COLLATE can define not only the collation but also the encoding of a
string datum.  Contrary to what I think you meant above, that seems
perfectly sensible to me, because after all a collation is necessarily a
bunch of rules about how to order a particular set of characters.  If the
data representation you use is unable to represent that set of characters,
it's not a very meaningful combination is it?

There's still the problem of how do you get a string of a nondefault
encoding into the database in the first place.  If you have to convert
to DB encoding to get it in there, then what's the use of a further
conversion?  This consideration may well kill the whole concept.
(It certainly kills NATIONAL CHARACTER syntax just as much.)

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Proposal - Support for National Characters functionality

2013-07-31 Thread Arulappan, Arul Shaji
 From: Alvaro Herrera [mailto:alvhe...@2ndquadrant.com]
 
 Boguk, Maksym escribió:
 
  I think I give a wrong description there... it will be not GUC but
  GUC-type value which will be initialized during CREATE DATABASE and
  will be read only after, very similar to the lc_collate.
  So I think name  national_lc_collate will be better.
  Function of this value - provide information about the default
  collation for the NATIONAL CHARACTERS inside the database.
  That's not limits user ability of use an alternative collation for
  NATIONAL CHARACTERS during create table via COLLATE keyword.
 
 This seems a bit odd.  I mean, if I want the option for differing encodings,
 surely I need to be able to set them for each column, not at the database
 level.
 
 Also, as far as I understand what we want to control here is the encoding that
 the strings are in (the mapping of bytes to characters), not the collation

Yes, that is our idea too. For the sql syntax

Create table tbl1 (col1 nchar);

What should be the encoding and collation for col1? Because the idea is to have 
them in separate encoding and collation (if needed) from that of the rest of 
the table.  We have options of 

a) Having guc variables that will determine the default encoding and collation 
for nchar/nvarchar columns. Note that the collate variable is default only. 
Users can still override them per column.
b) Having the encoding name and collation as part of the syntax. For ex., (col1 
nchar encoding UTF-8 COLLATE C). Ugly, but.
c) Be rigid and say nchar/nvarchar columns are by default UTF-8 (or something 
else). One cannot change the default. But they can override it when declaring 
the column by having a syntax similar to (b)


Rgds,
Arul Shaji



 (the way a set of strings are ordered).  So it doesn't make sense to set the
 NATIONAL CHARACTER option using the COLLATE keyword.




 
 --
 Álvaro Herrerahttp://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services




-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Proposal - Support for National Characters functionality

2013-07-31 Thread Arulappan, Arul Shaji
 From: Tom Lane [mailto:t...@sss.pgh.pa.us]
 
 Alvaro Herrera alvhe...@2ndquadrant.com writes:
  Also, as far as I understand what we want to control here is the
  encoding that the strings are in (the mapping of bytes to
characters),
  not the collation (the way a set of strings are ordered).  So it
  doesn't make sense to set the NATIONAL CHARACTER option using the
  COLLATE keyword.
 
 My thought is that we should simply ignore the NATIONAL CHARACTER
syntax,
 which is not the first nor the last brain-damaged feature design in
the SQL
 standard.  It's basically useless for what we want because there's
noplace to
 specify which encoding you mean.  Instead, let's consider that COLLATE
can
 define not only the collation but also the encoding of a string datum.

Yes, don't have a problem with this. If I understand you correctly, this
will be simpler syntax wise, but still get nchar/nvarchar data types
into a table, in different encoding from the rest of the table. 

 
 There's still the problem of how do you get a string of a nondefault
encoding
 into the database in the first place.  

Yes, that is the bulk of the work. Will need change in a whole lot of
places.

Is a step-by-step approach worth exploring ? Something similar to:

Step 1: Support nchar/nvarchar data types. Restrict them only to UTF-8
databases to begin with. 
Step 2: Support multiple encodings in a database. Remove the restriction
imposed in step1.


Rgds,
Arul Shaji




-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Proposal - Support for National Characters functionality

2013-07-30 Thread Arulappan, Arul Shaji
 -Original Message-
 From: Tatsuo Ishii [mailto:is...@postgresql.org]
 
 
 Also I don't understand why you need UTF-16 support as a database
encoding
 because UTF-8 and UTF-16 are logically equivalent, they are just
different
 represention (encoding) of Unicode. That means if we already support
UTF-8
 (I'm sure we already do), there's no particular reason we need to add
UTF-16
 support.
 
 Maybe you just want to support UTF-16 as a client encoding?

Given below is a design draft for this functionality:

Core new functionality (new code):
1)Create and register independent NCHAR/NVARCHAR/NTEXT data types.

2)Provide support for the new GUC nchar_collation to provide the
database with information about the default collation that needs to be
used for the new data types.

3)Create encoding conversion subroutines to convert strings between the
database encoding and UTF8 (from national strings to regular strings and
back).
PostgreSQL already have all required support (used for conversion
between the database encoding and client_encoding), so amount of the new
code will be minimal there.

4)Because all symbols from non-UTF8 encodings could be represented as
UTF8 (but the reverse is not true) comparison between N* types and the
regular string types inside database will be performed in UTF8 form. To
achieve this feature the new IMPLICIT casts may need to be created:
NCHAR - CHAR
NVARCHAR - VARCHAR
NTEXT - TEXT.

Casting in the reverse direction will be available too but only as
EXPLICIT.
However, these casts could fail if national strings could not be
represented in the used database encoding.

All these casts will use subroutines created in 3).

Casting/conversion between N* types will follow the same rules/mechanics
as used for casting/conversion between usual (CHAR(N)/VARCHAR(N)/TEXT)
string types.


5)Comparison between NATIONAL string values will be performed via
specialized UTF8 optimized functions (with respect of the
nchar_collation setting).


6)Client input/output of NATIONAL strings - NATIONAL strings will
respect the client_encoding setting, and their values will be
transparently converted to the requested client_encoding before
sending(receiving) to client (the same mechanics as used for usual
string types).
So no mixed encoding in client input/output will be supported/available.

7)Create set of the regression tests for these new data types.


Additional changes:
1)ECPG support for these new types
2) Support in the database drivers for the data types.

Rgds,
Arul Shaji




 --
 Tatsuo Ishii
 SRA OSS, Inc. Japan
 English: http://www.sraoss.co.jp/index_en.php
 Japanese: http://www.sraoss.co.jp




-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Proposal - Support for National Characters functionality

2013-07-30 Thread Tom Lane
Arulappan, Arul Shaji a...@fast.au.fujitsu.com writes:
 Given below is a design draft for this functionality:

 Core new functionality (new code):
 1)Create and register independent NCHAR/NVARCHAR/NTEXT data types.

 2)Provide support for the new GUC nchar_collation to provide the
 database with information about the default collation that needs to be
 used for the new data types.

A GUC seems like completely the wrong tack to be taking.  In the first
place, that would mandate just one value (at a time anyway) of
collation, which is surely not much of an advance over what's already
possible.  In the second place, what happens if you change the value?
All your indexes on nchar columns are corrupt, that's what.  Actually
the data itself would be corrupt, if you intend that this setting
determines the encoding and not just the collation.  If you really are
speaking only of collation, it's not clear to me exactly what this
proposal offers that can't be achieved today (with greater security,
functionality and spec compliance) by using COLLATE clauses on plain
text columns.

Actually, you really haven't answered at all what it is you want to do
that COLLATE can't do.

 4)Because all symbols from non-UTF8 encodings could be represented as
 UTF8 (but the reverse is not true) comparison between N* types and the
 regular string types inside database will be performed in UTF8 form.

I believe that in some Far Eastern character sets there are some
characters that map to the same Unicode glyph, but that some people
would prefer to keep separate.  So transcoding to UTF8 isn't necessarily
lossless.  This is one of the reasons why we've resisted adopting ICU or
standardizing on UTF8 as the One True Database Encoding.  Now this may
or may not matter for comparison to strings that were in some other
encoding to start with --- but as soon as you base your design on the
premise that UTF8 is a universal encoding, you are sliding down a
slippery slope to a design that will meet resistance.

 6)Client input/output of NATIONAL strings - NATIONAL strings will
 respect the client_encoding setting, and their values will be
 transparently converted to the requested client_encoding before
 sending(receiving) to client (the same mechanics as used for usual
 string types).
 So no mixed encoding in client input/output will be supported/available.

If you have this restriction, then I'm really failing to see what
benefit there is over what can be done today with COLLATE.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Proposal - Support for National Characters functionality

2013-07-16 Thread Martijn van Oosterhout
On Mon, Jul 15, 2013 at 05:11:40PM +0900, Tatsuo Ishii wrote:
  Does support for alternative multi-byte encodings have something to do
  with the Han unification controversy? I don't know terribly much about
  this, so apologies if that's just wrong.
 
 There's a famous problem regarding conversion between Unicode and other
 encodings, such as Shift Jis.
 
 There are lots of discussion on this. Here is the one from Microsoft:
 
 http://support.microsoft.com/kb/170559/EN-US

Apart from Shift-JIS not being a well defined (it's more a family of
encodings) it has the unusual feature of providing multiple ways to
encode the same character.  This is not even a Han unification issue,
they have largely been addressed.  For example, the square-root symbol
exists twice (0x8795 and 0x81E3) and many other mathmatical symbols
also.

Here's the code page which you can browse online:

http://msdn.microsoft.com/en-us/goglobal/cc305152

Which means to be round-trippable Unicode would have to double those
characters, but this would make it hard/impossible to round-trip with
any other character set that had those characters.  No easy solution
here.

Something that has been done before [1] is to map the doubles to the
custom area of the unicode space (0xe000-0x).  It gives you
round-trip support at the expense of having to handle those characters
yourself.  But since postgres doesn't do anything meaningful with
unicode characters this might be acceptable.

[1] Python does a similar trick to handle filenames coming from disk in
an unknown encoding:
http://docs.python.org/3/howto/unicode.html#files-in-an-unknown-encoding

Have a nice day,
-- 
Martijn van Oosterhout   klep...@svana.org   http://svana.org/kleptog/
 He who writes carelessly confesses thereby at the very outset that he does
 not attach much importance to his own thoughts.
   -- Arthur Schopenhauer


signature.asc
Description: Digital signature


Re: [HACKERS] Proposal - Support for National Characters functionality

2013-07-15 Thread Peter Geoghegan
On Mon, Jul 15, 2013 at 4:37 AM, Robert Haas robertmh...@gmail.com wrote:
 On Fri, Jul 5, 2013 at 2:35 PM, Pavel Stehule pavel.steh...@gmail.com wrote:
 Yes, what I know almost all use utf8 without problems. Long time I didn't
 see any request for multi encoding support.

 Well, not *everything* can be represented as UTF-8; I think this is
 particularly an issue with Asian languages.

What cannot be represented as UTF-8? UTF-8 can represent every
character in the Unicode character set, whereas UTF-16 can encode
characters 0 to 0x10.

Does support for alternative multi-byte encodings have something to do
with the Han unification controversy? I don't know terribly much about
this, so apologies if that's just wrong.

-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Proposal - Support for National Characters functionality

2013-07-15 Thread Tatsuo Ishii
 On Mon, Jul 15, 2013 at 4:37 AM, Robert Haas robertmh...@gmail.com wrote:
 On Fri, Jul 5, 2013 at 2:35 PM, Pavel Stehule pavel.steh...@gmail.com 
 wrote:
 Yes, what I know almost all use utf8 without problems. Long time I didn't
 see any request for multi encoding support.

 Well, not *everything* can be represented as UTF-8; I think this is
 particularly an issue with Asian languages.
 
 What cannot be represented as UTF-8? UTF-8 can represent every
 character in the Unicode character set, whereas UTF-16 can encode
 characters 0 to 0x10.
 
 Does support for alternative multi-byte encodings have something to do
 with the Han unification controversy? I don't know terribly much about
 this, so apologies if that's just wrong.

There's a famous problem regarding conversion between Unicode and other
encodings, such as Shift Jis.

There are lots of discussion on this. Here is the one from Microsoft:

http://support.microsoft.com/kb/170559/EN-US
--
Tatsuo Ishii
SRA OSS, Inc. Japan
English: http://www.sraoss.co.jp/index_en.php
Japanese: http://www.sraoss.co.jp


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Proposal - Support for National Characters functionality

2013-07-15 Thread Tatsuo Ishii
 On Fri, Jul 5, 2013 at 2:35 PM, Pavel Stehule
 pavel.steh...@gmail.com wrote:
  Yes, what I know almost all use utf8 without problems. Long time I
  didn't see any request for multi encoding support.
 
 Well, not *everything* can be represented as UTF-8; I think this is
 particularly an issue with Asian languages.
 
 If we chose to do it, I think that per-column encoding support would
 end up
 looking a lot like per-column collation support: it would be yet
 another per-
 column property along with typoid, typmod, and typcollation.  I'm not
 entirely
 sure it's worth it, although FWIW I do believe Oracle has something
 like this.
 
 Yes, the idea is that users will be able to declare columns of type
 NCHAR or NVARCHAR which will use the pre-determined encoding type. If we
 say that NCHAR is UTF-8 then the NCHAR column will be of UTF-8 encoding
 irrespective of the database encoding. It will be up to us to restrict
 what Unicode encodings we want to support for NCHAR/NVARCHAR columns.
 This is based on my interpretation of the SQL standard. As you allude to
 above, Oracle has a similar behaviour (they support UTF-16 as well).  
 
 Support for UTF-16 will be difficult without linking with some external
 libraries such as ICU. 

Can you please elaborate more on this? Why do you exactly need ICU?

Also I don't understand why you need UTF-16 support as a database
encoding because UTF-8 and UTF-16 are logically equivalent, they are
just different represention (encoding) of Unicode. That means if we
already support UTF-8 (I'm sure we already do), there's no particular
reason we need to add UTF-16 support.

Maybe you just want to support UTF-16 as a client encoding?
--
Tatsuo Ishii
SRA OSS, Inc. Japan
English: http://www.sraoss.co.jp/index_en.php
Japanese: http://www.sraoss.co.jp


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Proposal - Support for National Characters functionality

2013-07-15 Thread Peter Geoghegan
On Mon, Jul 15, 2013 at 8:58 AM, Tatsuo Ishii is...@postgresql.org wrote:
 Also I don't understand why you need UTF-16 support as a database
 encoding because UTF-8 and UTF-16 are logically equivalent, they are
 just different represention (encoding) of Unicode. That means if we
 already support UTF-8 (I'm sure we already do), there's no particular
 reason we need to add UTF-16 support.

To be fair, there is a small reason to support UTF-16 even with UTF-8
available. I personally do not find it compelling, but perhaps I am
not best placed to judge such things. As Wikipedia says on the the
English UTF-8 article:

Characters U+0800 through U+ use three bytes in UTF-8, but only
two in UTF-16. As a result, text in (for example) Chinese, Japanese or
Hindi could take more space in UTF-8 if there are more of these
characters than there are ASCII characters. This happens for pure text
but rarely for HTML documents. For example, both the Japanese UTF-8
and the Hindi Unicode articles on Wikipedia take more space in UTF-16
than in UTF-8.

This is the only advantage of UTF-16 over UTF-8 as a server encoding.
I'm inclined to take the fact that there has been so few (no?)
complaints from PostgreSQL's large Japanese user-base about the lack
of UTF-16 support as suggesting that that isn't considered to be a
compelling feature in the CJK realm.

-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Proposal - Support for National Characters functionality

2013-07-15 Thread Peter Eisentraut
On 7/15/13 1:26 AM, Arulappan, Arul Shaji wrote:
 Yes, the idea is that users will be able to declare columns of type
 NCHAR or NVARCHAR which will use the pre-determined encoding type. If we
 say that NCHAR is UTF-8 then the NCHAR column will be of UTF-8 encoding
 irrespective of the database encoding. It will be up to us to restrict
 what Unicode encodings we want to support for NCHAR/NVARCHAR columns.

I would try implementing this as an extension at first, with a new data
type that is internally encoded differently.  We have citext as
precedent for successfully implementing text-like data types in user space.



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Proposal - Support for National Characters functionality

2013-07-14 Thread Robert Haas
On Fri, Jul 5, 2013 at 2:35 PM, Pavel Stehule pavel.steh...@gmail.com wrote:
 Yes, what I know almost all use utf8 without problems. Long time I didn't
 see any request for multi encoding support.

Well, not *everything* can be represented as UTF-8; I think this is
particularly an issue with Asian languages.

If we chose to do it, I think that per-column encoding support would
end up looking a lot like per-column collation support: it would be
yet another per-column property along with typoid, typmod, and
typcollation.  I'm not entirely sure it's worth it, although FWIW I do
believe Oracle has something like this.  At any rate, it seems like
quite a lot of work.

Another idea would be to do something like what we do for range types
- i.e. allow a user to declare a type that is a differently-encoded
version of some base type.  But even that seems pretty hard.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Proposal - Support for National Characters functionality

2013-07-14 Thread Arulappan, Arul Shaji
 
 On Fri, Jul 5, 2013 at 2:35 PM, Pavel Stehule
pavel.steh...@gmail.com wrote:
  Yes, what I know almost all use utf8 without problems. Long time I
  didn't see any request for multi encoding support.
 
 Well, not *everything* can be represented as UTF-8; I think this is
 particularly an issue with Asian languages.
 
 If we chose to do it, I think that per-column encoding support would
end up
 looking a lot like per-column collation support: it would be yet
another per-
 column property along with typoid, typmod, and typcollation.  I'm not
entirely
 sure it's worth it, although FWIW I do believe Oracle has something
like this.

Yes, the idea is that users will be able to declare columns of type
NCHAR or NVARCHAR which will use the pre-determined encoding type. If we
say that NCHAR is UTF-8 then the NCHAR column will be of UTF-8 encoding
irrespective of the database encoding. It will be up to us to restrict
what Unicode encodings we want to support for NCHAR/NVARCHAR columns.
This is based on my interpretation of the SQL standard. As you allude to
above, Oracle has a similar behaviour (they support UTF-16 as well).  

Support for UTF-16 will be difficult without linking with some external
libraries such as ICU. 


 At any rate, it seems like quite a lot of work.

Thanks for putting my mind at ease ;-)

Rgds,
Arul Shaji


 
 Another idea would be to do something like what we do for range types
 - i.e. allow a user to declare a type that is a differently-encoded
version of
 some base type.  But even that seems pretty hard.
 
 --
 Robert Haas
 EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL
Company




-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Proposal - Support for National Characters functionality

2013-07-05 Thread Arulappan, Arul Shaji
Ishii san,

Thank you for your positive and early response.

 -Original Message-
 From: Tatsuo Ishii [mailto:is...@postgresql.org]
 Sent: Friday, 5 July 2013 3:02 PM
 To: Arulappan, Arul Shaji
 Cc: pgsql-hackers@postgresql.org
 Subject: Re: [HACKERS] Proposal - Support for National Characters
 functionality
 
 Arul Shaji,
 
 NCHAR support is on our TODO list for some time and I would like to
welcome
 efforts trying to implement it. However I have a few
 questions:
 
  This is a proposal to implement functionalities for the handling of
  National Characters.
 
  [Introduction]
 
  The aim of this proposal is to eventually have a way to represent
  'National Characters' in a uniform way, even in non-UTF8 encoded
  databases. Many of our customers in the Asian region who are now, as
  part of their platform modernization, are moving away from
mainframes
  where they have used National Characters representation in COBOL and
  other databases. Having stronger support for national characters
  representation will also make it easier for these customers to look
at
  PostgreSQL more favourably when migrating from other well known
RDBMSs
  who all have varying degrees of NCHAR/NVARCHAR support.
 
  [Specifications]
 
  Broadly speaking, the national characters implementation ideally
will
  include the following
  - Support for NCHAR/NVARCHAR data types
  - Representing NCHAR and NVARCHAR columns in UTF-8 encoding in
  non-UTF8 databases
 
 I think this is not a trivial work because we do not have framework to
allow
 mixed encodings in a database. I'm interested in how you are going to
solve
 the problem.
 

I would be lying if I said I have the design already speced out. I will
be working on this in the coming weeks and hope to design a working
solution consulting with the community.

  - Support for UTF16 column encoding and representing NCHAR and
  NVARCHAR columns in UTF16 encoding in all databases.
 
 Why do yo need UTF-16 as the database encoding? UTF-8 is already
supported,
 and any UTF-16 character can be represented in UTF-8 as far as I know.
 

Yes, that's correct. However there are advantages in using UTF-16
encoding for those characters that are always going to take atleast
two-bytes to represent. 

Having said that, my intention is to use UTF-8 for NCHAR as well.
Supporting UTF-16 will be even more complicated as it is not supported
natively in some Linux platforms. I only included it to give an option.

  - Support for NATIONAL_CHARACTER_SET GUC variable that will
determine
  the encoding that will be used in NCHAR/NVARCHAR columns.
 
 You said NCHAR's encoding is UTF-8. Why do you need the GUC if NCHAR's
 encoding is fixed to UTF-8?
 

If we are going to only support UTF-8 for NCHAR, then we don't need the
GUC variable obviously.

Rgds,
Arul Shaji



  The above points are at the moment a 'wishlist' only. Our aim is to
  tackle them one-by-one as we progress. I will send a detailed
proposal
  later with more technical details.
 
  The main aim at the moment is to get some feedback on the above to
  know if this feature is something that would benefit PostgreSQL in
  general, and if users maintaining DBs in non-English speaking
regions
  will find this beneficial.
 
  Rgds,
  Arul Shaji
 
 
 
  P.S.: It has been quite some time since I send a correspondence to
  this list. Our mail server adds a standard legal disclaimer to all
  outgoing mails, which I know that this list is not a huge fan of. I
  used to have an exemption for the mails I send to this list. If the
  disclaimer appears, apologies in advance. I will rectify that on the
next
 one.
 --
 Tatsuo Ishii
 SRA OSS, Inc. Japan
 English: http://www.sraoss.co.jp/index_en.php
 Japanese: http://www.sraoss.co.jp




-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Proposal - Support for National Characters functionality

2013-07-05 Thread Arulappan, Arul Shaji


 -Original Message-
 From: Claudio Freire [mailto:klaussfre...@gmail.com]
 Sent: Friday, 5 July 2013 3:41 PM
 To: Tatsuo Ishii
 Cc: Arulappan, Arul Shaji; PostgreSQL-Dev
 Subject: Re: [HACKERS] Proposal - Support for National Characters
 functionality
 
 On Fri, Jul 5, 2013 at 2:02 AM, Tatsuo Ishii is...@postgresql.org
wrote:
  - Support for NATIONAL_CHARACTER_SET GUC variable that will
determine
  the encoding that will be used in NCHAR/NVARCHAR columns.
 
  You said NCHAR's encoding is UTF-8. Why do you need the GUC if
NCHAR's
  encoding is fixed to UTF-8?
 
 
 Not only that, but I don't think it can be a GUC. Maybe a compile-time
switch,
 but if it were a GUC, how do you handle an existing database in UTF-8
when the
 setting is switched to UTF-16? Re-encode everything?
 Store the encoding along each value? It's a mess.
 
 Either fix it at UTF-8, or make it a compile-time thing, I'd say.

Agreed, that to begin with we only support UTF-8 encoding for NCHAR
columns. If that is the case, do we still need a compile time option to
turn on/off NCHAR functionality ? ?

Rgds,
Arul Shaji




-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Proposal - Support for National Characters functionality

2013-07-05 Thread Andrew Dunstan


On 07/05/2013 02:12 AM, Arulappan, Arul Shaji wrote:

- Support for UTF16 column encoding and representing NCHAR and
NVARCHAR columns in UTF16 encoding in all databases.

Why do yo need UTF-16 as the database encoding? UTF-8 is already

supported,

and any UTF-16 character can be represented in UTF-8 as far as I know.


Yes, that's correct. However there are advantages in using UTF-16
encoding for those characters that are always going to take atleast
two-bytes to represent.





Any suggestion to store data in utf-16 is likely to be a complete 
non-starter. I suggest you research our previously stated requirements 
for server side encodings.


cheers

andrew


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Proposal - Support for National Characters functionality

2013-07-05 Thread Peter Eisentraut
On 7/4/13 10:11 PM, Arulappan, Arul Shaji wrote:
 The main aim at the moment is to get some feedback on the above to know
 if this feature is something that would benefit PostgreSQL in general,
 and if users maintaining DBs in non-English speaking regions will find
 this beneficial.

For European languages, I think everyone has moved to using Unicode, so
the demand for supporting multiple encodings is approaching zero.  The
CJK realm might have difference requirements.



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Proposal - Support for National Characters functionality

2013-07-05 Thread Pavel Stehule
Yes, what I know almost all use utf8 without problems. Long time I didn't
see any request for multi encoding support.
Dne 5.7.2013 20:28 Peter Eisentraut pete...@gmx.net napsal(a):

 On 7/4/13 10:11 PM, Arulappan, Arul Shaji wrote:
  The main aim at the moment is to get some feedback on the above to know
  if this feature is something that would benefit PostgreSQL in general,
  and if users maintaining DBs in non-English speaking regions will find
  this beneficial.

 For European languages, I think everyone has moved to using Unicode, so
 the demand for supporting multiple encodings is approaching zero.  The
 CJK realm might have difference requirements.



 --
 Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
 To make changes to your subscription:
 http://www.postgresql.org/mailpref/pgsql-hackers



[HACKERS] Proposal - Support for National Characters functionality

2013-07-04 Thread Arulappan, Arul Shaji
This is a proposal to implement functionalities for the handling of
National Characters. 

[Introduction]

The aim of this proposal is to eventually have a way to represent
'National Characters' in a uniform way, even in non-UTF8 encoded
databases. Many of our customers in the Asian region who are now, as
part of their platform modernization, are moving away from mainframes
where they have used National Characters representation in COBOL and
other databases. Having stronger support for national characters
representation will also make it easier for these customers to look at
PostgreSQL more favourably when migrating from other well known RDBMSs
who all have varying degrees of NCHAR/NVARCHAR support.

[Specifications]

Broadly speaking, the national characters implementation ideally will
include the following 
- Support for NCHAR/NVARCHAR data types
- Representing NCHAR and NVARCHAR columns in UTF-8 encoding in non-UTF8
databases
- Support for UTF16 column encoding and representing NCHAR and NVARCHAR
columns in UTF16 encoding in all databases.
- Support for NATIONAL_CHARACTER_SET GUC variable that will determine
the encoding that will be used in NCHAR/NVARCHAR columns.

The above points are at the moment a 'wishlist' only. Our aim is to
tackle them one-by-one as we progress. I will send a detailed proposal
later with more technical details.

The main aim at the moment is to get some feedback on the above to know
if this feature is something that would benefit PostgreSQL in general,
and if users maintaining DBs in non-English speaking regions will find
this beneficial.

Rgds,
Arul Shaji



P.S.: It has been quite some time since I send a correspondence to this
list. Our mail server adds a standard legal disclaimer to all outgoing
mails, which I know that this list is not a huge fan of. I used to have
an exemption for the mails I send to this list. If the disclaimer
appears, apologies in advance. I will rectify that on the next one.











-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Proposal - Support for National Characters functionality

2013-07-04 Thread Tatsuo Ishii
Arul Shaji,

NCHAR support is on our TODO list for some time and I would like to
welcome efforts trying to implement it. However I have a few
questions:

 This is a proposal to implement functionalities for the handling of
 National Characters. 
 
 [Introduction]
 
 The aim of this proposal is to eventually have a way to represent
 'National Characters' in a uniform way, even in non-UTF8 encoded
 databases. Many of our customers in the Asian region who are now, as
 part of their platform modernization, are moving away from mainframes
 where they have used National Characters representation in COBOL and
 other databases. Having stronger support for national characters
 representation will also make it easier for these customers to look at
 PostgreSQL more favourably when migrating from other well known RDBMSs
 who all have varying degrees of NCHAR/NVARCHAR support.
 
 [Specifications]
 
 Broadly speaking, the national characters implementation ideally will
 include the following 
 - Support for NCHAR/NVARCHAR data types
 - Representing NCHAR and NVARCHAR columns in UTF-8 encoding in non-UTF8
 databases

I think this is not a trivial work because we do not have framework to
allow mixed encodings in a database. I'm interested in how you are
going to solve the problem.

 - Support for UTF16 column encoding and representing NCHAR and NVARCHAR
 columns in UTF16 encoding in all databases.

Why do yo need UTF-16 as the database encoding? UTF-8 is already
supported, and any UTF-16 character can be represented in UTF-8 as far
as I know.

 - Support for NATIONAL_CHARACTER_SET GUC variable that will determine
 the encoding that will be used in NCHAR/NVARCHAR columns.

You said NCHAR's encoding is UTF-8. Why do you need the GUC if NCHAR's
encoding is fixed to UTF-8?

 The above points are at the moment a 'wishlist' only. Our aim is to
 tackle them one-by-one as we progress. I will send a detailed proposal
 later with more technical details.
 
 The main aim at the moment is to get some feedback on the above to know
 if this feature is something that would benefit PostgreSQL in general,
 and if users maintaining DBs in non-English speaking regions will find
 this beneficial.
 
 Rgds,
 Arul Shaji
 
 
 
 P.S.: It has been quite some time since I send a correspondence to this
 list. Our mail server adds a standard legal disclaimer to all outgoing
 mails, which I know that this list is not a huge fan of. I used to have
 an exemption for the mails I send to this list. If the disclaimer
 appears, apologies in advance. I will rectify that on the next one.
--
Tatsuo Ishii
SRA OSS, Inc. Japan
English: http://www.sraoss.co.jp/index_en.php
Japanese: http://www.sraoss.co.jp


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Proposal - Support for National Characters functionality

2013-07-04 Thread Claudio Freire
On Fri, Jul 5, 2013 at 2:02 AM, Tatsuo Ishii is...@postgresql.org wrote:
 - Support for NATIONAL_CHARACTER_SET GUC variable that will determine
 the encoding that will be used in NCHAR/NVARCHAR columns.

 You said NCHAR's encoding is UTF-8. Why do you need the GUC if NCHAR's
 encoding is fixed to UTF-8?


Not only that, but I don't think it can be a GUC. Maybe a compile-time
switch, but if it were a GUC, how do you handle an existing database
in UTF-8 when the setting is switched to UTF-16? Re-encode everything?
Store the encoding along each value? It's a mess.

Either fix it at UTF-8, or make it a compile-time thing, I'd say.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers