Re: [HACKERS] UTF8 national character data type support WIP patch and list of open issues.

2013-09-17 Thread Arulappan, Arul Shaji


-Original Message-
From: pgsql-hackers-ow...@postgresql.org [mailto:pgsql-hackers-
ow...@postgresql.org] On Behalf Of MauMau

Hello,

I think it would be nice for PostgreSQL to support national character
types
largely because it should ease migration from other DBMSs.

[Reasons why we need NCHAR]
--
1. Invite users of other DBMSs to PostgreSQL.  Oracle, SQL Server,
MySQL, etc.
all have NCHAR support.  PostgreSQL is probably the only database out
of major
ones that does not support NCHAR.
Sadly, I've read a report from some Japanese government agency that the
number
of MySQL users exceeded that of PostgreSQL here in Japan in 2010 or
2011.  I
wouldn't say that is due to NCHAR support, but it might be one reason.
I want
PostgreSQL to be more popular and regain those users.

2. Enhance the open image of PostgreSQL by implementing more features
of SQL
standard.  NCHAR may be a wrong and unnecessary feature of SQL standard
now
that we have Unicode support, but it is defined in the standard and
widely
implemented.

3. I have heard that some potential customers didn't adopt PostgreSQL
due to
lack of NCHAR support.  However, I don't know the exact reason why they
need
NCHAR.

The use case we have is for customer(s) who are modernizing their
databases on mainframes. These applications are typically written in
COBOL which does have extensive support for National Characters.
Supporting National Characters as in-built data types in PostgreSQL is,
not to exaggerate, an important criteria in their decision to use
PostgreSQL or not. (So is Embedded COBOL. But that is a separate issue.)




4. I guess some users really want to continue to use ShiftJIS or EUC_JP
for
database encoding, and use NCHAR for a limited set of columns to store
international text in Unicode:
- to avoid code conversion between the server and the client for
performance
- because ShiftJIS and EUC_JP require less amount of storage (2 bytes
for most
Kanji) than UTF-8 (3 bytes) This use case is described in chapter 6 of
Oracle
Database Globalization Support Guide.
--


I think we need to do the following:

[Minimum requirements]
--
1. Accept NCHAR/NVARCHAR as data type name and N'...' syntactically.
This is already implemented.  PostgreSQL treats NCHAR/NVARCHAR as
synonyms for
CHAR/VARCHAR, and ignores N prefix.  But this is not documented.

2. Declare support for national character support in the manual.
1 is not sufficient because users don't want to depend on undocumented
behavior.  This is exactly what the TODO item national character
support
in PostgreSQL TODO wiki is about.

3. Implement NCHAR/NVARCHAR as distinct data types, not as synonyms so
that:
- psql \d can display the user-specified data types.
- pg_dump/pg_dumpall can output NCHAR/NVARCHAR columns as-is, not as
CHAR/VARCHAR.
- To implement additional features for NCHAR/NVARCHAR in the future, as
described below.
--


Agreed. This is our minimum requirement too. 

Rgds,
Arul Shaji








[Optional requirements]
--
1. Implement client driver support, such as:
- NCHAR host variable type (e.g. NCHAR var_name[12];) in ECPG, as
specified
in the SQL standard.
- national character methods (e.g. setNString, getNString,
setNCharacterStream) as specified in JDBC 4.0.
I think at first we can treat these national-character-specific
features as the
same as CHAR/VARCHAR.

2. NCHAR/NVARCHAR columns can be used in non-UTF-8 databases and always
contain
Unicode data.
I think it is sufficient at first that NCHAR/NVARCHAR columns can only
be used
in UTF-8 databases and they store UTF-8 strings.  This allows us to
reuse the
input/output/send/recv functions and other infrastructure of
CHAR/VARCHAR.
This is a reasonable compromise to avoid duplication and minimize the
first
implementation of NCHAR support.

3. Store strings in UTF-16 encoding in NCHAR/NVARCHAR columns.
Fixed-width encoding may allow faster string manipulation as described
in
Oracle's manual.  But I'm not sure about this, because UTF-16 is not a
real
fixed-width encoding due to supplementary characters.

This would definitely be a welcome addition. 



--


I don't think it is good to implement NCHAR/NVARCHAR types as
extensions like
contrib/citext, because NCHAR/NVARCHAR are basic types and need
client-side
support.  That is, client drivers need to be aware of the fixed
NCHAR/NVARCHAR
OID values.

How do you think we should implement NCHAR support?

Regards
MauMau



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To
make
changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers




-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:

Re: [HACKERS] Proposal - Support for National Characters functionality

2013-07-31 Thread Arulappan, Arul Shaji
 From: Alvaro Herrera [mailto:alvhe...@2ndquadrant.com]
 
 Boguk, Maksym escribió:
 
  I think I give a wrong description there... it will be not GUC but
  GUC-type value which will be initialized during CREATE DATABASE and
  will be read only after, very similar to the lc_collate.
  So I think name  national_lc_collate will be better.
  Function of this value - provide information about the default
  collation for the NATIONAL CHARACTERS inside the database.
  That's not limits user ability of use an alternative collation for
  NATIONAL CHARACTERS during create table via COLLATE keyword.
 
 This seems a bit odd.  I mean, if I want the option for differing encodings,
 surely I need to be able to set them for each column, not at the database
 level.
 
 Also, as far as I understand what we want to control here is the encoding that
 the strings are in (the mapping of bytes to characters), not the collation

Yes, that is our idea too. For the sql syntax

Create table tbl1 (col1 nchar);

What should be the encoding and collation for col1? Because the idea is to have 
them in separate encoding and collation (if needed) from that of the rest of 
the table.  We have options of 

a) Having guc variables that will determine the default encoding and collation 
for nchar/nvarchar columns. Note that the collate variable is default only. 
Users can still override them per column.
b) Having the encoding name and collation as part of the syntax. For ex., (col1 
nchar encoding UTF-8 COLLATE C). Ugly, but.
c) Be rigid and say nchar/nvarchar columns are by default UTF-8 (or something 
else). One cannot change the default. But they can override it when declaring 
the column by having a syntax similar to (b)


Rgds,
Arul Shaji



 (the way a set of strings are ordered).  So it doesn't make sense to set the
 NATIONAL CHARACTER option using the COLLATE keyword.




 
 --
 Álvaro Herrerahttp://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services




-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Proposal - Support for National Characters functionality

2013-07-31 Thread Arulappan, Arul Shaji
 From: Tom Lane [mailto:t...@sss.pgh.pa.us]
 
 Alvaro Herrera alvhe...@2ndquadrant.com writes:
  Also, as far as I understand what we want to control here is the
  encoding that the strings are in (the mapping of bytes to
characters),
  not the collation (the way a set of strings are ordered).  So it
  doesn't make sense to set the NATIONAL CHARACTER option using the
  COLLATE keyword.
 
 My thought is that we should simply ignore the NATIONAL CHARACTER
syntax,
 which is not the first nor the last brain-damaged feature design in
the SQL
 standard.  It's basically useless for what we want because there's
noplace to
 specify which encoding you mean.  Instead, let's consider that COLLATE
can
 define not only the collation but also the encoding of a string datum.

Yes, don't have a problem with this. If I understand you correctly, this
will be simpler syntax wise, but still get nchar/nvarchar data types
into a table, in different encoding from the rest of the table. 

 
 There's still the problem of how do you get a string of a nondefault
encoding
 into the database in the first place.  

Yes, that is the bulk of the work. Will need change in a whole lot of
places.

Is a step-by-step approach worth exploring ? Something similar to:

Step 1: Support nchar/nvarchar data types. Restrict them only to UTF-8
databases to begin with. 
Step 2: Support multiple encodings in a database. Remove the restriction
imposed in step1.


Rgds,
Arul Shaji




-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Proposal - Support for National Characters functionality

2013-07-30 Thread Arulappan, Arul Shaji
 -Original Message-
 From: Tatsuo Ishii [mailto:is...@postgresql.org]
 
 
 Also I don't understand why you need UTF-16 support as a database
encoding
 because UTF-8 and UTF-16 are logically equivalent, they are just
different
 represention (encoding) of Unicode. That means if we already support
UTF-8
 (I'm sure we already do), there's no particular reason we need to add
UTF-16
 support.
 
 Maybe you just want to support UTF-16 as a client encoding?

Given below is a design draft for this functionality:

Core new functionality (new code):
1)Create and register independent NCHAR/NVARCHAR/NTEXT data types.

2)Provide support for the new GUC nchar_collation to provide the
database with information about the default collation that needs to be
used for the new data types.

3)Create encoding conversion subroutines to convert strings between the
database encoding and UTF8 (from national strings to regular strings and
back).
PostgreSQL already have all required support (used for conversion
between the database encoding and client_encoding), so amount of the new
code will be minimal there.

4)Because all symbols from non-UTF8 encodings could be represented as
UTF8 (but the reverse is not true) comparison between N* types and the
regular string types inside database will be performed in UTF8 form. To
achieve this feature the new IMPLICIT casts may need to be created:
NCHAR - CHAR
NVARCHAR - VARCHAR
NTEXT - TEXT.

Casting in the reverse direction will be available too but only as
EXPLICIT.
However, these casts could fail if national strings could not be
represented in the used database encoding.

All these casts will use subroutines created in 3).

Casting/conversion between N* types will follow the same rules/mechanics
as used for casting/conversion between usual (CHAR(N)/VARCHAR(N)/TEXT)
string types.


5)Comparison between NATIONAL string values will be performed via
specialized UTF8 optimized functions (with respect of the
nchar_collation setting).


6)Client input/output of NATIONAL strings - NATIONAL strings will
respect the client_encoding setting, and their values will be
transparently converted to the requested client_encoding before
sending(receiving) to client (the same mechanics as used for usual
string types).
So no mixed encoding in client input/output will be supported/available.

7)Create set of the regression tests for these new data types.


Additional changes:
1)ECPG support for these new types
2) Support in the database drivers for the data types.

Rgds,
Arul Shaji




 --
 Tatsuo Ishii
 SRA OSS, Inc. Japan
 English: http://www.sraoss.co.jp/index_en.php
 Japanese: http://www.sraoss.co.jp




-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Proposal - Support for National Characters functionality

2013-07-14 Thread Arulappan, Arul Shaji
 
 On Fri, Jul 5, 2013 at 2:35 PM, Pavel Stehule
pavel.steh...@gmail.com wrote:
  Yes, what I know almost all use utf8 without problems. Long time I
  didn't see any request for multi encoding support.
 
 Well, not *everything* can be represented as UTF-8; I think this is
 particularly an issue with Asian languages.
 
 If we chose to do it, I think that per-column encoding support would
end up
 looking a lot like per-column collation support: it would be yet
another per-
 column property along with typoid, typmod, and typcollation.  I'm not
entirely
 sure it's worth it, although FWIW I do believe Oracle has something
like this.

Yes, the idea is that users will be able to declare columns of type
NCHAR or NVARCHAR which will use the pre-determined encoding type. If we
say that NCHAR is UTF-8 then the NCHAR column will be of UTF-8 encoding
irrespective of the database encoding. It will be up to us to restrict
what Unicode encodings we want to support for NCHAR/NVARCHAR columns.
This is based on my interpretation of the SQL standard. As you allude to
above, Oracle has a similar behaviour (they support UTF-16 as well).  

Support for UTF-16 will be difficult without linking with some external
libraries such as ICU. 


 At any rate, it seems like quite a lot of work.

Thanks for putting my mind at ease ;-)

Rgds,
Arul Shaji


 
 Another idea would be to do something like what we do for range types
 - i.e. allow a user to declare a type that is a differently-encoded
version of
 some base type.  But even that seems pretty hard.
 
 --
 Robert Haas
 EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL
Company




-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Proposal - Support for National Characters functionality

2013-07-05 Thread Arulappan, Arul Shaji
Ishii san,

Thank you for your positive and early response.

 -Original Message-
 From: Tatsuo Ishii [mailto:is...@postgresql.org]
 Sent: Friday, 5 July 2013 3:02 PM
 To: Arulappan, Arul Shaji
 Cc: pgsql-hackers@postgresql.org
 Subject: Re: [HACKERS] Proposal - Support for National Characters
 functionality
 
 Arul Shaji,
 
 NCHAR support is on our TODO list for some time and I would like to
welcome
 efforts trying to implement it. However I have a few
 questions:
 
  This is a proposal to implement functionalities for the handling of
  National Characters.
 
  [Introduction]
 
  The aim of this proposal is to eventually have a way to represent
  'National Characters' in a uniform way, even in non-UTF8 encoded
  databases. Many of our customers in the Asian region who are now, as
  part of their platform modernization, are moving away from
mainframes
  where they have used National Characters representation in COBOL and
  other databases. Having stronger support for national characters
  representation will also make it easier for these customers to look
at
  PostgreSQL more favourably when migrating from other well known
RDBMSs
  who all have varying degrees of NCHAR/NVARCHAR support.
 
  [Specifications]
 
  Broadly speaking, the national characters implementation ideally
will
  include the following
  - Support for NCHAR/NVARCHAR data types
  - Representing NCHAR and NVARCHAR columns in UTF-8 encoding in
  non-UTF8 databases
 
 I think this is not a trivial work because we do not have framework to
allow
 mixed encodings in a database. I'm interested in how you are going to
solve
 the problem.
 

I would be lying if I said I have the design already speced out. I will
be working on this in the coming weeks and hope to design a working
solution consulting with the community.

  - Support for UTF16 column encoding and representing NCHAR and
  NVARCHAR columns in UTF16 encoding in all databases.
 
 Why do yo need UTF-16 as the database encoding? UTF-8 is already
supported,
 and any UTF-16 character can be represented in UTF-8 as far as I know.
 

Yes, that's correct. However there are advantages in using UTF-16
encoding for those characters that are always going to take atleast
two-bytes to represent. 

Having said that, my intention is to use UTF-8 for NCHAR as well.
Supporting UTF-16 will be even more complicated as it is not supported
natively in some Linux platforms. I only included it to give an option.

  - Support for NATIONAL_CHARACTER_SET GUC variable that will
determine
  the encoding that will be used in NCHAR/NVARCHAR columns.
 
 You said NCHAR's encoding is UTF-8. Why do you need the GUC if NCHAR's
 encoding is fixed to UTF-8?
 

If we are going to only support UTF-8 for NCHAR, then we don't need the
GUC variable obviously.

Rgds,
Arul Shaji



  The above points are at the moment a 'wishlist' only. Our aim is to
  tackle them one-by-one as we progress. I will send a detailed
proposal
  later with more technical details.
 
  The main aim at the moment is to get some feedback on the above to
  know if this feature is something that would benefit PostgreSQL in
  general, and if users maintaining DBs in non-English speaking
regions
  will find this beneficial.
 
  Rgds,
  Arul Shaji
 
 
 
  P.S.: It has been quite some time since I send a correspondence to
  this list. Our mail server adds a standard legal disclaimer to all
  outgoing mails, which I know that this list is not a huge fan of. I
  used to have an exemption for the mails I send to this list. If the
  disclaimer appears, apologies in advance. I will rectify that on the
next
 one.
 --
 Tatsuo Ishii
 SRA OSS, Inc. Japan
 English: http://www.sraoss.co.jp/index_en.php
 Japanese: http://www.sraoss.co.jp




-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Proposal - Support for National Characters functionality

2013-07-05 Thread Arulappan, Arul Shaji


 -Original Message-
 From: Claudio Freire [mailto:klaussfre...@gmail.com]
 Sent: Friday, 5 July 2013 3:41 PM
 To: Tatsuo Ishii
 Cc: Arulappan, Arul Shaji; PostgreSQL-Dev
 Subject: Re: [HACKERS] Proposal - Support for National Characters
 functionality
 
 On Fri, Jul 5, 2013 at 2:02 AM, Tatsuo Ishii is...@postgresql.org
wrote:
  - Support for NATIONAL_CHARACTER_SET GUC variable that will
determine
  the encoding that will be used in NCHAR/NVARCHAR columns.
 
  You said NCHAR's encoding is UTF-8. Why do you need the GUC if
NCHAR's
  encoding is fixed to UTF-8?
 
 
 Not only that, but I don't think it can be a GUC. Maybe a compile-time
switch,
 but if it were a GUC, how do you handle an existing database in UTF-8
when the
 setting is switched to UTF-16? Re-encode everything?
 Store the encoding along each value? It's a mess.
 
 Either fix it at UTF-8, or make it a compile-time thing, I'd say.

Agreed, that to begin with we only support UTF-8 encoding for NCHAR
columns. If that is the case, do we still need a compile time option to
turn on/off NCHAR functionality ? ?

Rgds,
Arul Shaji




-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Proposal - Support for National Characters functionality

2013-07-04 Thread Arulappan, Arul Shaji
This is a proposal to implement functionalities for the handling of
National Characters. 

[Introduction]

The aim of this proposal is to eventually have a way to represent
'National Characters' in a uniform way, even in non-UTF8 encoded
databases. Many of our customers in the Asian region who are now, as
part of their platform modernization, are moving away from mainframes
where they have used National Characters representation in COBOL and
other databases. Having stronger support for national characters
representation will also make it easier for these customers to look at
PostgreSQL more favourably when migrating from other well known RDBMSs
who all have varying degrees of NCHAR/NVARCHAR support.

[Specifications]

Broadly speaking, the national characters implementation ideally will
include the following 
- Support for NCHAR/NVARCHAR data types
- Representing NCHAR and NVARCHAR columns in UTF-8 encoding in non-UTF8
databases
- Support for UTF16 column encoding and representing NCHAR and NVARCHAR
columns in UTF16 encoding in all databases.
- Support for NATIONAL_CHARACTER_SET GUC variable that will determine
the encoding that will be used in NCHAR/NVARCHAR columns.

The above points are at the moment a 'wishlist' only. Our aim is to
tackle them one-by-one as we progress. I will send a detailed proposal
later with more technical details.

The main aim at the moment is to get some feedback on the above to know
if this feature is something that would benefit PostgreSQL in general,
and if users maintaining DBs in non-English speaking regions will find
this beneficial.

Rgds,
Arul Shaji



P.S.: It has been quite some time since I send a correspondence to this
list. Our mail server adds a standard legal disclaimer to all outgoing
mails, which I know that this list is not a huge fan of. I used to have
an exemption for the mails I send to this list. If the disclaimer
appears, apologies in advance. I will rectify that on the next one.











-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers