Re: [HACKERS] UTF8 national character data type support WIP patch and list of open issues.

Boguk, Maksym Tue, 03 Sep 2013 17:15:06 -0700

>> 1)Addition of new string data types NATIONAL CHARACTER and NATIONAL 
>> CHARACTER VARIABLE.
>> These types differ from the char/varchar data types in one important
>> respect:  NATIONAL string types are always have UTF8 encoding even 
>> (independent from used database encoding).


>I don't like the approach of adding a new data type for this. The
encoding used for a text field should be an implementation detail, not
something that's exposed to users at the schema-level. A separate data
type makes an >nvarchar field behave slightly differently from text, for
example when it's passed to and from functions. It will also require
drivers and client applications to know about it.

Hi,  my task is implementing ANSI NATIONAL character string types as
part of PostgreSQL core.
And requirement " require drivers and client applications to know about
it" is reason why it could not be done as add-on (these new types should
have a fixed OID for most drivers from my experience).
Implementing them as UTF8 data-type is first step which allows have
NATIONAL characters with encoding differ from database encoding (and
might me even support multiple encoding for common string types in
future).


>> 1)Full set of string functions and operators for NATIONAL types (we 
>> could not use generic text functions because they assume that the 
>> stings will have database encoding).
>> Now only basic set implemented.
>> 2)Need implement some way to define default collation for a NATIONAL 
>> types.
>> 3)Need implement some way to input UTF8 characters into NATIONAL
types 
>> via SQL  (there are serious open problem... it will be defined later 
>> in the text).

>Yeah, all of these issues stem from the fact that the NATIONAL types
are separate from text.
>I think we should take a completely different approach to this. Two
alternatives spring to mind:

>1. Implement a new encoding.  The new encoding would be some variant of
>UTF-8 that encodes languages like Russian more efficiently. Then just
use that in the whole database. Something like SCSU
>(http://www.unicode.org/reports/tr6/) should do the trick, although I'm
not sure if SCSU can be used as a server-encoding. A lot of code relies
on the fact that a server encoding must have the high bit set in all
bytes that >are part of a multi-byte character. That's why SJIS for
example can only be used as a client-encoding. But surely you could come
up with some subset or variant of SCSU which satisfies that requirement.

>2. Compress the column. Simply do "ALTER TABLE foo ALTER COLUMN bar SET
STORAGE MAIN". That will make Postgres compress that field. That might
not be very efficient for compressing short Cyrillic text encoded in
>UTF-8 today, but that could be improved. There has been discussion on
supporting more compression algorithms in the past, and one such
algorithm could be again something like SCSU.

Both of these approach requires dump/restore the whole database which is
not always an opinion.
Implementing an UTF8 NATIONAL character as new datatype will provide
opinion use pg_upgrade to latest version and have required functionality
without prolonged downtime.

PS: is it possible to reserve some narrow type OID range in PostgreSQL
core for the future use?

Kind Regards,
Maksym






-- 
Sent via pgsql-hackers mailing list ([email protected])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] UTF8 national character data type support WIP patch and list of open issues.

Reply via email to