Unicode has evolved from version 1.0 with 7,161 characters released in 1991
to version 9.0 with 128,172 characters released in June 2016. My questions
are
- which version of Unicode is supported by PostgreSQL 9.6.1?
- what does "supported" exactly mean? simply store it? comparison? sorting?
substring? etc.

Below is a test I did which reveals some unexpected behaviors.

My test database 'gsdb' is using UTF8 encoding, confirmed by

select datname, datcollate, datctype, pg_encoding_to_char(encoding)
from pg_database
where datname = 'gsdb';

which returned UTF8.

Here is a simple test table:

create table unicode (id int, string varchar(100));

Then I insert some unicode characters by referencing their code points in
hexadecimal:

/* characters from BMP, 0000 - FFFF */
insert into unicode(id, string) values(1, U&'\0041');  -- 'A'
insert into unicode(id, string) values(2, U&'\00C4');   -- 'A' with umlaut,
German
insert into unicode(id, string) values(3, U&'\03B1');  -- Greek letter alpha
insert into unicode(id, string) values(4, U&'\6211');  -- a Chinese
character, https://unicodelookup.com/#0x6211/1
insert into unicode(id, string) values(5, U&'\6211\4EEC'); -- a string of
two Chinese characters
insert into unicode(id, string) values(6, U&'\30CF');  -- a Japanese
character
insert into unicode(id, string) values(7, U&'\306F');  -- a Japanese
character
insert into unicode(id, string) values(8, U&'\2B41');  --
https://unicodelookup.com/#0x2b41/1
insert into unicode(id, string) values(9, U&'\2B44');  --
https://unicodelookup.com/#0x2b44/1
insert into unicode(id, string) values(10, U&'\2B50');  --
https://unicodelookup.com/#0x2b50/1

/* Below are unicode characters with code points beyond FFFF, aka planes 1
- F */
insert into unicode(id, string) values(100, U&'\1F478'); -- a mojo
character, https://unicodelookup.com/#0x1f478/1
insert into unicode(id, string) values(101, U&'\1F479');  -- another mojo
insert into unicode(id, string) values(102, U&'\1D11F');  -- musical symbol
g clef ottava alta
insert into unicode(id, string) values(103, U&'\26000');   -- a very
infrequently used Chinese character
insert into unicode(id, string) values(104, U&'\26001');   -- another very
infrequently used Chinese character
insert into unicode(id, string) values(105, U&'\26000\26001');  -- a string
with 2 Chinese characters in the plane 2

The SELECT below shows what PostgreSQL has recorded:

select id, string,
  char_length(string),
  octet_length(string),
  ascii(string),
  substring(string, 1, 1) as firstChar,
  ascii(substring(string, 1, 1)) as unicodeInt
from unicode
order by string;

Here are the results:

[image: Inline image 1]

Observations

   - BMP characters (id <= 10)
      -  they are stored and fetched correctly.
      - their lengths in char are correct, although some of them take 3
      bytes (id = 4, 6, 7)
      - *But their sorting order seems to be undefined. Can anyone comment
      the sorting rules?*
   - Non-BMP characters (id >= 100)
      - they take 2 - 4 bytes.
      - Their lengths in character are not correct
      - they are not retrieved correctly, judged by the their fetched ascii
      value (column 5 in the table above)
      - substring is not correct


Specifically, the lack of support for emojo characters 0x1F478, 0x1F479 is
causing a problem in my application.

My conclusion:
- PostgreSQL 9.6.1 only supports a subset of unicode characters in BMP.  Is
there any documents defining which subset is fully supported?

Are any configuration I can change so that more unicode characters are
supported?

Thanks

James

Reply via email to