Re: [HACKERS] Reducing the overhead of NUMERIC data

Gregory Maxwell Fri, 04 Nov 2005 12:03:00 -0800

On 11/4/05, Martijn van Oosterhout <kleptog@svana.org> wrote:
> Yeah, and while one way of removing that dependance is to use ICU, that
> library wants everything in UTF-16. So we replace "copying to add NULL
> to string" with "converting UTF-8 to UTF-16 on each call. Ugh! The
> argument for UTF-16 is that if you're using a language that doesn't use
> ASCII at all, UTF-8 gets inefficient pretty quickly.


Is this really the case? Only unicode values 000800 - 00FFFF are
smaller in UTF-16 than in UTF-8, and in their case it's three bytes vs
two. Cyrilic, Arabic, Greek, Latin, etc are all two bytes in both.

So, yes in some cases UTF-8 will use three bytes where UTF-16 would be
two, but thats less inefficient than UTF-16 for ASCII, which many
people find acceptable.

> Locale sensetive, efficient storage, fast comparisons, pick any two!

I don't know that the choices are that limited, as I indicated earlier
in the thread I think it's useful to think of all of these encodings
as just different compression algorithms. If our desire was to have
all three, the backend could be made null safe and we could use the
locale-sensitive and fast representation (Probably UTF-16 or UTF-32)
in memory, and store on disk whatever is most efficient for storage.
(lz compressed UTF-whatever for fat fields, UTF-8 for mostly ascii
small fields, SCSU for non-ascii short fields
(http://www.unicode.org/reports/tr6/), etc)

> My guess is that in the long run there would be two basic string
> datatypes, one UTF-8, null terminated string used in the backend code
> as a standard C string, default collation strcmp. The other UTF-16 for
> user data that wants to be able to collate in a locale dependant way.

So if we need locale dependant colation we suffer 2x inflation for
many texts, and multibyte complexity still required if we are to
collate correctly when there are characters outside of the BMP. Yuck.

Disk storage type, memory strorage type, user API type, and collation
should be decoupled.

---------------------------(end of broadcast)---------------------------
TIP 9: In versions below 8.0, the planner will ignore your desire to
       choose an index scan if your joining column's datatypes do not
       match

Re: [HACKERS] Reducing the overhead of NUMERIC data

Reply via email to