On 11/4/05, Martijn van Oosterhout <kleptog@svana.org> wrote: > Yeah, and while one way of removing that dependance is to use ICU, that > library wants everything in UTF-16. So we replace "copying to add NULL > to string" with "converting UTF-8 to UTF-16 on each call. Ugh! The > argument for UTF-16 is that if you're using a language that doesn't use > ASCII at all, UTF-8 gets inefficient pretty quickly.
Is this really the case? Only unicode values 000800 - 00FFFF are smaller in UTF-16 than in UTF-8, and in their case it's three bytes vs two. Cyrilic, Arabic, Greek, Latin, etc are all two bytes in both. So, yes in some cases UTF-8 will use three bytes where UTF-16 would be two, but thats less inefficient than UTF-16 for ASCII, which many people find acceptable. > Locale sensetive, efficient storage, fast comparisons, pick any two! I don't know that the choices are that limited, as I indicated earlier in the thread I think it's useful to think of all of these encodings as just different compression algorithms. If our desire was to have all three, the backend could be made null safe and we could use the locale-sensitive and fast representation (Probably UTF-16 or UTF-32) in memory, and store on disk whatever is most efficient for storage. (lz compressed UTF-whatever for fat fields, UTF-8 for mostly ascii small fields, SCSU for non-ascii short fields (http://www.unicode.org/reports/tr6/), etc) > My guess is that in the long run there would be two basic string > datatypes, one UTF-8, null terminated string used in the backend code > as a standard C string, default collation strcmp. The other UTF-16 for > user data that wants to be able to collate in a locale dependant way. So if we need locale dependant colation we suffer 2x inflation for many texts, and multibyte complexity still required if we are to collate correctly when there are characters outside of the BMP. Yuck. Disk storage type, memory strorage type, user API type, and collation should be decoupled. ---------------------------(end of broadcast)--------------------------- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match