On Fri, Nov 04, 2005 at 01:54:04PM -0500, Tom Lane wrote: > [EMAIL PROTECTED] writes: > > I read "the backend is by and large an ASCII, null-terminated-string > > engine" with "we use UTF-8 [for varlena strings?]" as, a lot of the > > code assumes varlena strings are '\0' terminated, and an assumption > > on my part, that the varlena strings are not stored in the backend > > with a '\0' terminator, therefore, they require being copied out, > > terminated with a '\0', before they can be used? > > There are places where we have to do that, the worst from a performance > viewpoint being in string comparison --- we have to null-terminate both > values before we can pass them to strcoll(). > > One of the large bits that would have to be done before we could even > contemplate using UCS2/UCS4 is getting rid of our dependence on strcoll, > since its API is null-terminated-string.
Yeah, and while one way of removing that dependance is to use ICU, that library wants everything in UTF-16. So we replace "copying to add NULL to string" with "converting UTF-8 to UTF-16 on each call. Ugh! The argument for UTF-16 is that if you're using a language that doesn't use ASCII at all, UTF-8 gets inefficient pretty quickly. Locale sensetive, efficient storage, fast comparisons, pick any two! My guess is that in the long run there would be two basic string datatypes, one UTF-8, null terminated string used in the backend code as a standard C string, default collation strcmp. The other UTF-16 for user data that wants to be able to collate in a locale dependant way. Have a nice day, -- Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/ > Patent. n. Genius is 5% inspiration and 95% perspiration. A patent is a > tool for doing 5% of the work and then sitting around waiting for someone > else to do the other 95% so you can sue them.
pgp1SqmeHlE7l.pgp
Description: PGP signature