Charles Mills writes: >You could use 16 bits for every character, with some sort of >cleverness that yielded two 16-bit words when you had a code >point bigger than 65535 (actually somewhat less due to how the >cleverness works). That is called UTF-16. Pretty good but >still not very efficient.
In Japan and China, to pick a couple examples, UTF-16 is rather efficient. There are also far worse inefficiencies than using 16 bits to store each Latin character. In short, I wouldn't get *too* hung up on this point, especially as the complete lifecycle costs of storage continue to fall. For example, if you're designing applications and information systems for a global audience (or potentially global audience), it could be a perfectly reasonable decision to standardize on UTF-16 in favor of potential reductions in testing (for example). I think this is exactly what SAP did around the time they introduced their ECC releases, for instance. Somehow I'm reminded of the "save two characters" impulse which then caused a lot of angst in preparing for Y2K. :-) If there's a reasonable argument for spending 16 bits -- and sometimes there is -- by all means, spend them. This isn't 1974 or even 1994. The vast majority of the world's data are not codepoint-encoded alphanumerics anyway. -------------------------------------------------------------------------------------------------------- Timothy Sipples GMU VCT Architect Executive (Based in Singapore) E-Mail: sipp...@sg.ibm.com ---------------------------------------------------------------------- For IBM-MAIN subscribe / signoff / archive access instructions, send email to lists...@listserv.ua.edu with the message: INFO IBM-MAIN