--- Tomas Frydrych <[EMAIL PROTECTED]> wrote: > > > Andrew Dunbar <[EMAIL PROTECTED]> wrote: > > > Well pretty soon we're going to need a real > > replacement. Dom and I are both in favour of the > > replacement being UTF-8 but some here seem to want > > UTF-32. > > UTF-8 is an encoding scheme that is intended to > allow Unicode > communication between separate processes over 8-bit > channels. > For that it is great, but that's about the only > thing it is really good > for. UTF-8 processing is cumbersome, and as such it > is completely > unsuitable format to use for the piecetable. We need > a fixed with > encoding for that, such as the curent UCS-2, i.e., > UTF-32.
Please back up these comments. A lot of people, before they are familiar with Unicode and UTF-8 seem to think this. I did too. Then I read reams and reams of newsgroups and mailing lists and FAQs. Now I know why Qt, GTK, QNX, and others use UTF-8 internally. People seem to think that because UTF-8 encodes characters as variable length runs of bytes that this is somehow computationally expensive to handle. Not so. You can use existing 8-bit string functions on it. It is backwards compatible with ASCII. You can scan forwards and backwards effortlessly. You can always tell which character in a sequence a given byte belongs to. People think random access to these strings using array operator will cost the earth. Guess what - very little code access strings as arrays - especially in a Word Processor. Of the code which does, very little of that needs to. Even when you do perform lots of array operations on a UTF-8 string, people have done extensive tests showing that the cost is extremely negligable - look in the Unicode literature and you will find all this information. People think that UCS-2, UTF-16, or UTF-32 mean we can have perfect random access to strings because a characters is always represented as a single word or longword. Not so. UCS-2 should but this term is often (by Microsoft) used to refer to UTF-16. UTF-16 uses a mechanism called "surrogates" whereby a single character may need two words to represent it. There goes your free array access. Even UTF-32 is not safe from this. Because Unicode requires "combining characters". This means that "�" may be represented as "a" followed by a non-spacing "�" acute accent. Some people think this is also silly. These people need to go read all about Unicode before they embark on seriously multilingual software. Vietnames is possible to support without combining characters but you won't be able to view the results because no Vietnames fonts exist that work this way - they all expect to use combining characters. Thai needs them. Hindi needs them. All Indian/Indic languages need them. So to sum up, the two arguments not to use UTF-8 internally are: 1) Array access is too slow. - This is not true and it is seldom needed. 2) UTF-8 means you have to handle a series of values for a single on-screen character. - *All* Unicode encodings need this anyway! But look around the internet for better arguments and better written arguments. Andrew Dunbar. ===== http://linguaphile.sourceforge.net http://www.abisource.com __________________________________________________ Do You Yahoo!? Everything you'll ever need on one web page from News and Sport to Email and Music Charts http://uk.my.yahoo.com
