Andrew Dunbar wrote: > --- Tomas Frydrych <[EMAIL PROTECTED]> >wrote: > > >>>Andrew Dunbar <[EMAIL PROTECTED]> wrote: >>> >>>Well pretty soon we're going to need a real >>>replacement. Dom and I are both in favour of the >>>replacement being UTF-8 but some here seem to want >>>UTF-32. >>> >>UTF-8 is an encoding scheme that is intended to >>allow Unicode >>communication between separate processes over 8-bit >>channels. >>For that it is great, but that's about the only >>thing it is really good >>for. UTF-8 processing is cumbersome, and as such it >>is completely >>unsuitable format to use for the piecetable. We need >>a fixed with >>encoding for that, such as the curent UCS-2, i.e., >>UTF-32. >> > >Please back up these comments. A lot of people, >before >they are familiar with Unicode and UTF-8 seem to think >this. I did too. Then I read reams and reams of >newsgroups and mailing lists and FAQs. Now I know why >Qt, GTK, QNX, and others use UTF-8 internally. >People seem to think that because UTF-8 encodes >characters as variable length runs of bytes that this >is somehow computationally expensive to handle. Not >so. You can use existing 8-bit string functions on >it. >It is backwards compatible with ASCII. You can scan >forwards and backwards effortlessly. You can always >tell which character in a sequence a given byte >belongs to. >People think random access to these strings using >array operator will cost the earth. Guess what - very >little code access strings as arrays - especially in >a Word Processor. Of the code which does, very little >of that needs to. Even when you do perform lots of >array operations on a UTF-8 string, people have done >extensive tests showing that the cost is extremely >negligable - look in the Unicode literature and you >will find all this information. >People think that UCS-2, UTF-16, or UTF-32 mean we can >have perfect random access to strings because a >characters is always represented as a single word or >longword. Not so. UCS-2 should but this term is >often (by Microsoft) used to refer to UTF-16. UTF-16 >uses a mechanism called "surrogates" whereby a single >character may need two words to represent it. There >goes your free array access. Even UTF-32 is not safe >from this. Because Unicode requires "combining >characters". This means that "�" may be represented >as "a" followed by a non-spacing "�" acute accent. >Some people think this is also silly. These people >need to go read all about Unicode before they embark >on seriously multilingual software. Vietnames is >possible to support without combining characters but >you won't be able to view the results because no >Vietnames fonts exist that work this way - they all >expect to use combining characters. Thai needs them. >Hindi needs them. All Indian/Indic languages need >them. > >So to sum up, the two arguments not to use UTF-8 >internally are: > >1) Array access is too slow. > >- This is not true and it is seldom needed. > >2) UTF-8 means you have to handle a series of values > for a single on-screen character. > >- *All* Unicode encodings need this anyway! > >But look around the internet for better arguments and >better written arguments. > >Andrew Dunbar. > >===== >http://linguaphile.sourceforge.net http://www.abisource.com > >__________________________________________________ >Do You Yahoo!? >Everything you'll ever need on one web page >from News and Sport to Email and Music Charts >http://uk.my.yahoo.com > > Hi
Excuse my lazyness, but scanning through all unicode.org isn't really what i like to spend my week on ;) Any special articles you recommend us to read? /Johan
