Re: commit: abi: UTF8String class

phearbear Sat, 20 Apr 2002 20:57:30 -0700

Andrew Dunbar wrote:

> --- Tomas Frydrych <[EMAIL PROTECTED]>
>wrote: > 
>
>>>Andrew Dunbar <[EMAIL PROTECTED]> wrote:
>>>
>>>Well pretty soon we're going to need a real
>>>replacement.  Dom and I are both in favour of the
>>>replacement being UTF-8 but some here seem to want
>>>UTF-32.
>>>
>>UTF-8 is an encoding scheme that is intended to
>>allow Unicode 
>>communication between separate processes over 8-bit
>>channels. 
>>For that it is great, but that's about the only
>>thing it is really good 
>>for. UTF-8 processing is cumbersome, and as such it
>>is completely 
>>unsuitable format to use for the piecetable. We need
>>a fixed with 
>>encoding for that, such as the curent UCS-2, i.e.,
>>UTF-32.
>>
>
>Please back up these comments.  A lot of people,
>before
>they are familiar with Unicode and UTF-8 seem to think
>this.  I did too.  Then I read reams and reams of
>newsgroups and mailing lists and FAQs.  Now I know why
>Qt, GTK, QNX, and others use UTF-8 internally.
>People seem to think that because UTF-8 encodes
>characters as variable length runs of bytes that this
>is somehow computationally expensive to handle.  Not
>so.  You can use existing 8-bit string functions on
>it.
>It is backwards compatible with ASCII.  You can scan
>forwards and backwards effortlessly.  You can always
>tell which character in a sequence a given byte
>belongs to.
>People think random access to these strings using
>array operator will cost the earth.  Guess what - very
>little code access strings as arrays - especially in
>a Word Processor.  Of the code which does, very little
>of that needs to.  Even when you do perform lots of
>array operations on a UTF-8 string, people have done
>extensive tests showing that the cost is extremely
>negligable - look in the Unicode literature and you
>will find all this information.
>People think that UCS-2, UTF-16, or UTF-32 mean we can
>have perfect random access to strings because a
>characters is always represented as a single word or
>longword.  Not so.  UCS-2 should but this term is
>often (by Microsoft) used to refer to UTF-16.  UTF-16
>uses a mechanism called "surrogates" whereby a single
>character may need two words to represent it.  There
>goes your free array access.  Even UTF-32 is not safe
>from this.  Because Unicode requires "combining
>characters".  This means that "�" may be represented
>as "a" followed by a non-spacing "�" acute accent.
>Some people think this is also silly.  These people
>need to go read all about Unicode before they embark
>on seriously multilingual software.  Vietnames is
>possible to support without combining characters but
>you won't be able to view the results because no
>Vietnames fonts exist that work this way - they all
>expect to use combining characters.  Thai needs them.
>Hindi needs them.  All Indian/Indic languages need
>them.
>
>So to sum up, the two arguments not to use UTF-8
>internally are:
>
>1) Array access is too slow.
>
>- This is not true and it is seldom needed.
>
>2) UTF-8 means you have to handle a series of values
>   for a single on-screen character.
>
>- *All* Unicode encodings need this anyway!
>
>But look around the internet for better arguments and
>better written arguments.
>
>Andrew Dunbar.
>
>=====
>http://linguaphile.sourceforge.net http://www.abisource.com
>
>__________________________________________________
>Do You Yahoo!?
>Everything you'll ever need on one web page
>from News and Sport to Email and Music Charts
>http://uk.my.yahoo.com
>
>
Hi


Excuse my lazyness, but scanning through all unicode.org isn't really 
what i like to spend my week on ;) Any special articles you recommend us 
to read?

/Johan

Re: commit: abi: UTF8String class

Reply via email to