Re: commit: abi: UTF8String class

Andrew Dunbar Sat, 20 Apr 2002 05:02:39 -0700

 --- F J Franklin <[EMAIL PROTECTED]>
wrote: > > wrote: > o new UTF8String class (untested)
> > 
> > If this is part of the new unicodization to
> support
> > full-unicode, there's some stuff we need to
> discuss.
> 
> Wasn't intended as such. phearbear says QNX wants to
> use UTF-8 whereas
> Abi uses UCS-2 and I decided to write the UTF8String
> class to facilitate
> the conversion. Strings are stored internally as
> UTF-8 byte sequences,
> and there is a home-made iterator for accessing the
> string sequence by
> sequence; and a fn. for converting current sequence
> to UCS-4.
> 
> Currently conversion to UTF-8 is only from UCS-2,
> but conversion from
> UCS-4 would be a trivial change. (I'm assuming that
> UCS-2 is the first
> 65536 codes of UCS-4 - is this correct?)


Well not exactly, there is plenty of hazy stuff in
Unicode unfortunately and this is the reason why I
don't think it's a good idea for us to rush into the
new way of doing things.  This is one of the hazy
areas
and I'll attempt to explain it but you're better off
reading all the documentation you can find at
http://www.unicode.org and reading through a few
mailing list archives that deal with Unicode issues.
UCS-2 is a sixteen bit encoding which supports the
old 16-bit Unicode and as such is what you suggest.
UCS-4 seems to be an exact synonym for UTF-32 but you
better check!
UTF-16 is an encoding which allows the 32 bit Unicode
range to be represented in a series of one or two
16 bit fields.  When two fields are needed, each is
called a "surrogate".
UTF-32 is a 32-bit encoding where a 32-bit character
code is encoded in a single 32-bit field.  Not all
values are legal however.

UTF-16 vs. UCS-2: Unicode were adpoted early by
Microsoft for Windows NT, and by Java.  Both chose to
use UCS-2.  This was back when everybody thought 16
bits would be plenty.  Unicode has since been updated
to 32 bits.
Windows XP and up seem refer to their encoding simply
as "Unicode" but it behaves as either UCS-2 or UTF-16
depending on a registry setting!  I'm not sure what
the behaviour of Windows XP is.
I'm not sure whether Java now uses UTF-16 or not and
if so, I'm not sure whether they still use the term
UCS-2.

My rule of thumb: Any encoding starting with "UCS" is
to be considered deprecated.  Use UCS encodings and
UCS encoding names only when specifically dealing with
a UCS encoding.  For instance, converting to old
Windows NT filenames or GUI strings.  Do not *ever*
say "UCS-*" when you mean "UTF-*".  People are already
confused over this and we as developers of a multi-
lingual word processor need to have this very well
understood.  (Same goes for saying ASCII when you
mean ISO-8859-1 or even ISO-8859-*)

Please read up on this since I'm not fully up to date
because of my months on the road and not currently
owning a machine or having an internet connection.

> As a string class it's not nearly as functional as
> the others, but it's
> not really intended as a replacement.

Well pretty soon we're going to need a real
replacement.  Dom and I are both in favour of the
replacement being UTF-8 but some here seem to want
UTF-32.

> > We need to design the system so that a string is
> not
> > built from a series of UTF-8 (or UTF-32)
> characters
> > directly, but a series of "composed character"
> which
> > in turn are a series of UTF-8 characters, the
> first
> > being the main character, the remainder being
> zero-
> > width modifiers.  We need this to support proper
> > internationalization.  We probably need much
> > discussion first actually.
> 
> Not sure I understand this. Can you explain how to
> use zero-width
> modifiers?

They're also called "combining characters".  Such as
the acute accent or the umlaut (really a dieresis).
Instead of representing "�" as U+00C1, it can be
represented as U+0041 U+0301.  Currently this half-
works in AbiWord if you have TrueType fonts (or on
Windows) and if you turn off the RemapGlyphs hack in
your profile.
If you think this is a dumb idea then you haven't read
enough about Unicode so go read up (not you fjf, but
all of the Abi developers).  Not just Unicode uses
such characters, by the way.  The standard Vietnamese
encodings all use this feature.  Vietnamese fonts
which
include all combinations of letter+accent+tone mark
are very rare but those with "combining characters"
are
quite common.
As for southeast Asian and Indic languages, I don't
believe Unicode even bothers to include all the myriad
combinations of letter+vowel mark+funky language
feature.  Combining characters are generally
considered
to be a good thing, and the way forward.  They will
make searching, sorting, capitalization and maybe more
much simpler even for Western languages.
Once we understand these issues we then have to look
into "Unicode normalization"...

> Frank

Hope this helps, and I hope people other than just
Frank read it.  Let's do Unicode properly and be the
best Word Processor for Vietnamese and Thai on any
platform! (:

Andrew Dunbar.

=====
http://linguaphile.sourceforge.net http://www.abisource.com

__________________________________________________
Do You Yahoo!?
Everything you'll ever need on one web page
from News and Sport to Email and Music Charts
http://uk.my.yahoo.com

Re: commit: abi: UTF8String class

Reply via email to