--- F J Franklin <[EMAIL PROTECTED]> wrote: > > wrote: > o new UTF8String class (untested) > > > > If this is part of the new unicodization to > support > > full-unicode, there's some stuff we need to > discuss. > > Wasn't intended as such. phearbear says QNX wants to > use UTF-8 whereas > Abi uses UCS-2 and I decided to write the UTF8String > class to facilitate > the conversion. Strings are stored internally as > UTF-8 byte sequences, > and there is a home-made iterator for accessing the > string sequence by > sequence; and a fn. for converting current sequence > to UCS-4. > > Currently conversion to UTF-8 is only from UCS-2, > but conversion from > UCS-4 would be a trivial change. (I'm assuming that > UCS-2 is the first > 65536 codes of UCS-4 - is this correct?)
Well not exactly, there is plenty of hazy stuff in Unicode unfortunately and this is the reason why I don't think it's a good idea for us to rush into the new way of doing things. This is one of the hazy areas and I'll attempt to explain it but you're better off reading all the documentation you can find at http://www.unicode.org and reading through a few mailing list archives that deal with Unicode issues. UCS-2 is a sixteen bit encoding which supports the old 16-bit Unicode and as such is what you suggest. UCS-4 seems to be an exact synonym for UTF-32 but you better check! UTF-16 is an encoding which allows the 32 bit Unicode range to be represented in a series of one or two 16 bit fields. When two fields are needed, each is called a "surrogate". UTF-32 is a 32-bit encoding where a 32-bit character code is encoded in a single 32-bit field. Not all values are legal however. UTF-16 vs. UCS-2: Unicode were adpoted early by Microsoft for Windows NT, and by Java. Both chose to use UCS-2. This was back when everybody thought 16 bits would be plenty. Unicode has since been updated to 32 bits. Windows XP and up seem refer to their encoding simply as "Unicode" but it behaves as either UCS-2 or UTF-16 depending on a registry setting! I'm not sure what the behaviour of Windows XP is. I'm not sure whether Java now uses UTF-16 or not and if so, I'm not sure whether they still use the term UCS-2. My rule of thumb: Any encoding starting with "UCS" is to be considered deprecated. Use UCS encodings and UCS encoding names only when specifically dealing with a UCS encoding. For instance, converting to old Windows NT filenames or GUI strings. Do not *ever* say "UCS-*" when you mean "UTF-*". People are already confused over this and we as developers of a multi- lingual word processor need to have this very well understood. (Same goes for saying ASCII when you mean ISO-8859-1 or even ISO-8859-*) Please read up on this since I'm not fully up to date because of my months on the road and not currently owning a machine or having an internet connection. > As a string class it's not nearly as functional as > the others, but it's > not really intended as a replacement. Well pretty soon we're going to need a real replacement. Dom and I are both in favour of the replacement being UTF-8 but some here seem to want UTF-32. > > We need to design the system so that a string is > not > > built from a series of UTF-8 (or UTF-32) > characters > > directly, but a series of "composed character" > which > > in turn are a series of UTF-8 characters, the > first > > being the main character, the remainder being > zero- > > width modifiers. We need this to support proper > > internationalization. We probably need much > > discussion first actually. > > Not sure I understand this. Can you explain how to > use zero-width > modifiers? They're also called "combining characters". Such as the acute accent or the umlaut (really a dieresis). Instead of representing "�" as U+00C1, it can be represented as U+0041 U+0301. Currently this half- works in AbiWord if you have TrueType fonts (or on Windows) and if you turn off the RemapGlyphs hack in your profile. If you think this is a dumb idea then you haven't read enough about Unicode so go read up (not you fjf, but all of the Abi developers). Not just Unicode uses such characters, by the way. The standard Vietnamese encodings all use this feature. Vietnamese fonts which include all combinations of letter+accent+tone mark are very rare but those with "combining characters" are quite common. As for southeast Asian and Indic languages, I don't believe Unicode even bothers to include all the myriad combinations of letter+vowel mark+funky language feature. Combining characters are generally considered to be a good thing, and the way forward. They will make searching, sorting, capitalization and maybe more much simpler even for Western languages. Once we understand these issues we then have to look into "Unicode normalization"... > Frank Hope this helps, and I hope people other than just Frank read it. Let's do Unicode properly and be the best Word Processor for Vietnamese and Thai on any platform! (: Andrew Dunbar. ===== http://linguaphile.sourceforge.net http://www.abisource.com __________________________________________________ Do You Yahoo!? Everything you'll ever need on one web page from News and Sport to Email and Music Charts http://uk.my.yahoo.com
