Re: MSDN Article, Second Draft
Jungshik Shin écrivit: >> Except in some UNIX operating systems and specialized applications >> with specific needs, > >Note that ISO C 9x specifies that wchar_t be UTF-32/UCS-4 when > __STDC_ISO_10646__ is defined. This is of course very pedantic (I do not believe there are existing implementations that do it), but to be exact, UCS-2 and 16-bit encoding may be used for wchar_t, while __STDC_ISO_10646__ is #defined. It is just required to be a value below 200112L (date of first version, here -2 part, of ISO/IEC 10646 that defines a character beyond BMP, the equivalent of TUS 3.0.1) Antoine
Re: MSDN Article, Second Draft
Sinnathurai Srivas wrote: > Could you include the followin. > > 1/ > Why even after about 20 years of existence, the unicode is not > supported by any significant software and applications? > > 2/ > What if ISO-8859-X for any one who wanted it be allowed to exist (as a > standard) in parallel, while Unicode learns and matures it's too > advanced, but difficult technology. > > 3/ > In the name of promoting Unicode, are we holdingback multilingual > computing for the next 10 years or so? And you guys thought I couldn't spot a troll. Ha. -Doug Ewell Fullerton, California http://users.adelphia.net/~dewell/
RE: MSDN Article, Second Draft
> -Original Message- > From: [EMAIL PROTECTED] > [mailto:[EMAIL PROTECTED] On Behalf Of Sinnathurai Srivas > Sent: Saturday, August 21, 2004 10:40 AM > To: [EMAIL PROTECTED] > Subject: Re: MSDN Article, Second Draft > > > Could you include the followin. > > 1/ > Why even after about 20 years of existence, the unicode is > not supported by any significant software and applications? It is supported by most of the newer software systems, and by many applications - some are not even aware of it, thanks to the underlying operating system. Unicode 1.0 was published in 1991. It will be 20 years in 2011. > > 2/ > What if ISO-8859-X for any one who wanted it be allowed to exist (as a > standard) in parallel, while Unicode learns and matures it's > too advanced, but difficult technology. There is no problem with this. Most systems support at least importing and exporting various codes. In Israel, the recommendation for new development is to use the equivalent of 8859-8 if you only need Hebrew and English, and to use Unicode if you need other languages too, for example Arabic or Russian, or if you need the additional Hebrew characters that are not in 8859-8. > > 3/ > In the name of promoting Unicode, are we holdingback > multilingual computing for the next 10 years or so? Please explain. Jony > > I' looking for a fair analysis of these points. > > Kind regards > Sinnathurai > > > > >
Re: MSDN Article, Second Draft
Sinnathurai Srivas wrote: Could you include the followin. 1/ Why even after about 20 years of existence, the unicode is not supported by any significant software and applications? In your eyes, don't MS Windows 2k/XP/2003, Mac OS X, Linux, Solaris, Java, Plan9, BeOS, MS Office, StarOffice, Gnome, KDE, Mozilla, MS IE, ICU, Perl, Python and zillions of other programs (including OS', development tools, libraries) count as significant? 3/ In the name of promoting Unicode, are we holdingback multilingual computing for the next 10 years or so? Are you yearning for the chaotic days of tens (if not hundreds) of different character encodings? Without Unicode, multilingual features (as well as 'monolingual features' as well) of all the above would be far far worse than what they're now. Jungshik
Re: MSDN Article, Second Draft
Could you include the followin. 1/ Why even after about 20 years of existence, the unicode is not supported by any significant software and applications? 2/ What if ISO-8859-X for any one who wanted it be allowed to exist (as a standard) in parallel, while Unicode learns and matures it's too advanced, but difficult technology. 3/ In the name of promoting Unicode, are we holdingback multilingual computing for the next 10 years or so? I' looking for a fair analysis of these points. Kind regards Sinnathurai
Re: MSDN Article, Second Draft
John Cowan wrote: Jungshik Shin scripsit: The 'official' name of 'ASCII' is ANSI X3.4-1968 or ISO 646 (US). While dispelling myths about Unicode, I'm afraid you're spreading misinformation about what came before it. The sentence that 'ANSI pushed this scope ... represents 256 characters' is misleading. ANSI has nothing to do with various single, double, triple byte character sets that make up single and multibyte character Like it or not, "ANSI" has two meanings now: the American National Standards Institute and a generic term for an 8-bit Windows codepage. Similarly, "OEM" means both an original equipment manufacturer and an 8-bit PC-DOS codepage. I'm well aware of that, but I don't like the second usage at all. Actually, I noticed recently that even MS(DN) began to stop using 'ANSI' meaning the second although Win32 APIs with 'A' suffix would be here to stay. Jungshik
RE: MSDN Article, Second Draft
> -Original Message- > From: [EMAIL PROTECTED] > [mailto:[EMAIL PROTECTED] On Behalf Of Jungshik Shin > Sent: Saturday, August 21, 2004 6:34 AM > To: John Tisdale; [EMAIL PROTECTED] > Subject: Re: MSDN Article, Second Draft > > ... >numerous national and vendor character sets that are specific to a > small subset of scripts/characters in use (or that can cover only a > small subset of ) Some 8-bit character sets were even user specific. It wasn't too difficult to cut one's own character generator. ... Jony > >Jungshik > > >
Re: MSDN Article, Second Draft
Jungshik Shin scripsit: > As is often the case, Unicode experts are not necessarily experts on > 'legacy' character sets and encodings. The 'official' name of 'ASCII' is > ANSI X3.4-1968 or ISO 646 (US). While dispelling myths about Unicode, > I'm afraid you're spreading misinformation about what came before it. > The sentence that 'ANSI pushed this scope ... represents 256 characters' > is misleading. ANSI has nothing to do with various single, double, > triple byte character sets that make up single and multibyte character > encodings. They're devised and published by national and international > standard organizations as well as various vendors. Perhaps, you'd better > just get rid of the sentence 'ANSI pushed ... providing backward > compatibility with ASCII'. Like it or not, "ANSI" has two meanings now: the American National Standards Institute and a generic term for an 8-bit Windows codepage. Similarly, "OEM" means both an original equipment manufacturer and an 8-bit PC-DOS codepage. -- "No, John. I want formats that are actually John Cowan useful, rather than over-featured megaliths that http://www.ccil.org/~cowan address all questions by piling on ridiculous http://www.reutershealth.com internal links in forms which are hideously[EMAIL PROTECTED] over-complex." --Simon St. Laurent on xml-dev
Re: MSDN Article, Second Draft
John Tisdale wrote: Unicode Fundamentals Early character sets were very limited in scope. ASCII required only 7 bits to represent its repertoire of 128 characters. ANSI pushed this scope 8 bits which represented 256 characters while providing backward compatibility with ASCII. Countless other character sets emerged that represented the As is often the case, Unicode experts are not necessarily experts on 'legacy' character sets and encodings. The 'official' name of 'ASCII' is ANSI X3.4-1968 or ISO 646 (US). While dispelling myths about Unicode, I'm afraid you're spreading misinformation about what came before it. The sentence that 'ANSI pushed this scope ... represents 256 characters' is misleading. ANSI has nothing to do with various single, double, triple byte character sets that make up single and multibyte character encodings. They're devised and published by national and international standard organizations as well as various vendors. Perhaps, you'd better just get rid of the sentence 'ANSI pushed ... providing backward compatibility with ASCII'. characters needed by various languages and language groups. The growing complexities of managing numerous international character sets escalated the numerous national and vendor character sets that are specific to a small subset of scripts/characters in use (or that can cover only a small subset of ) Two standards emerged about the same time to address this demand. The Unicode Consortium published the Unicode Standard and the International Organization for Standardization (ISO) offered the ISO/IEF 10646 standard. A typo: It's ISO/IEC not ISO/IEF. Perhaps, it's not a typo. You consistently used ISO/IEF in place of ISO/IEC ;-) Fortunately, these two standards bodies synchronized their character sets some years ago and continue to do so as new characters are added. Yet, although the character sets are mapped identically, the standards for encoding them vary in many ways (which are beyond the scope of this article). I'm afraid that 'yet ...' can give a false impression that Unicode consortium and ISO/IEC have some differences in encoding standards especially considering that the sentence begins with 'although identically'. Coded Character Sets A coded character set (sometimes called a character repertoire) is a mapping from a set of abstract characters to a set of nonnegative, noncontiguous integers (between 0 and 1,114,111, called code points). A 'character repertoire' is different from a coded character set in that it's more like a set of abstract characters **without** numbers associated with them. (needless to say, 'a coded character set' is a set of character-integer pairs) Character Encoding Forms The second component in Unicode is character encoding forms. Their purpose I'm not sure whether 'component' is the best word to use here. The Unicode Standard provides three forms for encoding its repertoire (UTF-8, UTF-16 and UTF-32). Note that ISO 10646:2003 also define all three of them exactly the same as Unicode does. > You will often find references to USC-2 and USC-4. These are competing encoding forms offered by ISO/IEF 10646 (USC-2 is equivalent to UTF-16 and USC-4 to UTF-32). I will not discuss the UCS-2 IS different from UTF-16. UCS-2 can only represent a subset of characters in Unicode/ISO 10646 (namely, those in BMP). BTW, it's not USC but UCS. Also note that UTF in UTF-16/UTF-32/UTF-8 stand for either 'UCS Transformation Format' (UCS stands for Univeral Character Set, ISO 10646) or 'Unicode Transformation Format' significant enough to limit its implementation (as at least half of the 32 bits will contain zeros in the majority of applications). Except in some UNIX operating systems and specialized applications with specific needs, Note that ISO C 9x specifies that wchar_t be UTF-32/UCS-4 when __STDC_ISO_10646__ is defined. Recent versions of Python also uses UTF-32 internally. UTF-32 is seldom implemented as an end-to-end solution (yet it does have its strengths in certain applications). UTF-16 is the default means of encoding the Unicode character repertoire (which has perhaps played a role in the misnomer that Unicode is a 16-bit character set). I would not say UTF-16 is the default means of encoding . It's probably the most widely used, but that's different from being the default ...unless you're talking specifically about Win32 APIs (you're not in this paragraph, right?) UTF-8 is a variable-width encoding form based on byte-sized code units (ranging between 1 and 4 bytes per code unit). The code unit of UTF-8 is an 8-bit byte just like the code unit of UTF-16 and that of UTF-32 are a 16-bit 'half-word' and a 32-bit 'word', respectively. A single Unicode character is represented with 1 to 4 code units (bytes) depending on what code point it's assigned in the Unicode. Please, see p. 73 of the Unicode standard 4.0 In UTF-8, the high bits of each byte are reserved to indicat
Re: MSDN Article, Second Draft
May be a fourth level of abstraction is needed to complete what the MIME registry describes as "charsets": a TES (Transfer Encoding Syntax) sometimes happen at end, and some legacy specifications of CES mix it with what should have been left in a separate TES. For example, the specification of SCSU (Simple Compression Scheme for Unicode) defines it as a way to convert a stream of code points directly to a stream of bytes, without going through the level of abstraction of intermediate "code units" (or in this case, code units are simply the encoded bytes). This makes SCSU a legal CEF (like are UTF-32, UTF-16 and UTF-8) to convert a stream of encoded characters into a stream of (8-bit) code units, and a legal CES (like are UTF-32BE or UTF-32LE or UTF-16BE or UTF16-LE or UTF-8 or CESU-8, and UTF-16 or UTF-32 or UTF-8 or CESU-8 with a leading BOM) to take into account the generated byte order. But the SCSU specification speaks about "optional extensions" which are probably badly named because they should be better described as TES (DLE-escaping for NUL and DLE, or run-length compression, or COBS encoding), exactly like other well-known TES (Base64, Quoted-Printable) widely used in MIME contexts. I think that there still exists some other legacy charsets in the MIME registry that mix these level of abstraction, where a clear separation between CES and TES levels would have helped their interoperability. One cause of this descrepancy is that it has been, since long easier to create a new charset and have it registered in the long MIME registry, than to define a clear TES separately (the TES registry in MIME is not extremely long, and support for multiple TES in applications has often been very weak and not easily extensible, developers prefering to develop first the support needed to handle correctly the so many possible CES, just identified by their MIME "charset" identifier). The other related "problem" of TES is that many document structures (including XML) only offer a place to specify the "charset" (i.e. the result of a combination of a CCS, CEF and CES), but no place to specify the TES, which is left, apparently, to the transport protocol, ignoring the case of local storage where identification of TES is nearly impossible to make reliably... This means that local stores cannot benefit easily of the advantages of a TES specification (for example, when creating a reference to a text document, it's impossible to specify in the link that this document has been COBS-encoded or Base-64 encoded or even compressed in deflate or gzipped form, unless the local document is stored in an enveloppe format, such as a RFC2118 message with headers, and there's support in the hyperlink renderer to decode this enveloppe format transparently). For now, a hyperlink can specify the MIME-type of the document with an attribute specifying the "charset", i.e. the triplet , but no reliable and documented attribute to specify its TES (unless the document is transported via a email or with HTTP, and the source makes the job on the fly to transform it to the desired TES, which is a CPU-intensive job for servers that could be avoided if documents could be stored or cached directly by the server in their TES-encoded form; this means support in the server's storage to keep this out-of-band information). There does exists solutions but they are not universal and interoperable across distinct softwares working with the same physical document store: some filesystems offer that support with out-of-band meta-data, some servers will use private conventions with multiple file extensions and private server configuration files... If the document's TES encoding decoding could be handled directly by the client, without dependance of the underlying transport or storage technology, it would be easier. TES encoding is really out of scope of Unicode, but its support in various applications using encoded text documents should be enhanced. This includes a support for it in the XML and HTML document syntax, notably within source hyperlinks. As a final note: multiple TES encoding stages may be chained in any transport or storage, and changed on the fly across nodes in a transport network, without affecting the charset used for the decoded document. But in many applications, including HTTP, only one TES can be specified (else it will break other features such as document content signature and certification). I know no working implementation of any transport protocol that transparently allows specifying these multiple TES encodings (most often these steps are possible only in distinct layers of the transport architecture, where it can be made transparent for the applications handling encoded documents on the upper layers). This means that TES encoding/decoding affects the performance (and reliability...) of each relaying node in a transport network (such as proxies), a caveat avoided by including TES within a MIME charset, so that no TES encoding (o