Re: MSDN Article, Second Draft

2004-08-23 Thread Antoine Leca
Jungshik Shin écrivit:
>> Except in some UNIX operating systems and specialized applications
>> with specific needs,
>
>Note that ISO C 9x specifies that wchar_t be UTF-32/UCS-4 when
> __STDC_ISO_10646__ is defined.

This is of course very pedantic (I do not believe there are existing
implementations that do it), but to be exact, UCS-2 and 16-bit encoding may
be used for wchar_t, while __STDC_ISO_10646__ is #defined. It is just
required to be a value below 200112L (date of first version, here -2 part,
of ISO/IEC 10646 that defines a character beyond BMP, the equivalent of TUS
3.0.1)


Antoine




Re: MSDN Article, Second Draft

2004-08-21 Thread Doug Ewell
Sinnathurai Srivas  wrote:

> Could you include the followin.
>
> 1/
>  Why even after about 20 years of existence, the unicode is not
> supported by any significant software and applications?
>
> 2/
> What if ISO-8859-X for any one who wanted it be allowed to exist (as a
> standard) in parallel, while Unicode learns and matures it's too
> advanced, but difficult technology.
>
> 3/
> In the name of promoting Unicode, are we holdingback multilingual
> computing for the next 10 years or so?

And you guys thought I couldn't spot a troll.  Ha.

-Doug Ewell
 Fullerton, California
 http://users.adelphia.net/~dewell/





RE: MSDN Article, Second Draft

2004-08-21 Thread Jony Rosenne


> -Original Message-
> From: [EMAIL PROTECTED] 
> [mailto:[EMAIL PROTECTED] On Behalf Of Sinnathurai Srivas
> Sent: Saturday, August 21, 2004 10:40 AM
> To: [EMAIL PROTECTED]
> Subject: Re: MSDN Article, Second Draft
> 
> 
> Could you include the followin.
> 
> 1/
>  Why even after about 20 years of existence, the unicode is 
> not supported by any significant software and applications?

It is supported by most of the newer software systems, and by many
applications - some are not even aware of it, thanks to the underlying
operating system.

Unicode 1.0 was published in 1991. It will be 20 years in 2011.

> 
> 2/
> What if ISO-8859-X for any one who wanted it be allowed to exist (as a
> standard) in parallel, while Unicode learns and matures it's 
> too advanced, but difficult technology.

There is no problem with this. Most systems support at least importing and
exporting various codes. 

In Israel, the recommendation for new development is to use the equivalent
of 8859-8 if you only need Hebrew and English, and to use Unicode if you
need other languages too, for example Arabic or Russian, or if you need the
additional Hebrew characters that are not in 8859-8.

> 
> 3/
> In the name of promoting Unicode, are we holdingback 
> multilingual computing for the next 10 years or so?

Please explain.

Jony

> 
> I' looking for a fair analysis of these points.
> 
> Kind regards
> Sinnathurai
> 
> 
> 
> 
> 




Re: MSDN Article, Second Draft

2004-08-21 Thread Jungshik Shin
Sinnathurai Srivas wrote:
Could you include the followin.
1/
 Why even after about 20 years of existence, the unicode is not supported by
any significant software and applications?
In your eyes, don't MS Windows 2k/XP/2003, Mac OS X, Linux, Solaris, Java,
Plan9, BeOS,  MS Office, StarOffice, Gnome, KDE, Mozilla, MS IE, ICU, 
Perl, Python
and zillions of other programs (including OS', development tools, 
libraries) count as significant?


3/
In the name of promoting Unicode, are we holdingback multilingual computing
for the next 10 years or so?
 Are you yearning for the chaotic days of tens (if not hundreds) of 
different character encodings?
Without Unicode, multilingual features (as well as 'monolingual 
features' as well) of all the above
would be far  far worse than what they're now.

Jungshik



Re: MSDN Article, Second Draft

2004-08-21 Thread Sinnathurai Srivas
Could you include the followin.

1/
 Why even after about 20 years of existence, the unicode is not supported by
any significant software and applications?

2/
What if ISO-8859-X for any one who wanted it be allowed to exist (as a
standard) in parallel, while Unicode learns and matures it's too advanced,
but difficult technology.

3/
In the name of promoting Unicode, are we holdingback multilingual computing
for the next 10 years or so?

I' looking for a fair analysis of these points.

Kind regards
Sinnathurai





Re: MSDN Article, Second Draft

2004-08-21 Thread Jungshik Shin
John Cowan wrote:
Jungshik Shin scripsit:

 The 'official' name of 'ASCII' is
ANSI X3.4-1968 or ISO 646 (US). While dispelling myths about Unicode, 
I'm afraid you're spreading misinformation about what came before it.
The sentence that 'ANSI pushed this scope ... represents 256 characters' 
is misleading. ANSI has nothing to do with various single, double, 
triple byte character sets that make up single and multibyte character 

Like it or not, "ANSI" has two meanings now: the American National
Standards Institute and a generic term for an 8-bit Windows codepage.
Similarly, "OEM" means both an original equipment manufacturer and an
8-bit PC-DOS codepage.
  I'm well aware of that, but I don't like the second usage at all. 
Actually, I noticed recently
 that even MS(DN) began to stop using 'ANSI'  meaning the second 
although Win32 APIs with 'A'
suffix would be here to stay.

Jungshik


RE: MSDN Article, Second Draft

2004-08-20 Thread Jony Rosenne


> -Original Message-
> From: [EMAIL PROTECTED] 
> [mailto:[EMAIL PROTECTED] On Behalf Of Jungshik Shin
> Sent: Saturday, August 21, 2004 6:34 AM
> To: John Tisdale; [EMAIL PROTECTED]
> Subject: Re: MSDN Article, Second Draft
> 
> 

...

>numerous national and vendor character sets that are specific to a 
> small subset of scripts/characters in use (or that can cover only a 
> small subset of )

Some 8-bit character sets were even user specific. It wasn't too difficult
to cut one's own character generator.

...

Jony

> 
>Jungshik
> 
> 
> 




Re: MSDN Article, Second Draft

2004-08-20 Thread John Cowan
Jungshik Shin scripsit:

> As is often the case, Unicode experts are not necessarily experts on 
> 'legacy' character sets and encodings. The 'official' name of 'ASCII' is 
> ANSI X3.4-1968 or ISO 646 (US). While dispelling myths about Unicode, 
> I'm afraid you're spreading misinformation about what came before it.
> The sentence that 'ANSI pushed this scope ... represents 256 characters' 
> is misleading. ANSI has nothing to do with various single, double, 
> triple byte character sets that make up single and multibyte character 
> encodings. They're devised and published by national and international 
> standard organizations as well as various vendors. Perhaps, you'd better 
> just get rid of the sentence 'ANSI pushed ... providing backward 
> compatibility with ASCII'.

Like it or not, "ANSI" has two meanings now: the American National
Standards Institute and a generic term for an 8-bit Windows codepage.
Similarly, "OEM" means both an original equipment manufacturer and an
8-bit PC-DOS codepage.

-- 
"No, John.  I want formats that are actually   John Cowan
useful, rather than over-featured megaliths that   http://www.ccil.org/~cowan
address all questions by piling on ridiculous  http://www.reutershealth.com
internal links in forms which are hideously[EMAIL PROTECTED]
over-complex." --Simon St. Laurent on xml-dev



Re: MSDN Article, Second Draft

2004-08-20 Thread Jungshik Shin
John Tisdale wrote:
Unicode Fundamentals

Early character sets were very limited in scope. ASCII required only 7 bits
to represent its repertoire of 128 characters. ANSI pushed this scope 8 bits
which represented 256 characters while providing backward compatibility with
ASCII. Countless other character sets emerged that represented the
As is often the case, Unicode experts are not necessarily experts on 
'legacy' character sets and encodings. The 'official' name of 'ASCII' is 
ANSI X3.4-1968 or ISO 646 (US). While dispelling myths about Unicode, 
I'm afraid you're spreading misinformation about what came before it.
The sentence that 'ANSI pushed this scope ... represents 256 characters' 
is misleading. ANSI has nothing to do with various single, double, 
triple byte character sets that make up single and multibyte character 
encodings. They're devised and published by national and international 
standard organizations as well as various vendors. Perhaps, you'd better 
just get rid of the sentence 'ANSI pushed ... providing backward 
compatibility with ASCII'.


characters needed by various languages and language groups. The growing
complexities of managing numerous international character sets escalated the
  numerous national and vendor character sets that are specific to a 
small subset of scripts/characters in use (or that can cover only a 
small subset of )


Two standards emerged about the same time to address this demand. The
Unicode Consortium published the Unicode Standard and the International
Organization for Standardization (ISO) offered the ISO/IEF 10646 standard.
A typo: It's ISO/IEC not ISO/IEF. Perhaps, it's not a typo. You 
consistently used ISO/IEF in place of ISO/IEC ;-)

Fortunately, these two standards bodies synchronized their character sets
some years ago and continue to do so as new characters are added.
Yet, although the character sets are mapped identically, the standards for
encoding them vary in many ways (which are beyond the scope of this
article). 

I'm afraid that 'yet ...' can give a false impression that Unicode 
consortium and
ISO/IEC have some differences in encoding standards especially 
considering that the sentence begins with 'although identically'.


Coded Character Sets

A coded character set (sometimes called a character repertoire) is a mapping
from a set of abstract characters to a set of nonnegative, noncontiguous
integers (between 0 and 1,114,111, called code points). 
 A 'character repertoire' is different from a coded character set in 
that it's more like a set of abstract characters **without** numbers 
associated with them. (needless to say, 'a coded character set' is a set 
of character-integer pairs)


Character Encoding Forms
The second component in Unicode is character encoding forms. Their purpose
I'm not sure whether 'component' is the best word to use here.

The Unicode Standard provides three forms for encoding its repertoire
(UTF-8, UTF-16 and UTF-32). 
Note that ISO 10646:2003 also define all three of them exactly the same 
as Unicode does.

> You will often find references to USC-2 and
USC-4. These are competing encoding forms offered by ISO/IEF 10646 (USC-2 is
equivalent to UTF-16 and USC-4 to UTF-32). I will not discuss the
UCS-2 IS different from UTF-16. UCS-2 can only represent a subset of 
characters in Unicode/ISO 10646 (namely, those in BMP). BTW, it's not 
USC but UCS. Also note that UTF in UTF-16/UTF-32/UTF-8 stand for either 
'UCS Transformation Format' (UCS stands for Univeral Character Set, ISO 
10646)  or 'Unicode Transformation Format'


significant enough to limit its implementation (as at least half of the 32
bits will contain zeros in the majority of applications). Except in some
UNIX operating systems and specialized applications with specific needs,
  Note that ISO C 9x specifies that wchar_t be UTF-32/UCS-4 when 
__STDC_ISO_10646__ is defined. Recent versions of Python also uses 
UTF-32 internally.

UTF-32 is seldom implemented as an end-to-end solution (yet it does have its
strengths in certain applications).
UTF-16 is the default means of encoding the Unicode character repertoire
(which has perhaps played a role in the misnomer that Unicode is a 16-bit
character set). 
  I would not say UTF-16 is the default means of encoding . It's 
probably the most widely used, but that's different from being the 
default ...unless you're talking specifically about Win32 APIs (you're 
not in this paragraph, right?)


UTF-8 is a variable-width encoding form based on byte-sized code units
(ranging between 1 and 4 bytes per code unit). 
  The code unit of UTF-8 is an 8-bit byte just like the code unit of 
UTF-16 and that of UTF-32 are a 16-bit 'half-word' and a 32-bit 'word', 
respectively. A single Unicode character is represented with 1 to 4 code 
units (bytes) depending on what code point it's assigned in the Unicode. 
 Please, see p. 73 of the Unicode standard 4.0

In UTF-8, the high bits of each byte are reserved to indicat

Re: MSDN Article, Second Draft

2004-08-19 Thread Philippe Verdy
May be a fourth level of abstraction is needed to complete what the MIME
registry describes as "charsets": a TES (Transfer Encoding Syntax) sometimes
happen at end, and some legacy specifications of CES mix it with what should
have been left in a separate TES.

For example, the specification of SCSU (Simple Compression Scheme for
Unicode) defines it as a way to convert a stream of code points directly to
a stream of bytes, without going through the level of abstraction of
intermediate "code units" (or in this case, code units are simply the
encoded bytes).

This makes SCSU a legal CEF (like are UTF-32, UTF-16 and UTF-8) to convert a
stream of encoded characters into a stream of (8-bit) code units, and a
legal CES (like are UTF-32BE or UTF-32LE or UTF-16BE or UTF16-LE or UTF-8 or
CESU-8, and UTF-16 or UTF-32 or UTF-8 or CESU-8 with a leading BOM) to take
into account the generated byte order.

But the SCSU specification speaks about "optional extensions" which are
probably badly named because they should be better described as TES
(DLE-escaping for NUL and DLE, or run-length compression, or COBS encoding),
exactly like other well-known TES (Base64, Quoted-Printable) widely used in
MIME contexts.

I think that there still exists some other legacy charsets in the MIME
registry that mix these level of abstraction, where a clear separation
between CES and TES levels would have helped their interoperability. One
cause of this descrepancy is that it has been, since long easier to create a
new charset and have it registered in the long MIME registry, than to define
a clear TES separately (the TES registry in MIME is not extremely long, and
support for multiple TES in applications has often been very weak and not
easily extensible, developers prefering to develop first the support needed
to handle correctly the so many possible CES, just identified by their MIME
"charset" identifier).

The other related "problem" of TES is that many document structures
(including XML) only offer a place to specify the "charset" (i.e. the result
of a combination of a CCS, CEF and CES), but no place to specify the TES,
which is left, apparently, to the transport protocol, ignoring the case of
local storage where identification of TES is nearly impossible to make
reliably... This means that local stores cannot benefit easily of the
advantages of a TES specification (for example, when creating a reference to
a text document, it's impossible to specify in the link that this document
has been COBS-encoded or Base-64 encoded or even compressed in deflate or
gzipped form, unless the local document is stored in an enveloppe format,
such as a RFC2118 message with headers, and there's support in the hyperlink
renderer to decode this enveloppe format transparently).

For now, a hyperlink can specify the MIME-type of the document with an
attribute specifying the "charset", i.e. the triplet , but no
reliable and documented attribute to specify its TES (unless the document is
transported via a email or with HTTP, and the source makes the job on the
fly to transform it to the desired TES, which is a CPU-intensive job for
servers that could be avoided if documents could be stored or cached
directly by the server in their TES-encoded form; this means support in the
server's storage to keep this out-of-band information).

There does exists solutions but they are not universal and interoperable
across distinct softwares working with the same physical document store:
some filesystems offer that support with out-of-band meta-data, some servers
will use private conventions with multiple file extensions and private
server configuration files...

If the document's TES encoding decoding could be handled directly by the
client, without dependance of the underlying transport or storage
technology, it would be easier.

TES encoding is really out of scope of Unicode, but its support in various
applications using encoded text documents should be enhanced. This includes
a support for it in the XML and HTML document syntax, notably within source
hyperlinks.

As a final note: multiple TES encoding stages may be chained in any
transport or storage, and changed on the fly across nodes in a transport
network, without affecting the charset used for the decoded document. But in
many applications, including HTTP, only one TES can be specified (else it
will break other features such as document content signature and
certification). I know no working implementation of any transport protocol
that transparently allows specifying these multiple TES encodings (most
often these steps are possible only in distinct layers of the transport
architecture, where it can be made transparent for the applications handling
encoded documents on the upper layers). This means that TES
encoding/decoding affects the performance (and reliability...) of each
relaying node in a transport network (such as proxies), a caveat avoided by
including TES within a MIME charset, so that no TES encoding (o