Re: VISCII (was: Re: [BULK] - Re: MCW encoding of Hebrew)

Kenneth Whistler Tue, 25 May 2004 13:59:57 -0700

John Cowan asked:

> Doug Ewell scripsit:
> 
> > > So is [VIQR] a 7-bit encoding, or a scheme layered on top of ASCII?
> > 
> > It's a scheme layered on top of ASCII
> > > And what is KOI-7?
> > 
> > A true 7-bit encoding for Russian, in which Cyrillic letters (small and
> > capital respectively) were encoded in the ranges where ASCII has Latin
> > letters (capital and small respectively).
> 
> Ah.  And on what principle do you distinguish them?


VIQR uses (for example) a sequence of two ASCII characters 'd' + 'd'
to represent, conventionally, the Vietnamese barred-d, i.e.,
U+0111 LATIN SMALL LETTER D WITH STROKE. However, that is the
convention for the use of a sequence of two ASCII characters --
not a direct encoding of the character.

It is correct (and appropriate) to display VIQR with an ASCII
font, in conformance with the ASCII standard. People then learn
to interpret the various sequences of letters or letters plus
ASCII punctuation and symbols as representing "real" Vietnamese
orthography.

KOI-7, on the other hand, is an encoded character set. The
*definition* of the code points is as representing the
Cyrillic letters. 0x40 encodes CYRILLIC SMALL LETTER YU. It
is not AT SIGN masquerading as YU. It is correct (and
appropriate) to display KOI-7 with a KOI-7 font, in
conformance with the KOI-7 standard; it is *not* correct to
display it with an ASCII font.

The fact that KOI-7 was designed the way it was to make it
feasible to do Cyrillic on devices that could only handle
ASCII data is besides the point -- it was simply a clever
way to get around the then 7-bit limitations of devices.

> The IETF clearly
> treats them both as charsets, within its definitions.

The IETF definition of "charset" is underdetermined for
distinguishing these kinds of cases. Any specification that
allows you to map unambiguously from a sequence of bytes
to a sequence of abstract characters is, potentially, considered
a "charset" in the IETF sense, right?

As such, it cannot readily distinguish between true coded
character sets and conventional orthographies built on
top of ASCII, for example.

--Ken

Re: VISCII (was: Re: [BULK] - Re: MCW encoding of Hebrew)

Reply via email to