Re: charsets in debian/control

Thaddeus H. Black Sun, 05 Dec 2004 14:16:28 -0600

Peter Samuelson writes,

> We seem to be moving to a de facto standard of UTF-8 for non-ASCII
> characters in debian/control files.  This is not specified in Policy
> [1], but for hopefully obvious reasons, consistency is a Good Thing,
> and UTF-8 seems to be the best solution for this sort of thing.


Would Peter permit me a mild dissent?  I prefer Latin-1.  Reason: I can
recognize and distinguish Latin-1 characters, even when I do not always
understand the words they spell.  Recognizing and distinguishing the
characters is important to me.  And not just to me.  Imagine the dismay
of a Korean user trying to read Arabic script in a control file.

Well, the Korean user can speak for himself.  Speaking for myself, ASCII
is a little too limited.  There is a proper balance to strike, and to me
Latin-1 though imperfect is about right.

Latin-1 is wrong if you speak Polish, of course, and even if you don't
speak Polish, Latin-1's lack of a euro sign is slightly annoying; but,
well, I admit that I do not really mind precisely where the line is
drawn, so long as the general simple Latin concept of writing is
preserved and the number of distinct characters represented is kept
within reasonable bounds.  Regarding only Latin, Unicode recognizes over
eight hundred Latin characters: far too many for me.  This is not
considering Cyrillic or Greek; nor even beginning to think of the
numerous very different writing systems of a wider non-Western
world---worthy writing systems which I cannot even transcribe much less
read---beautiful writing systems in which the basic Western
left-to-right, character-based, diacritically marked semantics are not
preserved.  For the Debian Project, madness lies that way.  If Latin-1
is established and used if not universally loved, then probably we
should limit our usage to it.

I do not deny that Latin-1 represents all the languages I can read, and
that this fact may color my view.  Nevertheless to me a source written
in Chinese is effectively non-free.  It might as well be a compiled
binary blob.

Actually, UTF-8 encoding as such is fine.  It uses a few extra 0xC0 and
0xC1 bytes for the Latin-1 characters (see utf-8(7)), but this does not
matter much.  The full UTF-8 domain has numerous subtle semantics which
I should like to be able to avoid, however.  UTF-8 is for Unicode, which
is to allow the representation of the languages of the world in their
own scripts.  While highly useful in its own domain, this has little to
do with Debian control files, where we probably do not want the
languages of the world represented in any event.

I would tend to recommend that untranslated Debian work, especially
control files, be limited to Latin-1.  If the Japanese maintainers
uncomplainingly transliterate their names to Latin-1 for our benefit,
then probably the rest of us should do likewise.  Whether the Latin-1 is
C0/C1-encoded as UTF-8, however, is a matter of indifference to me.

-- 
Thaddeus H. Black
508 Nellie's Cave Road
Blacksburg, Virginia 24060, USA
+1 540 961 0920, [EMAIL PROTECTED]

pgpZfqqlenkJK.pgp
Description: PGP signature

Re: charsets in debian/control

Reply via email to