Peter Samuelson writes, > We seem to be moving to a de facto standard of UTF-8 for non-ASCII > characters in debian/control files. This is not specified in Policy > [1], but for hopefully obvious reasons, consistency is a Good Thing, > and UTF-8 seems to be the best solution for this sort of thing.
Would Peter permit me a mild dissent? I prefer Latin-1. Reason: I can recognize and distinguish Latin-1 characters, even when I do not always understand the words they spell. Recognizing and distinguishing the characters is important to me. And not just to me. Imagine the dismay of a Korean user trying to read Arabic script in a control file. Well, the Korean user can speak for himself. Speaking for myself, ASCII is a little too limited. There is a proper balance to strike, and to me Latin-1 though imperfect is about right. Latin-1 is wrong if you speak Polish, of course, and even if you don't speak Polish, Latin-1's lack of a euro sign is slightly annoying; but, well, I admit that I do not really mind precisely where the line is drawn, so long as the general simple Latin concept of writing is preserved and the number of distinct characters represented is kept within reasonable bounds. Regarding only Latin, Unicode recognizes over eight hundred Latin characters: far too many for me. This is not considering Cyrillic or Greek; nor even beginning to think of the numerous very different writing systems of a wider non-Western world---worthy writing systems which I cannot even transcribe much less read---beautiful writing systems in which the basic Western left-to-right, character-based, diacritically marked semantics are not preserved. For the Debian Project, madness lies that way. If Latin-1 is established and used if not universally loved, then probably we should limit our usage to it. I do not deny that Latin-1 represents all the languages I can read, and that this fact may color my view. Nevertheless to me a source written in Chinese is effectively non-free. It might as well be a compiled binary blob. Actually, UTF-8 encoding as such is fine. It uses a few extra 0xC0 and 0xC1 bytes for the Latin-1 characters (see utf-8(7)), but this does not matter much. The full UTF-8 domain has numerous subtle semantics which I should like to be able to avoid, however. UTF-8 is for Unicode, which is to allow the representation of the languages of the world in their own scripts. While highly useful in its own domain, this has little to do with Debian control files, where we probably do not want the languages of the world represented in any event. I would tend to recommend that untranslated Debian work, especially control files, be limited to Latin-1. If the Japanese maintainers uncomplainingly transliterate their names to Latin-1 for our benefit, then probably the rest of us should do likewise. Whether the Latin-1 is C0/C1-encoded as UTF-8, however, is a matter of indifference to me. -- Thaddeus H. Black 508 Nellie's Cave Road Blacksburg, Virginia 24060, USA +1 540 961 0920, [EMAIL PROTECTED]
pgpZfqqlenkJK.pgp
Description: PGP signature