Re: charsets in debian/control
Hello. Paul Hampson: The email address isn't important, since that has to be a subset of ASCII anyway. Are the Unicode-encoded domain names supported in (modern) browsers only? I can surf to http://.pl/ (with, e.g., Firefox) - can I send mail to [EMAIL PROTECTED], or should I always use the [EMAIL PROTECTED] equivalent, as the Unicode in domain names is restricted to WWW only? Cheers, -- Shot -- There's a difference between random people with stripy jumpers, and a respected scientist with a reputation. -- Steve Kitson, ucam.chat http://shot.pl/hovercraft/ === signature.asc Description: Digital signature
Re: charsets in debian/control
On Dec 11, Shot (Piotr Szotkowski) [EMAIL PROTECTED] wrote: I can surf to http://?.pl/ (with, e.g., Firefox) - can I send mail to [EMAIL PROTECTED], or should I always use the [EMAIL PROTECTED] equivalent, as the Unicode in domain names is restricted to WWW only? It depends on your MUA. With mutt you can send mail to internationalized domain names without needing to type the ASCII encoding. -- ciao, | Marco | [9705 svftIaGWGM8aU] signature.asc Description: Digital signature
Re: charsets in debian/control
On Sat, 11 Dec 2004 16:08:12 +0100, Shot (Piotr Szotkowski) wrote: Hello. Paul Hampson: The email address isn't important, since that has to be a subset of ASCII anyway. Are the Unicode-encoded domain names supported in (modern) browsers only? I can surf to http://.pl/ (with, e.g., Firefox) - can I send mail to [EMAIL PROTECTED], or should I always use the [EMAIL PROTECTED] equivalent, as the Unicode in domain names is restricted to WWW only? Interesting question. Quick check. Not restricted. Of course you have to use ACE (the ASCII Compatible Encoding defined in RFC3490) in transit (SMTP commands and message headers) but MUAs may (and at least some indeed do) accept/display the decoded form. aptitude search '~D^libidn' shows many apps at least linked with the IDN library. -- Micha Politowski Talking has been known to lead to communication if practised carelessly.
Re: charsets in debian/control
On Sat, Dec 11, 2004 at 04:08:12PM +0100, Shot (Piotr Szotkowski) wrote: Hello. Paul Hampson: The email address isn't important, since that has to be a subset of ASCII anyway. Are the Unicode-encoded domain names supported in (modern) browsers only? I can surf to http://.pl/ (with, e.g., Firefox) - can I send mail to [EMAIL PROTECTED], or should I always use the [EMAIL PROTECTED] equivalent, as the Unicode in domain names is restricted to WWW only? Good point. Others have pointed out that you can. And the flipside is, can I post to [EMAIL PROTECTED] RFC2821 says: Local-part = Dot-string / Quoted-string Quoted-string = DQUOTE *qcontent DQUOTE1 While the above definition for Local-part is relatively permissive, for maximum interoperability, a host that expects to receive mail SHOULD avoid defining mailboxes where the Local-part requires (or uses) the Quoted-string form or where the Local-part is case-sensitive. For any purposes that require generating or comparing Local-parts (e.g., to specific mailbox names), all quoted forms MUST be treated as equivalent and the sending system SHOULD transmit the form that uses the minimum quoting possible. Systems MUST NOT define mailboxes in such a way as to require the use in SMTP of non-ASCII characters (octets with the high order bit set to one) or ASCII control characters (decimal value 0-31 and 127). These characters MUST NOT be used in MAIL or RCPT commands or other commands that require mailbox names. == RFC2821 doesn't give more detail than that about Quoted-string, so I presume we would have to use something like the ACE encoding used for domain names. A quick google didn't show up anything concrete, so I have _no_ idea what would look like as an email box on my mail server. I certainly think RFC2047 would be a bad idea: =?ISO-8859-1?B?SWYgeW91IGNhbiByZWFkIHRoaXMgeW8=?= (I have no idea what that says, I grabbed it from the RFC. It's base64 or something quite like it) So the short answer is the email address in SMTP has to be a subset of US-ASCII, but domain names can be handled by libidn and local-parts are still in need of a standard. -- --- Paul TBBle Hampson, MCSE 7th year CompSci/Asian Studies student, ANU The Boss, Bubblesworth Pty Ltd (ABN: 51 095 284 361) [EMAIL PROTECTED] No survivors? Then where do the stories come from I wonder? -- Capt. Jack Sparrow, Pirates of the Caribbean This email is licensed to the recipient for non-commercial use, duplication and distribution. --- signature.asc Description: Digital signature
Re: charsets in debian/control
On Tue, Dec 07, 2004 at 05:56:54PM +, Thaddeus H. Black wrote: But yes, non-ASCII Latin-1 chars should not be given special status over the national chars found in other languages spoken by project members. Debian should be using either ASCII, or Unicode; standardizing on Latin-1 makes no sense in a global project. True. Look, Steve: mild abuse aside, I agree with you in every particular. Nevertheless, I would respectfully suggest that your criticism underscores my point, which regards the monstrous increase in complexity which the full Unicode standard represents. Yet you had concluded this means we should use Latin-1 as an encoding for the files. All arguments that justify the use of Latin-1 characters in the control file are equally applicable to any of a number of other national character sets used by one or more developers. Consider. Is it a bug if Readline cannot echo full bidirectional input? Er, yes, sure it is, independently of what happens in debian/control. If Dselect does not appreciate all the non-spacing characters? IFF dselect has a reason to display such characters, yes. This may well be the case regardless of whether debian/control ever supports non-ASCII characters; Debian may start supporting localized Packages files via some external mechanism, or it may provide a localized UI that requires these characters. If Less does not regard Tibetan subjoined letters? (This is my Tibetan straw man.) Yes, this is also a bug. Not one that's likely to be noticed for a while, but a bug nevertheless. But your example again overstates the complexity of the task: the main responsibility of less is to figure out how many characters to display on a line, and let the *terminal* render the glyphs. This is code that needs to be implemented only once, and most of the work is already done centrally for *all* apps by glibc which keeps track of the display width of each character. Undoubtedly one might observe that the Tibetan problem were not really a problem with Less but rather with some underlying library, but this misses the point---or rather again it underscores the point. Unicode solves what for many of us was not a problem by creating an entirely new class of problems. For example, it requires us to be particular about how we tag our e-mail attachments... Um, no. Being part of a *global Internet* causes this problem for you. The non-ASCII characters in your email were undefined gibberish according to your headers; only naive (or helpful, YMMV) mail readers would render them at all, and only naive mail readers commanded by users using a Western European locale would have rendered them as intended. Actually, perhaps even that is being too generous, as there are *different* native 8-bit encodings used on each of Unix, Windows, and MacOS; the Unix and Windows encodings differ on relatively few codepoints, but the Mac encoding is widely different. And you think it's ok to inflict this same mess on anyone not using a Latin-1 locale while trying to read a debian/control file? Am I arguing to jettison Unicode? No; to the partial extent that I had been arguing it earlier in the thread, you, Peter, Daniel and Matthew have changed my mind. However, the typical roster of skills one masters in contributing broadly to Debian development is already awesome: C, C++, CPP, Make, Perl, Python, Autoconf, CVS, Shell, Glibc, System calls, /proc, IPC, sockets, Sed, Awk, Vi, Emacs, locales, Libdb, GnuPG, Readline, Ncurses, TeX, Postscript, Groff, XML, assembly, Flex, Bison, ORB, Lisp, Dpkg, PAM, Xlibs, Tk, GTK, SysVInit, Debconf, ELF, etc.---not to mention the use of the English language at a sophisticated technical level. UTF-8 is neat, but I do not really like Unicode (you may have noticed this). Seeking essential simplicity, I would prefer to keep the full hairy overgrown Unicode standard from the typical Debian roster of development skills. Wouldn't you? 1) Sorry, modern software is a complex creature. This is because we demand complex things of it -- including handling all the languages that we speak. 2) Most DDs do not master all of the above skills. *I* don't have a mastery of all of the above skills; contributing broadly to Debian usually means mastering some of these skills, and knowing where to find answers for the rest. 3) Mastering Unicode, for the purposes of almost anyone not working directly on glibc or implementing a terminal, is roughly equivalent to making sure your application implements proper string handling for CJK. If you do it right, the differences between UTF-8 and ISO-2022 are normally minimal; if you do it wrong, you get bug reports from Japanese users. However, for files for which no encoding is specified, there is no right way to handle non-ASCII data, which is why debian/control is an issue. 4) As suggested above, for 98% of all applications on the system, the encoding used for debian/control is *entirely
Re: charsets in debian/control
It is one thing spiritedly to argue a point against friends and allies. It is another to be obstinate. I do not wish the latter, and I admit that I am both outnumbered and outreasoned today. Please permit me without malice to conform my position, which now might be stated as follows. Unicode is a reasonable solution to a difficult yet important problem. Broadly accepted even among Debian Developers from the Latin-1 countries, Unicode is also recognized outside Debian around a wider world. Unicode is recommended for general Debian application. For non-localized purposes in which a restricted, byte-based character set is wanted, plain seven-bit ASCII is normally the logical choice. As for Latin-1, although it served some needs in an earlier day, it must today be regarded as a local, incompatible encoding, not recommended for general international use. I trust that you will inform me if the conformed position yet lacks in any significant way! Besides expressing my own revised view, the statement also means to summarize the subthread's key points. Since I happen to have the attention of interested people at the moment, I should say that I could use some help in conforming debram's [7800 Non-English Natural Language] division sensibly to the Unicode consensus. I lack the right knowledge to do it myself. At present, only the Latin-1 languages are sensibly differentiated there. The aid of a Russian (for group 7890) and a Japanese (for group 7880) might be particularly suitable, for instance. (If you don't know what this is about, it regards debtags [1].) Turning to another matter, the responses to my impromptu roster of Debian development skills indicate that the roster has been taken in slightly a different manner than I had meant it. ... the typical roster of skills one masters in contributing broadly to Debian development is ... awesome: C, C++, CPP, Make, Perl, Python, Autoconf, CVS, Shell, Glibc, System calls, /proc, IPC, sockets, Sed, Awk, Vi, Emacs, locales, Libdb, GnuPG, Readline, Ncurses, TeX, Postscript, Groff, XML, assembly, Flex, Bison, ORB, Lisp, Dpkg, PAM, Xlibs, Tk, GTK, SysVInit, Debconf, ELF, etc.---not to mention the use of the English language at a sophisticated technical level. Although the roster may be interesting, it was meant neither as a canonical proposal nor as a challenge. In fact it was just what I had happened to think of informally at the moment. For the record, I happen to have a working familiarity with nineteen of the items on my own roster, plus a limited familiarity with seven more. Were the roster a challenge, it would be a foolish one, because Steve Langasek would beat me in a Debian development contest and I know it. As for the other fifteen roster items, as Steve said, contributing broadly to Debian usually means mastering some of these skills, and knowing where to find answers for the rest. -- Thaddeus H. Black 508 Nellie's Cave Road Blacksburg, Virginia 24060, USA +1 540 961 0920, [EMAIL PROTECTED] 1. http://debtags.alioth.debian.org pgpWJ19KUmZqo.pgp Description: PGP signature
Re: charsets in debian/control
[Roger Leigh] I've been using Debian with UTF-8 only locales for over 12 months now. I now consider it fine for general use, with respect to terminal and application support. Unlike a couple of years ago, most things work perfectly. Some apps like 'screen' do not just configure themselves for UTF-8 support based on LC_CTYPE, but have to be manually configured. Presumably your goal would include fixing these apps. signature.asc Description: Digital signature
Re: charsets in debian/control
On Tuesday 07 December 2004 00.19, Roger Leigh wrote: I think going to UTF-8 as the default locale charmap for all locales is a feasable goal for etch, as is recoding everything to UTF-8 (where it makes sense). Yep. My biggest problem right now is 'lpr sometextfile' to a postscript printer (I use cups). I *believe* the problem is not necessarily with cups itself but with a2ps or whatever is used to generate the postscript output. cheers -- vbi -- Hail Eris! pgp0lyS3by2Vw.pgp Description: PGP signature
Re: charsets in debian/control
* Roger Leigh ([EMAIL PROTECTED]) [041207 00:40]: I think going to UTF-8 as the default locale charmap for all locales is a feasable goal for etch, as is recoding everything to UTF-8 (where it makes sense). feasable goal and etch are the magic words I think: I agree on that, but I don't want to claim that we are already there today. Cheers, Andi -- http://home.arcor.de/andreas-barth/ PGP 1024/89FB5CE5 DC F1 85 6D A6 45 9C 0F 3B BE F1 D0 C5 D1 D9 0C
Re: charsets in debian/control
Patrze w ekran, a to Roger Leigh pisze do mnie: - No UTF-8 console keymaps - Some broken libraries e.g. GTK+ 1.2 [obsolete] - I can't paste UTF-8 into emacs (perhaps a problem in my .emacs) - mc making mess with its frames Maciek -- M.Sc. Maciej Dems [EMAIL PROTECTED] - C o m p u t e r P h y s i c s L a b o r a t o r y Institute of Physics,Technical University of Lodz ul. Wolczanska 219, 93-005 Lodz, Poland, +48426313649
Re: charsets in debian/control
07.12.2004 13:33 +0100 Maciej Dems (-): Patrze w ekran, a to Roger Leigh pisze do mnie: - No UTF-8 console keymaps - Some broken libraries e.g. GTK+ 1.2 [obsolete] - I can't paste UTF-8 into emacs (perhaps a problem in my .emacs) - mc making mess with its frames Add dselect and aptitude here. -- Eugeniy Meshcheryakov Kyiv National Taras Shevchenko University Information and Computing Centre http://icc.univ.kiev.ua signature.asc Description: Digital signature
Re: charsets in debian/control
On Tuesday 07 December 2004 12:44 am, Peter Samuelson wrote: Defining the character set as utf-8 means that any non-unicode capable application is going to have issues, yes. Postulate an app that is ignorant of character sets - we'll call it aptitude. Fixing it to make it accept utf-8 and spit out the correct encoding for its LC_CTYPE is no harder than fixing it to make it accept iso-8859-1 and spit out the correct encoding for its LC_CTYPE. And if the app already deals with charset conversions but assumes iso-8859-1 input, then it's trivial to fix it to assume utf-8 input. This is not true. iso-8859-1 is an 8-bit charset, while Unicode is a 32-bit [0] charset. Storing and manipulating iso-8859-1 strings requires no changes to internal datatypes (only conversions for input and output); storing and manipulating Unicode means you have to switch to a completely different set of string-handling functions for all internal operations. In C++ you might be able to partly finesse this by creating a replacement string class, but if our program (call it aptitude) is already using a complex replacement string class for some tasks, and this class assumes that characters are 8 bits wide, this might be a slightly non-trivial task, especially compared to handling iso-8859-1. Hypothetically speaking. :-) On the other hand, once the program is using Unicode internally, taking iso-8859-1 as input and producing it as output should be no problem. Daniel [0] According to the libc manual, only 16 bits have been assigned, but GNU systems use 32-bit encoding internally if the libc transcoding functions are used. -- /--- Daniel Burrows [EMAIL PROTECTED] --\ | swapon /dev/ram | \--- News without the $$ -- National Public Radio -- http://www.npr.org ---/ pgpuGzR6Woq1o.pgp Description: PGP signature
Re: charsets in debian/control
On Tuesday 07 December 2004 10:17 am, Daniel Burrows wrote: complex replacement string class Admittedly, complex might (hypothetically) be a bit of an exaggeration. :P Daniel -- /--- Daniel Burrows [EMAIL PROTECTED] --\ | You are in a maze of twisty little signatures, all alike. | \ Evil Overlord, Inc: http://www.eviloverlord.com --/ pgpdSwtA0vgWt.pgp Description: PGP signature
Re: charsets in debian/control
On Tue, Dec 07, 2004 at 10:17:17AM -0500, Daniel Burrows wrote: On Tuesday 07 December 2004 12:44 am, Peter Samuelson wrote: And if the app already deals with charset conversions but assumes iso-8859-1 input, then it's trivial to fix it to assume utf-8 input. This is not true. iso-8859-1 is an 8-bit charset, while Unicode is a 32-bit [0] charset. Storing and manipulating iso-8859-1 strings requires no changes to internal datatypes (only conversions for input and output); storing and manipulating Unicode means you have to switch to a completely different set of string-handling functions for all internal operations. No, you do not have to do this. You can keep working with char, the changes when switching to UTF-8 will mostly have to deal with the fact that one Unicode character is represented by more than one char. This means that you need to use a different strlen function, take care only to chop strings of char at character boundaries, ensure that input strings are actually valid UTF-8, etc. Cheers, Richard -- __ _ |_) /| Richard Atterer | GnuPG key: | \/¯| http://atterer.net | 0x888354F7 ¯ '` ¯
Re: charsets in debian/control
Daniel Burrows [EMAIL PROTECTED] wrote: iso-8859-1 is an 8-bit charset, while Unicode is a 32-bit [0] charset. =20 Storing and manipulating iso-8859-1 strings requires no changes to internal= =20 datatypes (only conversions for input and output); storing and manipulating= =20 Unicode means you have to switch to a completely different set of=20 string-handling functions for all internal operations. utf-8 is an 8-bit encoding of unicode, using variable length characters. Traditional string manipulation routines work fine, except in the case where you need to know the number of characters rather than the number of bytes. This is typically not a large number of areas of code. [0] According to the libc manual, only 16 bits have been assigned, but GN= U=20 systems use 32-bit encoding internally if the libc transcoding functions ar= e=20 used. The libc manual is out of date. We've been using more than 16 bits for a while. -- Matthew Garrett | [EMAIL PROTECTED]
Re: charsets in debian/control
On Tuesday 07 December 2004 10:40 am, Richard Atterer wrote: No, you do not have to do this. You can keep working with char, the changes when switching to UTF-8 will mostly have to deal with the fact that one Unicode character is represented by more than one char. This means that you need to use a different strlen function, take care only to chop strings of char at character boundaries, ensure that input strings are actually valid UTF-8, etc. This might work for programs that relatively blindly manipulate character strings and can pass them off to the terminal for processing. In fact, aptitude does a *lot* of processing and formatting of strings internally. That means, for instance: splitting strings into words and paragraphs, truncating strings, finding out how wide strings are. More importantly, it also makes significant (and increasing) use of strings annotated with the terminal attributes of each character (think colors, bold/reverse video, etc). Needless to day, it performs all of the above operations on those strings as well. All of these are impacted by extended charsets: for instance, you need to use a different function to find whitespace, and combining characters with their attributes requires the use of a structure where an integer previously sufficed. That's not to mention finding the length of a string, which is necessary to perform most types of layout. The changes that are necessary are at least: At a minimum, the class used for formatted strings will have to be re-targeted to support either formatted wide strings or formatted utf8 strings. If wide characters or are not used internally, it is also necessary to audit every occurrence of s.size() and check whether the length-in-memory or the length-in-characters of the string is being queried. If neither wide characters nor a utf8-specialized basic_string are used, it is necessary to audit every string constructor (which might cut a substring) and make sure that it doesn't play havoc with utf8 codings. Every use of isspace() and friends will have to be replaced with Unicode-aware equivalents. And that's just the problems I can think of off the top of my head. It's also necessary to use a completely different set of terminal i/o routines, but this is pretty much expected. None of these problems are insurmountable, of course, and I know pretty much how to solve must of them. However, it's also true that none of them exist *at all* when using iso-8859-1, which is why I object to the comment that it's no harder to handle utf8 than iso-8859-1. (in fact, if your terminal speaks iso-8859-1, aptitude will handle it just fine without any changes) Daniel -- /--- Daniel Burrows [EMAIL PROTECTED] --\ | Hi, I'm a .signature virus! | | Copy me into your .signature to help me spread! | \ The Turtle Moves! -- http://www.lspace.org ---/ pgpJ6zHEp3AWv.pgp Description: PGP signature
Re: charsets in debian/control
Steve Langasek writes, ... most of the letters you listed here are specific to the IPA, which would have no use at all in a control file as they're not part of the writing system of any natural language. Ok. Encodings and charsets are distinct concepts. Just because the file is specified in UTF-8 *encoding* does not mean we suddenly have to start coping with the entire Unicode character set. Right. Why, what a lovely straw man you have there. No comment. But yes, non-ASCII Latin-1 chars should not be given special status over the national chars found in other languages spoken by project members. Debian should be using either ASCII, or Unicode; standardizing on Latin-1 makes no sense in a global project. True. Look, Steve: mild abuse aside, I agree with you in every particular. Nevertheless, I would respectfully suggest that your criticism underscores my point, which regards the monstrous increase in complexity which the full Unicode standard represents. Consider. Is it a bug if Readline cannot echo full bidirectional input? If Dselect does not appreciate all the non-spacing characters? If Less does not regard Tibetan subjoined letters? (This is my Tibetan straw man.) Undoubtedly one might observe that the Tibetan problem were not really a problem with Less but rather with some underlying library, but this misses the point---or rather again it underscores the point. Unicode solves what for many of us was not a problem by creating an entirely new class of problems. For example, it requires us to be particular about how we tag our e-mail attachments... ... to properly declare the character set on the non-ASCII mails you send. We can perhaps be pardoned for feeling a little grumpy about this. Am I arguing to jettison Unicode? No; to the partial extent that I had been arguing it earlier in the thread, you, Peter, Daniel and Matthew have changed my mind. However, the typical roster of skills one masters in contributing broadly to Debian development is already awesome: C, C++, CPP, Make, Perl, Python, Autoconf, CVS, Shell, Glibc, System calls, /proc, IPC, sockets, Sed, Awk, Vi, Emacs, locales, Libdb, GnuPG, Readline, Ncurses, TeX, Postscript, Groff, XML, assembly, Flex, Bison, ORB, Lisp, Dpkg, PAM, Xlibs, Tk, GTK, SysVInit, Debconf, ELF, etc.---not to mention the use of the English language at a sophisticated technical level. UTF-8 is neat, but I do not really like Unicode (you may have noticed this). Seeking essential simplicity, I would prefer to keep the full hairy overgrown Unicode standard from the typical Debian roster of development skills. Wouldn't you? -- Thaddeus H. Black 508 Nellie's Cave Road Blacksburg, Virginia 24060, USA +1 540 961 0920, [EMAIL PROTECTED] pgpoTit3xtAci.pgp Description: PGP signature
Re: charsets in debian/control
On Dec 07, Thaddeus H. Black [EMAIL PROTECTED] wrote: UTF-8 is neat, but I do not really like Unicode (you may Actually you do not even understand it, because this sentence is meaningless. -- ciao, | Marco | [9639 coubl1Ib61SmA] signature.asc Description: Digital signature
Re: charsets in debian/control
[Thaddeus H. Black] UTF-8 is neat, but I do not really like Unicode (you may [Marco d'Itri] Actually you do not even understand it, because this sentence is meaningless. Perhaps he is aware of the difference between Unicode and ISO-10646? UTF-8 is an encoding of ISO-10646.
RE: charsets in debian/control
Thaddeus H. Black wrote: However, the typical roster of skills one masters in contributing broadly to Debian development is already awesome: C, C++, CPP, Make, Perl, Python, Autoconf, CVS, Shell, Glibc, System calls, /proc, IPC, sockets, Sed, Awk, Vi, Emacs, locales, Libdb, GnuPG, Readline, Ncurses, TeX, Postscript, Groff, XML, assembly, Flex, Bison, ORB, Lisp, Dpkg, PAM, Xlibs, Tk, GTK, SysVInit, Debconf, ELF, etc.---not to mention the use of the English language at a sophisticated technical level. Pardon me, but I only know 18 of the 40 items you mentioned, but I don't have a problem writing software for Debian or Linux in general. (Some) developers having to learn (parts of) Unicode is not a _general_ problem, not the least because many already know it. It might be a problem for _you_in_particular_, because you do not know it and don't want to learn it. But that isn't a very good argument against applying a perhaps somewhat complex technology to Debian that's well suited for the job. Especially since many tools that today can't handle multibyte encodings (UTF-8/Unicode in particular) yet, will _have_to_ support it at some time in the future anyway. BTW, the understanding of Unicode isn't required for most tools, mostly the understanding of UTF-8 is sufficient, and UTF-8 is trivial. UTF-8 is neat, but I do not really like Unicode (you may have noticed this). You might like Bytext[1] better then. SCNR ;-) Seriously, I get the impression you don't like Unicode because _you_ don't need it. Seeking essential simplicity, I would prefer to keep the full hairy overgrown Unicode standard from the typical Debian roster of development skills. Wouldn't you? No, I wouldn't. References: 1. http://www.bytext.org
Re: charsets in debian/control
On Sunday 05 December 2004 20.11, Goswin von Brederlow wrote: Any parser that acceps 8bit non-ascii chars will accept UTF-8 then. What remains is just making the UTF-8 chars visually correct then. And make sure that, where character strings are modified, the multibyte sequences are counted right and handled correctly. This should not be a big problem, but things like display code etc. must now be aware that character count == byte count does not longer hold. -- vbi -- Immer ist der Mann ein junger Mann, der einem jungen Weibe wohl gefällt. -- Johann Wolfgang von Goethe (Nausikaa) pgpmkpF8H0upU.pgp Description: PGP signature
Re: charsets in debian/control
Daniel Burrows [EMAIL PROTECTED] writes: On Sunday 05 December 2004 03:32 pm, Jose Carlos Garcia Sogo wrote: Would Peter permit me a mild dissent? I prefer Latin-1. Reason: I can recognize and distinguish Latin-1 characters, even when I do not always understand the words they spell. Recognizing and distinguishing the characters is important to me. And not just to me. Imagine the dismay of a Korean user trying to read Arabic script in a control file. But the only field in UTF8 should be Maintainer, and that field should have (IMHO) also a roman transliterate for the name, if you don't use a latin charset (Greek, Arabic, Japanese, Chinese...) Well, when aptitude gets UTF8 support, it'll decode all the control fields that are mainly meant for human consumption: that means at least Description in addition to the Maintainer field, and maybe also Section. I think the only field in UTF-8 in the main (english) Packages file should be the maintainer field. There might be some discussion about allowing the packages name in the description to be native too but I wouldn't like that. Now, for translated Packages files, like a chinese one, only the description should change. I don't see any reason to limit ourselves in the long term by sticking to Latin1 (or ASCII) just because none of us can read all of the languages that are available in the extended UTF8 namespace. If we want people to stick to certain subsets of UTF8, that should be determined in Policy, not the packaging software. The software has to be able to work with translated Packages file. It would be quite unacceptable for aptitude to show gibberish in the description for a chinese user with a translated Packages file. So there realy should be no limit there. But limiting each Packages file to the subset of characters recognisable in that language sounds like a good idea. Chinese user probably don't want japanese in their Packages file and vice versa. Seeing that english is the common language in Debian I would also say that an english description is a must. If you want a practical concern (aside from, say, a general suspicion of building policy into software tools), consider these cases: - Someone wants to translate the Description fields of all packages in Debian into Chinese or Arabic. What will they do if the package tools only support Latin-1? - Someone wants to use the Debian packaging tools to create a new distribution for use in China. Again, what will they do if the package tools only support Latin-1? Daniel You are absolutely right, the tools should cope with everything with the possible exception of warning/rejecting policy violations on upload. MfG Goswin
Re: charsets in debian/control
I would not disagree with Peter or Daniel. They are right in my view. However, consider the following Unicode characters: 025A LATIN SMALL LETTER SCHWA WITH HOOK 025E LATIN SMALL LETTER CLOSED REVERSED OPEN E 0261 LATIN SMALL LETTER SCRIPT G 0264 LATIN SMALL LETTER RAMS HORN 0267 LATIN SMALL LETTER HENG WITH HOOK 027A LATIN SMALL LETTER TURNED R WITH LONG LEG 027F LATIN SMALL LETTER REVERSED R WITH FISHHOOK 0285 LATIN SMALL LETTER SQUAT REVERSED ESH 0295 LATIN LETTER PHARYNGEAL VOICED FRICATIVE 02A2 LATIN LETTER REVERSED GLOTTAL STOP WITH STROKE FF21 FULLWIDTH LATIN CAPITAL LETTER A We are not speaking of a stricken Polish L, a double-accented Magyar O, or a euro sign. We are speaking of... well, to tell the truth I have no idea what these letters are. Have you? More to the point, should you and I learn to recognize such letters? Should we expect basic Latin terminal fonts to cover them? Is it reasonable to marginalize the á's and ü's of Latin-1 by lumping them with the squat reversed esh? Now, the squat reversed esh as such does not bother me. If you show me a picture of it and tell me what language it is for and what sound it makes, then I will know it. What is important to me is to preserve the simple Roman conception of the general-use alphabet in a reasonable way---not for communication in a particular language, but rather for clear, compact terminal representation and for general international use. Inherent in the concept are the relative fewness of the available characters and the predictable way they are arrayed across a page from left to right. In my view, a terminal which cannot correctly display the á is somewhat broken, and a user who does not recognize the á probably should learn. I would not say the same with respect to the squat reversed esh. However, this is just my view. -- Thaddeus H. Black 508 Nellie's Cave Road Blacksburg, Virginia 24060, USA +1 540 961 0920, [EMAIL PROTECTED] pgpC3wA9A3ASF.pgp Description: PGP signature
Re: charsets in debian/control
Thaddeus H. Black wrote: 025A LATIN SMALL LETTER SCHWA WITH HOOK 025E LATIN SMALL LETTER CLOSED REVERSED OPEN E 0261 LATIN SMALL LETTER SCRIPT G 0264 LATIN SMALL LETTER RAMS HORN 0267 LATIN SMALL LETTER HENG WITH HOOK 027A LATIN SMALL LETTER TURNED R WITH LONG LEG 027F LATIN SMALL LETTER REVERSED R WITH FISHHOOK 0285 LATIN SMALL LETTER SQUAT REVERSED ESH 0295 LATIN LETTER PHARYNGEAL VOICED FRICATIVE 02A2 LATIN LETTER REVERSED GLOTTAL STOP WITH STROKE FF21 FULLWIDTH LATIN CAPITAL LETTER A I have no idea what these letters are. The Recording Artist Formerly Known as Prince :-) Bruce
Re: charsets in debian/control
Thaddeus H. Black [EMAIL PROTECTED] wrote: We are not speaking of a stricken Polish L, a double-accented Magyar O, or a euro sign. We are speaking of... well, to tell the truth I have no idea what these letters are. Have you? More to the point, should you and I learn to recognize such letters? Should we expect basic Latin terminal fonts to cover them? Is it reasonable to marginalize the =E1's and =FC's of Latin-1 by lumping them with the squat reversed esh? Why is it important that you recognise them? I can't see any reasonable argument against UTF-8 that doesn't also remove anything other than ascii. In my view, a terminal which cannot correctly display the =E1 is somewhat broken, and a user who does not recognize the =E1 probably should learn. I would not say the same with respect to the squat reversed esh. However, this is just my view. Defining the character set as utf-8 means that any non-unicode capable application is going to have issues, yes. But so does defining the character set as anything other than ascii - people using a non-8859-1 terminal encoding won't be able to read any of the non-ascii characters in the file. The only two character sets that make any sense whatsoever in the Unix world are ascii and UTF-8. I'd be happy with either, but I've got a fairly anglo-centric viewpoint. I can see a strong argument for maintainers actually being allowed to spell their name properly, even if pragmatism suggests that we want a latinised version available as well. -- Matthew Garrett | [EMAIL PROTECTED]
Re: charsets in debian/control
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 Andreas Barth [EMAIL PROTECTED] writes: Though I agree on your last statement (and please, remember, I'm from germany where non-ASCII-characters are also in common use), I still consider that UTF-8-not-ASCII has not finally reached ok, but it's on the way to it (and I consider this a good thing). I've been using Debian with UTF-8 only locales for over 12 months now. I now consider it fine for general use, with respect to terminal and application support. Unlike a couple of years ago, most things work perfectly. The only things I've currently found lacking are - - No UTF-8 console keymaps - - Some broken libraries e.g. GTK+ 1.2 [obsolete] - - I can't paste UTF-8 into emacs (perhaps a problem in my .emacs) I think going to UTF-8 as the default locale charmap for all locales is a feasable goal for etch, as is recoding everything to UTF-8 (where it makes sense). Regards, Roger - -- Roger Leigh Printing on GNU/Linux? http://gimp-print.sourceforge.net/ Debian GNU/Linuxhttp://www.debian.org/ GPG Public Key: 0x25BFB848. Please sign and encrypt your mail. -BEGIN PGP SIGNATURE- Version: GnuPG v1.2.5 (GNU/Linux) Comment: Processed by Mailcrypt 3.5.8 http://mailcrypt.sourceforge.net/ iD8DBQFBtOj6VcFcaSW/uEgRAohjAKCNnbfpRayVrKwAd7NmfeYtntYVEgCgnPGQ 0rVgxXmc4jjBkBe+p+or9X4= =f7lE -END PGP SIGNATURE-
Re: charsets in debian/control
On Mon, Dec 06, 2004 at 06:58:10PM +, Thaddeus H. Black wrote: I would not disagree with Peter or Daniel. They are right in my view. However, consider the following Unicode characters: 025A LATIN SMALL LETTER SCHWA WITH HOOK 025E LATIN SMALL LETTER CLOSED REVERSED OPEN E 0261 LATIN SMALL LETTER SCRIPT G 0264 LATIN SMALL LETTER RAMS HORN 0267 LATIN SMALL LETTER HENG WITH HOOK 027A LATIN SMALL LETTER TURNED R WITH LONG LEG 027F LATIN SMALL LETTER REVERSED R WITH FISHHOOK 0285 LATIN SMALL LETTER SQUAT REVERSED ESH 0295 LATIN LETTER PHARYNGEAL VOICED FRICATIVE 02A2 LATIN LETTER REVERSED GLOTTAL STOP WITH STROKE FF21 FULLWIDTH LATIN CAPITAL LETTER A We are not speaking of a stricken Polish L, a double-accented Magyar O, or a euro sign. Indeed we're not; most of the letters you listed here are specific to the IPA, which would have no use at all in a control file as they're not part of the writing system of any natural language. Encodings and charsets are distinct concepts. Just because the file is specified in UTF-8 *encoding* does not mean we suddenly have to start coping with the entire Unicode character set. OTOH, the Unicode charset is also the only one we have that is a superset of iso8859-1, iso8859-2, and iso8859-15, so if you want to be able to *use* the , the , and the in the same file together with and , the only sensible way to do so is to specify a UTF-8 encoding. We are speaking of... well, to tell the truth I have no idea what these letters are. Have you? More to the point, should you and I learn to recognize such letters? Should we expect basic Latin terminal fonts to cover them? Is it reasonable to marginalize the ?'s and ?'s of Latin-1 by lumping them with the squat reversed esh? Why, what a lovely straw man you have there. But yes, non-ASCII Latin-1 chars should not be given special status over the national chars found in other languages spoken by project members. Debian should be using either ASCII, or Unicode; standardizing on Latin-1 makes no sense in a global project. In my view, a terminal which cannot correctly display the ? is somewhat broken, and a user who does not recognize the ? probably should learn. I would not say the same with respect to the squat reversed esh. However, this is just my view. Mmm-hmm. Content-Type: text/plain; charset=unknown-8bit Your opinion about which charset to use for Debian files would carry more weight with me if you had enough experience with such things to properly declare the character set on the non-ASCII mails you send. -- Steve Langasek postmodern programmer signature.asc Description: Digital signature
Re: charsets in debian/control
On Mon, Dec 06, 2004 at 06:53:42PM -0800, Steve Langasek [EMAIL PROTECTED] wrote: But yes, non-ASCII Latin-1 chars should not be given special status over the national chars found in other languages spoken by project members. Debian should be using either ASCII, or Unicode; standardizing on Latin-1 makes no sense in a global project. Actually Latin-9 would be better, it doesn't contain the useless . Mike
Re: charsets in debian/control
On Tue, Dec 07, 2004 at 12:04:56PM +0900, Mike Hommey wrote: On Mon, Dec 06, 2004 at 06:53:42PM -0800, Steve Langasek [EMAIL PROTECTED] wrote: But yes, non-ASCII Latin-1 chars should not be given special status over the national chars found in other languages spoken by project members. Debian should be using either ASCII, or Unicode; standardizing on Latin-1 makes no sense in a global project. Actually Latin-9 would be better, it doesn't contain the useless ¤. Standardizing on any 8-bit character set makes no sense in a global project. :P -- Steve Langasek postmodern programmer signature.asc Description: Digital signature
Re: charsets in debian/control
On Mon, Dec 06, 2004 at 07:10:21PM -0800, Steve Langasek [EMAIL PROTECTED] wrote: On Tue, Dec 07, 2004 at 12:04:56PM +0900, Mike Hommey wrote: On Mon, Dec 06, 2004 at 06:53:42PM -0800, Steve Langasek [EMAIL PROTECTED] wrote: But yes, non-ASCII Latin-1 chars should not be given special status over the national chars found in other languages spoken by project members. Debian should be using either ASCII, or Unicode; standardizing on Latin-1 makes no sense in a global project. Actually Latin-9 would be better, it doesn't contain the useless
Re: charsets in debian/control
[Matthew Garrett] Defining the character set as utf-8 means that any non-unicode capable application is going to have issues, yes. Postulate an app that is ignorant of character sets - we'll call it aptitude. Fixing it to make it accept utf-8 and spit out the correct encoding for its LC_CTYPE is no harder than fixing it to make it accept iso-8859-1 and spit out the correct encoding for its LC_CTYPE. And if the app already deals with charset conversions but assumes iso-8859-1 input, then it's trivial to fix it to assume utf-8 input. Peter signature.asc Description: Digital signature
charsets in debian/control
We seem to be moving to a de facto standard of UTF-8 for non-ASCII characters in debian/control files. This is not specified in Policy [1], but for hopefully obvious reasons, consistency is a Good Thing, and UTF-8 seems to be the best solution for this sort of thing. In my sid control files, I see 841 lines with non-ASCII characters, mostly (761 lines) in Maintainer and Uploaders fields: perl -ne 'print if m/[\x80-\xff]/' /var/lib/apt/lists/* | wc -l Of these, 747 lines are UTF-8 and 94 lines are not.[2] I hate to suggest a mass bug filing (33 source packages), since it's a mere de facto standard. And I'm certainly not in the mood to campaign for a Policy amendment. But it would be a Good Thing to aim for consistency here. Current UI tools (dpkg, dselect, apt-cache, aptitude) seem to know nothing about character sets, and just pass characters verbatim to the terminal, but one can easily imagine a tool that would convert to a user's local character set when possible. I suggest that the affected source packages[3] be run through the command 'iconv -f ORIGINAL_CHARSET -t utf-8' as soon as convenient. Would people support a mass bug at minor severity? Peter [1] Note that UTF-8 *is* recommended for debian/changelog. http://www.debian.org/doc/debian-policy/ap-pkg-sourcepkg.html#s-pkg-dpkgchangelog [2] It is easy to tell if text is UTF-8 or not; I use the exit status of iconv -f utf-8 -t utf-8. This gives very few false positives, because UTF-8 has a very strict format. [3] abcm2ps freecraft maint-guide ap-utilsgl-117 movixmaker-2 appunti-informatica-libera glade-perl mozilla-locale-hu ayuda gnustep-icons myspell-sv boa gridlockntfsdoc boa-constructor gtkdiskfree pdftohtml bombermaze gtodo pdp bonsai irispyca cadubi itcl3 pyro cantus kernel-patch-2.4.26-s390pythoncad coq-doc kernel-patch-2.4.27-s390rat crafted krb4strategoxt darkstatlg-issue46 sympa ddclientlibcgi-validate-perlsyslog-ng doc-linux-html-pt libconfig-general-perl tuxeyes doc-linux-text-pt libexporter-lite-perl unac drpythonlibtext-unaccent-perl wmblob elmolibuniversal-exports-perl wmnetmon fcmplinux-ntfs wordtrans fortunes-fr linux-tutorial-es wprint fortunes-it signature.asc Description: Digital signature
Re: charsets in debian/control
[Peter Samuelson] I suggest that the affected source packages[3] be run through the command 'iconv -f ORIGINAL_CHARSET -t utf-8' as soon as convenient. Ehhh, I see I have already ruined my credibility by pasting the wrong source package list. The real list is much shorter. Apologies, Peter ap-utilsglade-perl maint-guide appunti-informatica-libera irismyspell-sv ayuda itcl3 pdp cadubi kernel-patch-2.4.26-s390pyca cantus kernel-patch-2.4.27-s390rat crafted krb4strategoxt doc-linux-html-pt lg-issue46 sympa doc-linux-text-pt libcgi-validate-perlsyslog-ng elmolibexporter-lite-perl wmnetmon fcmplibuniversal-exports-perl wordtrans fortunes-it linux-tutorial-es wprint signature.asc Description: Digital signature
Re: charsets in debian/control
[Peter Samuelson] We seem to be moving to a de facto standard of UTF-8 for non-ASCII characters in debian/control files. This is not specified in Policy [1], but for hopefully obvious reasons, consistency is a Good Thing, and UTF-8 seems to be the best solution for this sort of thing. Some will argue that only ASCII is acceptable in debian/control files. I am not one of these. I agree that we should standardise on UTF-8 for both the changelog and the control file (and the copyright file, for the upstream author and package author names). We need to be able to correctly represent the names of people, and it can not be done using ASCII only. Good to see that most packages already uses UTF-8. I hope the packages using other charsets can be converted to UTF-8 as soon as possible.
Re: charsets in debian/control
* Petter Reinholdtsen ([EMAIL PROTECTED]) [041205 11:30]: [Peter Samuelson] We seem to be moving to a de facto standard of UTF-8 for non-ASCII characters in debian/control files. This is not specified in Policy [1], but for hopefully obvious reasons, consistency is a Good Thing, and UTF-8 seems to be the best solution for this sort of thing. Some will argue that only ASCII is acceptable in debian/control files. I am not one of these. I agree that we should standardise on UTF-8 for both the changelog and the control file (and the copyright file, for the upstream author and package author names). We need to be able to correctly represent the names of people, and it can not be done using ASCII only. Good to see that most packages already uses UTF-8. I hope the packages using other charsets can be converted to UTF-8 as soon as possible. There are different way to view that, and there is a policy bug about that very topic. I think most of us agree that non-UTF-8-characters are not a good idea (please note the UTF-8-characters is a superset of ASCII). For some places (like package names), I think most of us even agree that only ASCII-characters should be used. Also, there is the proposal that in other fields (i.e. names), an translation should (also) be used if the characters are not in some basic classes (more or less: ASCII plus ASCII-similar letters). So, I personally consider non-UTF-8-characters an bug, and UTF-8-not-ASCII on the way from bug to allowed. Cheers, Andi -- http://home.arcor.de/andreas-barth/ PGP 1024/89FB5CE5 DC F1 85 6D A6 45 9C 0F 3B BE F1 D0 C5 D1 D9 0C
Re: charsets in debian/control
Le dimanche 05 décembre 2004 à 11:43 +0100, Andreas Barth a écrit : I think most of us agree that non-UTF-8-characters are not a good idea (please note the UTF-8-characters is a superset of ASCII). For some places (like package names), I think most of us even agree that only ASCII-characters should be used. Also, there is the proposal that in other fields (i.e. names), an translation should (also) be used if the characters are not in some basic classes (more or less: ASCII plus ASCII-similar letters). So, I personally consider non-UTF-8-characters an bug, and UTF-8-not-ASCII on the way from bug to allowed. Many of us have names that can't be written using ASCII. Furthermore, the Debian tools need consistency between the developer name in the changelog and the Maintainer/Uploaders fields in the control file. The only way for these developers to have a policy-compliant changelog without having their uploads considered as NMUs is to encode the control file in UTF-8. -- .''`. Josselin Mouette/\./\ : :' : [EMAIL PROTECTED] `. `'[EMAIL PROTECTED] `- Debian GNU/Linux -- The power of freedom signature.asc Description: Ceci est une partie de message =?ISO-8859-1?Q?num=E9riquement?= =?ISO-8859-1?Q?_sign=E9e?=
Re: charsets in debian/control
* Josselin Mouette ([EMAIL PROTECTED]) [041205 13:05]: Le dimanche 05 décembre 2004 à 11:43 +0100, Andreas Barth a écrit : I think most of us agree that non-UTF-8-characters are not a good idea (please note the UTF-8-characters is a superset of ASCII). For some places (like package names), I think most of us even agree that only ASCII-characters should be used. Also, there is the proposal that in other fields (i.e. names), an translation should (also) be used if the characters are not in some basic classes (more or less: ASCII plus ASCII-similar letters). So, I personally consider non-UTF-8-characters an bug, and UTF-8-not-ASCII on the way from bug to allowed. Many of us have names that can't be written using ASCII. Furthermore, the Debian tools need consistency between the developer name in the changelog and the Maintainer/Uploaders fields in the control file. The only way for these developers to have a policy-compliant changelog without having their uploads considered as NMUs is to encode the control file in UTF-8. Though I agree on your last statement (and please, remember, I'm from germany where non-ASCII-characters are also in common use), I still consider that UTF-8-not-ASCII has not finally reached ok, but it's on the way to it (and I consider this a good thing). Cheers, Andi -- http://home.arcor.de/andreas-barth/ PGP 1024/89FB5CE5 DC F1 85 6D A6 45 9C 0F 3B BE F1 D0 C5 D1 D9 0C
Re: charsets in debian/control
On Sun, Dec 05, 2004 at 01:01:16PM +0100, Josselin Mouette wrote: Many of us have names that can't be written using ASCII. Well, they usually can be transliterated, can't they? Transliterating is somewhat of a kludge (and I think in most cases UTF-8 is a much better solution); OTOH I'd rapidly get confused in the list of Japanese maintainers if their names weren't transliterated. /* Steinar */ -- Homepage: http://www.sesse.net/
Re: charsets in debian/control
On Dec 05, Peter Samuelson [EMAIL PROTECTED] wrote: Would people support a mass bug at minor severity? Make it normal. -- ciao, | Marco | [9589 inOGrPyJFNKhM] signature.asc Description: Digital signature
Re: charsets in debian/control
On Dec 05, Steinar H. Gunderson [EMAIL PROTECTED] wrote: Transliterating is somewhat of a kludge (and I think in most cases UTF-8 is a much better solution); OTOH I'd rapidly get confused in the list of Japanese maintainers if their names weren't transliterated. This is a different issue: in an international environment, people who write their name in a non-Latin script should also add a romanized version. -- ciao, | Marco | [9590 titPdfXuT6SXM] signature.asc Description: Digital signature
Re: charsets in debian/control
[Steinar H. Gunderson] Transliterating is somewhat of a kludge (and I think in most cases UTF-8 is a much better solution); OTOH I'd rapidly get confused in the list of Japanese maintainers if their names weren't transliterated. I think it's a valid choice for a maintainer who natively speaks a language that does not use the Roman alphabet, whether to present one's name in the preferred form, or a Roman transliteration which will be easier for most developers to identify. It's an asymmetric situation, in that people interacting with Debian development *already* have to know a modicum of English - and thus, non-ASCII variations on the Roman alphabet should not confound most of us in the way other writing systems might. In either case, at least the email address will be a clue, and a point of contact. Peter signature.asc Description: Digital signature
Re: charsets in debian/control
[Marco d'Itri] Would people support a mass bug at minor severity? Make it normal. Given that Policy recommends debian/changelog to be utf-8, coupled with the observation (which I had not thought of) that various tools may require a maintainer's name in debian/control and debian/changelog to be the same - I'd agree. I'll wait for more feedback before doing it, though. One thing I don't wish for is a public flogging for filing an unjustified mass bug. Peter signature.asc Description: Digital signature
Re: charsets in debian/control
[Peter Samuelson] I suggest that the affected source packages[3] be run through the command 'iconv -f ORIGINAL_CHARSET -t utf-8' as soon as convenient. No, as you noticed this list is short and can be processed in a more elegant manner, e.g. sympa description uses a no-break space where a normal space would suffice, so telling maintainer to convert to UTF-8 is not a good idea. I filed several bugreports months ago for packages having non-ASCII characters in their description, 3 are closed (#245592, #245594, #245596) and 2 are still open: itcl3 (#242690) and krb4 (#242694). IMO such bugreports are better than mass bug filing because the 3 closed bugreports did not switch to UTF-8 but converted to ASCII instead. This requires a manual processing because instructions are not identical for all bugreports. Denis
Re: charsets in debian/control
Josselin Mouette [EMAIL PROTECTED] writes: Le dimanche 05 décembre 2004 à 11:43 +0100, Andreas Barth a écrit : I think most of us agree that non-UTF-8-characters are not a good idea (please note the UTF-8-characters is a superset of ASCII). For some places (like package names), I think most of us even agree that only ASCII-characters should be used. Also, there is the proposal that in other fields (i.e. names), an translation should (also) be used if the characters are not in some basic classes (more or less: ASCII plus ASCII-similar letters). So, I personally consider non-UTF-8-characters an bug, and UTF-8-not-ASCII on the way from bug to allowed. Many of us have names that can't be written using ASCII. Furthermore, the Debian tools need consistency between the developer name in the changelog and the Maintainer/Uploaders fields in the control file. The only way for these developers to have a policy-compliant changelog without having their uploads considered as NMUs is to encode the control file in UTF-8. -- .''`. Josselin Mouette/\./\ : :' : [EMAIL PROTECTED] `. `'[EMAIL PROTECTED] `- Debian GNU/Linux -- The power of freedom Which means all control file, changelog file, changes file, Packages and Sources file parsing programs have to be truely converted to UTF-8. dpkg, apt, aptitude, dselect, apt-proxy, apt-cacher(?), debmirror, debpartial-mirror, DAK, cdebootstrap, ... I guess most just work out of luck with the mixture we have now. We already had cdebootstrap crashes because of it (its parser was a bit stricter than the rest). On that note, how likely is it to hit a UTF-8 character encoding that contains a '\n'? Any non UTF-8 aware parser would assume a new line has started and get parse errors. MfG Goswin
Re: charsets in debian/control
On Sun, Dec 05, 2004 at 06:40:52PM +0100, Goswin von Brederlow wrote: On that note, how likely is it to hit a UTF-8 character encoding that contains a '\n'? Any non UTF-8 aware parser would assume a new line has started and get parse errors. 0% likely, guaranteed. UTF-8 is *designed* to be upwards compatible with plain ASCII. Every valid ASCII character has the same meaning in UTF-8. Every UTF-8 byte sequence for a non-ASCII character will not contain *any* ASCII characters. This is achieved by making sure that everything above plain ASCII has the high bit set, not just for the first byte, but for all of them. -- Bart.
Re: charsets in debian/control
Bart Schuller [EMAIL PROTECTED] writes: On Sun, Dec 05, 2004 at 06:40:52PM +0100, Goswin von Brederlow wrote: On that note, how likely is it to hit a UTF-8 character encoding that contains a '\n'? Any non UTF-8 aware parser would assume a new line has started and get parse errors. 0% likely, guaranteed. UTF-8 is *designed* to be upwards compatible with plain ASCII. Every valid ASCII character has the same meaning in UTF-8. Every UTF-8 byte sequence for a non-ASCII character will not contain *any* ASCII characters. This is achieved by making sure that everything above plain ASCII has the high bit set, not just for the first byte, but for all of them. Ok, so no problems there. Any parser that acceps 8bit non-ascii chars will accept UTF-8 then. What remains is just making the UTF-8 chars visually correct then. MfG Goswin
Re: charsets in debian/control
On Sun, Dec 05, 2004 at 06:40:52PM +0100, Goswin von Brederlow wrote: On that note, how likely is it to hit a UTF-8 character encoding that contains a '\n'? Any non UTF-8 aware parser would assume a new line has started and get parse errors. Thats no problem. The only problem you have with UTF-8 is, that a UTF-8 reader will see illegal byte sequences in a traditionally encoded (latin1) file. Greetings Bernd -- (OO) -- [EMAIL PROTECTED] -- ( .. ) [EMAIL PROTECTED],linux.de,debian.org} http://www.eckes.org/ o--o 1024D/E383CD7E [EMAIL PROTECTED] v:+497211603874 f:+497211606754 (OO) When cryptography is outlawed, bayl bhgynjf jvyy unir cevinpl!
Re: charsets in debian/control
Peter Samuelson writes, We seem to be moving to a de facto standard of UTF-8 for non-ASCII characters in debian/control files. This is not specified in Policy [1], but for hopefully obvious reasons, consistency is a Good Thing, and UTF-8 seems to be the best solution for this sort of thing. Would Peter permit me a mild dissent? I prefer Latin-1. Reason: I can recognize and distinguish Latin-1 characters, even when I do not always understand the words they spell. Recognizing and distinguishing the characters is important to me. And not just to me. Imagine the dismay of a Korean user trying to read Arabic script in a control file. Well, the Korean user can speak for himself. Speaking for myself, ASCII is a little too limited. There is a proper balance to strike, and to me Latin-1 though imperfect is about right. Latin-1 is wrong if you speak Polish, of course, and even if you don't speak Polish, Latin-1's lack of a euro sign is slightly annoying; but, well, I admit that I do not really mind precisely where the line is drawn, so long as the general simple Latin concept of writing is preserved and the number of distinct characters represented is kept within reasonable bounds. Regarding only Latin, Unicode recognizes over eight hundred Latin characters: far too many for me. This is not considering Cyrillic or Greek; nor even beginning to think of the numerous very different writing systems of a wider non-Western world---worthy writing systems which I cannot even transcribe much less read---beautiful writing systems in which the basic Western left-to-right, character-based, diacritically marked semantics are not preserved. For the Debian Project, madness lies that way. If Latin-1 is established and used if not universally loved, then probably we should limit our usage to it. I do not deny that Latin-1 represents all the languages I can read, and that this fact may color my view. Nevertheless to me a source written in Chinese is effectively non-free. It might as well be a compiled binary blob. Actually, UTF-8 encoding as such is fine. It uses a few extra 0xC0 and 0xC1 bytes for the Latin-1 characters (see utf-8(7)), but this does not matter much. The full UTF-8 domain has numerous subtle semantics which I should like to be able to avoid, however. UTF-8 is for Unicode, which is to allow the representation of the languages of the world in their own scripts. While highly useful in its own domain, this has little to do with Debian control files, where we probably do not want the languages of the world represented in any event. I would tend to recommend that untranslated Debian work, especially control files, be limited to Latin-1. If the Japanese maintainers uncomplainingly transliterate their names to Latin-1 for our benefit, then probably the rest of us should do likewise. Whether the Latin-1 is C0/C1-encoded as UTF-8, however, is a matter of indifference to me. -- Thaddeus H. Black 508 Nellie's Cave Road Blacksburg, Virginia 24060, USA +1 540 961 0920, [EMAIL PROTECTED] pgpZfqqlenkJK.pgp Description: PGP signature
Re: charsets in debian/control
El dom, 05-12-2004 a las 20:16 +, Thaddeus H. Black escribi: Peter Samuelson writes, We seem to be moving to a de facto standard of UTF-8 for non-ASCII characters in debian/control files. This is not specified in Policy [1], but for hopefully obvious reasons, consistency is a Good Thing, and UTF-8 seems to be the best solution for this sort of thing. Would Peter permit me a mild dissent? I prefer Latin-1. Reason: I can recognize and distinguish Latin-1 characters, even when I do not always understand the words they spell. Recognizing and distinguishing the characters is important to me. And not just to me. Imagine the dismay of a Korean user trying to read Arabic script in a control file. But the only field in UTF8 should be Maintainer, and that field should have (IMHO) also a roman transliterate for the name, if you don't use a latin charset (Greek, Arabic, Japanese, Chinese...) So I don't really see your point here. That you can write your name in your native alphabet doesn't mean that from now people could write their rules files in Chinese, or whatever. Cheers, -- Jose Carlos Garcia Sogo [EMAIL PROTECTED] signature.asc Description: Esta parte del mensaje =?ISO-8859-1?Q?est=E1?= firmada digitalmente
Re: charsets in debian/control
On Sunday 05 December 2004 03:32 pm, Jose Carlos Garcia Sogo wrote: Would Peter permit me a mild dissent? I prefer Latin-1. Reason: I can recognize and distinguish Latin-1 characters, even when I do not always understand the words they spell. Recognizing and distinguishing the characters is important to me. And not just to me. Imagine the dismay of a Korean user trying to read Arabic script in a control file. But the only field in UTF8 should be Maintainer, and that field should have (IMHO) also a roman transliterate for the name, if you don't use a latin charset (Greek, Arabic, Japanese, Chinese...) Well, when aptitude gets UTF8 support, it'll decode all the control fields that are mainly meant for human consumption: that means at least Description in addition to the Maintainer field, and maybe also Section. I don't see any reason to limit ourselves in the long term by sticking to Latin1 (or ASCII) just because none of us can read all of the languages that are available in the extended UTF8 namespace. If we want people to stick to certain subsets of UTF8, that should be determined in Policy, not the packaging software. If you want a practical concern (aside from, say, a general suspicion of building policy into software tools), consider these cases: - Someone wants to translate the Description fields of all packages in Debian into Chinese or Arabic. What will they do if the package tools only support Latin-1? - Someone wants to use the Debian packaging tools to create a new distribution for use in China. Again, what will they do if the package tools only support Latin-1? Daniel -- /--- Daniel Burrows [EMAIL PROTECTED] --\ |We've got nothing to fear but the stuff that we're| | afraid of! -- Fluble | \--- Be like the kid in the movie! Play chess! -- http://www.uschess.org --/ pgpPb1jhTqyTk.pgp Description: PGP signature
Re: charsets in debian/control
On Sun, Dec 05, 2004 at 04:42:24PM -0500, Daniel Burrows wrote: On Sunday 05 December 2004 03:32 pm, Jose Carlos Garcia Sogo wrote: Would Peter permit me a mild dissent? I prefer Latin-1. Reason: I can recognize and distinguish Latin-1 characters, even when I do not always understand the words they spell. Recognizing and distinguishing the characters is important to me. And not just to me. Imagine the dismay of a Korean user trying to read Arabic script in a control file. But the only field in UTF8 should be Maintainer, and that field should have (IMHO) also a roman transliterate for the name, if you don't use a latin charset (Greek, Arabic, Japanese, Chinese...) Well, when aptitude gets UTF8 support, it'll decode all the control fields that are mainly meant for human consumption: that means at least Description in addition to the Maintainer field, and maybe also Section. I don't see any reason to limit ourselves in the long term by sticking to Latin1 (or ASCII) just because none of us can read all of the languages that are available in the extended UTF8 namespace. If we want people to stick to certain subsets of UTF8, that should be determined in Policy, not the packaging software. If you want a practical concern (aside from, say, a general suspicion of building policy into software tools), consider these cases: - Someone wants to translate the Description fields of all packages in Debian into Chinese or Arabic. What will they do if the package tools only support Latin-1? - Someone wants to use the Debian packaging tools to create a new distribution for use in China. Again, what will they do if the package tools only support Latin-1? Isn't there a proposal around for Description#en: English text Description#ja: Japanese text etc? I can see that this would have to be split somehow to avoid the Packages file suddenly filling CD1 on its own, but... -- Paul TBBle Hampson, [EMAIL PROTECTED] 7th year CompSci/Asian Studies student, ANU Shorter .sig for a more eco-friendly paperless office. signature.asc Description: Digital signature
Re: charsets in debian/control
On Mon, Dec 06, 2004 at 09:54:36AM +1100, Paul Hampson [EMAIL PROTECTED] wrote: Isn't there a proposal around for Description#en: English text Description#ja: Japanese text And you'd advocate to write the English text in latin1 and the japanese text in euc-jp ? Let's make it clear: 1 text file, 1 encoding. Mike
Re: charsets in debian/control
Le lundi 06 décembre 2004 à 09:26 +0900, Mike Hommey a écrit : On Mon, Dec 06, 2004 at 09:54:36AM +1100, Paul Hampson [EMAIL PROTECTED] wrote: Isn't there a proposal around for Description#en: English text Description#ja: Japanese text And you'd advocate to write the English text in latin1 and the japanese text in euc-jp ? Let's make it clear: 1 text file, 1 encoding. Well, it's already the case for the generated localized debconf template files. Not that I believe it is a good thing... -- .''`. Josselin Mouette/\./\ : :' : [EMAIL PROTECTED] `. `'[EMAIL PROTECTED] `- Debian GNU/Linux -- The power of freedom signature.asc Description: Ceci est une partie de message =?ISO-8859-1?Q?num=E9riquement?= =?ISO-8859-1?Q?_sign=E9e?=
Re: charsets in debian/control
[Thaddeus H. Black] Would Peter permit me a mild dissent? I prefer Latin-1. Dissents are fine. (: The reason to go with UTF-8 is for consistency. Tools that wish to render text onto the screen ought to be able to depend on knowing the encoding that text is in. See below for why I (and many others) think UTF-8 is the right choice for an encoding to standardize on. I do not deny that Latin-1 represents all the languages I can read, and that this fact may color my view. Nevertheless to me a source written in Chinese is effectively non-free. It might as well be a compiled binary blob. Consider packages intended for speakers of other languages: for example, an Urdu dictionary. The Description field would traditionally describe the package both in English and in Urdu (which uses the Arabic alphabet), and I think that's perfectly fine: the target audience can read its description more easily, and the rest of us can read the English. Now extrapolate to cases involving arbitrary languages, and this is possible only if the Description field uses an encoding of Unicode. (Well, one could invent an extra header to specify the character set, but that seems pointless in the extreme.) UTF-8 is by far the best encoding of Unicode for our purposes, since it was designed to be compatible with tools that parse ASCII. Other Unicode encodings have null bytes and other ASCII values embedded in non-ASCII characters. You can argue, and I would agree, that the Maintainer and Uploaders fields (the only fields other than Description where we are likely to see non-ASCII text) ought to be written in roman letters. People involved with Debian development are required to know a certain amount of English in any case, so the roman alphabet is a common denominator. And, unlike the Description field, it's awkward to try and have both native glyphs and a roman transliteration. However, I see no reason to tell Eastern Europeans that they cannot write their names natively; interpreting Eastern European diacritics is no harder for people who don't speak those languages than interpreting Western European diacritics for people who don't speak those. Peter signature.asc Description: Digital signature
RE: charsets in debian/control
Thaddeus H. Black wrote: I do not deny that Latin-1 represents all the languages I can read, and that this fact may color my view. Nevertheless to me a source written in Chinese is effectively non-free. It might as well be a compiled binary blob. So Emacs is effectively non-free, because I don't speak Lisp. Heh, good one! ;-)
Re: charsets in debian/control
On Sun, Dec 05, 2004 at 09:32:00PM +0100, Jose Carlos Garcia Sogo wrote: But the only field in UTF8 should be Maintainer, and that field should have (IMHO) also a roman transliterate for the name, if you don't use a latin charset (Greek, Arabic, Japanese, Chinese...) The transliterated field should be called 'Maintainer'. If you want some other freaky encoding, unsupported by the older tools, put it in a new field. Using the old field just breaks stuff for no reason. -- .''`. ** Debian GNU/Linux ** | Andrew Suffield : :' : http://www.debian.org/ | `. `' | `- -- | signature.asc Description: Digital signature
Re: charsets in debian/control
On Mon, Dec 06, 2004 at 09:26:57AM +0900, Mike Hommey wrote: On Mon, Dec 06, 2004 at 09:54:36AM +1100, Paul Hampson [EMAIL PROTECTED] wrote: Isn't there a proposal around for Description#en: English text Description#ja: Japanese text And you'd advocate to write the English text in latin1 and the japanese text in euc-jp ? Let's make it clear: 1 text file, 1 encoding. No, I'm advocating the whole thing be UTF-8, to support the above sensibly. Sorry I wasn't clearer. -_-; (On the other hand, I'd _love_ to see rules files written in other languages, as long as the programs being called have manpages in English. ^_^) -- Paul TBBle Hampson, [EMAIL PROTECTED] 7th year CompSci/Asian Studies student, ANU Shorter .sig for a more eco-friendly paperless office. signature.asc Description: Digital signature
Re: charsets in debian/control
On Mon, Dec 06, 2004 at 01:40:27AM +, Andrew Suffield wrote: On Sun, Dec 05, 2004 at 09:32:00PM +0100, Jose Carlos Garcia Sogo wrote: But the only field in UTF8 should be Maintainer, and that field should have (IMHO) also a roman transliterate for the name, if you don't use a latin charset (Greek, Arabic, Japanese, Chinese...) The transliterated field should be called 'Maintainer'. If you want some other freaky encoding, unsupported by the older tools, put it in a new field. Using the old field just breaks stuff for no reason. I like this idea... As I recall, people can throw new header fields in willy-nilly, so people can as of _now_ put... Maintainer-Name: utf-8 encoded maintainer name for example. The email address isn't important, since that has to be a subset of ASCII anyway. Apart from documenting correct orthography though, I'm not sure if this has any importance in terms of the tools... Changelogs would presumably continue to match the Maintainer: field, and so they would include the ASCII transliteration of the name. -- Paul TBBle Hampson, [EMAIL PROTECTED] 7th year CompSci/Asian Studies student, ANU Shorter .sig for a more eco-friendly paperless office. signature.asc Description: Digital signature