Re: UTF-8 in jessie
Hi, Quoting Adam Borowski (2013-08-12 02:51:52) On Mon, May 06, 2013 at 02:49:57PM +0200, Andreas Beckmann wrote: now might be the right time to start a discussion about release goals for jessie. I would like to propose full UTF-8 support. I don't mean here full support for all of Unicode's finer points, merely complete eradication of mojibake. That is, ensuring that /m.o/ matches möo, or that ä sorts as equal to acombining ¨ is out of scope of this proposal. I propose the following sub-goals: 1. all programs should, in their default configuration, accept UTF-8 input and pass it through uncorrupted. Having to manually specify encoding is acceptable only in a programmatic interface, GUI/std{in,out,err}/ command line/plain files should work with nothing but LC_CTYPE. as an addendum to this release goal proposal, it is maybe also worth mentioning working multibyte character support in coreutils as a possible goal. From http://bugs.debian.org/139861 : $ echo -e 日\n本\nで\nは | sort -u | wc -l 3 $ echo -e 日\n本\nで\nは | sort | wc -l 4 Or having head/tail which work character base instead of byte based would be sweet as well. While upstream doesnt seem to support this, it seems that Fedora has a patch for coreutils: http://pkgs.fedoraproject.org/cgit/coreutils.git/tree/coreutils-i18n.patch?id=6e10f376996b64f538259091a524df2249b653fb;id2=HEAD or also: http://trac.cross-lfs.org/browser/patches/coreutils-6.12-unicode-1.patch?rev=577dd2d59133e10bd32c58844293e93af0e6f162 cheers, josch -- To UNSUBSCRIBE, email to debian-devel-requ...@lists.debian.org with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org Archive: http://lists.debian.org/20131014105058.7934.26083@hoothoot
Re: UTF-8 in jessie
* Adam Borowski kilob...@angband.pl, 2013-08-12, 02:51: Detecting non-UTF files is easy: * false positives are impossible * false negatives are extremely unlikely: combinations of letters that would happen to match a valid utf character don't happen naturally, and even if they did, every single combination in the file tested would need to match valid utf. Not+IAo-quite. While 7-bit encodings different than ASCII are all endangered species, some of them can still be seen in the wild, and they excellently disguise themselves as UTF-8. (We had to add special code to detect ISO-2022 encodings to Lintian not that long ago.) Anyway, it you want to help UTF-8-ize the world, you could start by providing patches for these bugs: http://lintian.debian.org/tags/debian-changelog-file-uses-obsolete-national-encoding.html http://lintian.debian.org/tags/debian-copyright-file-uses-obsolete-national-encoding.html http://lintian.debian.org/tags/doc-base-file-uses-obsolete-national-encoding.html -- Jakub Wilk -- To UNSUBSCRIBE, email to debian-devel-requ...@lists.debian.org with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org Archive: http://lists.debian.org/20130917161026.ga6...@jwilk.net
Re: UTF-8 in jessie
Adam Borowski writes (Re: UTF-8 in jessie): Let's take a look at some sheets. Last time I looked at this I found a copy of the actual ASCII standards document from 1968 or so and it did mention this usage. I don't think that better UTF-8 support should involve needlessly converting 7-bit ASCII text files which use ` ' as matched quotes, into UTF-8 text files which use non-ISO-646 codepoints. These code points are defined to be exactly the same in both ASCII and Unicode. Only fonts may differ. And like Han unification issues, this is out of scope here. Do you intend that text files containing uses of ` ' as matched single quotes should be changed to use non-7-bit BMP matched single quotes ? It seems that you don't. In which case I'm afraid you will have to make this explicit somehow in your proposal. Otherwise zealous people will go around complaining about funny-looking quotes and changing a whole bunch of text files to no longer be 7-bit. See GCC's error messages, for a case in point. Ian. -- To UNSUBSCRIBE, email to debian-devel-requ...@lists.debian.org with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org Archive: http://lists.debian.org/21023.14041.902602.551...@chiark.greenend.org.uk
Re: UTF-8 in jessie
Quoting Ian Jackson (2013-08-29 13:56:09) Adam Borowski writes (Re: UTF-8 in jessie): Let's take a look at some sheets. Last time I looked at this I found a copy of the actual ASCII standards document from 1968 or so and it did mention this usage. I don't think that better UTF-8 support should involve needlessly converting 7-bit ASCII text files which use ` ' as matched quotes, into UTF-8 text files which use non-ISO-646 codepoints. These code points are defined to be exactly the same in both ASCII and Unicode. Only fonts may differ. And like Han unification issues, this is out of scope here. Do you intend that text files containing uses of ` ' as matched single quotes should be changed to use non-7-bit BMP matched single quotes ? It seems that you don't. In which case I'm afraid you will have to make this explicit somehow in your proposal. Otherwise zealous people will go around complaining about funny-looking quotes and changing a whole bunch of text files to no longer be 7-bit. See GCC's error messages, for a case in point. I believe the underlying issue is the one summarized here: https://en.wikipedia.org/wiki/Typewriter_apostrophe#ASCII_encoding If that is correct, then the issue here is not whether ASCII ` equals UTF-8 ' (or some similar recoding), but instead that _authors_ from an era of looking at output representing ' as ` grew a habit of typing back into documents that other character. How about we simply mention explicitly that `arcane quoting' - even if arguably related to UTF-8 encoding, should be classified not as release-critical bugs but as spelling errors. - Jonas -- * Jonas Smedegaard - idealist Internet-arkitekt * Tlf.: +45 40843136 Website: http://dr.jones.dk/ [x] quote me freely [ ] ask before reusing [ ] keep private signature.asc Description: signature
Re: UTF-8 in jessie
Jonas Smedegaard writes (Re: UTF-8 in jessie): I believe the underlying issue is the one summarized here: https://en.wikipedia.org/wiki/Typewriter_apostrophe#ASCII_encoding Yes. How about we simply mention explicitly that `arcane quoting' - even if arguably related to UTF-8 encoding, should be classified not as release-critical bugs but as spelling errors. I don't think it is a bug. What I'm trying to forestall is a campaign to change documents which use ` '. Ian. -- To UNSUBSCRIBE, email to debian-devel-requ...@lists.debian.org with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org Archive: http://lists.debian.org/21023.28874.354634.158...@chiark.greenend.org.uk
Re: UTF-8 in jessie
Quoting Ian Jackson (2013-08-29 18:03:22) Jonas Smedegaard writes (Re: UTF-8 in jessie): I believe the underlying issue is the one summarized here: https://en.wikipedia.org/wiki/Typewriter_apostrophe#ASCII_encoding Yes. How about we simply mention explicitly that `arcane quoting' - even if arguably related to UTF-8 encoding, should be classified not as release-critical bugs but as spelling errors. I don't think it is a bug. What I'm trying to forestall is a campaign to change documents which use ` '. My aim was the same, but I see how severity minor still implies IT'S A BUG!!. How about this, then: Although arguably related to UTF-8 encoding, `arcane quoting' is not part of this release-goal. - Jonas -- * Jonas Smedegaard - idealist Internet-arkitekt * Tlf.: +45 40843136 Website: http://dr.jones.dk/ [x] quote me freely [ ] ask before reusing [ ] keep private signature.asc Description: signature
Re: UTF-8 in jessie
Jonas Smedegaard writes (Re: UTF-8 in jessie): Quoting Ian Jackson (2013-08-29 18:03:22) Jonas Smedegaard writes (Re: UTF-8 in jessie): I believe the underlying issue is the one summarized here: https://en.wikipedia.org/wiki/Typewriter_apostrophe#ASCII_encoding ... My aim was the same, but I see how severity minor still implies IT'S A BUG!!. How about this, then: Although arguably related to UTF-8 encoding, `arcane quoting' is not part of this release-goal. I'm happy with ... is not part of this release goal. I'm not sure that everyone will know what `arcane quoting' means. How about Changing any 7-bit characters (for example, changing the way we deal with `'-quoting) is not part of this release-goal. Ian. -- To UNSUBSCRIBE, email to debian-devel-requ...@lists.debian.org with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org Archive: http://lists.debian.org/21023.31276.110369.358...@chiark.greenend.org.uk
Re: UTF-8 in jessie
Ian Jackson ijack...@chiark.greenend.org.uk writes: Jonas Smedegaard writes (Re: UTF-8 in jessie): How about we simply mention explicitly that `arcane quoting' - even if arguably related to UTF-8 encoding, should be classified not as release-critical bugs but as spelling errors. I don't think it is a bug. What I'm trying to forestall is a campaign to change documents which use ` '. Well, personally I think that's ugly, and if it bothers me sufficiently I might file a minor bug about it, but I certainly agree that it's not a suitable subject for a mass bug filing. And surely that's obvious? I wouldn't have even thought of the relationship of backtick in ASCII to this proposal if you hadn't mentioned it, and it seems unlikely to me that we're going to have many people who think this is what it means. -- Russ Allbery (r...@debian.org) http://www.eyrie.org/~eagle/ -- To UNSUBSCRIBE, email to debian-devel-requ...@lists.debian.org with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org Archive: http://lists.debian.org/87r4dcflqa@windlord.stanford.edu
Re: UTF-8 in jessie
Adam Borowski writes (UTF-8 in jessie): I would like to propose full UTF-8 support. I don't mean here full support for all of Unicode's finer points, merely complete eradication of mojibake. That is, ensuring that /m.o/ matches möo, or that ä sorts as equal to acombining ¨ is out of scope of this proposal. I agree with everything you propose except that I have one reservation regarding this: 4. all text files should be encoded in UTF-8 I agree with this except that I think it should be permitted that a text file uses ASCII codepoints. You may say but UTF-8 is a superset of ASCII. Well, no, it isn't. UTF-8 is a superset of ISO-646 but ISO-646 is not identical to ASCII. In particular the descriptions of the codepoints ` ' in ISO-646 effectively forbids them from being used as matching single quotes, despite that being specified as allowed in ASCII. I don't think that better UTF-8 support should involve needlessly converting 7-bit ASCII text files which use ` ' as matched quotes, into UTF-8 text files which use non-ISO-646 codepoints. (In fact I would like to see Markus Kuhn's decision about ` ' reversed - our default character set should be ASCII for 0..127 plus UTF for the rest. That's not an argument I expect to win but at the very least we shouldn't have to worsify things for ASCII users.) Thanks, Ian. -- To UNSUBSCRIBE, email to debian-devel-requ...@lists.debian.org with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org Archive: http://lists.debian.org/21022.5425.511942.342...@chiark.greenend.org.uk
Re: UTF-8 in jessie
On 12 August 2013 01:51, Adam Borowski kilob...@angband.pl wrote: 3. all file names must be valid UTF-8 Case in point errors from ubuntu UDD package importer: Packages containing non-UTF-8, non-ASCII filenames. This is a problem. It is unclear how to sensibly map these into Bazaar. anon-proxy aspell-is aspell-pt aspell-ro cvsnt dacco egroupware ewiki firebird2 fortunes-pl fslint glest-data gmoo ii-esu ispell-fo jpilot kdeedu liblingua-de-ascii-perl magyarispell mtink ooohg openverse phpgedview phpgroupware projectl qcad qdvdauthor tatan tuxpaint uae xblast-tnt xblast-tnt-levels Not sure how up to date this list is (it could be a historic package version that has non-UTF8/non-ASCII filenames.) Test file bugs? Regards, Dmitrijs. -- To UNSUBSCRIBE, email to debian-devel-requ...@lists.debian.org with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org Archive: http://lists.debian.org/canbhlui_ch-z7y+ga4f2ui81if2gjog-cxccmfzuzumpzjf...@mail.gmail.com
Re: UTF-8 in jessie
On Wed, Aug 28, 2013 at 04:20:17PM +0100, Ian Jackson wrote: Adam Borowski writes (UTF-8 in jessie): I would like to propose full UTF-8 support. I don't mean here full support for all of Unicode's finer points, merely complete eradication of mojibake. That is, ensuring that /m.o/ matches möo, or that ä sorts as equal to acombining ¨ is out of scope of this proposal. I agree with everything you propose except that I have one reservation regarding this: 4. all text files should be encoded in UTF-8 I agree with this except that I think it should be permitted that a text file uses ASCII codepoints. You may say but UTF-8 is a superset of ASCII. Well, no, it isn't. Uhm, how? UTF-8 is a superset of ISO-646 but ISO-646 is not identical to ASCII. In particular the descriptions of the codepoints ` ' in ISO-646 effectively forbids them from being used as matching single quotes, despite that being specified as allowed in ASCII. Let's take a look at some sheets. Feb 1972: https://en.wikipedia.org/wiki/File:ASCII_Code_Chart-Quick_ref_card.jpg 1967/68: http://www.samhallas.co.uk/repository/telegraph/teletype_33_specs.pdf ` and ' don't look like anything resembling matching quotes to me. Usually ' is vertical or slightly slanted, ` tends to be at 45 degrees or quite close to horizontal. I don't think that better UTF-8 support should involve needlessly converting 7-bit ASCII text files which use ` ' as matched quotes, into UTF-8 text files which use non-ISO-646 codepoints. These code points are defined to be exactly the same in both ASCII and Unicode. Only fonts may differ. And like Han unification issues, this is out of scope here. -- ᛊᚨᚾᛁᛏᚣ᛫ᛁᛊ᛫ᚠᛟᚱ᛫ᚦᛖ᛫ᚹᛖᚨᚲ -- To UNSUBSCRIBE, email to debian-devel-requ...@lists.debian.org with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org Archive: http://lists.debian.org/20130828180712.ga8...@angband.pl
Re: UTF-8 in jessie
On Mon, 12 Aug 2013 02:51:52 +0200, Adam Borowski wrote: 4a. perl and pod Considering perl to be text raises one more issue: pod. By perl's design, pod without a specified encoding is considered to be ISO-8859-1, even if the file contains use utf8;. This is surprising, and many authors use UTF-8 like everywhere else, leading to obvious results (man gdm3 for one example). Thus, there should be a tool (preferably the one mentioned above) that checks perl files for pod with undeclared encoding, and raises alarm if the file contains any bytes with high bit set. If a conversion encoding is specified, such a declaration could be added automatically. This tool exists, and it's called perl 5.18 :) (Ok, more like Pod::Simple or pod2man or whatever that now errors out on non-ASCII-chars in pod without =encoding.) The results can be seen at http://bugs.debian.org/cgi-bin/pkgreport.cgi?tag=perl-5.18-transition;users=debian-p...@lists.debian.org (half of the POD errors bugs). Cheers, gregor -- .''`. Homepage: http://info.comodo.priv.at/ - OpenPGP key 0xBB3A68018649AA06 : :' : Debian GNU/Linux user, admin, and developer - http://www.debian.org/ `. `' Member of VIBE!AT SPI, fellow of the Free Software Foundation Europe `- NP: Tom Waits: Come On Up To The House signature.asc Description: Digital signature
Re: UTF-8 in jessie
Quoting Charles Plessy (ple...@debian.org): About display by GUIs, I think that we should have a system to install all the fonts necessary to display languages that we support at the installation. Such as tasksel and its language tasks? :-) In short, we already have that. However, we need people to maintain that, namely to decide what fonts should be installed when a given language is chosen at install time. This is usually asked to contributing new translators of D-I and, therefore, most languages that are not Latin-something based will trigger the installation of a font package that is suitable for them. I also try to move the maintenance of such font packages under the (large) umbrella of the pkg-fonts maintenance team, as the maintenance of font packages is usually very loose. signature.asc Description: Digital signature
Re: UTF-8 in jessie
Le Tue, Aug 13, 2013 at 08:12:24AM +0200, Christian PERRIER a écrit : Quoting Charles Plessy (ple...@debian.org): About display by GUIs, I think that we should have a system to install all the fonts necessary to display languages that we support at the installation. Such as tasksel and its language tasks? :-) In short, we already have that. However, we need people to maintain that, namely to decide what fonts should be installed when a given language is chosen at install time. Hi Christian, what I am proposing is a task that install all languages. I made a bit of research earlier, and it is not as simple as installing all the existing tasks, as the result on my computer was that some browsers started to display Japanese texts with simplified Chinese glyphs. http://bugs.debian.org/702050 Unfortunately, I did not get answer. Feedback is much welcome. Cheers, -- Charles Plessy Tsurumi, Kanagawa, Japan -- To UNSUBSCRIBE, email to debian-devel-requ...@lists.debian.org with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org Archive: http://lists.debian.org/20130813063123.ge6...@falafel.plessy.net
Re: UTF-8 in jessie
Quoting Charles Plessy (ple...@debian.org): Hi Christian, what I am proposing is a task that install all languages. I made a bit of research earlier, and it is not as simple as installing all the existing tasks, as the result on my computer was that some browsers started to display Japanese texts with simplified Chinese glyphs. http://bugs.debian.org/702050 Unfortunately, I did not get answer. Feedback is much welcome. It's quite likely the result of broken (or wrong) fontconfig files in some of the installed fonts. For instance, fonts-arphic-u{kai|ming} currently spit out warnings from their fontconfig files. signature.asc Description: Digital signature
Re: UTF-8 in jessie
On Mon, Aug 12, 2013 at 5:56 PM, Thorsten Glaser t...@mirbsd.de wrote: Florian Lohoff f at zz.de writes: 5. All programs consuning UTF8 Text must understand a BOM. The kernel doesn’t, start there: tglase@tglase:~$ mksh -c 'print '\''\ufeff#!/bin/sh\necho foo'\' x; chmod +x x; ./x ./x: line 1: #!/bin/sh: No such file or directory foo That’s running GNU bash, with bash as /bin/sh for testing, which deviates from my normal setup of running mksh… because I fixed mksh to support this (and the MirBSD kernel, too). They was the utf8script package http://packages.qa.debian.org/u/utf8script.html but it was O/RM some time ago. Time to ressurect ? I disagree with requiring ASCII for $PATH though… bye, //mirabilos -- To UNSUBSCRIBE, email to debian-devel-requ...@lists.debian.org with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org Archive: http://lists.debian.org/loom.20130812t175549-...@post.gmane.org -- To UNSUBSCRIBE, email to debian-devel-requ...@lists.debian.org with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org Archive: http://lists.debian.org/CAE2SPAa6jMkh=q2h0c0bbpnfz8k-teuptwok1e32regi9hs...@mail.gmail.com
Re: UTF-8 in jessie
Vincent Lefevre vincent at vinc17.net writes: If scripts intend to use LC_ALL=C.UTF-8 to force everything to the standard locale with UTF-8 support, then the glibc should be modified to regard C.UTF-8 like C w.r.t. $LANGUAGE. I mean: Ouch! Scripts do, and this *is* how C.UTF-8 was intended: to behave like C/POSIX except for the encoding. Both should have output in English. Please reportbug that. Thanks, //mirabilos -- To UNSUBSCRIBE, email to debian-devel-requ...@lists.debian.org with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org Archive: http://lists.debian.org/loom.20130813t122454-...@post.gmane.org
Re: UTF-8 in jessie
On 2013-08-13 10:25:31 +, Thorsten Glaser wrote: Vincent Lefevre vincent at vinc17.net writes: If scripts intend to use LC_ALL=C.UTF-8 to force everything to the standard locale with UTF-8 support, then the glibc should be modified to regard C.UTF-8 like C w.r.t. $LANGUAGE. I mean: Ouch! Scripts do, and this *is* how C.UTF-8 was intended: to behave like C/POSIX except for the encoding. Both should have output in English. Please reportbug that. OK: http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=719590 -- Vincent Lefèvre vinc...@vinc17.net - Web: http://www.vinc17.net/ 100% accessible validated (X)HTML - Blog: http://www.vinc17.net/blog/ Work: CR INRIA - computer arithmetic / AriC project (LIP, ENS-Lyon) -- To UNSUBSCRIBE, email to debian-devel-requ...@lists.debian.org with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org Archive: http://lists.debian.org/20130813120507.ga18...@xvii.vinc17.org
Re: UTF-8 in jessie
On Mon, Aug 12, 2013 at 05:58:20PM +0200, Adam Borowski wrote: On Mon, Aug 12, 2013 at 03:50:19PM +0200, Florian Lohoff wrote: 5. All programs consuning UTF8 Text must understand a BOM. I'm afraid I don't agree here: BOMs are nasty stuff that serve no purpose once you standardize on UTF8. They might help with exchange with a minority of Windows programs, at a cost at our side. Windows hardly does plain text: most of that is MSVC/etc sources, but then, the C/C++ standards explicitely forbid junk in places other than comments. Most other languages expect a hashbang on Unix, which makes BOMs impossible. I agree that BOMs are nasty and should not be generated by our standard tools. I have been bitten by BOMs more than once and had a hard time looking for the fault until looking at the plain ascii file with a hex editor. AFAIK Tools like vim understand and hide the fact that there is a BOM and rewrite them. Other tools give interesting results stumping on a BOM. So its inconstistent which makes it hard to find. Flo -- Florian Lohoff f...@zz.de signature.asc Description: Digital signature
Re: UTF-8 in jessie (debhelper and BOM)
On Tue, Aug 13, 2013 at 01:44:03PM +0900, Osamu Aoki wrote: But I do not understand goal #5. Why MUST? Do you have rationale? On Mon, Aug 12, 2013 at 03:50:19PM +0200, Florian Lohoff wrote: On Mon, Aug 12, 2013 at 02:51:52AM +0200, Adam Borowski wrote: I propose the following sub-goals: ... 4. all text files should be encoded in UTF-8 Yes. But it will be nice to have some support by dh_installdocs :-) ^^ 5. All programs consuming UTF8 Text must understand a BOM. I agree as SHOULD but should we state MUST? Please note the number of '' markers. It is not part of my proposal, and the discussion in this thread, me included, seems to be pretty hostile to adding BOMs. -- ᛊᚨᚾᛁᛏᚣ᛫ᛁᛊ᛫ᚠᛟᚱ᛫ᚦᛖ᛫ᚹᛖᚨᚲ -- To UNSUBSCRIBE, email to debian-devel-requ...@lists.debian.org with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org Archive: http://lists.debian.org/20130813224409.gc3...@angband.pl
Re: UTF-8 in jessie
On 2013-08-12 04:18, Charles Plessy wrote: Le Mon, Aug 12, 2013 at 02:51:52AM +0200, Adam Borowski a écrit : I would like to propose full UTF-8 support. I don't mean here full support for all of Unicode's finer points, merely complete eradication of mojibake. Hi Adam, Hi, this is a great goal. Here are two comments. There is a related issue opened on the Policy (http://bugs.debian.org/701081), where we propose the following: - Require UTF-8 for the names of all files and directories installed by binary packages. For the record, there is a Lintian tag for this now[1], which suggests only a handful of packages violates this. - Recommend ASCII when possible. - Require ASCII for files in /bin, /sbin, /usr/bin, /usr/sbin and /usr/games. Requiring ASCII for files in $PATH should be trivial to implement as a separate tag. I suppose the ASCII requirement could also be implemented as a pedantic check or so. Regardless, patches welcome. :) About display by GUIs, I think that we should have a system to install all the fonts necessary to display languages that we support at the installation. Have a nice Debconf ! ~Niels [1] http://lintian.debian.org/tags/file-name-is-not-valid-UTF-8.html -- To UNSUBSCRIBE, email to debian-devel-requ...@lists.debian.org with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org Archive: http://lists.debian.org/520895a6.1090...@thykier.net
Re: UTF-8 in jessie
On 2013-08-12 02:51:52 +0200, Adam Borowski wrote: Detecting non-UTF files is easy: * false positives are impossible * false negatives are extremely unlikely: combinations of letters that would happen to match a valid utf character don't happen naturally, and even if they did, every single combination in the file tested would need to match valid utf. Not that unlikely, and it is rather annoying that Firefox (and therefore Iceweasel) gets this wrong due to an ambiguity with TIS-620. IMHO, in case of ambiguity, UTF-8 should always be preferred by default (applications could have options to change the preferences). Bug reports: https://bugzilla.mozilla.org/show_bug.cgi?id=760050 http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=719481 On the other hand, detecting text files is hard. Deciding whether a file is a text file may be hard even for a human. What about text files with ANSI control sequences? The best tool so far, file, makes so many errors it's useless for this purpose. Yes. One could use location: like, declaring stuff in /etc/ and /usr/share/doc/ to be text unless proven otherwise, but that's an incomplete hack. Only hashbangs can be considered reliable, but scripts are not where most documentation goes. Also, should HTML be considered text or not? Updating http-equiv is not rocket surgery, detecting HTML with fancy extensions can be. I think better questions could be: why do you want to regard a file as text? For what purpose(s)? For the all shipped text files in UTF-8 rule only? What about examples whose purpose is to have a file in a charset different from UTF-8? 4a. perl and pod Considering perl to be text raises one more issue: pod. By perl's design, pod without a specified encoding is considered to be ISO-8859-1, even if the file contains use utf8;. This is surprising, and many authors use UTF-8 like everywhere else, leading to obvious results (man gdm3 for one example). Thus, there should be a tool (preferably the one mentioned above) that checks perl files for pod with undeclared encoding, and raises alarm if the file contains any bytes with high bit set. If a conversion encoding is specified, such a declaration could be added automatically. Yes, undeclared encoding when not ASCII should be regarded as a bug. -- Vincent Lefèvre vinc...@vinc17.net - Web: http://www.vinc17.net/ 100% accessible validated (X)HTML - Blog: http://www.vinc17.net/blog/ Work: CR INRIA - computer arithmetic / AriC project (LIP, ENS-Lyon) -- To UNSUBSCRIBE, email to debian-devel-requ...@lists.debian.org with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org Archive: http://lists.debian.org/20130812105035.ga28...@xvii.vinc17.org
Re: UTF-8 in jessie
On Mon, Aug 12, 2013 at 12:50:35PM +0200, Vincent Lefevre wrote: On 2013-08-12 02:51:52 +0200, Adam Borowski wrote: Detecting non-UTF files is easy: * false positives are impossible * false negatives are extremely unlikely: combinations of letters that would happen to match a valid utf character don't happen naturally, and even if they did, every single combination in the file tested would need to match valid utf. Not that unlikely, and it is rather annoying that Firefox (and therefore Iceweasel) gets this wrong due to an ambiguity with TIS-620. IMHO, in case of ambiguity, UTF-8 should always be preferred by default (applications could have options to change the preferences). That's the opposite of what I'm talking about: it is hard to reliably detect ancient encodings, because they tend to assign a character to every possible bit stream. On the other hand, only certain combinations of bytes with the 8th bit set are valid UTF-8, and thus it is possible to detect UTF-8 with good accuracy. It is obviously trivial to fool such detection deliberately, but such combinations don't happen in real languages, and thus if something validates as UTF-8, it is safe to assume it indeed is. On the other hand, detecting text files is hard. Deciding whether a file is a text file may be hard even for a human. What about text files with ANSI control sequences? Same as, say, a Word97 document: not text for my purposes. It might be just coloured plain text, but there is no generic way to handle that. Binary formats go more into subgoal 1 of my proposal: arbitrary Unicode input that matches your syntax should be accepted, and go out uncorrupted (not the same as unmodified). One could use location: like, declaring stuff in /etc/ and /usr/share/doc/ to be text unless proven otherwise, but that's an incomplete hack. Only hashbangs can be considered reliable, but scripts are not where most documentation goes. Also, should HTML be considered text or not? Updating http-equiv is not rocket surgery, detecting HTML with fancy extensions can be. I think better questions could be: why do you want to regard a file as text? For what purpose(s)? For the all shipped text files in UTF-8 rule only? A shipped config file will have some settings the user may edit and comments he may read. Being able to see what's going on is a prerequisite here. A perl/python/etc script is something our kind of folks often edit and/or read. A plain text file ships no encoding information, thus it can't be either rendered nor edited comfortably if the encoding is different from the system one. HTML can include http-equiv which take care of rendering, but editing is still a problem. And if you edit it, or, say, fill in some fields from a database, you risk data loss. If everything is UTF-8 end-to-end, this risk goes away. (I do care about plain text more, though.) What about examples whose purpose is to have a file in a charset different from UTF-8? Well, we don't convert those :) I don't expect a package with a test suite that includes charset stuff to make such an error by itself, but if there's a need, we could add a syntax for exclusions. For example, writing verbatim in the charset field. 4a. perl and pod Considering perl to be text raises one more issue: pod. By perl's design, pod without a specified encoding is considered to be ISO-8859-1, even if the file contains use utf8;. This is surprising, and many authors use UTF-8 like everywhere else, leading to obvious results (man gdm3 for one example). Thus, there should be a tool (preferably the one mentioned above) that checks perl files for pod with undeclared encoding, and raises alarm if the file contains any bytes with high bit set. If a conversion encoding is specified, such a declaration could be added automatically. Yes, undeclared encoding when not ASCII should be regarded as a bug. And if it's declared but not UTF-8, I'd convert it at package build time. -- ᛊᚨᚾᛁᛏᚣ᛫ᛁᛊ᛫ᚠᛟᚱ᛫ᚦᛖ᛫ᚹᛖᚨᚲ -- To UNSUBSCRIBE, email to debian-devel-requ...@lists.debian.org with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org Archive: http://lists.debian.org/20130812131659.ga21...@angband.pl
Re: UTF-8 in jessie
On Mon, Aug 12, 2013 at 09:58:30AM +0200, Niels Thykier wrote: For the record, there is a Lintian tag for this now[1], which suggests only a handful of packages violates this. - Recommend ASCII when possible. - Require ASCII for files in /bin, /sbin, /usr/bin, /usr/sbin and /usr/games. Requiring ASCII for files in $PATH should be trivial to implement as a separate tag. I suppose the ASCII requirement could also be implemented as a pedantic check or so. Regardless, patches welcome. :) I disagree here: I'd want to remove any need for that recommendation instead. You might have a point about files in $PATH, though. About display by GUIs, I think that we should have a system to install all the fonts necessary to display languages that we support at the installation. Could be good, yeah. At least something basic for every valid Unicode character. On the other hand, for me at least CJK doesn't functionally differ from mojibake. Which can lead to problems: on debconf 11 CW party, the best stuff came in a bottle marked only in Japanese, and thus I'll have to find which one it was the hard way today :p Jokes aside, enough of Unicode consists of line drawing, symbols and images like U+1F4A9 PILE OF POO[2] that are readable by everyone with appropriate fonts that we might as well just go for 100% coverage by default. Disk space is cheap even on weakest of today's phones, complex packaging with moving parts has serious maintenance cost. [1] http://lintian.debian.org/tags/file-name-is-not-valid-UTF-8.html [2] apt-get install ttf-ancient-fonts Yeah, aptly named. -- ᛊᚨᚾᛁᛏᚣ᛫ᛁᛊ᛫ᚠᛟᚱ᛫ᚦᛖ᛫ᚹᛖᚨᚲ -- To UNSUBSCRIBE, email to debian-devel-requ...@lists.debian.org with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org Archive: http://lists.debian.org/20130812135503.ga24...@angband.pl
Re: UTF-8 in jessie
On Mon, Aug 12, 2013 at 02:51:52AM +0200, Adam Borowski wrote: I propose the following sub-goals: 1. all programs should, in their default configuration, accept UTF-8 input and pass it through uncorrupted. Having to manually specify encoding is acceptable only in a programmatic interface, GUI/std{in,out,err}/ command line/plain files should work with nothing but LC_CTYPE. 2. all GUI/curses/etc programs should be able to display UTF-8 output where appropriate 3. all file names must be valid UTF-8 4. all text files should be encoded in UTF-8 5. All programs consuning UTF8 Text must understand a BOM. Flo -- Florian Lohoff f...@zz.de signature.asc Description: Digital signature
Re: UTF-8 in jessie
Florian Lohoff f at zz.de writes: 5. All programs consuning UTF8 Text must understand a BOM. The kernel doesn’t, start there: tglase@tglase:~$ mksh -c 'print '\''\ufeff#!/bin/sh\necho foo'\' x; chmod +x x; ./x ./x: line 1: #!/bin/sh: No such file or directory foo That’s running GNU bash, with bash as /bin/sh for testing, which deviates from my normal setup of running mksh… because I fixed mksh to support this (and the MirBSD kernel, too). I disagree with requiring ASCII for $PATH though… bye, //mirabilos -- To UNSUBSCRIBE, email to debian-devel-requ...@lists.debian.org with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org Archive: http://lists.debian.org/loom.20130812t175549-...@post.gmane.org
Re: UTF-8 in jessie
On Mon, Aug 12, 2013 at 03:50:19PM +0200, Florian Lohoff wrote: 5. All programs consuning UTF8 Text must understand a BOM. I'm afraid I don't agree here: BOMs are nasty stuff that serve no purpose once you standardize on UTF8. They might help with exchange with a minority of Windows programs, at a cost at our side. Windows hardly does plain text: most of that is MSVC/etc sources, but then, the C/C++ standards explicitely forbid junk in places other than comments. Most other languages expect a hashbang on Unix, which makes BOMs impossible. Other reasons: * concatenating files adds a misplaced BOM * taking stuff from the middle loses them * tools like grep, patch, etc pick and insert lots of individual lines * tools that don't care about encodings would need to learn about them * files that appear the same will have a different hash due to presence or absence of an invisible character that can appear/disappear with no explicit request on the user's part * with UTF-8, we're 95% there. For BOMs, there's almost no support. So I'm strongly against producing BOMs. As for accepting them, there's little that can break so it would be mostly ok... but certainly not as a must clause. -- ᛊᚨᚾᛁᛏᚣ᛫ᛁᛊ᛫ᚠᛟᚱ᛫ᚦᛖ᛫ᚹᛖᚨᚲ -- To UNSUBSCRIBE, email to debian-devel-requ...@lists.debian.org with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org Archive: http://lists.debian.org/20130812155820.ga31...@angband.pl
Re: UTF-8 in jessie
On 2013-08-12 15:16:59 +0200, Adam Borowski wrote: On Mon, Aug 12, 2013 at 12:50:35PM +0200, Vincent Lefevre wrote: On 2013-08-12 02:51:52 +0200, Adam Borowski wrote: Detecting non-UTF files is easy: * false positives are impossible * false negatives are extremely unlikely: combinations of letters that would happen to match a valid utf character don't happen naturally, and even if they did, every single combination in the file tested would need to match valid utf. Not that unlikely, and it is rather annoying that Firefox (and therefore Iceweasel) gets this wrong due to an ambiguity with TIS-620. IMHO, in case of ambiguity, UTF-8 should always be preferred by default (applications could have options to change the preferences). That's the opposite of what I'm talking about: it is hard to reliably detect ancient encodings, because they tend to assign a character to every possible bit stream. On the other hand, only certain combinations of bytes with the 8th bit set are valid UTF-8, and thus it is possible to detect UTF-8 with good accuracy. It is obviously trivial to fool such detection deliberately, but such combinations don't happen in real languages, and thus if something validates as UTF-8, it is safe to assume it indeed is. I don't know about the exact cause making Firefox to recognize some file as TIS-620 instead of UTF-8, but it is fooled and not deliberately. On the other hand, detecting text files is hard. Deciding whether a file is a text file may be hard even for a human. What about text files with ANSI control sequences? Same as, say, a Word97 document: not text for my purposes. It might be just coloured plain text, but there is no generic way to handle that. I think I've already seen such files as distributed text files (documentation), or perhaps there were just backspace characters to get bold (x\bx) and underline (x\b_). The less utility can handle them. I think better questions could be: why do you want to regard a file as text? For what purpose(s)? For the all shipped text files in UTF-8 rule only? A shipped config file will have some settings the user may edit and comments he may read. Being able to see what's going on is a prerequisite here. However some config files may be byte-oriented (like procmailrc, AFAIK). HTML can include http-equiv which take care of rendering, but editing is still a problem. And if you edit it, or, say, fill in some fields from a database, you risk data loss. If everything is UTF-8 end-to-end, this risk goes away. (I do care about plain text more, though.) You may still have NFC/NFD problems (this is also true for filenames). -- Vincent Lefèvre vinc...@vinc17.net - Web: http://www.vinc17.net/ 100% accessible validated (X)HTML - Blog: http://www.vinc17.net/blog/ Work: CR INRIA - computer arithmetic / AriC project (LIP, ENS-Lyon) -- To UNSUBSCRIBE, email to debian-devel-requ...@lists.debian.org with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org Archive: http://lists.debian.org/20130812172807.ga2...@ioooi.vinc17.net
Re: UTF-8 in jessie
On 12 August 2013 01:51, Adam Borowski kilob...@angband.pl wrote: On Mon, May 06, 2013 at 02:49:57PM +0200, Andreas Beckmann wrote: I propose the following sub-goals: 1. all programs should, in their default configuration, accept UTF-8 input and pass it through uncorrupted. Having to manually specify encoding is acceptable only in a programmatic interface, GUI/std{in,out,err}/ command line/plain files should work with nothing but LC_CTYPE. 2. all GUI/curses/etc programs should be able to display UTF-8 output where appropriate 3. all file names must be valid UTF-8 4. all text files should be encoded in UTF-8 What about locales though? * C.utf8 locale should be always available * C.utf8 locale should be the default/fallback locale * utf8 locale variants should be default / available / preferred (where appropriate) (this is rough idea, adjust above as appropriate feasible at this point in time) Regards, Dmitrijs. -- To UNSUBSCRIBE, email to debian-devel-requ...@lists.debian.org with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org Archive: http://lists.debian.org/canbhluhz9rezyipz1ze5zrptb5zccrspmdf9aoaygqyjvk1...@mail.gmail.com
Re: UTF-8 in jessie
On 2013-08-12 17:58:20 +0200, Adam Borowski wrote: On Mon, Aug 12, 2013 at 03:50:19PM +0200, Florian Lohoff wrote: 5. All programs consuning UTF8 Text must understand a BOM. I'm afraid I don't agree here: BOMs are nasty stuff that serve no purpose once you standardize on UTF8. They might help with exchange with a minority of Windows programs, at a cost at our side. Windows hardly does plain text: most of that is MSVC/etc sources, but then, the C/C++ standards explicitely forbid junk in places other than comments. Most other languages expect a hashbang on Unix, which makes BOMs impossible. I think that BOM has more drawbacks than advantages. It could be useful only if there were an API to handle it correctly and transparently, and if the current API's (open(), fopen(), etc.) were no longer used. Basically this means that one would need a new OS. This would also mean that a BOM could be seen as some kind of metadata used by the new API, and having the charset in the metadata would actually make BOM completely useless. Other reasons: * concatenating files adds a misplaced BOM * taking stuff from the middle loses them * tools like grep, patch, etc pick and insert lots of individual lines * tools that don't care about encodings would need to learn about them * files that appear the same will have a different hash due to presence or absence of an invisible character that can appear/disappear with no explicit request on the user's part * with UTF-8, we're 95% there. For BOMs, there's almost no support. This would also affect regexp, e.g. ^foo on the first line of a file. So I'm strongly against producing BOMs. As for accepting them, there's little that can break so it would be mostly ok... but certainly not as a must clause. Agreed. -- Vincent Lefèvre vinc...@vinc17.net - Web: http://www.vinc17.net/ 100% accessible validated (X)HTML - Blog: http://www.vinc17.net/blog/ Work: CR INRIA - computer arithmetic / AriC project (LIP, ENS-Lyon) -- To UNSUBSCRIBE, email to debian-devel-requ...@lists.debian.org with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org Archive: http://lists.debian.org/20130812214212.ga22...@xvii.vinc17.org
Re: UTF-8 in jessie
On 2013-08-12 20:14:30 +0100, Dmitrijs Ledkovs wrote: What about locales though? * C.utf8 locale should be always available * C.utf8 locale should be the default/fallback locale * utf8 locale variants should be default / available / preferred (where appropriate) If scripts intend to use LC_ALL=C.UTF-8 to force everything to the standard locale with UTF-8 support, then the glibc should be modified to regard C.UTF-8 like C w.r.t. $LANGUAGE. I mean: xvii% LANGUAGE=fr_FR LC_ALL=C.UTF-8 cp cp: opérande de fichier manquant Saisissez « cp --help » pour plus d'informations. xvii% LANGUAGE=fr_FR LC_ALL=C cp cp: missing file operand Try 'cp --help' for more information. Both should have output in English. -- Vincent Lefèvre vinc...@vinc17.net - Web: http://www.vinc17.net/ 100% accessible validated (X)HTML - Blog: http://www.vinc17.net/blog/ Work: CR INRIA - computer arithmetic / AriC project (LIP, ENS-Lyon) -- To UNSUBSCRIBE, email to debian-devel-requ...@lists.debian.org with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org Archive: http://lists.debian.org/20130812215017.gb22...@xvii.vinc17.org
Re: UTF-8 in jessie
Le Mon, Aug 12, 2013 at 03:55:03PM +0200, Adam Borowski a écrit : On Mon, Aug 12, 2013 at 09:58:30AM +0200, Niels Thykier wrote: For the record, there is a Lintian tag for this now[1], which suggests only a handful of packages violates this. - Recommend ASCII when possible. - Require ASCII for files in /bin, /sbin, /usr/bin, /usr/sbin and /usr/games. Requiring ASCII for files in $PATH should be trivial to implement as a separate tag. I suppose the ASCII requirement could also be implemented as a pedantic check or so. Regardless, patches welcome. :) I disagree here: I'd want to remove any need for that recommendation instead. You might have a point about files in $PATH, though. Le Mon, Aug 12, 2013 at 03:56:48PM +, Thorsten Glaser a écrit : I disagree with requiring ASCII for $PATH though… Hi Adam, Thorsten, and everybody, To my knowledge, in Unstable there is currently no filename in the PATH that is not encoded in plain ASCII. The rationale for codifying this practice into a requirement is to ensure that on multi-user systems, the administrator and the users will not encounter commands that they can not display or can not type. For file names outside the PATH, the recommendation to use ASCII when possible should not be interpreted in an overly restrictive way: there are also good reasons for using UTF-8 characters that are not in ASCII. See http://bugs.debian.org/701081 for further discussion. Have a nice day, -- Charles Plessy Tsurumi, Kanagawa, Japan -- To UNSUBSCRIBE, email to debian-devel-requ...@lists.debian.org with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org Archive: http://lists.debian.org/20130812235702.gb9...@falafel.plessy.net
Re: UTF-8 in jessie (debhelper and BOM)
Hi, UTF-8 is a good goal indeed as principle. (I agree but I am struggling to update package documentation since Japanese are known to be tough (JIS 2022/EUCJP/SHIFT-JIS/... are used) EUC/SHIFT-JIS mixed case can be confused with LATIN-1 easily. ) But I do not understand goal #5. Why MUST? Do you have rationale? On Mon, Aug 12, 2013 at 03:50:19PM +0200, Florian Lohoff wrote: On Mon, Aug 12, 2013 at 02:51:52AM +0200, Adam Borowski wrote: I propose the following sub-goals: ... 4. all text files should be encoded in UTF-8 Yes. But it will be nice to have some support by dh_installdocs :-) ^^ 5. All programs consuming UTF8 Text must understand a BOM. I agree as SHOULD but should we state MUST? After all BOM has no value in UTF-8 except to upset some programs. See Wikipedia page: http://en.wikipedia.org/wiki/Byte_order_mark | The Unicode Standard permits the BOM in UTF-8, but does not require | or recommend its use. Byte order has no meaning in UTF-8 ... (pointer to the Unicode document is listed there.) If it is only for the first byte, it is relatively easy. But there are text data with bogus BOM in the content. Should program understand them to be safe, too? FYI: I had problem recently for PO files containing lots of BOM inside of a text file which broke running XaTeX. Please note TeX family of programs have more elaborate character support than Unicode only UTF-8. I would rather have XeTeX ...) To me, program to filter such BOM will be nice. But we should not shoot a good UTF-8 program for stupid BOM containing UTF-8 data. Osamu -- To UNSUBSCRIBE, email to debian-devel-requ...@lists.debian.org with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org Archive: http://lists.debian.org/20130813044403.GB19557@goofy.localdomain
UTF-8 in jessie
On Mon, May 06, 2013 at 02:49:57PM +0200, Andreas Beckmann wrote: now might be the right time to start a discussion about release goals for jessie. I would like to propose full UTF-8 support. I don't mean here full support for all of Unicode's finer points, merely complete eradication of mojibake. That is, ensuring that /m.o/ matches möo, or that ä sorts as equal to acombining ¨ is out of scope of this proposal. I propose the following sub-goals: 1. all programs should, in their default configuration, accept UTF-8 input and pass it through uncorrupted. Having to manually specify encoding is acceptable only in a programmatic interface, GUI/std{in,out,err}/ command line/plain files should work with nothing but LC_CTYPE. 2. all GUI/curses/etc programs should be able to display UTF-8 output where appropriate 3. all file names must be valid UTF-8 4. all text files should be encoded in UTF-8 This proposal doesn't call for eradication of non-UTF8 locales, even though I think that's long overdue. Josselin Mouette proposed that in #603914, and I agree, but that's material for another flamewar. Let's discuss the above points in depth: 1. properly passing UTF-8 Text entered by an user should never get mangled. These days, we can assume mixed charsets are a thing of the past, thus there's no need of special handling. So are, mostly, programs that don't support it -- but due to historic reasons, some are not configured to do so. Thus, let's mandate that no per-program steps are needed. An example: let's say we have an SQL table foo(a varchar(250)). Let's run somesqlclient -e insert into foo values('$x'); select a from foo (-e being whatever stands for execute this statement). sqlite3: ok p[ostgre]sql: ok mysql: doesn't work! But... the schema was declared as UTF-8, my locale is en_US.UTF-8, why doesn't it work? Turns out mysql requires you to call it with an extra argument, --default-character-set=utf8. There's no binary ABI to maintain, compat with some historic behaviour makes no sense. I can accept having to specify the charset in, say, a DBI line, as that's what the API wants, but on the command line... that's just wrong. Am I supposed to wrap everything with iconv, and suffer data loss on the way? Setting LANG/LC_foo should be enough. Another case, perhaps more controversial, is apache. Just take a look at how many of Debian random project pages have mangled encodings somewhere. By a 0th approximation, well over one third (more for text/plain, such as logs). And that's with users whose skills are way above average. These days, producing text that's not in UTF-8 can take quite a bit of effort, especially with modern GUI tools which don't even really pay lip service to supporting ancient charsets anymore. Thus, if someone serves some text in such a charset, he takes pains to even edit it. One argument is that because AddDefaultCharset overrides http-equiv, such old files would be mangled. I'd say, as they already take effort to maintain, let's let them rot in hell, as they are a rare case that stands in the way of a nearly ubiquitous one working properly. Such an admin can always configure his server to use an ancient encoding if he wishes to do so. (The other argument, our own files shipped in /doc/, is dead since apache 2.2.22-4, and is a major part of part 4 of this proposal.) 2. GUI/curses display With gtk, qt, and probably more, the issue is mostly moot. Other toolkits might require some work, but typically it's a matter of encoding (part 1 of this proposal): characters have different horizontal widths so you use outside functions for functionality like line wrapping already. Not so much in curses. Here, you have some characters take two spaces (CJK), some take zero (zero width spaces), some take zero but must not be detached from the previous character (combining). The line wrapping algorithm is actually quite simple, but needs to be implemented for every curses program that displays arbitrary strings. Ouch. [I got quite some experience fixing curses/etc programs this way, so I pledge priority help here. gtk/qt/fooxwidgets, not so much.] 3. all file names must be UTF-8 This is quite straightforward. They are already uninstallable on filesystems that operate in characters rather than bytes. Might be a good idea to forbid nasty stuff like newlines, tabs, etc too. I propose to apply this restriction to source packages as well. If Contents-* files are to be believed, the only violation is a binary package, zero source ones, so there'd be no extra work now, and at most a repack if an upstream regresses. The benefit is less clear than for binaries, but it's trivial and would prevent unexpected breakages. 4. all shipped text files in UTF-8 We don't want mojibake in provided documentation, config files, etc. With the amount of hackers nearby, even perl/shell/python/etc scripts in /*/bin. In short, all text files. This could be done by a
Re: UTF-8 in jessie
On Mon, Aug 12, 2013 at 02:51:52AM +0200, Adam Borowski wrote: [...] On the other hand, detecting text files is hard. The best tool so far, file, makes so many errors it's useless for this purpose. One could use location: like, declaring stuff in /etc/ and /usr/share/doc/ to be text unless proven otherwise, but that's an incomplete hack. Only hashbangs can be considered reliable, but scripts are not where most documentation goes. Just a note: hashbangs can't really be considered reliable either -- consider tarball-in-sh/other-script files (waf is a good example). Then there's stuff like gambas-compiled executables which also ship with valid hashbangs, and #!/usr/bin/haserl stuff which can contain lua bytecode after the hashbang line. The only requirement for valid hashbangs, afaict, is that the first two bytes are #!, and everything up to the \x20 or \n is resolvable to a valid filename. [...] -- Kind regards, Loong Jin signature.asc Description: Digital signature
Re: UTF-8 in jessie
Le Mon, Aug 12, 2013 at 02:51:52AM +0200, Adam Borowski a écrit : I would like to propose full UTF-8 support. I don't mean here full support for all of Unicode's finer points, merely complete eradication of mojibake. Hi Adam, this is a great goal. Here are two comments. There is a related issue opened on the Policy (http://bugs.debian.org/701081), where we propose the following: - Require UTF-8 for the names of all files and directories installed by binary packages. - Recommend ASCII when possible. - Require ASCII for files in /bin, /sbin, /usr/bin, /usr/sbin and /usr/games. About display by GUIs, I think that we should have a system to install all the fonts necessary to display languages that we support at the installation. Have a nice Debconf ! -- Charles -- To UNSUBSCRIBE, email to debian-devel-requ...@lists.debian.org with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org Archive: http://lists.debian.org/20130812021825.gc6...@falafel.plessy.net