Re: OT: Python (was: Make Unicode bugs release critical?)
On 2011-02-14 16:43:11 +, Ian Jackson wrote: When LC_CTYPE=en_GB.utf-8, programs which attempt to print unicode characters to stdout should use UTF-8. That's what LC_TYPE means. So, cat, grep, etc. are all broken. :) -- Vincent Lefèvre vinc...@vinc17.net - Web: http://www.vinc17.net/ 100% accessible validated (X)HTML - Blog: http://www.vinc17.net/blog/ Work: CR INRIA - computer arithmetic / Arénaire project (LIP, ENS-Lyon) -- To UNSUBSCRIBE, email to debian-devel-requ...@lists.debian.org with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org Archive: http://lists.debian.org/20110216000107.gl15...@prunille.vinc17.org
Re: OT: Python (was: Make Unicode bugs release critical?)
On Wed, Feb 16, 2011 at 01:01:07AM +0100, Vincent Lefevre wrote: On 2011-02-14 16:43:11 +, Ian Jackson wrote: When LC_CTYPE=en_GB.utf-8, programs which attempt to print unicode characters to stdout should use UTF-8. That's what LC_TYPE means. So, cat, grep, etc. are all broken. :) How come? cat will, for any valid UTF-8 character on input, print a valid UTF-8 character on output. For any valid ISO-8859-1 character on input, it will print a valid ISO-8859-1 character on output. grep on the other hand has to actually understand the encoding -- and it does. Try this: $ echo ą|LC_CTYPE=C grep --color=always . Will be mangled. $ echo ą|LC_CTYPE=en_US.utf-8 grep --color=always . Will be handled correctly. -- 1KB // Microsoft corollary to Hanlon's razor: // Never attribute to stupidity what can be // adequately explained by malice. -- To UNSUBSCRIBE, email to debian-devel-requ...@lists.debian.org with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org Archive: http://lists.debian.org/20110216003451.ga14...@angband.pl
Re: OT: Python (was: Make Unicode bugs release critical?)
On 2011-02-16 01:34:51 +0100, Adam Borowski wrote: On Wed, Feb 16, 2011 at 01:01:07AM +0100, Vincent Lefevre wrote: On 2011-02-14 16:43:11 +, Ian Jackson wrote: When LC_CTYPE=en_GB.utf-8, programs which attempt to print unicode characters to stdout should use UTF-8. That's what LC_TYPE means. So, cat, grep, etc. are all broken. :) How come? cat will, for any valid UTF-8 character on input, print a valid UTF-8 character on output. For any valid ISO-8859-1 character on input, it will print a valid ISO-8859-1 character on output. I was just commenting what Ian said. If there is a valid reason for which cat may not produce UTF-8 in UTF-8 locales, this is also true for perl or any other software. -- Vincent Lefèvre vinc...@vinc17.net - Web: http://www.vinc17.net/ 100% accessible validated (X)HTML - Blog: http://www.vinc17.net/blog/ Work: CR INRIA - computer arithmetic / Arénaire project (LIP, ENS-Lyon) -- To UNSUBSCRIBE, email to debian-devel-requ...@lists.debian.org with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org Archive: http://lists.debian.org/20110216004529.gn15...@prunille.vinc17.org
Re: Make Unicode bugs release critical? (was: Re: RFA: all my packages)
Le vendredi 11 février 2011 à 19:33 +0100, Axel Beckert a écrit : Kicking out good and unique software, only because of missing or incomplete UTF-8 support, will surely lower Debian's quality more than missing or broken UTF-8 support in very few packages. And it would make those users (and devs) angry who need that software independently of working UTF-8 support or not. Kicking out software with incomplete UTF-8 support sounds unfair. Kicking out software that doesn’t work at all in UTF-8 locales and requires the user to set a broken locale, OTOH, sounds like a sanitary emergency. -- .''`. : :' : “You would need to ask a lawyer if you don't know `. `' that a handshake of course makes a valid contract.” `--- J???rg Schilling -- To UNSUBSCRIBE, email to debian-devel-requ...@lists.debian.org with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org Archive: http://lists.debian.org/1297676104.3044.218.camel@meh
Re: Make Unicode bugs release critical? (was: Re: RFA: all my packages)
Josselin Mouette writes (Re: Make Unicode bugs release critical? (was: Re: RFA: all my packages)): Kicking out software that doesn?t work at all in UTF-8 locales and requires the user to set a broken locale, OTOH, sounds like a sanitary emergency. Excellent, I look forward to the removal of python. I always hated that language anyway. $ LC_CTYPE=en_GB.utf-8 python -c 'print u\u00a3' unicode pound sign $ But $ LC_CTYPE=en_GB.utf-8 python -c 'print u\u00a3' | cat Traceback (most recent call last): File string, line 1, in module UnicodeEncodeError: 'ascii' codec can't encode character u'\xa3' in position 0: ordinal not in range(128) $ Ian. -- To UNSUBSCRIBE, email to debian-devel-requ...@lists.debian.org with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org Archive: http://lists.debian.org/19801.8997.829350.140...@chiark.greenend.org.uk
Re: Make Unicode bugs release critical? (was: Re: RFA: all my packages)
* Ian Jackson ijack...@chiark.greenend.org.uk, 2011-02-14, 12:42: Kicking out software that doesn?t work at all in UTF-8 locales and requires the user to set a broken locale, OTOH, sounds like a sanitary emergency. Excellent, I look forward to the removal of python. I always hated that language anyway. $ LC_CTYPE=en_GB.utf-8 python -c 'print u\u00a3' unicode pound sign $ But $ LC_CTYPE=en_GB.utf-8 python -c 'print u\u00a3' | cat Traceback (most recent call last): File string, line 1, in module UnicodeEncodeError: 'ascii' codec can't encode character u'\xa3' in position 0: ordinal not in range(128) $ This is the expected behaviour. Incidentally, it has nothing to do with UTF-8. You'll get the same result if you use a locale with a legacy encoding. -- Jakub Wilk -- To UNSUBSCRIBE, email to debian-devel-requ...@lists.debian.org with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org Archive: http://lists.debian.org/20110214131425.ga4...@jwilk.net
OT: Python (was: Make Unicode bugs release critical?)
-BEGIN PGP SIGNED MESSAGE- Hash: SHA512 Hi, lets start a python rant. I love to hate this language. :-) Am Mo den 14. Feb 2011 um 14:14 schrieb Jakub Wilk: $ LC_CTYPE=en_GB.utf-8 python -c 'print u\u00a3' unicode pound sign [...] $ LC_CTYPE=en_GB.utf-8 python -c 'print u\u00a3' | cat Traceback (most recent call last): File string, line 1, in module UnicodeEncodeError: 'ascii' codec can't encode character u'\xa3' in position 0: ordinal not in range(128) This is the expected behaviour. Incidentally, it has nothing to do with UTF-8. You'll get the same result if you use a locale with a legacy encoding. I see. It is funny to see python lovers to blame other for the bugs in the language. ~ LC_CTYPE=en_GB.utf-8 perl -e 'print \x{00a3}\n;' ~ LC_CTYPE=en_GB.utf-8 perl -e 'print \x{00a3}\n;' | cat Both gives the same result, a '£' sign as expected. * Ian Jackson ijack...@chiark.greenend.org.uk, 2011-02-14, 12:42: Excellent, I look forward to the removal of python. I always hated that language anyway. I hate them more. :-) Regards Klaus - -- Klaus Ethgenhttp://www.ethgen.ch/ pub 2048R/D1A4EDE5 2000-02-26 Klaus Ethgen kl...@ethgen.de Fingerprint: D7 67 71 C4 99 A6 D4 FE EA 40 30 57 3C 88 26 2B -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.10 (GNU/Linux) iQEVAwUBTVkwIJ+OKpjRpO3lAQr9qAf+I4UXXNKso2hhr6BEjgn/o0IOpbI6/jhe YwSf5rysUlb924NvtdOc1VzLoOff/uUDXOpW0VICSJMZRfVLZvVvdwaysa+SJj/f 0UL0CnuHogtan5uV627JFQRI5/VpQ9LXRc7w6w0+Eh8d7Pm/FJYomI4fuGAM0jPo n1mFCeHSP2PiSIJ85cKWCqxsDkC4EDrPvrqol2ZJfuW1bVqqViGWMIrQ8RXzQ8JD eSBHY0qjOCoMz1W46C4ruk3SVkX6FGe/V9U6XUG9kcAYlfpMyfeHDQ207P1tuEUH dmD9gFA8ZpUgxHSZY43ONBnJlFynubPv7bmWoic7sez6V8zab6TFqg== =KrXl -END PGP SIGNATURE- -- To UNSUBSCRIBE, email to debian-devel-requ...@lists.debian.org with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org Archive: http://lists.debian.org/20110214133736.gb6...@ikki.ethgen.ch
Re: OT: Python (was: Make Unicode bugs release critical?)
On 2011-02-14, Klaus Ethgen kl...@ethgen.de wrote: ~ LC_CTYPE=en_GB.utf-8 perl -e 'print \x{00a3}\n;' ~ LC_CTYPE=en_GB.utf-8 perl -e 'print \x{00a3}\n;' | cat Both gives the same result, a '£' sign as expected. And what's the value in that demonstration? Yes, you can treat UTF8 like a bytestream. And the thread was about the problems that can arise of this. Kind regards Philipp Kern -- To UNSUBSCRIBE, email to debian-devel-requ...@lists.debian.org with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org Archive: http://lists.debian.org/slrnilidf3.11r.tr...@kelgar.0x539.de
Re: OT: Python (was: Make Unicode bugs release critical?)
On ma, 2011-02-14 at 14:37 +0100, Klaus Ethgen wrote: lets start a python rant. I love to hate this language. :-) Let's not. Let's not rant about any languages, or tools, or desktop environments. Let's be constructive on Debian mailing lists, shall we? We have plenty of side-channels for rants, sarcasm, snide remarks, passive-aggressiveness, and other forms of anti-social behavior, let's use those instead. -- To UNSUBSCRIBE, email to debian-devel-requ...@lists.debian.org with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org Archive: http://lists.debian.org/1297692931.31960.13.camel@tacticus
Re: OT: Python (was: Make Unicode bugs release critical?)
* Klaus Ethgen kl...@ethgen.de, 2011-02-14, 14:37: ~ LC_CTYPE=en_GB.utf-8 perl -e 'print \x{00a3}\n;' ~ LC_CTYPE=en_GB.utf-8 perl -e 'print \x{00a3}\n;' | cat Let me try... $ LC_CTYPE=en_GB.utf-8 perl -e 'print \x{00a3}\n;' | isutf8 stdin: line 1, char 1, byte offset 1: invalid UTF-8 code But I don't blame Perl for that. It's documented behavior, so I can either live with that or use another language. -- Jakub Wilk -- To UNSUBSCRIBE, email to debian-devel-requ...@lists.debian.org with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org Archive: http://lists.debian.org/20110214143302.ga6...@jwilk.net
Re: OT: Python (was: Make Unicode bugs release critical?)
-BEGIN PGP SIGNED MESSAGE- Hash: SHA512 Am Mo den 14. Feb 2011 um 15:15 schrieb Lars Wirzenius: On ma, 2011-02-14 at 14:37 +0100, Klaus Ethgen wrote: lets start a python rant. I love to hate this language. :-) Let's not. 'Till here it is personal desire. Let's not rant about any languages, or tools, or desktop environments. Let's be constructive on Debian mailing lists, shall we? You are true. I just couldn't resist if someone was trying to blame all other than the one that has the bug. Regards Klaus - -- Klaus Ethgenhttp://www.ethgen.ch/ pub 2048R/D1A4EDE5 2000-02-26 Klaus Ethgen kl...@ethgen.de Fingerprint: D7 67 71 C4 99 A6 D4 FE EA 40 30 57 3C 88 26 2B -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.10 (GNU/Linux) iQEVAwUBTVk9hZ+OKpjRpO3lAQoy7Qf9EV1erqhNsAgfJ1ubQiitzufbk5Wq4rA/ rVh+Tpn4SHTE3D5Sw20UIPrUYonaQD6z8gokOkIdvzvgzVOBj3vPioFnWZy368QK DUXymUPal23q+iwwV8FYNqq7ggnwpnT0DX1PNCmMUHZl21ZkMjMJO2cuv21ycD6I JGBvA0w+dOVb7YfI+HGMwAlyT2gEkT7nsg8nlvYUU+EgzCaXjC1tdPHfe3QAYsQh Pd0QDqhxFvwVRB9SskSas1JnjUh5DKMI/USr7a/+jP6dWeVQHIRglIN5uNFCq8kW 70jM2XCdTeZcdFy1lOiJ07YCYW1gg0kKCN+DlyEFJmJUzYsfP+4KsQ== =H8Sg -END PGP SIGNATURE- -- To UNSUBSCRIBE, email to debian-devel-requ...@lists.debian.org with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org Archive: http://lists.debian.org/20110214143445.gd6...@ikki.ethgen.ch
Re: OT: Python (was: Make Unicode bugs release critical?)
On Mon, Feb 14, 2011 at 02:02:11PM +, Philipp Kern wrote: On 2011-02-14, Klaus Ethgen kl...@ethgen.de wrote: ~ LC_CTYPE=en_GB.utf-8 perl -e 'print \x{00a3}\n;' ~ LC_CTYPE=en_GB.utf-8 perl -e 'print \x{00a3}\n;' | cat Both gives the same result, a '£' sign as expected. And what's the value in that demonstration? Yes, you can treat UTF8 like a bytestream. And the thread was about the problems that can arise of this. Er, and tell me where exactly it makes sense to allow one encoding but not another for a bytestream? It appears that Python has a nasty bug where it ignores the encoding if isatty(stdout) returns 0. So let's go fixing or reporting that rather than arguing about it. -- 1KB // Microsoft corollary to Hanlon's razor: // Never attribute to stupidity what can be // adequately explained by malice. -- To UNSUBSCRIBE, email to debian-devel-requ...@lists.debian.org with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org Archive: http://lists.debian.org/20110214143608.ga8...@angband.pl
Re: OT: Python (was: Make Unicode bugs release critical?)
Jakub Wilk writes (Re: OT: Python (was: Make Unicode bugs release critical?)): * Klaus Ethgen kl...@ethgen.de, 2011-02-14, 14:37: ~ LC_CTYPE=en_GB.utf-8 perl -e 'print \x{00a3}\n;' ~ LC_CTYPE=en_GB.utf-8 perl -e 'print \x{00a3}\n;' | cat Let me try... $ LC_CTYPE=en_GB.utf-8 perl -e 'print \x{00a3}\n;' | isutf8 stdin: line 1, char 1, byte offset 1: invalid UTF-8 code WTF. OK, Perl's out too. We'll have to write everything in dash :-). Ian. -- To UNSUBSCRIBE, email to debian-devel-requ...@lists.debian.org with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org Archive: http://lists.debian.org/19801.18743.486394.290...@chiark.greenend.org.uk
Re: Make Unicode bugs release critical? (was: Re: RFA: all my packages)
Le lundi 14 février 2011 à 12:42 +, Ian Jackson a écrit : Josselin Mouette writes (Re: Make Unicode bugs release critical? (was: Re: RFA: all my packages)): Kicking out software that doesn?t work at all in UTF-8 locales and requires the user to set a broken locale, OTOH, sounds like a sanitary emergency. Excellent, I look forward to the removal of python. I always hated that language anyway. From your reply I look more forward to the removal of vm, since it broke the Unicode in my original email. $ LC_CTYPE=en_GB.utf-8 python -c 'print u\u00a3' | cat Traceback (most recent call last): File string, line 1, in module UnicodeEncodeError: 'ascii' codec can't encode character u'\xa3' in position 0: ordinal not in range(128) $ You must specify the encoding of your data in your bitstreams. I agree this is inconvenient (and one of the things I dislike in Python), but it is: 1. completely independent of the locale (UTF8 or not) 2. easy to work with once you understand how encodings in Python work 3. much better in Python 3. -- .''`. : :' : “You would need to ask a lawyer if you don't know `. `' that a handshake of course makes a valid contract.” `--- J???rg Schilling -- To UNSUBSCRIBE, email to debian-devel-requ...@lists.debian.org with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org Archive: http://lists.debian.org/1297698390.8791.72.camel@meh
Re: Make Unicode bugs release critical? (was: Re: RFA: all my packages)
On Mon, 14 Feb 2011, Josselin Mouette wrote: You must specify the encoding of your data in your bitstreams. I agree this is inconvenient (and one of the things I dislike in Python), but it is: 1. completely independent of the locale (UTF8 or not) 2. easy to work with once you understand how encodings in Python work 3. much better in Python 3. As long as python 3 is compiled to use UCS-4 as the internal representation, you mean. Are our packages set to use UCS-4? -- One disk to rule them all, One disk to find them. One disk to bring them all and in the darkness grind them. In the Land of Redmond where the shadows lie. -- The Silicon Valley Tarot Henrique Holschuh -- To UNSUBSCRIBE, email to debian-devel-requ...@lists.debian.org with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org Archive: http://lists.debian.org/20110214160108.ga7...@khazad-dum.debian.net
Re: Make Unicode bugs release critical? (was: Re: RFA: all my packages)
On Mon, Feb 14, 2011 at 02:01:08PM -0200, Henrique de Moraes Holschuh wrote: As long as python 3 is compiled to use UCS-4 as the internal representation, you mean. Are our packages set to use UCS-4? At least for python 3.1, yes: common_configure_args = \ --prefix=/usr \ --enable-ipv6 \ --with-dbmliborder=bdb \ --with-wide-unicode \ --with-computed-gotos \ --with-system-expat \ The --with-wide-unicode enables UCS-4. With a very few exceptions, I believe all the recent Debian python packages have been compiled this way. -- brian m. carlson / brian with sandals: Houston, Texas, US +1 832 623 2791 | http://www.crustytoothpaste.net/~bmc | My opinion only OpenPGP: RSA v4 4096b: 88AC E9B2 9196 305B A994 7552 F1BA 225C 0223 B187 signature.asc Description: Digital signature
Re: OT: Python (was: Make Unicode bugs release critical?)
-BEGIN PGP SIGNED MESSAGE- Hash: SHA512 Am Mo den 14. Feb 2011 um 16:24 schrieb Ian Jackson: Jakub Wilk writes (Re: OT: Python (was: Make Unicode bugs release critical?)): * Klaus Ethgen kl...@ethgen.de, 2011-02-14, 14:37: ~ LC_CTYPE=en_GB.utf-8 perl -e 'print \x{00a3}\n;' ~ LC_CTYPE=en_GB.utf-8 perl -e 'print \x{00a3}\n;' | cat Let me try... $ LC_CTYPE=en_GB.utf-8 perl -e 'print \x{00a3}\n;' | isutf8 stdin: line 1, char 1, byte offset 1: invalid UTF-8 code WTF. OK, Perl's out too. No, it is not. 00a3 is just not a utf-8 character, it is unicode. To get a correct utf-8 character you need to print \x{c2a3} and then isutf8 is happy. We'll have to write everything in dash :-). lisp. :-) But now we get complete out of topic. Regards Klaus - -- Klaus Ethgenhttp://www.ethgen.ch/ pub 2048R/D1A4EDE5 2000-02-26 Klaus Ethgen kl...@ethgen.de Fingerprint: D7 67 71 C4 99 A6 D4 FE EA 40 30 57 3C 88 26 2B -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.10 (GNU/Linux) iQEVAwUBTVlWk5+OKpjRpO3lAQohXgf9FC839X5Pozj2LZUJKd+X9Bcy5F/q+zWg cdPlFkRL2BSq05M4+V8anb6vP47JdMMJfgc1oszNWZkYOQkgZdTy1GdCVF9o0jpD xSlA7MVBt7ijTtfOlodzZiO6PyXPx7vo6AJGUufwb4KxekLR6vKq9fzlTLvvD/mH lPPbCuZrY90eWqRjFeLyXA6Cmx+cJG5jt8nAAOzBjWTuENNp+vTFx1Lad13que7T AAXrQupjCpRwAxfN8cuYMMIAFw5FCOyTQNAZXaAeMV1UOslVVdXlffUDB6uqpNvC JPPL9PhughLVWtSxsm74emFCVkBQ75xTGMJTbCUCfMmdwTj3mD7uLw== =J1JB -END PGP SIGNATURE- -- To UNSUBSCRIBE, email to debian-devel-requ...@lists.debian.org with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org Archive: http://lists.debian.org/20110214162139.gf6...@ikki.ethgen.ch
Re: OT: Python (was: Make Unicode bugs release critical?)
Klaus Ethgen writes (Re: OT: Python (was: Make Unicode bugs release critical?)): No, it is not. 00a3 is just not a utf-8 character, it is unicode. To get a correct utf-8 character you need to print \x{c2a3} and then isutf8 is happy. When LC_CTYPE=en_GB.utf-8, programs which attempt to print unicode characters to stdout should use UTF-8. That's what LC_TYPE means. Ian. -- To UNSUBSCRIBE, email to debian-devel-requ...@lists.debian.org with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org Archive: http://lists.debian.org/19801.23455.536473.211...@chiark.greenend.org.uk
Re: OT: Python (was: Make Unicode bugs release critical?)
On Mon, 14 Feb 2011 16:43:11 + Ian Jackson ijack...@chiark.greenend.org.uk wrote: Klaus Ethgen writes (Re: OT: Python (was: Make Unicode bugs release critical?)): No, it is not. 00a3 is just not a utf-8 character, it is unicode. To get a correct utf-8 character you need to print \x{c2a3} and then isutf8 is happy. When LC_CTYPE=en_GB.utf-8, programs which attempt to print unicode characters to stdout should use UTF-8. That's what LC_TYPE means. By the way, $ LC_CTYPE=en_GB.utf-8 echo 'puts \x00a3\n'|tclsh|isutf8 $ $ LC_CTYPE=en_GB.utf-8 echo 'puts \x00a3\n'|tclsh|xxd -p c2a30a0a $ But RMS told the world not to use Tcl. -- To UNSUBSCRIBE, email to debian-devel-requ...@lists.debian.org with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org Archive: http://lists.debian.org/20110214203601.715df57c.kos...@domain007.com
Re: Make Unicode bugs release critical?
On 02/14/2011 10:39 AM, Ian Jackson wrote: [snip] The fact that naive Python programs work (honouring LC_CTYPE as they should) unless you pipe their output to something is clearly a bug. The fact that it's a specification bug doesn't mean it's not a bug. It doesn't seem to work for me. $ python -V Python 2.6.6 $ LC_CTYPE=en_GB.utf-8 python -c 'print u\u00a3' Traceback (most recent call last): File string, line 1, in module UnicodeEncodeError: 'ascii' codec can't encode character u'\xa3' in position 0: ordinal not in range(128) $ LC_CTYPE=en_GB.utf-8 python -c 'print u\uc2a3' Traceback (most recent call last): File string, line 1, in module UnicodeEncodeError: 'ascii' codec can't encode character u'\uc2a3' in position 0: ordinal not in range(128) $ perl -v This is perl, v5.10.1 (*) built for x86_64-linux-gnu-thread-multi (with 51 registered patches, see perl -V for more detail) $ LC_CTYPE=en_GB.utf-8 perl -e 'print \x{00a3}\n;' perl: warning: Setting locale failed. perl: warning: Please check that your locale settings: LANGUAGE = (unset), LC_ALL = (unset), LC_CTYPE = en_GB.utf-8, LANG = en_US.UTF-8 are supported and installed on your system. perl: warning: Falling back to the standard locale (C). £ $ locale LANG=en_US.UTF-8 LC_CTYPE=en_US.UTF-8 LC_NUMERIC=en_US.UTF-8 LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 LC_PAPER=en_US.UTF-8 LC_NAME=en_US.UTF-8 LC_ADDRESS=en_US.UTF-8 LC_TELEPHONE=en_US.UTF-8 LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=en_US.UTF-8 LC_ALL= -- The normal condition of mankind is tyranny and misery. Milton Friedman -- To UNSUBSCRIBE, email to debian-devel-requ...@lists.debian.org with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org Archive: http://lists.debian.org/4d59a558.5020...@cox.net
Re: Make Unicode bugs release critical?
On Mon, Feb 14, 2011 at 03:57:44PM -0600, Ron Johnson wrote: It doesn't seem to work for me. [...] $ LC_CTYPE=en_GB.utf-8 python -c 'print u\u00a3' Traceback (most recent call last): File string, line 1, in module UnicodeEncodeError: 'ascii' codec can't encode character u'\xa3' in position 0: ordinal not in range(128) [...] $ LC_CTYPE=en_GB.utf-8 perl -e 'print \x{00a3}\n;' perl: warning: Setting locale failed. perl: warning: Please check that your locale settings: LANGUAGE = (unset), LC_ALL = (unset), LC_CTYPE = en_GB.utf-8, LANG = en_US.UTF-8 are supported and installed on your system. perl: warning: Falling back to the standard locale (C). [...] You probably don't have an en_GB.utf-8 locale (maybe you have localepurge installed?). I bet en_US.utf-8 will net you different results. -- { IRL(Jeremy_Stanley); WWW(http://fungi.yuggoth.org/); PGP(43495829); WHOIS(STANL3-ARIN); SMTP(fu...@yuggoth.org); FINGER(fu...@yuggoth.org); MUD(kin...@katarsis.mudpy.org:6669); IRC(fu...@irc.yuggoth.org#ccl); ICQ(114362511); YAHOO(crawlingchaoslabs); AIM(dreadazathoth); } -- To UNSUBSCRIBE, email to debian-devel-requ...@lists.debian.org with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org Archive: http://lists.debian.org/20110214222615.gd1...@yuggoth.org
Re: Make Unicode bugs release critical?
On 02/14/2011 04:26 PM, The Fungi wrote: On Mon, Feb 14, 2011 at 03:57:44PM -0600, Ron Johnson wrote: It doesn't seem to work for me. [...] $ LC_CTYPE=en_GB.utf-8 python -c 'print u\u00a3' Traceback (most recent call last): File string, line 1, inmodule UnicodeEncodeError: 'ascii' codec can't encode character u'\xa3' in position 0: ordinal not in range(128) [...] $ LC_CTYPE=en_GB.utf-8 perl -e 'print \x{00a3}\n;' perl: warning: Setting locale failed. perl: warning: Please check that your locale settings: LANGUAGE = (unset), LC_ALL = (unset), LC_CTYPE = en_GB.utf-8, LANG = en_US.UTF-8 are supported and installed on your system. perl: warning: Falling back to the standard locale (C). [...] You probably don't have an en_GB.utf-8 locale (maybe you have localepurge installed?). I bet en_US.utf-8 will net you different results. That's it... $ LC_CTYPE=en_US.utf-8 python -c 'print u\u00a3' £ $ LC_CTYPE=en_US.utf-8 perl -e 'print \x{00a3}\n;' £ No localepurge, but when initially building the system, I only installed one or two locales. -- The normal condition of mankind is tyranny and misery. Milton Friedman -- To UNSUBSCRIBE, email to debian-devel-requ...@lists.debian.org with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org Archive: http://lists.debian.org/4d59c47d.7060...@cox.net
Re: Make Unicode bugs release critical?
On Mon, Feb 14, 2011 at 06:10:37PM -0600, Ron Johnson wrote: On 02/14/2011 04:26 PM, The Fungi wrote: You probably don't have an en_GB.utf-8 locale (maybe you have localepurge installed?). I bet en_US.utf-8 will net you different results. That's it... No localepurge, but when initially building the system, I only installed one or two locales. No one would expect an USian to use a GB locale. The problem is, there is currently no way to request UTF-8 encoding without specifying language. It's a remnant of ancient locales where ISO-8859-1 didn't make sense for pl_PL nor ISO-8859-2 for fr_FR. Also, iconv() functions are really inconvenient to use, it'd be much easier to use regular wide char functions predictably. In other words: can I has C.UTF-8 guaranteed? -- 1KB // Microsoft corollary to Hanlon's razor: // Never attribute to stupidity what can be // adequately explained by malice. -- To UNSUBSCRIBE, email to debian-devel-requ...@lists.debian.org with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org Archive: http://lists.debian.org/20110215002700.ga15...@angband.pl
Make Unicode bugs release critical?
On pe, 2011-02-11 at 10:05 +0100, Vincent Fourmond wrote: On 11/02/11 09:52, Josselin Mouette wrote: Le vendredi 11 février 2011 à 09:47 +0100, Adam Borowski a écrit : I'd say there should be no place in Debian in 2011 for software that can't do UTF-8, especially if near-identical forks exist. That would make a nice addition to the policy, wouldn’t it? So long as it is not a MUST, else I have a feeling we'll find many many packages RC... That aside, I agree with this idea. A release goal or release requirement might be another way of achieving this. However, I'm curious: is there a lot of software that is broken with Unicode, particularly with the UTF-8 encoding? I can't remember anything much in recent times. The first Unicode standard was published in 1991. That's twenty years ago. Any software that processes text at all and is incapable of dealing with UTF-8 should be considered with extreme suspicion. Making all such bugs be release critical (which includes the notion that release managers may ignore the bug in particular cases) sounds like a good way to get things under control. -- Blog/wiki/website hosting with ikiwiki (free for free software): http://www.branchable.com/ -- To UNSUBSCRIBE, email to debian-devel-requ...@lists.debian.org with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org Archive: http://lists.debian.org/1297417074.3105.6.ca...@havelock.lan
Re: Make Unicode bugs release critical?
On Fri, Feb 11, 2011 at 09:37:54AM +, Lars Wirzenius wrote: However, I'm curious: is there a lot of software that is broken with Unicode, particularly with the UTF-8 encoding? I can't remember anything much in recent times. Mostly it is just the old stuff like - eterm, aterm - elvis - X tools from the basic package (xman, xmessage, xmore, ...) - TeX without additional packages -- Miroslav Kure -- To UNSUBSCRIBE, email to debian-devel-requ...@lists.debian.org with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org Archive: http://lists.debian.org/20110211101442.ga29...@pharaoh.inf.upol.cz
Re: Make Unicode bugs release critical?
On Fri, Feb 11, 2011 at 11:14:42AM +0100, Miroslav Kure wrote: However, I'm curious: is there a lot of software that is broken with Unicode, particularly with the UTF-8 encoding? I can't remember anything much in recent times. Mostly it is just the old stuff like - eterm, aterm - elvis - X tools from the basic package (xman, xmessage, xmore, ...) - TeX without additional packages - tr(1) -- WBR, wRAR signature.asc Description: Digital signature
Re: Make Unicode bugs release critical?
Hi there! On Fri, 11 Feb 2011 11:14:42 +0100, Miroslav Kure wrote: On Fri, Feb 11, 2011 at 09:37:54AM +, Lars Wirzenius wrote: However, I'm curious: is there a lot of software that is broken with Unicode, particularly with the UTF-8 encoding? I can't remember anything much in recent times. Mostly it is just the old stuff like - eterm, aterm - elvis - X tools from the basic package (xman, xmessage, xmore, ...) - TeX without additional packages Plus a2ps (see #180236), something a lot of people use before KSPs, even if there are various alternatives, some of them already in Debian: Message-ID: 4d3f1bf6.1060...@sanctuary.nslug.ns.ca URL: http://lists.debian.org/msgid-search/%3c4D3F1BF6.1060604%40sanctuary.nslug.ns.ca%3e Everything should be at http://wiki.debian.org/UTF8BrokenApps. Thx, bye, Gismo / Luca pgpUwauikQgCE.pgp Description: PGP signature
Re: Make Unicode bugs release critical?
On Fri, Feb 11, 2011 at 11:14:42AM +0100, Miroslav Kure wrote: On Fri, Feb 11, 2011 at 09:37:54AM +, Lars Wirzenius wrote: However, I'm curious: is there a lot of software that is broken with Unicode, particularly with the UTF-8 encoding? I can't remember anything much in recent times. Mostly it is just the old stuff like - eterm, aterm - elvis - X tools from the basic package (xman, xmessage, xmore, ...) - TeX without additional packages XeTeX and XeLaTeX allow native UTF-8 input. Should be made the default, IMO, given how obsolete and broken the standard TeX encodings are. Being able to write in actual text rather than a lot of illegible incantations was a major revelation, and it's a bit sad it was in that situation in the first place. It also sorts out the awful font support, so you can use standard freetype-registered fonts, again without the pain. Result: a document you can actually read in the editor! IMO all those broken terminal emulators, editors and tools should be put in the bin. There are plenty of non-broken replacements, so why keep them around to bitrot even further? It's not like it's going to cause massive inconvenience--they are long obsolete. Regards, Roger -- .''`. Roger Leigh : :' : Debian GNU/Linux http://people.debian.org/~rleigh/ `. `' Printing on GNU/Linux? http://gutenprint.sourceforge.net/ `-GPG Public Key: 0x25BFB848 Please GPG sign your mail. signature.asc Description: Digital signature
Re: Make Unicode bugs release critical?
-BEGIN PGP SIGNED MESSAGE- Hash: SHA512 Hi, Am Fr den 11. Feb 2011 um 10:37 schrieb Lars Wirzenius: The first Unicode standard was published in 1991. That's twenty years ago. Any software that processes text at all and is incapable of dealing with UTF-8 should be considered with extreme suspicion. Making all such bugs be release critical (which includes the notion that release managers may ignore the bug in particular cases) sounds like a good way to get things under control. I think you are mixing stuff together. First there is unicode. There are several definitions for unicode (unicode-16, unicode-32, ...) but UTF-8 is not unicode it is just one implementation of unicode and in my eyes the most problematic as it has undefined states and is variable length. However, UTF-8 was created to allow using unicode in non-unicode environments. For me that was always a pointless plan and the unreadable UTF-8 characters all around buggy software that cannot handle encodings correct (and there are many around) and ignorant users who are using UTF-8 in environments that are not specified for multibyte charsets (IRC) is the most annoying one. As there are places where UTF-8 makes perfect sense and is the best solution it is not the best solution for all ignorance users (me too ;-) have. So specifying to be UTF-8 capable is somewhat inconsequent. Software has to be capable to handle every encoding as long as they are specified for that encodings. Regards Klaus - -- Klaus Ethgenhttp://www.ethgen.ch/ pub 2048R/D1A4EDE5 2000-02-26 Klaus Ethgen kl...@ethgen.de Fingerprint: D7 67 71 C4 99 A6 D4 FE EA 40 30 57 3C 88 26 2B -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.10 (GNU/Linux) iQEVAwUBTVUksZ+OKpjRpO3lAQoxGgf/WRdHVqOQ+4A/VkbaLRkXk7uZMKk1uNMT t5gIbmtkIZLRhGkVZIzuVNXT7Zlq+tS3HwpbUaHNmd7ImNUlN+m9dP1gJFacZaGd zYeM0L1G9nfh4iwNmNIqQ/ZhF3lnOUtV6kDqvlZ4EgIwXfAPDZeFMgCxkCeh8mbq H2MABIqwGxahqQoZ6Oql0npvE4QMVB7Use2iT2pPiNBSsB1hFzH9sqNu+uNdbko9 mI82BLHhMwwjhIo3ceFEHkah5pCPlJpTJHgRLd5nYf6/BUkEiR+ECnohdbkjjX5d 1ftp+4Q7Bngve1+5vM4yKQJAEx5vV1kV8U+GaQGE8Kad+op2BhWL+Q== =VYai -END PGP SIGNATURE- -- To UNSUBSCRIBE, email to debian-devel-requ...@lists.debian.org with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org Archive: http://lists.debian.org/2011025946.ga4...@ikki.ethgen.ch
Re: Make Unicode bugs release critical?
Am -10.01.-28163 20:59, schrieb Andrey Rahmatullin: On Fri, Feb 11, 2011 at 11:14:42AM +0100, Miroslav Kure wrote: However, I'm curious: is there a lot of software that is broken with Unicode, particularly with the UTF-8 encoding? I can't remember anything much in recent times. Mostly it is just the old stuff like - eterm, aterm - elvis - X tools from the basic package (xman, xmessage, xmore, ...) - TeX without additional packages - tr(1) grep, sed, awk, bash, ... Torsten -- To UNSUBSCRIBE, email to debian-devel-requ...@lists.debian.org with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org Archive: http://lists.debian.org/4d552988.1020...@debian.org
Re: Make Unicode bugs release critical?
On Fri, Feb 11, 2011 at 01:20:24PM +0100, Torsten Werner wrote: However, I'm curious: is there a lot of software that is broken with Unicode, particularly with the UTF-8 encoding? I can't remember anything much in recent times. Mostly it is just the old stuff like - eterm, aterm - elvis - X tools from the basic package (xman, xmessage, xmore, ...) - TeX without additional packages - tr(1) grep, sed, awk, bash, ... http://bugs.debian.org/495677 -- WBR, wRAR signature.asc Description: Digital signature
Re: Make Unicode bugs release critical?
On pe, 2011-02-11 at 13:20 +0100, Torsten Werner wrote: grep, sed, awk, bash, ... grep, sed, and awk, at least, seem to work acceptably for me with UTF-8. The support can be improved, I'm sure. -- Blog/wiki/website hosting with ikiwiki (free for free software): http://www.branchable.com/ -- To UNSUBSCRIBE, email to debian-devel-requ...@lists.debian.org with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org Archive: http://lists.debian.org/1297428633.3105.56.ca...@havelock.lan
Re: Make Unicode bugs release critical?
On Fr, 11 Feb 2011, Roger Leigh wrote: XeTeX and XeLaTeX allow native UTF-8 input. Should be made the default, IMO, given how obsolete and broken the standard TeX encodings are. Being able to write in actual text rather than Please don't write rubbish if you don't know what you are talking about!!! You have apparently no idea between input and font encoding. LaTeX can easily useutf8 with the appropriate inputenc, as well as dozens of other encoding. Not all of the world is using UTF8. UTF( is still taileored to western roman script, thus very unpopular in Japan for example. sorts out the awful font support, so you can use standard freetype-registered fonts, again without the pain. Result: a document you can actually read in the editor! Argg, PLEASE STOP THAT RUBBISH I never use xetex, I write a lot in German (umlauts), Japanese, Italian, ... TeX is different, don't try to throw away working solutions of 20 years because of your ignorance. ARrggg. I love people blabbering like drunkyards. IMO all those broken terminal emulators, editors and tools should be put in the bin. There are plenty of non-broken replacements, so why keep them around to bitrot even further? It's not like it's So what is the replacement for tex? Yeah iknow, it is *luatex* but we are FAR fro being stable and usable. XeTeX is nice for certain things, but not for all. Have you tried to set Tibetan text with XeTeX? The last time I tried it was a mess. And with Khmer (the language and script of Cambodia) it is even worse. Only because you are only using ASCII characters please don't make the rest of the world laugh on you. Best wishes Norbert (mumbling throw away in the bin*, *standard freetype*, ...) Norbert Preiningpreining@{jaist.ac.jp, logic.at, debian.org} JAIST, Japan TeX Live Debian Developer DSA: 0x09C5B094 fp: 14DF 2E6C 0307 BE6D AD76 A9C0 D2BF 4AA3 09C5 B094 HAGNABY (n.) Someone who looked a lot more attractive in the disco than they do in your bed the next morning. --- Douglas Adams, The Meaning of Liff -- To UNSUBSCRIBE, email to debian-devel-requ...@lists.debian.org with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org Archive: http://lists.debian.org/20110211124629.ga1...@gamma.logic.tuwien.ac.at
Re: Make Unicode bugs release critical?
On 02/11/11 14:20, Torsten Werner wrote: grep, sed, awk, bash, ... ? $ echo αβγ | sed 's/./a/' aβγ Regards, Φαίδων :-) -- To UNSUBSCRIBE, email to debian-devel-requ...@lists.debian.org with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org Archive: http://lists.debian.org/4d553373.5060...@debian.org
Re: Make Unicode bugs release critical?
On 2011-02-11 21:46:29 +0900, Norbert Preining wrote: On Fr, 11 Feb 2011, Roger Leigh wrote: XeTeX and XeLaTeX allow native UTF-8 input. Should be made the default, IMO, given how obsolete and broken the standard TeX encodings are. Being able to write in actual text rather than Please don't write rubbish if you don't know what you are talking about!!! You have apparently no idea between input and font encoding. LaTeX can easily useutf8 with the appropriate inputenc, Which one??? FYI, utf8 is very incomplete and utf8x is broken (bug 601365). -- Vincent Lefèvre vinc...@vinc17.net - Web: http://www.vinc17.net/ 100% accessible validated (X)HTML - Blog: http://www.vinc17.net/blog/ Work: CR INRIA - computer arithmetic / Arénaire project (LIP, ENS-Lyon) -- To UNSUBSCRIBE, email to debian-devel-requ...@lists.debian.org with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org Archive: http://lists.debian.org/20110211131843.gh15...@prunille.vinc17.org
Re: Make Unicode bugs release critical?
On Fri, Feb 11, 2011 at 09:46:29PM +0900, Norbert Preining wrote: On Fr, 11 Feb 2011, Roger Leigh wrote: XeTeX and XeLaTeX allow native UTF-8 input. Should be made the default, IMO, given how obsolete and broken the standard TeX encodings are. Being able to write in actual text rather than Please don't write rubbish if you don't know what you are talking about!!! Um, no need to be rude. Please keep your reply to technical points; if I've said something incorrect, by all means correct me, but insults is a step too far. I haven't said anything that could justify it, other than the fact that you disagree with my /opinion/. You have apparently no idea between input and font encoding. I only mentioned UTF-8 with regard to input, so you are assuming too much. LaTeX can easily useutf8 with the appropriate inputenc, as well as dozens of other encoding. Not all of the world is using UTF8. UTF( is still taileored to western roman script, thus very unpopular in Japan for example. The inputenc hack only gets you so far. I tried to go this way, and ran into all sorts of issues with UTF-8 in macro definitions getting scrambled and other sources of pain. With XeLaTeX I had no such troubles. So IME inputenc was not a suitable solution for serious UTF-8 work. sorts out the awful font support, so you can use standard freetype-registered fonts, again without the pain. Result: a document you can actually read in the editor! Argg, PLEASE STOP THAT RUBBISH What you are calling rubbish is not in any way false. It's given me the ability to have nice legible UTF-8-encoded documents, with excellent font support. There may be other ways. There may be better ways. But it's not wrong. [snip rant] IMO all those broken terminal emulators, editors and tools should be put in the bin. There are plenty of non-broken replacements, so why keep them around to bitrot even further? It's not like it's So what is the replacement for tex? Yeah iknow, it is *luatex* but we are FAR fro being stable and usable. Well I thought the jury was still out on which was the better solution. I really couldn't care less which wins; I'm using the solution which works right now, and I'll happily adopt whatever is better down the line. XeTeX is nice for certain things, but not for all. Have you tried to set Tibetan text with XeTeX? The last time I tried it was a mess. And with Khmer (the language and script of Cambodia) it is even worse. Only because you are only using ASCII characters please don't make the rest of the world laugh on you. You are again making unwarranted assumptions. I might not be using it for difficult-to-set languages, but I'm certainly not using ASCII characters only, or I wouldn't be needing UTF-8 input. Regards, Roger -- .''`. Roger Leigh : :' : Debian GNU/Linux http://people.debian.org/~rleigh/ `. `' Printing on GNU/Linux? http://gutenprint.sourceforge.net/ `-GPG Public Key: 0x25BFB848 Please GPG sign your mail. signature.asc Description: Digital signature
Re: Make Unicode bugs release critical?
On 2011-02-11 15:33:49 +0500, Andrey Rahmatullin wrote: On Fri, Feb 11, 2011 at 11:14:42AM +0100, Miroslav Kure wrote: However, I'm curious: is there a lot of software that is broken with Unicode, particularly with the UTF-8 encoding? I can't remember anything much in recent times. Mostly it is just the old stuff like - eterm, aterm - elvis - X tools from the basic package (xman, xmessage, xmore, ...) - TeX without additional packages - tr(1) less has problems with new Unicode characters (bug 597918). -- Vincent Lefèvre vinc...@vinc17.net - Web: http://www.vinc17.net/ 100% accessible validated (X)HTML - Blog: http://www.vinc17.net/blog/ Work: CR INRIA - computer arithmetic / Arénaire project (LIP, ENS-Lyon) -- To UNSUBSCRIBE, email to debian-devel-requ...@lists.debian.org with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org Archive: http://lists.debian.org/20110211133024.gi15...@prunille.vinc17.org
Re: Make Unicode bugs release critical?
On Fri, Feb 11, 2011 at 12:59:46PM +0100, Klaus Ethgen wrote: Am Fr den 11. Feb 2011 um 10:37 schrieb Lars Wirzenius: The first Unicode standard was published in 1991. That's twenty years ago. Any software that processes text at all and is incapable of dealing with UTF-8 should be considered with extreme suspicion. Making all such bugs be release critical (which includes the notion that release managers may ignore the bug in particular cases) sounds like a good way to get things under control. I think you are mixing stuff together. First there is unicode. There are several definitions for unicode (unicode-16, unicode-32, ...) but UTF-8 is not unicode it is just one implementation of unicode and in my eyes the most problematic as it has undefined states and is variable length. There is just one definition of Unicode, any new versions merely add extra characters, collating rules, etc. There are several ways to represent Unicode as a stream of bytes. Only one of them is fit for external storage, and that's UTF-8 since it doesn't break the assumptions that are true for text files: 1. no null bytes 2. basic newlines, etc are always newlines, never a part of a bigger character (not true for some ancient multibyte encodings) 3. not affected by endianness or any other internal detail Also, _all_ Unicode encodings are of variable length. However, UTF-8 was created to allow using unicode in non-unicode environments. For me that was always a pointless plan and the unreadable UTF-8 characters all around buggy software that cannot handle encodings correct (and there are many around) and ignorant users who are using UTF-8 in environments that are not specified for multibyte charsets (IRC) is the most annoying one. UTF-8 was never meant as merely a tool to allow using unicode in non-unicode environments. UTF-32 is useful only as an internal representation if you do care about a string of code points. Since a single character can consist of multiple such code points, it doesn't give you much unless you have to pass every code point through a function like wcwidth() -- ie, you are implementing something low-level which cares about properties of characters and their parts. You should never place UTF-32 into external storage that is not private to your program or can possibly be moved. UTF-16 is never, ever useful. It is a sad trap for win32 and Java developers, due to a bad engineering decision suggested, as I was told, by delegates from Microsoft and Sun, who wanted to conserve disk space and memory by storing separately code points and a language tag -- ie, exactly the thing Unicode was supposed to get us rid of. Even on day one, it was known that you can't fit all characters into 16 bits, and the decision to put all rare characters into a private area that needs out of band information was pretty ridiculous. The end result is, you have an encoding with all downsides of UTF-8 but none of the advantages. Since neither UTF-16 nor UTF-32 can be considered text, the decision all UNIX systems made was to use UTF-8 in the libc's API in all Unicode locales. Otherwise, you'd need separate APIs like FooBarA()/FooBarW() on Windows, which cause no end of problems. So specifying to be UTF-8 capable is somewhat inconsequent. Software has to be capable to handle every encoding as long as they are specified for that encodings. No, there is only one encoding left, as long as you don't have to talk to Windows. We can start purging away all the support for ancient charsets in places that do not need to handle foreign data. Debian has used UTF-8 as default for 5 releases already, and if you try to use an ancient locale, do not expect good results since no one bothers fixing bugs there. And maintaining unused code costs time and causes a risk of bugs, so good riddance! -- 1KB // Microsoft corollary to Hanlon's razor: // Never attribute to stupidity what can be // adequately explained by malice. -- To UNSUBSCRIBE, email to debian-devel-requ...@lists.debian.org with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org Archive: http://lists.debian.org/20110211133612.ga2...@angband.pl
Re: Make Unicode bugs release critical?
On Fr, 11 Feb 2011, Roger Leigh wrote: Um, no need to be rude. Well, you started with throw TeX into the bin! (cum grano salis) The only possible answer to that is mine. Or shutting up and ignoring that kind of rants from your side. insults is a step too far. I haven't said anything that could justify it, other than the fact that you disagree with my /opinion/. Very simple: replaceing *tex* wiht *xetex* will break existing documents. And that is a no-go. That is TeX world. You are taling about WinWord world. You have apparently no idea between input and font encoding. I only mentioned UTF-8 with regard to input, so you are assuming too much. You mentioned *fontconfig* which is font encoding, and has nothing whatsoever to do with inputenc. I don't assume too much. The inputenc hack only gets you so far. I tried to go this way, and Agreed. Improvements are welcome, please help and fix the shortcomings. sorts out the awful font support, so you can use standard freetype-registered fonts, again without the pain. Result: a document you can actually read in the editor! Argg, PLEASE STOP THAT RUBBISH What you are calling rubbish is not in any way false. It's given It *IS* wrong. You are stating that using freetype-registered fonts makes a document readable by the editor. Sorry this is rediculous. - different fonts might register themselves under different names to fontconfig - fonts might not be available her or there and migh tnot be embedded in the pdf DEK wrote his own font loading mechanism because he wanted to be sure that docuemtns *can* be typeset also on any other machine, and that works. If you use xetex that might work, or might not work, or might work but you are missing suddently some characters (there is for example a version of the palatino fonts with cyrillic characters, and a version without cyrillic characters, some systems have these *enriched* fonts and don't embedd them properly. THen, suddenly, on the target system, characters disappear. Is THIS the way you want to typeset documetns?) I repeat: RUBBISH. Well I thought the jury was still out on which was the better solution. Most people I know in the TeX community are seeing the real future with luatex. Best wishes Norbert Norbert Preiningpreining@{jaist.ac.jp, logic.at, debian.org} JAIST, Japan TeX Live Debian Developer DSA: 0x09C5B094 fp: 14DF 2E6C 0307 BE6D AD76 A9C0 D2BF 4AA3 09C5 B094 MARYTAVY (n.) A person to whom, under dire injunctions of silence, you tell a secret which you wish to be fare more widely known. --- Douglas Adams, The Meaning of Liff -- To UNSUBSCRIBE, email to debian-devel-requ...@lists.debian.org with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org Archive: http://lists.debian.org/20110211134338.gh1...@gamma.logic.tuwien.ac.at
Re: Make Unicode bugs release critical?
Am 11.02.2011 14:02, schrieb Faidon Liambotis: $ echo αβγ | sed 's/./a/' aβγ Okay. But... $ echo αβγ | busybox sed 's/./a/' a�βγ :) -- To UNSUBSCRIBE, email to debian-devel-requ...@lists.debian.org with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org Archive: http://lists.debian.org/4d553bf6.9020...@debian.org
Re: Make Unicode bugs release critical?
On Fri, Feb 11, 2011 at 02:30:24PM +0100, Vincent Lefevre wrote: On 2011-02-11 15:33:49 +0500, Andrey Rahmatullin wrote: On Fri, Feb 11, 2011 at 11:14:42AM +0100, Miroslav Kure wrote: However, I'm curious: is there a lot of software that is broken with Unicode, particularly with the UTF-8 encoding? I can't remember anything much in recent times. less has problems with new Unicode characters (bug 597918). Unicode 6.0 came out in october 2010, well after Squeeze's freeze, so you can't expect support for new characters already. There are in no fonts shipped with squeeze, so not recognizing the characters as valid is not a big problem. Less shouldn't maintain a private copy of character properties if all that data is already present in libc -- but guess what, wcwidth(0x1F4A9) and iswprint() don't know them too. So oh well, Squeeze won't display such vital characters as kitten[1], ghost, japanese ogre or pile of shit. Gotta invest in a crystal ball that will tell us what new characters will be. [1]. To see my examples, you can grab: http://angband.pl/debian/pool/main/t/ttf-ancient-fonts/ttf-ancient-fonts_2.52-1.0kb1_all.deb (newer than the version in unstable, Gürkan Sengün's version is 404-compliant, let's poke him so we have _one_ Unicode 6.0 font in Debian). -- 1KB // Microsoft corollary to Hanlon's razor: // Never attribute to stupidity what can be // adequately explained by malice. -- To UNSUBSCRIBE, email to debian-devel-requ...@lists.debian.org with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org Archive: http://lists.debian.org/20110211140202.gb2...@angband.pl
Re: Make Unicode bugs release critical?
On Fri, Feb 11, 2011 at 10:43:38PM +0900, Norbert Preining wrote: On Fr, 11 Feb 2011, Roger Leigh wrote: Um, no need to be rude. Well, you started with throw TeX into the bin! (cum grano salis) The only possible answer to that is mine. Or shutting up and ignoring that kind of rants from your side. Please read what I said carefully, rather than imagined slights. I did not at any point state that TeX should be thrown in the bin; that was with regard to broken terminal emulators, editors and tools. I fully believe we should remove obsolete tools which have superior replacements. I did not include TeX in that category. You have apparently no idea between input and font encoding. I only mentioned UTF-8 with regard to input, so you are assuming too much. You mentioned *fontconfig* which is font encoding, and has nothing whatsoever to do with inputenc. I don't assume too much. No, I mentioned fontconfig because XeTeX allows use of system fonts via fontconfig. That was completely separate from UTF-8 input. sorts out the awful font support, so you can use standard freetype-registered fonts, again without the pain. Result: a document you can actually read in the editor! Argg, PLEASE STOP THAT RUBBISH What you are calling rubbish is not in any way false. It's given It *IS* wrong. You are stating that using freetype-registered fonts makes a document readable by the editor. Sorry this is rediculous. - different fonts might register themselves under different names to fontconfig - fonts might not be available her or there and migh tnot be embedded in the pdf [...] I repeat: RUBBISH. I didn't state any of those things. Please calm down, and please read what I actually wrote, rather than what you thought I wrote. Regards, Roger -- .''`. Roger Leigh : :' : Debian GNU/Linux http://people.debian.org/~rleigh/ `. `' Printing on GNU/Linux? http://gutenprint.sourceforge.net/ `-GPG Public Key: 0x25BFB848 Please GPG sign your mail. signature.asc Description: Digital signature
Re: Make Unicode bugs release critical?
On Fr, 11 Feb 2011, Roger Leigh wrote: read what I actually wrote, rather than what you thought I wrote. So *what* is your proposal, instead of discussing uselessly and wasting bytes? Is it: ln -sf tex xetex Best wishes Norbert Norbert Preiningpreining@{jaist.ac.jp, logic.at, debian.org} JAIST, Japan TeX Live Debian Developer DSA: 0x09C5B094 fp: 14DF 2E6C 0307 BE6D AD76 A9C0 D2BF 4AA3 09C5 B094 LITTLE URSWICK (n.) The member of any class who most inclines a teacher towards the view that capital punishment should be introduced in schools. --- Douglas Adams, The Meaning of Liff -- To UNSUBSCRIBE, email to debian-devel-requ...@lists.debian.org with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org Archive: http://lists.debian.org/20110211141749.gl1...@gamma.logic.tuwien.ac.at
Re: Make Unicode bugs release critical?
On 2011-02-11 15:02:02 +0100, Adam Borowski wrote: On Fri, Feb 11, 2011 at 02:30:24PM +0100, Vincent Lefevre wrote: On 2011-02-11 15:33:49 +0500, Andrey Rahmatullin wrote: On Fri, Feb 11, 2011 at 11:14:42AM +0100, Miroslav Kure wrote: However, I'm curious: is there a lot of software that is broken with Unicode, particularly with the UTF-8 encoding? I can't remember anything much in recent times. less has problems with new Unicode characters (bug 597918). Unicode 6.0 came out in october 2010, The character mentioned in my bug report (U+1E9F LATIN SMALL LETTER DELTA) appeared in Unicode 5.1.0 (March 2008). well after Squeeze's freeze, so you can't expect support for new characters already. Well, March 2008 was more than 1 year before Squeeze's freeze. There are in no fonts shipped with squeeze, so not recognizing the characters as valid is not a big problem. Fonts containing the character in question are shipped with Squeeze: the character appears correctly in xterm. Less shouldn't maintain a private copy of character properties if all that data is already present in libc I agree. -- but guess what, wcwidth(0x1F4A9) and iswprint() don't know them too. No problems with U+1E9F: Property alnum : yes Property alpha : yes Property cntrl : no Property digit : no Property graph : yes Property lower : yes Property print : yes Property punct : no Property space : no Property upper : no Property xdigit: no wcwidth = 1 So, if less were using libc, it wouldn't have any problem with this character. -- Vincent Lefèvre vinc...@vinc17.net - Web: http://www.vinc17.net/ 100% accessible validated (X)HTML - Blog: http://www.vinc17.net/blog/ Work: CR INRIA - computer arithmetic / Arénaire project (LIP, ENS-Lyon) -- To UNSUBSCRIBE, email to debian-devel-requ...@lists.debian.org with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org Archive: http://lists.debian.org/20110211143511.gj15...@prunille.vinc17.org
Re: Make Unicode bugs release critical?
Lars Wirzenius wrote: However, I'm curious: is there a lot of software that is broken with Unicode, particularly with the UTF-8 encoding? I can't remember anything much in recent times. We chose an 80% quickfix to get where we are, and so now we have the other 80% to go. It's been whittled away at for the past 10 years or so, but still a lot left. And, that's utf8 support, only. It's probably a pipe dream to expect other unicode encodings to work half as well, and surely other encodings fare even worse overall. If anything, utf8 probably makes the overall situation worse for other encodings, since we expect it to just work, and give up on handling the other complexity. The first Unicode standard was published in 1991. That's twenty years ago. Any software that processes text at all and is incapable of dealing with UTF-8 should be considered with extreme suspicion. Most languages still make it easy to get wrong, in my experience. It can be as simple as software written trusting language documentation that says strings are processed in unicode and doesn't point out all the exceptions that can let non-unicode data in. For example, this simple haskell program processess a file's content utf-8 cleanly, but prints its name like foö. import System.Environment main = do args - getArgs let file = head args putStrLn $ file is: ++ file putStr = readFile file This program has an entirely different failure mode; type in foö (touch it first), and it will complain that fo� doesn't exist. main = getLine = readFile = putStr Neither of these failure modes is obvious from any documentation I've seen. Both of these programs are something a typical developer would expect to work. (Both also have unexpected failure modes when LANG=C.) Probably every thousand lines of perl has a unicode encoding bug of some sort. Based on data from my own code. Any perl code that uses an XS module probably has an encoding bug. I assume that python had some problems with its unicode support too, since they saw fit to radically change it in python 3. And it sounds like the python 3 changes will break unicode in many programs ported over to it, unless file opens etc are audited and fixed. Stackoverflow has 1600 matches for python unicode questions. The best case is probably a language that has a restructed enough interface that most of these problems are avoided. (But, stackoverflow still has 500 javascript unicode questions.) Making all such bugs be release critical (which includes the notion that release managers may ignore the bug in particular cases) sounds like a good way to get things under control. It would probably be a large load on the RMs. It's easy to pick some random program that works great with unicode and find an edge case. The RMs would probably prefer to not have git getting RC bugs filed just because it sometimes exposes filenames written like fo\303\266. :) -- see shy jo, who deals with at least 1 unicode bug a week on average. 4 this week signature.asc Description: Digital signature
Re: Make Unicode bugs release critical?
Excerpts from Joey Hess's message of Sex Fev 11 13:39:08 -0200 2011: (...) It can be as simple as software written trusting language documentation that says strings are processed in unicode and doesn't point out all the exceptions that can let non-unicode data in. For example, this simple haskell program processess a file's content utf-8 cleanly, but prints its name like foö. import System.Environment main = do args - getArgs let file = head args putStrLn $ file is: ++ file putStr = readFile file This program has an entirely different failure mode; type in foö (touch it first), and it will complain that fo� doesn't exist. main = getLine = readFile = putStr Neither of these failure modes is obvious from any documentation I've seen. Both of these programs are something a typical developer would expect to work. (Both also have unexpected failure modes when LANG=C.) http://hackage.haskell.org/trac/ghc/ticket/3307 Greetings. (...) signature.asc Description: PGP signature
Re: Make Unicode bugs release critical? (was: Re: RFA: all my packages)
Hi, Adam Borowski wrote: Speaking of rxvt... shouldn't this clusterϫϫck become the only rxvt in Debian? Both rxvt and rxvt-beta, completely dead upstream for 10 and 8 years respectively, besides having terrible support for terminal codes lack even such a tiny detail as UTF-8 support. I'd say there should be no place in Debian in 2011 for software that can't do UTF-8, especially if near-identical forks exist. I'd replace especially with only in that sentence. Kicking out good and unique software, only because of missing or incomplete UTF-8 support, will surely lower Debian's quality more than missing or broken UTF-8 support in very few packages. And it would make those users (and devs) angry who need that software independently of working UTF-8 support or not. Regards, Axel -- ,''`. | Axel Beckert a...@debian.org, http://people.debian.org/~abe/ : :' : | Debian Developer, ftp.ch.debian.org Admin `. `' | 1024D: F067 EA27 26B9 C3FC 1486 202E C09E 1D89 9593 0EDE `-| 4096R: 2517 B724 C5F6 CA99 5329 6E61 2FF9 CD59 6126 16B5 -- To UNSUBSCRIBE, email to debian-devel-requ...@lists.debian.org with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org Archive: http://lists.debian.org/20110211183343.gp12...@sym.noone.org
Re: Make Unicode bugs release critical?
On Fri, Feb 11, 2011 at 09:37:54AM +, Lars Wirzenius wrote: However, I'm curious: is there a lot of software that is broken with Unicode, particularly with the UTF-8 encoding? I can't remember anything much in recent times. ispell, aspell. I think hunspell got fix recently. Kurt -- To UNSUBSCRIBE, email to debian-devel-requ...@lists.debian.org with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org Archive: http://lists.debian.org/20110211203240.ga30...@roeckx.be
Re: Make Unicode bugs release critical?
On Fri, 11 Feb 2011, Lars Wirzenius wrote: However, I'm curious: is there a lot of software that is broken with Unicode, particularly with the UTF-8 encoding? I can't remember anything much in recent times. 1. Stuff that cannot do one of UTF-8, UTF-16 or UCS-4. 2. Anything that cannot deal with Supplementary planes. This includes the use of UCS-2 instead of UTF-16, as it cannot represent the Supplementary planes. python 3 when not compiled to use UCS-4 memory hog mode is an example, I am told. We likely want to restrain ourselves to declaring (1) to be release critical for Wheezy. -- One disk to rule them all, One disk to find them. One disk to bring them all and in the darkness grind them. In the Land of Redmond where the shadows lie. -- The Silicon Valley Tarot Henrique Holschuh -- To UNSUBSCRIBE, email to debian-devel-requ...@lists.debian.org with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org Archive: http://lists.debian.org/20110211221653.gb18...@khazad-dum.debian.net
Re: Make Unicode bugs release critical?
On 02/11/2011 07:36 AM, Adam Borowski wrote: [snip] UTF-16 is never, ever useful. It is a sad trap for win32 and Java developers, due to a bad engineering decision suggested, as I was told, by [snip] No, there is only one encoding left, as long as you don't have to talk to Windows. Never useful except for 90% of the market? (I wonder how SAMBA deals with it...) -- The normal condition of mankind is tyranny and misery. Milton Friedman -- To UNSUBSCRIBE, email to debian-devel-requ...@lists.debian.org with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org Archive: http://lists.debian.org/4d55c263.90...@cox.net
Re: Make Unicode bugs release critical?
[Ron Johnson] Never useful except for 90% of the market? (I wonder how SAMBA deals with it...) I don't think you really want to know. There's a 'unicode' flag in much of the CIFS protocol that means filenames and such are in UTF-16 (I think UTF-16LE) instead of some-random-configured-code-page. Samba's been using that flag for about 10 years. You configure it to say what encoding your filenames are supposed to be on the server, and it expresses them in UTF-16 on the wire. Samba also supports non-Unicode-aware clients like Windows 3.11 - or at least it used to support these - you'd tell Samba what client code page to translate your filenames into on the wire. Fun stuff. Samba doesn't really deal with file _contents_, which is a much more interesting problem than filenames. It just serves contents as-is, like most file service protocols other than FTP. -- Peter Samuelson | org-tld!p12n!peter | http://p12n.org/ -- To UNSUBSCRIBE, email to debian-devel-requ...@lists.debian.org with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org Archive: http://lists.debian.org/20110212003312.gb10...@p12n.org
Re: Make Unicode bugs release critical?
On Fri, Feb 11, 2011 at 08:16:54PM -0200, Henrique de Moraes Holschuh wrote: On Fri, 11 Feb 2011, Lars Wirzenius wrote: However, I'm curious: is there a lot of software that is broken with Unicode, particularly with the UTF-8 encoding? I can't remember anything much in recent times. 2. Anything that cannot deal with Supplementary planes. This includes the use of UCS-2 instead of UTF-16, as it cannot represent the Supplementary planes. python 3 when not compiled to use UCS-4 memory hog mode is an example, I am told. Using UCS-2 is hardly better than using ISO-8859-1 or any other ancient charset. Using either UTF-16 or UCS-4 can be a memory hog, that's why to pick UTF-8 for regular use. Except for some rare cases (CJK with no formatting or markup), it uses less memory and can be passed as-is to POSIX file functions. Picking a random subset of Unicode is like putting day-of-the-year in one byte variable since this way you support 70% of uses and it conserves memory... -- 1KB // Microsoft corollary to Hanlon's razor: // Never attribute to stupidity what can be // adequately explained by malice. -- To UNSUBSCRIBE, email to debian-devel-requ...@lists.debian.org with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org Archive: http://lists.debian.org/20110212020220.ga26...@angband.pl
Re: Make Unicode bugs release critical?
On Sat, 12 Feb 2011, Adam Borowski wrote: On Fri, Feb 11, 2011 at 08:16:54PM -0200, Henrique de Moraes Holschuh wrote: 2. Anything that cannot deal with Supplementary planes. This includes the use of UCS-2 instead of UTF-16, as it cannot represent the Supplementary planes. python 3 when not compiled to use UCS-4 memory hog mode is an example, I am told. Using UCS-2 is hardly better than using ISO-8859-1 or any other ancient charset. Using either UTF-16 or UCS-4 can be a memory hog, that's why to pick UTF-8 for regular use. Except for some rare cases (CJK with no Python 3 uses UCS-2 (or UCS-4) for the internal representation. Likely they wanted to have something that made it easy to address each character in an Unicode string in O(1). That might actually give better performance given how much people like to do string slicing and splicing in python. The O(N) often required by UTF-8 and UTF-16 might well be more painful than the much larger data cache footprint of UCS-4... but that is a damn big *maybe*, and very unlikely to be consistent across very different architectures. Well, not like I care. I don't even have Python 3 installed, and I will only do so the day something I need decides to pull it as a dependency. Picking a random subset of Unicode is like putting day-of-the-year in one UCS-2 is deprecated as all heck. As far as I could research through Google, it is not a valid Unicode representation since Unicode 2.0 (i.e. 1996). So it wouldn't even count as a random subset of Unicode. -- One disk to rule them all, One disk to find them. One disk to bring them all and in the darkness grind them. In the Land of Redmond where the shadows lie. -- The Silicon Valley Tarot Henrique Holschuh -- To UNSUBSCRIBE, email to debian-devel-requ...@lists.debian.org with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org Archive: http://lists.debian.org/20110212035533.ga32...@khazad-dum.debian.net