subject:"Make Unicode bugs release critical\?"

Re: OT: Python (was: Make Unicode bugs release critical?)

2011-02-15 Thread Vincent Lefevre

On 2011-02-14 16:43:11 +, Ian Jackson wrote:
 When LC_CTYPE=en_GB.utf-8, programs which attempt to print unicode
 characters to stdout should use UTF-8.  That's what LC_TYPE means.

So, cat, grep, etc. are all broken. :)

-- 
Vincent Lefèvre vinc...@vinc17.net - Web: http://www.vinc17.net/
100% accessible validated (X)HTML - Blog: http://www.vinc17.net/blog/
Work: CR INRIA - computer arithmetic / Arénaire project (LIP, ENS-Lyon)


-- 
To UNSUBSCRIBE, email to debian-devel-requ...@lists.debian.org
with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org
Archive: http://lists.debian.org/20110216000107.gl15...@prunille.vinc17.org

Re: OT: Python (was: Make Unicode bugs release critical?)

2011-02-15 Thread Adam Borowski

On Wed, Feb 16, 2011 at 01:01:07AM +0100, Vincent Lefevre wrote:
 On 2011-02-14 16:43:11 +, Ian Jackson wrote:
  When LC_CTYPE=en_GB.utf-8, programs which attempt to print unicode
  characters to stdout should use UTF-8.  That's what LC_TYPE means.
 
 So, cat, grep, etc. are all broken. :)

How come?

cat will, for any valid UTF-8 character on input, print a valid UTF-8
character on output.  For any valid ISO-8859-1 character on input, it will
print a valid ISO-8859-1 character on output.  

grep on the other hand has to actually understand the encoding -- and it
does.  Try this:
$ echo ą|LC_CTYPE=C grep --color=always .
Will be mangled.
$ echo ą|LC_CTYPE=en_US.utf-8 grep --color=always .
Will be handled correctly.

-- 
1KB // Microsoft corollary to Hanlon's razor:
//  Never attribute to stupidity what can be
//  adequately explained by malice.


-- 
To UNSUBSCRIBE, email to debian-devel-requ...@lists.debian.org
with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org
Archive: http://lists.debian.org/20110216003451.ga14...@angband.pl

Re: OT: Python (was: Make Unicode bugs release critical?)

2011-02-15 Thread Vincent Lefevre

On 2011-02-16 01:34:51 +0100, Adam Borowski wrote:
 On Wed, Feb 16, 2011 at 01:01:07AM +0100, Vincent Lefevre wrote:
  On 2011-02-14 16:43:11 +, Ian Jackson wrote:
   When LC_CTYPE=en_GB.utf-8, programs which attempt to print unicode
   characters to stdout should use UTF-8.  That's what LC_TYPE means.
  
  So, cat, grep, etc. are all broken. :)
 
 How come?
 
 cat will, for any valid UTF-8 character on input, print a valid UTF-8
 character on output.  For any valid ISO-8859-1 character on input, it will
 print a valid ISO-8859-1 character on output.

I was just commenting what Ian said. If there is a valid reason for
which cat may not produce UTF-8 in UTF-8 locales, this is also
true for perl or any other software.

-- 
Vincent Lefèvre vinc...@vinc17.net - Web: http://www.vinc17.net/
100% accessible validated (X)HTML - Blog: http://www.vinc17.net/blog/
Work: CR INRIA - computer arithmetic / Arénaire project (LIP, ENS-Lyon)


-- 
To UNSUBSCRIBE, email to debian-devel-requ...@lists.debian.org
with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org
Archive: http://lists.debian.org/20110216004529.gn15...@prunille.vinc17.org

Re: Make Unicode bugs release critical? (was: Re: RFA: all my packages)

2011-02-14 Thread Josselin Mouette

Le vendredi 11 février 2011 à 19:33 +0100, Axel Beckert a écrit : 
 Kicking out good and unique software, only because of missing or
 incomplete UTF-8 support, will surely lower Debian's quality more than
 missing or broken UTF-8 support in very few packages. And it would
 make those users (and devs) angry who need that software independently
 of working UTF-8 support or not.

Kicking out software with incomplete UTF-8 support sounds unfair.

Kicking out software that doesn’t work at all in UTF-8 locales and
requires the user to set a broken locale, OTOH, sounds like a sanitary
emergency.

-- 
 .''`.
: :' : “You would need to ask a lawyer if you don't know
`. `'   that a handshake of course makes a valid contract.”
  `---  J???rg Schilling


--
To UNSUBSCRIBE, email to debian-devel-requ...@lists.debian.org
with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org
Archive: http://lists.debian.org/1297676104.3044.218.camel@meh

Re: Make Unicode bugs release critical? (was: Re: RFA: all my packages)

2011-02-14 Thread Ian Jackson

Josselin Mouette writes (Re: Make Unicode bugs release critical? (was: Re: 
RFA: all my packages)):
 Kicking out software that doesn?t work at all in UTF-8 locales and
 requires the user to set a broken locale, OTOH, sounds like a sanitary
 emergency.

Excellent, I look forward to the removal of python.  I always hated
that language anyway.

$ LC_CTYPE=en_GB.utf-8 python -c 'print u\u00a3'
unicode pound sign
$

But

$ LC_CTYPE=en_GB.utf-8 python -c 'print u\u00a3' | cat 
Traceback (most recent call last):
  File string, line 1, in module
UnicodeEncodeError: 'ascii' codec can't encode character u'\xa3' in
position 0: ordinal not in range(128)
$

Ian.


-- 
To UNSUBSCRIBE, email to debian-devel-requ...@lists.debian.org
with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org
Archive: http://lists.debian.org/19801.8997.829350.140...@chiark.greenend.org.uk

Re: Make Unicode bugs release critical? (was: Re: RFA: all my packages)

2011-02-14 Thread Jakub Wilk


* Ian Jackson ijack...@chiark.greenend.org.uk, 2011-02-14, 12:42:
Kicking out software that doesn?t work at all in UTF-8 locales and 
requires the user to set a broken locale, OTOH, sounds like a sanitary 
emergency.


Excellent, I look forward to the removal of python.  I always hated 
that language anyway.


$ LC_CTYPE=en_GB.utf-8 python -c 'print u\u00a3'
unicode pound sign
$

But

$ LC_CTYPE=en_GB.utf-8 python -c 'print u\u00a3' | cat
Traceback (most recent call last):
 File string, line 1, in module
UnicodeEncodeError: 'ascii' codec can't encode character u'\xa3' in
position 0: ordinal not in range(128)
$


This is the expected behaviour. Incidentally, it has nothing to do with 
UTF-8. You'll get the same result if you use a locale with a legacy 
encoding.


--
Jakub Wilk


--
To UNSUBSCRIBE, email to debian-devel-requ...@lists.debian.org
with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org
Archive: http://lists.debian.org/20110214131425.ga4...@jwilk.net

OT: Python (was: Make Unicode bugs release critical?)

2011-02-14 Thread Klaus Ethgen

-BEGIN PGP SIGNED MESSAGE-
Hash: SHA512

Hi,

lets start a python rant. I love to hate this language. :-)

Am Mo den 14. Feb 2011 um 14:14 schrieb Jakub Wilk:
 $ LC_CTYPE=en_GB.utf-8 python -c 'print u\u00a3'
 unicode pound sign
[...]
 $ LC_CTYPE=en_GB.utf-8 python -c 'print u\u00a3' | cat
 Traceback (most recent call last):
  File string, line 1, in module
 UnicodeEncodeError: 'ascii' codec can't encode character u'\xa3' in
 position 0: ordinal not in range(128)
 
 This is the expected behaviour. Incidentally, it has nothing to do
 with UTF-8. You'll get the same result if you use a locale with a
 legacy encoding.

I see. It is funny to see python lovers to blame other for the bugs in
the language.

~ LC_CTYPE=en_GB.utf-8 perl -e 'print \x{00a3}\n;'
~ LC_CTYPE=en_GB.utf-8 perl -e 'print \x{00a3}\n;' | cat

Both gives the same result, a '£' sign as expected.

 * Ian Jackson ijack...@chiark.greenend.org.uk, 2011-02-14, 12:42:
 Excellent, I look forward to the removal of python.  I always
 hated that language anyway.

I hate them more. :-)

Regards
   Klaus
- -- 
Klaus Ethgenhttp://www.ethgen.ch/
pub  2048R/D1A4EDE5 2000-02-26 Klaus Ethgen kl...@ethgen.de
Fingerprint: D7 67 71 C4 99 A6 D4 FE  EA 40 30 57 3C 88 26 2B
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.10 (GNU/Linux)

iQEVAwUBTVkwIJ+OKpjRpO3lAQr9qAf+I4UXXNKso2hhr6BEjgn/o0IOpbI6/jhe
YwSf5rysUlb924NvtdOc1VzLoOff/uUDXOpW0VICSJMZRfVLZvVvdwaysa+SJj/f
0UL0CnuHogtan5uV627JFQRI5/VpQ9LXRc7w6w0+Eh8d7Pm/FJYomI4fuGAM0jPo
n1mFCeHSP2PiSIJ85cKWCqxsDkC4EDrPvrqol2ZJfuW1bVqqViGWMIrQ8RXzQ8JD
eSBHY0qjOCoMz1W46C4ruk3SVkX6FGe/V9U6XUG9kcAYlfpMyfeHDQ207P1tuEUH
dmD9gFA8ZpUgxHSZY43ONBnJlFynubPv7bmWoic7sez6V8zab6TFqg==
=KrXl
-END PGP SIGNATURE-


-- 
To UNSUBSCRIBE, email to debian-devel-requ...@lists.debian.org
with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org
Archive: http://lists.debian.org/20110214133736.gb6...@ikki.ethgen.ch

Re: OT: Python (was: Make Unicode bugs release critical?)

2011-02-14 Thread Philipp Kern

On 2011-02-14, Klaus Ethgen kl...@ethgen.de wrote:
 ~ LC_CTYPE=en_GB.utf-8 perl -e 'print \x{00a3}\n;'
 ~ LC_CTYPE=en_GB.utf-8 perl -e 'print \x{00a3}\n;' | cat
 Both gives the same result, a '£' sign as expected.

And what's the value in that demonstration?  Yes, you can treat UTF8 like a
bytestream.  And the thread was about the problems that can arise of this.

Kind regards
Philipp Kern


-- 
To UNSUBSCRIBE, email to debian-devel-requ...@lists.debian.org
with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org
Archive: http://lists.debian.org/slrnilidf3.11r.tr...@kelgar.0x539.de

Re: OT: Python (was: Make Unicode bugs release critical?)

2011-02-14 Thread Lars Wirzenius

On ma, 2011-02-14 at 14:37 +0100, Klaus Ethgen wrote:
 lets start a python rant. I love to hate this language. :-)

Let's not.

Let's not rant about any languages, or tools, or desktop environments.
Let's be constructive on Debian mailing lists, shall we?

We have plenty of side-channels for rants, sarcasm, snide remarks,
passive-aggressiveness, and other forms of anti-social behavior, let's
use those instead.



-- 
To UNSUBSCRIBE, email to debian-devel-requ...@lists.debian.org
with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org
Archive: http://lists.debian.org/1297692931.31960.13.camel@tacticus

Re: OT: Python (was: Make Unicode bugs release critical?)

2011-02-14 Thread Jakub Wilk


* Klaus Ethgen kl...@ethgen.de, 2011-02-14, 14:37:

~ LC_CTYPE=en_GB.utf-8 perl -e 'print \x{00a3}\n;'
~ LC_CTYPE=en_GB.utf-8 perl -e 'print \x{00a3}\n;' | cat


Let me try...

$ LC_CTYPE=en_GB.utf-8 perl -e 'print \x{00a3}\n;' | isutf8
stdin: line 1, char 1, byte offset 1: invalid UTF-8 code


But I don't blame Perl for that. It's documented behavior, so I can 
either live with that or use another language.


--
Jakub Wilk


--
To UNSUBSCRIBE, email to debian-devel-requ...@lists.debian.org
with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org
Archive: http://lists.debian.org/20110214143302.ga6...@jwilk.net

Re: OT: Python (was: Make Unicode bugs release critical?)

2011-02-14 Thread Klaus Ethgen

-BEGIN PGP SIGNED MESSAGE-
Hash: SHA512

Am Mo den 14. Feb 2011 um 15:15 schrieb Lars Wirzenius:
 On ma, 2011-02-14 at 14:37 +0100, Klaus Ethgen wrote:
  lets start a python rant. I love to hate this language. :-)
 
 Let's not.

'Till here it is personal desire.

 Let's not rant about any languages, or tools, or desktop environments.
 Let's be constructive on Debian mailing lists, shall we?

You are true. I just couldn't resist if someone was trying to blame all
other than the one that has the bug.

Regards
   Klaus
- -- 
Klaus Ethgenhttp://www.ethgen.ch/
pub  2048R/D1A4EDE5 2000-02-26 Klaus Ethgen kl...@ethgen.de
Fingerprint: D7 67 71 C4 99 A6 D4 FE  EA 40 30 57 3C 88 26 2B
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.10 (GNU/Linux)

iQEVAwUBTVk9hZ+OKpjRpO3lAQoy7Qf9EV1erqhNsAgfJ1ubQiitzufbk5Wq4rA/
rVh+Tpn4SHTE3D5Sw20UIPrUYonaQD6z8gokOkIdvzvgzVOBj3vPioFnWZy368QK
DUXymUPal23q+iwwV8FYNqq7ggnwpnT0DX1PNCmMUHZl21ZkMjMJO2cuv21ycD6I
JGBvA0w+dOVb7YfI+HGMwAlyT2gEkT7nsg8nlvYUU+EgzCaXjC1tdPHfe3QAYsQh
Pd0QDqhxFvwVRB9SskSas1JnjUh5DKMI/USr7a/+jP6dWeVQHIRglIN5uNFCq8kW
70jM2XCdTeZcdFy1lOiJ07YCYW1gg0kKCN+DlyEFJmJUzYsfP+4KsQ==
=H8Sg
-END PGP SIGNATURE-


-- 
To UNSUBSCRIBE, email to debian-devel-requ...@lists.debian.org
with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org
Archive: http://lists.debian.org/20110214143445.gd6...@ikki.ethgen.ch

Re: OT: Python (was: Make Unicode bugs release critical?)

2011-02-14 Thread Adam Borowski

On Mon, Feb 14, 2011 at 02:02:11PM +, Philipp Kern wrote:
 On 2011-02-14, Klaus Ethgen kl...@ethgen.de wrote:
  ~ LC_CTYPE=en_GB.utf-8 perl -e 'print \x{00a3}\n;'
  ~ LC_CTYPE=en_GB.utf-8 perl -e 'print \x{00a3}\n;' | cat
  Both gives the same result, a '£' sign as expected.
 
 And what's the value in that demonstration?  Yes, you can treat UTF8 like a
 bytestream.  And the thread was about the problems that can arise of this.

Er, and tell me where exactly it makes sense to allow one encoding but not
another for a bytestream?

It appears that Python has a nasty bug where it ignores the encoding if
isatty(stdout) returns 0.  So let's go fixing or reporting that rather than
arguing about it.

-- 
1KB // Microsoft corollary to Hanlon's razor:
//  Never attribute to stupidity what can be
//  adequately explained by malice.


-- 
To UNSUBSCRIBE, email to debian-devel-requ...@lists.debian.org
with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org
Archive: http://lists.debian.org/20110214143608.ga8...@angband.pl

Re: OT: Python (was: Make Unicode bugs release critical?)

2011-02-14 Thread Ian Jackson

Jakub Wilk writes (Re: OT: Python (was: Make Unicode bugs release critical?)):
 * Klaus Ethgen kl...@ethgen.de, 2011-02-14, 14:37:
 ~ LC_CTYPE=en_GB.utf-8 perl -e 'print \x{00a3}\n;'
 ~ LC_CTYPE=en_GB.utf-8 perl -e 'print \x{00a3}\n;' | cat
 
 Let me try...
 
 $ LC_CTYPE=en_GB.utf-8 perl -e 'print \x{00a3}\n;' | isutf8
 stdin: line 1, char 1, byte offset 1: invalid UTF-8 code

WTF.  OK, Perl's out too.

We'll have to write everything in dash :-).

Ian.


-- 
To UNSUBSCRIBE, email to debian-devel-requ...@lists.debian.org
with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org
Archive: 
http://lists.debian.org/19801.18743.486394.290...@chiark.greenend.org.uk

Re: Make Unicode bugs release critical? (was: Re: RFA: all my packages)

2011-02-14 Thread Josselin Mouette

Le lundi 14 février 2011 à 12:42 +, Ian Jackson a écrit : 
 Josselin Mouette writes (Re: Make Unicode bugs release critical? (was: Re: 
 RFA: all my packages)):
  Kicking out software that doesn?t work at all in UTF-8 locales and
  requires the user to set a broken locale, OTOH, sounds like a sanitary
  emergency.
 
 Excellent, I look forward to the removal of python.  I always hated
 that language anyway.

From your reply I look more forward to the removal of vm, since it broke
the Unicode in my original email.

 $ LC_CTYPE=en_GB.utf-8 python -c 'print u\u00a3' | cat 
 Traceback (most recent call last):
   File string, line 1, in module
 UnicodeEncodeError: 'ascii' codec can't encode character u'\xa3' in
 position 0: ordinal not in range(128)
 $


You must specify the encoding of your data in your bitstreams. I agree
this is inconvenient (and one of the things I dislike in Python), but it
is: 
 1. completely independent of the locale (UTF8 or not) 
 2. easy to work with once you understand how encodings in Python
work 
 3. much better in Python 3.

-- 
 .''`.
: :' : “You would need to ask a lawyer if you don't know
`. `'   that a handshake of course makes a valid contract.”
  `---  J???rg Schilling


--
To UNSUBSCRIBE, email to debian-devel-requ...@lists.debian.org
with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org
Archive: http://lists.debian.org/1297698390.8791.72.camel@meh

Re: Make Unicode bugs release critical? (was: Re: RFA: all my packages)

2011-02-14 Thread Henrique de Moraes Holschuh

On Mon, 14 Feb 2011, Josselin Mouette wrote:
 You must specify the encoding of your data in your bitstreams. I agree
 this is inconvenient (and one of the things I dislike in Python), but it
 is: 
  1. completely independent of the locale (UTF8 or not) 
  2. easy to work with once you understand how encodings in Python
 work 
  3. much better in Python 3.

As long as python 3 is compiled to use UCS-4 as the internal
representation, you mean.  Are our packages set to use UCS-4?

-- 
  One disk to rule them all, One disk to find them. One disk to bring
  them all and in the darkness grind them. In the Land of Redmond
  where the shadows lie. -- The Silicon Valley Tarot
  Henrique Holschuh


-- 
To UNSUBSCRIBE, email to debian-devel-requ...@lists.debian.org
with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org
Archive: http://lists.debian.org/20110214160108.ga7...@khazad-dum.debian.net

Re: Make Unicode bugs release critical? (was: Re: RFA: all my packages)

2011-02-14 Thread brian m. carlson

On Mon, Feb 14, 2011 at 02:01:08PM -0200, Henrique de Moraes Holschuh wrote:
 As long as python 3 is compiled to use UCS-4 as the internal
 representation, you mean.  Are our packages set to use UCS-4?

At least for python 3.1, yes:

common_configure_args = \
--prefix=/usr \
--enable-ipv6 \
--with-dbmliborder=bdb \
--with-wide-unicode \
--with-computed-gotos \
--with-system-expat \

The --with-wide-unicode enables UCS-4.  With a very few exceptions, I
believe all the recent Debian python packages have been compiled this
way.

-- 
brian m. carlson / brian with sandals: Houston, Texas, US
+1 832 623 2791 | http://www.crustytoothpaste.net/~bmc | My opinion only
OpenPGP: RSA v4 4096b: 88AC E9B2 9196 305B A994 7552 F1BA 225C 0223 B187


signature.asc
Description: Digital signature

Re: OT: Python (was: Make Unicode bugs release critical?)

2011-02-14 Thread Klaus Ethgen

-BEGIN PGP SIGNED MESSAGE-
Hash: SHA512

Am Mo den 14. Feb 2011 um 16:24 schrieb Ian Jackson:
 Jakub Wilk writes (Re: OT: Python (was: Make Unicode bugs release 
 critical?)):
  * Klaus Ethgen kl...@ethgen.de, 2011-02-14, 14:37:
  ~ LC_CTYPE=en_GB.utf-8 perl -e 'print \x{00a3}\n;'
  ~ LC_CTYPE=en_GB.utf-8 perl -e 'print \x{00a3}\n;' | cat
  
  Let me try...
  
  $ LC_CTYPE=en_GB.utf-8 perl -e 'print \x{00a3}\n;' | isutf8
  stdin: line 1, char 1, byte offset 1: invalid UTF-8 code
 
 WTF.  OK, Perl's out too.

No, it is not. 00a3 is just not a utf-8 character, it is unicode. To get
a correct utf-8 character you need to print \x{c2a3} and then isutf8 is
happy.

 We'll have to write everything in dash :-).

lisp. :-)

But now we get complete out of topic.

Regards
   Klaus
- -- 
Klaus Ethgenhttp://www.ethgen.ch/
pub  2048R/D1A4EDE5 2000-02-26 Klaus Ethgen kl...@ethgen.de
Fingerprint: D7 67 71 C4 99 A6 D4 FE  EA 40 30 57 3C 88 26 2B
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.10 (GNU/Linux)

iQEVAwUBTVlWk5+OKpjRpO3lAQohXgf9FC839X5Pozj2LZUJKd+X9Bcy5F/q+zWg
cdPlFkRL2BSq05M4+V8anb6vP47JdMMJfgc1oszNWZkYOQkgZdTy1GdCVF9o0jpD
xSlA7MVBt7ijTtfOlodzZiO6PyXPx7vo6AJGUufwb4KxekLR6vKq9fzlTLvvD/mH
lPPbCuZrY90eWqRjFeLyXA6Cmx+cJG5jt8nAAOzBjWTuENNp+vTFx1Lad13que7T
AAXrQupjCpRwAxfN8cuYMMIAFw5FCOyTQNAZXaAeMV1UOslVVdXlffUDB6uqpNvC
JPPL9PhughLVWtSxsm74emFCVkBQ75xTGMJTbCUCfMmdwTj3mD7uLw==
=J1JB
-END PGP SIGNATURE-


-- 
To UNSUBSCRIBE, email to debian-devel-requ...@lists.debian.org
with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org
Archive: http://lists.debian.org/20110214162139.gf6...@ikki.ethgen.ch

Re: OT: Python (was: Make Unicode bugs release critical?)

2011-02-14 Thread Ian Jackson

Klaus Ethgen writes (Re: OT: Python (was: Make Unicode bugs release 
critical?)):
 No, it is not. 00a3 is just not a utf-8 character, it is unicode. To get
 a correct utf-8 character you need to print \x{c2a3} and then isutf8 is
 happy.

When LC_CTYPE=en_GB.utf-8, programs which attempt to print unicode
characters to stdout should use UTF-8.  That's what LC_TYPE means.

Ian.


-- 
To UNSUBSCRIBE, email to debian-devel-requ...@lists.debian.org
with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org
Archive: 
http://lists.debian.org/19801.23455.536473.211...@chiark.greenend.org.uk

Re: OT: Python (was: Make Unicode bugs release critical?)

2011-02-14 Thread Konstantin Khomoutov

On Mon, 14 Feb 2011 16:43:11 +
Ian Jackson ijack...@chiark.greenend.org.uk wrote:

 Klaus Ethgen writes (Re: OT: Python (was: Make Unicode bugs release
 critical?)):
  No, it is not. 00a3 is just not a utf-8 character, it is unicode.
  To get a correct utf-8 character you need to print \x{c2a3} and
  then isutf8 is happy.
 
 When LC_CTYPE=en_GB.utf-8, programs which attempt to print unicode
 characters to stdout should use UTF-8.  That's what LC_TYPE means.

By the way,

$ LC_CTYPE=en_GB.utf-8 echo 'puts \x00a3\n'|tclsh|isutf8
$
$ LC_CTYPE=en_GB.utf-8 echo 'puts \x00a3\n'|tclsh|xxd -p
c2a30a0a
$

But RMS told the world not to use Tcl.


-- 
To UNSUBSCRIBE, email to debian-devel-requ...@lists.debian.org
with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org
Archive: http://lists.debian.org/20110214203601.715df57c.kos...@domain007.com

Re: Make Unicode bugs release critical?

2011-02-14 Thread Ron Johnson


On 02/14/2011 10:39 AM, Ian Jackson wrote:
[snip]


The fact that naive Python programs work (honouring LC_CTYPE as they
should) unless you pipe their output to something is clearly a bug.
The fact that it's a specification bug doesn't mean it's not a bug.



It doesn't seem to work for me.

$ python -V
Python 2.6.6

$ LC_CTYPE=en_GB.utf-8 python -c 'print u\u00a3'
Traceback (most recent call last):
  File string, line 1, in module
UnicodeEncodeError: 'ascii' codec can't encode character u'\xa3' in 
position 0: ordinal not in range(128)


$ LC_CTYPE=en_GB.utf-8 python -c 'print u\uc2a3'
Traceback (most recent call last):
  File string, line 1, in module
UnicodeEncodeError: 'ascii' codec can't encode character u'\uc2a3' 
in position 0: ordinal not in range(128)


$ perl -v

This is perl, v5.10.1 (*) built for x86_64-linux-gnu-thread-multi
(with 51 registered patches, see perl -V for more detail)

$ LC_CTYPE=en_GB.utf-8 perl -e 'print \x{00a3}\n;'
perl: warning: Setting locale failed.
perl: warning: Please check that your locale settings:
LANGUAGE = (unset),
LC_ALL = (unset),
LC_CTYPE = en_GB.utf-8,
LANG = en_US.UTF-8
are supported and installed on your system.
perl: warning: Falling back to the standard locale (C).
£

$ locale
LANG=en_US.UTF-8
LC_CTYPE=en_US.UTF-8
LC_NUMERIC=en_US.UTF-8
LC_TIME=en_US.UTF-8
LC_COLLATE=en_US.UTF-8
LC_MONETARY=en_US.UTF-8
LC_MESSAGES=en_US.UTF-8
LC_PAPER=en_US.UTF-8
LC_NAME=en_US.UTF-8
LC_ADDRESS=en_US.UTF-8
LC_TELEPHONE=en_US.UTF-8
LC_MEASUREMENT=en_US.UTF-8
LC_IDENTIFICATION=en_US.UTF-8
LC_ALL=


--
The normal condition of mankind is tyranny and misery.
Milton Friedman


--
To UNSUBSCRIBE, email to debian-devel-requ...@lists.debian.org
with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org
Archive: http://lists.debian.org/4d59a558.5020...@cox.net

Re: Make Unicode bugs release critical?

2011-02-14 Thread The Fungi

On Mon, Feb 14, 2011 at 03:57:44PM -0600, Ron Johnson wrote:
 It doesn't seem to work for me.
[...]
 $ LC_CTYPE=en_GB.utf-8 python -c 'print u\u00a3'
 Traceback (most recent call last):
   File string, line 1, in module
 UnicodeEncodeError: 'ascii' codec can't encode character u'\xa3' in
 position 0: ordinal not in range(128)
[...]
 $ LC_CTYPE=en_GB.utf-8 perl -e 'print \x{00a3}\n;'
 perl: warning: Setting locale failed.
 perl: warning: Please check that your locale settings:
   LANGUAGE = (unset),
   LC_ALL = (unset),
   LC_CTYPE = en_GB.utf-8,
   LANG = en_US.UTF-8
 are supported and installed on your system.
 perl: warning: Falling back to the standard locale (C).
[...]

You probably don't have an en_GB.utf-8 locale (maybe you have
localepurge installed?). I bet en_US.utf-8 will net you different
results.
-- 
{ IRL(Jeremy_Stanley); WWW(http://fungi.yuggoth.org/); PGP(43495829);
WHOIS(STANL3-ARIN); SMTP(fu...@yuggoth.org); FINGER(fu...@yuggoth.org);
MUD(kin...@katarsis.mudpy.org:6669); IRC(fu...@irc.yuggoth.org#ccl);
ICQ(114362511); YAHOO(crawlingchaoslabs); AIM(dreadazathoth); }


-- 
To UNSUBSCRIBE, email to debian-devel-requ...@lists.debian.org
with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org
Archive: http://lists.debian.org/20110214222615.gd1...@yuggoth.org

Re: Make Unicode bugs release critical?

2011-02-14 Thread Ron Johnson


On 02/14/2011 04:26 PM, The Fungi wrote:

On Mon, Feb 14, 2011 at 03:57:44PM -0600, Ron Johnson wrote:

It doesn't seem to work for me.

[...]

$ LC_CTYPE=en_GB.utf-8 python -c 'print u\u00a3'
Traceback (most recent call last):
   File string, line 1, inmodule
UnicodeEncodeError: 'ascii' codec can't encode character u'\xa3' in
position 0: ordinal not in range(128)

[...]

$ LC_CTYPE=en_GB.utf-8 perl -e 'print \x{00a3}\n;'
perl: warning: Setting locale failed.
perl: warning: Please check that your locale settings:
LANGUAGE = (unset),
LC_ALL = (unset),
LC_CTYPE = en_GB.utf-8,
LANG = en_US.UTF-8
 are supported and installed on your system.
perl: warning: Falling back to the standard locale (C).

[...]

You probably don't have an en_GB.utf-8 locale (maybe you have
localepurge installed?). I bet en_US.utf-8 will net you different
results.


That's it...

$ LC_CTYPE=en_US.utf-8 python -c 'print u\u00a3'
£

$ LC_CTYPE=en_US.utf-8 perl -e 'print \x{00a3}\n;'
£

No localepurge, but when initially building the system, I only 
installed one or two locales.


--
The normal condition of mankind is tyranny and misery.
Milton Friedman


--
To UNSUBSCRIBE, email to debian-devel-requ...@lists.debian.org
with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org
Archive: http://lists.debian.org/4d59c47d.7060...@cox.net

Re: Make Unicode bugs release critical?

2011-02-14 Thread Adam Borowski

On Mon, Feb 14, 2011 at 06:10:37PM -0600, Ron Johnson wrote:
 On 02/14/2011 04:26 PM, The Fungi wrote:
 You probably don't have an en_GB.utf-8 locale (maybe you have
 localepurge installed?). I bet en_US.utf-8 will net you different
 results.
 
 That's it...
 
 No localepurge, but when initially building the system, I only
 installed one or two locales.

No one would expect an USian to use a GB locale.

The problem is, there is currently no way to request UTF-8 encoding without
specifying language.  It's a remnant of ancient locales where ISO-8859-1
didn't make sense for pl_PL nor ISO-8859-2 for fr_FR.

Also, iconv() functions are really inconvenient to use, it'd be much easier
to use regular wide char functions predictably.

In other words: can I has C.UTF-8 guaranteed?

-- 
1KB // Microsoft corollary to Hanlon's razor:
//  Never attribute to stupidity what can be
//  adequately explained by malice.


-- 
To UNSUBSCRIBE, email to debian-devel-requ...@lists.debian.org
with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org
Archive: http://lists.debian.org/20110215002700.ga15...@angband.pl

Make Unicode bugs release critical?

2011-02-11 Thread Lars Wirzenius

On pe, 2011-02-11 at 10:05 +0100, Vincent Fourmond wrote:
 On 11/02/11 09:52, Josselin Mouette wrote:
  Le vendredi 11 février 2011 à 09:47 +0100, Adam Borowski a écrit : 
  I'd say there should be no place in Debian in 2011 for software that can't
  do UTF-8, especially if near-identical forks exist.
  
  That would make a nice addition to the policy, wouldn’t it?
 
   So long as it is not a MUST, else I have a feeling we'll find many
 many packages RC...
 
   That aside, I agree with this idea.

A release goal or release requirement might be another way of achieving
this.

However, I'm curious: is there a lot of software that is broken with
Unicode, particularly with the UTF-8 encoding? I can't remember anything
much in recent times.

The first Unicode standard was published in 1991. That's twenty years
ago. Any software that processes text at all and is incapable of dealing
with UTF-8 should be considered with extreme suspicion. Making all such
bugs be release critical (which includes the notion that release
managers may ignore the bug in particular cases) sounds like a good way
to get things under control.

-- 
Blog/wiki/website hosting with ikiwiki (free for free software):
http://www.branchable.com/


-- 
To UNSUBSCRIBE, email to debian-devel-requ...@lists.debian.org
with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org
Archive: http://lists.debian.org/1297417074.3105.6.ca...@havelock.lan

Re: Make Unicode bugs release critical?

2011-02-11 Thread Miroslav Kure

On Fri, Feb 11, 2011 at 09:37:54AM +, Lars Wirzenius wrote:
 
 However, I'm curious: is there a lot of software that is broken with
 Unicode, particularly with the UTF-8 encoding? I can't remember anything
 much in recent times.

Mostly it is just the old stuff like
 - eterm, aterm
 - elvis
 - X tools from the basic package (xman, xmessage, xmore, ...)
 - TeX without additional packages

-- 
Miroslav Kure


-- 
To UNSUBSCRIBE, email to debian-devel-requ...@lists.debian.org
with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org
Archive: http://lists.debian.org/20110211101442.ga29...@pharaoh.inf.upol.cz

Re: Make Unicode bugs release critical?

2011-02-11 Thread Andrey Rahmatullin

On Fri, Feb 11, 2011 at 11:14:42AM +0100, Miroslav Kure wrote:
  However, I'm curious: is there a lot of software that is broken with
  Unicode, particularly with the UTF-8 encoding? I can't remember anything
  much in recent times.
 Mostly it is just the old stuff like
  - eterm, aterm
  - elvis
  - X tools from the basic package (xman, xmessage, xmore, ...)
  - TeX without additional packages
- tr(1)

-- 
WBR, wRAR


signature.asc
Description: Digital signature

Re: Make Unicode bugs release critical?

2011-02-11 Thread Luca Capello

Hi there!

On Fri, 11 Feb 2011 11:14:42 +0100, Miroslav Kure wrote:
 On Fri, Feb 11, 2011 at 09:37:54AM +, Lars Wirzenius wrote:
 
 However, I'm curious: is there a lot of software that is broken with
 Unicode, particularly with the UTF-8 encoding? I can't remember anything
 much in recent times.

 Mostly it is just the old stuff like
  - eterm, aterm
  - elvis
  - X tools from the basic package (xman, xmessage, xmore, ...)
  - TeX without additional packages

Plus a2ps (see #180236), something a lot of people use before KSPs, even
if there are various alternatives, some of them already in Debian:

  Message-ID: 4d3f1bf6.1060...@sanctuary.nslug.ns.ca
  URL: 
http://lists.debian.org/msgid-search/%3c4D3F1BF6.1060604%40sanctuary.nslug.ns.ca%3e

Everything should be at http://wiki.debian.org/UTF8BrokenApps.

Thx, bye,
Gismo / Luca


pgpUwauikQgCE.pgp
Description: PGP signature

Re: Make Unicode bugs release critical?

2011-02-11 Thread Roger Leigh

On Fri, Feb 11, 2011 at 11:14:42AM +0100, Miroslav Kure wrote:
 On Fri, Feb 11, 2011 at 09:37:54AM +, Lars Wirzenius wrote:
  
  However, I'm curious: is there a lot of software that is broken with
  Unicode, particularly with the UTF-8 encoding? I can't remember anything
  much in recent times.
 
 Mostly it is just the old stuff like
  - eterm, aterm
  - elvis
  - X tools from the basic package (xman, xmessage, xmore, ...)
  - TeX without additional packages

XeTeX and XeLaTeX allow native UTF-8 input.  Should be made the
default, IMO, given how obsolete and broken the standard TeX
encodings are.  Being able to write in actual text rather than
a lot of illegible incantations was a major revelation, and it's
a bit sad it was in that situation in the first place.  It also
sorts out the awful font support, so you can use standard
freetype-registered fonts, again without the pain.  Result: a
document you can actually read in the editor!

IMO all those broken terminal emulators, editors and tools should
be put in the bin.  There are plenty of non-broken replacements, so
why keep them around to bitrot even further?  It's not like it's
going to cause massive inconvenience--they are long obsolete.


Regards,
Roger

-- 
  .''`.  Roger Leigh
 : :' :  Debian GNU/Linux http://people.debian.org/~rleigh/
 `. `'   Printing on GNU/Linux?   http://gutenprint.sourceforge.net/
   `-GPG Public Key: 0x25BFB848   Please GPG sign your mail.


signature.asc
Description: Digital signature

Re: Make Unicode bugs release critical?

2011-02-11 Thread Klaus Ethgen

-BEGIN PGP SIGNED MESSAGE-
Hash: SHA512

Hi,

Am Fr den 11. Feb 2011 um 10:37 schrieb Lars Wirzenius:
 The first Unicode standard was published in 1991. That's twenty years
 ago. Any software that processes text at all and is incapable of dealing
 with UTF-8 should be considered with extreme suspicion. Making all such
 bugs be release critical (which includes the notion that release
 managers may ignore the bug in particular cases) sounds like a good way
 to get things under control.

I think you are mixing stuff together. First there is unicode. There are
several definitions for unicode (unicode-16, unicode-32, ...) but UTF-8
is not unicode it is just one implementation of unicode and in my eyes
the most problematic as it has undefined states and is variable length.

However, UTF-8 was created to allow using unicode in non-unicode
environments. For me that was always a pointless plan and the unreadable
UTF-8 characters all around buggy software that cannot handle encodings
correct (and there are many around) and ignorant users who are using
UTF-8 in environments that are not specified for multibyte charsets
(IRC) is the most annoying one.

As there are places where UTF-8 makes perfect sense and is the best
solution it is not the best solution for all ignorance users (me too ;-)
have.

So specifying to be UTF-8 capable is somewhat inconsequent. Software has
to be capable to handle every encoding as long as they are specified for
that encodings.

Regards
   Klaus
- -- 
Klaus Ethgenhttp://www.ethgen.ch/
pub  2048R/D1A4EDE5 2000-02-26 Klaus Ethgen kl...@ethgen.de
Fingerprint: D7 67 71 C4 99 A6 D4 FE  EA 40 30 57 3C 88 26 2B
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.10 (GNU/Linux)

iQEVAwUBTVUksZ+OKpjRpO3lAQoxGgf/WRdHVqOQ+4A/VkbaLRkXk7uZMKk1uNMT
t5gIbmtkIZLRhGkVZIzuVNXT7Zlq+tS3HwpbUaHNmd7ImNUlN+m9dP1gJFacZaGd
zYeM0L1G9nfh4iwNmNIqQ/ZhF3lnOUtV6kDqvlZ4EgIwXfAPDZeFMgCxkCeh8mbq
H2MABIqwGxahqQoZ6Oql0npvE4QMVB7Use2iT2pPiNBSsB1hFzH9sqNu+uNdbko9
mI82BLHhMwwjhIo3ceFEHkah5pCPlJpTJHgRLd5nYf6/BUkEiR+ECnohdbkjjX5d
1ftp+4Q7Bngve1+5vM4yKQJAEx5vV1kV8U+GaQGE8Kad+op2BhWL+Q==
=VYai
-END PGP SIGNATURE-


-- 
To UNSUBSCRIBE, email to debian-devel-requ...@lists.debian.org
with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org
Archive: http://lists.debian.org/2011025946.ga4...@ikki.ethgen.ch

Re: Make Unicode bugs release critical?

2011-02-11 Thread Torsten Werner

Am -10.01.-28163 20:59, schrieb Andrey Rahmatullin:
 On Fri, Feb 11, 2011 at 11:14:42AM +0100, Miroslav Kure wrote:
 However, I'm curious: is there a lot of software that is broken with
 Unicode, particularly with the UTF-8 encoding? I can't remember anything
 much in recent times.
 Mostly it is just the old stuff like
  - eterm, aterm
  - elvis
  - X tools from the basic package (xman, xmessage, xmore, ...)
  - TeX without additional packages
 - tr(1)

grep, sed, awk, bash, ...

Torsten


-- 
To UNSUBSCRIBE, email to debian-devel-requ...@lists.debian.org
with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org
Archive: http://lists.debian.org/4d552988.1020...@debian.org

Re: Make Unicode bugs release critical?

2011-02-11 Thread Andrey Rahmatullin

On Fri, Feb 11, 2011 at 01:20:24PM +0100, Torsten Werner wrote:
  However, I'm curious: is there a lot of software that is broken with
  Unicode, particularly with the UTF-8 encoding? I can't remember anything
  much in recent times.
  Mostly it is just the old stuff like
   - eterm, aterm
   - elvis
   - X tools from the basic package (xman, xmessage, xmore, ...)
   - TeX without additional packages
  - tr(1)
 grep, sed, awk, bash, ...
http://bugs.debian.org/495677

-- 
WBR, wRAR


signature.asc
Description: Digital signature

Re: Make Unicode bugs release critical?

2011-02-11 Thread Lars Wirzenius

On pe, 2011-02-11 at 13:20 +0100, Torsten Werner wrote:
 grep, sed, awk, bash, ...

grep, sed, and awk, at least, seem to work acceptably for me with UTF-8.
The support can be improved, I'm sure.

-- 
Blog/wiki/website hosting with ikiwiki (free for free software):
http://www.branchable.com/


-- 
To UNSUBSCRIBE, email to debian-devel-requ...@lists.debian.org
with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org
Archive: http://lists.debian.org/1297428633.3105.56.ca...@havelock.lan

Re: Make Unicode bugs release critical?

2011-02-11 Thread Norbert Preining

On Fr, 11 Feb 2011, Roger Leigh wrote:
 XeTeX and XeLaTeX allow native UTF-8 input.  Should be made the
 default, IMO, given how obsolete and broken the standard TeX
 encodings are.  Being able to write in actual text rather than

Please don't write rubbish if you don't know what you are talking about!!!

You have apparently no idea between input and font encoding.

LaTeX can easily useutf8 with the appropriate inputenc, as well
as dozens of other encoding. Not all of the world is using UTF8.
UTF( is still taileored to western roman script, thus very unpopular
in Japan for example.

 sorts out the awful font support, so you can use standard
 freetype-registered fonts, again without the pain.  Result: a
 document you can actually read in the editor!

Argg, PLEASE STOP THAT RUBBISH

I never use xetex, I write a lot in German (umlauts), Japanese,
Italian, ...

TeX is different, don't try to throw away working solutions of 20 years
because of your ignorance.

ARrggg. I love people blabbering like drunkyards.

 IMO all those broken terminal emulators, editors and tools should
 be put in the bin.  There are plenty of non-broken replacements, so
 why keep them around to bitrot even further?  It's not like it's

So what is the replacement for tex?
Yeah iknow, it is *luatex* but we are FAR fro being stable and
usable.

XeTeX is nice for certain things, but not for all. Have you tried to
set Tibetan text with XeTeX? The last time I tried it was a mess.
And with Khmer (the language and script of Cambodia) it is even worse.
Only because you are only using ASCII characters please don't make the
rest of the world laugh on you.

Best wishes

Norbert
(mumbling throw away in the bin*, *standard freetype*, ...)


Norbert Preiningpreining@{jaist.ac.jp, logic.at, debian.org}
JAIST, Japan TeX Live  Debian Developer
DSA: 0x09C5B094   fp: 14DF 2E6C 0307 BE6D AD76  A9C0 D2BF 4AA3 09C5 B094

HAGNABY (n.)
Someone who looked a lot more attractive in the disco than they do in
your bed the next morning.
--- Douglas Adams, The Meaning of Liff


-- 
To UNSUBSCRIBE, email to debian-devel-requ...@lists.debian.org
with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org
Archive: http://lists.debian.org/20110211124629.ga1...@gamma.logic.tuwien.ac.at

Re: Make Unicode bugs release critical?

2011-02-11 Thread Faidon Liambotis


On 02/11/11 14:20, Torsten Werner wrote:


grep, sed, awk, bash, ...


?

$ echo αβγ | sed 's/./a/'
aβγ

Regards,
Φαίδων :-)


--
To UNSUBSCRIBE, email to debian-devel-requ...@lists.debian.org
with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org
Archive: http://lists.debian.org/4d553373.5060...@debian.org

Re: Make Unicode bugs release critical?

2011-02-11 Thread Vincent Lefevre

On 2011-02-11 21:46:29 +0900, Norbert Preining wrote:
 On Fr, 11 Feb 2011, Roger Leigh wrote:
  XeTeX and XeLaTeX allow native UTF-8 input.  Should be made the
  default, IMO, given how obsolete and broken the standard TeX
  encodings are.  Being able to write in actual text rather than
 
 Please don't write rubbish if you don't know what you are talking about!!!
 
 You have apparently no idea between input and font encoding.
 
 LaTeX can easily useutf8 with the appropriate inputenc,

Which one???

FYI, utf8 is very incomplete and utf8x is broken (bug 601365).

-- 
Vincent Lefèvre vinc...@vinc17.net - Web: http://www.vinc17.net/
100% accessible validated (X)HTML - Blog: http://www.vinc17.net/blog/
Work: CR INRIA - computer arithmetic / Arénaire project (LIP, ENS-Lyon)


-- 
To UNSUBSCRIBE, email to debian-devel-requ...@lists.debian.org
with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org
Archive: http://lists.debian.org/20110211131843.gh15...@prunille.vinc17.org

Re: Make Unicode bugs release critical?

2011-02-11 Thread Roger Leigh

On Fri, Feb 11, 2011 at 09:46:29PM +0900, Norbert Preining wrote:
 On Fr, 11 Feb 2011, Roger Leigh wrote:
  XeTeX and XeLaTeX allow native UTF-8 input.  Should be made the
  default, IMO, given how obsolete and broken the standard TeX
  encodings are.  Being able to write in actual text rather than
 
 Please don't write rubbish if you don't know what you are talking about!!!

Um, no need to be rude.  Please keep your reply to technical points;
if I've said something incorrect, by all means correct me, but
insults is a step too far.  I haven't said anything that could justify
it, other than the fact that you disagree with my /opinion/.

 You have apparently no idea between input and font encoding.

I only mentioned UTF-8 with regard to input, so you are assuming
too much.

 LaTeX can easily useutf8 with the appropriate inputenc, as well
 as dozens of other encoding. Not all of the world is using UTF8.
 UTF( is still taileored to western roman script, thus very unpopular
 in Japan for example.

The inputenc hack only gets you so far.  I tried to go this way, and
ran into all sorts of issues with UTF-8 in macro definitions getting
scrambled and other sources of pain.  With XeLaTeX I had no such
troubles.  So IME inputenc was not a suitable solution for serious
UTF-8 work.

  sorts out the awful font support, so you can use standard
  freetype-registered fonts, again without the pain.  Result: a
  document you can actually read in the editor!
 
 Argg, PLEASE STOP THAT RUBBISH

What you are calling rubbish is not in any way false.  It's given
me the ability to have nice legible UTF-8-encoded documents, with
excellent font support.  There may be other ways.  There may be
better ways.  But it's not wrong.

[snip rant]

  IMO all those broken terminal emulators, editors and tools should
  be put in the bin.  There are plenty of non-broken replacements, so
  why keep them around to bitrot even further?  It's not like it's
 
 So what is the replacement for tex?
 Yeah iknow, it is *luatex* but we are FAR fro being stable and
 usable.

Well I thought the jury was still out on which was the better solution.
I really couldn't care less which wins; I'm using the solution which
works right now, and I'll happily adopt whatever is better down the
line.

 XeTeX is nice for certain things, but not for all. Have you tried to
 set Tibetan text with XeTeX? The last time I tried it was a mess.
 And with Khmer (the language and script of Cambodia) it is even worse.
 Only because you are only using ASCII characters please don't make the
 rest of the world laugh on you.

You are again making unwarranted assumptions.  I might not be using it
for difficult-to-set languages, but I'm certainly not using ASCII
characters only, or I wouldn't be needing UTF-8 input.  


Regards,
Roger

-- 
  .''`.  Roger Leigh
 : :' :  Debian GNU/Linux http://people.debian.org/~rleigh/
 `. `'   Printing on GNU/Linux?   http://gutenprint.sourceforge.net/
   `-GPG Public Key: 0x25BFB848   Please GPG sign your mail.


signature.asc
Description: Digital signature

Re: Make Unicode bugs release critical?

2011-02-11 Thread Vincent Lefevre

On 2011-02-11 15:33:49 +0500, Andrey Rahmatullin wrote:
 On Fri, Feb 11, 2011 at 11:14:42AM +0100, Miroslav Kure wrote:
   However, I'm curious: is there a lot of software that is broken with
   Unicode, particularly with the UTF-8 encoding? I can't remember anything
   much in recent times.
  Mostly it is just the old stuff like
   - eterm, aterm
   - elvis
   - X tools from the basic package (xman, xmessage, xmore, ...)
   - TeX without additional packages
 - tr(1)

less has problems with new Unicode characters (bug 597918).

-- 
Vincent Lefèvre vinc...@vinc17.net - Web: http://www.vinc17.net/
100% accessible validated (X)HTML - Blog: http://www.vinc17.net/blog/
Work: CR INRIA - computer arithmetic / Arénaire project (LIP, ENS-Lyon)


-- 
To UNSUBSCRIBE, email to debian-devel-requ...@lists.debian.org
with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org
Archive: http://lists.debian.org/20110211133024.gi15...@prunille.vinc17.org

Re: Make Unicode bugs release critical?

2011-02-11 Thread Adam Borowski

On Fri, Feb 11, 2011 at 12:59:46PM +0100, Klaus Ethgen wrote:
 Am Fr den 11. Feb 2011 um 10:37 schrieb Lars Wirzenius:
  The first Unicode standard was published in 1991. That's twenty years
  ago. Any software that processes text at all and is incapable of dealing
  with UTF-8 should be considered with extreme suspicion. Making all such
  bugs be release critical (which includes the notion that release
  managers may ignore the bug in particular cases) sounds like a good way
  to get things under control.
 
 I think you are mixing stuff together. First there is unicode. There are
 several definitions for unicode (unicode-16, unicode-32, ...) but UTF-8
 is not unicode it is just one implementation of unicode and in my eyes
 the most problematic as it has undefined states and is variable length.

There is just one definition of Unicode, any new versions merely add extra
characters, collating rules, etc.

There are several ways to represent Unicode as a stream of bytes.  Only one
of them is fit for external storage, and that's UTF-8 since it doesn't break
the assumptions that are true for text files:
1. no null bytes
2. basic newlines, etc are always newlines, never a part of a bigger
   character (not true for some ancient multibyte encodings)
3. not affected by endianness or any other internal detail

Also, _all_ Unicode encodings are of variable length.

 However, UTF-8 was created to allow using unicode in non-unicode
 environments. For me that was always a pointless plan and the unreadable
 UTF-8 characters all around buggy software that cannot handle encodings
 correct (and there are many around) and ignorant users who are using
 UTF-8 in environments that are not specified for multibyte charsets
 (IRC) is the most annoying one.

UTF-8 was never meant as merely a tool to allow using unicode in
non-unicode environments.

UTF-32 is useful only as an internal representation if you do care about a
string of code points.  Since a single character can consist of multiple
such code points, it doesn't give you much unless you have to pass every
code point through a function like wcwidth() -- ie, you are implementing
something low-level which cares about properties of characters and their
parts.  You should never place UTF-32 into external storage that is not
private to your program or can possibly be moved.

UTF-16 is never, ever useful.  It is a sad trap for win32 and Java
developers, due to a bad engineering decision suggested, as I was told, by
delegates from Microsoft and Sun, who wanted to conserve disk space and
memory by storing separately code points and a language tag -- ie, exactly
the thing Unicode was supposed to get us rid of.  Even on day one, it was
known that you can't fit all characters into 16 bits, and the decision to
put all rare characters into a private area that needs out of band
information was pretty ridiculous.  The end result is, you have an encoding
with all downsides of UTF-8 but none of the advantages.

Since neither UTF-16 nor UTF-32 can be considered text, the decision all
UNIX systems made was to use UTF-8 in the libc's API in all Unicode locales. 
Otherwise, you'd need separate APIs like FooBarA()/FooBarW() on Windows,
which cause no end of problems.

 So specifying to be UTF-8 capable is somewhat inconsequent. Software has
 to be capable to handle every encoding as long as they are specified for
 that encodings.

No, there is only one encoding left, as long as you don't have to talk to
Windows.  We can start purging away all the support for ancient charsets in
places that do not need to handle foreign data.  Debian has used UTF-8 as
default for 5 releases already, and if you try to use an ancient locale, do
not expect good results since no one bothers fixing bugs there.  And
maintaining unused code costs time and causes a risk of bugs, so good
riddance!

-- 
1KB // Microsoft corollary to Hanlon's razor:
//  Never attribute to stupidity what can be
//  adequately explained by malice.


-- 
To UNSUBSCRIBE, email to debian-devel-requ...@lists.debian.org
with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org
Archive: http://lists.debian.org/20110211133612.ga2...@angband.pl

Re: Make Unicode bugs release critical?

2011-02-11 Thread Norbert Preining

On Fr, 11 Feb 2011, Roger Leigh wrote:
 Um, no need to be rude. 

Well, you started with throw TeX into the bin! (cum grano salis)
The only possible answer to that is mine. Or shutting up and ignoring
that kind of rants from your side.

 insults is a step too far.  I haven't said anything that could justify
 it, other than the fact that you disagree with my /opinion/.

Very simple: replaceing *tex* wiht *xetex* will break existing
documents. And that is a no-go. That is TeX world. 
You are taling about WinWord world.

  You have apparently no idea between input and font encoding.
 
 I only mentioned UTF-8 with regard to input, so you are assuming
 too much.

You mentioned *fontconfig* which is font encoding, and has nothing
whatsoever to do with inputenc. I don't assume too much.

 The inputenc hack only gets you so far.  I tried to go this way, and

Agreed. Improvements are welcome, please help and fix the 
shortcomings.

   sorts out the awful font support, so you can use standard
   freetype-registered fonts, again without the pain.  Result: a
   document you can actually read in the editor!
  
  Argg, PLEASE STOP THAT RUBBISH
 
 What you are calling rubbish is not in any way false.  It's given

It *IS* wrong.
You are stating that using freetype-registered fonts makes a document
readable by the editor. Sorry this is rediculous.
- different fonts might register themselves under different names 
  to fontconfig
- fonts might not be available her or there and migh tnot be embedded
  in the pdf

DEK wrote his own font loading mechanism because he wanted to be sure
that docuemtns *can* be typeset also on any other machine, and that
works.
If you use xetex that might work, or might not work, or might work
but you are missing suddently some characters
(there is for example a version of the palatino fonts with cyrillic
characters, and a version without cyrillic characters, some systems
have these *enriched* fonts and don't embedd them properly. THen,
suddenly, on the target system, characters disappear. Is THIS 
the way you want to typeset documetns?)

I repeat: RUBBISH.

 Well I thought the jury was still out on which was the better solution.

Most people I know in the TeX community are seeing the real future with
luatex.

Best wishes

Norbert

Norbert Preiningpreining@{jaist.ac.jp, logic.at, debian.org}
JAIST, Japan TeX Live  Debian Developer
DSA: 0x09C5B094   fp: 14DF 2E6C 0307 BE6D AD76  A9C0 D2BF 4AA3 09C5 B094

MARYTAVY (n.)
A person to whom, under dire injunctions of silence, you tell a secret
which you wish to be fare more widely known.
--- Douglas Adams, The Meaning of Liff


-- 
To UNSUBSCRIBE, email to debian-devel-requ...@lists.debian.org
with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org
Archive: http://lists.debian.org/20110211134338.gh1...@gamma.logic.tuwien.ac.at

Re: Make Unicode bugs release critical?

2011-02-11 Thread Torsten Werner

Am 11.02.2011 14:02, schrieb Faidon Liambotis:
 $ echo αβγ | sed 's/./a/'
 aβγ

Okay. But...

$ echo αβγ | busybox sed 's/./a/'
a�βγ

:)


-- 
To UNSUBSCRIBE, email to debian-devel-requ...@lists.debian.org
with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org
Archive: http://lists.debian.org/4d553bf6.9020...@debian.org

Re: Make Unicode bugs release critical?

2011-02-11 Thread Adam Borowski

On Fri, Feb 11, 2011 at 02:30:24PM +0100, Vincent Lefevre wrote:
 On 2011-02-11 15:33:49 +0500, Andrey Rahmatullin wrote:
  On Fri, Feb 11, 2011 at 11:14:42AM +0100, Miroslav Kure wrote:
However, I'm curious: is there a lot of software that is broken with
Unicode, particularly with the UTF-8 encoding? I can't remember anything
much in recent times.
 
 less has problems with new Unicode characters (bug 597918).

Unicode 6.0 came out in october 2010, well after Squeeze's freeze, so you
can't expect support for new characters already.  There are in no fonts
shipped with squeeze, so not recognizing the characters as valid is not a
big problem.

Less shouldn't maintain a private copy of character properties if all that
data is already present in libc -- but guess what, wcwidth(0x1F4A9) and
iswprint() don't know them too.

So oh well, Squeeze won't display such vital characters as  kitten[1],
 ghost,  japanese ogre or  pile of shit.  Gotta invest in a
crystal ball that will tell us what new characters will be.


[1]. To see my examples, you can grab:
http://angband.pl/debian/pool/main/t/ttf-ancient-fonts/ttf-ancient-fonts_2.52-1.0kb1_all.deb

(newer than the version in unstable, Gürkan Sengün's version is
404-compliant, let's poke him so we have _one_ Unicode 6.0 font in Debian).

-- 
1KB // Microsoft corollary to Hanlon's razor:
//  Never attribute to stupidity what can be
//  adequately explained by malice.


-- 
To UNSUBSCRIBE, email to debian-devel-requ...@lists.debian.org
with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org
Archive: http://lists.debian.org/20110211140202.gb2...@angband.pl

Re: Make Unicode bugs release critical?

2011-02-11 Thread Roger Leigh

On Fri, Feb 11, 2011 at 10:43:38PM +0900, Norbert Preining wrote:
 On Fr, 11 Feb 2011, Roger Leigh wrote:
  Um, no need to be rude. 
 
 Well, you started with throw TeX into the bin! (cum grano salis)
 The only possible answer to that is mine. Or shutting up and ignoring
 that kind of rants from your side.

Please read what I said carefully, rather than imagined slights.
I did not at any point state that TeX should be thrown in the bin;
that was with regard to broken terminal emulators, editors and
tools.  I fully believe we should remove obsolete tools which have
superior replacements.  I did not include TeX in that category.

   You have apparently no idea between input and font encoding.
  
  I only mentioned UTF-8 with regard to input, so you are assuming
  too much.
 
 You mentioned *fontconfig* which is font encoding, and has nothing
 whatsoever to do with inputenc. I don't assume too much.

No, I mentioned fontconfig because XeTeX allows use of system fonts
via fontconfig.  That was completely separate from UTF-8 input.

sorts out the awful font support, so you can use standard
freetype-registered fonts, again without the pain.  Result: a
document you can actually read in the editor!
   
   Argg, PLEASE STOP THAT RUBBISH
  
  What you are calling rubbish is not in any way false.  It's given
 
 It *IS* wrong.
 You are stating that using freetype-registered fonts makes a document
 readable by the editor. Sorry this is rediculous.
 - different fonts might register themselves under different names 
   to fontconfig
 - fonts might not be available her or there and migh tnot be embedded
   in the pdf

[...]

 I repeat: RUBBISH.

I didn't state any of those things.  Please calm down, and please
read what I actually wrote, rather than what you thought I wrote.


Regards,
Roger

-- 
  .''`.  Roger Leigh
 : :' :  Debian GNU/Linux http://people.debian.org/~rleigh/
 `. `'   Printing on GNU/Linux?   http://gutenprint.sourceforge.net/
   `-GPG Public Key: 0x25BFB848   Please GPG sign your mail.


signature.asc
Description: Digital signature

Re: Make Unicode bugs release critical?

2011-02-11 Thread Norbert Preining

On Fr, 11 Feb 2011, Roger Leigh wrote:
 read what I actually wrote, rather than what you thought I wrote.

So *what* is your proposal, instead of discussing uselessly and wasting
bytes?

Is it:
ln -sf tex xetex

Best wishes

Norbert

Norbert Preiningpreining@{jaist.ac.jp, logic.at, debian.org}
JAIST, Japan TeX Live  Debian Developer
DSA: 0x09C5B094   fp: 14DF 2E6C 0307 BE6D AD76  A9C0 D2BF 4AA3 09C5 B094

LITTLE URSWICK (n.)
The member of any class who most inclines a teacher towards the view
that capital punishment should be introduced in schools.
--- Douglas Adams, The Meaning of Liff


-- 
To UNSUBSCRIBE, email to debian-devel-requ...@lists.debian.org
with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org
Archive: http://lists.debian.org/20110211141749.gl1...@gamma.logic.tuwien.ac.at

Re: Make Unicode bugs release critical?

2011-02-11 Thread Vincent Lefevre

On 2011-02-11 15:02:02 +0100, Adam Borowski wrote:
 On Fri, Feb 11, 2011 at 02:30:24PM +0100, Vincent Lefevre wrote:
  On 2011-02-11 15:33:49 +0500, Andrey Rahmatullin wrote:
   On Fri, Feb 11, 2011 at 11:14:42AM +0100, Miroslav Kure wrote:
 However, I'm curious: is there a lot of software that is broken with
 Unicode, particularly with the UTF-8 encoding? I can't remember 
 anything
 much in recent times.
  
  less has problems with new Unicode characters (bug 597918).
 
 Unicode 6.0 came out in october 2010,

The character mentioned in my bug report (U+1E9F LATIN SMALL LETTER DELTA)
appeared in Unicode 5.1.0 (March 2008).

 well after Squeeze's freeze, so you can't expect support for new
 characters already.

Well, March 2008 was more than 1 year before Squeeze's freeze.

 There are in no fonts shipped with squeeze, so not recognizing the
 characters as valid is not a big problem.

Fonts containing the character in question are shipped with Squeeze:
the character appears correctly in xterm.

 Less shouldn't maintain a private copy of character properties if
 all that data is already present in libc

I agree.

 -- but guess what, wcwidth(0x1F4A9) and iswprint() don't know them
 too.

No problems with U+1E9F:

Property alnum : yes
Property alpha : yes
Property cntrl : no
Property digit : no
Property graph : yes
Property lower : yes
Property print : yes
Property punct : no
Property space : no
Property upper : no
Property xdigit: no
wcwidth = 1

So, if less were using libc, it wouldn't have any problem with
this character.

-- 
Vincent Lefèvre vinc...@vinc17.net - Web: http://www.vinc17.net/
100% accessible validated (X)HTML - Blog: http://www.vinc17.net/blog/
Work: CR INRIA - computer arithmetic / Arénaire project (LIP, ENS-Lyon)


-- 
To UNSUBSCRIBE, email to debian-devel-requ...@lists.debian.org
with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org
Archive: http://lists.debian.org/20110211143511.gj15...@prunille.vinc17.org

Re: Make Unicode bugs release critical?

2011-02-11 Thread Joey Hess

Lars Wirzenius wrote:
 However, I'm curious: is there a lot of software that is broken with
 Unicode, particularly with the UTF-8 encoding? I can't remember anything
 much in recent times.

We chose an 80% quickfix to get where we are, and so now we have the
other 80% to go. It's been whittled away at for the past 10 years or so,
but still a lot left.

And, that's utf8 support, only. It's probably a pipe dream to expect
other unicode encodings to work half as well, and surely other encodings
fare even worse overall. If anything, utf8 probably makes the overall
situation worse for other encodings, since we expect it to just work,
and give up on handling the other complexity.

 The first Unicode standard was published in 1991. That's twenty years
 ago. Any software that processes text at all and is incapable of dealing
 with UTF-8 should be considered with extreme suspicion.

Most languages still make it easy to get wrong, in my experience.

It can be as simple as software written trusting language documentation
that says strings are processed in unicode and doesn't point out all
the exceptions that can let non-unicode data in. For example, this
simple haskell program processess a file's content utf-8 cleanly, but
prints its name like foÃ¶.

import System.Environment
main = do
args - getArgs
let file = head args
putStrLn $ file is:  ++ file
putStr = readFile file

This program has an entirely different failure mode; type in
foö (touch it first), and it will complain that fo� doesn't exist.

main = getLine = readFile = putStr

Neither of these failure modes is obvious from any documentation I've seen.
Both of these programs are something a typical developer would expect to
work. (Both also have unexpected failure modes when LANG=C.)

Probably every thousand lines of perl has a unicode encoding bug of some
sort. Based on data from my own code. Any perl code that uses an XS module
probably has an encoding bug.

I assume that python had some problems with its unicode support too,
since they saw fit to radically change it in python 3. And it sounds
like the python 3 changes will break unicode in many programs ported
over to it, unless file opens etc are audited and fixed. Stackoverflow
has 1600 matches for python unicode questions.

The best case is probably a language that has a restructed enough
interface that most of these problems are avoided. 
(But, stackoverflow still has 500 javascript unicode questions.)

 Making all such
 bugs be release critical (which includes the notion that release
 managers may ignore the bug in particular cases) sounds like a good way
 to get things under control.

It would probably be a large load on the RMs. It's easy to pick some
random program that works great with unicode and find an edge case. The RMs
would probably prefer to not have git getting RC bugs filed just because
it sometimes exposes filenames written like fo\303\266. :)

-- 
see shy jo, who deals with at least 1 unicode bug a week on average. 4 this week


signature.asc
Description: Digital signature

Re: Make Unicode bugs release critical?

2011-02-11 Thread Marco Túlio Gontijo e Silva

Excerpts from Joey Hess's message of Sex Fev 11 13:39:08 -0200 2011:
(...)
 It can be as simple as software written trusting language documentation
 that says strings are processed in unicode and doesn't point out all
 the exceptions that can let non-unicode data in. For example, this
 simple haskell program processess a file's content utf-8 cleanly, but
 prints its name like foÃ¶.
 
 import System.Environment
 main = do
 args - getArgs
 let file = head args
 putStrLn $ file is:  ++ file
 putStr = readFile file
 
 This program has an entirely different failure mode; type in
 foö (touch it first), and it will complain that fo� doesn't exist.
 
 main = getLine = readFile = putStr
 
 Neither of these failure modes is obvious from any documentation I've seen.
 Both of these programs are something a typical developer would expect to
 work. (Both also have unexpected failure modes when LANG=C.)

http://hackage.haskell.org/trac/ghc/ticket/3307

Greetings.
(...)


signature.asc
Description: PGP signature

Re: Make Unicode bugs release critical? (was: Re: RFA: all my packages)

2011-02-11 Thread Axel Beckert

Hi,

Adam Borowski wrote:
 Speaking of rxvt... shouldn't this clusterϫϫck become the only rxvt in
 Debian?  Both rxvt and rxvt-beta, completely dead upstream for 10 and 8
 years respectively, besides having terrible support for terminal codes lack
 even such a tiny detail as UTF-8 support.
 
 I'd say there should be no place in Debian in 2011 for software that
 can't do UTF-8, especially if near-identical forks exist.

I'd replace especially with only in that sentence. 

Kicking out good and unique software, only because of missing or
incomplete UTF-8 support, will surely lower Debian's quality more than
missing or broken UTF-8 support in very few packages. And it would
make those users (and devs) angry who need that software independently
of working UTF-8 support or not.

Regards, Axel
-- 
 ,''`.  |  Axel Beckert a...@debian.org, http://people.debian.org/~abe/
: :' :  |  Debian Developer, ftp.ch.debian.org Admin
`. `'   |  1024D: F067 EA27 26B9 C3FC 1486  202E C09E 1D89 9593 0EDE
  `-|  4096R: 2517 B724 C5F6 CA99 5329  6E61 2FF9 CD59 6126 16B5


-- 
To UNSUBSCRIBE, email to debian-devel-requ...@lists.debian.org
with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org
Archive: http://lists.debian.org/20110211183343.gp12...@sym.noone.org

Re: Make Unicode bugs release critical?

2011-02-11 Thread Kurt Roeckx

On Fri, Feb 11, 2011 at 09:37:54AM +, Lars Wirzenius wrote:
 
 However, I'm curious: is there a lot of software that is broken with
 Unicode, particularly with the UTF-8 encoding? I can't remember anything
 much in recent times.

ispell, aspell.  I think hunspell got fix recently.


Kurt


-- 
To UNSUBSCRIBE, email to debian-devel-requ...@lists.debian.org
with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org
Archive: http://lists.debian.org/20110211203240.ga30...@roeckx.be

Re: Make Unicode bugs release critical?

2011-02-11 Thread Henrique de Moraes Holschuh

On Fri, 11 Feb 2011, Lars Wirzenius wrote:
 However, I'm curious: is there a lot of software that is broken with
 Unicode, particularly with the UTF-8 encoding? I can't remember anything
 much in recent times.

1. Stuff that cannot do one of UTF-8, UTF-16 or UCS-4.

2. Anything that cannot deal with Supplementary planes.

   This includes the use of UCS-2 instead of UTF-16, as it cannot represent
   the Supplementary planes.  python 3 when not compiled to use UCS-4 memory
   hog mode is an example, I am told.

We likely want to restrain ourselves to declaring (1) to be release
critical for Wheezy.

-- 
  One disk to rule them all, One disk to find them. One disk to bring
  them all and in the darkness grind them. In the Land of Redmond
  where the shadows lie. -- The Silicon Valley Tarot
  Henrique Holschuh


-- 
To UNSUBSCRIBE, email to debian-devel-requ...@lists.debian.org
with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org
Archive: http://lists.debian.org/20110211221653.gb18...@khazad-dum.debian.net

Re: Make Unicode bugs release critical?

2011-02-11 Thread Ron Johnson


On 02/11/2011 07:36 AM, Adam Borowski wrote:
[snip]


UTF-16 is never, ever useful.  It is a sad trap for win32 and Java
developers, due to a bad engineering decision suggested, as I was told, by

[snip]


No, there is only one encoding left, as long as you don't have to talk to
Windows.


Never useful except for 90% of the market?  (I wonder how SAMBA 
deals with it...)


--
The normal condition of mankind is tyranny and misery.
Milton Friedman


--
To UNSUBSCRIBE, email to debian-devel-requ...@lists.debian.org
with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org
Archive: http://lists.debian.org/4d55c263.90...@cox.net

Re: Make Unicode bugs release critical?

2011-02-11 Thread Peter Samuelson


[Ron Johnson]
 Never useful except for 90% of the market?  (I wonder how SAMBA deals
 with it...)

I don't think you really want to know.  There's a 'unicode' flag in
much of the CIFS protocol that means filenames and such are in UTF-16
(I think UTF-16LE) instead of some-random-configured-code-page.
Samba's been using that flag for about 10 years.  You configure it to
say what encoding your filenames are supposed to be on the server, and
it expresses them in UTF-16 on the wire.

Samba also supports non-Unicode-aware clients like Windows 3.11 - or at
least it used to support these - you'd tell Samba what client code page
to translate your filenames into on the wire.  Fun stuff.

Samba doesn't really deal with file _contents_, which is a much more
interesting problem than filenames.  It just serves contents as-is,
like most file service protocols other than FTP.
-- 
Peter Samuelson | org-tld!p12n!peter | http://p12n.org/


-- 
To UNSUBSCRIBE, email to debian-devel-requ...@lists.debian.org
with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org
Archive: http://lists.debian.org/20110212003312.gb10...@p12n.org

Re: Make Unicode bugs release critical?

2011-02-11 Thread Adam Borowski

On Fri, Feb 11, 2011 at 08:16:54PM -0200, Henrique de Moraes Holschuh wrote:
 On Fri, 11 Feb 2011, Lars Wirzenius wrote:
  However, I'm curious: is there a lot of software that is broken with
  Unicode, particularly with the UTF-8 encoding? I can't remember anything
  much in recent times.
 
 2. Anything that cannot deal with Supplementary planes.
 
This includes the use of UCS-2 instead of UTF-16, as it cannot represent
the Supplementary planes.  python 3 when not compiled to use UCS-4 memory
hog mode is an example, I am told.

Using UCS-2 is hardly better than using ISO-8859-1 or any other ancient
charset.  Using either UTF-16 or UCS-4 can be a memory hog, that's why to
pick UTF-8 for regular use.  Except for some rare cases (CJK with no
formatting or markup), it uses less memory and can be passed as-is to POSIX
file functions.

Picking a random subset of Unicode is like putting day-of-the-year in one
byte variable since this way you support 70% of uses and it conserves
memory...

-- 
1KB // Microsoft corollary to Hanlon's razor:
//  Never attribute to stupidity what can be
//  adequately explained by malice.


-- 
To UNSUBSCRIBE, email to debian-devel-requ...@lists.debian.org
with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org
Archive: http://lists.debian.org/20110212020220.ga26...@angband.pl

Re: Make Unicode bugs release critical?

2011-02-11 Thread Henrique de Moraes Holschuh

On Sat, 12 Feb 2011, Adam Borowski wrote:
 On Fri, Feb 11, 2011 at 08:16:54PM -0200, Henrique de Moraes Holschuh wrote:
  2. Anything that cannot deal with Supplementary planes.
  
 This includes the use of UCS-2 instead of UTF-16, as it cannot represent
 the Supplementary planes.  python 3 when not compiled to use UCS-4 memory
 hog mode is an example, I am told.
 
 Using UCS-2 is hardly better than using ISO-8859-1 or any other ancient
 charset.  Using either UTF-16 or UCS-4 can be a memory hog, that's why to
 pick UTF-8 for regular use.  Except for some rare cases (CJK with no

Python 3 uses UCS-2 (or UCS-4) for the internal representation.  Likely
they wanted to have something that made it easy to address each
character in an Unicode string in O(1).

That might actually give better performance given how much people like
to do string slicing and splicing in python.  The O(N) often required by
UTF-8 and UTF-16 might well be more painful than the much larger data
cache footprint of UCS-4... but that is a damn big *maybe*, and very
unlikely to be consistent across very different architectures.

Well, not like I care.  I don't even have Python 3 installed, and I will
only do so the day something I need decides to pull it as a dependency.

 Picking a random subset of Unicode is like putting day-of-the-year in one

UCS-2 is deprecated as all heck.  As far as I could research through
Google, it is not a valid Unicode representation since Unicode 2.0 (i.e.
1996).  So it wouldn't even count as a random subset of Unicode.

-- 
  One disk to rule them all, One disk to find them. One disk to bring
  them all and in the darkness grind them. In the Land of Redmond
  where the shadows lie. -- The Silicon Valley Tarot
  Henrique Holschuh


-- 
To UNSUBSCRIBE, email to debian-devel-requ...@lists.debian.org
with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org
Archive: http://lists.debian.org/20110212035533.ga32...@khazad-dum.debian.net

53 matches

Mail list logo