Re: Make Unicode bugs release critical?

Joey Hess Fri, 11 Feb 2011 07:39:31 -0800

Lars Wirzenius wrote:
> However, I'm curious: is there a lot of software that is broken with
> Unicode, particularly with the UTF-8 encoding? I can't remember anything
> much in recent times.


We chose an 80% quickfix to get where we are, and so now we have the
other 80% to go. It's been whittled away at for the past 10 years or so,
but still a lot left.

And, that's utf8 support, only. It's probably a pipe dream to expect
other unicode encodings to work half as well, and surely other encodings
fare even worse overall. If anything, utf8 probably makes the overall
situation worse for other encodings, since we expect it to "just work",
and give up on handling the other complexity.

> The first Unicode standard was published in 1991. That's twenty years
> ago. Any software that processes text at all and is incapable of dealing
> with UTF-8 should be considered with extreme suspicion.

Most languages still make it easy to get wrong, in my experience.

It can be as simple as software written trusting language documentation
that says "strings are processed in unicode" and doesn't point out all
the exceptions that can let non-unicode data in. For example, this
simple haskell program processess a file's content utf-8 cleanly, but
prints its name like "foÃ¶".

import System.Environment
main = do
        args <- getArgs
        let file = head args
        putStrLn $ "file is: " ++ file
        putStr =<< readFile file

This program has an entirely different failure mode; type in
"foö" (touch it first), and it will complain that "fo�" doesn't exist.

main = getLine >>= readFile >>= putStr

Neither of these failure modes is obvious from any documentation I've seen.
Both of these programs are something a typical developer would expect to
work. (Both also have unexpected failure modes when LANG=C.)

Probably every thousand lines of perl has a unicode encoding bug of some
sort. Based on data from my own code. Any perl code that uses an XS module
probably has an encoding bug.

I assume that python had some problems with its unicode support too,
since they saw fit to radically change it in python 3. And it sounds
like the python 3 changes will break unicode in many programs ported
over to it, unless file opens etc are audited and fixed. Stackoverflow
has 1600 matches for python unicode questions.

The best case is probably a language that has a restructed enough
interface that most of these problems are avoided. 
(But, stackoverflow still has 500 javascript unicode questions.)

> Making all such
> bugs be release critical (which includes the notion that release
> managers may ignore the bug in particular cases) sounds like a good way
> to get things under control.

It would probably be a large load on the RMs. It's easy to pick some
random program that works great with unicode and find an edge case. The RMs
would probably prefer to not have git getting RC bugs filed just because
it sometimes exposes filenames written like "fo\303\266". :)

-- 
see shy jo, who deals with at least 1 unicode bug a week on average. 4 this week

signature.asc
Description: Digital signature

Re: Make Unicode bugs release critical?

Reply via email to