Lars Wirzenius wrote: > However, I'm curious: is there a lot of software that is broken with > Unicode, particularly with the UTF-8 encoding? I can't remember anything > much in recent times.
We chose an 80% quickfix to get where we are, and so now we have the other 80% to go. It's been whittled away at for the past 10 years or so, but still a lot left. And, that's utf8 support, only. It's probably a pipe dream to expect other unicode encodings to work half as well, and surely other encodings fare even worse overall. If anything, utf8 probably makes the overall situation worse for other encodings, since we expect it to "just work", and give up on handling the other complexity. > The first Unicode standard was published in 1991. That's twenty years > ago. Any software that processes text at all and is incapable of dealing > with UTF-8 should be considered with extreme suspicion. Most languages still make it easy to get wrong, in my experience. It can be as simple as software written trusting language documentation that says "strings are processed in unicode" and doesn't point out all the exceptions that can let non-unicode data in. For example, this simple haskell program processess a file's content utf-8 cleanly, but prints its name like "foö". import System.Environment main = do args <- getArgs let file = head args putStrLn $ "file is: " ++ file putStr =<< readFile file This program has an entirely different failure mode; type in "foö" (touch it first), and it will complain that "fo�" doesn't exist. main = getLine >>= readFile >>= putStr Neither of these failure modes is obvious from any documentation I've seen. Both of these programs are something a typical developer would expect to work. (Both also have unexpected failure modes when LANG=C.) Probably every thousand lines of perl has a unicode encoding bug of some sort. Based on data from my own code. Any perl code that uses an XS module probably has an encoding bug. I assume that python had some problems with its unicode support too, since they saw fit to radically change it in python 3. And it sounds like the python 3 changes will break unicode in many programs ported over to it, unless file opens etc are audited and fixed. Stackoverflow has 1600 matches for python unicode questions. The best case is probably a language that has a restructed enough interface that most of these problems are avoided. (But, stackoverflow still has 500 javascript unicode questions.) > Making all such > bugs be release critical (which includes the notion that release > managers may ignore the bug in particular cases) sounds like a good way > to get things under control. It would probably be a large load on the RMs. It's easy to pick some random program that works great with unicode and find an edge case. The RMs would probably prefer to not have git getting RC bugs filed just because it sometimes exposes filenames written like "fo\303\266". :) -- see shy jo, who deals with at least 1 unicode bug a week on average. 4 this week
signature.asc
Description: Digital signature