Tuomo Valkonen wrote: > On 2007-10-18, Russell Shaw <[EMAIL PROTECTED]> wrote: >> What alternative is there to UTF-8? An advantage of monoculturalism is that >> if the architecture is sufficient, everything can be consistant and easy. > > There are problems with locale encoding and wchar_t, but fundamentally > their abstraction is better than specifying a Single Global Encoding. > Specifying "everything is UTF-8" is an evolutionary dead-end. I think > it's better to say "here's wchar_t and functions to operate on it. We > don't actually specify what the actual encoding is, because then it's > a blackbox that can easily be changed." Almost likewise with LC_CTYPE > multibyte encodings. Unfortunately they forget to provide convenient > functions for encoding conversions when communicating with the external > world (that should mostly be in the libraries, seldom in applications), > and the libc multibyte routines are a bit too limited, etc. That's > however something that could easily be solved if people weren't so > intent on creating another problem almost as big as the ASCII and > Latin1 assumptions that we're still suffering from. Indeed, you do > need and want that kind of libraries to conveniently use that Single > Global Standard too; the difference is that by specifying a particular > encoding, clean design is not encouraged, and applications can and > will expect that encoding and not do things abstractly through a > handful of libraries that could easily be changed (or configured). > > Another major problem is the unix and C "untyped" text file and stream > legacy, so you have to assume every file is in some encoding -- ASCII, > LC_CTYPE, UTF-8, or so, which it may not be. That could also be solved > by e.g. creating a "typed" plain text file (could be mime type stored > on fs) and stream format, assuming the locale encoding for legacy > stuff, and opening text files though some library as text streams, > that then does the conversions to the abstract application internal > encoding (either multibyte encoding -- not necessary LC_CTYPE, to > allow wider character ranges internally in programs than in legacy > files -- or wide character). That's a rather big task, but not really > that much bigger than a transition to a global monoculture.
I find it hard to see those problems because i rarely handle non-english text. In the general-purpose editing applications i've made (like a word processor), any non-english text is passed out to a "black box" unicode layout processor plugin for things like paragraph formatting, and i can make it UTF-8 or UTF-32 or whatever data encoding is convenient. I see "all UTF-8" as only applying between completely separate applications on the pc. I've done hardly any non-english processing, but iirc, UTF-8 files are supposed to start with a magic number. If all text files were UTF-8, the magic number wouldn't be needed. I'm probably missing something you mean. I find it hard to see how all kinds of config files in /etc called be made non 7-bit ascii without major parsing pain. To me, config file tokens should be in 7-bit latin because the content is more like program code that only programmers should see, and any non-english configuration should be done through an i18n-ized gui imo (not having thought of anything better). _______________________________________________ wm-spec-list mailing list wm-spec-list@gnome.org http://mail.gnome.org/mailman/listinfo/wm-spec-list