On 2007-10-18, Russell Shaw <[EMAIL PROTECTED]> wrote: > I find it hard to see those problems because i rarely handle non-english > text.
Which problems? The ones with present abstraction implementations (wchar_t, locale), or the general unknown encoding fuckup? > In the general-purpose editing applications i've made (like a word processor), > any non-english text is passed out to a "black box" unicode layout processor > plugin for things like paragraph formatting, and i can make it UTF-8 or UTF-32 > or whatever data encoding is convenient. I see "all UTF-8" as only applying > between completely separate applications on the pc. It applies to any software components trying to communicate. Things like DBUS (iirc) and Cairo in their monoculturism require the use of UTF-8 with their API. Those are the ones I have studied and become disappointed with. There are probably many others (everything gnome related?) as well. > I've done hardly any non-english processing, but iirc, UTF-8 files are > supposed > to start with a magic number. If all text files were UTF-8, the magic number > wouldn't be needed. I'm probably missing something you mean. Text files on *nix do not tend to carry any information as to their character encoding, or type in any other way either. They're randomly assumed to either be ASCII + random bytes with high bit set, locale encoding, or these days UTF-8, depending on the application. On Windows AFAIK they do have some kind of unicode markers, and maybe there's some standard about that, but any random text file on *nix tends to be in the locale encoding without indicators if it was created on that system (by that user) when the same locale was in use. But files from elsewhere can use different encoding, and some formats stored in plain text files require a particular encoding to be used without indicating it anywhere in the file (e.g. YAML). > I find it hard to see how all kinds of config files in /etc called be made > non 7-bit ascii without major parsing pain. To me, config file tokens should > be > in 7-bit latin because the content is more like program code that only > programmers should see, and any non-english configuration should be done > through > an i18n-ized gui imo (not having thought of anything better). A case could probably be made for config file tokens to be 7-bit ASCII. But the files contain data strings as well, including things like translations of menu items and such. Their encoding can be application-specific, but wouldn't it be simpler for the file to specify its encoding in a standard manner? Then arbitrary text editors can use the right encoding (or convert to whatever encoding they please). HTML/XML/etc. do, for example, tend to include a Content-Type or such encoding specification, but unfortunately few text editors understand it (and the SG/XML syntax generally sucks anyway and isn't suitable for editing by text editors -- yet there's nothing better either -- and could hence be binary). Arbitrary plain text files could include the same information in a more easily accessible format. One rather hacky and ugly option might be using on the first line the -*- foo: bar; -*- syntax that some text editors do already support. Another cleaner option could be based on storage of mime types on the file system. ... But this is really drifting away from the topic of this thread and perhaps even the whole list, and should perhaps be taken elsewhere. -- Tuomo _______________________________________________ wm-spec-list mailing list wm-spec-list@gnome.org http://mail.gnome.org/mailman/listinfo/wm-spec-list