Haxe wrote: > On Tuesday 27 September 2005 18:25, Christian Biere wrote: > > It's not a bug. > I think it's a POSIX violation, IIRC (see below).
No, I don't think so. Otherwise, please show me what part of POSIX says anything about the encoding of filenames. > > For almost all Unix-like systems the encoding of filenames is > > completely irrelevant, they are handled as opaque binary byte > > strings. > I know, and I completely agree with this. > This is why it's the application's responsibility to interpret this byte > string in the way the user wants. Assuming you mean "display purposes" I agree. > So there must be a way to tell applications how a given byte string > should be interpreted. This could theoretically be done by an > application-specific preference setting. But instead, POSIX suggests to > use a locale setting for this purpose. Not quite. POSIX just says that the encoding affects the behaviour of tolower, toupper, isalpha, isdigit and so on. It also says that any other uses/effects of the LC_CTYPE variable are implementation- defined. If LC_CTYPE is unset, the enconding is derived from LC_ALL or LANG and I read the former statement that effects beyond the behaviour of those ctype functions are still implementation defined. To the best of my knowledge, POSIX does not tell anything more about filenames other than there might be a maximum acceptable length and that they're NUL-terminated - although the latter is rather implicit than explicit. At the moment, I can't tell whether '/' slashes i.e., path separators are mentioned anywhere. > > Actually, "[EMAIL PROTECTED]" does not imply any > > character set. It's just a language preference. > I think that's not true. A locale also contains information about > character encoding. This information is supposed to be taken from the > locale specified in LC_CTYPE (whereas LANG is the "general" language > preference). [EMAIL PROTECTED] implies iso-8859-15. That's really implementation-defined i.e., a property of the OS you use or possibly even just its current configuration. That means it could as well have UTF-8 or whatever as default encoding for "[EMAIL PROTECTED]". If you check the output of "locale -a" or your locale directory (/usr/share/locale, /usr/lib/locale or the like) you probably (not necessarily) find directory names like de, de_DE, de_DE.ISO-8859-1 de_DE.ISO-8859-15. Of course the "@euro" leaves only two options in practise: UTF-8 or ISO-8859-15. Actually, POSIX does not even define the name of any encoding so some OS accept iso8859_1 while others use ISO8859-1 or ISO-8859-1 etc. Funny enough isalpha(), isdigit() etc. cannot work reliably with multi-byte encodings like UTF-8 or EUC-JP because these functions look only at single bytes. At the moment, I don't know what's the standard to figure out whether the current locale uses a single byte or multi-byte encoding and I don't know the multi-byte equivalents for those functions but suspect there are none. Keep in mind that the language specifier is rather about stuff like date format, monetary format, number format and language use for messages. All of these can be set indepently using the respective LC_* variable. If you use LANG or LC_ALL instead the other values will be derived or overriden. > > Further this is a very special case in which you're seemingly > > interested in files with filenames that are compatible with > > your locale encoding. If you're interested in files that have > > Asian or Arabic filenames (non-ASCII-fied) for example, those > > filenames couldn't be converted anyway. So either your other > > tools still wouldn't handle those properly or you'd have to > > live with bogus filenames containing mostly underscores or > > some trash. > I know. I know that this is bad, but it's my decision. Sure but if you let me stretch this statement you could as well demand that we implement a "connection flood" feature because it's technically possible to a certain degree and "it's your decision" whether you use it or not. ;) > Or, more > accurately, it was the default when I installed my OS some years ago. > Many linux distros still use iso-charset locales for western languages, > and that's what users expect to "work" as good as possible, i.e. at > least work for those few characters that are contained in that charset. > In my case, I want it to work for ä, ö, ü and ß. For the moment, I can > live with japanese Kanji being converted to underscores. In fact, that > would even be somewhat helpful, since I can't type arbitrary > foreign-language filenames into my console. Well you often don't need to type names either because you just use a glob pattern or a GUI where you only have to click on items. Also if you use a shell with tab-completion (like bash) typing is not much of a problem as long as there are at least some ASCII characters in the filename. > > > That sucks but I still think it's better to keep the UTF-8. I'd > > rather recommend to switch to UTF-8 > ... > > Not because of Gtk-Gnutella but because UTF-8 > > is the future > > Yes, I know, me too. Utf-8 _IS_ the future (at least for unix, on other > platforms it might be utf-16 or whatever unicode encoding). Right, I meant Unicode, UTF-8 only in cases like this (where you need something compatible with NUL-terminated C strings). > But that's > not the point here. The point is that it has to work like the user > expects, and it must be possible to have different applications > interoperate. That's why standards exist. I'd happily adopt any reasonable standard but I don't know of any that covers our problem. I'd happily comply with "best current practice" but don't know one off-hand which means everyone believes in his own rules or experience and has his own preferences. > > and those apps > > should be fixed to allow UTF-8 *and* the locale encoding > That is logically impossible. You're right that sometimes UTF-8 data other than ASCII can look like data with a different encoding and vice-versa. Though I think false-positives in UTF-8 detection are sufficiently rare. There cannot be any perfect detection or solution as long as character sets are mixed in the wild. > If a filename contains the byte 0xc3 > followed by a 0xa4, there is no way for the application to know whether > I mean "ä" (in utf-8) or "Ã?" (in iso-8859-15) or "?" (in eucjp). All > three are, in theory, perfectly reasonable. Right, although in practice such strings are usually sufficiently obscure in ISO-8859-* that they are not used in natural language. The currency character in the example above is hardly used by anyone in my experience - it's usually really a case of mis-declared UTF-8 or effect of conversion of the EUR sign (Really who the hell needs it? Thanks a lot EU for coming up with pointless "standards" that cause more trouble than they solve.) to a fallback: the unspecified currency character. > I have to tell the application which encoding to assume, and this is > in most cases done by a locale setting. In the general case it might be better to check for the locale encoding first and then check for UTF-8 encoding. But I really think it depends on the application. At least for search results in Gnutella, the inverse order should be used - accepting that many vendors will not support Unicode for centuries to come. Filenames of your local files could be handled differently. > GTKG should convert filenames to the charset of the locale given in > LC_CTYPE before storing them to the disk. Characters that can't be > represented in that charset should be replaced by an underscore or the > like. I can grant your request to encode to the locale encoding if all characters are convertible. I'd prefer to keep it in UTF-8 but the locale encoding causes you less problems as it seems and as long as you don't change your locale encoding, Gtk-Gnutella won't have a problem to convert it back and forth. However, converting UTF-8 filenames that cannot be represented in full in your locale encoding is absolutely no option for Gtk-Gnutella. I don't even want a feature to do that optionally. If you need that please use some utility to "fixup" the filenames. A world-wide network must use Unicode by all means. -- Christian
pgp9QN9sGH5FO.pgp
Description: PGP signature
