Re: [Gtk-gnutella-devel]GTKG chooses wrong charset/encoding for filenames on disk

Christian Biere Wed, 28 Sep 2005 05:09:31 -0700

Haxe wrote:
> On Tuesday 27 September 2005 18:25, Christian Biere wrote:
> > It's not a bug.
 
> I think it's a POSIX violation, IIRC (see below).


No, I don't think so. Otherwise, please show me what part of POSIX
says anything about the encoding of filenames.
 
> > For almost all Unix-like systems the encoding of filenames is
> > completely irrelevant, they are handled as opaque binary byte
> > strings.

> I know, and I completely agree with this.
> This is why it's the application's responsibility to interpret this byte 
> string in the way the user wants.

Assuming you mean "display purposes" I agree.
 
> So there must be a way to tell applications how a given byte string 
> should be interpreted. This could theoretically be done by an 
> application-specific preference setting. But instead, POSIX suggests to 
> use a locale setting for this purpose.

Not quite. POSIX just says that the encoding affects the behaviour
of tolower, toupper, isalpha, isdigit and so on. It also says that
any other uses/effects of the LC_CTYPE variable are implementation-
defined. If LC_CTYPE is unset, the enconding is derived from LC_ALL
or LANG and I read the former statement that effects beyond the
behaviour of those ctype functions are still implementation defined.
To the best of my knowledge, POSIX does not tell anything more
about filenames other than there might be a maximum acceptable
length and that they're NUL-terminated - although the latter is
rather implicit than explicit. At the moment, I can't tell whether
'/' slashes i.e., path separators are mentioned anywhere.
 
> > Actually, "[EMAIL PROTECTED]" does not imply any
> > character set. It's just a language preference.

> I think that's not true. A locale also contains information about 
> character encoding. This information is supposed to be taken from the 
> locale specified in LC_CTYPE (whereas LANG is the "general" language 
> preference). [EMAIL PROTECTED] implies iso-8859-15.

That's really implementation-defined i.e., a property of the OS
you use or possibly even just its current configuration. That
means it could as well have UTF-8 or whatever as default encoding
for "[EMAIL PROTECTED]". If you check the output of "locale -a" or
your locale directory (/usr/share/locale, /usr/lib/locale or the
like) you probably (not necessarily) find directory names like
de, de_DE, de_DE.ISO-8859-1 de_DE.ISO-8859-15. Of course the "@euro"
leaves only two options in practise: UTF-8 or ISO-8859-15.
Actually, POSIX does not even define the name of any encoding
so some OS accept iso8859_1 while others use ISO8859-1 or
ISO-8859-1 etc.

Funny enough isalpha(), isdigit() etc. cannot work reliably
with multi-byte encodings like UTF-8 or EUC-JP because these
functions look only at single bytes. At the moment, I don't
know what's the standard to figure out whether the current
locale uses a single byte or multi-byte encoding and I don't
know the multi-byte equivalents for those functions but
suspect there are none.

Keep in mind that the language specifier is rather about stuff
like date format, monetary format, number format and language use
for messages. All of these can be set indepently using the
respective LC_* variable. If you use LANG or LC_ALL instead the
other values will be derived or overriden.

> > Further this is a very special case in which you're seemingly
> > interested in files with filenames that are compatible with
> > your locale encoding. If you're interested in files that have
> > Asian or Arabic filenames (non-ASCII-fied) for example, those
> > filenames couldn't be converted anyway. So either your other
> > tools still wouldn't handle those properly or you'd have to
> > live with bogus filenames containing mostly underscores or
> > some trash.
 
> I know. I know that this is bad, but it's my decision.

Sure but if you let me stretch this statement you could as well
demand that we implement a "connection flood" feature because
it's technically possible to a certain degree and "it's your
decision" whether you use it or not. ;)

> Or, more 
> accurately, it was the default when I installed my OS some years ago. 
> Many linux distros still use iso-charset locales for western languages, 
> and that's what users expect to "work" as good as possible, i.e. at 
> least work for those few characters that are contained in that charset. 
> In my case, I want it to work for ä, ö, ü and ß. For the moment, I can 
> live with japanese Kanji being converted to underscores. In fact, that 
> would even be somewhat helpful, since I can't type arbitrary 
> foreign-language filenames into my console.

Well you often don't need to type names either because you just
use a glob pattern or a GUI where you only have to click on
items. Also if you use a shell with tab-completion (like bash)
typing is not much of a problem as long as there are at least
some ASCII characters in the filename.

> 
> > That sucks but I still think it's better to keep the UTF-8. I'd
> > rather recommend to switch to UTF-8
> ...
> > Not because of Gtk-Gnutella but because UTF-8
> > is the future
> 
> Yes, I know, me too. Utf-8 _IS_ the future (at least for unix, on other 
> platforms it might be utf-16 or whatever unicode encoding).

Right, I meant Unicode, UTF-8 only in cases like this (where you
need something compatible with NUL-terminated C strings).

> But that's 
> not the point here. The point is that it has to work like the user 
> expects, and it must be possible to have different applications 
> interoperate. That's why standards exist.

I'd happily adopt any reasonable standard but I don't know of any
that covers our problem. I'd happily comply with "best current
practice" but don't know one off-hand which means everyone believes
in his own rules or experience and has his own preferences.

> > and those apps
> > should be fixed to allow UTF-8 *and* the locale encoding

> That is logically impossible.

You're right that sometimes UTF-8 data other than ASCII can look
like data with a different encoding and vice-versa. Though I
think false-positives in UTF-8 detection are sufficiently rare.
There cannot be any perfect detection or solution as long as
character sets are mixed in the wild.

> If a filename contains the byte 0xc3 
> followed by a 0xa4, there is no way for the application to know whether 
> I mean "ä" (in utf-8) or "Ã?" (in iso-8859-15) or "?" (in eucjp). All 
> three are, in theory, perfectly reasonable.

Right, although in practice such strings are usually sufficiently
obscure in ISO-8859-* that they are not used in natural language.
The currency character in the example above is hardly used by anyone
in my experience - it's usually really a case of mis-declared UTF-8
or effect of conversion of the EUR sign (Really who the hell needs
it? Thanks a lot EU for coming up with pointless "standards" that
cause more trouble than they solve.) to a fallback: the unspecified
currency character.

> I have to tell the application which encoding to assume, and this is
> in most cases done by a locale setting.

In the general case it might be better to check for the locale
encoding first and then check for UTF-8 encoding. But I really
think it depends on the application. At least for search results
in Gnutella, the inverse order should be used - accepting that
many vendors will not support Unicode for centuries to come.
Filenames of your local files could be handled differently.

> GTKG should convert filenames to the charset of the locale given in 
> LC_CTYPE before storing them to the disk. Characters that can't be 
> represented in that charset should be replaced by an underscore or the 
> like.

I can grant your request to encode to the locale encoding if all
characters are convertible. I'd prefer to keep it in UTF-8 but
the locale encoding causes you less problems as it seems and as
long as you don't change your locale encoding, Gtk-Gnutella won't
have a problem to convert it back and forth.

However, converting UTF-8 filenames that cannot be represented in
full in your locale encoding is absolutely no option for Gtk-Gnutella.
I don't even want a feature to do that optionally. If you need that
please use some utility to "fixup" the filenames.

A world-wide network must use Unicode by all means.

-- 
Christian

pgp9QN9sGH5FO.pgp
Description: PGP signature

Re: [Gtk-gnutella-devel]GTKG chooses wrong charset/encoding for filenames on disk

Reply via email to