Haxe wrote:
> But unfortunately, it's a bit more general than "display purposes". In 
> general, you need to know the source encoding whenever a string has to 
> be transformed to a specific destination encoding.

Exactly. And that's where things fall apart. LC_CTYPE has nothing to
do with the encoding of filenames. It's at best an probability
factor - but for example in Mac OS X it has actually no effect, UTF-8
is the one and only filename encoding. And at the moment I wish
every OS vendor would have made the same design decision years ago.
Then we wouldn't even need to discuss this.

In GNOME UTF-8 is the one and only filename encoding too which is
derived from Gtk+. There's the environment variable G_BROKEN_FILENAMES
can be set to enable some workarounds for handling non-UTF-8 filenames.
IIRC, with that variable enabled, some Gtk+ routines (or maybe just apps)
will store files using your locale encoding for the filenames. If it
makes you happy we can do the same in Gtk-Gnutella - except when the
UTF-8 filename is incompatible with your encoding.

> > > I know. I know that this is bad, but it's my decision.

> > Sure but if you let me stretch this statement you could as well
> > demand that we implement a "connection flood" feature because
> > it's technically possible to a certain degree and "it's your
> > decision" whether you use it or not. ;)
 
> That is an unfair comparison. I cannot harm the net, i.e. other people, 
> if i locally convert my filenames to a format that enables me to read 
> some characters at the expense of some other characters.

Well it's not as evil as in my bogus example. However, if you change
to filenames to something meaningless, people won't be able to
find the file at your site.

That was actually more a reaction to your reasoning that "it's your
decision". If it's your code, it's your decision. Otherwise it's
not. ;)

> That's what I would do manually, anyway.

*That* is really your decision then.

> Having GTKG do this, like all other apps do it, is just a local
> convenienve feature.

Okay, that's another trigger. Convince me but don't reason with
"like all other" because that is no reason.
 
> > I'd happily adopt any reasonable standard but I don't know of any
> > that covers our problem. I'd happily comply with "best current
> > practice" but don't know one off-hand which means everyone believes
> > in his own rules or experience and has his own preferences.
> 
> The best current practice seems to be exactly what I propose. That's why 
> I suggest this behaviour. I'm not inventing this, I steal it from other 
> apps. For example, graphical file managers like konqueror. Or for 
> example Mozilla Firefox.

I wouldn't consider a particularly good example. Just because Mozilla
handles one or two standards better than MSIE and has tabbed browsing
doesn't make it an idol. Actually, it corrupts our minds. Just recently
someone added a feature to handle the unsolicited favicon requests of
Mozilla.

And with respect to storing files: No, it's bad. I've seen showing the
correct filename in the "save as" dialog and then store some corrupt
filename with question marks in it. And it does many odd things. I
have to admit it (almost) never ever crashes though.

> > > If a filename contains the byte 0xc3
> > > followed by a 0xa4, there is no way for the application to know
> > > whether I mean "ä" (in utf-8) or "Ã?" (in iso-8859-15) or "?" (in
> > > eucjp). All three are, in theory, perfectly reasonable.
> >
> > Right, although in practice such strings are usually sufficiently
> > obscure in ISO-8859-* that they are not used in natural language.
> > The currency character in the example above is hardly used by anyone
> > in my experience
 
> This is probably a misunderstanding caused by email transcoding. The 
> character in the example should be a Euro sign, not a general currency 
> sign.

> It would have been a general currency sign if interpreted as 
> iso-8859-1, which probably somehow happened to the email before you 
> received it (I sent it in utf-8). But in the examplary iso-8859-15, it 
> is a Euro sign.
 
I'm using mutt from the console which doesn't support UTF-8. Keep
in mind that whine is still mostly water. I only looked at the
hex codes you've given. In ISO-8859-1 0xA4 is the generic currency
sign, in ISO-8859-15 it's the Euro symbol. They should have replaced
the $ instead, that would have been funnier.

> And even if you can reasonably assume a valid utf-8 string to not be 
> meant as iso-8859-* in practice, you can't do such guessing in the 
> aforementioned case of eucjp.

That might explain some of the odd string Daichi sees. However,
he uses EUC-JP, we tried some test cases and it worked fine. I'm
really no character set expert. Maybe there are really too many
false-positives with respect to UTF-8 detection for some of
them. Just another reason to stick with UTF-8.
 
> > Filenames of your local files could be handled differently.

> Yes, the local on-disk encoding of my file names has nothing to to with 
> the encoding used on the net. They have to be converted in between, if 
> they don't accidentally happen to be the same.

But we cannot know the source character set in any case.

> > I can grant your request to encode to the locale encoding if all
> > characters are convertible. I'd prefer to keep it in UTF-8 but
> > the locale encoding causes you less problems as it seems and as
> > long as you don't change your locale encoding, Gtk-Gnutella won't
> > have a problem to convert it back and forth.
> 
> GTKG even wont have a problem when I change my locale. That's the whole 
> point of the locale setting. When I change the encoding of my local 
> file name encoding, GTKG will know this because it is told by means of 
> a locale setting.

And then you rename all your files? If you have ISO-8859-15 encoded
filenames now and switch to some other locale encoding Gtk-Gnutella
cannot convert the filenames to UTF-8 anymore.
 
> > A world-wide network must use Unicode by all means.
> 
> I think we still misunderstand each other.
> *Of course* you use unicode on the net. I never said anything different. 
> You just ought to use my local encoding on my local disk.

But as Gnutella user you and your shared files are part of a global
network. Its sole purpose is to share files world-wide. Germany
has many immigrants and also Germans who are just interested in
foreign content. So I'm sure there are people who look for content
from Arabic/Asian countries which have non-ASCII filenames. The
vast amount of people does not know what encoding their system
uses or what "locale encoding" is at all or what effects some
LC_* variable has. So they just use the default (which doesn't
seem to be UTF-8 in most cases) or if they really want a Euro sign,
they probably googled for it and ended up with ISO-8859-15. Thus,
if Gtk-Gnutella converted filenames to locale encoding by
default shreddering any meaning out of it, Gtk-Gnutella becomes
effectively a blackhole for these cases.

> Imagine I locally create a file "Mäusetanz.ogg". I store it on disk 
> using iso-8859-15 for the filename, like I always do. Now I want to 
> share the file, so that other people can listen to my great music *g*.

Yes but maybe people from other places in the world would like to
listen to it as well. If they use some encoding which has "ä" they
end up with a file called "M_usetanz.ogg" on their disk. Maybe
they would see "M?usetanz.ogg" by default otherwise but nothing
is lost. If they find out what causes this, they can just change
the encoding. They don't even need to change it permanently. If
you don't like UTF-8, you can decide this on a process-by-process
basis or any lower granulation.

> I set LC_CTYPE to an iso-8859-15 locale, for example [EMAIL PROTECTED], so 
> that 
> GTKG knows that 0xE4 is supposed to mean "ä". GTKG can now internally 
> convert this name to utf-8, which it will use on the net. That's how it 
> works.

You handle it that way. I for example don't. I really don't care
about the encoding as long as I make something out of the filename.
If I download some music file with Gnutella, it ends up in my
download directory and "mplayer -shuffle *" takes care off my
needs. It doesn't even matter whether Gtk-Gnutella uses UTF-8
or any other encoding but if I share the file, the UTF-8 encoding
is a huge advantage.

My conclusion is:
If your locale encoding is not UTF-8 and an UTF-8 encoded filename
is compatible with your locale encoding and the environment variable
G_BROKEN_FILENAMES is set, Gtk-Gnutella shall convert the filename
to your locale encoding. Otherwise, the filename shall be kept as-is.

I'm not sure whether you care about incomplete files. You can actually
Drag & Drop all files shown in the Downloads pane to other apps like
Mozilla and Xine for example - which makes filename encoding less
of an issue. It would be easier if the conversion happens as a final
operation after the download has finished.

It's not so cool for bulk downloads but we could make the filename
editable from the details pane. That would allow you to save the file
under any name you wish. You'd still have to use G_BROKEN_FILENAMES
to enforce the character set conversion but you could shredder all
characters you don't like from there.

-- 
Christian

Attachment: pgpEMqZp3QiPH.pgp
Description: PGP signature

Reply via email to