Re: [Gtk-gnutella-devel]GTKG chooses wrong charset/encoding for filenames on disk

Christian Biere Mon, 03 Oct 2005 07:48:34 -0700

Haxe wrote:
> A quote from the manual for the GNU C library (glibc):
> The ISO C standard defines functions to convert strings from a multibyte 
> representation to wide character strings. [...] The character set 
> assumed for the multibyte encoding is not specified as an argument to 
> the functions. Instead the character set specified by the LC_CTYPE 
> category of the current locale is used.

The problem with those functions is that the used encoding for
wide-characters is implementation-defined and may even depend
on the current locale. So it might simply be EUC-JP codepoints
and wchar_t could as well be a plain 8-bit char on systems that
offer no real Unicode support. As long as Gtk+ 2.x is usually
on such systems, we can still fully support Unicode with the
current way.

I really have no idea how to convert wchar_t * (or win_t *) strings
to UTF-8 without that knowledge. Even if we blindly assume it's Unicode
it may be UTF-8, UCS-2, UTF-16, UTF-32 maybe even a system-specific
non-standard encoding of Unicode codepoints. Some systems definitely
use 16-bit integers for wchar_t and win_t while GNU glibc uses 32-bit
types and UTF-32. With respect to portability and interoperability, the
ISO C multi-byte support seems to cause more problems than it solves.

> So much for the technical question if a filename CAN be converted. Yes, 
> it can, the C functions obviously know which target encoding to convert 
> to. Now to the question if filenames SHOULD be converted, which is, as 
> it appears, unfortunately not only a technical question.

> Yes, in OS X, filenames are defined to always be utf-8, which is really 
> great. But in other unices, this is unfortunatly not yet the common 
> practice. Instead, filenames are encoded in what ISO C defines to be 
> the "external encoding", i.e. the encoding specified by your locale. If 
> that happens to be utf-8, you're a lucky man.

Almost, most people can use UTF-8 as their locale encoding, so it's not a
question of luck. I would rather blame the OS vendor for using a bad
default than the users of course.

> > Okay, that's another trigger. Convince me but don't reason with
> > "like all other" because that is no reason.

> Unfortunately, this is a reason.

The reasons these "all other" have for doing so might be valid reasons
but I'd never consider the fact "all others do that" alone as sufficient,
at least for things I care and disagree about. For unimportant things
you can surely simply do what the majority does.

> Different applications ought to interoperate. This must at least
> be considered.

> > Yes but maybe people from other places in the world would like to
> > listen to it as well. If they use some encoding which has "ä" they

Of course I meant there's a "no" missing in that sentence.

> > end up with a file called "M_usetanz.ogg" on their disk.

> If they optionally choose so, why not? If an ä can't be represented in 
> their charset, it means they are not using utf-8 locally. They are 
> instead using, for example, koi8-r, which is optimized for russian 
> cyrillic. So russian filenames will be more important to them than 
> german filenames. What would be more convenient for them?

They should use UTF-8 (or Unicode in general) instead. So
that they don't have problems with foreign strings/languages and
foreigners don't have problems with their strings. That's not
just about Gtk-Gnutella but a general issue. Not every application
can reasonably convert strings (e.g., FTP). Everybody can
learn how to read "Mäusetanz", cyrillic, kanji or whatever but you
can't really pronounce "M_usetanz". Keeping Gnutella in mind, it's
also bad because now "usetanz" would be added to the database as
matchable term. If there are several non-convertible characters
in the filename you can easily end up with search terms that are
often matched and then cause a lot of useless traffic and bogus
results. Sure *you* rename your files properly but I assume that
many people do not and that they also share their "completed"
directory which means after an explicit "rescan" or during the
next session such files would be shared.

> Store all 
> names in koi8-r, losing one foreign "ä", but still be able to correctly 
> see all their downloaded russian files in their file manager? Or store 
> all names in utf-8, technically not losing any information but losing 
> the ability to read these filenames, especially their preferred russian 
> filenames? I bet they won't choose the latter. Why not at least give 
> the option?

Because they don't know what they're missing or doing. A set
LC_CTYPE does not convince me that they actually want their
filenames to be encoded that way. That's also why GLib 2.x seems
to have a new variable G_FILENAME_ENCODING for this. However if
OS vendors decide on their own to set this to the LC_CTYPE equivalent
and use a non-UTF-8 LC_CTYPE by default especially if the OS can
handle Unicode as well all effort is wasted. I think most Unix
systems use ASCII as default which is like-wise bad and nowadays
a stupid "optimization".

> Note that you already have an option "Convert 'evil' characters (like 
> shell meta characters) to underscores in generated filenames".

Which is something "all others" do not. Should I remove it again?
I added it on my own initative even though, personally, I have disabled
that option as well as the "convert spaces to underscore option".

However that has nothing to do with character sets and with respect
to Gnutella you don't lose much as Gnutella treats (most) of those
characters equivalent to spaces anyway, so it doesn't match with
searchability of these files.

The reason I added the conversion of "evil characters" and made
it default is that I see many (mostly new) Unix users have
absolutely no idea that there are dangerous characters that
must be escaped. The funny thing is, neither do Acrobat or
Mozilla developers know about this. Exploits based on shell-
character injection are probably the largest category after
buffer overflows. So I'm rather safe than sorry and baby the
Gtk-Gnutella users. Turn it off if you don't care or don't
need it. It takes nothing more than 3 clicks. Well it's
unfortunate for some characters and cases but I didn't
invent the set of meta-characters used by shells. It's
certainly not a perfect solution/feature nor can I think of
one.

Actually that option also covers charactes which are not
usable on FAT partitions.

> On Thursday 29 September 2005 05:32, Christian Biere wrote:
> > Then I download a file with an UTF-8 encoded filename with kanji
> > characters. The display is perfectly fine. The save-as dialog
> > shows the correct unmodified name. The download dialog however
> > shows question marks instead of the kanji characters. The
> > filename on the disk looks the same and these are really plain
> > ASCIIquestion marks.

[..]

> 2.
> The fact that mozilla offers you to save your file as something 
> containing Kanji and not give a warning even if it could know better 
> that this will fail in locale "C" is perhaps bad user interface. 
> Perhaps you should be warned if you manually entered an "impossible" 
> filename. _BUT_: What you enter into a graphical save-as dialog are 
> characters, not bytes.

I'm not so sure about this. "Every string is UTF-8 encoded" is
rather a Gtk+ 2.x "limitation" (and some other GUI libraries), Gtk+
1.2 doesn't care but X11 is also extended towards that direction and
that's pretty logic if you keep in mind that X11 is network-transparent
so that server and client could easily disagree about character sets -
or other clients with respect to drag & drop or copy & paste.
Originally, however, X11 didn't care about the encoding either.

> You cannot enter text encoded in utf-8 or in 
> euc-jp, you just enter text. The task to represent these characters 
> using bytes has to be done by the application. And if you told the 
> application (by a locale) to use plain ascii for this task, there is 
> simply no way to encode the Kanji characters. It just can't do better.

Almost. Personally, I never told Mozilla to use any certain encoding
for the filename. So if it used the filename as given by the server,
that would still be perfectly correct in my opinion.

Well, it obviously converted the filename before storing the file. It's
the usual TOCTOU issue or maybe TOPTOU rather in this case (time of
check/presentation, time of use). It should simply use the already
corrupted filename in the save-as dialog. Of course I didn't enter
the filename myself. The filename was suggested by the HTML document
or the web server and during my checks the indicated character set
was clearly UTF-8 and Mozilla got that right until it actually created
the file. So I certainly won't consider Mozilla's behaviour a good
example or how things should be.

> 3.
> For most users who don't know of different encodings, "æ-¥æ?¬èª?.html" 
> won't be that much better than "___.html". They will later change both 
> filenames  manually anyway.

I think it depends. If they really use a graphical filemanager - which
I don't - or pick files from graphical control in any application they
just click on filenames anyway and something funny is just as good as
______.html - except that the latter has less entropy.

> By the way, I'm experimenting with changing my whole system to utf-8, 
> which is definitely the only future-proof encoding in unix. If I really 
> do that, the whole problem discussed here will no longer be too 
> important for me personally. Of course I only do this because one day I 
> would have done so anyway.

Sure but if no apps insist or at least prefer UTF-8, less people
will switch to UTF-8. Another issue is that most English speakers
which still dominate the Internet and software market couldn't care
less because ASCII is the smallest common denominator which means
few of them have any reason to use Unicode. Others use specific
character sets "like all others" in their region and as long as
they don't care about foreign stuff, they are not happy to switch
either because it seemingly causes just problems without any
benefit. Of course, it's more difficult for this group because
their encodings are not a simple subset of UTF-8 which means
conversion will be required.

> But I still consider this a workaround, since many people in the world
> will continue to have these problems. I would still really love to see
> the proposed option in GTKG.

I can probably live with _______.html too but I still think it's a
bad idea to do this per default especially if the search results do
not show __________.html but something meaningful. That's why I
suggested the details pane stuff. We could show the filename that
would be used to store the file in there and possibly place a
checkbox "convert to locale encoding" next to that. Then users
would immediately see the effect [*] of their choice and possibly
decide against conversion. This decision would be permanent not for
that file alone. I think it's better to place it there instead
of the configuration dialog because in the latter you don't see
the effect. Of course it could be mirrored in the dialog in the
"Download" section.

[*] Of course, Gtk-Gnutella would internally convert the string
to the current locale - and back to UTF-8 from for displaying
purposes in the Gtk+ 2.x case.

> PS: Using utf-8 in bash/konsole is sloooooooooww!!!!

I use this terminal for my Unicode needs:
http://sf.net/projects/gtkterm

xterm supports UTF-8 too but for some reasons the font-rendering
is horribly broken here and I cannot use Vim with UTF-8 in it
either. In gtkterm2 Vim works fine, so it's not Vim's fault. Sure
the rendering seems to be a little slower but at least it works.
I'm not a Gtk+ fan but you can count on Gtk+ 2.x if you want/need
proper Unicode/UTF-8 support. Another drawback is that it uses a
rather huge amount of RAM and doesn't start as quick as aterm or
xterm. However by using it's tabbed console feature that seems to
amortized.

There's actually very little custom code in gtkterm, it's mostly a
front-end for VTE with a little Gtk+ 2.x code. So if gtkterm is missing
anything and I wanted to code the perfect no non-sense terminal emulator,
I'd start from there. Other emulators are so old and sometimes have
horrible coding style that it's no real pleasure to modify them, or they
require some desktop environment.

IMHO, it's a real shame that there are dozens of terminal emulators
with all kinds of features and gimmicks but hardly any that
supports something as important (for professional use) as Unicode.

-- 
Christian

pgpFbLLZlpdae.pgp
Description: PGP signature

Re: [Gtk-gnutella-devel]GTKG chooses wrong charset/encoding for filenames on disk

Reply via email to