RE: UTF-8 vs. Non-UTF-8 Locales and File Names (WAS: Re: Roundtri pping in Unicode)

Lars Kristan Wed, 15 Dec 2004 04:17:24 -0800

Title: RE: UTF-8 vs. Non-UTF-8 Locales and File Names (WAS: Re: Roundtripping in Unicode)

Edward H. Trager wrote:
> UTF-8's home directory). So both users could probably guess
> the filename
> they were looking at.
Which, BTW, is true for most of Europe but is not true for some other combinations of locales.

>
> d??claration_des_droits.utf8
>
> The terminal, being set to interpret the legacy locale, does not know
> how to interpret the two bytes that are used for the UTF-8 "é".

This is well known but is only the start of what the thread was discussing.

Your example only shows a difference in interpretation. You are still able to copy and paste the filename, use it in scripts and open in it in any program.

Now switch your locale to Latin 1 and create a file with that name in Latin 1. Switch back to UTF-8 and try doing various things with this file. I assume the following happens:

1 - Instead of letters being misinterpreted, they are lost. Leading to empty filenames in extreme cases.
2 - You cannot open the file by copying its name from the terminal.
3 - You can probably still specify it in scripts (which need to be edited in Latin 1), but if someone would start validating the script when in UTF-8 locale, you would lose that ability.

4 - Most C programs should be able to process the file. But I would not bet on some more 'advanced' languages. The more they comply with Unicode, the less likely it is they will open the file.

5 - Windows is likely having problems accessing that file.

And, yes, the solution is still to convert all filenames to UTF-8. That is, if all users on a particular system agree that this is what should be done with their files. But does not prevent such files from being generated, whatever the reason or cause is.

Lars

RE: UTF-8 vs. Non-UTF-8 Locales and File Names (WAS: Re: Roundtri pping in Unicode)

Reply via email to