Title: RE: UTF-8 vs. Non-UTF-8 Locales and File Names (WAS: Re: Roundtripping in Unicode)

Edward H. Trager wrote:
> UTF-8's home directory).  So both users could probably guess
> the filename
> they were looking at.
Which, BTW, is true for most of Europe but is not true for some other combinations of locales.

>
>       d??claration_des_droits.utf8
>
> The terminal, being set to interpret the legacy locale, does not know
> how to interpret the two bytes that are used for the UTF-8 "é".

This is well known but is only the start of what the thread was discussing.

Your example only shows a difference in interpretation. You are still able to copy and paste the filename, use it in scripts and open in it in any program.

Now switch your locale to Latin 1 and create a file with that name in Latin 1. Switch back to UTF-8 and try doing various things with this file. I assume the following happens:

1 - Instead of letters being misinterpreted, they are lost. Leading to empty filenames in extreme cases.
2 - You cannot open the file by copying its name from the terminal.
3 - You can probably still specify it in scripts (which need to be edited in Latin 1), but if someone would start validating the script when in UTF-8 locale, you would lose that ability.

4 - Most C programs should be able to process the file. But I would not bet on some more 'advanced' languages. The more they comply with Unicode, the less likely it is they will open the file.

5 - Windows is likely having problems accessing that file.

And, yes, the solution is still to convert all filenames to UTF-8. That is, if all users on a particular system agree that this is what should be done with their files. But does not prevent such files from being generated, whatever the reason or cause is.


Lars

Reply via email to