RE: Nicest UTF

Lars Kristan Mon, 13 Dec 2004 04:36:26 -0800

Title: RE: Nicest UTF

Marcin 'Qrczak' Kowalczyk wrote:
> > My my, you are assuming all files are in the same encoding.
>
> Yes. Otherwise nothing shows filenames correctly to the user.
UNIX is a multi user system. One user can use one locale and might never see files from another user that uses a different locale. And users can even have filenames in wrong locales in their own home directory. Copied from somewhere. Perhaps only a letter here and there does not display correctly, but this doesn't mean the user can't use the file.

>
> > And what about all the references to the files in scripts?
> > In configuration files?
>
> Such files rarely use non-ASCII characters. Non-ASCII characters are
> primarily used in names of documents created explicitly by the user.
Rarely. So only rare systems will not boot after the conversion. And only rare programs will no longer work. Is that acceptable?

Plus, it might not be as rare as you think. It might be far more common in a country where not many people understand English and are not using latin letters on top of it.

Also, a script (a UNIX batch file) many have an ASCII name, but what if it processes some user documents for some purpose. And has a set of filenames hardcoded in it? What about MRU lists? What about documents that link other documents?

Mass renaming is a dangerous thing. It should be done gradually and with utmost care. And during this period, everything should keep working. If not, users won't even start the process.

>
> > Soft links?
>
> They can be fixed automatically.
Ummmm, yes, not a good example. Except in case one decides to allow the user to select an option to use U+FFFD instead of failing the conversion. Then you need to be extra careful, rename any files that convert to a sinle name and keep track of everything so you can use the right names for the soft links. But yes, it can be done. If, on the other hand, you adopt the 'broken' conversion concept, you can convert all filenames, in a single pass, and don't need to build lists of softlinks since you can convert them directly.

>
> > If you want to break things, this is definitely the way to do it.
>
> Using non-ASCII filenames is risky to begin with. Existing tools don't
> have a good answer to what should happen with these files when the
> default encoding used by the user changes, or when a user using a
> different encoding tries to access them.
Not really. On UNIX, it is all very well defined. A filename is a sequence of bytes which is only interpreted when it is displayed. You can place a filename in a script or a configuration file and the file will be identified and opened regardless of your locale setting.

People like you and me avoid non-ASCII filenames. But not all users do.

> Mozilla doesn't show such filenames in a directory listing. You
> may consider it a bug, but this is a fact. Producing non-UTF-8 HTML
> labeled as UTF-8 would be wrong too. There is no good solution to
> the problem of filenames encoded in different encodings.
There is no good solution. True. And I am trying to find one. And yes, I would consider that a bug. They should probably use some escaping technique. And, funny thing, you would probably accept the escaping technique. But if you think about it, it is again representing invalid data with valid Unicode characters. And if un-escaping needs to be done, it introduces all the problems that you are pointing out for my 'broken' conversion. So, think of my 128 codepoints as an escaping technique. One with no overhead. One with little possibiliy of confusion. One that can be standardized and whoever comes across it will know exactly what it is. Which is definitely not true if we let each application devise its own escaping and there is no way they can interoperate.

> > As soon as you realize you cannot convert filenames to UTF-8, you
> > will see that all you can do is start adding new ones in UTF-8.
> > Or forget about Unicode.
>
> I'm not using a UTF-8 locale yet, because too many programs don't
> support it.
Like Mozilla. I am showing you the way programs can be made to work with UTF-8 faster and easier. And really by fixing them, not by rewriting them. At least some programs, or some portions of programs. Then developers can concentrate on the things that do require extra attention, like strupr, isspace (or their equivalence).

> I'm using ISO-8859-2.
In fact you're lucky. Many ISO-8859-1 filenames display correctly in ISO-8859-2. Not all users are so lucky.

> But almost all filenames are ASCII.
Basically, you are avoiding the problem alltogether. A wise decision. But it also means you don't know as much about this problem as I do.

Lars

RE: Nicest UTF

Reply via email to