Re: filename encoding (was: ISO-2022)

Bram Moolenaar Fri, 02 Feb 2001 05:10:36 -0800

Edmund Grimly Evans wrote:

> > From the point of view of the kernel it's just a sequence of bytes (except
> > for '/').  From the point of view of the user the bytes form characters
> > with a specific meaning.  If you use the wrong character set, that meaning
> > is lost.  Conversion is required to keep the meaning.
> 
> The question is: where should this conversion be performed?
> 
> I suggest it should be performed in individual programs, if at all
> (I'm not sure it's worth implementing).

How can the program know which encoding is being used?  Especially on Unix,
where everything looks like a file (also mounted floppies, tapes, etc.)
Perhaps with a new ioctl() call?  I wouldn't like that.

It's also a lot of work and hassle to incorporate the knowledge about file
name conversion in every program that handles file names.

> > I think the problem is clear: file names can be encoded in any character
> > set.  We need to know the character set used to do anything with those
> > names.  Thus the character set must be stored with the file system.
> 
> I disagree. What have file systems got to do with it?

It's the lowest common place with knowledge about the directory structure.

> For example, files under /home/tom/ might use ISO-8859-1, while files
> under /home/dick/ use UTF-8, but those might both be subdirectories of
> the same NFS file system mounted on /home. On the server /home might
> correspond to /export/home, where /export is an ext2 fs ...

On existing filesystems a mix of encodings can be used.  We can't change this,
this problem already exists, we can't solve it by introducing something new.
Even storing the encoding as part of the directory entry won't help for
existing systems, because it would mean a change already.  Conversion to one
single encoding would be simpler then.

> An ideal multi-encoding file browser might allow you to specify
> arbitrarily complex rules for deciding what encoding is in use for
> which file names. But personally, I don't think this is worth
> implementing: just use UTF-8.

True, if we are going to make changes, then let's introduce the file system
that has UTF-8 filenames only.  This just requires one flag that indicates all
file names on this filesystem are UTF-8.  Simple and effective.

That doesn't solve the problem of conversion, but at least we know the
encoding for these files, thus conversion will be possible them.  Thus it
makes the problem smaller.

> CDs are a special case, because they're read-only.

And use a special kind of filesystem.  But other removable media has the same
problem.  If you write it on one system and read it on another, hopefully you
see the same file names.  If I read a tape I wrote last year back onto a
system that now uses UTF-8, can I still read those files, or do I just get
error messages that the file names are not valid UTF-8?

-- 
hundred-and-one symptoms of being an internet addict:
128. You can access the Net -- via your portable and cellular phone.

 ///  Bram Moolenaar -- [EMAIL PROTECTED] -- http://www.moolenaar.net  \\\
(((   Creator of Vim - http://www.vim.org -- ftp://ftp.vim.org/pub/vim   )))
 \\\  Help me helping AIDS orphans in Uganda - http://iccf-holland.org  ///
-
Linux-UTF8:   i18n of Linux on all levels
Archive:      http://mail.nl.linux.org/lists/
Re: filename encoding (was: ISO-2022)

Reply via email to