Re: [zfs-discuss] path-name encodings

Marcus Sundman Wed, 05 Mar 2008 00:59:32 -0800

Bart Smaalders <[EMAIL PROTECTED]> wrote:
> Marcus Sundman wrote:
> > Bart Smaalders <[EMAIL PROTECTED]> wrote:
> >> UTF8 is the answer here.  If you care about anything more than
> >> simple ascii and you work in more than a single locale/encoding,
> >> use UTF8. You may not understand the meaning of a filename, but at
> >> least you'll see the same characters as the person who wrote it.
> > 
> > I think you are a bit confused.
> > 
> > A) If you meant that _I_ should use UTF-8 then that alone won't
> > help. Let's say the person who created the file used ISO-8859-1 and
> > named it 'häst', i.e., 0x68e47374. If I then use UTF-8 when
> > displaying the filename my program will be faced with the problem
> > of what to do with the second byte, 0xe4, which can't be decoded
> > using UTF-8. ("häst" is 0x68c3a47374 in UTF-8, in case someone
> > wonders.)
> 
> What I mean is very simple:
> 
> The OS has no way of merging your various encodings.  If I create a
> directory, and have people from around the world create a file
> in that directory named after themselves in their own character sets,
> what should I see when I invoke:
> 
> % ls -l | less
> 
> in that directory?


Either (1) programs can find out what the encoding is, or (2) programs
must assume the encoding is what some environment variable (or
somesuch) is set to.

(1) The OS doesn't have to "merge" anything, just let the programs
handle any conversions the programs see fit.

(2) The OS must transcode the filenames. If a filename is incompatible
with the target encoding then the offending characters must be escaped.


> If you wish to share filenames across locales, I suggest you and
> everyone else writing to that directory use an encoding that will work
> across all those locales.  The encoding that works well for this on
> Unix systems is UTF8, since it leaves '/' and NULL alone.

Again, that won't work. First of all there is no way to enforce
programs to use UTF-8. I can't even force my own programs to do that.
(E.g., unrar or unzip or tar or 7z (can't remember which one(s)) just
dump the filename data to the fs in whatever encoding they were inside
the archive, and I have at least one collaboration program that also
does it similarly.) Now, if I force the fs to only accept filenames
compatible with UTF-8 (i.e., utf8only) then I risk losing files. I'd
rather have files with incomprehensible filenames than not have them at
all. OTOH, if I allow filenames incompatible with UTF-8 then my
programs can't necessarily access them if I use UTF-8. I could use some
8bits/char encoding (e.g., iso-8859-15), but I'd rather not, since the
world is going the way of UTF-8 and so I'd just be dragging behind. And
then I would also have problems with garbage-filenames when they use
UTF-8 or some other encoding. Also, I'm quite sure I do have files with
names with characters not in iso-8859-15.

So, you see, there is no way for me to use filenames intelligibly unless
their encodings are knowable. (In fact I'm quite surprised that zfs
doesn't (and even can't) know the encoding(s) of filenames. Usually Sun
seems to make relatively sane design decisions. This, however, is more
what I'd expect from linux with their overpragmatic "who cares if it's
sane, as long as it kinda works"-attitudes.)


Regards,

Marcus
_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] path-name encodings

Reply via email to