Re: [zfs-discuss] path-name encodings

Marcus Sundman Tue, 04 Mar 2008 02:46:30 -0800

Bart Smaalders <[EMAIL PROTECTED]> wrote:
> Marcus Sundman wrote:
> > Bart Smaalders <[EMAIL PROTECTED]> wrote:
> >>> I'm unable to find more info about this. E.g., what does "reject
> >>> file names" mean in practice? E.g., if a program tries to create a
> >>> file using an utf8-incompatible filename, what happens? Does the
> >>> fopen() fail? Would this normally be a problem? E.g., do tar and
> >>> similar programs convert utf8-incompatible filenames to utf8 upon
> >>> extraction if my locale (or wherever the fs encoding is taken
> >>> from) is set to use utf-8? If they don't, then what happens with
> >>> archives containing utf8-incompatible filenames?
> >>
> >> Note that the normal ZFS behavior is exactly what you'd expect: you
> >> get the filenames you wanted; the same ones back you put in.
> > 
> > OK, thanks. I still haven't got any answer to my original question,
> > though. I.e., is there some way to know what text the filename is,
> > or do I have to make a more or less wild guess what encoding the
> > program that created the file used?
> 
> How do you expect the filesystem to know this?  Open(2) takes 3 args;
> none of them have anything to do with the encoding.


I don't expect the filesystem to know "this" (whatever you mean by
"this"). I don't expect the filesystem not to either. I just don't know,
and therefore I ask.

> > OK, if I use utf8only then I know that all filenames can be
> > interpreted as UTF-8. However, that's completely unacceptable for
> > me, since I'd much rather have an important file with an
> > incomprehensible filename than not have that important file at all.
> > Also, what about non-UTF-8 encodings? E.g., is it possible to know
> > whether 0xe4 is "ä" (as in iso-8859-1) or "ф" (as in iso-8859-5)?
> > 
> 
> There are two characters not allowed in filenames: NULL and '/'.
> Everything else is meaning imparted by the user, just like the
> contents of text documents.

You are confusing "characters" and "bytes". The former are encoded when
transformed to the latter. '/' is a character, 0x2f is a byte. (Well,
representations of a character and of a byte, respectively, if we're
nitpicking.)

> >> The trick is that in order to support such things as
> >> casesensitivity=false for CIFS, the OS needs to know what
> >> characters are uppercase vs lowercase, which means it needs to
> >> know about encodings, and reject codepoints which cannot be
> >> classified as uppercase vs lowercase.
> > 
> > I don't see why the OS would care about that. Isn't that the job of
> > the CIFS daemon? 
> 
> If my program attempts to open file "fred" in a case-insensitive
> filesystem and "FRED" exists, I would expect to get a handle to
> "FRED".  In order for the filesystem to do this, the OS must be able
> to perform this comparison.

Well, yes, if the case-insensitivity is in the filesystem (and if the
fs is in the kernel), but my point was that it wouldn't _have_to_ be in
the filesystem. It's probably faster if it is, though.

> CIFS is in the kernel; case insensitivity is a property of the 
> filesystem, not a layer added on by a daemon.

You probably mean "CIFS is in (Open)Solaris" and "case insensitivity is
a property of ZFS".

> If not, I could create "fred" and "FRED" locally, and then which one
> would I get were I to open "FrEd" via CIFS?

I guess that would be up to the implementation (unless CIFS includes
it in its specification). 

> > As a matter of fact I don't see why the OS would need to
> > know how to decode any filename-bytes to text. However, I firmly
> > believe that user applications should have that opportunity. If the
> > encoding of filenames is not known (explicitly or implicitly) then
> > applications don't have that opportunity.
> 
> The OS doesn't care; the user does.  If a user creates a file named
> წყალსა in his home directory, but my encoding doesn't contain these 
> characters, what should ls -l display?

I assume we're assuming encodings to be known here. (If the encodings
are unknown/unspecified the user can't create a file named any
particular character string, only raw data (bits/bytes).) What a
particular program displays is up to the implementation, I guess. I've
seen programs use escapes (e.g., \uc3\ua5), or '?', or empty squares, or
small squares with hex-numbers in them. (I've also seen programs not
display the text at all (sometimes not displaying any text after the
offending part), or even crash.)

However, we have the same problem always when programs should display
text, whether we know the encoding or not. Command line programs might
propagate the problem to the terminal (as ls in OpenSolaris currently
seems to be doing), graphical programs have to deal with it themselves.

So, while the OS might not care, the programs certainly do, especially
the graphical ones, since they can't let someone else deal with the
problem. (And yes, I know programs don't like to be anthropomorphized.)

> You also assume that knowing the encoding will transfer meaning...
> but a directory containing files named ᚠᚱᚩᚠᚢᚱ, ᛞᚩᛗᛖᛋ and ᚻᛚᛇᛏᚪᚾ may
> as well be line noise for most of us.

I assume no such thing. However, I firmly believe that knowing the
encoding of a bit sequence is the _only_possibility_ to be able to
_know_ what text that bit sequence represents.

> The OS doesn't care one whit about language or encodings (save
> the optional upper/lower case accommodation for CIFS).  The OS simply
> stores files under names that don't contain either '/' or NULL.

I think you mean "[...]names that don't contain either 0x2F and 0x0",
which includes characters such as 'A' in UTF-16.

> UTF8 is the answer here.  If you care about anything more than simple
> ascii and you work in more than a single locale/encoding, use UTF8.
> You may not understand the meaning of a filename, but at least
> you'll see the same characters as the person who wrote it.

I think you are a bit confused.

A) If you meant that _I_ should use UTF-8 then that alone won't help.
Let's say the person who created the file used ISO-8859-1 and named it
'häst', i.e., 0x68e47374. If I then use UTF-8 when displaying the
filename my program will be faced with the problem of what to do with
the second byte, 0xe4, which can't be decoded using UTF-8. ("häst" is
0x68c3a47374 in UTF-8, in case someone wonders.)

B) If you meant that _everybody_ should use UTF-8 then why would UTF-8
be "the answer"? Certainly it's enough that everybody uses the same
encoding.


Regards,

Marcus
_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] path-name encodings

Reply via email to