Q: Filesystem Encoding

2002-07-10 Thread Shlomi Tal

Hello Unicoders, I have a question about filesystems. I never use anything 
but ASCII characters in filenames, and I would like to know if it is still 
justified. Of the various filesystems in use, I know only that the Joliet 
CDFS uses UCS-2BE. What about FAT16, FAT32, NTFS and Linux Ext2?

In short: should I still stick to ASCII alone in filenames, or are there 
filesystems where I really don't have to anymore? Thanks in advance.

_
Send and receive Hotmail on your mobile device: http://mobile.msn.com





Re: Q: Filesystem Encoding

2002-07-10 Thread Jungshik Shin



On Wed, 10 Jul 2002, Shlomi Tal wrote:

 Hello Unicoders, I have a question about filesystems. I never use anything
 but ASCII characters in filenames, and I would like to know if it is still
 justified. Of the various filesystems in use, I know only that the Joliet
 CDFS uses UCS-2BE. What about FAT16, FAT32, NTFS and Linux Ext2?

  NTFS  uses UTF-16LE and Linux Ext2 like most other Unix fs is
encoding-neutral/encoding-blind as long as a certain set of restrictions
are satisfied, which is the case of UTF-8 (that's why it used to be
called UTF-FSS with FSS standing for 'file system safe').

 In short: should I still stick to ASCII alone in filenames, or are there
 filesystems where I really don't have to anymore? Thanks in advance.

  Definitely/unconditionally no for NTFS. As for Linux ext2(and most other
Unix fs'), unless you mix up UTF-8 and legacy encodings (which you
wouldn't because you have never used non-ASCII), it's all right to switch
to UTF-8 and use non-ASCII chars.

  Jungshik





Re: Q: Filesystem Encoding

2002-07-10 Thread Barry Caplan

At 08:43 AM 7/10/2002 -0400, Jungshik Shin wrote:
 In short: should I still stick to ASCII alone in filenames, or are there
 filesystems where I really don't have to anymore? Thanks in advance.

  Definitely/unconditionally no for NTFS. As for Linux ext2(and most other
Unix fs'), unless you mix up UTF-8 and legacy encodings (which you
wouldn't because you have never used non-ASCII), it's all right to switch
to UTF-8 and use non-ASCII chars.

But be aware that such filenames may or may not be able to be transferred *across* 
file systems.
Not only that, but, although I haven't tested in detail for a while, I would not be 
fully comfortable with middleware that is responsible for managing file names across 
systems either, such as FTP, email attachments,  and Samba. Particularly in the case 
of FTP and email, just because one client works does not mean another one will.


Also keep in mind that even if the file name transfers exactly correct, there is no 
guarantee, except, for ASCII characters, that the system will have fonts to display 
the file name.

Barry Caplan
www.i18n.com





Re: Q: Filesystem Encoding

2002-07-10 Thread Frank da Cruz

Barry Caplan [EMAIL PROTECTED] wrote:
 But be aware that such filenames may or may not be able to be transferred
 *across* file systems.  Not only that, but, although I haven't tested in
 detail for a while, I would not be fully comfortable with middleware that is
 responsible for managing file names across systems either, such as FTP...
 
For the record, the new Kermit FTP clients:

  http://www.columbia.edu/kermit/ftpclient.html

can be told to handle filenames (and contents of text files) any way you want,
converting them between the client and server character set, including UTF-8
(without the server having to know a thing about it).  Details here:

  http://www.columbia.edu/kermit/ckermit80.html#x3.7

- Frank




Re: Q: Filesystem Encoding

2002-07-10 Thread Jungshik Shin


On Wed, 10 Jul 2002, Barry Caplan wrote:

 At 08:43 AM 7/10/2002 -0400, Jungshik Shin wrote:
  In short: should I still stick to ASCII alone in filenames, or are there
  filesystems where I really don't have to anymore? Thanks in advance.
 
   Definitely/unconditionally no for NTFS. As for Linux ext2(and most other
 Unix fs'), unless you mix up UTF-8 and legacy encodings (which you
 wouldn't because you have never used non-ASCII), it's all right to switch
 to UTF-8 and use non-ASCII chars.

 But be aware that such filenames may or may not be able to be
 transferred *across* file systems.

 You're absolutely right. Another related problem is normalization.
For instance, MacOS X uses one NF while NTFS uses another. And, I haven't
dug up what's planned about this on Unix fs and NFS front . Some Unix
fs-related APIs may have to be extended to deal with NF's.

 Not only that, but, although I haven't tested in detail for a while,
 I would not be fully comfortable with middleware that is responsible for
 managing file names across systems either, such as FTP, email attachments,
 and Samba. Particularly in the case of FTP and email, just because one
 client works does not mean another one will.

  Samba 3.0 appears to support Unicode (see
http://sambaaxp.org/xamba_XP_2002/vergeichick.pdf). BTW, from my own
experience, I know that codepage-based (non-unicode encoding) support
in samba 2.x works well between Win2k and Unix.

  As for email attachment, one should stick to IETF RFC 2231. Of course,
not all email clients are compliant to RFC 2231(Mozilla and Pine
are among the compliant), but I think that's the best way to get
your filenames across. Even fewer web clients and servers abide by RFC
2231(actually, I haven't seen any. None of Mozilla 1.x, Lynx 2.8, and MS
IE 6 supports this.)  when it comes to http Content-Disposition header
(the same header used for email attachment). Hopefully, this will change.
(e.g. http://bugzilla.mozilla.org/show_bug.cgi?id=155949)

  Some IETF drafts and RFCs have been written about I18N of FTP
and are available at
http://www.ietf.org/html.charters/ftpext-charter.html.
By any means, this is not to say that one can right now use
Unicode(UTF-8) for FTP except when one uses Kermit.

 Also keep in mind that even if the file name transfers exactly correct,
 there is no guarantee, except, for ASCII characters, that the system
 will have fonts to display the file name.

  Well, not being able to display is a problem of a different dimension
than not being able to get filenames across intact. Moreover,
two parties exchanging filenames, say, in Chinese/Finnish/Thai/...
are likely to have necessary fonts.

  Jungshik Shin