Q: Filesystem Encoding
Hello Unicoders, I have a question about filesystems. I never use anything but ASCII characters in filenames, and I would like to know if it is still justified. Of the various filesystems in use, I know only that the Joliet CDFS uses UCS-2BE. What about FAT16, FAT32, NTFS and Linux Ext2? In short: should I still stick to ASCII alone in filenames, or are there filesystems where I really don't have to anymore? Thanks in advance. _ Send and receive Hotmail on your mobile device: http://mobile.msn.com
Re: Q: Filesystem Encoding
On Wed, 10 Jul 2002, Shlomi Tal wrote: Hello Unicoders, I have a question about filesystems. I never use anything but ASCII characters in filenames, and I would like to know if it is still justified. Of the various filesystems in use, I know only that the Joliet CDFS uses UCS-2BE. What about FAT16, FAT32, NTFS and Linux Ext2? NTFS uses UTF-16LE and Linux Ext2 like most other Unix fs is encoding-neutral/encoding-blind as long as a certain set of restrictions are satisfied, which is the case of UTF-8 (that's why it used to be called UTF-FSS with FSS standing for 'file system safe'). In short: should I still stick to ASCII alone in filenames, or are there filesystems where I really don't have to anymore? Thanks in advance. Definitely/unconditionally no for NTFS. As for Linux ext2(and most other Unix fs'), unless you mix up UTF-8 and legacy encodings (which you wouldn't because you have never used non-ASCII), it's all right to switch to UTF-8 and use non-ASCII chars. Jungshik
Re: Q: Filesystem Encoding
At 08:43 AM 7/10/2002 -0400, Jungshik Shin wrote: In short: should I still stick to ASCII alone in filenames, or are there filesystems where I really don't have to anymore? Thanks in advance. Definitely/unconditionally no for NTFS. As for Linux ext2(and most other Unix fs'), unless you mix up UTF-8 and legacy encodings (which you wouldn't because you have never used non-ASCII), it's all right to switch to UTF-8 and use non-ASCII chars. But be aware that such filenames may or may not be able to be transferred *across* file systems. Not only that, but, although I haven't tested in detail for a while, I would not be fully comfortable with middleware that is responsible for managing file names across systems either, such as FTP, email attachments, and Samba. Particularly in the case of FTP and email, just because one client works does not mean another one will. Also keep in mind that even if the file name transfers exactly correct, there is no guarantee, except, for ASCII characters, that the system will have fonts to display the file name. Barry Caplan www.i18n.com
Re: Q: Filesystem Encoding
Barry Caplan [EMAIL PROTECTED] wrote: But be aware that such filenames may or may not be able to be transferred *across* file systems. Not only that, but, although I haven't tested in detail for a while, I would not be fully comfortable with middleware that is responsible for managing file names across systems either, such as FTP... For the record, the new Kermit FTP clients: http://www.columbia.edu/kermit/ftpclient.html can be told to handle filenames (and contents of text files) any way you want, converting them between the client and server character set, including UTF-8 (without the server having to know a thing about it). Details here: http://www.columbia.edu/kermit/ckermit80.html#x3.7 - Frank
Re: Q: Filesystem Encoding
On Wed, 10 Jul 2002, Barry Caplan wrote: At 08:43 AM 7/10/2002 -0400, Jungshik Shin wrote: In short: should I still stick to ASCII alone in filenames, or are there filesystems where I really don't have to anymore? Thanks in advance. Definitely/unconditionally no for NTFS. As for Linux ext2(and most other Unix fs'), unless you mix up UTF-8 and legacy encodings (which you wouldn't because you have never used non-ASCII), it's all right to switch to UTF-8 and use non-ASCII chars. But be aware that such filenames may or may not be able to be transferred *across* file systems. You're absolutely right. Another related problem is normalization. For instance, MacOS X uses one NF while NTFS uses another. And, I haven't dug up what's planned about this on Unix fs and NFS front . Some Unix fs-related APIs may have to be extended to deal with NF's. Not only that, but, although I haven't tested in detail for a while, I would not be fully comfortable with middleware that is responsible for managing file names across systems either, such as FTP, email attachments, and Samba. Particularly in the case of FTP and email, just because one client works does not mean another one will. Samba 3.0 appears to support Unicode (see http://sambaaxp.org/xamba_XP_2002/vergeichick.pdf). BTW, from my own experience, I know that codepage-based (non-unicode encoding) support in samba 2.x works well between Win2k and Unix. As for email attachment, one should stick to IETF RFC 2231. Of course, not all email clients are compliant to RFC 2231(Mozilla and Pine are among the compliant), but I think that's the best way to get your filenames across. Even fewer web clients and servers abide by RFC 2231(actually, I haven't seen any. None of Mozilla 1.x, Lynx 2.8, and MS IE 6 supports this.) when it comes to http Content-Disposition header (the same header used for email attachment). Hopefully, this will change. (e.g. http://bugzilla.mozilla.org/show_bug.cgi?id=155949) Some IETF drafts and RFCs have been written about I18N of FTP and are available at http://www.ietf.org/html.charters/ftpext-charter.html. By any means, this is not to say that one can right now use Unicode(UTF-8) for FTP except when one uses Kermit. Also keep in mind that even if the file name transfers exactly correct, there is no guarantee, except, for ASCII characters, that the system will have fonts to display the file name. Well, not being able to display is a problem of a different dimension than not being able to get filenames across intact. Moreover, two parties exchanging filenames, say, in Chinese/Finnish/Thai/... are likely to have necessary fonts. Jungshik Shin