From: "Antoine Leca" <[EMAIL PROTECTED]>
Err, not really. MS-DOS *need to know* the encoding to use, a bit like a
*nix application that displays filenames need to know the encoding to use
the correct set of glyphs (but constrainst are much more heavy.) Also
Windows NT Unicode applications know it, because it can't be changed :-).

But when it comes to other Windows applications (still the more common) that
happen to operate in 'Ansi' mode, they are subject to the hazard of codepage
translations. Even if Windows 'knows' the encoding used for the filesystem
(as when it uses NTFS or Joliet, or VFAT on NT kernels; in the other cases
it does not even know it, much like with *nix kernels), the only usable set
is the _intersection_ of the set used to write and the set used to read;
that is, usually, it is restricted to US ASCII, very much like the usable
set in *nix cases...

True, but this applies to FAT-only filesystems, which happen to store filenames with a "OEM" charset which is not stored explicitly on the volume. This is a known caveat even for Unix, when you look at the tricky details of the support of Windows file sharing through Samba, when the client requests a file with a "short" 8.3 name, that a partition used by Windows is supposed to support.


In fact, this nightmare comes from the support in Windows of the compatibility with legacy DOS applications which don't know the details and don't use the Win32 APIs with Unicode support. Note that DOS applications use a "OEM" charset which is part of the user settings, not part of the system settings (see the effects of the command CHCP in a DOS command prompt).

FAT32 and NTFS help reconciliate these incompatible charsets because these filesystems also store a "LFN" (Long File Name) for the same files (in that case the short name, encoded in some ambiguous OEM charset, is just an alias, acting exactly like a hard link on Unix created in the same directory that references the same file). "LFN" names are UTF-16 encoded and support mostly the same names as in NTFS volumes.

However, on FAT32 volumes, the short names are mandatory, unlike on NTFS volumes where they can be created "on the fly" by the filesystem driver, according to the current user settings for the selected OEM charset, without storing them explicitly on the volume. Windows contains, in CHKDSK, a way to verify that short names of FAT32 filesystems are properly encoded with a coherent OEM charset, using the UTF-16 encoded LFN names as a reference. If needed, corrections for the OEM charset can be applied...

This nightmare of incompatible OEM charsets do happen on Windows 98/98SE/ME, when the "autoexec.bat" file that defines the current user profile is not executing as it should the proper "CHCP" command, or when this autoexec.bat file has been modified or erased: in that case, the default OEM charset (codepage 437) is used, and short filenames are incorrectly encoded.

Another complexity is that Win32 applications, that use a fixed (not user-settable) "ANSI" charset, and that don't use the Unicode API depend on the conversion from the ANSI charset to the current OEM charset. But if a file is handled through some directory shares via multiple hosts, that have distinct ANSI charsets (i.e. Windows hosts running different localization of Windows, such as a US installation and a French version in the same LAN), the charsets viewed by these hosts will create incompatible encodings on the same shared volume.

So the only "stable" subset for short names, that is not affected by OS localization or user settings is the intersection of all possible ANSI and OEM charsets that can be set in all versions of Windows! No need to say, this designates only the printable ASCII charset for short 8.3 names. Long filenames are not affected by this problem.

Conclusion: to use international characters out of ASCII in filenames used by Windows, make sure that the the name is not in a 8.3 short format, so that a long filename, in UTF-16, will be created on FAT32 filesystems or on SMBFS shares (Samba on Unix/Linux, Windows servers)... Or use NTFS (but then resolve the interoperability problems with Linux/Unix client hosts that can't access reliably, for now, to these filesystems, and that are not completely emulated by Unix filesystems used by Samba, due to the limitation on the LanMan sharing protocol, and limitations of Unix filesystems as well that rarely use UTF-8 as their prefered encoding...)





Reply via email to