On Tue, 23 Dec 2003, Nick Ing-Simmons wrote: > Ed Batutis <[EMAIL PROTECTED]> writes: > >> I don't think we understand common practice (or that such practices > >> are even established yet) well enough to specify that yet.
Common practice is that file names on 'local disks' are assumed to be in the character encoding of the current locale. Of course, this assumption doesn't always hold and can break things with networked file system and all sort of different file systems, but what could Perl do about it other than offering some options/flexibility to let users do what they want? Perl users are supposed to be 'consenting adults' (maybe not in terms of physical age for some young users) so that given a set of options, they have to pick one most suitable for them for a given task. > Because we don't know how, because the "common practice" isn't established. As I wrote, it's been established well before Unicode came into the scene. It has little to do with UTF-8 or Unicode. > If we "just fix it" now the behaviour will be tied down and when the > "common practice" is established we will not be able to support it. Let's not 'fix' it (not carve it on a stone), but offer a few well-thought-out options. For instance, Perl may offer (not that these are particularly well-thought-out) 'just treat this as a sequence of octets', 'locale', and 'unicode'. 'locale' on Unix means multibyte encoding returned by nl_langinfo(CODESET) or equivalent. On Windows, it's whatever 'A' APIs accept or is returned by ACP_??(). 'unicode' is utf8 on Unix-like OS, BeOS and 'utf-16(le)' on Windows. > When _I_ want Unicode named things on Linux I just put file names in UTF-8. In that case, you're mixing two encodings on your file system by creating files with UTF-8 names while still using en_GB.ISO-8859-1 locale. Why does Perl have to be held responsible for your intentional act that is bound to break things? Because I don't want to be restricted by the character repertoire of legacy encodings, I switched over to UTF-8 locale almost two years ago. > Suits me fine, but is not going to mesh with my locale setting because > I am going to leave that as en_GB otherwise piles of legacy C apps get ill. Well, things are changing rapidly on that front. > Now when I have samba-mounted a WinXP file system that is wrong, same for Well, actually, if your WinXP file system has only characters covered by Windows-1252, you can use 'codepage=cp1252' and 'iocharset=iso8859-1' for smbmount/mount. Obviously, there's a problem because iso8859-1 is a subset of Windows-1252. If you use en_GB.UTF-8 on Linux, there'd not be such a problem because you can use 'codepage=cp1252' and 'iocharset=utf8'. > CDROMs most likely. This mess will converge some more - I can already > see that happening. UDF is the way to go in CD-ROM/DVD-ROM. > _My_ gut feeling is that on Linux at least the way forward is to > pass the UTF-8 string through -d - and indeed possibly "upgrade" to UTF-8 > if the string has high-bit octets. > But you seem to be making the case that UTF-8 should be converted to > some "local" multi-byte encoding - which is the "common practice" ? That's because there are a lot of people like you who still use en_GB (ja_JP.eucJP, de_DE.iso8859-1, etc) instead of en_GB.UTF-8 (ja_JP.UTF-8, de_DE.UTF-8) :-) On Linux, the number is dwindling, but on Solaris and other Unix (not that they don't support UTF-8 locales but that most system admins. don't bother to install necessary locales and support files), it's not decreasing as fast. Jungshik