On Tue, 23 Dec 2003, Jarkko Hietaniemi wrote: > > It works because it relies > > on iconv(3) to convert between the current locale codeset and UTF-16 > > (used internally by Mozilla) if/wherever possible. 'wc*to*mb/mb*to*wc' > > is only used only where iconv(3) is not available. Anyway, yes, that's > > possible.
> Note that I'm not *opposed* to someone fixing e.g. Win32 being able to > acces Unicode names in NTFS/VFAT. What I'm opposed to is anyone > thinking there are (a) easy (b) portable solutions. We are talking > always of very OS and FS specific solutions. OK. I'm sorry if I misunderstood you. You're absolutely right that we're talking about very OS/FS-dependent issues. > Win32 and Mac OS X are probably the most "well-off". For (other) UNIXy > systems, I don't know. I guess BeOS is in the same league as Win2k/XP [1] and Mac OS X. There, everything should be in UTF-8. > If one is happy > with just using UTF-8 filenames, Perl 5.8 already can work fine. If one I wish everybody were :-) on Unix. Fortunately, UTF-8 seems to be catching on judging from the 'emergence' of two 'file system conversion' tools. See, for instance, <http://osx.freshmeat.net/releases/144059/>. > > If a user mixes multiple encodings/code sets in her/his file > > system, that's not Perl's problem but her/his problem so that I don't > > think that's a valid reason for not doing something reasonable. > wants to use locales and especially some non 8-bit locales, well, Perl > currently most definitely does not switch its "filename encoding" based > on locales. Personally I think that's a daft idea... at least without > a new specific (say) LC_FILENAME control-- overloading the poor LC_CTYPE > sounds dangerous. I don't see how introducing a new LC_* would help here. Whether it's LC_CTYPE or LC_FILENAME, the problem is still there. Perhaps, we need a pragma to indicate which of the following is to be assumed about the file system character encoding, 'locale', 'native', 'unicode', 'user-specified'. On Unix, 'locale' and 'native' would be identical both meaning that Perl should convert its internal Unicode to and from the codeset returned by 'nl_langinfo(CODESET)'. Directly inspecting LC_CTYPE or other environment variables is a BAD idea and should be used as a fallback only where nl_langinfo(CODESET) is not supported. When converting to and from 'native' encoding, it should rely on iconv(3)' available on the system instead of its internal 'encoding' converter. However, there's a problem here. A lot of system admins on commericial Unix install only the minimal set of iconv(3) modules. See <http://bugzilla.mozilla.org/show_bug.cgi?id=202747#c18>. Therefore, perhaps, we first try iconv(3) and then fall back to using Perl's 'encoding'. There are other problems when using iconv(3) (e.g. <http://bugzilla.mozilla.org/show_bug.cgi?id=197051). 'unicode' on Unix means 'utf8'. 'user-specified' means whatever a user wants to use. On Windows, 'locale' means using the code page of the current system locale. 'native' is UTF-16LE (but on Win 9x/ME, the character repertoire would be limited to that of the system codepage). The same is true of 'unicode'. On Mac OS X, locale, native and unicode would mean all the same (UTF-8). As for 'normalization', I have to think more about it. And so on...... I've been just thinking aloud so that you have to bear with some incoherency. Jungshik