Re: perlunicode comment - when Unicode does not happen

Jungshik Shin Tue, 23 Dec 2003 06:46:52 -0800

On Tue, 23 Dec 2003, Jarkko Hietaniemi wrote:

> > It works because it relies
> > on iconv(3) to convert between the current locale codeset and UTF-16
> > (used internally by Mozilla) if/wherever possible. 'wc*to*mb/mb*to*wc'
> > is only used  only where iconv(3) is not available. Anyway, yes, that's
> > possible.


> Note that I'm not *opposed* to someone fixing e.g. Win32 being able to
> acces Unicode names in NTFS/VFAT.  What I'm opposed to is anyone
> thinking there are (a) easy (b) portable solutions.  We are talking
> always of  very OS and FS specific solutions.

  OK. I'm sorry if I misunderstood you. You're absolutely right that
we're talking about very OS/FS-dependent issues.

> Win32 and Mac OS X are probably the most "well-off".  For (other) UNIXy
> systems, I don't know.

  I guess BeOS is in the same league as Win2k/XP [1] and Mac OS X.
There, everything should be in UTF-8.

> If one is happy
> with just using UTF-8 filenames, Perl 5.8 already can work fine.  If one

  I wish everybody were :-) on Unix. Fortunately, UTF-8 seems to be
catching on judging from the 'emergence' of two 'file system conversion'
tools. See, for instance, <http://osx.freshmeat.net/releases/144059/>.

> > If a user mixes multiple encodings/code sets in her/his file
> > system, that's not Perl's problem but her/his problem so that I don't
> > think that's a valid reason for not doing something reasonable.

> wants to use locales and especially some non 8-bit locales, well, Perl
> currently most definitely does not switch its "filename encoding" based
> on locales.  Personally I think that's a daft idea... at least without
> a new specific (say) LC_FILENAME control-- overloading the poor LC_CTYPE
> sounds dangerous.

 I don't see how introducing a new LC_* would help here. Whether
it's LC_CTYPE or LC_FILENAME, the problem is still there.

Perhaps, we need a pragma to indicate which of the following is to be
assumed about the file system character encoding, 'locale', 'native',
'unicode', 'user-specified'. On Unix, 'locale' and 'native' would be
identical both meaning that Perl should convert its internal Unicode
to and from the codeset returned by 'nl_langinfo(CODESET)'. Directly
inspecting LC_CTYPE or other environment variables is a BAD idea and
should be used as a fallback only where nl_langinfo(CODESET) is not
supported. When converting to and from 'native' encoding, it should rely
on iconv(3)' available on the system instead of its internal 'encoding'
converter.  However, there's a problem here. A lot of system admins on
commericial Unix install only the minimal set of iconv(3) modules. See
<http://bugzilla.mozilla.org/show_bug.cgi?id=202747#c18>. Therefore,
perhaps, we first try iconv(3) and then fall back to using
Perl's 'encoding'. There are other problems when using iconv(3)
(e.g. <http://bugzilla.mozilla.org/show_bug.cgi?id=197051).

  'unicode' on Unix means 'utf8'.  'user-specified' means whatever a
user wants to use. On Windows, 'locale' means using the code page of
the current system locale. 'native' is UTF-16LE (but on Win 9x/ME, the
character repertoire would be limited to that of the system codepage).
The same is true of 'unicode'.  On Mac OS X, locale, native and unicode
would mean all the same (UTF-8). As for 'normalization', I have to think
more about it. And so on......  I've been just thinking aloud so that
you have to bear with some incoherency.

   Jungshik

Re: perlunicode comment - when Unicode does not happen

Reply via email to