Re: Always setting UTF-8 flag - am I bad?

2004-08-05 Thread Jungshik Shin
On Thu, 5 Aug 2004, Nick Ing-Simmons wrote:

 Alright, I failed to say that this is an XS module, so I convert with
 WideCharToMultiByte, a Windows routine(*), put the result in an SV, and
 then say SvUTF8_on.

 The possible danger here is if the multi byte encoding for
 user's environment is not UTF-8 but (say) a Japanese one.

  Almost always(99.999% of time. unless SetACP() or sth. is used to
change it), the default system code page is not UTF-8
on Windows (Windows-1252 on Western European Windows, Windows-1251
on Russian, Windows-932/936/949/950 on East Asian windows, etc).
However, you can specify the code page for 'multibyte' encoding
to use when invoking WideCharToMultiByte (i.e. WideCharToMultiByte is
different from wcstombs() on a POSIX system). See

  
http://msdn.microsoft.com/library/default.asp?url=/library/en-us/intl/unicode_2bj9.asp


  Jungshik



Re: Unicode filenames on Windows with Perl = 5.8.2

2004-06-22 Thread Jungshik Shin
Jan Dubois wrote:
On Mon, 21 Jun 2004, Steve Hay wrote:
 


I must confess that 2 doesn't really bother me since the 9x type
systems are now a thing of the past (XP onwards are all NT type
systems, even XP Home Edition).
   

While I also wish that Win 9x would just cease to exist, I don't think
any core Perl patches would be accepted if they would render Perl
inoperable on those systems. You would have to provide at least a
fallback solution, even if it means creating separate binaries for 9x
and NT Windows systems.
 

 JFYI, if using MSLU is problematic (for some reason), we may consider
http://libunicows.sourceforge.net/
It's released under MIT license.
 Jungshik



Re: AL32UTF8

2004-05-01 Thread Jungshik Shin
Tim Bunce wrote:
On Fri, Apr 30, 2004 at 10:58:19PM +0700, Martin Hosken wrote: 

IIRC AL32UTF8 was introduced at the behest of Oracle (a voting member of 
Unicode) because they were storing higher plane codes using the 
surrogate pair technique of UTF-16 mapped into UTF-8 (i.e. resulting in 
2 UTF-8 chars or 6 bytes) rather than the correct UTF-8 way of a single 
char of 4+ bytes. There is no real trouble doing it that way since 
anyone can convert between the 'wrong' UTF-8 and the correct form. But 
they found that if you do a simple binary based sort of a string in 
AL32UTF8 and compare it to a sort in true UTF-8 you end up with a subtly 
different order. On this basis they made request to the UTC to have 
AL32UTF8 added as a kludge and out of the kindness of their hearts the 
UTC agreed thus saving Oracle from a whole heap of work. But all are 
agreed that UTF-8 and not AL32UTF8 is the way forward.


Um, now you've confused me.

The Oracle docs say In AL32UTF8, one supplementary character is
represented in one code point, totalling four bytes. which you
say is correct UTF-8 way. So the old Oracle ``UTF8'' charset
is what's now called CESU-8 and what Oracle call ``AL32UTF8''
is the correct UTF-8 way.
 So did you mean CESU-8 when you said AL32UTF8?

I guess so.

Thank you for reminding me of this. I used to know that, but forgot it 
and was about to write my colleague to use 'UTF8' (instead of 
'AL32UTF8') when she creates a database with Oracle for our project.

Oracle is notorious for using 'incorrect' and confusing character 
encoding names. Their 'AL32UTF8' is the true and only UTF-8 while 
__their__ 'UTF8' is CESU-8 (a beast that MUST be confined within Oracle 
and MUST NOT be leaked out to the world at large. Needless to say, it'd 
be even better had it not been born.)

Oracle has no execuse whatsoever for failing to get their 'UTF8' right 
in the first place because Unicode had been extended beyond BMP a long 
time before they introduced UTF8 into their product(s) (let alone the 
fact that ISO 10646 had non-BMP planes from the very beginning in 1980's 
and that UTF-8 was devised to cover the full set of ISO 10646) However, 
they failed and in their 'UTF8', a single character beyond BMP was (and 
still is) encoded as a pair of 3-byte representations of surrogate code 
points. Apparently for the sake of backward compatibility (I wonder how 
many instances of Oracle databases existed with non-BMP characters 
stored in their 'UTF8' when they decided to follow this route), they 
decided to keep the designation 'UTF8' for CESU-8 and came up with a new 
designation 'AL32UTF8' for the true and only UTF-8.

Jungshik



Re: Status of -C

2004-01-11 Thread Jungshik Shin
Paul Hoffman wrote:

Er, never mind. I found that I was doing something quite silly with 
the -C. All is OK, and it is now causing STDIN to be UTF8ish.
Would you mind sharing your experience? That way, others will be able to 
avoid repeating your mistake.

Jungshik



Re: perlunicode comment - when Unicode does not happen

2003-12-25 Thread Jungshik Shin
On Tue, 23 Dec 2003, Nick Ing-Simmons wrote:
 Ed Batutis [EMAIL PROTECTED] writes:
  I don't think we understand common practice (or that such practices
  are even established yet) well enough to specify that yet.

  Common practice is that file names on 'local disks' are assumed to be
in the character encoding of the current locale. Of course, this
assumption doesn't always hold and can break things with networked file
system and all sort of different file systems, but what could Perl do
about it other than offering some options/flexibility to let users do
what they want? Perl users are supposed to be 'consenting adults' (maybe
not in terms of physical age for some young users) so that  given a set
of options, they have to pick one most suitable for them for a given
task.

 Because we don't know how, because the common practice isn't established.

  As I wrote, it's been established well before Unicode came into the
scene. It has little to do with UTF-8 or Unicode.

 If we just fix it now the behaviour will be tied down and when the
 common practice is established we will not be able to support it.

   Let's not 'fix' it (not carve it on a stone), but offer a few
well-thought-out options. For instance, Perl may offer (not that these
are particularly well-thought-out) 'just treat this as a sequence of
octets', 'locale', and 'unicode'. 'locale' on Unix means multibyte
encoding returned by  nl_langinfo(CODESET) or equivalent.  On Windows,
it's whatever 'A' APIs accept or is returned by ACP_??().  'unicode'
is utf8 on Unix-like OS, BeOS and 'utf-16(le)' on Windows.

 When _I_ want Unicode named things on Linux I just put file names in UTF-8.

  In that case, you're mixing two encodings on your file system by
creating files with UTF-8 names while still using en_GB.ISO-8859-1
locale. Why does Perl have to be held responsible for your intentional act
that is bound to break things? Because I don't want to be restricted by
the character repertoire of legacy encodings, I switched over to UTF-8
locale almost two years ago.

 Suits me fine, but is not going to mesh with my locale setting because
 I am going to leave that as en_GB otherwise piles of legacy C apps get ill.

  Well, things are changing rapidly on that front.

 Now when I have samba-mounted a WinXP file system that is wrong, same for

  Well, actually, if your WinXP file system has only characters covered
by Windows-1252, you can use 'codepage=cp1252' and 'iocharset=iso8859-1'
for smbmount/mount.  Obviously, there's a problem because iso8859-1 is a
subset of Windows-1252. If you use en_GB.UTF-8 on Linux, there'd not be
such a problem because you can use 'codepage=cp1252' and 'iocharset=utf8'.

 CDROMs most likely. This mess will converge some more - I can already
 see that happening.

 UDF is the way to go in CD-ROM/DVD-ROM.


 _My_ gut feeling is that on Linux at least the way forward is to
 pass the UTF-8 string through -d - and indeed possibly upgrade to UTF-8
 if the string has high-bit octets.
 But you seem to be making the case that UTF-8 should be converted to
 some local multi-byte encoding - which is the common practice ?

  That's because there are a lot of people like you who still use en_GB
(ja_JP.eucJP, de_DE.iso8859-1, etc) instead of en_GB.UTF-8 (ja_JP.UTF-8,
de_DE.UTF-8) :-) On Linux, the number is dwindling, but on Solaris
and other Unix (not that they don't support UTF-8 locales but that most
system admins. don't bother to install necessary locales and support
files), it's not decreasing as fast.

   Jungshik


Re: perlunicode comment - when Unicode does not happen

2003-12-25 Thread Jungshik Shin

On Thu, 25 Dec 2003, Jarkko Hietaniemi wrote:

  locale. Why does Perl have to be held responsible for your intentional
  act that is bound to break things?

 Whoa!  It's the other way round here.  Nick is using a locale that suits
 him for other reasons (e.g. getting time and data formats in proper
 British ways), but why should he be constrained not to use for his
 filenames  whatever he wants?

  Then, he should switch to en_GB.UTF-8. Besides, he implied that
he still uses ISO-8859-1 for files whose names can be covered by
ISO-8859-1, which is why I wrote about mixing up two encodings
in a single file system _under_ his control.

  Moreover, why would you think that en_GB.UTF-8 locale gives him the
time and date format NOT suitable for him? You're making a mistake of
binding locale and encoding. Encoding should never be a part of the
locale definition. The fact that it is on Unix is just an artifact of
Unix file system and we want to leave it behind us if possible. Of course,
we have to live with that for a long while to come, unfortunately.

Well, actually, if your WinXP file system has only characters covered
  by Windows-1252,

 And how would Nick know that, or he could he guarantee that, if the
 Windows share is in multiuser use?

  Of course, he can't. That's why I wrote 'if'.

 PLEASE, PEOPLE: stop thinking of this in terms of an environment
 controlled
 solely by one user.

  Before writing that, please read the man page of 'smbmount' and
'mount' if Linux system is available to you. They're not environment
variables.

  Jungshik


Re: perlunicode comment - when Unicode does not happen

2003-12-25 Thread Jungshik Shin
On Thu, 25 Dec 2003, Jarkko Hietaniemi wrote:

 What I wish is that the whole current locale system would curl up and
 die.

  As you'd agree, it's only 'encoding' part that has to die. Everybody
should switch to UTF-8 on Unix and end-users should never worry about
'encoding'.  In an ideal world, 'encoding' would never be a part of
'locale'.  We're getting there although very slowly.

nl_langinfo(CODESET) is rather well supported where it's
  available (i.e. SUS-compliant modern Unix platforms).

 That's not good enough for Perl.  Perl must also deal with
 non-SUS-compliant older UNIX or -like platforms.

  Sure, I'm well aware of that. Otherwise, I'd not have gone on to
mention gnulib and such.

  Jungshik


Re: perlunicode comment - when Unicode does not happen

2003-12-25 Thread Jungshik Shin
On Thu, 25 Dec 2003, Jungshik Shin wrote:

 locale definition. The fact that it is on Unix is just an artifact of
 Unix file system and we want to leave it behind us if possible. Of course,

 Of course, it's rather a whole lot of different
things that bind locale and encoding on Unix, from which we want to get
away asap.


Re: perlunicode comment - when Unicode does not happen

2003-12-25 Thread Jungshik Shin

On Thu, 25 Dec 2003, Jarkko Hietaniemi wrote:

  What I wish is that the whole current locale system would curl up and
  die.
 
As you'd agree, it's only 'encoding' part that has to die.

 Oh no, there are plenty of parts in it that I wish would die :-)

 Wishing it to die is different from finding a lot of defects
that you want to fix, isn't it? Sure, there are a lot of
things that can be done better. For quite a lot of them (not
all of them) ICU offers solutions.

 list of things to fix . snipped

  Everybody should switch to UTF-8 on Unix

 Yes.  UTF-8 and NFD, I would say.

  As much as I like NFD (well, I'd like it even better if
Korean NFD hadn't been  made permanenlty broken between Unicode 2.x and
3.0), I don't think people will ever agree on NFD part.

  Jungshik


Re: perlunicode comment - when Unicode does not happen

2003-12-25 Thread Jungshik Shin

On Thu, 25 Dec 2003, Jarkko Hietaniemi wrote:

  Whoa!  It's the other way round here.  Nick is using a locale that
  suits him for other reasons (e.g. getting time and data formats in
  proper British ways), but why should he be constrained not to use for his
  filenames  whatever he wants?
 
Then, he should switch to en_GB.UTF-8.

 That will work if there's en_GB.UTF-8 available for him in his
 particular Unixes and assuming using UTF-8 locales won't break other
 things.

 IIRC, he explicitly mentioned 'Linux' in his message. Besides,
Solaris, Compaq Tru64, AIX, and HP/UX [1] have all supported UTF-8 locales
for a 'long' time (some of them far far longer than Linux/glibc has). In
the past, all the locales don't come free, but these days, they all come
with no extra charge so that it depends on  the 'will'/'policy' of the
system administrators whether that's available or not. Sure, there are a
number of other Unix, old and new, and many old ones don't support UTF-8
locales.

I do want to respect people's wish to  make UTF-8 files on their file
systems even if their version of Unix don't support UTF-8 locales.
Otherwise, I wouldn't have come up with a set of 'options' Perl can
offer to them.  However, people doing so should be aware that there's
price to pay.  For instance, in their shell, file names would not be
shown correctly (i.e.  'ls' would show you garbled characters) They
can't use usual set of Unix tools (e.g. 'find' wouldn't work as intended).

  ISO-8859-1, which is why I wrote about mixing up two encodings
  in a single file system _under_ his control.

 I think we are here talking past each other :-)  I'm assuming the
 not all file systems (like Samba mounts) are not necessarily under
 his control, you are assuming they

 Well, I think that's a different story. He explicitly wrote why
he still uses en_GB.ISO-8859-1 (like some old programs breaking under
UTF-8 locale).

Moreover, why would you think that en_GB.UTF-8 locale gives him the
  time and date format NOT suitable for him?

 I'm not thinking that.  What I think his point is is that plain
 en_GB.iso88591 is _enough_ for him to get time/date formats etc
 working right, but  en_GB.UTF-8 brings in _too much_ (such as some
 programs not yet being UTF-8 aware enough,

 What you had in parentheses was what he wrote in his original message,
but what you wrote didn't sound like that to me. At lesat, you took a
bad example of time/date format.

 or him wanting to use iso8859-1 file names in some directories, but in
 some directories not).

  Yes, that's what I meant. He made a conscious decision to
mix up two encodings (read his message. 'If I want Unicode characters
in file names, I'd just use UTF-8' or something like that), for which
he has to pay whatever price he has to pay.  If Perl offers a set of
options as I outlined in my previous message, he has to be careful when
opening files in different directories.  For some directories, he has to
use one option while for other directories, he has to use another option.


  You're making a mistake of binding locale and encoding.

 I'm not-- many UNIX vendors do, and I have to with that fact.  If Linux
 and glibc are doing the Right Thing, that's marvelous, but not all the
 world is Linux and glibc.

 I never implied that, let alone saying that. (I always prefer to say
Unix in place of Linux. To me, Linux is just one of many Unix) And,
please check out recent commercial Unix. They DO offer UTF-8 locales as I
wrote above (Solaris and AIX had offered solid UTF-8 locales years before
Linux/Glibc did - actually, when Linux/Glibc 1.x has almost __zero__
locale support, UTF-8 or not). Whether they're installed by the system
admin. is a different story. Anyway, exactly because of the unavailability
of UTF-8 locales for whatever reason, we've been discussing this issue
(to convert Perl's internal Unicode to and from the 'native' encoding
in file I/O.).

  The fact that it is on Unix is just an artifact of Unix file system

 Not quite.  UNIX doesn't care.  In traditional UNIX filenames are just
 bytes.

 You're absolutely right. I didn't mean to say 'file system' there
as I corrected in my subsequent email.


  PLEASE, PEOPLE: stop thinking of this in terms of an environment
  controlled solely by one user.
 
Before writing that, please read the man page of 'smbmount' and
  'mount' if Linux system is available to you. They're not environment
  variables.

 Please read my sentence again to see that I had no variable in it :-)
 Just environment.

 OK. Sorry for misreading it. Anyway, Perl can't help resolve that problem.
It can only offer a set of flexible options (as I listed in 'a few
messages ago') that help people solve the problem for themselves.

  Jungshik

[1] SGI Irix seems to lag behind in this area. FreeBSD was slow, but
seems to have done a catch-up recently.


Re: perlunicode comment - when Unicode does not happen

2003-12-24 Thread Jungshik Shin
On Tue, 23 Dec 2003, Jarkko Hietaniemi wrote:

   I don't see how introducing a new LC_* would help here. Whether

 Limit the mess of CTYPE controlling Yet Another Feature.

  I don't think it's yet another feature. It's one of features
that's commonly assigned to it. Well, I guess you'd ask how
'commonly'...

 Anyway, introducing a new env. variable is not a solution to
the mess.  By doing so, you just add another problem because a new
variable is only meaningful to Perl at least at the beginning.

  it's LC_CTYPE or LC_FILENAME, the problem is still there.

  to and from the codeset returned by 'nl_langinfo(CODESET)'.

 Don't get me started how suckily and brokenly nl_langinfo() is supported
 across platforms :-)  Well, CODESET may be on the average better
 supported.
 May.

  nl_langinfo(CODESET) is rather well supported where it's
available (i.e. SUS-compliant modern Unix platforms). Encoding/codeset
name mess is another issue, though.

 If Perl could use gnulib (a collection of small code snippets
that are meant to be included in the source code, 'nl_langinfo(CODESET)'
could be emulated where it's not available. However, I guess it can't
because GPL/LGPL is not suitable for Perl according to you.


  Directly inspecting LC_CTYPE or other environment variables is a BAD
  idea

 I can optimize that for ya: s/Directly inspecting/Using/ :-)

  I intentionally used the phrase because 'nl_langinfo(CODESET)'
is 'the' _indirect_ way to get to it (plus the resolution of LC_*/LANG
environment variable priority)

   Jungshik


Re: perlunicode comment - when Unicode does not happen

2003-12-23 Thread Jungshik Shin
On Tue, 23 Dec 2003, Jarkko Hietaniemi wrote:

  (AFAIK) W2K and later _are able_ to use UTF-16LE encoded Unicode for
  filenames,
  but because of backward compatibility reasons using 8-bit codepages is
  much
  more likely.
 
No. _Both_ NTFS (only supported by Win 2k/XP) and VFAT (supported by
  Win 2k/XP and Win 9x/ME) use UTF-16LE **exclusively**. In that respect,

 (and that's probably well docum^Wpatented by Microsoft... :-)

   Well, the _internals_ of NTFS and VFAT are not well documented (and
is probably patented as well) so that NTFS developers for Linux kernel
have to reverse-engineer it. However, APIs for 'casually' accesing them
(including the fact they use 'Unicode' with their use of 'Unicode' usually
meaning UTF-16LE or at least UCS-2LE) are documented well enough afaik.

 (How about CIFS?)

  I believe it, too,  uses UTF-16LE (or at least UCS-2). Samba developers
will know that well.

   FYI, Mac OS X 10.3 (or 10.2) or later has APIs for the conversion
  between NFC and NFD.

 I'm not worried about the various Unicode APIs being available.

  I just mentioned it because even on Mac OS X, you have to do
things differently (before 10.2 and after 10.2). After 10.2(?), you
can rely on OS APIs while before that you have to roll your own.

  Jungshik


Re: perlunicode comment - when Unicode does not happen

2003-12-23 Thread Jungshik Shin
On Tue, 23 Dec 2003, Jarkko Hietaniemi wrote:

  It works because it relies
  on iconv(3) to convert between the current locale codeset and UTF-16
  (used internally by Mozilla) if/wherever possible. 'wc*to*mb/mb*to*wc'
  is only used  only where iconv(3) is not available. Anyway, yes, that's
  possible.

 Note that I'm not *opposed* to someone fixing e.g. Win32 being able to
 acces Unicode names in NTFS/VFAT.  What I'm opposed to is anyone
 thinking there are (a) easy (b) portable solutions.  We are talking
 always of  very OS and FS specific solutions.

  OK. I'm sorry if I misunderstood you. You're absolutely right that
we're talking about very OS/FS-dependent issues.

 Win32 and Mac OS X are probably the most well-off.  For (other) UNIXy
 systems, I don't know.

  I guess BeOS is in the same league as Win2k/XP [1] and Mac OS X.
There, everything should be in UTF-8.

 If one is happy
 with just using UTF-8 filenames, Perl 5.8 already can work fine.  If one

  I wish everybody were :-) on Unix. Fortunately, UTF-8 seems to be
catching on judging from the 'emergence' of two 'file system conversion'
tools. See, for instance, http://osx.freshmeat.net/releases/144059/.

  If a user mixes multiple encodings/code sets in her/his file
  system, that's not Perl's problem but her/his problem so that I don't
  think that's a valid reason for not doing something reasonable.

 wants to use locales and especially some non 8-bit locales, well, Perl
 currently most definitely does not switch its filename encoding based
 on locales.  Personally I think that's a daft idea... at least without
 a new specific (say) LC_FILENAME control-- overloading the poor LC_CTYPE
 sounds dangerous.

 I don't see how introducing a new LC_* would help here. Whether
it's LC_CTYPE or LC_FILENAME, the problem is still there.

Perhaps, we need a pragma to indicate which of the following is to be
assumed about the file system character encoding, 'locale', 'native',
'unicode', 'user-specified'. On Unix, 'locale' and 'native' would be
identical both meaning that Perl should convert its internal Unicode
to and from the codeset returned by 'nl_langinfo(CODESET)'. Directly
inspecting LC_CTYPE or other environment variables is a BAD idea and
should be used as a fallback only where nl_langinfo(CODESET) is not
supported. When converting to and from 'native' encoding, it should rely
on iconv(3)' available on the system instead of its internal 'encoding'
converter.  However, there's a problem here. A lot of system admins on
commericial Unix install only the minimal set of iconv(3) modules. See
http://bugzilla.mozilla.org/show_bug.cgi?id=202747#c18. Therefore,
perhaps, we first try iconv(3) and then fall back to using
Perl's 'encoding'. There are other problems when using iconv(3)
(e.g. http://bugzilla.mozilla.org/show_bug.cgi?id=197051).

  'unicode' on Unix means 'utf8'.  'user-specified' means whatever a
user wants to use. On Windows, 'locale' means using the code page of
the current system locale. 'native' is UTF-16LE (but on Win 9x/ME, the
character repertoire would be limited to that of the system codepage).
The same is true of 'unicode'.  On Mac OS X, locale, native and unicode
would mean all the same (UTF-8). As for 'normalization', I have to think
more about it. And so on..  I've been just thinking aloud so that
you have to bear with some incoherency.

   Jungshik


Re: perlunicode comment - when Unicode does not happen

2003-12-23 Thread Jungshik Shin
On Tue, 23 Dec 2003, Nick Ing-Simmons wrote:

 Jungshik Shin [EMAIL PROTECTED] writes:
 On Mon, 22 Dec 2003, Jarkko Hietaniemi wrote:
 
  (AFAIK) W2K and later _are able_ to use UTF-16LE encoded Unicode for
  filenames,
  but because of backward compatibility reasons using 8-bit codepages is
  much
  more likely.
 
   No. _Both_ NTFS (only supported by Win 2k/XP) and VFAT (supported by
 Win 2k/XP and Win 9x/ME) use UTF-16LE **exclusively**.

 But those OSes also support older file systems (e.g. floppies),
 and shares where things are not as clear (at least to me).

  In cases of floppy (FAT), I guess we're just back to old days :-)
In case of CIFS, I really have to check. Then, even Windows supports
(although not free) NFS and other file sharing ... things become fuzzy.


 In that respect,
 Windows filesystems are 'saner' than Unix file systems.  APIs for accessing
 them come in two flavors, 'A' APIs and 'W' APIs, though as I explained
 in another message of mine.

 In that message you mentioned a .dll - should perl look for and
 link to that DLL ?

 Actually, I mentioned three different possibilities. Only one of them
relies on MSLU (Microsoft Layer for Unicode). If you do that, you just
need a single binary that works across Win32 platforms. However,
the presence of MSLU is required.

 The second strategy is to do what Mozilla does: 1. write a set of
wrapper functions that emulates Windows
'W' APIs, 2. detect the OS at run-time (Windows 9x/ME vs Windows 2k/XP) 3.
call either emulated versions of 'W' APIs or native 'W' APIs (I'm omitting
details here, but you should get the idea). This is actually similar
to what's done by MSLU, but you don't have to rely on MSLU.

 The final approach is to build two separate binaries, one for Win 9x/ME
(with 'A' APIs) and the other for Win 2k/XP (with 'W' APIs)

  In all three cases, the character repertoire (that can be used for
file names) on Win 9x/ME is limited to that of the system codepage. It
may sound odd because VFAT can cover the whold Unicode repertoire. Don't
ask me why, but that's the way Win 9x/ME works. That can explain why
Jarkko got confused.  If somebody hacks VFAT and write her own VFAT IO
functions, the full range of Unicode can be used even on Win 9x/ME.

  Jungshik


Re: perlunicode comment - when Unicode does not happen

2003-12-22 Thread Jungshik Shin
On Mon, 22 Dec 2003, Ed Batutis wrote:

 Jarkko Hietaniemi [EMAIL PROTECTED] wrote in message
 news:[EMAIL PROTECTED]

  You do know that ...
 Yes.

 If wctomb or mbtowc are to be used, then Perl's Unicode must be converted
 either to the locale's wide char or to its multibyte. This isn't trivial,
 but Mozilla solved this same problem. It can portably work. (Are you
 listening Brian Stell!). It wasn't easy for them, but they did it.

  You're probably talking about nsNativeCharsetUtils.cpp in Mozilla.
(http://lxr.mozilla.org/seamonkey/source/xpcom/io/nsNativeCharsetUtils.cpp).
I'm familar with that part because I made a few changes there in the last
6 months.  Mozilla doesn't use wc*mb/mb*wc() because it can't possibly
know _what_ 'wchar_t' actually is in the current locale? Note that
'wchar_t' is not only locale dependent (i.e. run-time dependency) on a
single platform but also a compiler-dependent. It works because it relies
on iconv(3) to convert between the current locale codeset and UTF-16
(used internally by Mozilla) if/wherever possible. 'wc*to*mb/mb*to*wc'
is only used  only where iconv(3) is not available. Anyway, yes, that's
possible. If a user mix multiple encodings/code sets in her/his file
system, that's not Perl's problem but her/his problem so that I don't
think that's a valid reason for not doing something reasonable.

  Imagine ...

 I don't have to imagine. But I think that where a Perl script opens its
 files is its own business. I don't see why Perl would have to do anything in
 that regard. Even if it did, I don't see that feature as blocking the
 simpler feature of just doing a conversion to/from multibyte before/after a
 system call. If I'm dealing with just Japanese on a Japanese system, that's
 all I need.

  Uhhh... from a Win32 API bug workaround you deduce that ... SJIS should
  work?

 Well, Win32 has an API to test whether a backslash is the second byte
of a 'multibyte character'. That is, the code snippet given by Ed could
have been written better with that API.


 Here's my dilemma: utf-8 doesn't work as an argument to -d and neither does
 Shift-JIS (at least with certain Shift-JIS characters). Those are my only
 choices. So you are saying basically 'Shift-JIS be damned  - write a
 module'? I hope you'll understand if I find it hard to sympathize with that

 Win32 is troublesome because it has two tier-ed APIs, code-page
dependent 'A' APIs and Unicode-based 'W' APIs. If 'W' APIs are guaranteed
to be available everywhere (from Win95 to WinXP), Perl can just convert
whatever legacy encodings into UTF-16LE and call 'W' APIs. Actually,
you don't have to call 'W' APIs directly but just using the 'generic'
APIs would be translated into 'W' APIs if a macro (whose name is escaping
me) is defined at the compile time. Now the question is whether 'W'
APIs are available on old Win95/98/ME. They're available if MS IE 5.x
or later and/or relatively new version of MS Word/Office are installed
because they come with MSLU (Microsoft Layer for Unicode) dll. So, for
the majority of cases, the above should work.  However, there are some
small number of cases where MSLU is not available on Win 9x/ME. In that
case, you have to fall back to 'A' APIs. Even with MSLU installed, on
Win9x/ME, you're limited to the character repertoire of the legacy code
page (i.e. Shift_JIS on Japanese windows, Windows-932 on Chinese Windows,
Windows-1252 on Western European Windows). Therefore, a better approach
might be to do the OS detection and use 'A' APIs on Win 9x/ME and 'W'
APIs on Win 2k/XP. That's what Mozilla does.  Unfortunately, this code is
not yet deployed to the file I/O part of Mozilla, which is the cause of
several bugs. (See http://bugzilla.mozilla.org/show_bug.cgi?id=162361)
Still another approach is to build two separate binaries of Win32 Perl,
one for Win 9x/ME and the other for Win 2k/XP.

  Jungshik



Re: perlunicode comment - when Unicode does not happen

2003-12-22 Thread Jungshik Shin
On Mon, 22 Dec 2003, Jarkko Hietaniemi wrote:

 (AFAIK) W2K and later _are able_ to use UTF-16LE encoded Unicode for
 filenames,
 but because of backward compatibility reasons using 8-bit codepages is
 much
 more likely.

  No. _Both_ NTFS (only supported by Win 2k/XP) and VFAT (supported by
Win 2k/XP and Win 9x/ME) use UTF-16LE **exclusively**. In that respect,
Windows filesystems are 'saner' than Unix file systems.  APIs for accessing
them come in two flavors, 'A' APIs and 'W' APIs, though as I explained
in another message of mine.


 The Apple HFS handles Unicode using _normalized_ (NFC, IIRC) UTF-8.

  The Mac OS X file system uses not NFC (precomposed unicode) but NFD
(decomposed Unicode).

 There we have two different Unicode encodings, both in use.

 FYI, Mac OS X 10.3 (or 10.2) or later has APIs for the conversion
between NFC and NFD.

  Jungshik


Re: Mixing Unicode and Byte output on a Unicode enabled Perl 5.8.0

2003-10-09 Thread Jungshik Shin
On Thu, 9 Oct 2003, Frank Smith wrote:

 I am trying to use the  (pound sterling) symbol in a script that
 produces both TEXT and HTML the html handles the Unicode fine, all the
 browsers seem to work. However, once the text file arrives on the Windowz
 box the Unicode  screws Excel.

 Can you help by suggesting a way to force a specific script to produce
 'plain text' (That bit more than ASCII) or preferably to specifically
 output, via the IO layer, 'plain text' on specific occasions.

  Well, there's nothing that prevents you from using UTF-8 for *plain
text*.  I've got tens of thousands of UTF-8 plain text files and am
making one now (because I'm gonna send this email in 'text/plain;
charset=UTF-8')

  Anyway, what you want is to get your output in Windows-1252
(or its subset ISO-8859-1) so that Excel running under English version
of Windows 9x/ME or Windows 2k/XP with the default locale set to English
can read your text output with .  The man page of 'Encode' should help
you (see the section Encoding via PerlIO).

  Alternatively, if you're on Win2k/XP (and don't care about Win9x/ME),
you can prepend your UTF-8 plain text output with UTF-8 BOM (that is, at
the very beginning of your plain text output file, print out \x{feff}).
With UTF-8 BOM present, Win2k/XP should be able to detect that your
plain text file is in UTF-8 instead of legacy 'code pages'.

  Jungshik


Re: Mixing Unicode and Byte output on a Unicode enabled Perl 5.8.0

2003-10-09 Thread Jungshik Shin
On Thu, 9 Oct 2003, Guido Flohr wrote:

 BTW, Windows editors also insert that BOM at the beginning when writing
 XML files encoded in UTF-8.  In other words: If you edit a UTF-8 XML
 file with Windows Notepad, it will be corrupted.  MSIE and Mozilla (!)
 still treat it as well-formed XML but a standards compliant parser will
 of course reject it.

  Well, I am not fond of UTF-8 BOM at all, but it's not a violation
of the standard to prepend an XML file in UTF-8 with UTF-8 BOM
(see http://www.w3.org/TR/REC-xml#sec-guessing).

  Jungshik


Re: Quick question: viscii vs. iscii? NEVERMIND

2003-06-03 Thread Jungshik Shin
On Mon, 2 Jun 2003, David Graff wrote:

 Does 5.8 have any conversion functionality for ISCII?  If not, is
 anyone working on this (and is there a notion when it may be ready)?

   Encode doesn't support ISCII (there may be a separate module for
ISCII, though), yet. I'm planning to work on it (see my message to the
list sent on May 17th. We also need TSCII converter) , but you (or anyone
else) are welcome to go ahead because I'm not gonna do it very soon.

  Jungshik


Encode::_utf8_on and output

2003-05-30 Thread Jungshik Shin
On Sat, 18 Jan 2003, Jarkko Hietaniemi wrote:

 Now Perl-5.8.1-to-be has been changed to

 (1) not to do any implicit UTF-8-ification of any filehandles unless
 explicitly asked to do so (either by the -C command line switch
 or by setting the env var PERL_UTF8_LOCALE to a true value, the switch
 wins if both are present) (and if the locale settings do not indicate


 Note that the above do not change the fact that if a *programmer* wants
 their code to be UTF-8 aware, they need to think about the evil binmode().

Recently, I came across something curious. From this thread, we all know
that perl 5.8.0 does implicit 'UTF-8-ification' when it's run under a
UTF-8 locale and perl 5.8.1 won't. The following script produces
five output files. Under UTF-8 locale and perl 5.8, default.out
has  (U+AC00 U+AC01) in EUC-KR is '0xb0 0xa1 0xb0 0xa2'.

  c2 b0 c2 a1 c2 b0 c2 a2

while bytes.out, binmod.out, encode.out and default2.out have

  b0 a1 b0 a2

What made me curious is default2.out. I'm wondering how setting UTF8
flag on what's an invalid UTF-8 string ($output) with Encode::_utf8_on
effectively made the output filehandle behave as if 'binmode' were set or
'bytes' layer were used. Needless to say, I wouldn't rely on that, but
am interested to know how this happens.

Jungshik

P.S. BTW, is there any way to specify 'CHECK' for 'encoding' layer?


#!/usr/bin/perl -w
use Encode;

$input = \x{ac00}\x{ac01};
$output = encode(euc-kr, $input,  Encode::FB_PERLQQ);

open $ofh,  default.out;
print $ofh $output;
close $ofh;

open $ofh, :bytes, bytes.out;
print $ofh $output;
close $ofh;

open $ofh,  binmod.out;
binmode($ofh);
print $ofh $output;
close $ofh;

open $ofh,  default2.out;
Encode::_utf8_on($output);
print $ofh $output;
close $ofh;

open $ofh, :encoding(euc-kr), encode.out;
print $ofh $input;
close $ofh;
---




Re: How to name CJK ideographs

2002-10-25 Thread Jungshik Shin



On Sat, 26 Oct 2002, Dan Kogai wrote:

 On Saturday, Oct 26, 2002, at 03:55 Asia/Tokyo, Jungshik Shin wrote:
Another possibility is 'meaning-pronunciation' index. I believe
  this is one of a few ways to refer to CJK characters (say, over the
  phone)
  in all CJK countries. However, to do this, we need much more raw data
  (more or less like a small dictionary) than UniHan DB provides because
  it lists meanings of characters in English only.

 That's one thing I wish I could do -- Dan as in Bomb because I
 can't go like YOU five ef three ee :)  I know that's difficult but it

  Until such a time as you can do that or somebody with infinite amount
of free time volunteers :-), how about \N{life:sheng1} for zh and
\N{life:saeng} for ko and so forth? Nothing fancy but using
what's available in UniHan DB.  Then, I came to wonder in this
age of Unicode, why we have to bother to use '\N{}' when we can just
directly use  生 in perl. I know there are some cases where 'N{...}'
is necessary and useful Another question came up.  do we really
need meaning-pronunciation index in native languages?  If one can enter
meaning-pronunciation inside 'N{...}', there would be really no reason
not to directly type the character in question. Therefore, 'N{...}'
is kinda fallback for those who can't enter CJK characters directly and
'meaning-pronunciation' in English and Romanized form is all we need for
'\N{}', isn't it?

  Just my two hundredths of €  .


  Jungshik




RFC 2231 (was Re: Encode::MIME::Header...)

2002-10-08 Thread Jungshik Shin




On Mon, 7 Oct 2002, Dan Kogai wrote:

 As I said, Encode::MIME::Header has those restrictions;

 * the Encode API
 * RFC 2047

 I'm not sure if Encode::MIME::Header is the best place to
implement RFC 2231 because RFC 2231 encoding/decoding involves two
parameters, 'MIME charset' and 'language'.  RFC 2231 is used not only
for email/news messages but also in http header.

 Implementing RFC 2231 in Encode::MIME::Header would help
dynamically-generated-attachment (on the web) have the standard-compliant
Content-Disposition header(RFC 2183). Currently, most C-D headers
generated by CGI programs use either raw 8bit characters in an unspecified
encoding or RFC 2047-encoding for the value of 'name' parameter of
C-D header. Neither of these behaviors are standard-compliant.

  Jungshik





Re: README.cjk?

2002-05-06 Thread Jungshik Shin

On Tue, 7 May 2002, Dan Kogai wrote:

 Hi Dan,

 pumpking is calling for the (hopefully) the last chance to update 
 README.cjk.
 
 On Tuesday, May 7, 2002, at 02:48 , Jarkko Hietaniemi wrote:
  Do I have the latest versions of the README.{cn,jp,ko,tw}?
 
 I do think so but I am calling for the last possible update anyhow.

  It seems like my latest version was lost somewhere (sent around April
18th) :-). Here's another try. I took this chance to correct a couple
of typos.

  Cheers,

  Jungshik


If you read this file _as_is_, just ignore the funny characters you
see. It is written in the POD format (see perlpod manpage) which is
specially designed to be readable as is.

This file is in Korean encoded in EUC-KR. 

ÀÌ ¹®¼­¸¦ perldocÀ» ½á¼­ º¸Áö ¾Ê°í Á÷Á¢ º¸´Â °æ¿ì¿¡´Â °¢ ºÎºÐÀÇ
¿ªÇÒÀ» Ç¥½ÃÇϱâ À§ÇØ ¾²ÀÎ =head, =item, 'L' µîÀº ¹«½ÃÇϽʽÿÀ.
ÀÌ ¹®¼­´Â µû·Î perldocÀ» ¾²Áö ¾Ê°í º¸´õ¶óµµ Àдµ¥ º° ÁöÀåÀÌ
¾ø´Â POD Çü½ÄÀ¸·Î Â¥¿© ÀÖ½À´Ï´Ù.  ´õ ÀÚ¼¼ÇÑ °ÍÀº perlpod
¸Å´º¾óÀ» Âü°íÇϽʽÿÀ. 


=head1 NAME

perlko - Perl°ú Çѱ¹¾î ÀÎÄÚµù

=head1 DESCRIPTION

PerlÀÇ ¼¼°è¿¡ ¿À½Å °ÍÀ» ȯ¿µÇÕ´Ï´Ù !


PerlÀº 5.8.0ÆǺÎÅÍ À¯´ÏÄÚµå/ISO 10646¿¡ ´ëÇÑ ±¤¹üÀ§ÇÑ Áö¿øÀ» ÇÕ´Ï´Ù.
À¯´ÏÄÚµå Áö¿øÀÇ ÀÏȯÀ¸·Î ÇÑÁßÀÏÀ» ºñ·ÔÇÑ ¼¼°è °¢±¹¿¡¼­
À¯´ÏÄÚµå ÀÌÀü¿¡ ¾²°í ÀÖ¾ú°í Áö±Ýµµ ³Î¸® ¾²ÀÌ°í ÀÖ´Â ¼ö¸¹Àº ÀÎÄÚµùÀ»
Áö¿øÇÕ´Ï´Ù.  À¯´ÏÄÚµå´Â Àü ¼¼°è¿¡¼­ ¾²ÀÌ´Â ¸ðµç ¾ð¾î¸¦ À§ÇÑ Ç¥±â ü°è -
À¯·´ÀÇ ¶óƾ ¾ËÆĺª, Å°¸± ¾ËÆĺª, ±×¸®½º ¾ËÆĺª, Àεµ¿Í µ¿³² ¾Æ½Ã¾ÆÀÇ
ºê¶ó¹Ì °è¿­ ½ºÅ©¸³Æ®, ¾Æ¶ø ¹®ÀÚ, È÷ºê¸® ¹®ÀÚ, ÇÑÁßÀÏÀÇ ÇÑÀÚ, Çѱ¹¾îÀÇ ÇѱÛ,
ÀϺ»¾îÀÇ °¡³ª, ºÏ¹Ì Àεð¾ÈÀÇ Ç¥±â ü°è µî-¸¦ ¼ö¿ëÇÏ´Â °ÍÀ» ¸ñÇ¥·Î ÇÏ°í
Àֱ⠶§¹®¿¡ ±âÁ¸¿¡ ¾²ÀÌ´ø  °¢ ¾ð¾î ¹× ±¹°¡ ±×¸®°í ¿î¿µ ü°è¿¡ °íÀ¯ÇÑ
¹®ÀÚ ÁýÇÕ°ú ÀÎÄÚµù¿¡ ¾µ ¼ö ÀÖ´Â ¸ðµç ±ÛÀÚ´Â ¹°·ÐÀÌ°í  ±âÁ¸ ¹®ÀÚ ÁýÇÕ¿¡¼­
Áö¿øÇÏ°í ÀÖÁö ¾Ê´ø ¾ÆÁÖ ¸¹Àº ±ÛÀÚ¸¦  Æ÷ÇÔÇÏ°í ÀÖ½À´Ï´Ù.


PerlÀº ³»ºÎÀûÀ¸·Î À¯´ÏÄڵ带 ¹®ÀÚ Ç¥ÇöÀ» À§ÇØ »ç¿ëÇÕ´Ï´Ù. º¸´Ù ±¸Ã¼ÀûÀ¸·Î
¸»Çϸé Perl ½ºÅ©¸³Æ® ¾È¿¡¼­  UTF-8 ¹®ÀÚ¿­À» ¾µ ¼ö ÀÖ°í, 
°¢Á¾ ÇÔ¼ö¿Í ¿¬»êÀÚ(¿¹¸¦ µé¾î, Á¤±Ô½Ä, index, substr)°¡ ¹ÙÀÌÆ® ´ÜÀ§
´ë½Å À¯´ÏÄÚµå ±ÛÀÚ ´ÜÀ§·Î µ¿ÀÛÇÕ´Ï´Ù. (´õ ÀÚ¼¼ÇÑ °ÍÀº 
perlunicode ¸Å´º¾óÀ» Âü°íÇϽʽÿÀ.) À¯´ÏÄڵ尡 ³Î¸® º¸±ÞµÇ±â Àü¿¡
³Î¸® ¾²ÀÌ°í ÀÖ¾ú°í, ¿©ÀüÈ÷ ³Î¸® ¾²ÀÌ°í ÀÖ´Â °¢±¹/°¢ ¾ð¾îº° ÀÎÄÚµùÀ¸·Î
ÀÔÃâ·ÂÀ» ÇÏ°í À̵é ÀÎÄÚµùÀ¸·Î µÈ µ¥ÀÌÅÍ¿Í ¹®¼­¸¦ ´Ù·ç´Â °ÍÀ» µ½±â À§ÇØ
'Encode'°¡  ¾²¿´½À´Ï´Ù. ¹«¾ùº¸´Ù 'Encode'¸¦  ½á¼­ ¼ö¸¹Àº ÀÎÄÚµù »çÀÌÀÇ
º¯È¯À» ½±°Ô ÇÒ ¼ö ÀÖ½À´Ï´Ù.

'Encode'´Â ´ÙÀ½°ú °°Àº Çѱ¹¾î ÀÎÄÚµùÀ» Áö¿øÇÕ´Ï´Ù.

=over 4

=item euc-kr 

  US-ASCII¿Í KS X 1001À» °°ÀÌ ¾²´Â ¸ÖƼ¹ÙÀÌÆ® ÀÎÄÚµù (ÈçÈ÷ ¿Ï¼ºÇüÀ̶ó°í
  ºÒ¸².) KS X 2901°ú RFC 1557 Âü°í.

=item  cp949 

MS-Windows 9x/ME¿¡¼­ ¾²ÀÌ´Â È®Àå ¿Ï¼ºÇü.  euc-kr¿¡ 8,822ÀÚÀÇ
ÇÑ±Û À½ÀýÀ» ´õÇÑ °ÍÀÓ.  alias´Â uhc, windows-949, x-windows-949,
ks_c_5601-1987. ¸Ç ¸¶Áö¸· À̸§Àº ÀûÀýÇÏÁö ¾ÊÀº À̸§ÀÌÁö¸¸, Microsoft
Á¦Ç°¿¡¼­ CP949ÀÇ Àǹ̷Π¾²ÀÌ°í ÀÖÀ½.

=item  johab  

KS X 1001:1998 ºÎ·Ï 3¿¡¼­ ±ÔÁ¤ÇÑ Á¶ÇÕÇü.  ¹®ÀÚ ·¹ÆÛÅ丮´Â cp949¿Í
¸¶Âù°¡Áö·Î US-ASCII¿Í  KS X 1001¿¡ 8,822ÀÚÀÇ ÇÑ±Û À½ÀýÀ» ´õÇÑ °ÍÀÓ.
ÀÎÄÚµù ¹æ½ÄÀº ÀüÇô ´Ù¸§. 

=item iso-2022-kr 

RFC 1557¿¡¼­ ±ÔÁ¤ÇÑ Çѱ¹¾î ÀÎÅÍ³Ý ¸ÞÀÏ ±³È¯¿ë ÀÎÄÚµùÀ¸·Î US-ASCII¿Í
KS X 1001À» ·¹ÆÛÅ丮·Î ÇÏ´Â Á¡¿¡¼­ euc-kr°ú °°Áö¸¸ ÀÎÄÚµù ¹æ½ÄÀÌ ´Ù¸§.
1997-8³â °æ±îÁö ¾²¿´À¸³ª ´õ ÀÌ»ó ¸ÞÀÏ ±³È¯¿¡ ¾²ÀÌÁö ¾ÊÀ½.

=item  ksc5601-raw 

KS X 1001(KS C 5601)À» GL(Áï, MSB¸¦ 0À¸·Î ÇÑ °æ¿ì) ¿¡ ³õ¾ÒÀ» ¶§ÀÇ
ÀÎÄÚµù. US-ASCII¿Í °áÇÕÇÏÁö ¾Ê°í ´Üµ¶À¸·Î ¾²ÀÌ´Â ÀÏÀº X11 µî¿¡¼­ ±Û²Ã
ÀÎÄÚµù (ksc5601.1987-0. '0'Àº GLÀ» ÀǹÌÇÔ.)À¸·Î ¾²ÀÌ´Â °ÍÀ» Á¦¿ÜÇÏ°í´Â
°ÅÀÇ ¾øÀ½. KS C 5601Àº 1997³â KS X 1001·Î À̸§À» ¹Ù²Ù¾úÀ½.  1998³â¿¡´Â  µÎ
±ÛÀÚ (À¯·ÎÈ­ ºÎÈ£¿Í µî·Ï »óÇ¥ ºÎÈ£)°¡ ´õÇØÁ³À½.

=back

 ¸î °¡Áö »ç¿ë ¿¹Á¦¸¦ ¾Æ·¡¿¡ º¸ÀÔ´Ï´Ù. 

¿¹¸¦ µé¾î, euc-kr ÀÎÄÚµùÀ¸·Î µÈ ÆÄÀÏÀ» UTF-8·Î º¯È¯ÇÏ·Á¸é ´ÙÀ½°ú
°°ÀÌ ÇÏ¸é µË´Ï´Ù. 


perl -Mencoding=euc-kr,STDOUT,utf8 -pe1   file.euckr  file.utf8

¿ªº¯È¯Àº ´ÙÀ½°ú °°ÀÌ ÇÒ ¼ö ÀÖ½À´Ï´Ù. 

perl -Mencoding=utf8,STDOUT,euc-kr -pe1   file.utf8   file.euckr

  ÀÌ·± º¯È¯À» Á»´õ Æí¸®ÇÏ°Ô ÇÒ ¼ö ÀÖµµ·Ï Encode ¸ðµâÀ» ½á¼­ 
¼ø¼öÇÏ°Ô Perl·Î¸¸ ¾²ÀÎ piconv°¡ Perl¿¡ µé¾î ÀÖ½À´Ï´Ù.
±× À̸§¿¡¼­ ¾Ë ¼ö ÀÖµíÀÌ piconv´Â Unix¿¡ ÀÖ´Â iconv¸¦
¸ðµ¨·Î ÇÑ °ÍÀÔ´Ï´Ù. ±× »ç¿ë¹ýÀº ¾Æ·¡¿Í °°½À´Ï´Ù.

   piconv -f euc-kr -t utf8  file.euckr  file.utf8
   piconv -f utf8 -t euc-kr  file.utf8  file.euckr

  ¶Ç, 'PerlIO::encoding' ¸ðµâÀ» ½á¼­ Çѱ¹¾î ÀÎÄÚµùÀ» ¾²¸é¼­ ±ÛÀÚ ´ÜÀ§
(¹ÙÀÌÆ® ´ÜÀ§°¡ ¾Æ´Ï¶ó) 󸮸¦ ½±°Ô ÇÒ ¼ö ÀÖ½À´Ï´Ù.

  #!/path/to/perl 

  use encoding 'euc-kr', STDIN = 'euc-kr',
 STDOUT- 'euc-kr', STDERR='euc-kr';

  print length(°¡³ª);# 2  (Å« µû¿ÈÇ¥´Â ±ÛÀÚ ´ÜÀ§ 󸮸¦ Áö½Ã)
  print length('°¡³ª');# 4  (ÀÛÀº µû¿ÈÇ¥´Â ¹ÙÀÌÆ® ´ÜÀ§ 󸮸¦ Áö½Ã)
  print index(ÇÑ°­, ´ëµ¿°­, ¿°);   # -1 ('¿°'ÀÌ ¾øÀ½)
  print index('ÇÑ°­, ´ëµ¿°­', '¿°');   # 7 (8¹ø°¿Í 9¹ø° ¹ÙÀÌÆ®°¡ '¿°'ÀÇ
Äڵ尪°ú ÀÏÄ¡ÇÔ.)



Re: http://bleedperl.dan.co.jp:8080/

2002-04-27 Thread Jungshik Shin

On Sat, 27 Apr 2002, Dan Kogai wrote:

 I have set up an experimental mod_bleedperl server which URI is shown in
 the subject.
 To demonstrate the power of Perl 5.8, I have written a small cgi/pl (.pl
 runs on Apache::Registry) called piconv.pl, a web version of piconv(1).

 http://bleedperl.dan.co.jp:8080/piconv/
 (Don't forget :8080; it's not run on root!)

 What's so funny is that this service can be used to 'asciify' non-ascii
 web pages.  Bart's idea of HTMLCREF is fully exploited here.  To find it
 out, try

  Wow, this is great and very timely!! Yesterday, I wrote to  Werner
Lemberg (the maintainer of CJK package for LaTeX and freetype/ttf2tfm
among other things) and Ross Moore (the maintainer of LaTeX2html
converter) that upcoming Perl 5.8 would include this great Encoding
module. With it, I told them that it'd be trivial to represent characters
outside the repertoire of the target encoding (for html output) in
NCRs. Today, Werner expressed his interest in this feature because he
wants to make use of that in groff. Now you put up this page...

  This feature will also help reduce ill-tagged (mislabeled) pages. For
instance, a lot of Korean web pages are mislabeled as EUC-KR while they
contain characters outside EUC-KR. If Encoding is widely used in CGI
programs behind those Web bulletin boards or mod_bleedperl is used along
with Apache(I'm assuming that mod_bleedperl can do an encoding coversion
behind the scene..), all of sudden a number of mistagged pages
will disappear :-)

  Jungshik





Re: README.jp, README.tw, README.cn, README.kr

2002-04-15 Thread Jungshik Shin


 Hi,

  Attached is README.ko (per Jarkko's suggestion, I used
'ko' instead of 'kr') in EUC-KR encoding. North Korea has its own 94
x 94 coded character set(KPS 9566-97: ISO-IR 202), but a few web pages
set up for/by North Korean companies(and possibly government?) of which
URLs I happened know  use EUC-KR.

  I also added what Autrijus added to README.tw.

  Cheers,

  Jungshik 



README.ko
Description: README  in Korean in EUC-KR 


piconv and EUC :-)

2002-04-10 Thread Jungshik Shin

On Sun, 31 Mar 2002, Dan Kogai wrote:

Hi Dan, 

 piconv -- iconv(1), reinvented in perl
 
 piconv is perl version of iconv, a character encoding con-
 verter widely availabe for various unixen today.   This
 script was primarily a technology demostrator for Perl
 5.8.0, you can use piconv in the place of iconv for virtu-
 ally any cases.

  Well, I'm afraid 'virtually any case' is a bit of exaggeration. glibc
iconv and iconv in libiconv can also deal with 'transliteration', but
I'm afraid piconv can't do that, yet :-)

  Another minor note about documentation. I forgot to mention
that 'EUC' in EUC-JP/EUC-KR stands for 'Extended Unix Code'.  I'm sure
it's the original term used by ATT  because I've seen 'Extended Unix
Code' on many occasions over many years.  There's at least one place in
Encode doc. that 'Extended Unix Character' is used in place of that.

  Cheers,

  Jungshik




Re: [PATCH] Supported.pod: cleanup/UTF-16/CJK.inf + an invasion tothe Glossary

2002-04-05 Thread Jungshik Shin

On Fri, 5 Apr 2002, Anton Tagunov wrote:

Hi Anton,

 Speaking of the patch..
 
 
 AT +=item Jungshik Shin's Hangul FAQ
 AT +Lhttp://jshin.net/faq
.
 AT +Lhttp://jshin.net/faq/qa8.html
 AT +has a comprehensive overview of the CKS * (Korean) standards.
 AT +Tha author claims however that the document needs
 AT +some modernisation :-)
 I'm sorry, I haven't been to bed too long, so not sure if my
 writings are okay.
 Jungshik, is this a proper recommendation for
 you cite? Drop the line on modernization? (Not the best place
 for jokes :-(

  No, it's perfectly all right with me. I don't think it's not
inconsistent with  Larry's  putting some nice jokes in his
Perl books :-)

  +The modern successor of the CCJK.inf.
  +The book of choice for everyone interested.
  +
  +Features a comprehensive coverage on CJKV character sets and encodings
  +along with many other issues faced by anyone trying to better support
  +CJKV languages/scripts in all the areas of information processing.

  Looks good. 




Re: [Encode] UCS/UTF mess and Surrogate Handlings

2002-04-05 Thread Jungshik Shin

On Fri, 5 Apr 2002, Jarkko Hietaniemi wrote:

  P.S.  Does utf8 support surrogates?  Surrogate pair is definitely the 
 
 No.  Surrogates are solely for UTF-16.  There's no need for surrogates
 in UTF-8 -- if we wanted to encode U+D800 using UTF-8, we *could* --
 BUT we should not.  Encoding U+D800 as UTF-8 should not be attempted,
 the whole surrogate space is a discontinuity in the Unicode code point
 space reserved for the evils of UTF-16.

  I can't agree more with you on this. Unfortunately, people
at Oracle and PeopleSoft think differently. Actually, what happened was
that they made a serious design mistake by making their DBs understand
only UTF-8 up to 3byte long although when they added UTF-8 support,
it was plainly clear that ISO 10646/Unicode was not just for BMP.
When planes beyond BMP finally began to be filled with actual characters,
they came up with that stupid idea of using two 3-byte-long UTF-8 units
(for surrogate pairs) to represent those characters.

  A lot of people on Unicode mailing list voiced a very strong
and technically solid objection against this, but Oracle and PeopleSoft
went on to publish DUTR  #26: Compatibility Encoding Scheme for UTF-16
(CESU-8) (http://www.unicode.org/unicode/reports/tr26). Does Encode
need to support this monster?  I hope not.

   Jungshik Shin




Re: [Encode] Encode::Supported revised

2002-04-04 Thread Jungshik Shin

On Thu, 4 Apr 2002, Dan Kogai wrote:


  Konnichiha !
  (hope I got this one right).

 On Thursday, April 4, 2002, at 03:06 , Jungshik Shin wrote:
  o The MIME name as defined in IETF RFCs.
 UCS-2 ucs2, iso-10646-1[IANA, et al]
 UCS-2le
 UTF-8 utf8 [RFC2279]
 
 
How about UCS-2BE? Of course, if UCS-2 is network byte order
  (big endian), it's not necessary. In that case, you may alias UCS-2
  to UCS-2BE.
 
And UCS2-NB (Network Byte order)?  Unicode terminology is confusing 
 sometimes.
I've checked http://www.unicode.org/glossary/ and it seems that the 
 canonical - alias order should be as follows.
 
 UCS-2 ucs2, iso-10646-1, utf-16be
 UTF-16LE  ucs2-le
 UTF-8 utf8
 
I left UCS-2 as is because it is IANA registered. UCS-2 is indeed a 
 name of encoding as the URL above clearly states.  It is also less 
 confusing than UTF-16.
ucs2-le will be fixed.

  IETF RFC 2781 also 'defines' (for IETF purpose) UTF-16LE, UTF-16BE, and
UTF-16. It's at http://www.faqs.org/rfcs/rfc2781.html  among other places.
BTW, how does Encode deal with BOM in UTF-16? It's trivial to add
BOM at the beginning by hand (with perl), but you may consider
adding an option (??) to add/remove BOM automatically converting
to/from UTF-16(LE|BE). 



Could you please just say 'Encoding vs Character Set'
  and remove parenthetical 'charset for short' or 'just charset' following
  'character set'?  I agree to your distinction between 'encoding' and
  'character set', but what is bothering me is that you treat 'charset'
  as a synonym to 'character set'.

 Now I agree.  charset is more appropriate for coded character set 
 and that was MIME header's first intention.  EUC is indeed a coded 
 character set but charset=ISO-2022-(JP|KP|CN)(-\d+)?  is absolutely 
 confusing --  it is a character encoding scheme at best.  I am thinking 
 of adding a small glossary to this document as follows.

 And  Here is a glossary I manually parsed out of 
 http://www.unicode.org/glossary/ , right after the signature.

  Thank you. BTW, you may also want to take a look at W3C's charmod
TR at http://www.w3.org/TR/charmod and 'charset' part of html4 spec
at http://www.w3.org/TR/REC-html40/charset.html 


 In a strict sense, the concept of 'raw' or 'as-is' (which you
  apparently use to mean a coded character set invoked on GL)  is not
  appropriate. Because JIS X 0208, JIS X 0208 and KS X 1001 don't map
  characters to their GL position when enumerating characters in their
  charts. The numeric ID used in JIS X 0208, JIS X 0212 and KS X 1001
  are row (ku) and column(ten?)  while GB 2312-80 appears to use GL
  codepoints. That's why I prefer gb2312-gl and ksx1001-gl to gb2312-raw
  and ksx1001-raw. 'gl' doesn't have a risk of being mistaken for row and
  column numbers.
 
I wonder whether ku-ten form is canonical or derived.  JIS X 0208 was 
 clearly designed to be ISO-2022 compliant.  Technically speaking 
 0x21-0x7e should the original and 1 - 94 is derived to make decimal 
 people happier.  But you've got a point.

  Maybe you're right. It may have made 'decimal-oriented people'
happier, but it's a pain in the ass to 'hexadecimal-oriented people'
like us, isn't it?


Speaking of '-raw'  that's a BSD sense of calling unprocessed data and 
 for a Deamon freak it came out naturally.

  All right. It's your decision :-)
  

  are IANA-registered (CUTF-16 even as a preferred MIME name)
  but probably should be avoided as encoding for web pages due to
  the lack of browser supports.
 
Not that I'd encourage people to use UTF-16 for their web pages,
  but  UTF-16 is supported by MS IE and KOI8-U is supported by both MS IE
  and Mozilla.
 
The problem is not just browsers.  As a network consultant I would 
 advised against UTF-16 or any text encoding that may croak cat(1) and 
 more(1) (We can go frank on Mojibake  For cases like mojibake, the 
 text goes to EOF).  After all, we have UTF-8 already that good old cat 
 of ours can read till EOF with no problem.

  Sure, I like UTF-8 much more than UTF-16 and any byte order dependent
and 'cat-breaking' :-) transformation formats of Unicode. I can assure
you that I'm certainly on your side !  Microsoft products generate UTF-8
with **totally redundant** BOM (byte order mark) at the beginning. I don't
know whether there's a conspiracy to break time-honored Unix tradition
of command line filtering, but it's certainly annoying to deal with UTF-8
files with BOM. For example, 'cat f1 f2 f3' wouldn't work as it is. 'cat'
and many other Unix tools need to be modified to remove 'BOM'.

  Lhttp://www.oreilly.com/people/authors/lunde/cjk_inf.html
  Somewhat obsolete (last update in 1996), but still useful.  Also try
 
Is there any rule against mentioning a book in print as opposed
  to online

Re: [PATCH] Re: [Encode] Encode::Supported revised

2002-04-04 Thread Jungshik Shin

On Thu, 4 Apr 2002, Anton Tagunov wrote:

 Hi Anton,

 Thanks a lot.

 - changes status of KOI8-U on Jungshik's comment
   (sorry, I have never tested that myself :-(

  I haven't test it either :-), but both Mozilla/Netscape6 and MS IE
list it in view|encoding  menu, which I interpret as having support
for it.


UTF-16 
 -  KOI8-U(http://www.faqs.org/rfcs/rfc2319.html)
  
 -are IANA-registered (CUTF-16 even as a preferred MIME name)
 +=for comment
 +waiting for comments from Jungshik Shin to soften this - Anton
 +
 +is a IANA-registered preferred MIME name
  but probably should be avoided as encoding for web pages due to 
 -the lack of browser supports.
 +the lack of browser support.

   The reason your test didn't work with MS IE was probably
you didn't prepend your UTF-16 html doc. with BOM(byte order mark).
It's to be noted that a conventional way of informing web browsers
of MIME charset by putting meta tag doesn't work for UTF-16/UTF-32.
Either you have to configure your web server to emit C-T header with
'charset=UTF-16(LE|BE)' or you have to put BOM at the beginning.
When BOM is present, MS IE 5/6, Mozilla/Netscape6 and Netscape4
have no problem rendering UTF-16(LE|BE) encoded pages. I put
up a couple of test pages at

   http://jshin.net/i18n/utf16le_kr2.html
   http://jshin.net/i18n/utf16be_kr2.html

For more details on UTF-16 and HTML, you can refer to HTML4 spec. at
 
  http://www.w3.org/TR/html4/charset  (see section 5.2.1)

As I wrote before, I have no intention to encourage use of UTF-16 over
UTF-8 although some people  whose primary script  has a more 'economical'
(in terms of file size) representation in UTF-16 than in UTF-8 may want
to use it.


 +=head2 Microsoft-related naming mess
 +
 +Microsoft products misuse the following names:
 +
 +=over 2
 +
 +=item KS_C_5601-1987
 +
 +Microsoft extension to CEUC-KR.
 +
 +Proper name: CCP949.
 +
 +See
 +http://lists.w3.org/Archives/Public/ietf-charsets/2001AprJun/0033.html
 +for details.

 Wow, I didn't know that Martin wrote this. Thanks a lot for
digging this up.  He 'rediscovered' what a lot of people in Korea had
complained about. One thing I don't agree with him is what designation
to use for  CP949. I think it'd better be 'windows-949' because that's
more in line with other MS code pages such as windows-125x (for European
scripts). By the same token, MS version of Shift_JIS can be labeled as
'windows-932. At the moment, Mozilla uses 'x-windows-949' for CP949/UHC
because it's not yet registered with IANA. Probably, I have to contact
Martin and discuss this issue.

 +Encode aliases CKS_C_5601-1987 to Ccp949 to reflect
 +this common misusage. 

 If my patch is accepted, cp949 has a couple of more aliases,
'uhc' and '(x-)-windows-949'. CP949 is commonly known as 
'ÅëÇÕ ¿Ï¼ºÇü'(Unified Hangul Code) in Korea.


 +IRaw CKS_C_5601-1987 encoding is available as Ckcs5601-raw.

  ksc5601-raw had better be renamed  ksx1001-raw and ksc5601-raw
can be made an alias to ksx1001-raw. Pls, note that now what's now called
ksc5601-raw has two new characters which were only added in Dec. 1998
over a year after the name change (KS C 5601 - KS X 1001).

 +=item GB2312
 +
 +Encode aliases CGB2312 to Ceuc-cn in full agreement with
 +IANA registration. Ccp936 is supported separately.
 +IRaw CGB_2312-80 encoding is available as Ckcs5601-raw.

  Oops... You meant gb2312-raw, didn't you? :-)


 Jungshik, I would have certainly advocated linking not only to
 http://lists.w3.org/Archives/Public/ietf-charsets/2001AprJun/0033.html
 but also to your comments on the KS_C_5601-1987 in the list archive,
 but all your mails were on several subjects each.
 
 Jungshik ... refer to Ken Lunde's CJKV Information Processing
 Jungshik about that 'epic war' between two camps. (see p.197 of
 Jungshik the book and http://jshin.net/faq/qa8.html)
 Jungshik We even set up a web page to prevent M$ from spreading that
 Jungshik ill-defined name.
 
 maybe we may link to this page? What is the address?

  The campaign web has disappeared since. It was almost 5 years
ago :-). However, my Hangul FAQ subject 8 deals with the issue
(http://jshin.net/faq/qa8.html) so that you may add the link to it.
Well, be aware that it's been untouched for a few years (if not longer)
and needs a complete overhaul.







Re: - charset + character set + coded character set + CCS (?) (was:[Encode] Encode::Supported revised)

2002-04-04 Thread Jungshik Shin

On Thu, 4 Apr 2002, Anton Tagunov wrote:

 Hi Anton !!

AT Our comments go in the same direction, but will you
AT let me strengthen your statements a bit?

  Thank you !

JS On the other hand, no one with *sufficient understanding*
JS of the issue uses 'character set' to mean encoding.

AT [ECMA-35, (equivalent of ISO 2022?)]:

  Yes, I think they're a verbatim equivalent of ISO 2022. I'd never
have been able to read ISO 2022 unless ECMA released it free as ECMA 35.

AT coded character set; code
AT   A set of unambiguous rules that establishes a
AT   character set and the one-to-one relationship between the 
AT   characters of the set and their coded representation.

AT [RFC 1345]:
AT   The ISO definition of the term coded character set is as
AT   follows: A set of unambiguous rules that establishes a 
AT   character set and the one-to-one relationship between the 
AT   characters of the set and their coded representation.

AT Hmmm... can this potentially lead to messing character set for
AT a short form of coded character set (in the ISO meaning)?

AT I see that these definitions themselves make a distinction between a
AT character set   (= repertoire) and
AT coded character set (= CCS + encoding = CCS + CES),

 Jungshik?

  Hmm, I feel like being treated as 'the' ultimate something here, which
I'm certainly not and never wanted to be :-)

  I think Dan is right when he wrote that EUC-JP,EUC-KR,EUC-CN,
EUC-TW and even UTF-8 could be regarded as both CCS and CES. Even though
they involve multiple character set standards, the mapping from abstract
characters in those multiple character set standards to integers (despite
being of multiple 'lengths') is strictly one-to-one.  I didn't realize
that it's possible to view things that way until he wrote that. On the
other hand, as he wrote, any encoding that utilize any form of escape
sequence (locking/single shift, designator, etc) , whether defined in
ISO 2022 or not (I have HZ in mind here)  cannot be called a CCS because
just providing the mapping alone cannot fully specify the way actual
text in that encoding is 'serialized' in octet-sequence. Therefore,
I believe the below doesn't hold true for all encodings we have
to deal with although it's the case for some encodings.

AT coded character set (= CCS + encoding = CCS + CES),

Then, I realize that RFC 1345 has the following after quoting
ISO definition of coded character set which you quoted above.

1345 This memo does not put further
1345 restrictions on the term of coded character set than the following:
1345  A coded character set is a set of rules that unambiguously and
1345  completely determines which sequence of characters, if any, is
1345  represented by each possible sequence of n-bit bytes for a certain
1345  value of n. This implies that e.g. a coded character set extended
1345  with one or more other coded character sets by means of the extension
1345  techniques of ISO 2022 constitutes a coded character set in its own
1345  right.  In this memo the term charset is used to refer to the above
1345  interpretation of the ISO term coded character set.

However, even RFC 1345 came up with a new term 'charset' for its
*extended* definition of 'coded character set'  to distinguish it from
the original ISO definition. The definition of 'charset' in RFC 1345
is actually in line with RFC 2130/2278. Therefore, what I wrote about
the statement that coded character set (= CCS + encoding = CCS + CES)
is still the case, IMO.



DOC Is a collection of characters in which each character is distinguished
DOC with unique ID (in most cases, ID is number).

JS   Some people like to distinguish between a mere collection of characters
JS and a collection of characters with uniq(numeric) ID /code points.
JS The former is sometimes refered to as a character repertoire
JS or a character set whereas the latter is called a 'coded character set'.

AT or rather CCS to rule out the ISO understanding

  I don't see any conflict between RFC 2130 CCS and ISO coded character
set _quoted_ in RFC 1345. It's not the original ISO definition of 'coded
character set' but  RFC 1345's extension of the definition that made
things complicated. However, even RFC 1345 gave it a new term 'charset'
to tell it from the original ISO defintion.


DOC =item Character IEncoding
DOC A character encoding may also encode character set as-is (also called
DOC a Iraw encoding.  i.e. US-ascii) or processed (i.e. EUC-JP, US-ascii is

JSIn a strict sense, the concept of 'raw' or 'as-is' (which you
JS apparently use to mean a coded character set invoked on GL)  is not
JS appropriate. Because JIS X 0208, JIS X 0208 and KS X 1001 don't map
JS characters to their GL position when enumerating characters in their
JS charts.
AT Looks like RFC 1345 has made one big pile:

AT   JIS_C6226-1978, JIS_C6226-1978 = JIS_C6226-1983
AT   GB_1988-80
AT   KS_C_5601-1987
AT   
AT are all listed in a similar manner there. Does this RFC change
AT anything?

  As we 

Re: let's cook it!

2002-03-27 Thread Jungshik Shin

On Wed, 27 Mar 2002, Nick Ing-Simmons wrote:

 Autrijus Tang [EMAIL PROTECTED] writes:
 On Tue, Mar 26, 2002 at 06:28:07PM -0500, Jungshik Shin wrote:
Microsoft products use 'ks_c_5601-1987' as an encoding name/MIME
  charset/character set encoding scheme. That's a very strange use
  of KS C 5601-1987. Because, what they mean by 'ks_c_5601-1987'
  is actually CP949/Unified Hangul Code(UHC)/X-Windows-949,
  an upward compatible proprieatary extension of EUC-KR.
 
 Just a quite note: exactly the same thing has happened with Microsoft's
 use of 'gb2312' to mean 'gbk', and 'big5' to mean 'cp950'. In Encode.pm,
 I've been carefully avoiding this misbehaviour; it has been fortunate that
 'ks_c_5601_1987' has a distinct name from 'ksc5601'. :-)
 
 At least they are consistently wrong across the world, most MS things
 claiming to be iso-8859-1 are really cp1252

  Well, not really. MS registered Windows-125x with IANA and use
Windows-125x in their products consistenly. It's NOT MS products (MS OE, IE,
Frontpage) BUT broken programs like Eudora (with very little notion of
I18N and MIME charset) that run under MS Windows that label Windows-125x
documents as ISO-8859-x. I don't like MS, but they shouldn't be blamed
for what's not their fault.

  MS should have registered CP949/950 as Windows-949/950
instead of labeling them misleadingly as ks_c_5601-1987 and big5, In case
of gb2312, gbk should be registered and used. I don't know about big5,
but in Korean case, apparently they tried to pretend that they follow
Korean Nat'l std. while they extended it in a proprietary way.

  Jungshik Shin




Re: Encoding vs Charset

2002-03-27 Thread Jungshik Shin

On Wed, 27 Mar 2002, Dan Kogai wrote:

 On Wednesday, March 27, 2002, at 11:22 , Jungshik Shin wrote:
IMHO, you're also misusing the term 'charset' here. MIME charset
  can be used synonymously with 'encodings' (or
  character set encoding scheme: see CJKV Information Processing,
  IETF RFC 2130 and RFC 2278). What has to be distinguished
  is 'coded character set' on the one hand (JIS X 0208, JIS X 0212,
  KS X 1001, KS X 1003, GB 2312, CNS 11xxx, ISO 10646, ISO 646, US-ASCII,
  ISO-8859-x) and 'encoding/character
  set encoding scheme/MIME charset on the other hand (EUC-JP,
  EUC-KR, EUC-TW, EUC-CN, ISO-2022-JP, ISO-2022-KR, ISO-2022-CN,
  ISO-8859-x, UTF-8, UTF-32, UTF-7, UTF-16, Big5, UHC)
 
I do not thinks so.   This time I can confidently say it is IANA that 
 has goofed.  To make my point clear, let me define Charset and Encoding 
 once again.
 
 Character Set:
 
a collection of characters in which each character is distinguished 
 with unique ID (in most cases, ID is number).
 
 Character Encoding:
 
A way to represent characters in byte stream.  Given character 
 encoding may contain a single character set (i.e. US-ascii) or multiple 
 character sets (i.e. EUC-JP that contain US-ascii, JIS X 0201 Kana, JIS 
 X 0208 and JIS X 0212).  Given character encoding may also encode 
 character set as-is (raw; US-ascii) or processed (for EUC-JP, US-ascii 
 is as-is, JIS X 0201 is prepended with \x8E, JIS X 0208 is added by 
 0x8080, and JIS X 0212 is added by 0x8080 then prepended with \x8F).

  You got me wrong. I don't have any objection to 'coded character set'
and 'encoding' defined this way. Problem is that  you're using '(coded)
character set' and 'charset' interchangeably.  They're two different
things depending on where you come from. My point is that because
'charset' is already overloaded with two or more different meanings(as
MIME Content-Type header parameter, it means 'encoding' as you defined
above), you'd better not use it when comparing coded character set on the
one hand and encoding/ character set encoding scheme on the other hand.
Simply, it'd be much better for you to say '(coded) character set vs
encoding' instead of 'charset vs encodig'

  Jungshik Shin

P.S. I'm wondering Why you posted this to Unicode list (where it's not
very much relevant) without posting to perl-unicode?  I was force to
post my response to Unicode list, but I'd rather keep this thread (if
there's need to continue) where it began (perl-unicode).




Re: let's cook it!

2002-03-26 Thread Jungshik Shin


Dan,

I'm sorry for dropping in this late, but I've just joined
the list and found this. 

 * rename gb2312 to gb2312-raw, ksc5601 to ksc5601-raw

  What do you mean by ksc5601-raw and gb2312-raw? If it's
KS C 5601-1987 and GB2312 put in GL, how about ksc5601-gl
and gb2312-gl?  Please, also note that KS C 5601-1992
was reissued and renamed as KS X 1001:1998. Therefore,
it'd be better to use ksx1001 in place of ksc5601 and
make ksc5601-* as aliases to ksx1001-*.

 * and alias gb2312 and ksc5601 to euc-(cn|kr)

 I agree. :)

 Oh, my gosh ! Please, remove this alias of ksc5601 to EUC-KR. That's the 
last thing we need. KS C 5601-1987 is NOT the encoding (or
character set encoding scheme or MIME charset) BUT just
a coded character set which is used in encodings/MIME charsets/
character set encoding schemes like EUC-KR and  ISO-2022-KR. 
By aliasing ksc5601 to EUC-KR, only thing we achieve
is to encourage the confusion and mistake which have to be
avoided at all cost.

 Well, at least almost every other program (hc, iconv, mozilla...) does
 that anyway.

  No, Mozilla doesn't do that. Neither does yudit. Mozilla's
character coding menu does NOT have KS C 5601.

   I wonder how this charset misteken as encode has started.  Well, in 
 majority of encodings, charsets are applied uncooked so that may be the 
 reason.

  Wait a moment. You have to be careful here. 'charset' is overloaded
term. In MIME sense, 'charset' means the same thing as 'encoding'
(e.g. ISO-2022-JP, ISO-2022-KR, US-ASCII, UTF-8, EUC-KR, EUC-JP,
EUC-CN, ISO-8859-X etc) and it DOES NOT mean the same thing as 
coded character set (JIS X 0208, JIS X 0201, KS X 1001, GB 2312,
CNS 1, US-ASCII, ISO-8859-x)

  It's unfortunate that GB2312 has been so firmly established in place
of EUC-CN. In case of EUC-KR, it has much stronger support than 
EUC-CN despite Microsoft's continuous assault on it and people
do know that EUC-KR is different from KS X 1001/KS C 5601.

  Microsoft products use 'ks_c_5601-1987' as an encoding name/MIME
charset/character set encoding scheme. That's a very strange use
of KS C 5601-1987. Because, what they mean by 'ks_c_5601-1987' 
is actually CP949/Unified Hangul Code(UHC)/X-Windows-949,
an upward compatible proprieatary extension of EUC-KR. No Korean
standard specifies it. However, apparently, they didn't want to 
give an impression that they came up with something proprietary
(not specified in Korean nat'l standard) by using 'X-Windows-949'
and decided to use 'ks_c_5601-1987' as MIME charset for it
although it has no place in Korean nat'l standard. Mozilla
has to accept 'ks_c_5601-1987' as an alias to 'X-Windows-949'
because MS IE, OE and frontpage are so widely used. 

  Jungshik Shin




Re: Encode: CJK-Guide

2002-03-26 Thread Jungshik Shin


Here's some feedback.

 Republic of
 Korea (South Korea; simply Korea as follows) has set KS C 5601 in
 1989.  They are both based upon JIS C 6226, could be one of the

  KS C 5601 was first issued in 1987 and revised in 1989 and
1992. Then, it was renamed and reissued as KS X 1001:1998 in
1998. 


 Though there are escape-based encodings for these two (ISO-2022-CN
 and ISO-2022-KR, respectively), they are hardly used in favor of EUC.

  ISO-2022-KR used to be widely used for Korean email exchange
as still is ISO-2022-JP. Now ISO-2022-KR is hardly used, but
at least it was used widely until late 1990's.  (see IETF RFC 1557).

 When you say gb2312 and ksc5601, EUC-based encoding is assumed.

  Please, don't help spread this misuse. It might be all right
for the ignorant) public to say KS C 5601 in place of EUC-KR, but Perl 
programmers should learn the difference between KS C 5601/KS X 1001 (coded 
character set) and encoding/MIME charset/character set encoding scheme/
character coding. 

  As I wrote before, GB 2312 has been so widely (mis)used that there's
no way to replace it with EUC-CN. Korean situation is much better
although not as good as Japanese case.

  BTW, I don't find any reference to Microsoft code pages
(CP949 for Korean, CP950, CP 936 , and CP932), JOHAB(Korean), and 
Big5-HKSCS Is that because they're not yet supported (well, Shift-JIS 
and Big5 are supported)? 

 Another BTW, don't you think your description of Unicode
and Han Unification is a bit too negative and biased? 
I know you feel strongly about the subject, but I'm not
sure CJK-Guide is the best place to express your personal
opinion on it in. If you don't like to tone down or change
it, you may add a disclaimer like 'some people have
reservation about Han Unification and Unicode because ..'
or 'the following is my personal opinion shared by
some people but not universally accepted'. 


 As a result, something funny has happed.  For example, U+673A means a
 machine in Simplified Chinese but a desk in Japanese.  a machine
 in Japanese.  U+6A5F.  

  Do you really believe this is a strong case against Han Unification?
I don't see any problem with this.  There are a number of
Chinese characters with multiple meanings  even without Han
Unification. Do those 'meanings' have to be assigned separate
code points? 

 So you can't tell what it means just by looking at the code.

  Why does coded character set have to care about what computational
linguists have to do? You can't tell the meaning of 
any English word with multiple meanings by just looking at
its computer representation without context/grammatical/linguistic/lexical
analysis, can you? How do you know what 'fly' means without context? 

  Jungshik Shin




Re: Encode::CJKguide

2002-03-26 Thread Jungshik Shin

On Wed, 27 Mar 2002, Markus Kuhn wrote:

 Dan Kogai wrote on 2002-03-26 22:35 UTC:
 Side note: I still think, Encode should have used the encoding tables
 that are already provided by the operating system where available. For
 example on Linux, the iconv() function with glibc 2.2 or newer does
 already provide access to all the necessary tables. I observe at the
 moment, that almost a dozen different programming language communities
 reinvent the recoding wheel simultaneously and independently, even
 though portable C libraries such as libiconv are already available for
 exactly the same purpose.

  I certainly feel the same way as you do. I thought
a portable implementation of iconv() in libiconv would prevent
the prolification of (potentially incompatible) encoding converters.
I wsa wrong.  I found myself
having to check and contribute to/correct, if necessary, all the 
incarnation of encoding converters (involving Korean
and sometimes other CJK) in Perl, Java, ICU, PHP, Mozilla, X11,
libiconv/glibc and so forth. It would be much better if
libiconv/glibc were used everywhere.  Encode doesn't support
a lot of encodings all of which are available in iconv() (glibc's
and libconv's). 

 please clarify that this text represents Dan Kogai's personal and
 possibly uninformed opinion on character encodings and their history,
 and not some consens of everyone involved in the Perl 5.8 release.
 I think this text is still in very early alpha testing ...

  As I wrote already, this disclaimer absolutely needs to be put in. 

 Many of which have a rather Japan-specific and sometimes semi-informed
 view of Unicode and often do not at all represent Chinese or Korean
 views on issues such as Han unification. Please remember: CJK != Japan
 and there are also many good or better Korean and Chinese web pages on
 these issues.

   Koreans are for Unicode almost unanimously.  Han Unification
has never been as large an issue in Korea as in Japan. 

 You should definitely also add a pointer to the Unihan database, which
 is the most comprehensive existing source of cross-reference and
 encoding conversion data between the different Han encodings:
 
 http://www.unicode.org/Public/UNIDATA/Unihan.txt

  I also like to add that ISO 10646:2000-1 and ISO 10646:2001-2
need to be consulted before making any premature judgement on
Han Unification. As you or someone else mentioned in another
forum, TUS 3.0 gave some misconception about Han Unification
by listing a single glyph for each Han Ideograph. On the other hand,
ISO 10646:2000-1 and ISO 10646:2001-2 list five glyphs (SC,TC,
K,J,and V) and browsing thru the table, one realize how little
difference there is among them (sure,there are differences, but
I don't think those differences warrants so much fuss about Han
Unification.). More often than not, I thought IRG didn't go
far enough in Han Unification because some characters appear
to need to be unified in my eyes. (perhaps, the source separation
rule kept them distinct.)

   Jungshik Shin




Re: Encode: CJK-Guide

2002-03-26 Thread Jungshik Shin

On Wed, 27 Mar 2002, Jarkko Hietaniemi wrote:

BTW, I don't find any reference to Microsoft code pages
  (CP949 for Korean, CP950, CP 936 , and CP932), JOHAB(Korean), and 
  Big5-HKSCS Is that because they're not yet supported (well, Shift-JIS 
  and Big5 are supported)? 
 
 AFAIK, they're not yet supported, since we have not had Korean
 expertise.

  Well, CJKV information processing by Ken Lunde provides
more than enough information to support JOHAB and CP949/UHC/X-Windows-949 :-).
In addition to that, there are existing implementations, glibc,libiconv,
Mozilla and so forth. I'm not blaming any one here for the lack of
support for Johab and CP949. (that's the last thing I'd do). Anyway, 
I'll try to help you with Korean encodings and other CJK encodings if 
necessary. 

  For Johab, no new table is necessary because Hangul precomposed
syllable mapping (to Unicode) is algorithmic while Hanjas and symbols can 
be mapped to KS X 1001 algorithmically and then mapped to Unicode
using KS X 1001 mapping table. 

  BTW, how about Big5-HKSCS(Hongkong), GBK, and GB18030(PRC)?

  Jungshik Shin




Re: Encoding vs Charset

2002-03-26 Thread Jungshik Shin

On Tue, 26 Mar 2002, Jungshik Shin wrote:

  really means euc-cn and charset=ks_c_5601-1987 really menas euc-kr.  
  Sadly this misconception is enbedded to popular browsers.

 M$ OE, M$ Frontpage keep producing html docs. However,
 it also has to be noted that the encoding
 designated as  'ks_c_5601-1987'  by M$ is NOT the same as 
 EUC-KR BUT their proprieatary extension of EUC-KR, namely
 CP949/UHC/(X-)-Windows-949.  

  Therefore, I'd like to suggest (or rather do) for Korean encodings that:

  - Add X-Windows-949 converter 
  - Make 'ks_c_5601-1987' and 'X-UHC', 'UHC',
and 'CP949' as an alias to 'X-Windows-949'
  - Add JOHAB converter 
  - Remove 'ksc5601' aliased to 'euc-kr'.

Since there are some existing data in X-Windows-949 but mislabeled
as EUC-KR, it might be necessary to make 'euc-kr' - Unicode
converter  generous and act as 'X-Windows-949' to Unicode
converter (whether or not this is desirable and necessary depends
on what applications Encode may be used for). 
However, in the other direction (Unicode - euc-kr)
it has to be strictly compliant to the standard. 
See http://bugzilla.mozilla.org/show_bug.cgi?id=131388

  Jungshik Shin





Re: GB2312 and EUC-CN : IANA registry

2002-03-26 Thread Jungshik Shin

On Wed, 27 Mar 2002, Anton Tagunov wrote:

 Hi, Anton,

 Very glad to hear you on this list :-)

  Me, too :-)

  When you say gb2312 and ksc5601, EUC-based encoding is assumed.
 
 JS   Please, don't help spread this misuse.

 Well, that was not meant to be applied to GB2312 :-). Below
is more extensive excerpt where I wrote that sentence:

JS  Please, don't help spread this misuse. It might be all right
JS for the ignorant) public to say KS C 5601 in place of EUC-KR, but Perl 
JS programmers should learn the difference between KS C 5601/KS X 1001 (coded 
JS character set) and encoding/MIME charset/character set encoding scheme/
JS character coding. 

JS   As I wrote before, GB 2312 has been so widely (mis)used that there's
JS no way to replace it with EUC-CN. Korean situation is much better
JS although not as good as Japanese case.

  It could have been misunderstood.


 Jungshik, one little point on GB2312.. Maybe I misunderstand
 something, but

  No, you're absolutely right about IANA. See below.


 IANA registry (http://www.iana.org/assignments/character-sets)
 has
 
 Name: GB2312  (preferred MIME name)
 MIBenum: 2025
 Source: Chinese for People's Republic of China (PRC) mixed one byte, 
 two byte set: 
   20-7E = one byte ASCII 
   A1-FE = two byte PRC Kanji 
 See GB 2312-80 
 PCL Symbol Set Id: 18C
 Alias: csGB2312
 
 do not know when was that put in, but it looks EUC-CN. Is it?
 And if yes, then GB2312 is a perfectly valid charset, isn't it?

  Yes, it's EUC-CN. I was about to add that although
EUC-CN is a better name than GB2312, the former has never been registered
with IANA while the latter was as 'preferred MIME name, You got there
first :-).  It's unfortunate that PRC decided to do this way, but that's
what we got and I think we have to respect their decision.

 And thank you for explaining how it happened that Korean
 misuse the name of a CCS for charset :-)

  You're welcome :-)

Actually, I told you only half the story :-). The other half happened
before the widespread use of Internet in Korea (i.e late 1980's and
early 1990's) when people typically refered to what's now called EUC-KR
as 'KS C 5601 Wansung' (= US-ASCII in GL and KS C 5601 in GR). It was
not technically correct, but didn't do much harm because there's no
need for exchange of  data over the internet. EUC (Extended Unix Code:
it's not Extended Unix Character) for Korean  was first specified in KS
C 5861-1992 (now KS X 2901), but the name EUC-KR appeared first in RFC
1557 where ISO-2022-KR was defined. It would have been better if RFC
1557 had been more explicit in its description of EUC-KR so that IANA
entry for EUC-KR is patterned after that for EUC-JP(GB2312 - EUC-CN)
with all the code sets and their  octet ranges. Perhaps, they 
thought just refering to KS C 5861-1992 was sufficient. 

--
Name: EUC-KR  (preferred MIME name) [RFC1557,Choi]
MIBenum: 38
Source: RFC-1557 (see also KS_C_5861-1992)
Alias: csEUCKR

--
Name: Extended_UNIX_Code_Packed_Format_for_Japanese
MIBenum: 18
Source: Standardized by OSF, UNIX International, and UNIX Systems
Laboratories Pacific.  Uses ISO 2022 rules to select
   code set 0: US-ASCII (a single 7-bit byte set)
   code set 1: JIS X0208-1990 (a double 8-bit byte set)
   restricted to A0-FF in both bytes
   code set 2: Half Width Katakana (a single 7-bit byte set)
   requiring SS2 as the character prefix
   code set 3: JIS X0212-1990 (a double 7-bit byte set)
   restricted to A0-FF in both bytes
   requiring SS3 as the character prefix
Alias: csEUCPkdFmtJapanese
Alias: EUC-JP  (preferred MIME name)



  Jungshik Shin




Re: Encode: CJK-Guide

2002-03-26 Thread Jungshik Shin

On Wed, 27 Mar 2002, Jarkko Hietaniemi wrote:

  Mozilla and so forth. I'm not blaming any one here for the lack of
  support for Johab and CP949. (that's the last thing I'd do). Anyway, 
  I'll try to help you with Korean encodings and other CJK encodings if 
  necessary. 
 
 Excellent, thanks.  You may download the latest Perl developer snapshot
 (which contains the latest Encode, 0.99) from:
 
   http:[EMAIL PROTECTED]
 
 and look at the documentation under perl/ext/Encode/

  I've looked around ext/Encode and I found that CP949 is supported.
So, what has to be added is JOHAB and what needs to be modified
is EUC-KR to support 8byte seq. representation of Hangul syllables
(see http://jshin.net/i18n/euckr2.html or 
http://bugzilla.mozilla.org/show_bug.cgi?id=128587)

For Johab, no new table is necessary because Hangul precomposed
  syllable mapping (to Unicode) is algorithmic while Hanjas and symbols can 
  be mapped to KS X 1001 algorithmically and then mapped to Unicode
  using KS X 1001 mapping table. 

 Before going further, I have a question or two. It appears that
euc-kr, ksc5601-raw(ksc5601-gl or whatever) and cp949 have their own
mapping tables although they're closely related. Is there any reason
for this? In case of Johab, the easiest way to add support for it is to
just generate the mapping table for it, but I feel uncomfotable bloating
the code when it can be done algorithmically if I can make use of the
mapping table for euc-kr or ksc5601(-raw). It appears that euc-jp and
shift_jis don't share the mapping table, either although shift_jis and
euc-jp can be more or less algorithmically converted to/from each other.
I must be missing something here. There should be a way to do it and
I'd be glad if someone could tell me where to look for an example case
(e.g. shift_jis and euc-jp)


BTW, how about Big5-HKSCS(Hongkong), GBK, and GB18030(PRC)?
 
 I *think* (but me speekee no Chineese) we do support those in Encode,
 but for space considerations one has to install an additional module,
 Encode::HanExtra.

  I found that Big5-HKSCS is included in 'plain Encode' and GBK, GB18030,
EUC-TW, and Big5plus are in HanExtra.

   Jungshik Shin