Re: Always setting UTF-8 flag - am I bad?
On Thu, 5 Aug 2004, Nick Ing-Simmons wrote: Alright, I failed to say that this is an XS module, so I convert with WideCharToMultiByte, a Windows routine(*), put the result in an SV, and then say SvUTF8_on. The possible danger here is if the multi byte encoding for user's environment is not UTF-8 but (say) a Japanese one. Almost always(99.999% of time. unless SetACP() or sth. is used to change it), the default system code page is not UTF-8 on Windows (Windows-1252 on Western European Windows, Windows-1251 on Russian, Windows-932/936/949/950 on East Asian windows, etc). However, you can specify the code page for 'multibyte' encoding to use when invoking WideCharToMultiByte (i.e. WideCharToMultiByte is different from wcstombs() on a POSIX system). See http://msdn.microsoft.com/library/default.asp?url=/library/en-us/intl/unicode_2bj9.asp Jungshik
Re: Unicode filenames on Windows with Perl = 5.8.2
Jan Dubois wrote: On Mon, 21 Jun 2004, Steve Hay wrote: I must confess that 2 doesn't really bother me since the 9x type systems are now a thing of the past (XP onwards are all NT type systems, even XP Home Edition). While I also wish that Win 9x would just cease to exist, I don't think any core Perl patches would be accepted if they would render Perl inoperable on those systems. You would have to provide at least a fallback solution, even if it means creating separate binaries for 9x and NT Windows systems. JFYI, if using MSLU is problematic (for some reason), we may consider http://libunicows.sourceforge.net/ It's released under MIT license. Jungshik
Re: AL32UTF8
Tim Bunce wrote: On Fri, Apr 30, 2004 at 10:58:19PM +0700, Martin Hosken wrote: IIRC AL32UTF8 was introduced at the behest of Oracle (a voting member of Unicode) because they were storing higher plane codes using the surrogate pair technique of UTF-16 mapped into UTF-8 (i.e. resulting in 2 UTF-8 chars or 6 bytes) rather than the correct UTF-8 way of a single char of 4+ bytes. There is no real trouble doing it that way since anyone can convert between the 'wrong' UTF-8 and the correct form. But they found that if you do a simple binary based sort of a string in AL32UTF8 and compare it to a sort in true UTF-8 you end up with a subtly different order. On this basis they made request to the UTC to have AL32UTF8 added as a kludge and out of the kindness of their hearts the UTC agreed thus saving Oracle from a whole heap of work. But all are agreed that UTF-8 and not AL32UTF8 is the way forward. Um, now you've confused me. The Oracle docs say In AL32UTF8, one supplementary character is represented in one code point, totalling four bytes. which you say is correct UTF-8 way. So the old Oracle ``UTF8'' charset is what's now called CESU-8 and what Oracle call ``AL32UTF8'' is the correct UTF-8 way. So did you mean CESU-8 when you said AL32UTF8? I guess so. Thank you for reminding me of this. I used to know that, but forgot it and was about to write my colleague to use 'UTF8' (instead of 'AL32UTF8') when she creates a database with Oracle for our project. Oracle is notorious for using 'incorrect' and confusing character encoding names. Their 'AL32UTF8' is the true and only UTF-8 while __their__ 'UTF8' is CESU-8 (a beast that MUST be confined within Oracle and MUST NOT be leaked out to the world at large. Needless to say, it'd be even better had it not been born.) Oracle has no execuse whatsoever for failing to get their 'UTF8' right in the first place because Unicode had been extended beyond BMP a long time before they introduced UTF8 into their product(s) (let alone the fact that ISO 10646 had non-BMP planes from the very beginning in 1980's and that UTF-8 was devised to cover the full set of ISO 10646) However, they failed and in their 'UTF8', a single character beyond BMP was (and still is) encoded as a pair of 3-byte representations of surrogate code points. Apparently for the sake of backward compatibility (I wonder how many instances of Oracle databases existed with non-BMP characters stored in their 'UTF8' when they decided to follow this route), they decided to keep the designation 'UTF8' for CESU-8 and came up with a new designation 'AL32UTF8' for the true and only UTF-8. Jungshik
Re: Status of -C
Paul Hoffman wrote: Er, never mind. I found that I was doing something quite silly with the -C. All is OK, and it is now causing STDIN to be UTF8ish. Would you mind sharing your experience? That way, others will be able to avoid repeating your mistake. Jungshik
Re: perlunicode comment - when Unicode does not happen
On Tue, 23 Dec 2003, Nick Ing-Simmons wrote: Ed Batutis [EMAIL PROTECTED] writes: I don't think we understand common practice (or that such practices are even established yet) well enough to specify that yet. Common practice is that file names on 'local disks' are assumed to be in the character encoding of the current locale. Of course, this assumption doesn't always hold and can break things with networked file system and all sort of different file systems, but what could Perl do about it other than offering some options/flexibility to let users do what they want? Perl users are supposed to be 'consenting adults' (maybe not in terms of physical age for some young users) so that given a set of options, they have to pick one most suitable for them for a given task. Because we don't know how, because the common practice isn't established. As I wrote, it's been established well before Unicode came into the scene. It has little to do with UTF-8 or Unicode. If we just fix it now the behaviour will be tied down and when the common practice is established we will not be able to support it. Let's not 'fix' it (not carve it on a stone), but offer a few well-thought-out options. For instance, Perl may offer (not that these are particularly well-thought-out) 'just treat this as a sequence of octets', 'locale', and 'unicode'. 'locale' on Unix means multibyte encoding returned by nl_langinfo(CODESET) or equivalent. On Windows, it's whatever 'A' APIs accept or is returned by ACP_??(). 'unicode' is utf8 on Unix-like OS, BeOS and 'utf-16(le)' on Windows. When _I_ want Unicode named things on Linux I just put file names in UTF-8. In that case, you're mixing two encodings on your file system by creating files with UTF-8 names while still using en_GB.ISO-8859-1 locale. Why does Perl have to be held responsible for your intentional act that is bound to break things? Because I don't want to be restricted by the character repertoire of legacy encodings, I switched over to UTF-8 locale almost two years ago. Suits me fine, but is not going to mesh with my locale setting because I am going to leave that as en_GB otherwise piles of legacy C apps get ill. Well, things are changing rapidly on that front. Now when I have samba-mounted a WinXP file system that is wrong, same for Well, actually, if your WinXP file system has only characters covered by Windows-1252, you can use 'codepage=cp1252' and 'iocharset=iso8859-1' for smbmount/mount. Obviously, there's a problem because iso8859-1 is a subset of Windows-1252. If you use en_GB.UTF-8 on Linux, there'd not be such a problem because you can use 'codepage=cp1252' and 'iocharset=utf8'. CDROMs most likely. This mess will converge some more - I can already see that happening. UDF is the way to go in CD-ROM/DVD-ROM. _My_ gut feeling is that on Linux at least the way forward is to pass the UTF-8 string through -d - and indeed possibly upgrade to UTF-8 if the string has high-bit octets. But you seem to be making the case that UTF-8 should be converted to some local multi-byte encoding - which is the common practice ? That's because there are a lot of people like you who still use en_GB (ja_JP.eucJP, de_DE.iso8859-1, etc) instead of en_GB.UTF-8 (ja_JP.UTF-8, de_DE.UTF-8) :-) On Linux, the number is dwindling, but on Solaris and other Unix (not that they don't support UTF-8 locales but that most system admins. don't bother to install necessary locales and support files), it's not decreasing as fast. Jungshik
Re: perlunicode comment - when Unicode does not happen
On Thu, 25 Dec 2003, Jarkko Hietaniemi wrote: locale. Why does Perl have to be held responsible for your intentional act that is bound to break things? Whoa! It's the other way round here. Nick is using a locale that suits him for other reasons (e.g. getting time and data formats in proper British ways), but why should he be constrained not to use for his filenames whatever he wants? Then, he should switch to en_GB.UTF-8. Besides, he implied that he still uses ISO-8859-1 for files whose names can be covered by ISO-8859-1, which is why I wrote about mixing up two encodings in a single file system _under_ his control. Moreover, why would you think that en_GB.UTF-8 locale gives him the time and date format NOT suitable for him? You're making a mistake of binding locale and encoding. Encoding should never be a part of the locale definition. The fact that it is on Unix is just an artifact of Unix file system and we want to leave it behind us if possible. Of course, we have to live with that for a long while to come, unfortunately. Well, actually, if your WinXP file system has only characters covered by Windows-1252, And how would Nick know that, or he could he guarantee that, if the Windows share is in multiuser use? Of course, he can't. That's why I wrote 'if'. PLEASE, PEOPLE: stop thinking of this in terms of an environment controlled solely by one user. Before writing that, please read the man page of 'smbmount' and 'mount' if Linux system is available to you. They're not environment variables. Jungshik
Re: perlunicode comment - when Unicode does not happen
On Thu, 25 Dec 2003, Jarkko Hietaniemi wrote: What I wish is that the whole current locale system would curl up and die. As you'd agree, it's only 'encoding' part that has to die. Everybody should switch to UTF-8 on Unix and end-users should never worry about 'encoding'. In an ideal world, 'encoding' would never be a part of 'locale'. We're getting there although very slowly. nl_langinfo(CODESET) is rather well supported where it's available (i.e. SUS-compliant modern Unix platforms). That's not good enough for Perl. Perl must also deal with non-SUS-compliant older UNIX or -like platforms. Sure, I'm well aware of that. Otherwise, I'd not have gone on to mention gnulib and such. Jungshik
Re: perlunicode comment - when Unicode does not happen
On Thu, 25 Dec 2003, Jungshik Shin wrote: locale definition. The fact that it is on Unix is just an artifact of Unix file system and we want to leave it behind us if possible. Of course, Of course, it's rather a whole lot of different things that bind locale and encoding on Unix, from which we want to get away asap.
Re: perlunicode comment - when Unicode does not happen
On Thu, 25 Dec 2003, Jarkko Hietaniemi wrote: What I wish is that the whole current locale system would curl up and die. As you'd agree, it's only 'encoding' part that has to die. Oh no, there are plenty of parts in it that I wish would die :-) Wishing it to die is different from finding a lot of defects that you want to fix, isn't it? Sure, there are a lot of things that can be done better. For quite a lot of them (not all of them) ICU offers solutions. list of things to fix . snipped Everybody should switch to UTF-8 on Unix Yes. UTF-8 and NFD, I would say. As much as I like NFD (well, I'd like it even better if Korean NFD hadn't been made permanenlty broken between Unicode 2.x and 3.0), I don't think people will ever agree on NFD part. Jungshik
Re: perlunicode comment - when Unicode does not happen
On Thu, 25 Dec 2003, Jarkko Hietaniemi wrote: Whoa! It's the other way round here. Nick is using a locale that suits him for other reasons (e.g. getting time and data formats in proper British ways), but why should he be constrained not to use for his filenames whatever he wants? Then, he should switch to en_GB.UTF-8. That will work if there's en_GB.UTF-8 available for him in his particular Unixes and assuming using UTF-8 locales won't break other things. IIRC, he explicitly mentioned 'Linux' in his message. Besides, Solaris, Compaq Tru64, AIX, and HP/UX [1] have all supported UTF-8 locales for a 'long' time (some of them far far longer than Linux/glibc has). In the past, all the locales don't come free, but these days, they all come with no extra charge so that it depends on the 'will'/'policy' of the system administrators whether that's available or not. Sure, there are a number of other Unix, old and new, and many old ones don't support UTF-8 locales. I do want to respect people's wish to make UTF-8 files on their file systems even if their version of Unix don't support UTF-8 locales. Otherwise, I wouldn't have come up with a set of 'options' Perl can offer to them. However, people doing so should be aware that there's price to pay. For instance, in their shell, file names would not be shown correctly (i.e. 'ls' would show you garbled characters) They can't use usual set of Unix tools (e.g. 'find' wouldn't work as intended). ISO-8859-1, which is why I wrote about mixing up two encodings in a single file system _under_ his control. I think we are here talking past each other :-) I'm assuming the not all file systems (like Samba mounts) are not necessarily under his control, you are assuming they Well, I think that's a different story. He explicitly wrote why he still uses en_GB.ISO-8859-1 (like some old programs breaking under UTF-8 locale). Moreover, why would you think that en_GB.UTF-8 locale gives him the time and date format NOT suitable for him? I'm not thinking that. What I think his point is is that plain en_GB.iso88591 is _enough_ for him to get time/date formats etc working right, but en_GB.UTF-8 brings in _too much_ (such as some programs not yet being UTF-8 aware enough, What you had in parentheses was what he wrote in his original message, but what you wrote didn't sound like that to me. At lesat, you took a bad example of time/date format. or him wanting to use iso8859-1 file names in some directories, but in some directories not). Yes, that's what I meant. He made a conscious decision to mix up two encodings (read his message. 'If I want Unicode characters in file names, I'd just use UTF-8' or something like that), for which he has to pay whatever price he has to pay. If Perl offers a set of options as I outlined in my previous message, he has to be careful when opening files in different directories. For some directories, he has to use one option while for other directories, he has to use another option. You're making a mistake of binding locale and encoding. I'm not-- many UNIX vendors do, and I have to with that fact. If Linux and glibc are doing the Right Thing, that's marvelous, but not all the world is Linux and glibc. I never implied that, let alone saying that. (I always prefer to say Unix in place of Linux. To me, Linux is just one of many Unix) And, please check out recent commercial Unix. They DO offer UTF-8 locales as I wrote above (Solaris and AIX had offered solid UTF-8 locales years before Linux/Glibc did - actually, when Linux/Glibc 1.x has almost __zero__ locale support, UTF-8 or not). Whether they're installed by the system admin. is a different story. Anyway, exactly because of the unavailability of UTF-8 locales for whatever reason, we've been discussing this issue (to convert Perl's internal Unicode to and from the 'native' encoding in file I/O.). The fact that it is on Unix is just an artifact of Unix file system Not quite. UNIX doesn't care. In traditional UNIX filenames are just bytes. You're absolutely right. I didn't mean to say 'file system' there as I corrected in my subsequent email. PLEASE, PEOPLE: stop thinking of this in terms of an environment controlled solely by one user. Before writing that, please read the man page of 'smbmount' and 'mount' if Linux system is available to you. They're not environment variables. Please read my sentence again to see that I had no variable in it :-) Just environment. OK. Sorry for misreading it. Anyway, Perl can't help resolve that problem. It can only offer a set of flexible options (as I listed in 'a few messages ago') that help people solve the problem for themselves. Jungshik [1] SGI Irix seems to lag behind in this area. FreeBSD was slow, but seems to have done a catch-up recently.
Re: perlunicode comment - when Unicode does not happen
On Tue, 23 Dec 2003, Jarkko Hietaniemi wrote: I don't see how introducing a new LC_* would help here. Whether Limit the mess of CTYPE controlling Yet Another Feature. I don't think it's yet another feature. It's one of features that's commonly assigned to it. Well, I guess you'd ask how 'commonly'... Anyway, introducing a new env. variable is not a solution to the mess. By doing so, you just add another problem because a new variable is only meaningful to Perl at least at the beginning. it's LC_CTYPE or LC_FILENAME, the problem is still there. to and from the codeset returned by 'nl_langinfo(CODESET)'. Don't get me started how suckily and brokenly nl_langinfo() is supported across platforms :-) Well, CODESET may be on the average better supported. May. nl_langinfo(CODESET) is rather well supported where it's available (i.e. SUS-compliant modern Unix platforms). Encoding/codeset name mess is another issue, though. If Perl could use gnulib (a collection of small code snippets that are meant to be included in the source code, 'nl_langinfo(CODESET)' could be emulated where it's not available. However, I guess it can't because GPL/LGPL is not suitable for Perl according to you. Directly inspecting LC_CTYPE or other environment variables is a BAD idea I can optimize that for ya: s/Directly inspecting/Using/ :-) I intentionally used the phrase because 'nl_langinfo(CODESET)' is 'the' _indirect_ way to get to it (plus the resolution of LC_*/LANG environment variable priority) Jungshik
Re: perlunicode comment - when Unicode does not happen
On Tue, 23 Dec 2003, Jarkko Hietaniemi wrote: (AFAIK) W2K and later _are able_ to use UTF-16LE encoded Unicode for filenames, but because of backward compatibility reasons using 8-bit codepages is much more likely. No. _Both_ NTFS (only supported by Win 2k/XP) and VFAT (supported by Win 2k/XP and Win 9x/ME) use UTF-16LE **exclusively**. In that respect, (and that's probably well docum^Wpatented by Microsoft... :-) Well, the _internals_ of NTFS and VFAT are not well documented (and is probably patented as well) so that NTFS developers for Linux kernel have to reverse-engineer it. However, APIs for 'casually' accesing them (including the fact they use 'Unicode' with their use of 'Unicode' usually meaning UTF-16LE or at least UCS-2LE) are documented well enough afaik. (How about CIFS?) I believe it, too, uses UTF-16LE (or at least UCS-2). Samba developers will know that well. FYI, Mac OS X 10.3 (or 10.2) or later has APIs for the conversion between NFC and NFD. I'm not worried about the various Unicode APIs being available. I just mentioned it because even on Mac OS X, you have to do things differently (before 10.2 and after 10.2). After 10.2(?), you can rely on OS APIs while before that you have to roll your own. Jungshik
Re: perlunicode comment - when Unicode does not happen
On Tue, 23 Dec 2003, Jarkko Hietaniemi wrote: It works because it relies on iconv(3) to convert between the current locale codeset and UTF-16 (used internally by Mozilla) if/wherever possible. 'wc*to*mb/mb*to*wc' is only used only where iconv(3) is not available. Anyway, yes, that's possible. Note that I'm not *opposed* to someone fixing e.g. Win32 being able to acces Unicode names in NTFS/VFAT. What I'm opposed to is anyone thinking there are (a) easy (b) portable solutions. We are talking always of very OS and FS specific solutions. OK. I'm sorry if I misunderstood you. You're absolutely right that we're talking about very OS/FS-dependent issues. Win32 and Mac OS X are probably the most well-off. For (other) UNIXy systems, I don't know. I guess BeOS is in the same league as Win2k/XP [1] and Mac OS X. There, everything should be in UTF-8. If one is happy with just using UTF-8 filenames, Perl 5.8 already can work fine. If one I wish everybody were :-) on Unix. Fortunately, UTF-8 seems to be catching on judging from the 'emergence' of two 'file system conversion' tools. See, for instance, http://osx.freshmeat.net/releases/144059/. If a user mixes multiple encodings/code sets in her/his file system, that's not Perl's problem but her/his problem so that I don't think that's a valid reason for not doing something reasonable. wants to use locales and especially some non 8-bit locales, well, Perl currently most definitely does not switch its filename encoding based on locales. Personally I think that's a daft idea... at least without a new specific (say) LC_FILENAME control-- overloading the poor LC_CTYPE sounds dangerous. I don't see how introducing a new LC_* would help here. Whether it's LC_CTYPE or LC_FILENAME, the problem is still there. Perhaps, we need a pragma to indicate which of the following is to be assumed about the file system character encoding, 'locale', 'native', 'unicode', 'user-specified'. On Unix, 'locale' and 'native' would be identical both meaning that Perl should convert its internal Unicode to and from the codeset returned by 'nl_langinfo(CODESET)'. Directly inspecting LC_CTYPE or other environment variables is a BAD idea and should be used as a fallback only where nl_langinfo(CODESET) is not supported. When converting to and from 'native' encoding, it should rely on iconv(3)' available on the system instead of its internal 'encoding' converter. However, there's a problem here. A lot of system admins on commericial Unix install only the minimal set of iconv(3) modules. See http://bugzilla.mozilla.org/show_bug.cgi?id=202747#c18. Therefore, perhaps, we first try iconv(3) and then fall back to using Perl's 'encoding'. There are other problems when using iconv(3) (e.g. http://bugzilla.mozilla.org/show_bug.cgi?id=197051). 'unicode' on Unix means 'utf8'. 'user-specified' means whatever a user wants to use. On Windows, 'locale' means using the code page of the current system locale. 'native' is UTF-16LE (but on Win 9x/ME, the character repertoire would be limited to that of the system codepage). The same is true of 'unicode'. On Mac OS X, locale, native and unicode would mean all the same (UTF-8). As for 'normalization', I have to think more about it. And so on.. I've been just thinking aloud so that you have to bear with some incoherency. Jungshik
Re: perlunicode comment - when Unicode does not happen
On Tue, 23 Dec 2003, Nick Ing-Simmons wrote: Jungshik Shin [EMAIL PROTECTED] writes: On Mon, 22 Dec 2003, Jarkko Hietaniemi wrote: (AFAIK) W2K and later _are able_ to use UTF-16LE encoded Unicode for filenames, but because of backward compatibility reasons using 8-bit codepages is much more likely. No. _Both_ NTFS (only supported by Win 2k/XP) and VFAT (supported by Win 2k/XP and Win 9x/ME) use UTF-16LE **exclusively**. But those OSes also support older file systems (e.g. floppies), and shares where things are not as clear (at least to me). In cases of floppy (FAT), I guess we're just back to old days :-) In case of CIFS, I really have to check. Then, even Windows supports (although not free) NFS and other file sharing ... things become fuzzy. In that respect, Windows filesystems are 'saner' than Unix file systems. APIs for accessing them come in two flavors, 'A' APIs and 'W' APIs, though as I explained in another message of mine. In that message you mentioned a .dll - should perl look for and link to that DLL ? Actually, I mentioned three different possibilities. Only one of them relies on MSLU (Microsoft Layer for Unicode). If you do that, you just need a single binary that works across Win32 platforms. However, the presence of MSLU is required. The second strategy is to do what Mozilla does: 1. write a set of wrapper functions that emulates Windows 'W' APIs, 2. detect the OS at run-time (Windows 9x/ME vs Windows 2k/XP) 3. call either emulated versions of 'W' APIs or native 'W' APIs (I'm omitting details here, but you should get the idea). This is actually similar to what's done by MSLU, but you don't have to rely on MSLU. The final approach is to build two separate binaries, one for Win 9x/ME (with 'A' APIs) and the other for Win 2k/XP (with 'W' APIs) In all three cases, the character repertoire (that can be used for file names) on Win 9x/ME is limited to that of the system codepage. It may sound odd because VFAT can cover the whold Unicode repertoire. Don't ask me why, but that's the way Win 9x/ME works. That can explain why Jarkko got confused. If somebody hacks VFAT and write her own VFAT IO functions, the full range of Unicode can be used even on Win 9x/ME. Jungshik
Re: perlunicode comment - when Unicode does not happen
On Mon, 22 Dec 2003, Ed Batutis wrote: Jarkko Hietaniemi [EMAIL PROTECTED] wrote in message news:[EMAIL PROTECTED] You do know that ... Yes. If wctomb or mbtowc are to be used, then Perl's Unicode must be converted either to the locale's wide char or to its multibyte. This isn't trivial, but Mozilla solved this same problem. It can portably work. (Are you listening Brian Stell!). It wasn't easy for them, but they did it. You're probably talking about nsNativeCharsetUtils.cpp in Mozilla. (http://lxr.mozilla.org/seamonkey/source/xpcom/io/nsNativeCharsetUtils.cpp). I'm familar with that part because I made a few changes there in the last 6 months. Mozilla doesn't use wc*mb/mb*wc() because it can't possibly know _what_ 'wchar_t' actually is in the current locale? Note that 'wchar_t' is not only locale dependent (i.e. run-time dependency) on a single platform but also a compiler-dependent. It works because it relies on iconv(3) to convert between the current locale codeset and UTF-16 (used internally by Mozilla) if/wherever possible. 'wc*to*mb/mb*to*wc' is only used only where iconv(3) is not available. Anyway, yes, that's possible. If a user mix multiple encodings/code sets in her/his file system, that's not Perl's problem but her/his problem so that I don't think that's a valid reason for not doing something reasonable. Imagine ... I don't have to imagine. But I think that where a Perl script opens its files is its own business. I don't see why Perl would have to do anything in that regard. Even if it did, I don't see that feature as blocking the simpler feature of just doing a conversion to/from multibyte before/after a system call. If I'm dealing with just Japanese on a Japanese system, that's all I need. Uhhh... from a Win32 API bug workaround you deduce that ... SJIS should work? Well, Win32 has an API to test whether a backslash is the second byte of a 'multibyte character'. That is, the code snippet given by Ed could have been written better with that API. Here's my dilemma: utf-8 doesn't work as an argument to -d and neither does Shift-JIS (at least with certain Shift-JIS characters). Those are my only choices. So you are saying basically 'Shift-JIS be damned - write a module'? I hope you'll understand if I find it hard to sympathize with that Win32 is troublesome because it has two tier-ed APIs, code-page dependent 'A' APIs and Unicode-based 'W' APIs. If 'W' APIs are guaranteed to be available everywhere (from Win95 to WinXP), Perl can just convert whatever legacy encodings into UTF-16LE and call 'W' APIs. Actually, you don't have to call 'W' APIs directly but just using the 'generic' APIs would be translated into 'W' APIs if a macro (whose name is escaping me) is defined at the compile time. Now the question is whether 'W' APIs are available on old Win95/98/ME. They're available if MS IE 5.x or later and/or relatively new version of MS Word/Office are installed because they come with MSLU (Microsoft Layer for Unicode) dll. So, for the majority of cases, the above should work. However, there are some small number of cases where MSLU is not available on Win 9x/ME. In that case, you have to fall back to 'A' APIs. Even with MSLU installed, on Win9x/ME, you're limited to the character repertoire of the legacy code page (i.e. Shift_JIS on Japanese windows, Windows-932 on Chinese Windows, Windows-1252 on Western European Windows). Therefore, a better approach might be to do the OS detection and use 'A' APIs on Win 9x/ME and 'W' APIs on Win 2k/XP. That's what Mozilla does. Unfortunately, this code is not yet deployed to the file I/O part of Mozilla, which is the cause of several bugs. (See http://bugzilla.mozilla.org/show_bug.cgi?id=162361) Still another approach is to build two separate binaries of Win32 Perl, one for Win 9x/ME and the other for Win 2k/XP. Jungshik
Re: perlunicode comment - when Unicode does not happen
On Mon, 22 Dec 2003, Jarkko Hietaniemi wrote: (AFAIK) W2K and later _are able_ to use UTF-16LE encoded Unicode for filenames, but because of backward compatibility reasons using 8-bit codepages is much more likely. No. _Both_ NTFS (only supported by Win 2k/XP) and VFAT (supported by Win 2k/XP and Win 9x/ME) use UTF-16LE **exclusively**. In that respect, Windows filesystems are 'saner' than Unix file systems. APIs for accessing them come in two flavors, 'A' APIs and 'W' APIs, though as I explained in another message of mine. The Apple HFS handles Unicode using _normalized_ (NFC, IIRC) UTF-8. The Mac OS X file system uses not NFC (precomposed unicode) but NFD (decomposed Unicode). There we have two different Unicode encodings, both in use. FYI, Mac OS X 10.3 (or 10.2) or later has APIs for the conversion between NFC and NFD. Jungshik
Re: Mixing Unicode and Byte output on a Unicode enabled Perl 5.8.0
On Thu, 9 Oct 2003, Frank Smith wrote: I am trying to use the (pound sterling) symbol in a script that produces both TEXT and HTML the html handles the Unicode fine, all the browsers seem to work. However, once the text file arrives on the Windowz box the Unicode screws Excel. Can you help by suggesting a way to force a specific script to produce 'plain text' (That bit more than ASCII) or preferably to specifically output, via the IO layer, 'plain text' on specific occasions. Well, there's nothing that prevents you from using UTF-8 for *plain text*. I've got tens of thousands of UTF-8 plain text files and am making one now (because I'm gonna send this email in 'text/plain; charset=UTF-8') Anyway, what you want is to get your output in Windows-1252 (or its subset ISO-8859-1) so that Excel running under English version of Windows 9x/ME or Windows 2k/XP with the default locale set to English can read your text output with . The man page of 'Encode' should help you (see the section Encoding via PerlIO). Alternatively, if you're on Win2k/XP (and don't care about Win9x/ME), you can prepend your UTF-8 plain text output with UTF-8 BOM (that is, at the very beginning of your plain text output file, print out \x{feff}). With UTF-8 BOM present, Win2k/XP should be able to detect that your plain text file is in UTF-8 instead of legacy 'code pages'. Jungshik
Re: Mixing Unicode and Byte output on a Unicode enabled Perl 5.8.0
On Thu, 9 Oct 2003, Guido Flohr wrote: BTW, Windows editors also insert that BOM at the beginning when writing XML files encoded in UTF-8. In other words: If you edit a UTF-8 XML file with Windows Notepad, it will be corrupted. MSIE and Mozilla (!) still treat it as well-formed XML but a standards compliant parser will of course reject it. Well, I am not fond of UTF-8 BOM at all, but it's not a violation of the standard to prepend an XML file in UTF-8 with UTF-8 BOM (see http://www.w3.org/TR/REC-xml#sec-guessing). Jungshik
Re: Quick question: viscii vs. iscii? NEVERMIND
On Mon, 2 Jun 2003, David Graff wrote: Does 5.8 have any conversion functionality for ISCII? If not, is anyone working on this (and is there a notion when it may be ready)? Encode doesn't support ISCII (there may be a separate module for ISCII, though), yet. I'm planning to work on it (see my message to the list sent on May 17th. We also need TSCII converter) , but you (or anyone else) are welcome to go ahead because I'm not gonna do it very soon. Jungshik
Encode::_utf8_on and output
On Sat, 18 Jan 2003, Jarkko Hietaniemi wrote: Now Perl-5.8.1-to-be has been changed to (1) not to do any implicit UTF-8-ification of any filehandles unless explicitly asked to do so (either by the -C command line switch or by setting the env var PERL_UTF8_LOCALE to a true value, the switch wins if both are present) (and if the locale settings do not indicate Note that the above do not change the fact that if a *programmer* wants their code to be UTF-8 aware, they need to think about the evil binmode(). Recently, I came across something curious. From this thread, we all know that perl 5.8.0 does implicit 'UTF-8-ification' when it's run under a UTF-8 locale and perl 5.8.1 won't. The following script produces five output files. Under UTF-8 locale and perl 5.8, default.out has (U+AC00 U+AC01) in EUC-KR is '0xb0 0xa1 0xb0 0xa2'. c2 b0 c2 a1 c2 b0 c2 a2 while bytes.out, binmod.out, encode.out and default2.out have b0 a1 b0 a2 What made me curious is default2.out. I'm wondering how setting UTF8 flag on what's an invalid UTF-8 string ($output) with Encode::_utf8_on effectively made the output filehandle behave as if 'binmode' were set or 'bytes' layer were used. Needless to say, I wouldn't rely on that, but am interested to know how this happens. Jungshik P.S. BTW, is there any way to specify 'CHECK' for 'encoding' layer? #!/usr/bin/perl -w use Encode; $input = \x{ac00}\x{ac01}; $output = encode(euc-kr, $input, Encode::FB_PERLQQ); open $ofh, default.out; print $ofh $output; close $ofh; open $ofh, :bytes, bytes.out; print $ofh $output; close $ofh; open $ofh, binmod.out; binmode($ofh); print $ofh $output; close $ofh; open $ofh, default2.out; Encode::_utf8_on($output); print $ofh $output; close $ofh; open $ofh, :encoding(euc-kr), encode.out; print $ofh $input; close $ofh; ---
Re: How to name CJK ideographs
On Sat, 26 Oct 2002, Dan Kogai wrote: On Saturday, Oct 26, 2002, at 03:55 Asia/Tokyo, Jungshik Shin wrote: Another possibility is 'meaning-pronunciation' index. I believe this is one of a few ways to refer to CJK characters (say, over the phone) in all CJK countries. However, to do this, we need much more raw data (more or less like a small dictionary) than UniHan DB provides because it lists meanings of characters in English only. That's one thing I wish I could do -- Dan as in Bomb because I can't go like YOU five ef three ee :) I know that's difficult but it Until such a time as you can do that or somebody with infinite amount of free time volunteers :-), how about \N{life:sheng1} for zh and \N{life:saeng} for ko and so forth? Nothing fancy but using what's available in UniHan DB. Then, I came to wonder in this age of Unicode, why we have to bother to use '\N{}' when we can just directly use 生 in perl. I know there are some cases where 'N{...}' is necessary and useful Another question came up. do we really need meaning-pronunciation index in native languages? If one can enter meaning-pronunciation inside 'N{...}', there would be really no reason not to directly type the character in question. Therefore, 'N{...}' is kinda fallback for those who can't enter CJK characters directly and 'meaning-pronunciation' in English and Romanized form is all we need for '\N{}', isn't it? Just my two hundredths of € . Jungshik
RFC 2231 (was Re: Encode::MIME::Header...)
On Mon, 7 Oct 2002, Dan Kogai wrote: As I said, Encode::MIME::Header has those restrictions; * the Encode API * RFC 2047 I'm not sure if Encode::MIME::Header is the best place to implement RFC 2231 because RFC 2231 encoding/decoding involves two parameters, 'MIME charset' and 'language'. RFC 2231 is used not only for email/news messages but also in http header. Implementing RFC 2231 in Encode::MIME::Header would help dynamically-generated-attachment (on the web) have the standard-compliant Content-Disposition header(RFC 2183). Currently, most C-D headers generated by CGI programs use either raw 8bit characters in an unspecified encoding or RFC 2047-encoding for the value of 'name' parameter of C-D header. Neither of these behaviors are standard-compliant. Jungshik
Re: README.cjk?
On Tue, 7 May 2002, Dan Kogai wrote: Hi Dan, pumpking is calling for the (hopefully) the last chance to update README.cjk. On Tuesday, May 7, 2002, at 02:48 , Jarkko Hietaniemi wrote: Do I have the latest versions of the README.{cn,jp,ko,tw}? I do think so but I am calling for the last possible update anyhow. It seems like my latest version was lost somewhere (sent around April 18th) :-). Here's another try. I took this chance to correct a couple of typos. Cheers, Jungshik If you read this file _as_is_, just ignore the funny characters you see. It is written in the POD format (see perlpod manpage) which is specially designed to be readable as is. This file is in Korean encoded in EUC-KR. ÀÌ ¹®¼¸¦ perldocÀ» ½á¼ º¸Áö ¾Ê°í Á÷Á¢ º¸´Â °æ¿ì¿¡´Â °¢ ºÎºÐÀÇ ¿ªÇÒÀ» Ç¥½ÃÇϱâ À§ÇØ ¾²ÀÎ =head, =item, 'L' µîÀº ¹«½ÃÇϽʽÿÀ. ÀÌ ¹®¼´Â µû·Î perldocÀ» ¾²Áö ¾Ê°í º¸´õ¶óµµ Àдµ¥ º° ÁöÀåÀÌ ¾ø´Â POD Çü½ÄÀ¸·Î Â¥¿© ÀÖ½À´Ï´Ù. ´õ ÀÚ¼¼ÇÑ °ÍÀº perlpod ¸Å´º¾óÀ» Âü°íÇϽʽÿÀ. =head1 NAME perlko - Perl°ú Çѱ¹¾î ÀÎÄÚµù =head1 DESCRIPTION PerlÀÇ ¼¼°è¿¡ ¿À½Å °ÍÀ» ȯ¿µÇÕ´Ï´Ù ! PerlÀº 5.8.0ÆǺÎÅÍ À¯´ÏÄÚµå/ISO 10646¿¡ ´ëÇÑ ±¤¹üÀ§ÇÑ Áö¿øÀ» ÇÕ´Ï´Ù. À¯´ÏÄÚµå Áö¿øÀÇ ÀÏȯÀ¸·Î ÇÑÁßÀÏÀ» ºñ·ÔÇÑ ¼¼°è °¢±¹¿¡¼ À¯´ÏÄÚµå ÀÌÀü¿¡ ¾²°í ÀÖ¾ú°í Áö±Ýµµ ³Î¸® ¾²ÀÌ°í ÀÖ´Â ¼ö¸¹Àº ÀÎÄÚµùÀ» Áö¿øÇÕ´Ï´Ù. À¯´ÏÄÚµå´Â Àü ¼¼°è¿¡¼ ¾²ÀÌ´Â ¸ðµç ¾ð¾î¸¦ À§ÇÑ Ç¥±â ü°è - À¯·´ÀÇ ¶óƾ ¾ËÆĺª, Å°¸± ¾ËÆĺª, ±×¸®½º ¾ËÆĺª, Àεµ¿Í µ¿³² ¾Æ½Ã¾ÆÀÇ ºê¶ó¹Ì °è¿ ½ºÅ©¸³Æ®, ¾Æ¶ø ¹®ÀÚ, È÷ºê¸® ¹®ÀÚ, ÇÑÁßÀÏÀÇ ÇÑÀÚ, Çѱ¹¾îÀÇ ÇѱÛ, ÀϺ»¾îÀÇ °¡³ª, ºÏ¹Ì Àεð¾ÈÀÇ Ç¥±â ü°è µî-¸¦ ¼ö¿ëÇÏ´Â °ÍÀ» ¸ñÇ¥·Î ÇÏ°í Àֱ⠶§¹®¿¡ ±âÁ¸¿¡ ¾²ÀÌ´ø °¢ ¾ð¾î ¹× ±¹°¡ ±×¸®°í ¿î¿µ ü°è¿¡ °íÀ¯ÇÑ ¹®ÀÚ ÁýÇÕ°ú ÀÎÄÚµù¿¡ ¾µ ¼ö ÀÖ´Â ¸ðµç ±ÛÀÚ´Â ¹°·ÐÀÌ°í ±âÁ¸ ¹®ÀÚ ÁýÇÕ¿¡¼ Áö¿øÇÏ°í ÀÖÁö ¾Ê´ø ¾ÆÁÖ ¸¹Àº ±ÛÀÚ¸¦ Æ÷ÇÔÇÏ°í ÀÖ½À´Ï´Ù. PerlÀº ³»ºÎÀûÀ¸·Î À¯´ÏÄڵ带 ¹®ÀÚ Ç¥ÇöÀ» À§ÇØ »ç¿ëÇÕ´Ï´Ù. º¸´Ù ±¸Ã¼ÀûÀ¸·Î ¸»Çϸé Perl ½ºÅ©¸³Æ® ¾È¿¡¼ UTF-8 ¹®ÀÚ¿À» ¾µ ¼ö ÀÖ°í, °¢Á¾ ÇÔ¼ö¿Í ¿¬»êÀÚ(¿¹¸¦ µé¾î, Á¤±Ô½Ä, index, substr)°¡ ¹ÙÀÌÆ® ´ÜÀ§ ´ë½Å À¯´ÏÄÚµå ±ÛÀÚ ´ÜÀ§·Î µ¿ÀÛÇÕ´Ï´Ù. (´õ ÀÚ¼¼ÇÑ °ÍÀº perlunicode ¸Å´º¾óÀ» Âü°íÇϽʽÿÀ.) À¯´ÏÄڵ尡 ³Î¸® º¸±ÞµÇ±â Àü¿¡ ³Î¸® ¾²ÀÌ°í ÀÖ¾ú°í, ¿©ÀüÈ÷ ³Î¸® ¾²ÀÌ°í ÀÖ´Â °¢±¹/°¢ ¾ð¾îº° ÀÎÄÚµùÀ¸·Î ÀÔÃâ·ÂÀ» ÇÏ°í À̵é ÀÎÄÚµùÀ¸·Î µÈ µ¥ÀÌÅÍ¿Í ¹®¼¸¦ ´Ù·ç´Â °ÍÀ» µ½±â À§ÇØ 'Encode'°¡ ¾²¿´½À´Ï´Ù. ¹«¾ùº¸´Ù 'Encode'¸¦ ½á¼ ¼ö¸¹Àº ÀÎÄÚµù »çÀÌÀÇ º¯È¯À» ½±°Ô ÇÒ ¼ö ÀÖ½À´Ï´Ù. 'Encode'´Â ´ÙÀ½°ú °°Àº Çѱ¹¾î ÀÎÄÚµùÀ» Áö¿øÇÕ´Ï´Ù. =over 4 =item euc-kr US-ASCII¿Í KS X 1001À» °°ÀÌ ¾²´Â ¸ÖƼ¹ÙÀÌÆ® ÀÎÄÚµù (ÈçÈ÷ ¿Ï¼ºÇüÀ̶ó°í ºÒ¸².) KS X 2901°ú RFC 1557 Âü°í. =item cp949 MS-Windows 9x/ME¿¡¼ ¾²ÀÌ´Â È®Àå ¿Ï¼ºÇü. euc-kr¿¡ 8,822ÀÚÀÇ ÇÑ±Û À½ÀýÀ» ´õÇÑ °ÍÀÓ. alias´Â uhc, windows-949, x-windows-949, ks_c_5601-1987. ¸Ç ¸¶Áö¸· À̸§Àº ÀûÀýÇÏÁö ¾ÊÀº À̸§ÀÌÁö¸¸, Microsoft Á¦Ç°¿¡¼ CP949ÀÇ Àǹ̷Π¾²ÀÌ°í ÀÖÀ½. =item johab KS X 1001:1998 ºÎ·Ï 3¿¡¼ ±ÔÁ¤ÇÑ Á¶ÇÕÇü. ¹®ÀÚ ·¹ÆÛÅ丮´Â cp949¿Í ¸¶Âù°¡Áö·Î US-ASCII¿Í KS X 1001¿¡ 8,822ÀÚÀÇ ÇÑ±Û À½ÀýÀ» ´õÇÑ °ÍÀÓ. ÀÎÄÚµù ¹æ½ÄÀº ÀüÇô ´Ù¸§. =item iso-2022-kr RFC 1557¿¡¼ ±ÔÁ¤ÇÑ Çѱ¹¾î ÀÎÅÍ³Ý ¸ÞÀÏ ±³È¯¿ë ÀÎÄÚµùÀ¸·Î US-ASCII¿Í KS X 1001À» ·¹ÆÛÅ丮·Î ÇÏ´Â Á¡¿¡¼ euc-kr°ú °°Áö¸¸ ÀÎÄÚµù ¹æ½ÄÀÌ ´Ù¸§. 1997-8³â °æ±îÁö ¾²¿´À¸³ª ´õ ÀÌ»ó ¸ÞÀÏ ±³È¯¿¡ ¾²ÀÌÁö ¾ÊÀ½. =item ksc5601-raw KS X 1001(KS C 5601)À» GL(Áï, MSB¸¦ 0À¸·Î ÇÑ °æ¿ì) ¿¡ ³õ¾ÒÀ» ¶§ÀÇ ÀÎÄÚµù. US-ASCII¿Í °áÇÕÇÏÁö ¾Ê°í ´Üµ¶À¸·Î ¾²ÀÌ´Â ÀÏÀº X11 µî¿¡¼ ±Û²Ã ÀÎÄÚµù (ksc5601.1987-0. '0'Àº GLÀ» ÀǹÌÇÔ.)À¸·Î ¾²ÀÌ´Â °ÍÀ» Á¦¿ÜÇÏ°í´Â °ÅÀÇ ¾øÀ½. KS C 5601Àº 1997³â KS X 1001·Î À̸§À» ¹Ù²Ù¾úÀ½. 1998³â¿¡´Â µÎ ±ÛÀÚ (À¯·ÎÈ ºÎÈ£¿Í µî·Ï »óÇ¥ ºÎÈ£)°¡ ´õÇØÁ³À½. =back ¸î °¡Áö »ç¿ë ¿¹Á¦¸¦ ¾Æ·¡¿¡ º¸ÀÔ´Ï´Ù. ¿¹¸¦ µé¾î, euc-kr ÀÎÄÚµùÀ¸·Î µÈ ÆÄÀÏÀ» UTF-8·Î º¯È¯ÇÏ·Á¸é ´ÙÀ½°ú °°ÀÌ ÇÏ¸é µË´Ï´Ù. perl -Mencoding=euc-kr,STDOUT,utf8 -pe1 file.euckr file.utf8 ¿ªº¯È¯Àº ´ÙÀ½°ú °°ÀÌ ÇÒ ¼ö ÀÖ½À´Ï´Ù. perl -Mencoding=utf8,STDOUT,euc-kr -pe1 file.utf8 file.euckr ÀÌ·± º¯È¯À» Á»´õ Æí¸®ÇÏ°Ô ÇÒ ¼ö ÀÖµµ·Ï Encode ¸ðµâÀ» ½á¼ ¼ø¼öÇÏ°Ô Perl·Î¸¸ ¾²ÀÎ piconv°¡ Perl¿¡ µé¾î ÀÖ½À´Ï´Ù. ±× À̸§¿¡¼ ¾Ë ¼ö ÀÖµíÀÌ piconv´Â Unix¿¡ ÀÖ´Â iconv¸¦ ¸ðµ¨·Î ÇÑ °ÍÀÔ´Ï´Ù. ±× »ç¿ë¹ýÀº ¾Æ·¡¿Í °°½À´Ï´Ù. piconv -f euc-kr -t utf8 file.euckr file.utf8 piconv -f utf8 -t euc-kr file.utf8 file.euckr ¶Ç, 'PerlIO::encoding' ¸ðµâÀ» ½á¼ Çѱ¹¾î ÀÎÄÚµùÀ» ¾²¸é¼ ±ÛÀÚ ´ÜÀ§ (¹ÙÀÌÆ® ´ÜÀ§°¡ ¾Æ´Ï¶ó) 󸮸¦ ½±°Ô ÇÒ ¼ö ÀÖ½À´Ï´Ù. #!/path/to/perl use encoding 'euc-kr', STDIN = 'euc-kr', STDOUT- 'euc-kr', STDERR='euc-kr'; print length(°¡³ª);# 2 (Å« µû¿ÈÇ¥´Â ±ÛÀÚ ´ÜÀ§ 󸮸¦ Áö½Ã) print length('°¡³ª');# 4 (ÀÛÀº µû¿ÈÇ¥´Â ¹ÙÀÌÆ® ´ÜÀ§ 󸮸¦ Áö½Ã) print index(ÇÑ°, ´ëµ¿°, ¿°); # -1 ('¿°'ÀÌ ¾øÀ½) print index('ÇÑ°, ´ëµ¿°', '¿°'); # 7 (8¹ø°¿Í 9¹ø° ¹ÙÀÌÆ®°¡ '¿°'ÀÇ Äڵ尪°ú ÀÏÄ¡ÇÔ.)
Re: http://bleedperl.dan.co.jp:8080/
On Sat, 27 Apr 2002, Dan Kogai wrote: I have set up an experimental mod_bleedperl server which URI is shown in the subject. To demonstrate the power of Perl 5.8, I have written a small cgi/pl (.pl runs on Apache::Registry) called piconv.pl, a web version of piconv(1). http://bleedperl.dan.co.jp:8080/piconv/ (Don't forget :8080; it's not run on root!) What's so funny is that this service can be used to 'asciify' non-ascii web pages. Bart's idea of HTMLCREF is fully exploited here. To find it out, try Wow, this is great and very timely!! Yesterday, I wrote to Werner Lemberg (the maintainer of CJK package for LaTeX and freetype/ttf2tfm among other things) and Ross Moore (the maintainer of LaTeX2html converter) that upcoming Perl 5.8 would include this great Encoding module. With it, I told them that it'd be trivial to represent characters outside the repertoire of the target encoding (for html output) in NCRs. Today, Werner expressed his interest in this feature because he wants to make use of that in groff. Now you put up this page... This feature will also help reduce ill-tagged (mislabeled) pages. For instance, a lot of Korean web pages are mislabeled as EUC-KR while they contain characters outside EUC-KR. If Encoding is widely used in CGI programs behind those Web bulletin boards or mod_bleedperl is used along with Apache(I'm assuming that mod_bleedperl can do an encoding coversion behind the scene..), all of sudden a number of mistagged pages will disappear :-) Jungshik
Re: README.jp, README.tw, README.cn, README.kr
Hi, Attached is README.ko (per Jarkko's suggestion, I used 'ko' instead of 'kr') in EUC-KR encoding. North Korea has its own 94 x 94 coded character set(KPS 9566-97: ISO-IR 202), but a few web pages set up for/by North Korean companies(and possibly government?) of which URLs I happened know use EUC-KR. I also added what Autrijus added to README.tw. Cheers, Jungshik README.ko Description: README in Korean in EUC-KR
piconv and EUC :-)
On Sun, 31 Mar 2002, Dan Kogai wrote: Hi Dan, piconv -- iconv(1), reinvented in perl piconv is perl version of iconv, a character encoding con- verter widely availabe for various unixen today. This script was primarily a technology demostrator for Perl 5.8.0, you can use piconv in the place of iconv for virtu- ally any cases. Well, I'm afraid 'virtually any case' is a bit of exaggeration. glibc iconv and iconv in libiconv can also deal with 'transliteration', but I'm afraid piconv can't do that, yet :-) Another minor note about documentation. I forgot to mention that 'EUC' in EUC-JP/EUC-KR stands for 'Extended Unix Code'. I'm sure it's the original term used by ATT because I've seen 'Extended Unix Code' on many occasions over many years. There's at least one place in Encode doc. that 'Extended Unix Character' is used in place of that. Cheers, Jungshik
Re: [PATCH] Supported.pod: cleanup/UTF-16/CJK.inf + an invasion tothe Glossary
On Fri, 5 Apr 2002, Anton Tagunov wrote: Hi Anton, Speaking of the patch.. AT +=item Jungshik Shin's Hangul FAQ AT +Lhttp://jshin.net/faq . AT +Lhttp://jshin.net/faq/qa8.html AT +has a comprehensive overview of the CKS * (Korean) standards. AT +Tha author claims however that the document needs AT +some modernisation :-) I'm sorry, I haven't been to bed too long, so not sure if my writings are okay. Jungshik, is this a proper recommendation for you cite? Drop the line on modernization? (Not the best place for jokes :-( No, it's perfectly all right with me. I don't think it's not inconsistent with Larry's putting some nice jokes in his Perl books :-) +The modern successor of the CCJK.inf. +The book of choice for everyone interested. + +Features a comprehensive coverage on CJKV character sets and encodings +along with many other issues faced by anyone trying to better support +CJKV languages/scripts in all the areas of information processing. Looks good.
Re: [Encode] UCS/UTF mess and Surrogate Handlings
On Fri, 5 Apr 2002, Jarkko Hietaniemi wrote: P.S. Does utf8 support surrogates? Surrogate pair is definitely the No. Surrogates are solely for UTF-16. There's no need for surrogates in UTF-8 -- if we wanted to encode U+D800 using UTF-8, we *could* -- BUT we should not. Encoding U+D800 as UTF-8 should not be attempted, the whole surrogate space is a discontinuity in the Unicode code point space reserved for the evils of UTF-16. I can't agree more with you on this. Unfortunately, people at Oracle and PeopleSoft think differently. Actually, what happened was that they made a serious design mistake by making their DBs understand only UTF-8 up to 3byte long although when they added UTF-8 support, it was plainly clear that ISO 10646/Unicode was not just for BMP. When planes beyond BMP finally began to be filled with actual characters, they came up with that stupid idea of using two 3-byte-long UTF-8 units (for surrogate pairs) to represent those characters. A lot of people on Unicode mailing list voiced a very strong and technically solid objection against this, but Oracle and PeopleSoft went on to publish DUTR #26: Compatibility Encoding Scheme for UTF-16 (CESU-8) (http://www.unicode.org/unicode/reports/tr26). Does Encode need to support this monster? I hope not. Jungshik Shin
Re: [Encode] Encode::Supported revised
On Thu, 4 Apr 2002, Dan Kogai wrote: Konnichiha ! (hope I got this one right). On Thursday, April 4, 2002, at 03:06 , Jungshik Shin wrote: o The MIME name as defined in IETF RFCs. UCS-2 ucs2, iso-10646-1[IANA, et al] UCS-2le UTF-8 utf8 [RFC2279] How about UCS-2BE? Of course, if UCS-2 is network byte order (big endian), it's not necessary. In that case, you may alias UCS-2 to UCS-2BE. And UCS2-NB (Network Byte order)? Unicode terminology is confusing sometimes. I've checked http://www.unicode.org/glossary/ and it seems that the canonical - alias order should be as follows. UCS-2 ucs2, iso-10646-1, utf-16be UTF-16LE ucs2-le UTF-8 utf8 I left UCS-2 as is because it is IANA registered. UCS-2 is indeed a name of encoding as the URL above clearly states. It is also less confusing than UTF-16. ucs2-le will be fixed. IETF RFC 2781 also 'defines' (for IETF purpose) UTF-16LE, UTF-16BE, and UTF-16. It's at http://www.faqs.org/rfcs/rfc2781.html among other places. BTW, how does Encode deal with BOM in UTF-16? It's trivial to add BOM at the beginning by hand (with perl), but you may consider adding an option (??) to add/remove BOM automatically converting to/from UTF-16(LE|BE). Could you please just say 'Encoding vs Character Set' and remove parenthetical 'charset for short' or 'just charset' following 'character set'? I agree to your distinction between 'encoding' and 'character set', but what is bothering me is that you treat 'charset' as a synonym to 'character set'. Now I agree. charset is more appropriate for coded character set and that was MIME header's first intention. EUC is indeed a coded character set but charset=ISO-2022-(JP|KP|CN)(-\d+)? is absolutely confusing -- it is a character encoding scheme at best. I am thinking of adding a small glossary to this document as follows. And Here is a glossary I manually parsed out of http://www.unicode.org/glossary/ , right after the signature. Thank you. BTW, you may also want to take a look at W3C's charmod TR at http://www.w3.org/TR/charmod and 'charset' part of html4 spec at http://www.w3.org/TR/REC-html40/charset.html In a strict sense, the concept of 'raw' or 'as-is' (which you apparently use to mean a coded character set invoked on GL) is not appropriate. Because JIS X 0208, JIS X 0208 and KS X 1001 don't map characters to their GL position when enumerating characters in their charts. The numeric ID used in JIS X 0208, JIS X 0212 and KS X 1001 are row (ku) and column(ten?) while GB 2312-80 appears to use GL codepoints. That's why I prefer gb2312-gl and ksx1001-gl to gb2312-raw and ksx1001-raw. 'gl' doesn't have a risk of being mistaken for row and column numbers. I wonder whether ku-ten form is canonical or derived. JIS X 0208 was clearly designed to be ISO-2022 compliant. Technically speaking 0x21-0x7e should the original and 1 - 94 is derived to make decimal people happier. But you've got a point. Maybe you're right. It may have made 'decimal-oriented people' happier, but it's a pain in the ass to 'hexadecimal-oriented people' like us, isn't it? Speaking of '-raw' that's a BSD sense of calling unprocessed data and for a Deamon freak it came out naturally. All right. It's your decision :-) are IANA-registered (CUTF-16 even as a preferred MIME name) but probably should be avoided as encoding for web pages due to the lack of browser supports. Not that I'd encourage people to use UTF-16 for their web pages, but UTF-16 is supported by MS IE and KOI8-U is supported by both MS IE and Mozilla. The problem is not just browsers. As a network consultant I would advised against UTF-16 or any text encoding that may croak cat(1) and more(1) (We can go frank on Mojibake For cases like mojibake, the text goes to EOF). After all, we have UTF-8 already that good old cat of ours can read till EOF with no problem. Sure, I like UTF-8 much more than UTF-16 and any byte order dependent and 'cat-breaking' :-) transformation formats of Unicode. I can assure you that I'm certainly on your side ! Microsoft products generate UTF-8 with **totally redundant** BOM (byte order mark) at the beginning. I don't know whether there's a conspiracy to break time-honored Unix tradition of command line filtering, but it's certainly annoying to deal with UTF-8 files with BOM. For example, 'cat f1 f2 f3' wouldn't work as it is. 'cat' and many other Unix tools need to be modified to remove 'BOM'. Lhttp://www.oreilly.com/people/authors/lunde/cjk_inf.html Somewhat obsolete (last update in 1996), but still useful. Also try Is there any rule against mentioning a book in print as opposed to online
Re: [PATCH] Re: [Encode] Encode::Supported revised
On Thu, 4 Apr 2002, Anton Tagunov wrote: Hi Anton, Thanks a lot. - changes status of KOI8-U on Jungshik's comment (sorry, I have never tested that myself :-( I haven't test it either :-), but both Mozilla/Netscape6 and MS IE list it in view|encoding menu, which I interpret as having support for it. UTF-16 - KOI8-U(http://www.faqs.org/rfcs/rfc2319.html) -are IANA-registered (CUTF-16 even as a preferred MIME name) +=for comment +waiting for comments from Jungshik Shin to soften this - Anton + +is a IANA-registered preferred MIME name but probably should be avoided as encoding for web pages due to -the lack of browser supports. +the lack of browser support. The reason your test didn't work with MS IE was probably you didn't prepend your UTF-16 html doc. with BOM(byte order mark). It's to be noted that a conventional way of informing web browsers of MIME charset by putting meta tag doesn't work for UTF-16/UTF-32. Either you have to configure your web server to emit C-T header with 'charset=UTF-16(LE|BE)' or you have to put BOM at the beginning. When BOM is present, MS IE 5/6, Mozilla/Netscape6 and Netscape4 have no problem rendering UTF-16(LE|BE) encoded pages. I put up a couple of test pages at http://jshin.net/i18n/utf16le_kr2.html http://jshin.net/i18n/utf16be_kr2.html For more details on UTF-16 and HTML, you can refer to HTML4 spec. at http://www.w3.org/TR/html4/charset (see section 5.2.1) As I wrote before, I have no intention to encourage use of UTF-16 over UTF-8 although some people whose primary script has a more 'economical' (in terms of file size) representation in UTF-16 than in UTF-8 may want to use it. +=head2 Microsoft-related naming mess + +Microsoft products misuse the following names: + +=over 2 + +=item KS_C_5601-1987 + +Microsoft extension to CEUC-KR. + +Proper name: CCP949. + +See +http://lists.w3.org/Archives/Public/ietf-charsets/2001AprJun/0033.html +for details. Wow, I didn't know that Martin wrote this. Thanks a lot for digging this up. He 'rediscovered' what a lot of people in Korea had complained about. One thing I don't agree with him is what designation to use for CP949. I think it'd better be 'windows-949' because that's more in line with other MS code pages such as windows-125x (for European scripts). By the same token, MS version of Shift_JIS can be labeled as 'windows-932. At the moment, Mozilla uses 'x-windows-949' for CP949/UHC because it's not yet registered with IANA. Probably, I have to contact Martin and discuss this issue. +Encode aliases CKS_C_5601-1987 to Ccp949 to reflect +this common misusage. If my patch is accepted, cp949 has a couple of more aliases, 'uhc' and '(x-)-windows-949'. CP949 is commonly known as 'ÅëÇÕ ¿Ï¼ºÇü'(Unified Hangul Code) in Korea. +IRaw CKS_C_5601-1987 encoding is available as Ckcs5601-raw. ksc5601-raw had better be renamed ksx1001-raw and ksc5601-raw can be made an alias to ksx1001-raw. Pls, note that now what's now called ksc5601-raw has two new characters which were only added in Dec. 1998 over a year after the name change (KS C 5601 - KS X 1001). +=item GB2312 + +Encode aliases CGB2312 to Ceuc-cn in full agreement with +IANA registration. Ccp936 is supported separately. +IRaw CGB_2312-80 encoding is available as Ckcs5601-raw. Oops... You meant gb2312-raw, didn't you? :-) Jungshik, I would have certainly advocated linking not only to http://lists.w3.org/Archives/Public/ietf-charsets/2001AprJun/0033.html but also to your comments on the KS_C_5601-1987 in the list archive, but all your mails were on several subjects each. Jungshik ... refer to Ken Lunde's CJKV Information Processing Jungshik about that 'epic war' between two camps. (see p.197 of Jungshik the book and http://jshin.net/faq/qa8.html) Jungshik We even set up a web page to prevent M$ from spreading that Jungshik ill-defined name. maybe we may link to this page? What is the address? The campaign web has disappeared since. It was almost 5 years ago :-). However, my Hangul FAQ subject 8 deals with the issue (http://jshin.net/faq/qa8.html) so that you may add the link to it. Well, be aware that it's been untouched for a few years (if not longer) and needs a complete overhaul.
Re: - charset + character set + coded character set + CCS (?) (was:[Encode] Encode::Supported revised)
On Thu, 4 Apr 2002, Anton Tagunov wrote: Hi Anton !! AT Our comments go in the same direction, but will you AT let me strengthen your statements a bit? Thank you ! JS On the other hand, no one with *sufficient understanding* JS of the issue uses 'character set' to mean encoding. AT [ECMA-35, (equivalent of ISO 2022?)]: Yes, I think they're a verbatim equivalent of ISO 2022. I'd never have been able to read ISO 2022 unless ECMA released it free as ECMA 35. AT coded character set; code AT A set of unambiguous rules that establishes a AT character set and the one-to-one relationship between the AT characters of the set and their coded representation. AT [RFC 1345]: AT The ISO definition of the term coded character set is as AT follows: A set of unambiguous rules that establishes a AT character set and the one-to-one relationship between the AT characters of the set and their coded representation. AT Hmmm... can this potentially lead to messing character set for AT a short form of coded character set (in the ISO meaning)? AT I see that these definitions themselves make a distinction between a AT character set (= repertoire) and AT coded character set (= CCS + encoding = CCS + CES), Jungshik? Hmm, I feel like being treated as 'the' ultimate something here, which I'm certainly not and never wanted to be :-) I think Dan is right when he wrote that EUC-JP,EUC-KR,EUC-CN, EUC-TW and even UTF-8 could be regarded as both CCS and CES. Even though they involve multiple character set standards, the mapping from abstract characters in those multiple character set standards to integers (despite being of multiple 'lengths') is strictly one-to-one. I didn't realize that it's possible to view things that way until he wrote that. On the other hand, as he wrote, any encoding that utilize any form of escape sequence (locking/single shift, designator, etc) , whether defined in ISO 2022 or not (I have HZ in mind here) cannot be called a CCS because just providing the mapping alone cannot fully specify the way actual text in that encoding is 'serialized' in octet-sequence. Therefore, I believe the below doesn't hold true for all encodings we have to deal with although it's the case for some encodings. AT coded character set (= CCS + encoding = CCS + CES), Then, I realize that RFC 1345 has the following after quoting ISO definition of coded character set which you quoted above. 1345 This memo does not put further 1345 restrictions on the term of coded character set than the following: 1345 A coded character set is a set of rules that unambiguously and 1345 completely determines which sequence of characters, if any, is 1345 represented by each possible sequence of n-bit bytes for a certain 1345 value of n. This implies that e.g. a coded character set extended 1345 with one or more other coded character sets by means of the extension 1345 techniques of ISO 2022 constitutes a coded character set in its own 1345 right. In this memo the term charset is used to refer to the above 1345 interpretation of the ISO term coded character set. However, even RFC 1345 came up with a new term 'charset' for its *extended* definition of 'coded character set' to distinguish it from the original ISO definition. The definition of 'charset' in RFC 1345 is actually in line with RFC 2130/2278. Therefore, what I wrote about the statement that coded character set (= CCS + encoding = CCS + CES) is still the case, IMO. DOC Is a collection of characters in which each character is distinguished DOC with unique ID (in most cases, ID is number). JS Some people like to distinguish between a mere collection of characters JS and a collection of characters with uniq(numeric) ID /code points. JS The former is sometimes refered to as a character repertoire JS or a character set whereas the latter is called a 'coded character set'. AT or rather CCS to rule out the ISO understanding I don't see any conflict between RFC 2130 CCS and ISO coded character set _quoted_ in RFC 1345. It's not the original ISO definition of 'coded character set' but RFC 1345's extension of the definition that made things complicated. However, even RFC 1345 gave it a new term 'charset' to tell it from the original ISO defintion. DOC =item Character IEncoding DOC A character encoding may also encode character set as-is (also called DOC a Iraw encoding. i.e. US-ascii) or processed (i.e. EUC-JP, US-ascii is JSIn a strict sense, the concept of 'raw' or 'as-is' (which you JS apparently use to mean a coded character set invoked on GL) is not JS appropriate. Because JIS X 0208, JIS X 0208 and KS X 1001 don't map JS characters to their GL position when enumerating characters in their JS charts. AT Looks like RFC 1345 has made one big pile: AT JIS_C6226-1978, JIS_C6226-1978 = JIS_C6226-1983 AT GB_1988-80 AT KS_C_5601-1987 AT AT are all listed in a similar manner there. Does this RFC change AT anything? As we
Re: let's cook it!
On Wed, 27 Mar 2002, Nick Ing-Simmons wrote: Autrijus Tang [EMAIL PROTECTED] writes: On Tue, Mar 26, 2002 at 06:28:07PM -0500, Jungshik Shin wrote: Microsoft products use 'ks_c_5601-1987' as an encoding name/MIME charset/character set encoding scheme. That's a very strange use of KS C 5601-1987. Because, what they mean by 'ks_c_5601-1987' is actually CP949/Unified Hangul Code(UHC)/X-Windows-949, an upward compatible proprieatary extension of EUC-KR. Just a quite note: exactly the same thing has happened with Microsoft's use of 'gb2312' to mean 'gbk', and 'big5' to mean 'cp950'. In Encode.pm, I've been carefully avoiding this misbehaviour; it has been fortunate that 'ks_c_5601_1987' has a distinct name from 'ksc5601'. :-) At least they are consistently wrong across the world, most MS things claiming to be iso-8859-1 are really cp1252 Well, not really. MS registered Windows-125x with IANA and use Windows-125x in their products consistenly. It's NOT MS products (MS OE, IE, Frontpage) BUT broken programs like Eudora (with very little notion of I18N and MIME charset) that run under MS Windows that label Windows-125x documents as ISO-8859-x. I don't like MS, but they shouldn't be blamed for what's not their fault. MS should have registered CP949/950 as Windows-949/950 instead of labeling them misleadingly as ks_c_5601-1987 and big5, In case of gb2312, gbk should be registered and used. I don't know about big5, but in Korean case, apparently they tried to pretend that they follow Korean Nat'l std. while they extended it in a proprietary way. Jungshik Shin
Re: Encoding vs Charset
On Wed, 27 Mar 2002, Dan Kogai wrote: On Wednesday, March 27, 2002, at 11:22 , Jungshik Shin wrote: IMHO, you're also misusing the term 'charset' here. MIME charset can be used synonymously with 'encodings' (or character set encoding scheme: see CJKV Information Processing, IETF RFC 2130 and RFC 2278). What has to be distinguished is 'coded character set' on the one hand (JIS X 0208, JIS X 0212, KS X 1001, KS X 1003, GB 2312, CNS 11xxx, ISO 10646, ISO 646, US-ASCII, ISO-8859-x) and 'encoding/character set encoding scheme/MIME charset on the other hand (EUC-JP, EUC-KR, EUC-TW, EUC-CN, ISO-2022-JP, ISO-2022-KR, ISO-2022-CN, ISO-8859-x, UTF-8, UTF-32, UTF-7, UTF-16, Big5, UHC) I do not thinks so. This time I can confidently say it is IANA that has goofed. To make my point clear, let me define Charset and Encoding once again. Character Set: a collection of characters in which each character is distinguished with unique ID (in most cases, ID is number). Character Encoding: A way to represent characters in byte stream. Given character encoding may contain a single character set (i.e. US-ascii) or multiple character sets (i.e. EUC-JP that contain US-ascii, JIS X 0201 Kana, JIS X 0208 and JIS X 0212). Given character encoding may also encode character set as-is (raw; US-ascii) or processed (for EUC-JP, US-ascii is as-is, JIS X 0201 is prepended with \x8E, JIS X 0208 is added by 0x8080, and JIS X 0212 is added by 0x8080 then prepended with \x8F). You got me wrong. I don't have any objection to 'coded character set' and 'encoding' defined this way. Problem is that you're using '(coded) character set' and 'charset' interchangeably. They're two different things depending on where you come from. My point is that because 'charset' is already overloaded with two or more different meanings(as MIME Content-Type header parameter, it means 'encoding' as you defined above), you'd better not use it when comparing coded character set on the one hand and encoding/ character set encoding scheme on the other hand. Simply, it'd be much better for you to say '(coded) character set vs encoding' instead of 'charset vs encodig' Jungshik Shin P.S. I'm wondering Why you posted this to Unicode list (where it's not very much relevant) without posting to perl-unicode? I was force to post my response to Unicode list, but I'd rather keep this thread (if there's need to continue) where it began (perl-unicode).
Re: let's cook it!
Dan, I'm sorry for dropping in this late, but I've just joined the list and found this. * rename gb2312 to gb2312-raw, ksc5601 to ksc5601-raw What do you mean by ksc5601-raw and gb2312-raw? If it's KS C 5601-1987 and GB2312 put in GL, how about ksc5601-gl and gb2312-gl? Please, also note that KS C 5601-1992 was reissued and renamed as KS X 1001:1998. Therefore, it'd be better to use ksx1001 in place of ksc5601 and make ksc5601-* as aliases to ksx1001-*. * and alias gb2312 and ksc5601 to euc-(cn|kr) I agree. :) Oh, my gosh ! Please, remove this alias of ksc5601 to EUC-KR. That's the last thing we need. KS C 5601-1987 is NOT the encoding (or character set encoding scheme or MIME charset) BUT just a coded character set which is used in encodings/MIME charsets/ character set encoding schemes like EUC-KR and ISO-2022-KR. By aliasing ksc5601 to EUC-KR, only thing we achieve is to encourage the confusion and mistake which have to be avoided at all cost. Well, at least almost every other program (hc, iconv, mozilla...) does that anyway. No, Mozilla doesn't do that. Neither does yudit. Mozilla's character coding menu does NOT have KS C 5601. I wonder how this charset misteken as encode has started. Well, in majority of encodings, charsets are applied uncooked so that may be the reason. Wait a moment. You have to be careful here. 'charset' is overloaded term. In MIME sense, 'charset' means the same thing as 'encoding' (e.g. ISO-2022-JP, ISO-2022-KR, US-ASCII, UTF-8, EUC-KR, EUC-JP, EUC-CN, ISO-8859-X etc) and it DOES NOT mean the same thing as coded character set (JIS X 0208, JIS X 0201, KS X 1001, GB 2312, CNS 1, US-ASCII, ISO-8859-x) It's unfortunate that GB2312 has been so firmly established in place of EUC-CN. In case of EUC-KR, it has much stronger support than EUC-CN despite Microsoft's continuous assault on it and people do know that EUC-KR is different from KS X 1001/KS C 5601. Microsoft products use 'ks_c_5601-1987' as an encoding name/MIME charset/character set encoding scheme. That's a very strange use of KS C 5601-1987. Because, what they mean by 'ks_c_5601-1987' is actually CP949/Unified Hangul Code(UHC)/X-Windows-949, an upward compatible proprieatary extension of EUC-KR. No Korean standard specifies it. However, apparently, they didn't want to give an impression that they came up with something proprietary (not specified in Korean nat'l standard) by using 'X-Windows-949' and decided to use 'ks_c_5601-1987' as MIME charset for it although it has no place in Korean nat'l standard. Mozilla has to accept 'ks_c_5601-1987' as an alias to 'X-Windows-949' because MS IE, OE and frontpage are so widely used. Jungshik Shin
Re: Encode: CJK-Guide
Here's some feedback. Republic of Korea (South Korea; simply Korea as follows) has set KS C 5601 in 1989. They are both based upon JIS C 6226, could be one of the KS C 5601 was first issued in 1987 and revised in 1989 and 1992. Then, it was renamed and reissued as KS X 1001:1998 in 1998. Though there are escape-based encodings for these two (ISO-2022-CN and ISO-2022-KR, respectively), they are hardly used in favor of EUC. ISO-2022-KR used to be widely used for Korean email exchange as still is ISO-2022-JP. Now ISO-2022-KR is hardly used, but at least it was used widely until late 1990's. (see IETF RFC 1557). When you say gb2312 and ksc5601, EUC-based encoding is assumed. Please, don't help spread this misuse. It might be all right for the ignorant) public to say KS C 5601 in place of EUC-KR, but Perl programmers should learn the difference between KS C 5601/KS X 1001 (coded character set) and encoding/MIME charset/character set encoding scheme/ character coding. As I wrote before, GB 2312 has been so widely (mis)used that there's no way to replace it with EUC-CN. Korean situation is much better although not as good as Japanese case. BTW, I don't find any reference to Microsoft code pages (CP949 for Korean, CP950, CP 936 , and CP932), JOHAB(Korean), and Big5-HKSCS Is that because they're not yet supported (well, Shift-JIS and Big5 are supported)? Another BTW, don't you think your description of Unicode and Han Unification is a bit too negative and biased? I know you feel strongly about the subject, but I'm not sure CJK-Guide is the best place to express your personal opinion on it in. If you don't like to tone down or change it, you may add a disclaimer like 'some people have reservation about Han Unification and Unicode because ..' or 'the following is my personal opinion shared by some people but not universally accepted'. As a result, something funny has happed. For example, U+673A means a machine in Simplified Chinese but a desk in Japanese. a machine in Japanese. U+6A5F. Do you really believe this is a strong case against Han Unification? I don't see any problem with this. There are a number of Chinese characters with multiple meanings even without Han Unification. Do those 'meanings' have to be assigned separate code points? So you can't tell what it means just by looking at the code. Why does coded character set have to care about what computational linguists have to do? You can't tell the meaning of any English word with multiple meanings by just looking at its computer representation without context/grammatical/linguistic/lexical analysis, can you? How do you know what 'fly' means without context? Jungshik Shin
Re: Encode::CJKguide
On Wed, 27 Mar 2002, Markus Kuhn wrote: Dan Kogai wrote on 2002-03-26 22:35 UTC: Side note: I still think, Encode should have used the encoding tables that are already provided by the operating system where available. For example on Linux, the iconv() function with glibc 2.2 or newer does already provide access to all the necessary tables. I observe at the moment, that almost a dozen different programming language communities reinvent the recoding wheel simultaneously and independently, even though portable C libraries such as libiconv are already available for exactly the same purpose. I certainly feel the same way as you do. I thought a portable implementation of iconv() in libiconv would prevent the prolification of (potentially incompatible) encoding converters. I wsa wrong. I found myself having to check and contribute to/correct, if necessary, all the incarnation of encoding converters (involving Korean and sometimes other CJK) in Perl, Java, ICU, PHP, Mozilla, X11, libiconv/glibc and so forth. It would be much better if libiconv/glibc were used everywhere. Encode doesn't support a lot of encodings all of which are available in iconv() (glibc's and libconv's). please clarify that this text represents Dan Kogai's personal and possibly uninformed opinion on character encodings and their history, and not some consens of everyone involved in the Perl 5.8 release. I think this text is still in very early alpha testing ... As I wrote already, this disclaimer absolutely needs to be put in. Many of which have a rather Japan-specific and sometimes semi-informed view of Unicode and often do not at all represent Chinese or Korean views on issues such as Han unification. Please remember: CJK != Japan and there are also many good or better Korean and Chinese web pages on these issues. Koreans are for Unicode almost unanimously. Han Unification has never been as large an issue in Korea as in Japan. You should definitely also add a pointer to the Unihan database, which is the most comprehensive existing source of cross-reference and encoding conversion data between the different Han encodings: http://www.unicode.org/Public/UNIDATA/Unihan.txt I also like to add that ISO 10646:2000-1 and ISO 10646:2001-2 need to be consulted before making any premature judgement on Han Unification. As you or someone else mentioned in another forum, TUS 3.0 gave some misconception about Han Unification by listing a single glyph for each Han Ideograph. On the other hand, ISO 10646:2000-1 and ISO 10646:2001-2 list five glyphs (SC,TC, K,J,and V) and browsing thru the table, one realize how little difference there is among them (sure,there are differences, but I don't think those differences warrants so much fuss about Han Unification.). More often than not, I thought IRG didn't go far enough in Han Unification because some characters appear to need to be unified in my eyes. (perhaps, the source separation rule kept them distinct.) Jungshik Shin
Re: Encode: CJK-Guide
On Wed, 27 Mar 2002, Jarkko Hietaniemi wrote: BTW, I don't find any reference to Microsoft code pages (CP949 for Korean, CP950, CP 936 , and CP932), JOHAB(Korean), and Big5-HKSCS Is that because they're not yet supported (well, Shift-JIS and Big5 are supported)? AFAIK, they're not yet supported, since we have not had Korean expertise. Well, CJKV information processing by Ken Lunde provides more than enough information to support JOHAB and CP949/UHC/X-Windows-949 :-). In addition to that, there are existing implementations, glibc,libiconv, Mozilla and so forth. I'm not blaming any one here for the lack of support for Johab and CP949. (that's the last thing I'd do). Anyway, I'll try to help you with Korean encodings and other CJK encodings if necessary. For Johab, no new table is necessary because Hangul precomposed syllable mapping (to Unicode) is algorithmic while Hanjas and symbols can be mapped to KS X 1001 algorithmically and then mapped to Unicode using KS X 1001 mapping table. BTW, how about Big5-HKSCS(Hongkong), GBK, and GB18030(PRC)? Jungshik Shin
Re: Encoding vs Charset
On Tue, 26 Mar 2002, Jungshik Shin wrote: really means euc-cn and charset=ks_c_5601-1987 really menas euc-kr. Sadly this misconception is enbedded to popular browsers. M$ OE, M$ Frontpage keep producing html docs. However, it also has to be noted that the encoding designated as 'ks_c_5601-1987' by M$ is NOT the same as EUC-KR BUT their proprieatary extension of EUC-KR, namely CP949/UHC/(X-)-Windows-949. Therefore, I'd like to suggest (or rather do) for Korean encodings that: - Add X-Windows-949 converter - Make 'ks_c_5601-1987' and 'X-UHC', 'UHC', and 'CP949' as an alias to 'X-Windows-949' - Add JOHAB converter - Remove 'ksc5601' aliased to 'euc-kr'. Since there are some existing data in X-Windows-949 but mislabeled as EUC-KR, it might be necessary to make 'euc-kr' - Unicode converter generous and act as 'X-Windows-949' to Unicode converter (whether or not this is desirable and necessary depends on what applications Encode may be used for). However, in the other direction (Unicode - euc-kr) it has to be strictly compliant to the standard. See http://bugzilla.mozilla.org/show_bug.cgi?id=131388 Jungshik Shin
Re: GB2312 and EUC-CN : IANA registry
On Wed, 27 Mar 2002, Anton Tagunov wrote: Hi, Anton, Very glad to hear you on this list :-) Me, too :-) When you say gb2312 and ksc5601, EUC-based encoding is assumed. JS Please, don't help spread this misuse. Well, that was not meant to be applied to GB2312 :-). Below is more extensive excerpt where I wrote that sentence: JS Please, don't help spread this misuse. It might be all right JS for the ignorant) public to say KS C 5601 in place of EUC-KR, but Perl JS programmers should learn the difference between KS C 5601/KS X 1001 (coded JS character set) and encoding/MIME charset/character set encoding scheme/ JS character coding. JS As I wrote before, GB 2312 has been so widely (mis)used that there's JS no way to replace it with EUC-CN. Korean situation is much better JS although not as good as Japanese case. It could have been misunderstood. Jungshik, one little point on GB2312.. Maybe I misunderstand something, but No, you're absolutely right about IANA. See below. IANA registry (http://www.iana.org/assignments/character-sets) has Name: GB2312 (preferred MIME name) MIBenum: 2025 Source: Chinese for People's Republic of China (PRC) mixed one byte, two byte set: 20-7E = one byte ASCII A1-FE = two byte PRC Kanji See GB 2312-80 PCL Symbol Set Id: 18C Alias: csGB2312 do not know when was that put in, but it looks EUC-CN. Is it? And if yes, then GB2312 is a perfectly valid charset, isn't it? Yes, it's EUC-CN. I was about to add that although EUC-CN is a better name than GB2312, the former has never been registered with IANA while the latter was as 'preferred MIME name, You got there first :-). It's unfortunate that PRC decided to do this way, but that's what we got and I think we have to respect their decision. And thank you for explaining how it happened that Korean misuse the name of a CCS for charset :-) You're welcome :-) Actually, I told you only half the story :-). The other half happened before the widespread use of Internet in Korea (i.e late 1980's and early 1990's) when people typically refered to what's now called EUC-KR as 'KS C 5601 Wansung' (= US-ASCII in GL and KS C 5601 in GR). It was not technically correct, but didn't do much harm because there's no need for exchange of data over the internet. EUC (Extended Unix Code: it's not Extended Unix Character) for Korean was first specified in KS C 5861-1992 (now KS X 2901), but the name EUC-KR appeared first in RFC 1557 where ISO-2022-KR was defined. It would have been better if RFC 1557 had been more explicit in its description of EUC-KR so that IANA entry for EUC-KR is patterned after that for EUC-JP(GB2312 - EUC-CN) with all the code sets and their octet ranges. Perhaps, they thought just refering to KS C 5861-1992 was sufficient. -- Name: EUC-KR (preferred MIME name) [RFC1557,Choi] MIBenum: 38 Source: RFC-1557 (see also KS_C_5861-1992) Alias: csEUCKR -- Name: Extended_UNIX_Code_Packed_Format_for_Japanese MIBenum: 18 Source: Standardized by OSF, UNIX International, and UNIX Systems Laboratories Pacific. Uses ISO 2022 rules to select code set 0: US-ASCII (a single 7-bit byte set) code set 1: JIS X0208-1990 (a double 8-bit byte set) restricted to A0-FF in both bytes code set 2: Half Width Katakana (a single 7-bit byte set) requiring SS2 as the character prefix code set 3: JIS X0212-1990 (a double 7-bit byte set) restricted to A0-FF in both bytes requiring SS3 as the character prefix Alias: csEUCPkdFmtJapanese Alias: EUC-JP (preferred MIME name) Jungshik Shin
Re: Encode: CJK-Guide
On Wed, 27 Mar 2002, Jarkko Hietaniemi wrote: Mozilla and so forth. I'm not blaming any one here for the lack of support for Johab and CP949. (that's the last thing I'd do). Anyway, I'll try to help you with Korean encodings and other CJK encodings if necessary. Excellent, thanks. You may download the latest Perl developer snapshot (which contains the latest Encode, 0.99) from: http:[EMAIL PROTECTED] and look at the documentation under perl/ext/Encode/ I've looked around ext/Encode and I found that CP949 is supported. So, what has to be added is JOHAB and what needs to be modified is EUC-KR to support 8byte seq. representation of Hangul syllables (see http://jshin.net/i18n/euckr2.html or http://bugzilla.mozilla.org/show_bug.cgi?id=128587) For Johab, no new table is necessary because Hangul precomposed syllable mapping (to Unicode) is algorithmic while Hanjas and symbols can be mapped to KS X 1001 algorithmically and then mapped to Unicode using KS X 1001 mapping table. Before going further, I have a question or two. It appears that euc-kr, ksc5601-raw(ksc5601-gl or whatever) and cp949 have their own mapping tables although they're closely related. Is there any reason for this? In case of Johab, the easiest way to add support for it is to just generate the mapping table for it, but I feel uncomfotable bloating the code when it can be done algorithmically if I can make use of the mapping table for euc-kr or ksc5601(-raw). It appears that euc-jp and shift_jis don't share the mapping table, either although shift_jis and euc-jp can be more or less algorithmically converted to/from each other. I must be missing something here. There should be a way to do it and I'd be glad if someone could tell me where to look for an example case (e.g. shift_jis and euc-jp) BTW, how about Big5-HKSCS(Hongkong), GBK, and GB18030(PRC)? I *think* (but me speekee no Chineese) we do support those in Encode, but for space considerations one has to install an additional module, Encode::HanExtra. I found that Big5-HKSCS is included in 'plain Encode' and GBK, GB18030, EUC-TW, and Big5plus are in HanExtra. Jungshik Shin