Re: relevance of [PATCH] tty utf8 mode in linux-kernel 2.6.4-rc1
Hi, (B (BI received the following mail personally. The writer permitted (Bme to cite it to linux-utf8. (B (B (BFrom: [EMAIL PROTECTED] (BSubject: Re: relevance of "[PATCH] tty utf8 mode" in linux-kernel 2.6.4-rc1 (BDate: Tue, 02 Mar 2004 09:34:37 -0500 (B (B (I cant post to this list right now, its refusing my ISP's email relay (B so I'm writing to you directly) (B (B Tomohiro KUBOTA wrote: (B (B Why do you think Kanji support is somewhat "fancyful" while the real (B Linux kernel has been supporting Latin/Cyrillic/Arabic/Greek and UTF-8? (B Is it because east Asian people are less important than European people? (B (B (B (B This is a good point, however it may be impractical to load a full (B featured unicode settings and options, input method, and conversion (B engine very early in the kernel bootstrap process. (B Even if it was added to the kernel the resulting size might still be (B too much to get meaningful support into LILO or GRUB, for example. (B (B A compromise might be to use half-width katakana for kernel startup (B messages. English has accepted a considerable amount of change from the (B world of typewriters and computers such that the language has been (B adapted to accomodate them as much as they to it. For very small (B embedded systems and kernel bootstrap routines, half-width katakana or (B a similar language compromise is more practical in my opinion. (B (B Once the full, general purpose operating system has been loaded, a (B proper and full featured language interface would of course become (B available. (B (B I think this is a reasonable compromise: A user who was not interested (B in the guts of the operating system would never see this stuff anyway: (B instead they would be presented with a nice shiny graphic while the (B system started up. (B (B (B $B%h%m%7%/(B, (B (BIn my opinion, i18n support of Linux console is important primarily (Bfor reading translated messages from various administrating commands. (B (BIn Japanese case, translated messages are written in normal Japanese (B(mixture of Hiragana and Kanji (and Katakana for transliteration from (Bforeign languages)), not in Katakana. It is impossible to transliterate (Bfrom normal Hiragana-Kanji Japanese text to Katakana text easily. (B(It needs dictionary of whole Japanese vocabulary, which is apparently (Bmuch larger than a set of Japanese font). (B (BTo read Japanese translated messages, support of Hiragana, Katakana, (Band Kanji (CJK Ideogram) is needed. A compromise will be discussed (Bwhat range of CJK Ideogram will be supported. In case of Japanese, (BJIS X 0208 (less than 7000 characters) would be a moderate choice. (BJIS X 0212 (less than 7000 characters) set is also included in the (B"CJK Unified Ideographs" (U+4E00 - U+9FAF), but it would be optional (Bfor Linux console. (B (BIt may be feasible that Japanese *input* support of Linux console (Bwill be limited to Hiragana or Katakana, because Japanese input system (Bwill need dictionary of whole Japanese vocabulary and grammatical (Banalysis system. (In future, when such large amount data will be (Brelatively "small" than average disk/network capacity, there might (Bbe real need to support Japanese input.) (B (B--- (BTomohiro KUBOTA [EMAIL PROTECTED] (Bhttp://www.debian.or.jp/~kubota/ (B (B-- (BLinux-UTF8: i18n of Linux on all levels (BArchive: http://mail.nl.linux.org/linux-utf8/
Re: relevance of [PATCH] tty utf8 mode in linux-kernel 2.6.4-rc1
Hi, From: Bruno Haible [EMAIL PROTECTED] Subject: Re: relevance of [PATCH] tty utf8 mode in linux-kernel 2.6.4-rc1 Date: Tue, 2 Mar 2004 15:17:32 +0100 No, you don't need a cursor at the middle position of a doublewidth character. There are two use-cases of terminals: ... a) The applications which assume a line-oriented display and don't care about the line width. For these a line-oriented (or paragraph- oriented) terminal model is suitable. This terminal can decide about character widths on its own, do bidi and ligatures, possibly use proportional fonts. In this case there is no use for | for line drawing, or for block graphics. Right. b) The applications which assume a cell matrix. Examples: vim 6, GNU readline, X/Open curses. These applications know what is represented on the screen, and where, because they keep their own cell matrix. When such an application wants to put a | at position (x, y), it can do (gotoxy x-1 y) space space backspace | or (gotoxy x-1 y) space space (gotoxy x y) | instead of the simplistic (gotoxy x y) | that you propose. Softwares have to be impremented as such. Otherwise, they fail. If you are thinking about far future, please think a completely different system, instead of modifying an existing tty system. No, the tty system has to be modified where needed. When modified, compatibility must be kept. --- Tomohiro KUBOTA [EMAIL PROTECTED] http://www.debian.or.jp/~kubota/ -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
Re: relevance of [PATCH] tty utf8 mode in linux-kernel 2.6.4-rc1
Hi, You always make light of compatibility to non-European-language environments. Even if it were not an adequate choice, standards need to keep compatibility to popular past environments. Kernel will have to handle wcwidth() anyway - to display doublewidth characters on the console - to calculate the cursor position on the console after processing a 0x08 (in your case; if a 0x08 moves one *cell* in any case, the calculation does not need wcwidth()) --- Tomohiro KUBOTA [EMAIL PROTECTED] http://www.debian.or.jp/~kubota/ -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
Re: Linux console internationalization
Hi, From: Innocenti Maresin [EMAIL PROTECTED] Subject: Re: Linux console internationalization Date: Wed, 06 Aug 2003 03:22:27 +0400 Tomohiro KUBOTA wrote: Interesting, but any plan to support more than 512 characters? Not within VGA text modes. 2^9 is a hardware restriction based on text framebuffer's data semantic. I see. It is since MS/PC-DOS version 6.x (so-called DOS/V) that IBM-compatible PC began to be able to display Japanese characters on text screen. (Before then Japanese local PC was used which has hardware Japanese support.) I imagine the MS/PC-DOS used VGA graphic mode. (I heard that V in the name DOS/V came from VGA.) And I think that 9x16 (this is the largest glyph size usable in VGA text) is apparently much less than is needed to read Japan glyphs without risk of eyes. Even for 12-year-old Japanese person ;-) Right. On tty, Japanese character are displayed using two columns. For example, when ASCII characters are 8x16, Japanese characters are 16x16. So, VGA text seems not to be an acceptable solution for East Asia. Right. --- Tomohiro KUBOTA [EMAIL PROTECTED] http://www.debian.or.jp/~kubota/ -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
Re: Linux console internationalization
Hi, From: Innocenti Maresin [EMAIL PROTECTED] Subject: Linux console internationalization Date: Wed, 06 Aug 2003 02:25:32 +0400 P.S. I just done a Web-page descibing my view of Linux console i18n and further plans. There is also a glossary of used terms. http://www.comtv.ru/~av95/linux/console/ Interesting, but any plan to support more than 512 characters? 512 is apparently much less than east Asian people's need. (For example, Japanese basic character set (JIS X 0208) has several thousands of characters. 12-year-old Japanese person should know roughly one thousand characters and adults should know much more.) And, how about fullwidth characters (i.e., return value of wcwidth() is 2) and combining characters (wcwidth() is 0), like xterm supports them? I am looking forward to linuxconsole project http://linuxconsole.sourceforge.net/ Do you know the project? --- Tomohiro KUBOTA [EMAIL PROTECTED] http://www.debian.or.jp/~kubota/ -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
Re: gtk2
Hi, From: srintuar26 [EMAIL PROTECTED] Subject: gtk2 Date: Tue, 1 Apr 2003 22:02:36 -0500 gnome-terminal and multi-gnome-terminal are fairly lightweight. Also, the user interface used to configure, interact with, and use the input method has to use some toolkit. I'd say gtk2 is as good a choice as another. As Glenn wrote, gnome-terminal is not very lightweight. And, do you say that non-European-language speaking people don't need to have choices? For example, there are people who like Eterm, Aterm, Wterm, Rxvt, Xterm, or so on. (Note that all of them support XIM.) Is it a priviledge of European-language-speaking people to say such preferences? It is what I wanted to call ethno-centrism. --- Tomohiro KUBOTA [EMAIL PROTECTED] http://www.debian.or.jp/~kubota/ -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
alias in fontconfig (Re: supporting XIM)
Hi, From: Jungshik Shin [EMAIL PROTECTED] Subject: Re: supporting XIM Date: Mon, 31 Mar 2003 11:08:53 +0900 - a word processor whose menus and messages are in English but can input/display/print text in your native language Which is better? The first one is completely unusable and the second one is unconveinent but usable. I agree with you on this point. That's why I compared the status of KDE in 1999-2000 with that in 2003. Back in 1999-2000, KDE/Qt people thought that translating messsages is I18N, but they don't do any more and KDE/Qt supports 'genuine I18N' much better now. I am glad there are people who understand this point. Several years ago, even when I said this tens of times I was ignored. - Xmms cannot display non-8bit languages (music titles and so on). Are you sure? It CAN display Chinese/Japanese/ Korean id3 v1 tag as long as the codeset of the current locale is the codeset used in ID3 v1 tag. I'll test this further. However, please note I won't be satisfied by i18n which require specific configuration other than setting LANG variable (and installing required softwares and resources). - Xft/Xft2-based softwares cannot display Japanese and Korean at the same time while Xft and Xft2 are UTF-8-based, because there are no fonts which contain both of Japanese and Korean. This should not be regarded as a font-side problem, because (1) font-style principle is different among scripts (there are no courier font for Japanese) You can use 'alias' in fontconfig if some programs use 'Courier' or 'Arial' instead of generic fonts names like 'monospace', 'serif', 'sansserif', and so forth. I want such alias to be automated. If I have one Korean font installed, it is obvious that renderer must use the font for all Korean texts. It is not a good idea that the renderer fail to display Korean when the user doesn't configure the alias. Since typography is different among scripts (Latin, Cyrillic, Greek, Han, Hangul, Hiragana, Katakana, Arab, Hebrew, Thai,...), we cannot expect there will be various fonts which include various scripts in the world (except for a few basic fonts like 'misc' or 'sansserif'). I cannot imagine courier Hiragana fonts nor mincho Arab fonts. This is why alias mechanism is not a makeshift but a naturally needed mechanism. - There are no lightweight web browser like dillo which is i18n-ed. I think that w3m-m17n is an excellent lightweight browser that supports I18N well. Well, I meant a lightweight GUI browser. Though I haven't checked, I imagine dillo and so on use 8bit font mechanism. There is another i18n extension of w3m: w3mmee. I don't know which is better. - FreeType mode of XFree86 Xterm doesn't support doublewidth characters. Well, it sort of does. Anyway, I submitted a patch to Thomas and I expect he'll apply it sooner or later. After that, I'll add '-faw' option (similar to '-fw' option). Fantastic! May I want more? Xterm can automatically search a good (corresponding) doublewidth font in non-FreeType mode. How about your patch? I already mentioned this issue. Programs like 'fmt' has to be modified, but there's already an alternative to 'fmt' that supports Unicod linebreaking algorithm. When I wrote this sentence, I thought about Text::Wrap() in Perl. --- Tomohiro KUBOTA [EMAIL PROTECTED] http://www.debian.or.jp/~kubota/ -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
Re: gtk2 + japanese; gnome2 and keyboard layouts
Hi, From: srintuar26 [EMAIL PROTECTED] Subject: Re: gtk2 + japanese; gnome2 and keyboard layouts Date: Mon, 31 Mar 2003 01:44:15 -0500 Well I for one have been placated for now by im-ja. Its precisely what ive been looking for, and extensive googling didnt root it out. I also tested im-ja Debian package with gnome-terminal. I felt it surely will be a convenient tool after more development. There are some points. - Japanese input methods need user preference configuration. For example, some (not small part of) Japanese people want Ctrl+U to be Hiragana conversion, Ctrl+I to be Katakana conversion, Ctrl+O to be Hankaku conversion, Ctrl+P to be Alphabet conversion (reverse romaji/kana conversion), without Kakutei (determination). These key bindings are from popular commercial input engine ATOK. (It is my first input engine in 15 years ago for about 8 years. After that, I configured all input methods (other than SKK) to ATOK-like key binding.) - Japanese input methods have a key sequence to switch no-conversion - kanji conversion. In im-ja, Shift+Space or Henkan (available in Japanese keyboard) key switch no-conversion - Hiragana - Katakana - Canna - Kanjipad - no-conversion. It is not suitable for Japanese people who want to input large amount of Japanese text as a mother tongue (or first language). Usually, such omnibus switching (Hiragana - Katakana - Kanji - Kanjipad - JIS table - ...) is bind to F10 key. I think it should be configuratable, too. (I don't know why (from what analogy) Henkan key was originally used for this purpose.) - Canna mode seems not to show some important informations such as conversion border (Bunsetsu border) and current converting Bunsetsu. - Canna mode seems not to supply various conversion keys. (For example, conversion border larger/smaller, Hiragana conversion, Katakana conversion, and so on). I may be wrong because I have not tested very well. (How about dictionary handling, JIS character table, and so on?) Does GTK+2 Input Method Framework supply ways for input methods to supply confurators? Are there any Japanese member in Im-ja developers? Japanese people know many tiny but important points for to achieve convenient input method and user interface. Anyway, I imagine most of Japanese people will continue to XIM for a while because (1) changing input method is like changing keyboard from QWERTY to DVORAK, (2) GTK+2 input method is not supported by popular softwares (you can imagine it is confusing to using multiple input methods with different user interfaces, it is like using QWERTY for one software and DVORAK for another), and (3) conversion dictionary which a user tought many words and conversion order of homonyms is a valuable thing and changing input method may mean losing the data. --- Tomohiro KUBOTA [EMAIL PROTECTED] http://www.debian.or.jp/~kubota/ -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
Re: gtk2 + japanese; gnome2 and keyboard layouts
Hi, From: srintuar26 [EMAIL PROTECTED] Subject: Re: gtk2 + japanese; gnome2 and keyboard layouts Date: Mon, 31 Mar 2003 21:27:29 -0500 Yes, that is a good point, but it brings up a question: how is this going to interact with applications which already have meanings for CTRL+O (File Open), CTRL+P (Print), etc Key bindings of Japanese input methods are classified into (at least) three categories: - keys which must be available everytime - keys which must be available only when input method is active - keys which msut be available only when there are undetermined string My examples of CTRL+O and CTRL+P are the third category, because they convert current undetermined strings into Hiragana, Katakana, and so on. In other cases, the input method can pass these key sequences to the application softwares. Only the first category keys are fatal for collision. However, it includes only one key, input method activation (like Shift + Space or Henkan in im-ja). Keys like mode change among Hiragana/Katakana/Kanjipad are the second category in ordinary input methods, though im-ja assignes Shift+Space or Henkan (same to input method activation) for this function. As a primary input method for a native speaker: I think it needs pehaps a bit more work, and of course evolution, mozilla, vim, etc, have to complete their transitions to gtk2. To be popular among native (Japanese) speakers, popular softwares must support GTK2 input methods. For example, mule/emacs/xemacs, kterm, rxvt, xterm, and KDE softwares. Especially, mule/emacs/xemacs is overwhelmingly popular among Japanese because it has been the only way to write Japanese in both of X and non-X environments for tens of years. --- Tomohiro KUBOTA [EMAIL PROTECTED] http://www.debian.or.jp/~kubota/ -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
Re: supporting XIM
whitespace between words. I feel that CJK people everytime have to keep a watch on softwares which are already i18n-ed, because i18n support of such softwares are sometimes broken when new versions are released. (Xedit often changes its status (can use XIM or cannot use XIM). What happens?) This is fatal if translation is already supplied (like OpenOffice.org case). I think a certain part of CJK developers' time are wasted into this. --- Tomohiro KUBOTA [EMAIL PROTECTED] http://www.debian.or.jp/~kubota/ -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
Pango tutorial? (Re: supporting XIM)
Hi, From: srintuar26 [EMAIL PROTECTED] Subject: Re: supporting XIM Date: Sun, 30 Mar 2003 19:25:41 -0500 If the theme engine uses pango for layout, and a desired language context is understood, I think this would work fine. Pango can always substitute fonts for missing glyphs... Unfortunately, there are no tutorials for Pango. A developer of Xplanet and I sent mails to a Pango developers (Evan Martin and Noah Levitt) to ask that but they think Pango is not intended to be used from applications directly but from upper toolkit layer. However, GTK2 is too heavy to be recommended for *all* softwares which displays some text. --- Tomohiro KUBOTA [EMAIL PROTECTED] http://www.debian.or.jp/~kubota/ -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
Re: supporting XIM
Hi, From: H. Peter Anvin [EMAIL PROTECTED] Subject: Re: supporting XIM Date: 30 Mar 2003 17:02:58 -0800 Perhaps not double-width, but there are plenty of non-ASCII, non-ISO-8859-1 characters in the Unicode set that should be interesting to U.S. programmers. This is a good information. I hope such people will hard-code UTF-8 support up to two bytes. Though I didn't find such softwares, I heard there are such softwares. We have to continue keeping watch on i18n implement of softwares How about em-dash or ligatures such as fi or ffl? Are they doublewidth? --- Tomohiro KUBOTA [EMAIL PROTECTED] http://www.debian.or.jp/~kubota/ -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
Re: supporting XIM
Hi, From: srintuar26 [EMAIL PROTECTED] Subject: Re: supporting XIM Date: Sun, 30 Mar 2003 19:25:41 -0500 - Tcl/Tk's XIM support is unstable even now. (Every time I try to input Japanese, it sticks). When I read Tcl/Tk's roadmap in version 8.0 age, I was really surprised that XIM support (essential for CJK, as you know) is very low priority. eh, XIM needs to be dropped imo. From personal observation, building tools such as XIM and IIIMF which are integrated into the X server is the wrong way to go, and GTK+ input methods seem to work much better. Why wrong? Anyway, CJK people are waiting for years. No more vaporware. Note that Tcl/Tk-based softwares which need text input are not usable at all because of this problem. - Text line wrapping. Chinese and Japanese (not Korean) don't use whitespace between words. Ooh, that makes me curious: is there a good discussion of how to line-break Japanese text? I wonder how browsers are doing it... (Non) usage of space in Chinese and Japanese causes problem on text search system such as mnoGoSearch. Now mnoGoSearch developer team seems to be thinking about using ChaSen to analyze Japanese text (though ChaSen doesn't support Chinese). Also, I cannot imagine a Japanese dictionary for ispell. Line-break in Japanese can be done almost any places except for several symbols (like kuten and touten which are like period and comma in English sentences). Also Japanese sentences often contain Latin alphabets (for example, there are many companies whose names are written in Latin alphabets, like SONY, NEC, and so on) and whitespace. Note that LF code in original Japanese text must not regarded as a space (Don't insert a space when connecting Japanese lines). However, Thai is much more difficult. It doesn't use whitespace between words, but line-breaking must be done at borders of words. It means that Thai dictionary is needed to achieve correct line- breaking for Thai. --- Tomohiro KUBOTA [EMAIL PROTECTED] http://www.debian.or.jp/~kubota/ -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
Re: supporting XIM
Hi, From: Glenn Maynard [EMAIL PROTECTED] Subject: Re: supporting XIM Date: Fri, 28 Mar 2003 16:49:31 -0500 Stop using the word racist. It's like saying if you don't support a feature I want, you're supporting terrorism; it makes people groan and stop paying attention. It's inflammatory, doesn't help your case at all, and injures your credibility. I see. I didn't know subtle nuance of the word. (Dictionaries never teach us about such nuances.) However, I am often annoyed by people who think supporting European languages is more important than supporting Asian languages even when there are no technical problem to achieve such support. They never have racist ideas. They just feel non-European languages are somewhat exotic and support of such languages is a special feature of softwares. For fair, I should mention that usual Japanese developers and users don't think about non-Japanese/English language support. I don't think they are racists. They just forget there are languages other than Japanese and English. How should I call such people? I know they are never racists in its original meaning. Note that even if they are not racists, the result (that there are few internationalized softwares) is as almost same as they are really racists. The difference is -- I have a little hope to persuade these developers not to forget about non-European-language speakers. On the other hand, Real racists are those who explicitly know about non- European-language speakers and who think they should be discriminated. --- Tomohiro KUBOTA [EMAIL PROTECTED] http://www.debian.or.jp/~kubota/ -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
Re: supporting XIM
Hi, From: Jungshik Shin [EMAIL PROTECTED] Subject: Re: supporting XIM Date: Thu, 27 Mar 2003 18:38:51 -0500 (EST) That's not a problem at all because there are Korean, Japanese and Chinese input modules that can coexist with other input modules and be switched to and from each other. With them, you don't need to use XIM. ... One point: Many Japanese texts include Alphabets, so Japanese people want to input not only Hiragana, Katakana, Kanji, and Numerics but also Alphabets. I imagine Korean people want, too. In such a case, switching between Alphabet (no conversion mode) and conversion mode has to be achieved by simple key typing like Shift + Space. The switch must be between conversion mode and no-conversion mode, must not be among all installed input methods. Is it possible in GTK applications? (This is achieved in Windows. Alt-Esc will switch between conversion and non-conversion, while Alt-Shift will switch among installed input methods.) Another point: I want to purge all non-internationalized softwares. Today, internationalization (such as Japanese character support) is regarded as a special feature. However, I think that non-supporting of internationalization should be regarded as a bug which is as severe as racist software. However, GTK is a relatively heavy toolkit and developers who want to write a lightweight software won't use it. I never think If there is one internationalized software (for example, gnome-terminal), it is enough. If developers want to develop another softwares in the same category (xterm, rxvt, eterm, aterm, ...), it means users have freedom to choose. Such a freedom of choice must not be a priviledge of English-speaking (or European-languages-speaking) people. Do you have any idea to solve this problem? There is at least one Japanese gtk2 input module as I wrote above. You just have to install it because it doesn't come default with gnome 2.x. Japanese people need multiple input modules. This is because Japanese conversion is too complex for a software to perfectly achieve it. Since complexity itself sometimes confuses users, there are input methods which want to be simple so as not to surprise users. (However, such simplicity is achieved by requiring users more information or keyboard input for conversion.) People who don't want to keep watching screen nor keyboard during input sentence (expert users) tend to prefer such simple methods with less need to watch screen to confirm conversion result. SKK is one of such methods. It cannot convert multiple words at a time (unlike most of modern input methods) but it means that it never convert one word into (wrongly) multiple words. T-Code is much more spartan input method with one-to-one mapping from a keyboard sequence to a kanji. Though a user has to remember thousands of such mappings because Japanese language needs thousands of kanjis, such input methods are popular in a certain amount of (not many) Japanese people. Of course several Japanese companies are competing in Input Method area on Windows. These companies are researching for better input methods -- larger and better-tuned dictionaries with newly coined words and phrases, better grammartical and semantic analyzers, and so on so on. I imagine this area is one of areas where Open Source people cannot compete with commercial softwares by full-time developer teams. How about Korean? --- Tomohiro KUBOTA [EMAIL PROTECTED] http://www.debian.or.jp/~kubota/ -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
Re: supporting XIM
Hi, From: Edward H Trager [EMAIL PROTECTED] Subject: Re: supporting XIM Date: Wed, 26 Mar 2003 12:29:30 -0500 (EST) I'd also like to be able to see instantaneous, on-the-fly switching of language/locale without having to restart KDE or Gnome or the program being used. I want to be able to just hit a button or key combination to switch everything from, say, English to French, or Chinese, or Japanese... It would be similar to using Yudit where I can easily assign function keys for changing the keyboard map/ input method. Is it possible to implement an XIM server as a wrapper for other XIM servers and input method engines/libraries? It would also wrap locale so that UTF-8 softwares would be able to connect with Canna. BTW, mlterm (http://mlterm.sourceforge.net) can switch XIM servers on-the-fly. Since it manages XIM-connecting locale independently from its main locale, it can (for example) use Canna and so on from en_US.UTF-8 locale. I think you can customize mlterm so that you can switch input methods by function keys or other keys. However, such an application-side solution is not very good, because it depends on application softwares and most softwares would have poor input method supports because most of developers in the world don't know very much on input methods. I want all softwares including lightweight ones will able to input/output not only ASCII or 8bit characters but also my mother tongue (Japanese). I don't want to say Hey, I am lucky! At last I found a Japanese-capable software!. --- Tomohiro KUBOTA [EMAIL PROTECTED] http://www.debian.or.jp/~kubota/ -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
Re: supporting XIM
Hi, From: Maiorana, Jason [EMAIL PROTECTED] Subject: RE: supporting XIM Date: Mon, 24 Mar 2003 11:11:31 -0500 I think it should be much more stateless, allowing the client library to do the rouma/kana conversions, and simply having the server anwer queries for possible Kanji, of course all in UTF-8. The state of the clients interface should be kept on the client side, imo FYI: Anthy is designed as a library-based input method. All tasks including not only rouma/kana conversion but also kanji conversion is done in the library. GTK+ module and XIM module are provided. (I have not tested Anthy). I heard that Anthy stopped to provide IIIMF module because the developers thought IIIMF protocol has some security problem but I don't know about the problem. Hiura-san, do you know something about this? Canna and Wnn (now FreeWnn) are designed as client-server style systems. They have their own protocols. Emacs (tamago), Xemacs (with Mule-Canna-Wnn), and kinput2 are well-known clients for Canna and Wnn servers. You know, kinput2 is an XIM server for Canna and Wnn. Thus, if you don't like XIM but don't hesitate to use Canna or FreeWnn, there might be a way to develop GTK+ module for Canna and FreeWnn. The problem of this solution is that this is valid only for GTK2-based softwares. Not for basic softwares such as xterm, rxvt, and emacs, not for KDE softwares, and not for slow computers which users don't want to use GTK2. The problem of IIIMF is --- as far as I tested --- that it is not easily compiled or very stable. Hiura, do you have any plans to provide easy-to-test .rpm and .deb packages of IIIMF- related softwares which might make users and developers become interested in IIIMF, want to study it, and want to develop IIIMF softwares? --- Tomohiro KUBOTA [EMAIL PROTECTED] http://www.debian.or.jp/~kubota/ -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
Re: supporting XIM
Hi, From: Juliusz Chroboczek [EMAIL PROTECTED] Subject: Re: supporting XIM [was: lamerpad] Date: 13 Mar 2003 01:27:47 +0100 The problem with IM support under X11 is that the XIM framework doesn't make sense. It defines an overly complex protocol that requires both the client and the XIM server to perform dozens of useless activities. Additionally, it defines four only remotely related protocols (``styles''), all of which need to be tested against. I don't know about XIM protocol itself, but I don't think it is difficult to implement an XIM client, nor is it too complicated than is needed. One or two styles is enough. Especially, over-the-spot style is relatively simple to implement and useful for users. What points do you think are useless on XIM? I don't know why you think so, whether because you really understand XIM or because you don't know about needed complicity and features for CJK support. --- Tomohiro KUBOTA [EMAIL PROTECTED] http://www.debian.or.jp/~kubota/ -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
Re: FYI: lamerpad
Hi, From: [EMAIL PROTECTED] (Janusz S. Bie) Subject: Re: FYI: lamerpad Date: 12 Mar 2003 08:02:59 +0100 The crucial question: does lamerpad work for you or anybody else? It doesn't work for me, see below. You are right. I tested lamerpad and failed. It failed in several aspects. First, it could not show any Kanji candidates. Next, I couldn't check if it works as an XIM server and an XIM client can connect with it. Third, it could not use GNU Unifont properly. --- Tomohiro KUBOTA [EMAIL PROTECTED] http://www.debian.or.jp/~kubota/ -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
Re: FYI: lamerpad
Hi, From: Glenn Maynard [EMAIL PROTECTED] Subject: Re: FYI: lamerpad Date: Tue, 11 Mar 2003 21:13:16 -0500 Of course, adoptation of Unicode alone cannot make your software support CJK languages (more efforts are needed). I hope Lamerpad will help testing softwares and will lead more softwares supporting CJK languages. What more is needed? Combining (Korean) and double-width characters (in the case of console apps) are two things that need special attention, but they're both just parts of supporting Unicode. Other than that, and input method support (which is unreasonably difficult at the moment--based on conversations on this list--except in Windows where it's merely annoying), what more is needed in the general case? If you are talking about full support of Unicode including technical reports and so on, you are right. However, there are many softwares which insist supporting Unicode which cannot handle bidi, combining, doublewidth, more than two or three bytes UTF-8 character, multiple fonts for multiple scripts, and so on. --- Tomohiro KUBOTA [EMAIL PROTECTED] http://www.debian.or.jp/~kubota/ -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
Re: supporting XIM
Hi, From: Maiorana, Jason [EMAIL PROTECTED] Subject: RE: supporting XIM Date: Mon, 24 Mar 2003 11:11:31 -0500 What points do you think are useless on XIM? I don't know why you think so, whether because you really understand XIM or because you don't know about needed complicity and features for CJK support. Well, I find most XIM methods to be unstable, and crash alot. Plus, they are far too dependant upon locale. I dont see why a XIM method should have such fragile dependancies upon the locale. I like to operate under en_US.UTF-8, but I like to enter Japanese and vietnamese sometimes. The vietnamese input method implemented under GTK+ works fine, no matter which locale im logged into. The XIM method for Japanese seems only to work under ja_JP.eucjp. You can send mails to ask for an improvement to support UTF-8 locales. Canna: http://canna.sourceforge.jp/ FreeWnn: http://www.freewnn.org/ Anthy: http://anthy.sourceforge.jp/ SKK: http://openlab.ring.gr.jp/skk/ XCIN:http://xcin.linux.org.tw/ However, locale-dependence itself is not a bad thing. For example, XCIN supports both of traditional and simplified Chinese depending on locale. We can imagine about an improvement that the default mode would be determined by locale even when it would support run-time switching of traditional and simplified Chinese. Also it crashes alot, probably due to Canna being somewhat unstable under rh8. (Start Japanese input and type wildly for a second, cannaserver will lock up.) There seem poorly-implemented XIM clients which cause locking up of XIM servers. They are bugs of either XIM clients or servers. Please contact developers of them. I think it should be much more stateless, allowing the client library to do the rouma/kana conversions, and simply having the server anwer queries for possible Kanji, of course all in UTF-8. The state of the clients interface should be kept on the client side, imo I think support of UTF-8 locales is a good improvement. Rouma/kana conversion is not as simple as you think, because conversion table is configurable in modern conversion engines. In SKK, rouma/kana conversion and kanji conversion are strongly connected from users' view and I don't think such separation can be achieved. kana-kanji conversion is much more complex. It is never a simple thing like one-to-one or one-to-many conversion. Timing of rouma-kana conversion and kana-kanji conversion is also a target of improvement for input method developers, like SKK does. There are also input methods which don't use rouma nor kana like T-Code. It is not a good idea to impose a standardized communication in the middle of conversion. There are many input methods with various various ideas and user interfaces algorithms by input method developers. Input method protocols must be as extensible as possible to allow input method developers to realize their ideas. --- Tomohiro KUBOTA [EMAIL PROTECTED] http://www.debian.or.jp/~kubota/ -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
FYI: lamerpad
Hi, I hope there are people who are interested in internationalization and Unicode support including Kanji, but I fear that it is difficult for non-CJK developers to test Kanji font/display/input/print support. Lamerpad, http://www.debian.org.hk/~ypwong/lamerpad.html, seems to be a good way for developers who don't know CJK languages to test their own softwares whether they support Kanji input or not. Of course, adoptation of Unicode alone cannot make your software support CJK languages (more efforts are needed). I hope Lamerpad will help testing softwares and will lead more softwares supporting CJK languages. Note that I have not tested Lamerpad yet. --- Tomohiro KUBOTA [EMAIL PROTECTED] http://www.debian.or.jp/~kubota/ -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
Pango tutorial?
Hi, I am now interesting in Pango, because it says it - can output anti-aliased text, - can handle multilingual text including CJK, bidi, combining, and Indic complex scripts, - can choose proper fonts for language (script) of (portions of) given texts, which means it doesn't force users to configure font settings to display non-Latin Alphabet texts, - can use multiple fonts for multilingle given text (one font for each language/script), which means it can display mixture of Japanese and Cyrillic when the system has Japanese font and Cyrillic font (even without a font which has both of japanese and Cyrillic), and - is free (meets the Open Source Definition). However, I have no idea how to use it. Are there any tutorials of Pango? Or, are there any other text rendering engines which meet the above conditions? Concretely, I am now interested in the beta version of xplanet, which uses FreeType. However, FreeType is a low level renderer and it doesn't support bidi nor combining. It doesn't take care of supported codepoint/language/script range of fonts. Thus, I think FreeType is not suitable for application softwares but it should be regarded as a basis of other high-level rendering engines. Thus, the main developer of xplanet and I are searching a good text rendering engine and interested in Pango. --- Tomohiro KUBOTA [EMAIL PROTECTED] http://www.debian.or.jp/~kubota/ -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
Re: How to read mail with #nnnn
Hi, P class=3DnormalB=E0i n=E0y #273;FONT face=3DTimes New = Roman#259;/FONTng kh=E1 l=E2u tr=EAn t#7901;=20 b=E1o b#7841;n. #272;#7895; th=F4ng Minh th#7853;t s#7921; kh=F4ng = xa l#7841; g=EC v#7899;i ch=FAng t=F4ị Anh t#7915; Nh#7853;t khi=20 #273;FONT face=3DTimes New Roman#7871;n Hoa th#7883;nh = #272;#7889;n th#432;#7901;ng #273;#7871;n nh=E0/FONT ch=FAng = t=F4i=20 Most of modern HTML rendering engines can decode it. Thus, you can read it with Mozilla and so on. --- Tomohiro KUBOTA [EMAIL PROTECTED] http://www.debian.or.jp/~kubota/ Introduction to I18N http://www.debian.org/doc/manuals/intro-i18n/ -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
Re: Linux Console in UTF-8 - current state
Hi, At 29 Sep 2002 08:51:55 +0200, Eric Streit wrote: It is probably safe to assume that anybody who wants to avoid framebuffers will not need UTF-8 support, though, so a config option for a stripped down console that way might be useful. if we implement a complete graphical environment in the framebuffer... it's a way to reinvent X11 ;) Well, Linux already has framebuffer-based console. Our hope is just to expand it (or develop similar thing) for better support of Unicode. Though Markus may stick to MES-* set, I think more characters (including CJK and other Asians) is a good choice. For example, 18x18ja.bdf in XFree86's CVS today is just about 4MB and about 600kB when compressed. When we know that this font (or similar one, which will be in similar size) is mandatory for CJK people, I imagine nobody would think this size is too large to be included in Linux source code. Other fonts (like Arab, Hebrew, Thai, Khmer, ...) will be considerably smaller in size than CJK and they must be included. On the other hand, italics and bolds can be omitted because they are not mandatory for any countries or languages in the world. Of course I never insist Linux *shoud not* support italics and bolds. I just pointed italics and bolds will have lower priority. --- Tomohiro KUBOTA [EMAIL PROTECTED] http://www.debian.or.jp/~kubota/ Introduction to I18N http://www.debian.org/doc/manuals/intro-i18n/ -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
Re: Linux Console in UTF-8 - current state
Hi, At Sun, 29 Sep 2002 03:30:47 -0500 (CDT), [EMAIL PROTECTED] wrote: It _is_ too large to be included. The kernel should include a Latin-1 font (for backwards-compatiblity) and let the user to load a large font if they want. Though I don't understand where is the borderline of too large and not too large, I understand your idea to limit font to backward- compatibility range, i.e., Latin-1. In this idea, kernel will have only ability to handle UTF-8, and fonts will be supplied in another packages (like Linux Console Tools) if users need more than Latin-1 (like Euro). Since most Linux distributions will have such another package, I think this is reasonable. I hope the kernel's ability will include support of zero-width and double-width characters. Anyway, what I hate is to divide people into two classes, people who don't need additional files/settings and people who need them. Japanese users were always forced to read books to configure softwares to be able to handle Japanese. I strongly hope that Unicode will equalize peoples in the world. To achieve this, we should not spoil the advantage of Unicode to ISO-2022 --- the unified character set --- by spliting the code space and saying this code space is needed, that is optional. --- Tomohiro KUBOTA [EMAIL PROTECTED] http://www.debian.or.jp/~kubota/ Introduction to I18N http://www.debian.org/doc/manuals/intro-i18n/ -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
Re: Linux and UTF8 filenames
Hi, At Thu, 19 Sep 2002 12:15:09 -0400, Maiorana, Jason [EMAIL PROTECTED] wrote: Do you know of any non-graphical input support for japanese? As Mike said, there are several softwares. Indeed, I am using GNU Emacs to input Japanese via ssh login. I was wondering if such software exists: - a text-terminal (non X, non-gui) japanese input method system There are no such text-terminals which do both of display and input. Kon2 is a Linux Kanji Console, which enables *display* of Japanese (EUC-JP or Shift_JIS) but it doesn't have input ability of Japanes. Jfbterm is a Linux Framebuffer Multilingual Console based on ISO-2022. However, it also doesn't have input ability of Japanese. On the other hand, there are several Japanese-input wrapper on tty. Uum for Wnn, Canuum for Canna, and Skkfep for SKK. Since they are tty softwares, they themselves don't have display ability. They owe it to the terminal. - a batch kanji picker: its easy to take a quantity of roomaji and turn them into kana, but is there a command line tool, or anything which could take kana and produce kanji's? Impossible, because various kanji can be candidates for same kana. (Possible, with interaction with users. Display these candidates and let the user choose one. Conversion process needs more work, because it has to divide series of letters into words. This work needs grammatical analysis. You know, Japanese language doesn't use whitespace to separate words.) --- Tomohiro KUBOTA [EMAIL PROTECTED] http://www.debian.or.jp/~kubota/ Introduction to I18N http://www.debian.org/doc/manuals/intro-i18n/ -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
Re: Linux and UTF8 filenames
Hi, At Thu, 19 Sep 2002 10:17:35 -0400, Maiorana, Jason [EMAIL PROTECTED] wrote: I dont think that IIIMF is really going to address the console issue at that level. (Also it uses UTF-16 internally, anyone else find that wierd for Unix software?) Though I don't know how IIIMF is good or bad, I don't know any alternatives which can input Chinese and Japanese. I agree that UTF-16 is a bad choice but it is not fatal, while no possibilites of support for Chinese and Japanese (any keymap-like approach can never support these languages) is fatal. --- Tomohiro KUBOTA [EMAIL PROTECTED] http://www.debian.or.jp/~kubota/ Introduction to I18N http://www.debian.org/doc/manuals/intro-i18n/ -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
Re: Perl 5.8 with significantly improved UTF-8 support is out
Hi, At Tue, 23 Jul 2002 13:54:59 +0100, markus kuhn wrote: Perl 5.8 is out! A good news. I will have to try it... Does it support LC_CTYPE ? Another major milestone reached ... I guess the emacs-unicode is now the only one left ... Linux console's Unicode support is very poor. It can handle only a few hundreds of characters, and cannot handle combining nor doublewidth characters. It doesn't have API for CJK input methods. Another one is Tcl/Tk. I cannot input Japanese in entry and text widgets by using XIM. Something should be wrong even now, though it may be specific problem for Debian package Extended input method is also needed. For example, I cannot input both of Japanese and Korean in one xterm session, because there are no XIM servers which support both of Japanese and Korean while xterm cannot switch XIM connection. (mlterm can do this, but I think all softwares should be able to do this.) Many softwares should be rewrote using internationalized widget libraries such as Pango to support complex languages. --- Tomohiro KUBOTA [EMAIL PROTECTED] http://www.debian.or.jp/~kubota/ Introduction to I18N http://www.debian.org/doc/manuals/intro-i18n/ -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
XTerm patch to call luit (2)
Hi, Here is a new patch. - Default of locale resource is changed from true to false. (I still have no idea which is the best... See below.) - Locale-related resource set-up is separated from VTInitialize() to a new function VTInitialize_locale(). - Added vi (Vietnamese) for luit-using locales in medium mode. - Use nl_langinfo(CODESET) if available. (Definition of HAVE_LANGINFO_CODESET is not implemented yet. Could you help me, Bruno?) - Use MB_CUR_MAX if available. - Implemented mystrcasecmp() instead of using locale_str. I heard that from a Japanese person that locale resource should be false for some time till the resource will be famous to avoid confusion. When it will be famous, (he said) many people would think the default should be true and then the default can be changed to true without annoying people. I think this opinion can be integrated into Juliusz's opinion that when some new font mechanism will be dominant, the default should be changed to true. So, how do you think about the default of false? The following is the new patch. Please note this patch is for my previous patch. begin 644 xterm-20020604-luit2.diff.gz M'XL(/?F!ST`WAT97)M+3(P,#(P-C`T+6QU:70R+F1I9F8`K5E]4^)($_\; M/L46W6`.POJ5XN(J_H6(*[W=KQ1@FDMN0\1!U]OSNU_WS.0-$$] M:VM)9KI[KI[NG^=5FF2:@WNR#?`^9-:$-1LJ.LD7MF16\-\:Z-_5HVXL MFVXDYHN4TAP1A%X1GZ;.83LD$9SO[WKS0)$A=KM5J_(A9;1)%W6\OR*8 M/WP@=+NE;NZ2FOCY\*%(\._'E^!B%IBM3?(EZ%[+AZM3)]@DOO4W\V*Y035 M34$,4WW3]!G\7_FJGRV1OLN(4!L8TX=96D3=\02/%#P+.^TU2:I: M-XL41P\T'AT:NLWXZF?R6+Y1EV,:W[@`5O@S5@)!-9SGJML]9R2+K ML67#'A(PH$%,.'?6.S_K=SEGO^/1LV+L*!1Y*@8_6B'7!#3Z7]CE^DU;! MT00\=LT%O7L!#HK*1T^?CBU#/DT3(U(09Q.NX^AU?7*`O=N]7;F[N MD!K_5165N_C=+;@,(Y5J#Q[3OZA%7)H\,Z:_X3`Z(Q_X_8WX@WH'.2/+ M+))B[9UEDO[E4/M\M33NB=JTQ]GZ#.YDJGN,^-RMH(5U[[CX2`S=9V3C M?;%VZ+HV`=)CM#\Y.PAZ#ON!)?'70`QQZLD3B$)8S*MH)E@/R(:O M;I+4Z-:K/T00!Q26SF@/[`#`\57ZVVQ13^22*35)#HIYBJ4:WCH.9YW!M M@$60FJY'*FT4R?^C%)8J('_@SJ93YE5`FUJMBJ+BD0:,(!T*694CG$SP.`S MCON!'E@7M4;'V:7@*KK%TPZ;)F(N;1W\3;GCD^F)N-A%TV!(O-UJ= MZ0ZQ?`W/8:0%[J(B*,%()8Y#Y7265?KG)V5P!S@$(7\\T\H#=_+7Y5RE4@+ M+.7M#G^_[.5QKUB]_$QC[L05UGYVU0Y/*:.7'%.9M4B!2EE*,O.9QAS# M'6$B(,[)/.IYZ6IQ'/U;O^H-^@-I;V$_8X@H$)0G8FJT[]Y9CNA5) M6H7TCT\IJV2V`0WSL?EZ0(7?QD]@^?EGC$7X5PE361'D@A1_!0F/)@Z M)DMIG(YH129%@PX[J,_8M7L?W[R*KW,][*8I:U7L*BAK0FA@+DE\Q M'XL#Z1KQ-S9MM/Q+UC3^CJ,$:)M#]FA[.TC7QN5T94`7ATSQ2%4FC^?EJ M`0-9S/!`6^R*.\9*1,)9/VIX]/GX+^T5_7^NY2E[N(UF?L+B@8*2@ M#[H-+G1-$AO,\@DS368$UH-T79Z?T*(K''!`FD*/\#PEPB#79Q)!L)*8J[) M\ZL]?=X[.KT^?XNO!KWST\3)2IM^PD;6;)L+T^)/!RN`R'2R!/141W1J3C M6U!+N.FCHL.+KA[E46[SID-2ZV8J))?B1J-DK9OY4;C8:VSLU^:;M.. M:.2N/-G:+)F'`'KP``CCYH=:]OM+..U^DE'@`U8)=(`$)_*%%_E;GJ M9;V,Q2L:_B:'W7+LOCR;SDY+DNY(L.GB8(D43S\((M\EI645]JE?_*TF^, M_V1!?57X@X#6TM`78V1+YZXOFOEE1D=;F@XR!_='U1CY!H.RQ]T;P?'1 M(:F%2(2#[ZGN^S`N`1[N5Q]UBHRA97P1PTU1FCQG,`9-[5Q\'UX/$2I MDCB5)'%J16+31[E0V'_LM*#O@`9D:P;D;###$IX=JNI[G?VG'@7\QOKM M(EW2;=`5B)=W'$`P!;_;@W*QL8O@BL+4_F.2=Q80\`TX(;#`+R5AP1+O M:%P'1K$E;`Y*I%'OLE-I'?(R)7=[`8Q+6:/Z-T3Y0_@4M1N$O*$OFD#\QZ MP$,`O8WI$\4)^TT=S=;*IAKMP%VVQ/M6J1UEC%X(\[QH76LBIM\GJT3]^$ M]NF;T#Y=0/LTZLJL`Z5-K%_4/?@?6[`?:4U#SULW).[-LG($4$D]JE+C.X_I MW^1RS^(G+5?=NQY@.9G2+I._T%C1TD0E5IP$/30@9=K)G4J41B;9'#/ MI$NB/H*%$ZS8/*KDCZ*\!`J_3-2HDS4NE=*Z*9=AC'I[3.82\X+04L^C MONB_@%A9ZIK7(6Y:R+/TH@[S@MK.#C7)3$2I[E(G!:U_7A`O#.L_02G$W? MB+/I(LYNSQ^6;(@-HU+I+#,?1$=-$D(78G'.4OY-PD].^6$72)@^C\YVT_C MU:$/EG*]$T6XTC).^PK@%G:1+DOSS!K*.*\E)5_BN9$5'\+7NM%5?%/'S M4#LOX.0-U%UOGYXP7I5R[X%F1-V\M![G(6F((#K7?=D-O^:V0V`:P[DY ME9!D4%RR?!K]ZIXR;(6F;I14-*4B;Z)S[0C!@XIB,N'!0E@0O'M?(XB M[06,@I'EZW=PM,V9Q^,!V)U['G.D,+SJ='N52LF*C#]*Q.57!U!HVD\RP#*7 M5!(;R]($@C_Q%^-='=F$VT:\$U5H-YKW?YYXX34-Y1A#!_M/):1H\?:^/ MEU_YB;G,ZT0QG;Q*5)5]I;FO[.9)4KSV`S?HW8((JRO]V*[BQ5;6RKO M\\('WK@LMBN)[L[4'MMS[W?25%W$7^W$0RXI--WT8YTN0UUC#4*$8 ML4*L)P']QN+._#,Y9-,,$%H=/K=:6Y#B_L*3$+JO*%2V*@!.G+=VP9;2 M)U6=ZL$X)3UJ%4/+GUY7@^U\][PI'^4MO7W=DZD\;'Z!':[-BBZQXBR@* MQYXE0FX7(T=I[\WD(NYDU76M?;YO)2ZOU:W-MYRU^,N3X_^P+(^C-C MC#7@]I:?]7*9(ZO;V]-!O]7:WJ,-'FT;)O,S89P#0BNA%?'7F*H?@1.T2 M3)#WQH%T/?XAH1BQU@GIN@XDA@`R?+YQP!)%PAE4];`J.`Q@8+5$O MTN8P82IS^P`4?CM+089SM069CB4QRE2'UZ2%OPAN$DKJ])Y:MY:=BPR:_F M(4E;%=+13*8,L,R+;SSR`08++!`\M\6%$6L)PG0?FH3S3R?OL8QR45_ )`?.V@0,(0`` ` end --- Tomohiro KUBOTA [EMAIL PROTECTED] http://www.debian.or.jp/~kubota/ Introduction to I18N http://www.debian.org/doc/manuals/intro-i18n/ -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
Re: [I18n][call for comments] XTerm patch to invoke luit
Hi, At Fri, 7 Jun 2002 15:06:09 +0200 (CEST), Bruno Haible wrote: CP1255 (Hebrew) CP1258, TCVN (Thai) Either you hardwire them, or you document that xterm should not be used with 8-bit fonts in these encodings. (Are there 8-bit fonts for CP1255, CP1258, TCVN at all??) For TIS-620 (ISO-8859-11) Thai, I don't like documentation way because luit already supports TIS-620 and Thai people apparently benefit from it. For CP1258 and TCVN Vietnamese, I think luit will easily support them, though it doesn't support them now. For Hebrew, I don't think we have to care about it so far, because XTerm doesn't support bidi and we are still not agreed whether to support bidi or not. I can add ISCII for complex 8bit encodings list. However, since XTerm doesn't support complex Indic scripts, I think it can be neglected so far. IMO, documentation way should be avoided as far as possible. It is because, if we need to write a documentation for a language, speakers of the language will probably need to read tens of documents to use tens of softwares. It is just Japanese people are localted and I imagine people from other countries such as Thai and Vietnam are also. Thus, I think hard-coding of th and vi is a good way so far. And also, I heard that systems without locale (with X_LOCALE) do not have MB_CUR_MAX. If it is true, we also have to have a fallback for this. --- Tomohiro KUBOTA [EMAIL PROTECTED] http://www.debian.or.jp/~kubota/ Introduction to I18N http://www.debian.org/doc/manuals/intro-i18n/ -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
[call for comments] XTerm patch to invoke luit
/HM;02XU:-24NPU],FO3 MZM3(,1_*_YP93`2BDAMN7P`,SXK1)=O6#PGL!Q_::Y07%4C6;W4BHFBT= MN5'6%7]WJE2:O]$*7.YAQ^FC[\'!N*0XH+BTN1O_%#-]3\(N;(RYZPP+S1C MB@6*18HEBF6*%:'?A(.:G+3;$/N[FXSB5+)KTA=V=#J`1X=L[#3*_$B%IH M+(HH-$6;8HNB0]EV':WH@+6J%NJ$#ZR22MDZ('0:D,A(N]U)LHT^R,9B.: MC6@VHMF(9B.ZXPBPG:X$?*VN]KJ;.%`;7CO]2JU_(5LI3QF.^`SURC;3=0K M:[\\Y%,DL=1-4:C)IDC/AVZKWK'YJPVUXR824/FO5U=1T,AZ0Q3@!\WL?C MY',XQ-7;!IH??R+6['5Z#_7M-%=Z\%6$W;F8FOF)OLL05;*NTG7XPSF 0_IJDQ36]07W(1.9J$` ` end --- Tomohiro KUBOTA [EMAIL PROTECTED] http://www.debian.or.jp/~kubota/ Introduction to I18N http://www.debian.org/doc/manuals/intro-i18n/ -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
Re: [I18n][call for comments] XTerm patch to invoke luit
Hi, At Thu, 6 Jun 2002 18:53:34 +0200 (CEST), Bruno Haible wrote: The default should follow the locale settings. In detail: - If MB_CUR_MAX == 1: Look at the specified main font. If it is an 8-bit font, use mode 1. Otherwise use mode 3. - If MB_CUR_MAX 1: If nl_langinfo(CODESET) is UTF-8, use mode 2. Otherwise use mode 3. I think your opinion is to use this algorithm for medium mode and use this mode for default. This algorithm is better because it does not hard-code any locale names. However, this algorithm does not work well for Thai, for which I'd like to use 3. UTF-8 with luit behavior. Do you have any idea to include 8bit encodings which need special processings such as combining? --- Tomohiro KUBOTA [EMAIL PROTECTED] http://www.debian.or.jp/~kubota/ Introduction to I18N http://www.debian.org/doc/manuals/intro-i18n/ -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
Re: ASCII and JIS X 0201 Roman - the backslash problem
Hi, At Fri, 10 May 2002 14:58:21 +0200 (CEST), Bruno Haible wrote: Why is it more harmful if U+00A5 is an escape character than if U+005C is an escape character? In both cases you just double it to get the original character. I think you mean that softwares which treat U+005C as an escape character should be modified to treat also U+00A5 as an escape character. Am I right? Then, there should already exist data which contain U+00A5 which doesn't intent to be an escape character. So it is a minor annoyance over the time of a few months, but by far not the costs that you are estimating. For personal users, I think most people will accept the costs. However, Unicode is not only used by personal users, but also used by company users. They won't accept such costs. Think about Y2K problem. Companies, especially banks, electricities, gases, and so on had to take extreme care and huge costs. --- Tomohiro KUBOTA [EMAIL PROTECTED] http://www.debian.or.jp/~kubota/ Introduction to I18N http://www.debian.org/doc/manuals/intro-i18n/ -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
Re: ASCII and JIS X 0201 Roman - the backslash problem
Hi, At Fri, 10 May 2002 14:17:04 -0400, Glenn Maynard wrote: The problem isn't the conversion costs, it's the fact that Windows will continue to use the characters incorrectly, and will reintroduce the problem continuously. Right. Microsoft *never* change their modified version of Unicode. What we can do is to call the encoding as non-Unicode, though they call it Unicode. It wouldn't help people that actually need to *use* the Yen symbol, since there'd still be no way to input the real single-width yen symbol, though it might be possible to add that to the input method. I think input method is not problem now. It is because (1) In Japanese version of Windows, *only* subset of Unicode which has conversion to CP932 is used because Unicode is limited to internal processing and text files which users treat are almost always CP932, and (2) if encoding or mapping table is changed, then input method should be modified as a matter of course. --- Tomohiro KUBOTA [EMAIL PROTECTED] http://www.debian.or.jp/~kubota/ Introduction to I18N http://www.debian.org/doc/manuals/intro-i18n/ -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
Re: ASCII and JIS X 0201 Roman - the backslash problem
Hi, At Fri, 10 May 2002 15:33:13 -0400, Glenn Maynard wrote: Out of curiosity, Tomohiro, is full-width Yen commonly used? (I'd guess $B1_(B would be a more obvious choice for full-width.) If you mean Unicode U+FFE5 by "full-width Yen", I cannot give an answer because Unicode itself is not yet very popular in Japan. However, full-width Yen in Shift_JIS and EUC-JP, i.e., 0x216F in JIS X 0208, is widely used in Japan. --- Tomohiro KUBOTA [EMAIL PROTECTED] http://www.debian.or.jp/~kubota/ "Introduction to I18N" http://www.debian.org/doc/manuals/intro-i18n/ -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
Re: Switching to UTF-8
Hi, At Mon, 6 May 2002 07:46:33 +0200, Pablo Saratxaga wrote: In Hiragana/Katakana, processing of n is complex (though it may be less complex than Hangul). No. The N is just a kana like any other, no complexity at all involved. Complexity only happens when typing in latin letters. That is why the use of transliteration typing will always require an input method anyways, it cannot be handled with just Xkb. In my above sentence, n is a Latin letter. It may correspond to HIRAGATA/KATAKANA LETTER N *or* 1st key stroke to n-a, n-i, n-u, n-e, n-o, n-y-a, n-y-u, or n-y-o. (Key strokes of n-y-a should give HIRAGANA/KATAKANA LETTER NI and following HIRAGANA/KATAKANA LETTER SMALL YA.) Anyway, I understand your point that Latin - Hiragana/Katakana cannot be implemented as xkb. --- Tomohiro KUBOTA [EMAIL PROTECTED] http://www.debian.or.jp/~kubota/ Introduction to I18N http://www.debian.org/doc/manuals/intro-i18n/ -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
Re: Switching to UTF-8
Hi, At 02 May 2002 23:54:37 +1000, Roger So wrote: Note that the source from Li18nux will try to use its own encoding conversion mechanisms on Linux, which is broken. You need to tell it to use iconv instead. I didn't know that because I am not a user of IIIMF nor other Li18nux products. How it is broken? Maybe I should attempt to package it for Debian again, now that woody is almost out of the way. (I have the full IIIMF stuff working well on my development machine.) I found that Debian has iiimecf package. Do you know what it is? I don't think xkb is sufficient because (1) there's a large number of different Chinese input methods out there, and (2) most of the input methods require the user to choose from a list of candidates after preedit. I _do_ think xkb is sufficient for Japanese though, if you limit Japanese to only hiragana and katagana. ;) I believe that you are kidding to say about such a limitation. Japanese language has much less vowels and consonants than Korean, which results in much more homonyms than Korean. Thus, I think native Japanese speakers won't decide to abolish Kanji. (Please don't be kidding in international mailing list, because people who don't know about Japanese may think you are talking about serious story.) Even if we limit to input of hiragana/katakana, xkb may not be sufficient. For one-key-one-hiragana/katakana method, I think xkb can be used. However, more than half of Japanese computer users use Romaji-kana conversion, two-keys-one-hiragana/katakana method. The complexity of the algorithm is like two or three-key input method of Hangul, I think. Do you think such an algorithm can be implemented as xkb? If yes, I think Romaji-kana conversion (whose complexity is like Hangul input method) can be implemented as xkb. --- Tomohiro KUBOTA [EMAIL PROTECTED] http://www.debian.or.jp/~kubota/ Introduction to I18N http://www.debian.org/doc/manuals/intro-i18n/ -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
Re: Switching to UTF-8
Hi, At Sun, 5 May 2002 19:12:31 -0400 (EDT), Jungshik Shin wrote: I believe that you are kidding to say about such a limitation. Japanese language has much less vowels and consonants than Korean, which results in much more homonyms than Korean. Thus, I think Well, actually it's due to not so much the difference in the number of consonants and vowels as the fact that Korean has both closed and open syllables while Japanese has only open syllables that makes Japanese have a lot more homonyms than Korean. You may be right. Anyway, the true reason is that Japanese language has a lot of words from old Chinese. These words which are not homonyms in Chinese will be homonyms in Japanese. (They may or may not be homonys in Korea. I believe that Korean also has a lot of Chinese-origin words.) Since a way to coin a new word is based on Kanji system, Japanese language would lose vitality without Kanji. I don't think Japanese will ever do, either. However, I'm afraid having too many homonyms is a little too 'feeble' a 'rationale' for not being able to convert to all phonetic scripts like Hiragana and Katakana. ... Since I don't represent Japanese people, I don't say whether it is a good idea or not to have many homonyms. You are right, there are many other reasons for/against using Kanji and I cannot explain everything. Japanese pronunciation does have troubles, though it is widely helped by accents or rhythms. However, in some cases, none of accesnts or context can help. For example, both science and chemistry are kagaku in japanese. So we sometimes call chemistry as bakegaku, where bake is another reading of ka for chemistry. Another famous confusing pair of words is private (organization) and municipal (organization), which is called shiritu. Thus, private is sometimes called watakushiritu and municipal is called ichiritu, again these alias names are from different readings of kanji. If you listen to Japanese news programs every day, you will find these examples some day. These days more and more Japanese people want to learn more Kanji to use their abundance of power of expression, though I am not one of these Kanji learners. I also like to know whether it's possible with Xkb. BTW, if we use three-set keyboards (where leading consonants and trailing consonants are assigned separate keys) and use U+1100 Hangul Conjoining Jamos, Korean Hangul input is entirely possible with Xkb alone. Note for xkb experts who don't know Hiragana/Katakana/Hangul: input methods of these scripts need backtracking. For example, in Hangul, imagine I hit keys in the c-v-c-v (c: consonant, v: vowel) sequence. When I hit c-v-c, it should represent one Hangul syllable c-v-c. However, when I hit the next v, it should be two Hangul syllables of c-v c-v. In Hiragana/Katakana, processing of n is complex (though it may be less complex than Hangul). --- Tomohiro KUBOTA [EMAIL PROTECTED] http://www.debian.or.jp/~kubota/ Introduction to I18N http://www.debian.org/doc/manuals/intro-i18n/ -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
Re: Switching to UTF-8
Hi, At Thu, 2 May 2002 02:14:29 -0400 (EDT), Jungshik Shin wrote: You mean IIIMF, didn't you? If there's any actual implementation, I'd love to try it out. We need to have Windows 2k/XP or MacOS 9/X style keyboard/IM switching mechanism/UI so that keyboard/IM modules targeted at/customized for each language can coexist and be brought up as necessary. It appears that IIIMF seems to be the only way unless somebody writes a gigantic one-fits-all XIM server for UTF-8 locale(s). I heard that IIIMF has some security problems from Project HEKE people http://www.kmc.gr.jp/proj/heke/ . I don't know whether it is true or not, nor the problem (if any) is solved or not. There _is_ already an implementation of IIIMF. You can download it from Li18nux site. However, I could not succeeded to try it. Since I have heard several reports of IIIMF users, it is simply my fault. There seems to be some XIM-based implementations which can input multiple complex languages. One is ximswitch software in Kondara Linux distribution. http://www.kondara.org . I downloaded it but I didn't test it yet. Another is mlterm http://mlterm.sourceforge.net/ which is entirely client-side solution to switch multiple XIM servers. Though I don't think it is a good idea to require clients to have such mechanisms, it is the only practical way so far to realize multiple language input. How about just running your favorite XIM under ja_JP.EUC-JP while all other applications are launched under ja_JP.UTF-8? As you know well, it just works fine although the character repertoire you can enter is limited to that of EUC-JP. Of course, this is not full-blown UTF-8 support, but at least it should give you the same degree of Japanese input support under ja_JP.UTF-8 as under ja_JP.EUC-JP. Well, then you would say what the point of moving to UTF-8 is. You can at least display more characters under UTF-8 than under EUC-JP, can't you? :-) There are, so far, no conversion engine which requires over-EUC-JP character set. Thus, EUC-JP is enough now. If someone wants to develop an input engine which supports more characters, he/she will want to use UTF-8. However, I think nobody feels strong necessity of it in Japan, besides pure technical interests for Unicode itself. BTW, Xkb may work for Korean Hangul, too and we don't need XIM if we use 'three-set keyboard' instead of 'two-set keyboard' and can live without Hanjas. I have to know more about Xkb to be certain, though. I see. This is not true for Japanese. Japanese people do need grammar and context analysis software to get Kanji text. How about Chinese? --- Tomohiro KUBOTA [EMAIL PROTECTED] http://www.debian.or.jp/~kubota/ Introduction to I18N http://www.debian.org/doc/manuals/intro-i18n/ -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
Re: Switching to UTF-8
Hi, At Wed, 01 May 2002 20:02:57 +0100, Markus Kuhn wrote: I have for some time now been using UTF-8 more frequently than ISO 8859-1. The three critical milestones that still keep me from moving entirely to UTF-8 are How about bash? Do you know any improvement? Please note that tcsh have already supported east Asian EUC-like multibyte encodings. I don't know it also supports UTF-8. How about zsh? For Japanese, character width problems and mapping table problems should be solved to _start_ migration to UTF-8. (This is why several Japanese localization patches are available for several UTF-8-based softwares such as Mutt. We should find ways to stop such localization patches.) Also, I want people who develop UTF-8-based softwares to have a custom to specify the range of UTF-8 support. For example, * range of codepoints U+ - U+2fff? all BMP? SMP/SIP? * special processings combining characters? bidi? Arab shaping? Indic scripts? Mongol (which needs vertical direction)? How about wcwidth()? * input methods Any way to input complex languages which cannot be supported by xkb mechanism (i.e., CJK) ? XIM? IIIMP? (How about Gnome2?) Or, any software-specific input methods (like Emacs or Yudit)? * fonts availability Though each software is not responsible for this, This software is designed to require Times font means that it cannot use non-Latin/Greek/Cyrillic characters. Though people in ISO-8859-1/2/15 region people don't have to care about these terms, other peole can easily believe a UTF-8-supported software and then disappointed to use it. Then he/she will become distrust UTF-8-supported softwares. We should avoid many people will become such. --- Tomohiro KUBOTA [EMAIL PROTECTED] http://www.debian.or.jp/~kubota/ Introduction to I18N http://www.debian.org/doc/manuals/intro-i18n/ -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
Re: Switching to UTF-8
Hi, At Thu, 2 May 2002 00:16:25 -0400, Glenn Maynard wrote: * input methods Any way to input complex languages which cannot be supported by xkb mechanism (i.e., CJK) ? XIM? IIIMP? (How about Gnome2?) Or, any software-specific input methods (like Emacs or Yudit)? How much extra work do X apps currently need to do to support input methods? Much work. I think this is one problematic point of XIM which caused very few softwares (which are developed by XIM-knowing developers, who are very few) can input CJK languages. X.org distribution (and XFree86 distribution) has a specification of XIM protocol. However, it is difficult. (At least I could not understand it). So, for practical usage by developers, http://www.ainet.or.jp/~inoue/im/index-e.html would be useful to develop XIM clients. I have not read a good introduction article to develop XIM servers. I think that low-level API should integrate XIM (or other input method protocols) support so that XIM-innocent developers (well, almost all developers in the world) can use it and they cannot annoy CJK people. Gnome2 seems to take this way. However, I wonder why Xlib doesn't have such wrapper functions which omit XIM programming troubles. It's little enough to add it easily to programs, but the fact that it exists at all means that I can't enter CJK into most programs. Since the regular 8-bit character message is in the system codepage, it's impossible to send CJK through. Well, I am talking about Unicode-based softwares. More and more developers in the world start to understand that 8bit is not enough for Unicode because it is a unversal fact. I am optimistic in this field; many developers will think 8bit character is a bad idea in near future. However, it is unlikely many developers will recognize the need of XIM (or other input method) support in near future because it is needed only for CJK languages. My concern is how to force thse XIM-innocent people to develop CJK-supporting softwares. How does this compare with the situation in X? Though I don't know about Windows programming, I often use Windows for my work. Imported softwares usually cannot handle Japanese because of font problem. However, input method (IME?) seems to be invoked even in these imported softwares. * fonts availability Though each software is not responsible for this, This software is designed to require Times font means that it cannot use non-Latin/Greek/Cyrillic characters. I can't think of ever using an (untranslated, English) X program and having it display anything but Latin characters. When is this actually a problem? For example, XCreateFontSet(-*-times-*) cannot display Japanese because there are no Japanese fonts which meet the name. (Instead, mincho and gothic are popular Japanese typefaces.) Such types of implementation is often seen in window managers and their theme files. --- Tomohiro KUBOTA [EMAIL PROTECTED] http://www.debian.or.jp/~kubota/ Introduction to I18N http://www.debian.org/doc/manuals/intro-i18n/ -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
Renewed my Unicode/JIS page
Hi, I revised my Unicode/JIS web page. http://www.debian.or.jp/~kubota/unicode-symbols.html I used new EastAsianWidth and mapping tables which are downloadable from the Internet. I rewrote my documents on the basis that Unicode Consortium has never released an official mapping tables between Unicode and east Asian encodings. I also mentioned VARIATION SELECTORS which is introduced in Unicode 3.2 . Please read and check it. --- Tomohiro KUBOTA [EMAIL PROTECTED] http://www.debian.or.jp/~kubota/ Introduction to I18N http://www.debian.org/doc/manuals/intro-i18n/ -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
Re: 3.2 MAPPINGS/EASTASIA
Hi, At Thu, 4 Apr 2002 11:58:57 +0200 (CEST), Bruno Haible wrote: Thanks a lot for these pointers! With this information, I can write a JISX0213 converter for glibc and libiconv. Please note that these tables may unofficial. Though jisx0213code.txt insists that it is built from official JIS X 0213 standard, it insists that 1-1-29 is changed from U+2015 to U+2014 because of JIS X 0221 standard. The JISX0213 InfoCenter web page insists that it should be a bug of JIS X 0213 standard. Also, since JIS X 0213 standard was released in 2000, the official mapping table should have unmapped characters. According to README.txt file in the IBM1394 archive, it should be related to CP932. Thus, I don't think it is a good source as an official JIS X 0213 mapping table. I think you can use either of them (or a combination of them). However, it is with a risk. I imagine a new version of JIS X 0213 will be available in afew years and it will have a complete official mapping table. In this case, mapping table of glibc and libiconv will have to be changed. You can wait for the official mapping table or you can implement a tentative table from jisx0213code.txt and IBM1394. Either will be OK. I'll make use of these 59 compatibility ideographs in the converter. That's the whole reason why they were introduced in Unicode 3.2. Right. The problem is, there are no official mapping tables which use them yet. --- Tomohiro KUBOTA [EMAIL PROTECTED] http://www.debian.or.jp/~kubota/ Introduction to I18N http://www.debian.org/doc/manuals/intro-i18n/ -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
Re: 3.2 MAPPINGS/EASTASIA
Hi, At Tue, 2 Apr 2002 15:36:16 +0200 (CEST), Bruno Haible wrote: Does this also apply to JISX0213:2000? Do you know where to find the conversion tables for this character encoding? The PDF file in the ISO-IR registry contains only the pictures of each glyph, but no conversion table. I found http://www.jca.apc.org/~earthian/aozora/0213.html http://www.jca.apc.org/~earthian/aozora/jisx0213code.zip but I don't know this is authorized one (or informative part of JIS standard) or merely prepared by one person. Also, I found http://www.cse.cuhk.edu.hk/~irg/ http://www.cse.cuhk.edu.hk/~irg/irg/N807_TablesX0123-UCS.zip It apparently includes IBM extended characters. Strictly speaking, JIS X 0213:2000 *cannot* be defined as a mapping table against ISO 10646, because JIS X 0213's han unification rule is different from ISO 10646's one. (You know, Unicode added several tens of compatibility ideographs which are different characters in JIS X 0213's point of view and different glyphs of the same character in Unicode's point of view.) --- Tomohiro KUBOTA [EMAIL PROTECTED] http://www.debian.or.jp/~kubota/ Introduction to I18N http://www.debian.org/doc/manuals/intro-i18n/ -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
sorting order of Kanji
Hi, At Mon, 25 Feb 2002 17:24:20 -0500, Glenn Maynard wrote: Kanji appear to be getting collated, however: 05:13pm [EMAIL PROTECTED]/2 [~] sort $BF|K\(B $Be:No(B $BF|K\(B (eof) $BF|K\(B $BF|K\(B $Be:No(B (I couldn't tell if that's the correct collation order, but it's clear they're being reordered, where the hiragana above are not.) It is impossible to collate Kanji by using simple functions such as strcoll(), because one Kanji has several readings depending on context (or word) in most cases. (This is Japanese case). (It is technically virtually impossible. It will need natural language understanding algorithm.) For Korean, one Kanji (Hanja) has one reading in most cases, though there are exceptions. However, if we ingore such exceptions, strcoll() will work by using reading table for all Ideogram characters. (Though it is technically possible, it will need a large dictionary). I don't know about Chinese. Thus, strcoll() simply works as strcmp(). --- Tomohiro KUBOTA [EMAIL PROTECTED] http://www.debian.or.jp/~kubota/ "Introduction to I18N" http://www.debian.org/doc/manuals/intro-i18n/ -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
Re: Variation selectors for narrow/wide EastAsian glyphs
Hi, At Mon, 04 Feb 2002 11:40:53 +, Markus Kuhn wrote: One potential alternative is that, given Unicode 3.2 has just introduced the notion of variation selectors, we ask the UTC and WG2 to consider the addition of two special variation selectors for single-width and double-width selection of glyphs in the East Asian ambiguous class. Interesting. I have a few comments. 1. The range of characters for which I want to use doublewidth version is not limited to EastAsianAmbiguous class. The list of such characters depend on Unicode - local encodings mapping tables and we don't have authorized reference mapping tables. Thus, I cannot show an exact list of such characters. However, if we want to support Japanese mapping tables in http://www.unicode.org/Public/MAPPINGS/OBSOLETE/EASTASIA which are widely used now, characters in Width problems in http://www.debian.or.jp/~kubota/unicode-symbols.html should be supported by your Width Selector. (I checked Japanese mapping tables only. Checking for Chinese and Korean tables may add characters to the list.) Thus, I think it is a good idea not to limit the range for which Width Selector is effective. (Another idea is to change EastAsianWidth definition. However, my proposal to change EastAsianWidth has failed...) 2. I am afraid that your proposal (or proposal to change ISO 6429) may take long time to be realized. It does not mean that it is not a good idea to propose Width Selector. I mean, we need some temporary solution because this is a practical problem, rather than a standardization problem. 3. I think your proposal is better than your SCW proposal because this proposal is STATELESS, though SCW proposal can be simplified to be stateless. That would be most easy to implement with existing font display engines that feature ligature substitution. That would be a way of allowing applications or encoding translation filters to have tight control over the width of a character on a character cell terminal, without the introduction of new ESC sequences. The a font could easily contain both narrow (CP437) and wide (JIS) versions of the U+25xx box drawing characters, etc. I don't think introduction of new 'character' is better than introduction of new ESC sequences. I think they are equivalent. --- Tomohiro KUBOTA [EMAIL PROTECTED] http://www.debian.or.jp/~kubota/ Introduction to I18N http://www.debian.org/doc/manuals/intro-i18n/ -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
Li18nux Locale Name Guideline Public Review
Hi, I found the 2nd public review of Li18nux Locale Name Guideline has started. http://www.hauN.org/ml/b-l-j/a/800/840.html http://www.li18nux.org/subgroups/sa/locnameguide/index.html The page says that comments are welcome until 14 Feb 2002. Any additions from Li18nux insiders? --- Tomohiro KUBOTA [EMAIL PROTECTED] http://www.debian.or.jp/~kubota/ Introduction to I18N http://www.debian.org/doc/manuals/intro-i18n/ -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
Re: [I18n]Li18nux Locale Name Guideline Public Review
Hi, At Mon, 21 Jan 2002 19:18:09 +0900, Tomohiro KUBOTA wrote: I found the 2nd public review of Li18nux Locale Name Guideline has started. http://www.hauN.org/ml/b-l-j/a/800/840.html http://www.li18nux.org/subgroups/sa/locnameguide/index.html One important note. I am not a member of Li18nux. Thus, people who have opinions should write it to Li18nux. The above web page writes how to comment. --- Tomohiro KUBOTA [EMAIL PROTECTED] http://www.debian.or.jp/~kubota/ Introduction to I18N http://www.debian.org/doc/manuals/intro-i18n/ -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
Re: Unicode, character ambiguities
Hi, At Sun, 13 Jan 2002 03:38:55 -0600 (CST), [EMAIL PROTECTED] wrote: Not allowing any upgrade path from CP932 to Unicode is going to encourage them to stick with CP932, and that hurts *everyone*. There is an upgrade path; intellegently convert the character. I think fixing the problem now is better than everyone dealing with it for the next 40 years. If you think so, please persuade Microsoft. BTW, It is Unicode which introduced the distinction between Shift_JIS and CP932 and confused us. Without Unicode, the only difference between Shift_JIS and CP932 is that CP932 has some additional characters. Thus, it is wrong to say that This is a problem of CP932, and Unicode is not responsible. --- Tomohiro KUBOTA [EMAIL PROTECTED] http://www.debian.or.jp/~kubota/ Introduction to I18N http://www.debian.org/doc/manuals/intro-i18n/ -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
Re: Unicode, character ambiguities
Hi, At Sat, 12 Jan 2002 03:13:00 -0600 (CST), [EMAIL PROTECTED] wrote: Some people apparently think there's a need (or at least, in the reverse). My preference, as a native speaker of neither of these languages, would be to display Japanese with a Japanese font and Chinese with a Chinese font, and I would be surprised if there were very few people with this preference. I'd prefer my KISS CD's to be displayed in a KISS font, too. That doesn't neccessarily mean that it's feasible, or worthwhile to be put in a spec. How many times I heard such an ignorance on Han characters... Ok, it is natural all of us are basically ignorant on non-native languages unless we study them. The concept of Han Variants is never like such a personal preference. It is nearly like the difference of characters. The term font and glyph is merely based on the Unicode's view that Han Variants are same characters and thus the distinction of Han Variants will be achieved using change of fonts and glyphs _technically_. For example, do you think good (english word) and guten (german word) is a same word or different word? Han Variants are like that. --- Tomohiro KUBOTA [EMAIL PROTECTED] http://www.debian.or.jp/~kubota/ Introduction to I18N http://www.debian.org/doc/manuals/intro-i18n/ -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
Re: Unicode, character ambiguities
Hi, At Fri, 11 Jan 2002 11:42:56 +0100, Kent Karlsson wrote: No it's not. And I was speaking as a matter of principle. If you are talking about the reference glyphs, then it the responsibility of whoever is complaining about them to point to the *actual* reference glyphs, not some other glyphs, that may or may not be the same as the reference glyphs. It should not be necessary for the *reader* to try to find out if the glyph referred to is sufficiently the same as the reference glyph(s) or not for the argument put forward. You are basically right. However, the concept of Unification is that the reference glyphs (written in the standard book) don't have special importance than other unified glyphs. I noticed I had one wrong idea. I am very sure that the low resolution image I suggested is more than enough as a basis of discussion on Han Unification. However, I didn't noticed that I can say that because I am native Japanese speaker and have trained tens of years to read Han Ideographs. Now I noticed it is natural you don't understand whether the low-resolution image is enough or not. In real, difference of Han Variants is obviously distinguishable even in 16x16 pixel fonts which we often use with X Window System. --- Tomohiro KUBOTA [EMAIL PROTECTED] http://www.debian.or.jp/~kubota/ Introduction to I18N http://www.debian.org/doc/manuals/intro-i18n/ -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
Re: Unicode, character ambiguities
Hi, At Fri, 11 Jan 2002 04:51:35 -0800, Edward Cherlin wrote: For example, I can write the cost is \100 and the file is C:\text\abc.txt or, How is such code executed, then? It appears severely broken. No compiler can tell from this code fragment which is supposed to be which, since \100 is a legitimate filespec in Windows. This is not a code. Assume this is a message for human's reading. Fixing the source code at the source is a lot cleaner than inflicting your fix on the rest of the world. It's as bad as Oracle's attempt to define a standard for its variant UTF-8 (CESU-8, which apparently should be pronounced 'sezyu' in English). Their stated reason is the same, that it's too much work to fix all of their databases, and their cure is to lay even more work off on the rest of the world. At first, this problem affect not only source codes but also many texts of end users. You can easily imagine text files of end users contain many \ as currency sign AND many \ as a element of file names. Even if you may success to persuade every Japanese Windows programmers to modify their source codes, you won't be successful to persuade Japanese business users to modify their files like accounts.xls . In case of Oracle, the problem was limited in the _internal_ encoding of the database (which end users don't care) and the end users can be free from feeling any trouble, if Oracle does a good work. And more, conversion from CESU-8 to correct UTF-8 can be done using simple algorithm. On the other hand, the meaning of \ depends on context and, ultimately, only the writer of the \ knows whether it should be U+005C or U+00A5. --- Tomohiro KUBOTA [EMAIL PROTECTED] http://www.debian.or.jp/~kubota/ Introduction to I18N http://www.debian.org/doc/manuals/intro-i18n/ -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
Re: Unicode, character ambiguities
Hi, At Fri, 11 Jan 2002 22:19:40 -0500, Glenn Maynard wrote: You have to assume that most Japanese systems will display \ as a Yen symbol, because they wlil. Japanese Windows system always displays \ (0x5c) (in CP932, or, almost people call this as Shift JIS) and U+005C with Yen Symbol. However, most Linux/BSD/UNIX systems display \ (0x5c) (in EUC-JP, which is the most popular encoding for Linux/BSD/UNIX system) and U+005C in backslash even in Japan. Now, translation tables for CP932 on these systems could translate backslash and the yen symbol both to the yen symbol; What is both? I think you are talking about both of backslash and yen symbol. However, what do you think is the codepoints for them in CP932? Answer: CP932 has the following yen sign and backslash CP932 (Shift JIS)Unicode (mapped by CP932 table) -- --- 0x5C (yen sign) U+005C (yen sign glyph in Windows) 0x81 0x5F (fullwidth backslash) U+FF3C (fullwidth backslash) 0x81 0x8F (fullwidth yen sign) U+FFE5 (fullwidth yen sign) note that CP932 0x5C (yen sign) is derived from JIS X 0201 and CP932 0x81 0x5F and CP932 0x81 0x8F are derived from JIS X 0208. thus, if you modify CP932 table 0x5C - U+00A5, it doesn't mean breaking round-trip compatibility with CP932. In case of Ogg, I think this can be a solution, because the strings are never parsed as filenames. However, this cannot be a general solution. --- Tomohiro KUBOTA [EMAIL PROTECTED] http://www.debian.or.jp/~kubota/ Introduction to I18N http://www.debian.org/doc/manuals/intro-i18n/ -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
Re: Unicode, character ambiguities
Hi, For reference of glyph, I am using http://www.unicode.org/cgi-bin/GetUnihanData.pl?codepoint=f9b1 and so on. Otherwise, displayed glyphs depend on system and we cannot discuss about same glyph. At 9 Jan 2002 23:52:49 -0800, H. Peter Anvin [EMAIL PROTECTED] wrote: My wife's name is Suzi (Susan). Since it happens to phoneticize pretty poorly into Japanese, she has chosen to use the same Suzuran ("lily of the valley") in Japanese rather than spelling her name in Katakana. "Suzuran" is U+9234 U+862D (.ANiN4NhN-); however, I could personally not have told the reference glyph for U+9324 was the same character. I actually found a "compatibility form", U+F9B1 (NoN&N1) which looks a lot more like I thought the character should look like, but that one is apparently only supposed to be used for Korean. I feel U+FB91 is a glyph for printing. Japanese people use U+9234 for handwriting and we can read it. However, we never use it for printing and I feel U+9234 in printing is somewhat funny. Please refer U+F9A8 vs U+4EE4 for clearer image. There are a few such exceptional cases. For example, U+8A00. The top element is written as "dot" in the image. However, we use "vertical stroke" for handwriting and "horizonal stroke" for printing. We never use "dot". (I could not find image for them.) Image for U+5165 is also like handwriting. I could not find image for printing glyph. Thus, I cannot say which is "Japanese", U+9234 or U+FB91. Average Japanese people (who don't know Chinese or Korean) don't think that the difference between U+9234 and U+FB91 is related to Chinese, Japanese, and Korean. Fonts of my system is like U+FB91. I think there are a few more examples. It is difficult to show "all" examples, like it is difficult for a native English speaker to show "all" verbs (s)he knows. It is also difficult even for me to list "all" irregular English verbs (like go-went-gone and come-came-come). However, I feel the number of examples would be very small. Note that "Kyokasho-tai" (textbook typeface) is designed to be similar to handwriting but this typeface is rarely used other than Japanese textbooks for elementary school. Interestingly, at least on my system U+9234 is displayed in the Japanese glyph rather than the reference glyph. My system also shows both of U+9234 and U+FB91 like U+FB91 image. --- $B5WJ]EDCR9-(B Tomohiro KUBOTA [EMAIL PROTECTED] http://www.debian.or.jp/~kubota/ "Introduction to I18N" http://www.debian.org/doc/manuals/intro-i18n/ -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
Re: Unicode, character ambiguities
At Thu, 10 Jan 2002 01:06:22 -0500, Glenn Maynard wrote: How major a problem is this in practice, right now? One temporary solution I could suggest is having specs (in this case, Ogg tags) choose a specific vendor's translation tables for these, and saying until Unicode standardizes these tables, use these, not your system's. That would at least (try to) guarantee that until that happens, if a user enters text on one system in SJIS, and moves it to another via UTF-8, he'll get the same SJIS output. I think it is a good idea. I'd like you to request Unicode Consortium to follow your idea. However, the problem is, Unicode Consortium doesn't have enough political power to define one standardized table and it doesn't have will to release one authorized mapping table. Do you think venders like MS, Sun, IBM, Apple, and so on (all of them are members of Unicode Consortium) will throw away their private mapping tables and follow a common one, though it means these venders will lose compatibility to their previous products? It is almost impossible. However, I think such venders' interests are against users' interests. Thus I want many people to send mails to request one standard mapping table. There may be a possibility that some one private table will be popular enough to be a de-facto standard. I imagine many venders are thinking about their own private table will win a status of de-facto standard. Though I don't like MS private table (CP932) because it has much more differences to other tables, I will welcome it if it can finish this confusing situation. See a chapter of Conversion tables differ between venders in http://www.debian.or.jp/~kubota/unicode-symbols.html for detail. --- Tomohiro KUBOTA [EMAIL PROTECTED] http://www.debian.or.jp/~kubota/ Introduction to I18N http://www.debian.org/doc/manuals/intro-i18n/ -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
Re: Unicode, character ambiguities
Hi, At 10 Jan 2002 10:02:21 -0800, H. Peter Anvin [EMAIL PROTECTED] wrote: I think there are a few more examples. It is difficult to show all examples, like it is difficult for a native English speaker to show all verbs (s)he knows. It is also difficult even for me to list all irregular English verbs (like go-went-gone and come-came-come). However, I feel the number of examples would be very small. If so, that would imply the number of code points would also be very small, and that it wouldn't be a major loss to assign code points to them. Would you agree? No, I had to add that one example may mean hundreds of characters, because one radical may be shared by hundreds of characters. --- Tomohiro KUBOTA [EMAIL PROTECTED] http://www.debian.or.jp/~kubota/ Introduction to I18N http://www.debian.org/doc/manuals/intro-i18n/ -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
Re: Unicode, character ambiguities
Hi, At Wed, 9 Jan 2002 11:59:12 -0500 (EST), Henry Spencer wrote: Indeed so. But you are also an insider with strong opinions on the matter, and that will influence your reporting, no matter how hard you try to be impartial. (Even experimenters systematically recording data tend to make errors favoring their own beliefs, perhaps because they are more careful when recording favorable results. This is why medical experiments nowadays always use double blind procedures, in which the experimenter himself does not know which patients are getting which treatment until afterward.) So, do you mean I am not free from such a bias while you are free? Did the Japanese scholar who prepared Han Unification say that Japanese people can read Chinese or Korean glyph? Did (s)he say that his/her theory is widely accepted by common Japanese people? Yes, I think my opinion is not located at the average Japanese. I am rather a Unicode lover than average Japanese people. --- Tomohiro KUBOTA [EMAIL PROTECTED] http://www.debian.or.jp/~kubota/ Introduction to I18N http://www.debian.org/doc/manuals/intro-i18n/ -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
Re: Unicode, character ambiguities
Hi, At Wed, 9 Jan 2002 17:26:47 -0500 (EST), Henry Spencer wrote: I have no bias on the subject mostly because I have no opinion on the subject. :-) I don't claim to know what the general opinion in Japan about Unicode or Han unification is (or would be). Sophism. For example, you may be interested in Unicode and you may hope Unicode to be popular as soon as possible. You may not care about native Japanese speakers' interest. How do you suspect about my opinion? I said that I hopes Unicode to be usable for native Japanese people. I sometimes criticise Unicode because I hope Unicode to be more useful. I don't criticise Unicode because of hate for Unicode. What's wrong about this position? What bias do you think? For myself, I graduated a University, which may mean my knowledge on Kanji characters is above the avearge Japanese people. Thus, I may be biased that I know some more Kanji characters than average Japanese people. However, my job is not related to computer, publication, typesetting, nor literature. My knowledge on Kanji may be lower than people with such jobs. I did introduced all which may bias my opinion or feelings. And you? --- Tomohiro KUBOTA [EMAIL PROTECTED] http://www.debian.or.jp/~kubota/ Introduction to I18N http://www.debian.org/doc/manuals/intro-i18n/ -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
Re: Unicode, character ambiguities
Hi, At Wed, 9 Jan 2002 16:00:27 +0100, Pablo Saratxaga wrote: Not true. I am a native Japanese speaker. There are some characters whose Japanese version is very basic (and elementary school student can read) while I cannot read Chinese version. But are those unified? Have you an example of a unified one in such case? Yes, unified. The most famous example is U+76F4. I'd like to show an image but I cannot find Images are not available at: http://www.unicode.org/cgi-bin/GetUnihanData.pl?codepoint=76f4 There are many characters which have walk radical. Japanese walk radical has (usually) one dot (note that dot is a term of Japanese calligraphy to tell element of Kanji characters, like vertical stroke, horizonal stroke, right brush, left brush, and leap, though my English translation may be wrong) while traditional-Chinese and Korean walk radical has two dots. Since there are a little number of Japanese characters which use walk radical with two dots (i.e., in Japan, some characters has two dots and many others has one dot), the number of dots is important for Japanese. However, they are unified. (In this case, Japanese people kan reed iT, just giv!ng a phunNy empResion.) (U+2ECC ... walk radical with one dot, U+2ECD ... walk radical with two dots. There are another variant of U+2ECE which is not used in Japan. Though these radicals are not unified, characters with these walk variants are unified.) There are many radicals which have similar problems. Many older Japanese people can read traditional-Chinese style, because Japanese people used to use the style until about 1950. But those are not unified. Those that have two different codepoints in japanese encodings are two different ones in unicode too. Since usage of t-C style characters is exceptional in modern Japanese (in case of person's family names and place names, and few others), not many t-C style characters are encoded in Japanese character sets. Maybe I'm missing something and there are indeed some characters that are problematic; however I haven't encountered none. On the other side I agree that my knowledge of kanji must be far below yours, so maybe I just happen to not know the ones that are problematic (among others I don't know either, of course). Believe me, I read tens of ads every day (on TV and on newspapers) because I live in Japan. (Sometimes Japanese ads may use very difficult character which nobody can read. The purpose is just to give an authorized or intelligent impression.) I once saw a picture of an ad that have the kanji for buy (I think, don't recall exactly) with its shell radical replaced by a real picture of a real shell; if it wasn't told in the text below the image what it was supposed to be I wouldn't had discovered it, for sure. Real shell image for shell radical? It is just an art, not a character, at least for any levels of computer text processing. Of course it is not traditional-Chinese style. --- Tomohiro KUBOTA [EMAIL PROTECTED] http://www.debian.or.jp/~kubota/ Introduction to I18N http://www.debian.org/doc/manuals/intro-i18n/ -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
Unicode 3.2 Beta
Hi, Unicode 3.2 Beta is now under public comment period. http://www.unicode.org/versions/beta.html It has Variation Selectors from U+FE00 to U+FE0F. However, the list of variations, i.e., StandardizedVariants.html is not available now. Does someone know the detail of this? I'd like to know whether Variation Selectors can be used for CJK Han Variants. (I sent a mail to [EMAIL PROTECTED] a few days ago but I have not received a reply yet.) --- Tomohiro KUBOTA [EMAIL PROTECTED] http://www.debian.or.jp/~kubota/ Introduction to I18N http://www.debian.org/doc/manuals/intro-i18n/ -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
Re: Unicode, character ambiguities
Hi, I am a native Japanese speaker and I think I am somewhat Unicode lover compared with Japanese average. At Tue, 8 Jan 2002 23:03:35 -0500, Glenn Maynard wrote: What, exactly, needs to be done by an application (or rather, its data formats) to accomodate CJK in Unicode (and other languages with similar ambiguities)? The most well-known criticism against Unicode is that it unified Han Ideograms (Kanji) from Chinese, Japanese, and Korean Han Ideograms (Kanji) with similar shape and origin, though they are different characters. Even native CJK speakers and CJK scholars can have different opinions on a question that this Kanji and that Kanji are different characters, or same characters with different shapes? Since Unicode takes an opinion which is different from most of common Japanese people, Japanese people came to generally hate Unicode. It is natural that scholars have variety of opinions than common people and Unicode Consortium did find a native Japanese scholar who support Unicode's opinion. But the opinion is different from common Japanese people's Thus, Japanese people think Unicode cannot distinguish different characters from China, Japan, and Korea. Unicode's view is that these characters are the same characters with different shale (glyph), so it should share one codepoint, because Unicode is a _character_ code, not a _glyph_ code. This is Han Unification. Now nobody can stand against the political and commercial power of Unicode and Japanese people feel helpless Note that I heard that Chinese and Korean people have different opinion on Kanji from Japanese. They think Kanji from China, Japan, and Korea are same character with different shape and they accept Unicode. If your software support only one language in one time, you can use Unicode and the problem is only to choose proper font. Here, Japanese font means a font which has Japanese glyph (in Unicode's view) for Han Unification codepoints. Now, the problem is to use Japanese font for Japanese, Chinese font for Chinese, and Korean font for Korean. However, if your software supports multilingual text, the problem can be difficult. Japanese people want to distinguish unified Kanji. However, many (even Japanese) people are satisfied if Japanese text is written in Japanese font. Thus, an easy compromise is to use Japanese font for all Han Unification characters. (Chinese and Korean people will accept it). I think the Han Unification problem can be ignored for daily usage, by using the compromise I wrote above. Is knowing the language enough? (For example, is it enough in HTML to write UTF-8 and use the LANG tag?) Is it generally important or useful to be able to change language mid- sentence? (It's much simpler to store a single language for a whole data element, and it's much easier to render.) Of course if your software can have language information it is great. mid-sentence language support is excellent! Usage of Japanese font anywhere (I wrote above) is a _compromise_ , so it is always welcome to avoid the compromise. However, I prefer more and more percentage of softwares in the world come to be able to handle CJK characters as soon as possible, than waiting for perfect CJK support. There are a few ways to store language information. Language tags above U+E, mark-up languages like XML, and so on. I wonder whether Variation Selectors in Unicode 3.2 Beta http://www.unicode.org/versions/beta.html can be used for this purpose or not Does anyone have information? Saying about round-trip compatibility, yes, round-trip compatibility for EUC-JP, EUC-KR, Big5, GB2312, GBK are guaranteed, i.e., Unicode is a superset of these encodings (character sets). However, (1) there are no authorative mapping tables between these encodings and Unicode and there are various private mapping tables. This can cause portability problem around round-trap compatibility. (2) Unicode is _not_ a superset of the combination of these encodings, i.e., Unicode is _not_ a superset of ISO-2022-JP-2 and so on. For (1), I am now trying to let Unicode Consortium to take some solution or to write an attention or techinical report about this problem. I hear that Unicode Technical Committee is now discussing about this problem. For (2), no solution can exist, because Unicode and ISO-2022 has different opinion of what is identity of character. However, usage of language-tags or variation-selectors(?) can partly solve this problem. However, an authorative way to express distinction between CJK Kanji must be determined, and everyone must follow the way, to keep portability. Now I hear nobody is wrestling with this problem... authorative is rather a political problem than technical Note that the internal encoding may be Unicode, but stream I/O encoding has to be specified by LC_CTYPE locale. This is mandatory for internationalized softwares. --- Tomohiro KUBOTA [EMAIL PROTECTED] http
Re: A nl_langinfo(CODESET) emulator for FreeBSD and other legacy platforms
Hi, At Wed, 26 Dec 2001 19:29:48 +, Markus Kuhn wrote: Simply ship your software with a little nl_langinfo() emulation that fixes that problem until the FreeBSD people get they act together and finally implement it. It can't take that much longer any more. A good work. Bruno's libcharset is also available for this purpose. It is a good idea to write the function as an emulation. http://www.cl.cam.ac.uk/~mgk25/ucs/langinfo.c http://www.cl.cam.ac.uk/~mgk25/ucs/langinfo.h Debian GNU/Linux locales package includes various pairs of locale and encoding. You may want to include them. Especially, TIS-620 for th would be needed. (If you want, I can send you the file.) And, a suggestion. If LANG (or LC_CTYPE or LC_ALL) has .encoding part, it should be checked first. (Now langinfo.c checks utf and 8859- only. Chinese may use GBK or GB18030 and Hong Kong people may use Big5HKSCS. Some people may use alias names of locale, such as german for de_DE and french for fr_FR. Are there any way to manage these cases? I have ever heard that the default encoding for Japanese locale on some proprietary Unix is Shift_JIS, not EUC-JP. However, I don't know the detail and I cannot suggest a concrete sample implementation. --- Tomohiro KUBOTA [EMAIL PROTECTED] http://www.debian.or.jp/~kubota/ Introduction to I18N http://www.debian.org/doc/manuals/intro-i18n/ -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
Re: Emacs and UTf8 locale
Hi, At Mon, 17 Dec 2001 21:02:03 +0100 (MET), Oliver Doepner wrote: Also, what exactly does Emacs do to use it? It sets the language environment to utf-8, and sets the default and preferred coding systems to utf-8. It also sets the default input method. Sorry for replying old discussion. I think UTF-8 mode should not mean the default input method. UTF-8 mode should only mean that the default input encoding is UTF-8 (Since Emacs has encoding guessing and fallback mechanism, Emacs can fall into other encodings if the encoding of input file cannot be UTF-8. The fallback encodings can be locale-dependent.) and the default output encoding is UTF-8. Input method depends on language, not encoding. --- Tomohiro KUBOTA [EMAIL PROTECTED] http://www.debian.or.jp/~kubota/ Introduction to I18N http://www.debian.org/doc/manuals/intro-i18n/ -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
Re: Two questions about console utf8 support
Hi, At Sat, 22 Dec 2001 21:50:31 -0800 (PST), James Simmons wrote: http://linuxconsole.sourceforge.net Hi folks. I'm that person that is rewriting the console system. Interesting. Though there were a few projects such as KON for Japanese, HAN for Korean, and JFBTERM for ISO-2022-based i18n, none of them has planed to be integrated into Linux source code. (offtopic: the reason of this problem is sometimes that skilled Japanese developers are sometimes not good at English language.) I have one request, though I am not very familiar with this area. You know, east Asian languages use thousands of characters and we need conversion engine to input our languages. For X Window System, we have a standard protocol which is called XIM. However, there are no such standards for console. Are your project planning to supply some API or interface for this purpose? East Asian people will be more happier if the API is standardized and we can use same conversion engine for all of Linux, BSD, and other UNIX-like systems. --- Tomohiro KUBOTA [EMAIL PROTECTED] http://www.debian.or.jp/~kubota/ Introduction to I18N http://www.debian.org/doc/manuals/intro-i18n/ -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
Re: Emacs and UTF-8 locale
Hi, At Tue, 18 Dec 2001 15:38:19 +0200 (IST), Eli Zaretskii wrote: utf8_mode = (strcmp(nl_langinfo(CODESET), UTF-8) == 0); Thanks. This is something that should be added to Emacs. For now, Emacs implements the backup procedure, which is the Lisp equivalent of the following: char *s; int utf8_mode = 0; if ((s = getenv(LC_ALL)) || (s = getenv(LC_CTYPE)) || (s = getenv(LANG))) { if (strstr(s, UTF-8)) utf8_mode = 1; } It is important that you do not only test LANG, but the first variable in the sequence LC_ALL, LC_CTYPE and LANG that has a value. That is what Emacs does. Why limiting to UTF-8? Since LC_CTYPE locale is widely used not only for UTF-8 encodings but also for various encodings, and since GNU Emacs supports such various encodings, I think it is a good idea to use LC_CTYPE locale not only for detecting UTF-8 mode but also for detecting other encodings such as ISO-8859-*, KOI8-*, EUC-*, TIS-620, Big5, and so on. --- Tomohiro KUBOTA [EMAIL PROTECTED] http://www.debian.or.jp/~kubota/ Introduction to I18N http://www.debian.org/doc/manuals/intro-i18n/ -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
Re: Yudit and XIM
Hi, At Fri, 14 Dec 2001 18:45:13 +0100, Juliusz Chroboczek wrote: XIM is intrinsically a locale-dependent protocol, as the set of available input methods is locale-dependent. Thus, the IM must be opened in the Input Method's locale. Right. On the other hand, once the IM has been opened, its usage is fully locale-independent; conversion from the IM's codeset to UTF-8 is done internally by Xlib. If you use Xutf8LookupString(), usage is locale-independent. (Instead, it will dependent on a specific encoding of UTF-8.) If you use XmbLookupString(), it will still give strings in encoding of the locale when you opened the IM. In practice, what this means is that the user must set her locale according to the IM's she wishes to use. The programmer, on the other hand, does not need to bother with locale issues. For standard softwares, this is right. (Standard means that the software supplies a standard way for users to choose input methods.) On the other hand, Yudit doesn't follow the standard way (users must configure Yudit) and it should choose IM from its menu. It is Yudit's design. To keep consistency with the design, Yudit should be able to choose input methods from the menu. Concretely speaking, Yudit should have skkinput, kinput2, xcin (traditional Chinese), xcin (simplified Chinese), ami, htt, and so on. Of course I think the _initial_ input method can be chosen by following the standard configuration for all XIM-supported softwares. --- Tomohiro KUBOTA [EMAIL PROTECTED] http://www.debian.or.jp/~kubota/ Introduction to I18N http://www.debian.org/doc/manuals/intro-i18n/ -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
Re: Yudit and XIM
Hi, At Thu, 13 Dec 2001 19:39:15 +0100, Juliusz Chroboczek wrote: Bruno has added full support for locale-independent use of XIM in XFree86 4.1.0 Xlib. The 4.1.0 version has some bugs, for reliable support you will want to use the Debian patched version or 4.1.99.2 or later (current CVS should be fine). For more information, man Xutf8LookupString(3) http://www.xfree86.org/current/Xutf8LookupString.3.html or see the function Input() in input.c in a reasonably recent version of XTerm. I think the introduction of Xutf8LookupString() is not sufficient for XIM to be locale-independent. For OverTheSpot preedit type, the XIM client has to prepare an XFontSet so that the XIM server uses it for displaying preedit strings. This font (fontset) to be used for displaying preedit strings _must_ be prepared by client side to keep consistent proportion between already-inputed strings and preedit strings. (The aim of OverTheSpot preedit type is to make users feel as if the preedit string is displayed seamlessly.) I suggest one solution. I am very very sure that people who want to use XIM know about locale and have the proper locale for using the XIM. (I don't understand why Gaspar doesn't want to introduce locale-dependent features. Introduction of such features does not mean a reduction of usability for people who use OSes which don't support locale.) Thus, as you prepared kinput2 as a menu item for input, how about preparing menu items for popular XIM servers? The database of XIM servers (inside the Yudit) also has the proper locale for each XIM server and setlocale(LC_CTYPE,proper_locale) will be called when a user chooses an XIM for input. The list should be customizable by users because we can never know a complete list of all XIM servers in the world. Please test mlterm (http://mlterm.sourceforge.net) which can dynamically change XIM servers by using this method. --- Tomohiro KUBOTA [EMAIL PROTECTED] http://www.debian.or.jp/~kubota/ Introduction to I18N http://www.debian.org/doc/manuals/intro-i18n/ -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
Re: StarWars
Hi, At Thu, 13 Dec 2001 15:24:09 +, Markus Kuhn wrote: If you like VT100 terminals, I'm sure you will enjoy this telnet towel.blinkenlights.nl I was refused to connect this site... Anyone up to make a UTF-8 version of this? :) http://www.asciimation.co.nz/ Interesting. However, this is already UTF-8, though only a small subset of U+0020 - U+007e seems to be used. :-) --- Tomohiro KUBOTA [EMAIL PROTECTED] http://www.debian.or.jp/~kubota/ Introduction to I18N http://www.debian.org/doc/manuals/intro-i18n/ -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
Re: diacritics in xterm
Hi, At Tue, 11 Dec 2001 21:47:49 +0100, Radovan Garabik wrote: : Thank you for the hint. So does this mean, the problem hasn't been : fixed for two years and you recommend the dangerous fix by replacing the : xterm binary? It seems so. I am running the dangerous binary for about 4 months, in both UTF-8 and ISO-8859-2 locales and so far have no problems at all. Please be sure that the fix does not disable multibyte character input. --- Tomohiro KUBOTA [EMAIL PROTECTED] http://www.debian.or.jp/~kubota/ Introduction to I18N http://www.debian.org/doc/manuals/intro-i18n/ -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
Re: input method for Japanese
Hi, At Sun, 9 Dec 2001 12:27:44 +0100 (CET), Gernot Jander wrote: I have some applications for reading, editing and learning Japanese which are until now based on the kinput2/canna input method. As far as i can see, this method is bound to the EUC encoding. Is there any other input method known, that uses utf-8 and works with the ja_JP.utf-8 locale? Or is any work in progress for Japanese input with utf-8 which i can join? kinput2 supports both of kinput2 and XIM protocols. (It also supports a few other protocols). Note that kinput(2) protocol is developed before the standardization of X Window System internationalization and is now obsolete. When you are using XIM protocol (X11R6's standard), you can input Japanese character using kinput2 into softwares which are running under ja_JP.UTF-8 locale. For example, you can input Japanese into xterm under UTF-8 locale using kinput2. Thus, we don't need to develop UTF-8-based Japanese input methods. OffTopic: I also want Yudit to adopt XIM protocol instead of kinput2 protocol. There are a few Japanese input method softwares such as kinput2, skkinput, xwnmo, and so on. (Note that I am not saying about the backend conversion engine). All of them support XIM protocol while only kinput2 supports kinput2 protocol. Moreover, there are Korean and Chinese XIM servers such as Ami and XCIN. --- Tomohiro KUBOTA [EMAIL PROTECTED] http://www.debian.or.jp/~kubota/ Introduction to I18N http://www.debian.org/doc/manuals/intro-i18n/ -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
mlterm mailing list is now opened
Hi, everyone. mlterm (MultiLingual TERMinal emulator), which I introduced a few days ago in i18n@xfree86 and debian-i18n lists, has got a SourceForge hosting. http://www.sourceforge.net/projects/mlterm/ http://mlterm.sourceforge.net/ mlterm is a terminal emulator with following unique features: - various encodings are supported (multilingual) - combining characters (TIS-620, TCVN5712, JIS X 0213, and UTF-8) - anti-aliased fonts with Xft and True Type fonts. - multiple windows in one process - XIM is changeable dynamically in run-time and you can input multiple complex languages such as Japanese and Chinese. - scroll by wheel mouse - background image (in other words, wallpaper) - transparent background - scrollbar plugin API (unstable) Two mailing lists are now available, one for discussion in English, the another for discussion in Japanese language. I imagine some of you will be interested in joining the English mailing list. --- Tomohiro KUBOTA [EMAIL PROTECTED] http://www.debian.or.jp/~kubota/ Introduction to I18N http://www.debian.org/doc/manuals/intro-i18n/ -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
Re: /efont/ and xterm (Re: UTF-8 Terminals)
Hi, At Wed, 14 Nov 2001 02:12:19 +0100 (CET), Markus Kuhn wrote: xterm is not suited for proportional or bi-width fonts. Split the font up into a 8x16/16x16 pair, and there will be no problems. Just like you have to do with Unifont. I'd like to know XTerm's policy. What is the reason of the (un)support of biwidth fonts like GNU Unifont and /efont/ ? Is it a policy of XTerm? Or, they will be supported in future? Otherwise, willing to accept patches to support them? I have no strong opinion on how biwidth (or doublewidth) fonts should be assembled. XFree86's doublewidth fonts don't contain singlewidth glyphs and they are exactly fixed width, while GNU Unifont and /efont/ contain both singlewidth and doublewidth glyphs. I don't know which is better. I even have no idea whether they should follow one united policy or not. However, it will benefit users if XTerm will support GNU Unifont and /efont/ as is. If a patch with tens of lines for XTerm can save time of millions of users, it is absolutely worth doing. If nobody is working on XTerm's support of GNU Unifont and /efont/, I'd like to research. Can anyone teach me where should I start to read the code of XTerm? --- $B5WJ]EDCR9-(B Tomohiro KUBOTA [EMAIL PROTECTED] http://www.debian.or.jp/~kubota/ "Introduction to I18N" http://www.debian.org/doc/manuals/intro-i18n/ -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
Re: UTF-8 Terminals
Hi, At Sat, 10 Nov 2001 16:19:21 +, Markus Kuhn wrote: Hardly anyone needs full Unicode. If all you are interested in are European scripts and symbols for instance, then the 3 kilocharacters of the Unicode subset MES-3 are more than good enough for your needs, and the XFree86 standard xterm fonts 6x13, 8x13, 9x15, 9x18, 10x20 have covered MES-3 for over a year now and are widely used. It is true that hardly anyonw needs full Unicode. However, it is different from people to people which subset of Unicode they need. For example, as you said, MES-3 would be a good subset for European people. People from other countries needs other subsets. Since XFree86 is a single distribution for the whole world, it should satisfy needs for people all over the world. People who can read CJK glyphs have used larger font sizes so far and will continue to do so in the future. True. Japanese people like 7x14 + 14x14 fonts and Korean and Chinese people like 8x16 + 16x16 fonts. XTerm has used 6x13 font as default (because fixed font was 6x13). Thus, it is reasonable way to have 12x13 font so that XTerm with the default setting can display as many characters as possible (including CJK scripts). I think it is not too small for CJK glyphs because there are small (of course not so beautiful) fonts, for example 10x10 and 12x12, for Japanese. BTW, did you now /efont/ project http://openlab.ring.gr.jp/efont/index.html http://openlab.ring.gr.jp/efont/unicode/index.html which has 10, 12, 14, 16, and 24 pixels Unicode fonts ? The web page has a table of subsets these fonts cover. Though I am not taking part in the project, I hope thse fonts will be used widely like ETL intlfonts. --- Tomohiro KUBOTA [EMAIL PROTECTED] http://www.debian.or.jp/~kubota/ Introduction to I18N http://www.debian.org/doc/manuals/intro-i18n/ -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
Re: [I18n]Call for testers: luit in XFree86 CVS
Hi, At Tue, 13 Nov 2001 13:28:42 +1100 (EST), Jim Breen wrote: I think we can get into serious hair-splitting here. My copy of JIS X 0213 describes itself as "$B3HD%4A;z=89g(B" (enlargement or extension kanji set), and the text inside makes it pretty clear that it it is in addition to JIS X 0208. I noted the new "JIS Kanji Dictionary" of which I saw some proofs in Tokyo earlier this year is described as covering JIS X 0208 and JIS X 0213. (Poor old JIS X 0212 is forgotten.) It is clear that JIS X 0213 includes JIS X 0208 (except for "dis-unified" characters). http://www.asahi-net.or.jp/~wq6k-yn/code/enc-x0213.html http://www.watch.impress.co.jp/internet/www/column/ogata/index.htm http://www.jca.apc.org/~earthian/aozora/0213.html http://www.itscj.ipsj.or.jp/ISO-IR/index.html I think there were a total of 56 kanji "dis-unified" in this way. Sorry, "kuchi-taka" and "hashigo-taka" is not "dis-unified". Certainly if you set out to use JIS X 0213 you really have to run with a a single set combining the characters defined in both JIS X 0208 and JIS X 0213, which is what the existing font files do. No. Though JIS X 0213 is an extension to JIS X 0208, JIS X 0213 itself includes all JIS X 0208 characters. Thus, JIS X 0213 is intended to be a replacement of JIS X 0208. Please check the literatures above for detail. --- Tomohiro KUBOTA [EMAIL PROTECTED] http://www.debian.or.jp/~kubota/ "Introduction to I18N" http://www.debian.org/doc/manuals/intro-i18n/ -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
Re: [I18n]Call for testers: luit in XFree86 CVS
Hi, At Tue, 13 Nov 2001 15:59:14 +1100 (EST), Jim Breen wrote: Where it says: JISX 0213 Japanese national standard. Released recently. Intended to be used in addition to JISX 0208. Share many characters with JISX 0212. And the author? 12 November 2001 Tomohiro KUBOTA [EMAIL PROTECTED] Oh, sorry! This is a mistake. (The last modification on 12 November 2001 was related to the change of unicode charts site.) --- Tomohiro KUBOTA [EMAIL PROTECTED] http://www.debian.or.jp/~kubota/ Introduction to I18N http://www.debian.org/doc/manuals/intro-i18n/ -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
Re: [I18n]Call for testers: luit in XFree86 CVS
Hi, At 12 Nov 2001 18:56:10 +, Juliusz Chroboczek wrote: I don't want to extend luit for 4.2.0; bug fixes only in this version. Much of what you're proposing will go into future releases of luit. More precisely, I see. Let's discuss these points after the release of 4.2.0. BTW, I have now trouble compiling luit. charset.c includes X11/fonts/fontenc.h and I could not find it. I found it in xc/lib/font/include/fontenc.h . Is it the right file? When I proceed compilation, I met compilation errors of: charset.o: In function `FontencCharsetRecode': charset.o(.text+0x146): undefined reference to `FontEncRecode' charset.o: In function `getFontencCharset': charset.o(.text+0x2f0): undefined reference to `FontEncMapFind' charset.o(.text+0x302): undefined reference to `FontMapReverse' I think I need some libraries in XFree86 CVS tree... TK How about Johab? Don't know. We'll see. Johab is a Korean encoding which covers full hanguls and symbols and ideographs in KS X 1001. However, the codepoints are not compatible with EUC-KR. As I've already mentioned, I strongly dislike the complexity of Markus' proposal. I want to use single shifts only. Thus I said CSI 1 w for each character. TK but I am afraid this solution can be too heavy, because luit will TK have to issue CSI 1 w for each doublewidth character and XTerm TK will have to parse it. I don't think that will be much of a problem. If it is, we'll see what can be done. Sure. If we will single shifts only, we can have more simple sequence. --- Tomohiro KUBOTA [EMAIL PROTECTED] http://www.debian.or.jp/~kubota/ Introduction to I18N http://www.debian.org/doc/manuals/intro-i18n/ -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
Re: UTF8 Terminal Detection
Hi, At Mon, 12 Nov 2001 23:24:20 +0100 (CET), Markus Kuhn wrote: I don't think, this is feasible or useful. Environment variables can only be set by a parent process for its children. In the case of a pty terminal emulator that starts applications as child processes (e.g., xterm), we have already the locale variables providing the encoding information to both the terminal emulator (e.g., xterm) and its children (shell, applications). In other connections, terminal and applications are just connected by some byte-serial communications channel that doesn't transmit environment variables. Modifying all communications channels to do that is further of then using UTF-8 everywhere, so why bother? I have been using ~/.bashrc including the following line for long time. if [ $TERM = linux -o ${TERM%-*} = xterm ] then LANG=C else LANG=ja_JP.eucJP fi This works for terminals which I usually use - terminals without Japanese (EUC-JP) support Linux console, Linux framebuffer console, and xterm - terminals with Japanese support kon console, jfbterm console, rxvt compiled with Kanji support, kterm, Tera Term Pro, and shell mode in emacs on X11 For terminals which support Japanese, I'd like to set LANG=ja_JP.eucJP so that I can use Japanese. However, using LANG=ja_JP.eucJP in other terminals will cause mojibake. For example, http://www.debian.or.jp/~kubota/mojibake/xterm.png Such mojibake can be avoided by setting LANG=C (English messages will be displayed, which can be read using English-Japanese dictionary). Because it is really bothering to set LANG or to invoke screen manually each time I start a new terminal, I am now almost happy with the above setting. However, looking TERM way does not work well for every cases nor is a right way. Also, this way does not work for non-Japanese languages as well as for Japanese, because TERM=kterm is available and is widely used for Japanese-capable terminals while there are no replacement for Korean, Chinese, Thai, nor other languages. For example, Hanterm sets TERM=xterm. You may think why I have to use terminals without Japanese support. Setting LANG=ja_JP.eucJP and using Japanese-capable terminals only would make me happy. However, everyone have chance to use Linux (or BSD, ...) console. And, many softwares invokes xterm directly. Anyway, using TERM variable for this purpose is not reliable, though this has been a real daily need for us for long years. --- Tomohiro KUBOTA [EMAIL PROTECTED] http://www.debian.or.jp/~kubota/ Introduction to I18N http://www.debian.org/doc/manuals/intro-i18n/ -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
Re: Locales and Emacs 21
Hi, At Tue, 23 Oct 2001 12:25:00 +0100 (BST), Markus Kuhn wrote: Unfortunately, that doesn't work right out-of-the-box yet. Elisp has at the moment no direct way of accessing the output of nl_langinfo(CODESET), therefore Emacs doesn't know about the current locale's character set and can't consider this information when deciding on the character set of a loaded file. Gerd Moellmann [EMAIL PROTECTED] said that fixing this would already be on the post-21 todo list. Emacs 20 already had a mechanism to guess encoding of the file with ordered candidate list of encodings. The problem is, we have to configure the list. (set-language-environment set this list.) For example, in Japanese environment, the encoding-guesser will check the encoding of the file with the candidate of EUC-JP, Shift_JIS, and ISO-2022-JP. (UTF-8 should be added to this list). Thus, what is configured using LC_CTYPE variable should the top candidate for the guesser, not the unique candidate. I think terminal-coding-system should also be set from LC_CTYPE. I heard a few months ago that Emacs21 will be able to do this. Now Emacs21 has been released. Did someone tested? Also, Emacs20 in X Window System could not display non-ISO-8859-1 characters without some settings in ~/.emacs or ~/.Xresources (these characters were displayed by white box). This is caused by improper default font configuration. Is this problem fixed in Emacs21 ? --- Tomohiro KUBOTA [EMAIL PROTECTED] http://www.debian.or.jp/~kubota/ Introduction to I18N http://www.debian.org/doc/manuals/intro-i18n/ - Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
Re: Vim 6.0 has been released (debian info)
Hi, At Wed, 3 Oct 2001 23:24:11 -0400, Jimmy Kaplowitz [EMAIL PROTECTED] wrote: That's not true on my up-to-date Debian system, running sid/unstable. The current release, 6.0.011-2 (which corresponds to upstream vim 6.0.11), is compiled with multi_byte disabled. The alpha and beta packages had it enabled, and I hereby put in my vote for it to be re-enabled. Wichert, a number of us think UTF-8 support is essential to the system of the future. If you want a minimalist version of vim without UTF-8, reintroduce vim-tiny. I confirmed I was wrong and you are right. This is terrible situation. Now multibyte-language speaker cannot use vim at all, neither in legacy encoding nor in UTF-8. Even my bug report with a patch (#107856) cannot fix this situation, though Wichart closed the bug when he packaged Vim 6.0! Bug#107856: http://bugs.debian.org/107856 In short, Vim 6.0 is completely useless without locale support for CJK people, while it means that 8bit-language people merely cannot use UTF-8 mode and they can use legacy encodings. --- Tomohiro KUBOTA [EMAIL PROTECTED] http://www.debian.or.jp/~kubota/ Introduction to I18N http://www.debian.org/doc/manuals/intro-i18n/ - Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
Re: Vim 6.0 has been released (debian info)
Hi, At Thu, 4 Oct 2001 06:34:50 -0400 (EDT), Thomas E. Dickey [EMAIL PROTECTED] wrote: want isn't the same as need Right, 8-bit language people want UTF-8 support. However, CJK people need either EUC support or UTF-8 support. (Of course we want both.) Fortunately, since Vim 6.0 supports LC_CTYPE locale, it supports both of EUC and UTF-8. On the other hand, Vim 6.0 without UTF-8 support does not support locale too, which makes Vim 6.0 without locale (including UTF-8) support completely useless for CJK people. I imagine RTL-language-speaking people and combining-character- language-speaking people also canoot live with Vim 6.0 without locale support. --- Tomohiro KUBOTA [EMAIL PROTECTED] http://www.debian.or.jp/~kubota/ Introduction to I18N http://www.debian.org/doc/manuals/intro-i18n/ - Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
Re: Unicode support under Linux
Hi, At Wed, 03 Oct 2001 15:45:31 -0400, Richard, Francois M [EMAIL PROTECTED] wrote: But, is it also true to say that under Linux utf-8 Locales, all C functions handle properly char data representing utf-8 character encoded data? Do strlen, strchr, strcmp, strcpy, toupper process char data correctly when the Locale character encoding is utf-8? OR I need to use the wide character functions after specific conversion from char to wchar_t of my charatcer data? Not perfectly. * strlen strlen counts the *number of bytes* of the given string, not the *number of characters* of the string. Since UTF-8 is a multibyte encoding, these two does not coincide. * strcpy works well. * strchr does not works at all, because UTF-8 character cannot be expressed with 'char' type. I think the simplest way to substitute all these functions is to use wide character. Standard C library has wchar_t substitution of above functions. And, these are conversion functions between multibyte character and wide character. Note that multibyte character does not mean the character is always multibyte. It is locale-dependent encoding. This means that, in ISO-8859-1 locale, multibyte character is ISO-8859-1. In Big5 locale, multibyte character is Big5. I.e., if you write your software using multibyte character and wide character, your software will support not only UTF-8 but also all major encodings in the world such as ISO-8859-*, EUC-*, KOI8-*, and so on. Explanation on wchar_t functions is available at my document available from my signature at the bottom of this mail. Note that wchar_t is not always UTF-32, though it is always true in GNU libc. If you have to write portable software, you must not assume wchar_t is UTF-32. --- Tomohiro KUBOTA [EMAIL PROTECTED] http://www.debian.or.jp/~kubota/ Introduction to I18N http://www.debian.org/doc/manuals/intro-i18n/ - Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
EastAsianWidth revised
Hi, As you know, Unicode 3.1.1 is released. It revised the East Asian Width for 15 characters. Markus, could you please update your wcwidth() implementation? And, all softwares which adopt Markus' wcwidth() or private wcwidth() should be updated. Read http://www.unicode.org/ for detail. --- Tomohiro KUBOTA [EMAIL PROTECTED] http://www.debian.or.jp/~kubota/ Introduction to I18N http://www.debian.org/doc/manuals/intro-i18n/ - Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
Cross Mapping Tables (Re: EastAsianWidth revised)
Hi, At Sat, 8 Sep 2001 20:54:37 +0100 (BST), Markus Kuhn [EMAIL PROTECTED] wrote: The following 15 characters went from neutral to ambiguous, probabaly someone discovered them in some CJK character set that is displayed there double-width: I imagine so, though these characters are not related to my report http://www.debian.or.jp/~kubota/unicode-symbols.html . However, there is an another problem that Unicode Consortium has abolished all EastAsian cross mapping tables. I once pointed that there are many cross mapping tables for Japanese Shift_JIS and JIS X 0208 - Unicode. I said that this causes a problem that an identical document in JIS X 0208 can become different when converted into Unicode in various environment. Now we have lost these mapping tables. Thus, the situation I pointed has got even worse because now we can implement arbitrary mapping tables because there are no standards. I will request Unicode Consortium to supply one authorized reliable reference mapping table between Unicode and JIS X 0208. This problem also affects the EastAsianWidth. Now we lost a way to discuss which Unicode character is doublewidth in EastAsian, except for characters only used in CJK (such as Han Ideogram, Hiragana, Katanakan, Hangul, and CJK-only punctuations). The normal wcwidth() did not change as a result of Unicode 3.1.1, because both neutral and ambiguous characters result there in the same width: 1 I just updated the still somewhat experimental wcwidth_cjk(), in case people found that so far actually useful. It contains a new table of EastAsianWidth Ambiguous characters. http://www.cl.cam.ac.uk/~mgk25/ucs/wcwidth.c Thanks. --- Tomohiro KUBOTA [EMAIL PROTECTED] http://www.debian.or.jp/~kubota/ Introduction to I18N http://www.debian.org/doc/manuals/intro-i18n/ - Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
Re: How to use Unicode
Hi, At Fri, 31 Aug 2001 23:15:20 +0100 (BST), Markus Kuhn [EMAIL PROTECTED] wrote: The -u8 was a temporary hack needed 2 years ago before glibc 2.2 with UTF-8 locale support was around. It is obsolete now, except on other operating systems (namely: FreeBSD) that still didn't have UTF-8 locales last time I checked. If you set the locale, then not only xterm but also all processes started inside will be informed that you want UTF-8. That's much neater as it replaces zillions of command line options to activate a separate UTF-8 mode for each single tool. True. Once a user set LANG variable, he/she should not need any more specification of language and encoding. This is a (part of) idea of locale. XTerm started to support locale partially - only UTF-8 locales. Further improvement will be discussed. --- Tomohiro KUBOTA [EMAIL PROTECTED] http://www.debian.or.jp/~kubota/ Introduction to I18N http://www.debian.org/doc/manuals/intro-i18n/ - Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
Re: Backspace problem in Xterm/rxvt
Hi, AAt Tue, 14 Aug 2001 10:13:56 +1000 (EST), Jim Breen [EMAIL PROTECTED] wrote: kterm's long-term practice notwithstanding, a BS should backspace over a whole character, and not fragment of one. If you mean BS key on your keyboard by your word BS, I agree. If you mean BS code (0x08) output to tty from softwares, I don't agree. Such change of de-facto standard is just impossible. This is not a discussion on which is technically better. It is not only kterm's practice but also every Japanese terminals' and every Japanese enabled softwares' practice. I remember you are living in Japan now, aren't you? Then you can try Japanese version of MS-DOS, Tera Term, telnet included in MS-Windows, NCSA telnet, rxvt, eterm, aterm, wterm, and so on. I think you cannot find any column-oriented terminal which moves cursor in two columns for one output of 0x08 code. --- Tomohiro KUBOTA [EMAIL PROTECTED] http://www.debian.or.jp/~kubota/ Introduction to I18N http://www.debian.org/doc/manuals/intro-i18n/ - Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
Re: Vim 6.0 now in beta test
Hi, At Mon, 06 Aug 2001 13:08:55 +0200, Bram Moolenaar [EMAIL PROTECTED] wrote: I don't see this problem. Are you using Vim in the GUI version or in a terminal? Does the cursor move to the right position after a delay or when typing another character like f? I am using terminal version on xterm 157 with -u8. I input a doublecolumn character using 'a' command and so on. [] - a doublecolumn character ~ - cursor position Then I hit ESC key [{} - {} means dotted box character and [ means garbage half. ~ I found that the following occurs after about one second: []} - } means garbage half of dotted box character. [] is right character. ~ --- Tomohiro KUBOTA [EMAIL PROTECTED] http://www.debian.or.jp/~kubota/ Introduction to I18N http://www.debian.org/doc/manuals/intro-i18n/ - Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
Re: Vim 6.0 now in beta test
Hi, At Sun, 05 Aug 2001 12:42:03 +0100, Markus Kuhn [EMAIL PROTECTED] wrote: Vim 6.0, Bram Moolenaar's vi editor with full UTF-8 support has now moved from alpha to beta test stage, so it's supposed to be stable and just needs wide and thorough testing now before it gets burned on millions of CD-ROMs: I tried and I found a bug. In UTF-8 locale, When I input a doublewidth character (for example, hiragana) at the end of a line and hit ESC key, the cursor moves toward left by only one column. It should be two columns. This bug does not occur in EUC-JP locale. --- Tomohiro KUBOTA [EMAIL PROTECTED] http://www.debian.or.jp/~kubota/ Introduction to I18N http://www.debian.org/doc/manuals/intro-i18n/ - Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
Re: Backspace problem in Xterm/rxvt
Hi, At Mon, 06 Aug 2001 09:58:28 +0500 (IST), [EMAIL PROTECTED] wrote: While using Backspace or Delete to erase character in Xterm with UTF-8 support , it will not work properly. It will accept to more Backspace for a single character. This is not a responsibility of terminals but of shells. Terminals are responsible to erase one column for one 0x08 code. It is shells' responsibility to issue proper number of 0x08 code for one hitting of BS key and erase proper bytes of the internal buffer. Many shells are designed on an assumption that numbers of characters, bytes, and columns are identical. This assumption is only true for encodings without multibyte characters, doublewidth characters, combining characters, and other complex features. Try patches for bash and so on which are available at: http://oss.software.ibm.com/developer/opensource/linux/patches/i18n.php --- Tomohiro KUBOTA [EMAIL PROTECTED] http://www.debian.or.jp/~kubota/ Introduction to I18N http://www.debian.org/doc/manuals/intro-i18n/ - Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
Re: New Unifont release
Hi, At Wed, 11 Jul 2001 14:18:31 +0200 (CEST), Bruno Haible [EMAIL PROTECTED] wrote: Can't b) be solved with the help of fontsets instead of redundantly doubling the number of fonts? Not in the current state of affairs. Xlib doesn't do anything meaningful when an XFontSet has two fonts with the same encoding (here: ISO10646-1). The fontset only helps when all you have are fonts in different character sets (ISO8859-x, JISX0208, JISX0212, etc.); then the DrawString algorithm will cut the string into segments, based on the character sets. Other information from the fonts (e.g. width) is not used during this segmentization. Any possibility on future extension of X11R5 XFontSet or X11R6 XOM to support it? Internationalized softwares which use XFontSet or XOM should run also under UTF-8 locales... I think both way (Unifont and separate font) should work because both ways exist. Practically, separate fonts way is important because there are less number of fonts which include large sub- charactersets such as Ideogram. And for new code, we use Xft instead of XFontSet. There also, it is helpful to have the entire Unicode repertoire in a single font. IMO, introducing another scheme as a recommended default way is not a good idea. More and more new knowledge is needed, less and less software will be internationalized. --- Tomohiro KUBOTA [EMAIL PROTECTED] http://www.debian.or.jp/~kubota/ Introduction to I18N http://www.debian.org/doc/manuals/intro-i18n/ - Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
Re: Arabic (was Re: [I18n]Syriac)
Hi, At Fri, 6 Jul 2001 04:30:04 +0430 (IRDT), Roozbeh Pournader [EMAIL PROTECTED] wrote: We have to choose some way: go the OpenType way, or come to some assignment of glyph numbers somewhere (Private use area? After U+10?) for the missing presentation forms. Why not submit a proposal to include them to Unicode? --- Tomohiro KUBOTA [EMAIL PROTECTED] http://www.debian.or.jp/~kubota/ Introduction to I18N http://www.debian.org/doc/manuals/intro-i18n/ - Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
Re: Luit and screen [was: anti-luit]
Hi, At Wed, 4 Jul 2001 20:39:30 +0100 (BST), Robert de Bath robert$@mayday.cix.co.uk wrote: Oops, I just went back to the GNU site; wrong licence. The _X11_ licence is compatible with the GPL ... so what's the problem Juliusz? You won't be using GPL code from outside in luit so there's no 'infection'. X11 license is compatible with the GPL. This means X11-licensed softwares can be used as a basis of GPL-ed softwares. However, softwares of GNU Project will have to be assigned to FSF. (Note the difference between merely GPL-ed softwares and GNU Project softwares.) This FSF's way is to guard itself legally. Dual license will not help this situation. OTOH, GPL-ed softwares cannot be included in XFree86 source tree, as Juliusz said. Thus, I think Juliusz's way (luit in X11 license) is reasonable. --- Tomohiro KUBOTA [EMAIL PROTECTED] http://www.debian.or.jp/~kubota/ Introduction to I18N http://www.debian.org/doc/manuals/intro-i18n/ - Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
Re: Locking Linux console to UTF-8
Hi, At Sat, 30 Jun 2001 09:05:15 +0100 (BST), Markus Kuhn [EMAIL PROTECTED] wrote: Do HAN, HAN2, KON, etc. already all work in UTF-8 locales? No. I also have never heard about development effort of it. Nobody seems to feel needs and be interested in developing it so far, at least for kon and jfbterm. --- Tomohiro KUBOTA [EMAIL PROTECTED] http://www.debian.or.jp/~kubota/ Introduction to I18N http://www.debian.org/doc/manuals/intro-i18n/ - Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
Re: Emacs and nl_langinfo(CODESET)
Hi, At Sat, 30 Jun 2001 09:00:50 +0100 (BST), Markus Kuhn [EMAIL PROTECTED] wrote: If you press ^C in an application that spits out BIG5 in an unfortunate moment or truncate a string by counting bytes, then you will loose BIG5 synchronization, and the terminal has to skip characters in the input stream until is finds two G0 characters in a row to be sure again where the next character starts. BIG5 is an example of a rather messy encoding, not only in that respect. ISO 2022 is far worse. I don't understand why the current implementation of luit can avoid this problem while iconv() approach cannot. --- Tomohiro KUBOTA [EMAIL PROTECTED] http://www.debian.or.jp/~kubota/ Introduction to I18N http://www.debian.org/doc/manuals/intro-i18n/ - Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
Re: Emacs and nl_langinfo(CODESET)
Hi, At Sat, 30 Jun 2001 08:48:15 +0100 (BST), Markus Kuhn [EMAIL PROTECTED] wrote: I added to xterm and less long ago code that searches for the substring UTF-8 in LC_ALL || LC_CTYPE || LANG, long before glibc had any UTF-8 locale and I knew about either nl_langinfo() or even libcharset. It is now obvious that nl_langinfo or libcharset is the proper solution to find out whether we should activate UTF-8 mode or not. My only agenda here is that I want to get rid of the necessity to remember application-specific command line switches such as -u8. I consider the -u8 deprecated and would appreciate if people wouldn't mention it any more. Yes. I strongly agree that we should not introduce application-specific command line switches such as -u8. (In Japan, there are some books which read how to configure such softwares. For example, you need *international: yes line in your ~/.Xresources to use xterm with Japanese. You need kterm instead of xterm. Use jless instead of less. Some of internationalized X softwares have multibyte option to enable it. Be careful not to specify -*-helvetica-* font for Japanese. I also bought a few books to establish Japanese environment when I started to use Linux. That is a mess! Only setting LANG should be enough. (Who need a book to simply set LANG variable!) Using nl_langinfo() and libcharset _only_ to detect UTF-8 locale is, I think, too heavy. It can be used also to detect other encodings, including ISO-8859-*, EUC-*, KOI8-*, and so on. Such an information can be used to enable the encoding by calling iconv() or calling luit from XTerm. Please don't try to read my mind remotely. Please use the continuously updated core dump of my mind at http://www.cl.cam.ac.uk/~mgk25/unicode.html instead. :-) I read your intension also from your mails to mailing lists. --- Tomohiro KUBOTA [EMAIL PROTECTED] http://www.debian.or.jp/~kubota/ Introduction to I18N http://www.debian.org/doc/manuals/intro-i18n/ - Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
Re: Locking Linux console to UTF-8
Hi, At Fri, 29 Jun 2001 15:58:00 +0200 (CEST), Bruno Haible [EMAIL PROTECTED] wrote: Personally I would suggest making this kind of user-space console software the default These consoles rely on the framebuffer console. Though jfbterm rely on framebuffer (and require Linux 2.2 or later), kon does not (and works with older Linux kernel). [According to the changelog file of kon, the first test release was 1992-10-13, obviously when framebuffer was not available.] However, I don't know whether Unicode can be implemented without framebuffer. Just an information. --- Tomohiro KUBOTA [EMAIL PROTECTED] http://www.debian.or.jp/~kubota/ Introduction to I18N http://www.debian.org/doc/manuals/intro-i18n/ - Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
Re: file name encoding
Hi, At Wed, 27 Jun 2001 20:51:31 +0200 (CEST), Bruno Haible [EMAIL PROTECTED] wrote: I agree that in _some_ places programs exchange text in locale (snip all followings) This is just I'd like to insist. Just one addition. Since Juliusz's filenames in UTF-8 without conversion way works only under UTF-8 locales, it is a subset of filenames in locale encoding way (i.e., the present state). (Note that if you follow filenames in locale encoding way, you will use UTF-8 filenames in UTF-8 locales.) Thus, this way does not include any technical improvement but it is just a pressure to people who don't use UTF-8 locales. --- Tomohiro KUBOTA [EMAIL PROTECTED] http://www.debian.or.jp/~kubota/ Introduction to I18N http://www.debian.org/doc/manuals/intro-i18n/ - Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
Re: file name encoding
Hi, At Tue, 26 Jun 2001 22:11:06 +0200 (CEST), Bruno Haible [EMAIL PROTECTED] wrote: - Newbies should have only a single variable to set in their $HOME/.profile, not dozens. Yes. This is the point. When users set LANG vairable, they expect all softwares to obey the variable. - We want to make it easy for everyone to use an UTF-8 locale. Users shouldn't be bothered to change various $HOME/.* files, set .Xdefault resources etc. Yes. However, not only UTF-8 but also all other encodings. - All X programs which set their default font to *-iso8859-1 independently of the locale. This includes nedit. Of course such softwares are buggy. However, softwares which use XDraw{Image}String() are also buggy. (Softwares before X11R4 should use both XDraw{Image}String() and XDraw{Image}String16(). Modern softwares after X11R5 should use X{mb,wc,(utf8?)}Draw{Image}String().) And more, default font of -adobe-helvetica-* is buggy enough. This excludes most non-Latin fonts. -adobe-helvetica-*,* is good. Or, adding-,*-mechanism before XCreateFontSet() is better, like I modified twm. in xc/programs/twm/util.c basename2 = (char *)malloc(strlen(font-name) + 3); if (basename2) sprintf(basename2, %s,*, font-name); else basename2 = font-name; if( (font-fontset = XCreateFontSet(dpy, basename2, missing_charset_list_return, missing_charset_count_return, def_string_return)) == NULL) { Of course we can implement better font-guessing mechanism, like I implemented for IceWM, Blackbox, and Sawfish. (I didn't use the mechanism for twm because I thought the mechanism is too heavy for twm.) --- Tomohiro KUBOTA [EMAIL PROTECTED] http://www.debian.or.jp/~kubota/ Introduction to I18N http://www.debian.org/doc/manuals/intro-i18n/ - Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
Re: file name encoding
Hi, At 26 Jun 2001 13:49:10 -0700, H. Peter Anvin [EMAIL PROTECTED] wrote: Incidentally, I believe there needs to be an easy way to set the default character set in use on a system. This may of course be overridden by the user (possibly at their own peril), but it is nevertheless a useful concept. This mechanism is implemented since X11R5. XFontSet. Why XFontSet is not very popular? I imagine some reasons. - People imagine from its name that it is only for CJK people who need multiple fonts. - People were accustomed to use system without setting locale. XFontSet-related functions assume ASCII without locale setting. Thus, when using XFontSet, I check locale and use XFontStruct- related conventional non-internationalized functions when the check fails. This can avoid complains from people who don't know how to set locale. See the source code of twm I wrote for detail. xc/programs/twm/twm.c loc = setlocale(LC_ALL, ); if (!loc || !strcmp(loc, C) || !strcmp(loc, POSIX) || !XSupportsLocale()) { use_fontset = False; } else { use_fontset = True; } --- Tomohiro KUBOTA [EMAIL PROTECTED] http://www.debian.or.jp/~kubota/ Introduction to I18N http://www.debian.org/doc/manuals/intro-i18n/ - Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
Re: file name encoding
Hi, At 26 Jun 2001 16:37:05 -0700, H. Peter Anvin [EMAIL PROTECTED] wrote: The issue is, however, what that does mean? In particular, strings in the filesystem are usually in the system-wide encoding scheme, not what that particular user happens to be processing at the time. Ah, I understand. We were discussing about different theme. My point is not on the byte sequence for filenames in the filesystem. It can or cannot be UTF-8. I don't care much because users have little chance to access to the raw byte sequence on the filesystem. My point is that user-level commands must obey locale when they communicate with users. For example, 'ls' must display file names in locale encoding. --- Tomohiro KUBOTA [EMAIL PROTECTED] http://www.debian.or.jp/~kubota/ Introduction to I18N http://www.debian.org/doc/manuals/intro-i18n/ - Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/