Switching to UTF-8 and Gnome 1.2.x
Hi, In my transition to UTF-8, I found that Gnome 1.2.x has a lot of files in mixed encodings. All *.desktop files and .directory files are in mixed encodings. Entries for [ja] are in EUC-JP, entries for [de] are in ISO-8859-1/15 and entries for [ru] are in KOI8-R and so on. On the other hand, corresponding KDE files are all in UTF-8 so that I don't need to change anything there. Anyway, thanks to Encoding module (to be included in upcoming Perl 5.8 by default), I was able to write a simple script to add ko_KR.UTF-8 entries for all [ko] entries in EUC-KR in *desktop files and .directory files. Below is the list of directories I have to run my script on: /usr/share/apps /usr/share/applets /usr/share/applnk /etc/X11/applnk /usr/share/mc $HOME/.gnome Still, I got gibberish in Gnome tip of the day. It turned out that gnome hint files (usually installed in /usr/share/gnome/hints) are Xml files in mixed encodings. I don't think they're compliant to Xml standard because I've never heard of Xml files in mixed encodings. So, I also had to add ko_KR.UTF-8 entries for all [ko] entries. Even with this, for some reason unknown to me, whenever I cross the 'boundary'(i.e. from the last to the first or the other way around), I got gibberish. Two other places where languages are tied to encodings are Gnome help (usually in /usr/share/gnome/help) and Gimp tips (/usr/(local/)share/gimp/$version/tips/gimp_tips.[lang].txt) I also had to make UTF-8 version of them. I believe all these problems have been addressed in Gnome 2.0(RC?/beta), but still Gnome 1.x are widely used. I thought my experience would help others who want to move on to UTF-8 as well as distribution builders. Jungshik Shin -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
Re: Switching to UTF-8
On Thu, May 02, 2002 at 09:51:44AM +0900, Gaspar Sinai wrote: I am not much of an Emacs guy but if I were I would probably use QEmacs, which looks pretty decent to me: http://fabrice.bellard.free.fr/qemacs/ I had a quick look at qemacs a couple of weeks ago, for other reasons (namely docbook support), and found out that this is a project in its early phases of development, nowhere near a full-blown editor. -- Yann Dirson[EMAIL PROTECTED] |Why make M$-Bill richer richer ? Debian-related: [EMAIL PROTECTED] | Support Debian GNU/Linux: Pro:[EMAIL PROTECTED] | Freedom, Power, Stability, Gratuity http://ydirson.free.fr/| Check http://www.debian.org/ -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
Re: Switching to UTF-8
Hi, At Mon, 6 May 2002 07:46:33 +0200, Pablo Saratxaga wrote: In Hiragana/Katakana, processing of n is complex (though it may be less complex than Hangul). No. The N is just a kana like any other, no complexity at all involved. Complexity only happens when typing in latin letters. That is why the use of transliteration typing will always require an input method anyways, it cannot be handled with just Xkb. In my above sentence, n is a Latin letter. It may correspond to HIRAGATA/KATAKANA LETTER N *or* 1st key stroke to n-a, n-i, n-u, n-e, n-o, n-y-a, n-y-u, or n-y-o. (Key strokes of n-y-a should give HIRAGANA/KATAKANA LETTER NI and following HIRAGANA/KATAKANA LETTER SMALL YA.) Anyway, I understand your point that Latin - Hiragana/Katakana cannot be implemented as xkb. --- Tomohiro KUBOTA [EMAIL PROTECTED] http://www.debian.or.jp/~kubota/ Introduction to I18N http://www.debian.org/doc/manuals/intro-i18n/ -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
Re: Switching to UTF-8
On Mon, 6 May 2002, Pablo Saratxaga wrote: On Mon, May 06, 2002 at 10:11:34AM +0900, Tomohiro KUBOTA wrote: Note for xkb experts who don't know Hiragana/Katakana/Hangul: input methods of these scripts need backtracking. For example, in Hangul, imagine I hit keys in the c-v-c-v (c: consonant, v: vowel) sequence. When I hit c-v-c, it should represent one Hangul syllable c-v-c. However, when I hit the next v, it should be two Hangul syllables of c-v c-v. That is only the case with 2-mode keyboard; with 3-mode keyboard there is no ambiguity, as there are three groups of keys V, C1, C2; allowing for all the possible combinations: V-C2, C1-V-C2. Eg: there are two keys 'V-C2 and C1-V-C2' should be 'C1-V and 'C1-V-C2' :-) To go all the way to Xkb, even three-set keyboard array has to be modified a little because some clusters of vowels and consonants are not assigned separate keys, but have to be entered by a sequence of keys assigned to basic/simple vowels and consonants. Alternatively, programs have to be modified to truly support 'L+V+T*' model of Hangul syllables as stipulated in TUS 3.0. p. 53. for each consoun: one for the leading syllab consoun, and one for the ending syllab consoun. (I think the small round glyph to fill an empty place in a syllab is always at place C2, that is, c-v is always written C1-V-C2 with a special C2 that is not written in latin transliteration) You almost got it right except that IEung ('ㅇ') is NULL at the syllable onset position (i.e. it's a place holder for syllables that begin with a vowel and does not appear in Latin transliteration). IEung is not NULL at the syllable coda-position but corresponds to [ng] (IPA : [ŋ] ) as in 'young'. To put in your way, V-C2 syllable is always written as IEung-V-C2 with IEung having no phonetic value. Here I assumed we're not talking about the orthography of the 15th century ;-) Jungshik Shin -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
Re: Switching to UTF-8
Hi, At 02 May 2002 23:54:37 +1000, Roger So wrote: Note that the source from Li18nux will try to use its own encoding conversion mechanisms on Linux, which is broken. You need to tell it to use iconv instead. I didn't know that because I am not a user of IIIMF nor other Li18nux products. How it is broken? Maybe I should attempt to package it for Debian again, now that woody is almost out of the way. (I have the full IIIMF stuff working well on my development machine.) I found that Debian has iiimecf package. Do you know what it is? I don't think xkb is sufficient because (1) there's a large number of different Chinese input methods out there, and (2) most of the input methods require the user to choose from a list of candidates after preedit. I _do_ think xkb is sufficient for Japanese though, if you limit Japanese to only hiragana and katagana. ;) I believe that you are kidding to say about such a limitation. Japanese language has much less vowels and consonants than Korean, which results in much more homonyms than Korean. Thus, I think native Japanese speakers won't decide to abolish Kanji. (Please don't be kidding in international mailing list, because people who don't know about Japanese may think you are talking about serious story.) Even if we limit to input of hiragana/katakana, xkb may not be sufficient. For one-key-one-hiragana/katakana method, I think xkb can be used. However, more than half of Japanese computer users use Romaji-kana conversion, two-keys-one-hiragana/katakana method. The complexity of the algorithm is like two or three-key input method of Hangul, I think. Do you think such an algorithm can be implemented as xkb? If yes, I think Romaji-kana conversion (whose complexity is like Hangul input method) can be implemented as xkb. --- Tomohiro KUBOTA [EMAIL PROTECTED] http://www.debian.or.jp/~kubota/ Introduction to I18N http://www.debian.org/doc/manuals/intro-i18n/ -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
Re: Switching to UTF-8
On Sun, 2002-05-05 at 21:00, Tomohiro KUBOTA wrote: At 02 May 2002 23:54:37 +1000, Roger So wrote: Note that the source from Li18nux will try to use its own encoding conversion mechanisms on Linux, which is broken. You need to tell it to use iconv instead. I didn't know that because I am not a user of IIIMF nor other Li18nux products. How it is broken? The csconv library that IIIMF comes with doesn't work properly (at least I didn't get it to work), possibly because of endianess issues. csconv is meant to be a cross-platform replacement for iconv. Maybe I should attempt to package it for Debian again, now that woody is almost out of the way. (I have the full IIIMF stuff working well on my development machine.) I found that Debian has iiimecf package. Do you know what it is? It's the IIIM Emacs Client Framework. As the name implies, it's an implementation of an IIIM client in Emacs. I've never tried it out, as I don't use Emacs. :) Is it used by anyone? Last time I checked, popularity-contest said nobody was using it... I _do_ think xkb is sufficient for Japanese though, if you limit Japanese to only hiragana and katagana. ;) I believe that you are kidding to say about such a limitation. Japanese language has much less vowels and consonants than Korean, which results in much more homonyms than Korean. Thus, I think native Japanese speakers won't decide to abolish Kanji. (Please don't be kidding in international mailing list, because people who don't know about Japanese may think you are talking about serious story.) Sorry, it wasn't meant to be a serious comment. :) Cheers Roger -- Roger So Debian Developer Sun Wah Linux Limitedi18n/L10n Project Leader Tel: +852 2250 0230 [EMAIL PROTECTED] Fax: +852 2259 9112 http://www.sw-linux.com/ -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
Re: Switching to UTF-8
On Sun, 5 May 2002, Tomohiro KUBOTA wrote: At 02 May 2002 23:54:37 +1000, Roger So wrote: I _do_ think xkb is sufficient for Japanese though, if you limit Japanese to only hiragana and katagana. ;) I believe that you are kidding to say about such a limitation. Japanese language has much less vowels and consonants than Korean, which results in much more homonyms than Korean. Thus, I think Well, actually it's due to not so much the difference in the number of consonants and vowels as the fact that Korean has both closed and open syllables while Japanese has only open syllables that makes Japanese have a lot more homonyms than Korean. native Japanese speakers won't decide to abolish Kanji. I don't think Japanese will ever do, either. However, I'm afraid having too many homonyms is a little too 'feeble' a 'rationale' for not being able to convert to all phonetic scripts like Hiragana and Katakana. The easiest counter argument to that is how Japanese speakers can tell which homonym is meant in oral communication if Kanji is so important to disambiguate among homonyms. They don't have any Kanjis to help them, (well, sometimes you may have to write down Kanjis to break the ambiguity in the middle of conversation, but I guess it's mostly limited to proper nouns). I heard that they don't have much trouble because the context helps a listener a lot with figuring out which of many homonyms is meant by a speaker. This is true in any language. Arguably, the same thing could help readers in written communication. Of course, using logographic/ideographic characters like Kanji certainly helps readers very much and that should be a very good reason for Japanese to keep Kanji in their writing system. English writing system is also 'logographic' in a sense (so is modern Korean orthography in pure Hangul as it departs from the strict agreement between pronunciation and spelling ) and a spelling reform (to make English have a similar degree of the agreement between spelling and pronunciation as to that in Spanish) would make it harder to read written text depriving English written text of its 'logographic' nature. On the other hand, it would help learners and writers. It's always been struggle between readers vs writers and listeners vs speakers xkb can be used. However, more than half of Japanese computer users use Romaji-kana conversion, two-keys-one-hiragana/katakana method. The complexity of the algorithm is like two or three-key input method of Hangul, I think. Do you think such an algorithm can be implemented as xkb? If yes, I think Romaji-kana conversion (whose complexity is like Hangul input method) can be implemented as xkb. I also like to know whether it's possible with Xkb. BTW, if we use three-set keyboards (where leading consonants and trailing consonants are assigned separate keys) and use U+1100 Hangul Conjoining Jamos, Korean Hangul input is entirely possible with Xkb alone. Jungshik Shin -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
Re: Switching to UTF-8
Hi, At Sun, 5 May 2002 19:12:31 -0400 (EDT), Jungshik Shin wrote: I believe that you are kidding to say about such a limitation. Japanese language has much less vowels and consonants than Korean, which results in much more homonyms than Korean. Thus, I think Well, actually it's due to not so much the difference in the number of consonants and vowels as the fact that Korean has both closed and open syllables while Japanese has only open syllables that makes Japanese have a lot more homonyms than Korean. You may be right. Anyway, the true reason is that Japanese language has a lot of words from old Chinese. These words which are not homonyms in Chinese will be homonyms in Japanese. (They may or may not be homonys in Korea. I believe that Korean also has a lot of Chinese-origin words.) Since a way to coin a new word is based on Kanji system, Japanese language would lose vitality without Kanji. I don't think Japanese will ever do, either. However, I'm afraid having too many homonyms is a little too 'feeble' a 'rationale' for not being able to convert to all phonetic scripts like Hiragana and Katakana. ... Since I don't represent Japanese people, I don't say whether it is a good idea or not to have many homonyms. You are right, there are many other reasons for/against using Kanji and I cannot explain everything. Japanese pronunciation does have troubles, though it is widely helped by accents or rhythms. However, in some cases, none of accesnts or context can help. For example, both science and chemistry are kagaku in japanese. So we sometimes call chemistry as bakegaku, where bake is another reading of ka for chemistry. Another famous confusing pair of words is private (organization) and municipal (organization), which is called shiritu. Thus, private is sometimes called watakushiritu and municipal is called ichiritu, again these alias names are from different readings of kanji. If you listen to Japanese news programs every day, you will find these examples some day. These days more and more Japanese people want to learn more Kanji to use their abundance of power of expression, though I am not one of these Kanji learners. I also like to know whether it's possible with Xkb. BTW, if we use three-set keyboards (where leading consonants and trailing consonants are assigned separate keys) and use U+1100 Hangul Conjoining Jamos, Korean Hangul input is entirely possible with Xkb alone. Note for xkb experts who don't know Hiragana/Katakana/Hangul: input methods of these scripts need backtracking. For example, in Hangul, imagine I hit keys in the c-v-c-v (c: consonant, v: vowel) sequence. When I hit c-v-c, it should represent one Hangul syllable c-v-c. However, when I hit the next v, it should be two Hangul syllables of c-v c-v. In Hiragana/Katakana, processing of n is complex (though it may be less complex than Hangul). --- Tomohiro KUBOTA [EMAIL PROTECTED] http://www.debian.or.jp/~kubota/ Introduction to I18N http://www.debian.org/doc/manuals/intro-i18n/ -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
Re: Switching to UTF-8
Kaixo! On Mon, May 06, 2002 at 10:11:34AM +0900, Tomohiro KUBOTA wrote: Note for xkb experts who don't know Hiragana/Katakana/Hangul: input methods of these scripts need backtracking. For example, in Hangul, imagine I hit keys in the c-v-c-v (c: consonant, v: vowel) sequence. When I hit c-v-c, it should represent one Hangul syllable c-v-c. However, when I hit the next v, it should be two Hangul syllables of c-v c-v. That is only the case with 2-mode keyboard; with 3-mode keyboard there is no ambiguity, as there are three groups of keys V, C1, C2; allowing for all the possible combinations: V-C2, C1-V-C2. Eg: there are two keys for each consoun: one for the leading syllab consoun, and one for the ending syllab consoun. (I think the small round glyph to fill an empty place in a syllab is always at place C2, that is, c-v is always written C1-V-C2 with a special C2 that is not written in latin transliteration) In Hiragana/Katakana, processing of n is complex (though it may be less complex than Hangul). No. The N is just a kana like any other, no complexity at all involved. Complexity only happens when typing in latin letters. That is why the use of transliteration typing will always require an input method anyways, it cannot be handled with just Xkb. --- Tomohiro KUBOTA [EMAIL PROTECTED] http://www.debian.or.jp/~kubota/ Introduction to I18N http://www.debian.org/doc/manuals/intro-i18n/ -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/ -- Ki ça vos våye bén, Pablo Saratxaga http://www.srtxg.easynet.be/PGP Key available, key ID: 0x8F0E4975 -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
Re: Switching to UTF-8
On Thu, May 02, 2002 at 02:03:06AM -0400, Jungshik Shin wrote: I know very little about Win32 APIs, but according to what little I learned from Mozilla source code, it doesn't seem to be so simple as you wrote in Windows, either. Actually, my impression is that Windows IME APIs are almost parallel (concept-wise) to those of XIM APIs. (btw, MS WIndows XP introduced an enhanced IM related APIs called TSF?.) In both cases, you have to determine what type of preediting support (in XIM terms, over-the-spot, on-the-spot, off-the-spot and none?) is shared by clients and IM server. Depending on the preediting type, the amount of works to be done by clients varies. I'm afraid your impression that Windows IME clients have very little to do to get keyboard input comes from your not having written programs that can accept input from CJK IME(input method editors) as it appears to be confirmed by what I'm quoting below. I wrote the patch for PuTTY to accept input from Win2K's IME, and some fixes for Vim's. What I said is all that's necessary for simple support, and the vast majority of applications don't need any more than that. Of course, what you do with this input is up to the application, and if you have no support for storing anything but text in the system codepage, there might be a lot of work to do. That's a different topic entirely, of course. It just occurred to me that Mozilla.org has an excellent summary of input method supports on three major platforms (Unix/X11, MacOS, MS-Windows). See http://www.mozilla.org/projects/intl/input-method-spec.html. I've never seen any application do anything other than what this describes as Over-The-Spot composition. This includes system dialogs, Word, Notepad and IE. This document incorrectly says: Windows does not use the off-the-spot or over-the-spot styles of input. As far as I know, Windows uses *only* over-the-spot input. Perhaps on-the-spot can be implemented (and most people would probably agree that it's cosmetically better), but it would proably take a lot more work. Ex: http://zewt.org/~glenn/over1.jpg http://zewt.org/~glenn/over2.jpg (The rest of the first half of the document describes input styles that most programs don't use.) The document states Last modified May 18, 1999, so the information on it is probably out of date. The only other thing you have to handle is described in Platform Protocols: WM_IME_COMPOSITION. The other two messages can be ignored. The only API function listed here that's often needed is SetCaretPosition, to set the cursor position. It's little enough to add it easily to programs, but the fact that it exists at all means that I can't enter CJK into most programs. Since the regular 8-bit character message is in the system codepage, it's impossible to send CJK through. Even in English or any SBCS-based Windows 9x/ME, you can write programs that can accept CJK characters from CJK (global) IMEs. Mozilla, MS IE, MS Word, and MS OE are good examples. Yes, you're agreeing with what you quoted. -- Glenn Maynard -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
Re: Switching to UTF-8
Hi, At Thu, 2 May 2002 02:14:29 -0400 (EDT), Jungshik Shin wrote: You mean IIIMF, didn't you? If there's any actual implementation, I'd love to try it out. We need to have Windows 2k/XP or MacOS 9/X style keyboard/IM switching mechanism/UI so that keyboard/IM modules targeted at/customized for each language can coexist and be brought up as necessary. It appears that IIIMF seems to be the only way unless somebody writes a gigantic one-fits-all XIM server for UTF-8 locale(s). I heard that IIIMF has some security problems from Project HEKE people http://www.kmc.gr.jp/proj/heke/ . I don't know whether it is true or not, nor the problem (if any) is solved or not. There _is_ already an implementation of IIIMF. You can download it from Li18nux site. However, I could not succeeded to try it. Since I have heard several reports of IIIMF users, it is simply my fault. There seems to be some XIM-based implementations which can input multiple complex languages. One is ximswitch software in Kondara Linux distribution. http://www.kondara.org . I downloaded it but I didn't test it yet. Another is mlterm http://mlterm.sourceforge.net/ which is entirely client-side solution to switch multiple XIM servers. Though I don't think it is a good idea to require clients to have such mechanisms, it is the only practical way so far to realize multiple language input. How about just running your favorite XIM under ja_JP.EUC-JP while all other applications are launched under ja_JP.UTF-8? As you know well, it just works fine although the character repertoire you can enter is limited to that of EUC-JP. Of course, this is not full-blown UTF-8 support, but at least it should give you the same degree of Japanese input support under ja_JP.UTF-8 as under ja_JP.EUC-JP. Well, then you would say what the point of moving to UTF-8 is. You can at least display more characters under UTF-8 than under EUC-JP, can't you? :-) There are, so far, no conversion engine which requires over-EUC-JP character set. Thus, EUC-JP is enough now. If someone wants to develop an input engine which supports more characters, he/she will want to use UTF-8. However, I think nobody feels strong necessity of it in Japan, besides pure technical interests for Unicode itself. BTW, Xkb may work for Korean Hangul, too and we don't need XIM if we use 'three-set keyboard' instead of 'two-set keyboard' and can live without Hanjas. I have to know more about Xkb to be certain, though. I see. This is not true for Japanese. Japanese people do need grammar and context analysis software to get Kanji text. How about Chinese? --- Tomohiro KUBOTA [EMAIL PROTECTED] http://www.debian.or.jp/~kubota/ Introduction to I18N http://www.debian.org/doc/manuals/intro-i18n/ -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
Re: readline (was: Switching to UTF-8)
Markus Kuhn writes: There is also bash/readline SuSE 8.0 ships with a bash/readline that works fine with (at least) width 1 characters in an UTF-8 locale. There is also an alpha release of a readline version that attempts to handle single-width, double-width and zero-width characters in all multibyte locales. But it's alpha (read: it doesn't work for me yet). Bruno -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
Re: Switching to UTF-8
On Thu, 2002-05-02 at 17:11, Tomohiro KUBOTA wrote: There _is_ already an implementation of IIIMF. You can download it from Li18nux site. However, I could not succeeded to try it. Since I have heard several reports of IIIMF users, it is simply my fault. Note that the source from Li18nux will try to use its own encoding conversion mechanisms on Linux, which is broken. You need to tell it to use iconv instead. Maybe I should attempt to package it for Debian again, now that woody is almost out of the way. (I have the full IIIMF stuff working well on my development machine.) BTW, Xkb may work for Korean Hangul, too and we don't need XIM if we use 'three-set keyboard' instead of 'two-set keyboard' and can live without Hanjas. I have to know more about Xkb to be certain, though. I see. This is not true for Japanese. Japanese people do need grammar and context analysis software to get Kanji text. How about Chinese? I don't think xkb is sufficient because (1) there's a large number of different Chinese input methods out there, and (2) most of the input methods require the user to choose from a list of candidates after preedit. I _do_ think xkb is sufficient for Japanese though, if you limit Japanese to only hiragana and katagana. ;) Regards Roger -- Roger So Debian Developer Sun Wah Linux Limitedi18n/L10n Project Leader Tel: +852 2250 0230 [EMAIL PROTECTED] Fax: +852 2259 9112 http://www.sw-linux.com/ -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
Re: readline (was: Switching to UTF-8)
Bruno Haible wrote on 2002-05-02 12:23 UTC: There is also an alpha release of a readline version that attempts to handle single-width, double-width and zero-width characters in all multibyte locales. But it's alpha (read: it doesn't work for me yet). Yes, it seems the train is rolling now for UTF-8 support in bash/readline as well, which is excellent news. ftp://ftp.cwru.edu/hidden/bash-2.05b-alpha1.tar.gz ftp://ftp.cwru.edu/hidden/readline-4.3-alpha1.tar.gz Anyone interested in joining the bash-testers list to help iron out any problems with UTF-8 support in bash/readline should contact Chet Ramey [EMAIL PROTECTED]. http://cnswww.cns.cwru.edu/~chet/readline/rltop.html Markus -- Markus G. Kuhn, Computer Laboratory, University of Cambridge, UK Email: mkuhn at acm.org, WWW: http://www.cl.cam.ac.uk/~mgk25/ -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
Re: Switching to UTF-8
Markus Kuhn [EMAIL PROTECTED] writes: c) Emacs - Current Emacs UTF-8 support is still a bit too provisional for my comfort. In particular, I don't like that the UTF-8 mode is not binary transparent. Work on turning Emcas completely into a UTF-8 editor is under way, and I'd be very curious to hear about the current status and whether there is anything to test already. Anyone? AFAIK, there is some activity on the Emacs 22 branch. XEmacs is in the process of switching to UCS for its internal character set, too. -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
Re: Switching to UTF-8
On Wed, 1 May 2002, Florian Weimer wrote: Markus Kuhn [EMAIL PROTECTED] writes: c) Emacs - Current Emacs UTF-8 support is still a bit too provisional for my comfort. In particular, I don't like that the UTF-8 mode is not binary transparent. Work on turning Emcas completely into a UTF-8 editor is under way, and I'd be very curious to hear about the current status and whether there is anything to test already. Anyone? AFAIK, there is some activity on the Emacs 22 branch. XEmacs is in the process of switching to UCS for its internal character set, too. I am not much of an Emacs guy but if I were I would probably use QEmacs, which looks pretty decent to me: http://fabrice.bellard.free.fr/qemacs/ As I don't use Emacs so I can not really tell the difference, it might not have all the functionality that Emacs has. But I have a feeling that the functionality you can expect from a text editor is there. I like that Qemacs has a much smaller memory and binary size than “mainstream” Emacs. Open Source is funny: you probably will never hear Microsoft praising Java ☺ Gáspár・ガーシュパール・Гашьпар・갓팔・Γασπαρ ᏱᎦᏊ ᎣᏌᏂᏳ ᎠᏓᏅᏙ ᎠᏓᏙᎵᎩ ᏂᎪᎯᎸᎢ ᎾᏍᏋ ᎤᏠᏯᏍᏗ ᏂᎯ. -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
Re: Switching to UTF-8
Hi, At Wed, 01 May 2002 20:02:57 +0100, Markus Kuhn wrote: I have for some time now been using UTF-8 more frequently than ISO 8859-1. The three critical milestones that still keep me from moving entirely to UTF-8 are How about bash? Do you know any improvement? Please note that tcsh have already supported east Asian EUC-like multibyte encodings. I don't know it also supports UTF-8. How about zsh? For Japanese, character width problems and mapping table problems should be solved to _start_ migration to UTF-8. (This is why several Japanese localization patches are available for several UTF-8-based softwares such as Mutt. We should find ways to stop such localization patches.) Also, I want people who develop UTF-8-based softwares to have a custom to specify the range of UTF-8 support. For example, * range of codepoints U+ - U+2fff? all BMP? SMP/SIP? * special processings combining characters? bidi? Arab shaping? Indic scripts? Mongol (which needs vertical direction)? How about wcwidth()? * input methods Any way to input complex languages which cannot be supported by xkb mechanism (i.e., CJK) ? XIM? IIIMP? (How about Gnome2?) Or, any software-specific input methods (like Emacs or Yudit)? * fonts availability Though each software is not responsible for this, This software is designed to require Times font means that it cannot use non-Latin/Greek/Cyrillic characters. Though people in ISO-8859-1/2/15 region people don't have to care about these terms, other peole can easily believe a UTF-8-supported software and then disappointed to use it. Then he/she will become distrust UTF-8-supported softwares. We should avoid many people will become such. --- Tomohiro KUBOTA [EMAIL PROTECTED] http://www.debian.or.jp/~kubota/ Introduction to I18N http://www.debian.org/doc/manuals/intro-i18n/ -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
Re: Switching to UTF-8
On Thu, May 02, 2002 at 11:38:38AM +0900, Tomohiro KUBOTA wrote: * input methods Any way to input complex languages which cannot be supported by xkb mechanism (i.e., CJK) ? XIM? IIIMP? (How about Gnome2?) Or, any software-specific input methods (like Emacs or Yudit)? How much extra work do X apps currently need to do to support input methods? In Windows, you do need to do a little--there's a small API to tell the input method the cursor position (for when it opens a character selection box) and to receive characters. (The former can be omitted and it'll still be usable, if annoying--the dialog will be at 0x0. The latter can be omitted for Unicode-based programs, or if the system codepage happens to match the characters.) It's little enough to add it easily to programs, but the fact that it exists at all means that I can't enter CJK into most programs. Since the regular 8-bit character message is in the system codepage, it's impossible to send CJK through. How does this compare with the situation in X? * fonts availability Though each software is not responsible for this, This software is designed to require Times font means that it cannot use non-Latin/Greek/Cyrillic characters. I can't think of ever using an (untranslated, English) X program and having it display anything but Latin characters. When is this actually a problem? -- Glenn Maynard -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
Re: Switching to UTF-8
Hi, At Thu, 2 May 2002 00:16:25 -0400, Glenn Maynard wrote: * input methods Any way to input complex languages which cannot be supported by xkb mechanism (i.e., CJK) ? XIM? IIIMP? (How about Gnome2?) Or, any software-specific input methods (like Emacs or Yudit)? How much extra work do X apps currently need to do to support input methods? Much work. I think this is one problematic point of XIM which caused very few softwares (which are developed by XIM-knowing developers, who are very few) can input CJK languages. X.org distribution (and XFree86 distribution) has a specification of XIM protocol. However, it is difficult. (At least I could not understand it). So, for practical usage by developers, http://www.ainet.or.jp/~inoue/im/index-e.html would be useful to develop XIM clients. I have not read a good introduction article to develop XIM servers. I think that low-level API should integrate XIM (or other input method protocols) support so that XIM-innocent developers (well, almost all developers in the world) can use it and they cannot annoy CJK people. Gnome2 seems to take this way. However, I wonder why Xlib doesn't have such wrapper functions which omit XIM programming troubles. It's little enough to add it easily to programs, but the fact that it exists at all means that I can't enter CJK into most programs. Since the regular 8-bit character message is in the system codepage, it's impossible to send CJK through. Well, I am talking about Unicode-based softwares. More and more developers in the world start to understand that 8bit is not enough for Unicode because it is a unversal fact. I am optimistic in this field; many developers will think 8bit character is a bad idea in near future. However, it is unlikely many developers will recognize the need of XIM (or other input method) support in near future because it is needed only for CJK languages. My concern is how to force thse XIM-innocent people to develop CJK-supporting softwares. How does this compare with the situation in X? Though I don't know about Windows programming, I often use Windows for my work. Imported softwares usually cannot handle Japanese because of font problem. However, input method (IME?) seems to be invoked even in these imported softwares. * fonts availability Though each software is not responsible for this, This software is designed to require Times font means that it cannot use non-Latin/Greek/Cyrillic characters. I can't think of ever using an (untranslated, English) X program and having it display anything but Latin characters. When is this actually a problem? For example, XCreateFontSet(-*-times-*) cannot display Japanese because there are no Japanese fonts which meet the name. (Instead, mincho and gothic are popular Japanese typefaces.) Such types of implementation is often seen in window managers and their theme files. --- Tomohiro KUBOTA [EMAIL PROTECTED] http://www.debian.or.jp/~kubota/ Introduction to I18N http://www.debian.org/doc/manuals/intro-i18n/ -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
Re: Switching to UTF-8
On Thu, 2 May 2002, Glenn Maynard wrote: On Thu, May 02, 2002 at 11:38:38AM +0900, Tomohiro KUBOTA wrote: * input methods Any way to input complex languages which cannot be supported by xkb mechanism (i.e., CJK) ? XIM? IIIMP? (How about Gnome2?) Or, any software-specific input methods (like Emacs or Yudit)? How much extra work do X apps currently need to do to support input methods? In Windows, you do need to do a little--there's a small API to tell the input method the cursor position (for when it opens a character selection ... How does this compare with the situation in X? I know very little about Win32 APIs, but according to what little I learned from Mozilla source code, it doesn't seem to be so simple as you wrote in Windows, either. Actually, my impression is that Windows IME APIs are almost parallel (concept-wise) to those of XIM APIs. (btw, MS WIndows XP introduced an enhanced IM related APIs called TSF?.) In both cases, you have to determine what type of preediting support (in XIM terms, over-the-spot, on-the-spot, off-the-spot and none?) is shared by clients and IM server. Depending on the preediting type, the amount of works to be done by clients varies. I'm afraid your impression that Windows IME clients have very little to do to get keyboard input comes from your not having written programs that can accept input from CJK IME(input method editors) as it appears to be confirmed by what I'm quoting below. It just occurred to me that Mozilla.org has an excellent summary of input method supports on three major platforms (Unix/X11, MacOS, MS-Windows). See http://www.mozilla.org/projects/intl/input-method-spec.html. It's little enough to add it easily to programs, but the fact that it exists at all means that I can't enter CJK into most programs. Since the regular 8-bit character message is in the system codepage, it's impossible to send CJK through. Even in English or any SBCS-based Windows 9x/ME, you can write programs that can accept CJK characters from CJK (global) IMEs. Mozilla, MS IE, MS Word, and MS OE are good examples. Jungshik Shin -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
Re: Switching to UTF-8
On Thu, 2 May 2002, Tomohiro KUBOTA wrote: At Wed, 01 May 2002 20:02:57 +0100, Markus Kuhn wrote: I have for some time now been using UTF-8 more frequently than ISO 8859-1. The three critical milestones that still keep me from moving entirely to UTF-8 are How about bash? Do you know any improvement? Please note that tcsh have already supported east Asian EUC-like multibyte encodings. I don't know it also supports UTF-8. It doesn't seem to support UTF-8 locale as of tcsh 6.10.0 (2000-11-19). I can't find anything about UTF-8 at http://www.tcsh.org. The newest release is 6.11.0 The same is true of zsh. (http://www.zsh.org) combining characters? bidi? Arab shaping? Indic scripts? and Hangul :-) Mongol (which needs vertical direction)? How about wcwidth()? Pango and ST should certainly help, here * input methods Any way to input complex languages which cannot be supported by xkb mechanism (i.e., CJK) ? XIM? IIIMP? (How about Gnome2?) You mean IIIMF, didn't you? If there's any actual implementation, I'd love to try it out. We need to have Windows 2k/XP or MacOS 9/X style keyboard/IM switching mechanism/UI so that keyboard/IM modules targeted at/customized for each language can coexist and be brought up as necessary. It appears that IIIMF seems to be the only way unless somebody writes a gigantic one-fits-all XIM server for UTF-8 locale(s). How about just running your favorite XIM under ja_JP.EUC-JP while all other applications are launched under ja_JP.UTF-8? As you know well, it just works fine although the character repertoire you can enter is limited to that of EUC-JP. Of course, this is not full-blown UTF-8 support, but at least it should give you the same degree of Japanese input support under ja_JP.UTF-8 as under ja_JP.EUC-JP. Well, then you would say what the point of moving to UTF-8 is. You can at least display more characters under UTF-8 than under EUC-JP, can't you? :-) In Korean case, as I wrote a couple of days ago, I had to modify Ami (a popular Korean XIM) to make it run under ko_KR.UTF-8 because otherwise even though my applications are running under and fully aware of UTF-8 (e.g. vim under UTF-8 xterm), I couldn't enter over 8,000 Hangul syllables not in EUC-KR but in UTF-8. Moreover, under ko_KR.UTF-8, Xterm-16x and Vim 6.1 with a single line patch works almost flawlessly with U+1100 Hangul Jamos. Markus, can you update your UTF-8 FAQ on this issue? Xterm has been supporting Thai script and that certainly brought in almost automagically Middle Korean support as a by-product. BTW, Xkb may work for Korean Hangul, too and we don't need XIM if we use 'three-set keyboard' instead of 'two-set keyboard' and can live without Hanjas. I have to know more about Xkb to be certain, though. Or, any software-specific input methods (like Emacs or Yudit)? Yudit supports Indic, Thai, Arabic pretty well as far as I know. And, judging from what Gaspar wrote to me, Middle Korean support with U+1100 jamo is not so far away. Most of what's necessary is firmly in place because Gaspar has written a very generic complex script support routines which hopefully can be used for Middle Korean without much effort. Jungshik Shin -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/