Re: relevance of [PATCH] tty utf8 mode in linux-kernel 2.6.4-rc1
Tomohiro KUBOTA wrote: From: Markus Kuhn [EMAIL PROTECTED] to the left, not one *cell*. I know that this is not what backspace does in some EUC terminal emulators, but I believe a strong case can be made A correction. Not *some* EUC terminal emulators, but *every* EUC terminal emulators. Do you know *any* example which is popular in CJK world and on which a 0x08 moves two columns on a doublewidth character? Sure, every one of Korean emulators (for EUC-KR and Johab) I have used moves two column-widths (a single Korean character) for 'backspace'. I was rather surprised to know that Japanese terminal emulators don't. Jungshik -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
Re: Canonical Mode Input Processing with multi-byte character sets
On Tue, 24 Feb 2004, Derek Martin wrote: Hi Derek, On Tue, Feb 24, 2004 at 08:43:09PM +0900, Jungshik Shin wrote: Please, read what I wrote more carefully. I did write that deleting the last letter is more useful when you're in the middle of typing a sequence of letter to form a syllable. I think we're talking past eachother here... I noted that and I agree with it. It's specifically the fact that once I type the third character of a hangeul glyph, I can't backspace and change ONLY that last character, that annoys me. You say that most Koreans prefer that behavior, and I believe you. But I can't for the life of me understand why... ;-) To me, it seems unnatural and inefficient. Sorry for my misunderstanding. As you may know by now, The Korean script has several different facets. It's alphabetic, syllabic and featural all at the same time. Therefore, different implementations at different times on different platforms take different approaches when it comes to representing and processing the Korean script on computer. Because you live in Korea now, you must have seen the keypad of Korean mobile phones and may have learned how to type Korean. It uses three keys for vowels and 6 keys for consonants. See how consonants are grouped and you may understand why the Korean script is featural. Almost invariably once I've committed an erroneous syllable, it's not the whole syllable I need to replace, but only the last character which I flubbed. Otherwise, if I made a mistake before the syllable Anyway, I understand where you're coming from. Your complaint is perfectly valid. What you want can and must be implemented Actually, Nabi may already have implemented it because its input automata is based on U+1100 Hangul Jamos. In addition, I have the same complaint about the most popular Korean mobile phone keypad. It takes a lot more key storkes to enter a single syllable and it's annoying to find 'backspace' delete the whole syllable instead of the last letter typed. However, 9th graders on the street don't seem to have a problem at all because they can type Korean so fast with the keypad that having to enter a syllable from the beginning doesn't appear to matter to them. So, I guess your problem would go away as you get more familiar with your Korean keyboard and input method. However, incremental search needs to be done with individual letters as unit instead of syllables. I think Indian people have similar needs. Incremental search with letters as units was implemented in only one program (Korean Emacs : Hanemacs by KIM Kang-hee) as far as I know. It would be great if it's implemented in Mozilla's 'type as you find'. LANG=en_US.UTF-8 (or en_GB.UTF-8, en_CA.UTF-8) LC_CTYPE=ko_KR.UTF-8 LC_MESSAGES=en_US.UTF-8 # not necessary unless LC_ALL is set, but LC_TIME=en_US.UTF-8 # just to be sure. --- # .profile (or whatever) LANG=en_US.UTF-8 LC_COLLATE=C # I like ASCII sorting for most applications... ... export LANG LC_COLLATE ... Then, when I start up an application where I want to type Korean, I originally tried startiing it like this: $ LANG=ko_KR.UTF-8 LC_COLLATE=ko_KR.UTF-8 LC_MESSAGES=en_US.UTF-8 gedit 2. Hangeul input via ami simply didn't work. There's one missing piece here. Sorry I forgot to tell you. You have to set XMODIFIERS to '@im=Ami'. If you log on with the Korean locale selected in KDM/GDM, this variable is automatically set for you on most Linux distributions. However, apparently you don't so that you have to set it manually. 1. Menus were in Korean Really? Hmm, you may have set 'LINGUA' or something like that (non-standard GNU extension) set to Korean. Make sure it's unset. As it happens, until recently the most common case I want to do this was with mozilla. It wasn't a major problem then, because my installation of Mozilla had no Korean. But as my Korean improves, I have more and more cases where I want to do this. Of course, I'm also better able to navigate the menus, but that's beside the point... :) Actually, Mozilla language packs work independently of the locale. No matter what your locale is, you can have Mozilla's menu in any language for which you have installed the language pack. However, Ami works with Mozilla only if Mozilla is launched with LC_CTYPE (or equivalent) set to ko_KR.UTF-8/ko_KR.EUC-KR. BTW, it should be fixed to work with any UTF-8 locales. Hmm, I'm gonna add it to the TODO list. Jungshik -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
Re: Does Hotmail support UTF-8 emails properly?
Richard Jones wrote : On Sun, Feb 01, 2004 at 05:35:04AM +0900, Jungshik Shin wrote: ASCII are compatible). For your mail-sending web form, why don't you send an email to yourself and view it with mail clients that are well I18Nized such as Mozilla-Mail, Mozilla Thunderbird and MS Outlook Express? Unfortunately Hotmail is what the majority of the target audience use. I've now changed the script so that it uses iconv to convert everything to ISO-2022-JP before sending, and now it works OK in Hotmail. That's unfortunate, indeed. However, it's not that bad if your recipients are all Japanese and they don't need to receive non-Japanese emails. BTW, I mentioned Mozilla/MS OE as a way to make sure that your mail-sending form works correctly because you were not sure that it worked correctly. Jungshik -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
Re: Linux console UTF-8 by default
Edward H. Trager wrote: On Saturday 2004.01.10 20:48:31 +0330, Roozbeh Pournader wrote: On Sat, 2004-01-10 at 20:36, Edward H. Trager wrote: Is there any good reason why implementors would not support the full range of Unicode -- i.e., UTF-8 up to six serialized bytes? UTF-8 up to four bytes, you mean. See http://www.faqs.org/rfcs/rfc3629.html. I guess I was recalling (from http://www.cl.cam.ac.uk/~mgk25/unicode.html) that six bytes allows encoding all possible 2^31 UCS code points, although I suppose nothing above plane 1 has been defined. - Ed Trager Plane 2 has tens of thousands of Chinese characters and Plane 14 has variation selectors and language tags. However, nothing will ever be defined above Plane 16. JTC1/SC2/WG2 made a firm commitment to that. Jungshik -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
Re: devanagari question
On Wed, 31 Dec 2003 [EMAIL PROTECTED] wrote: If you yearn for the old days You seem to have a very slow mind. I don't know whose mind is slow. I gave all the necessary information and you couldn't still make it work. Here's one more try with a step-by-step instruction (actually, there's not much to tell you because you must have taken most of these steps) 1. download Sun Indic fonts, which you already did. 2. Put them (there are two of them) into a directory of your choice (say, /usr/local/share/fonts), which you must have done already. 3. Edit /etc/fonts/local.conf or $HOME/.fonts.conf and add the directory above to the font search path. You can skip this step if you throw fonts into one of directories or its subdirectory already listed in /etc/fonts/fonts.conf, /etc/fonts/local.conf and $HOME/.fonts.conf like /usr/share/fonts or /usr/share/fonts/indic 3b. although not necessary (because fontconfig scans font directories regularly), run the following, if you want to make sure. fc-cache -v -f directory_name 4. Lanuch Mozilla (built with CTL and Xft) and enjoy. Your web page was written in such a way that no further configuration is necessary on Mozilla's side. 5. _Optionally_, go to font pref. panel of Mozilla and set Devanagari fonts to Sun's fonts. Also make sure 'allow documents to use other fonts' is NOT checked. This is necessary for viewing other Hindi pages. Because most other Hindi sites don't specify 'lang=hi' [1], you have to launch Mozilla under hi_IN locale (i.e. 'LC_ALL=hi_IN.UTF-8 mozilla') [2] For X11core build (with CTL but NOT with Xft), you have to follow the step (which can be simplified slightly with chkfontpath available on FC1/RH/Mandrake) described at (or equivalent http://bugzilla.mozilla.org/show_bug.cgi?id=176315#c14 (The last two fields of XLFD for Sun Indic fonts should be 'sun.unicode.india-0' instead of 'hykoreanjamo-1'). See also http://bugs.xfree86.org/show_bug.cgi?id=939 With the encoding file for Sun Indic fonts, you don't need to make aliases. If you want to use 'standard' opentype fonts for Devanagari, you can try the latest (but still old/outdated) patch at http://bugzilla.mozilla.org/show_bug.cgi?id=215219 [1] BBC Hindi site will begin to use 'lang=hi' in a couple of weeks. [2] You don't have to once Mozilla bug 208479 is fixed. -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
Re: devanagari question
On Sat, 3 Jan 2004, Jungshik Shin wrote: On Wed, 31 Dec 2003 [EMAIL PROTECTED] wrote: If you yearn for the old days You seem to have a very slow mind. I don't know whose mind is slow. I gave all the necessary information and you couldn't still make it work. Here's one more try with a I'm sorry I forgot that I always had built Mozilla with a patch that went into the trunk only a few days ago. That patch was made so long time ago (and it's only necessary for Devanagari but not for Tamil) that it was taken for granted by me, but it was not in the tree until a few days ago. The patch to apply (you only need to apply the patch if you download 1.6b release source instead of the CVS trunk source) is available at http://bugzilla.mozilla.org/show_bug.cgi?id=203406 (the last patch uploaded there). BTW, X11core build doesn't need this patch to work although with the patch, it works better. Jungshik -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
Re: devanagari question
On Wed, 31 Dec 2003 [EMAIL PROTECTED] wrote: Good. So no need to worry about the html page. Actually, there is. By 'sun_devanagair_font', I didn't mean that you use that verbatim but that you have to replace that name by the actual name of Sun's font. Besides, it's always a good practice to put one of five CSS generic font families (serif, sans-serif, etc) at the end of your font list as I wrote. Remains to worry about Mozilla and/or the X server and/or fontconfig. Xserver does only little part in the equation as long as it supports Render extension. Did you put your Sun's Saraswati fonts (two of them) in one of directories looked into by fontconfig? things work. Am quite prepared to use cryptic names like -altsys-saraswati5-medium-r-normal--0-0-0-0-p-0-iso10646-1 Well, with that XLFD name, Mozilla (X11core build) wouldn't recognize it as a SunIndic font so that Devanagari wouldn't get rendered as it should. You have to alias it so that the last two field of XLFD is sun.unicode.india-0 (or something like that) by editing fonts.alias file and some other chores involved in the X11 font installation. That's one of reasons I told you to use an Xft build. but you seem to imply that life is simpler today. Not yet for me. If you yearn for the old days of XLFD, X11core fonts and mkfontdir/mkfontscale/xset fp/chkfont/xfs/fonts.dir/fonts.alias/ fonts.scale etc, you can stay there by continuing to use a non-Xft (X11core) build of Mozilla. However, for the increasing number of programs in modern Linux distributions, you won't have a choice soon when gtk2 stops honoring GDK_USE_XFT=0. [Answering my own question from yesterday night - the new Mozilla build shows as possible font choices things in the output of fc-list on the client.] Where have you been during the client-side font revolution? On Mars ;-) ? -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
Re: devanagari question
On Wed, 31 Dec 2003 [EMAIL PROTECTED] wrote: [Installed Fedora 1 on a spare machine - compiled Mozilla 1.6b after ./configure --enable-ctl --enable-xft . It runs fine (*), but doesnt show what I expect to see.] Let me repeat my question, this time referring to http://homepages.cwi.nl/~aeb/moz/test.html It works fine on my machine with SunIndic truetype fonts installed. The string there is rendered exactly like the image below. [Apart from the obvious Mozilla bugs, there is a change in behaviour. The old build showed in Edit/preferences/appearance/fonts actual font names, the new build shows font family names. The font names were very recognizable: just the output of xlsfonts. These font family names have an origin unclear to me. Mozilla does not run on the X server, but the X server has the fonts, maybe there is a problem there?] Not at all. As I explained at least two times on this list, there are two flavors of Mozilla-builds, X11core build and Xft (client-side font) build. The latter does NOT use 20-year old (broken) XLFD based font selection scheme any more. The font selection in Xft build works more like that on Windows and MacOS (and more in line with CSS). You don't think end-users have to care for seeing all those (cryptic to them) 'iso8859-1', 'iso10646-1', 'jis0208.1980-0' and things like that, do you? Jungshik -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
Re: devanagari question
On Sun, 28 Dec 2003 [EMAIL PROTECTED] wrote: but I tried compiling on a Debian (Woody) and on a RedHat (7.2) machine. In both cases Mozilla-1.6b. For Debian the compiled binary does not run. Errors are like reported: ./mozilla-bin: relocation error: mozilla/dist/bin/components/libgfx_gtk.so: undefined symbol: GetContent__C8nsIFrame Obviously, I can't possibly know what's wrong with your Debian build environment (linker, compiler, etc) :-) Why don't you post to netscape.public.mozilla.unix newsgroup at news.mozilla.org with details including the output of 'nm'? For RedHat the version compiled with --enable-ctl runs, but still does not handle devanagari. Did you install Sun's fonts? It only works with Sun's fonts I mentioned if it's not clear from my post and i18n rel. notes. Although there's a way to make it work with a non-Xft build (I wouldn't explain it to you), I'd recommend you build with 'enable-xft'. [On the other hand, adding --enable-xft fails (on Debian): checking for xft... Package xft was not found in the pkg-config search path. Your Debian seems pretty much outdated as far as Xft/fontconfig is concerned. Jungshik -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
Re: devanagari question
On Sun, 28 Dec 2003 [EMAIL PROTECTED] wrote: [A week or so ago I wrote a multilingual text, and several languages failed under default Mozilla. If we succeed in getting a version that handles devanagari then a next point You have to make sure to tag the Devanagari part with 'lang=hi-IN' for html and 'xml:lang=hi-IN lang=hi-IN' for xhtml (if it's Hindi). That is, you have to do something like this for Xhtml. p lang=hi-IN xml:lang=hi-IN ... /p div lang=hi-IN xml:lang=hi-IN ... /div html lang=hi-IN xml:lang=hi-IN ... /html body lang=hi-IN xml:lang=hi-IN ... /html You may also 'style' Devanagari parts with the following style: font-family: sun_devanagari_font, default_devanagari_font_on_Windows, default_devanagari_font_on_Mac, some_free_devanagari_opentype_fonts, generic_css_family The reason you have to put 'sun_devanagari_font' at the beginning is that 'sun_devanagari_font' is not likely to be installed on most Windows/Mac OS X so that it doesn't do any harm while for Mozilla-Linux, it's essential that it's picked up _before_ other Devanagari likely to be installed on Linux. Certainly, things should be easier than this, but that's where Mozilla stands at the moment. for discussion will be vocalized Hebrew. For now the first It's not likely to work yet because vocalized Hebrew involves combining marks (right?). Jungshik -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
Re: devanagari question
[EMAIL PROTECTED] wrote: Jungshik wrote: lots of good advice Thanks ! You're welcome. However, I will not pursue this further. Have no time. For the time being it seems this is something where Internet Explorer works, and Mozilla still requires a nontrivial amount of work. There are certainly a lot of things to do, but that doesn't mean that it doesn't work. On Windows 2k/XP, the _default_ Mozilla build works almost as well as MS IE for complex scripts (except for rendering justfied text and cursor movement/selection). On Unix/Linux and Win 9x/ME, you need a CTL-enabled build and the right font. (Posted to mozilla-build or so. Awaiting moderator approval. If you had used the newsserver (news.mozilla.org) instead of the mailing list, it'd have been just posted without approval. -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
Re: devanagari question
On Tue, 23 Dec 2003 [EMAIL PROTECTED] wrote: Recently I noticed that for me the sequence U+092C U+093F (b i) is rendered by Mozilla as b followed by i, while in fact the i glyph should precede the b glyph. Is something wrong in my expectations? or in Mozilla? or in my Mozilla 1.5 setup? Devanagari is not supported by the default Mozilla build on Linux (as noted in the international known issues page.) On Windows 2k/XP, Devanagari, Thai, Tamil, Korean and other complex scripts supported by Uniscribe are supported (although somewhat limited) if you install any of complex script support packages (go to Control panel / International or something like that) and reboot. On Windows 9x/ME, only Tamil and Korean are supported with 'special' fonts. Thai is supported only on Thai version of Win 9x/ME. If you want to make Mozilla support Devanagari on Linux, you have to download the trunk source from the CVS, build with 'enable-ctl', and 'gtk' (for gtk2 + ctl, see mozilla bug 189433) If you like 'Xft' (as many others do and I strongly recommend), turn on 'enable-xft' as well. Then, install SunIndic font (truetype version for 'Xft') available at http://developer.sun.com/techtopics/global/index.html (follow the link for free Indian font). (Funny setup, to be broken by default, but even the release page http://www.mozilla.org/releases/mozilla1.6b/known-issues-int.html mentions this. See also http://bugzilla.mozilla.org/show_bug.cgi?id=201746 .) Nothing funny. Complex script support is not that simple especially when you have to retrofit it. I'd love to turn it on by default, but the cursor movement issue has to be resolved before turning it on (see bug 203406 as well). And, eventually, we have to use Pango (see bug 215219). that source was so dirty - the produced binary failed with errors like ./mozilla-bin: relocation error: mozilla/dist/bin/components/libeditor.so: undefined symbol: GetViewExternal__C8nsIFrameP14nsIPresContext In the mozilla binary directory, you have to run $ sh run-mozilla.sh ./mozilla-bin By directly running 'mozilla-bin', you made it pick up symbols from some other places (probably, system-wide nspr/xpcom/* shared libraries installed on your system.) BTW, see also http://sila.mozdev.org Jungshik -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
Re: devanagari question
On Wed, 24 Dec 2003, Jan Willem Stumpel wrote: It would be nice if solutions to common problems (in this case 'how to put an UTF-8 string on to the screen', solved, e.g., by Openoffice) were shared between different open-source projects. OpenOffice uses ICU's layout engine that supports some complex scripts but not all complex scripts. In case of AbiWord, I don't know anything about its internals, but ICU and Pango (http://www.pango.org) are two obvious choices (both are open-sourced) if its developers want to support complex scripts (Brahmi-derived scripts - Devanagari, Tamil, Telugu, Thai, Lao, Khmer, Tibet, etc-, Korean Hangul, Mongolian). Does it support scripts that require BIDI/RTL (Hebrew, Syriac and Arabic among others)? Also, note that even Latin, Greek and Cyrillic alphabets are complex once you go beyond basic stuffs because some languages need base letter + combining diacritic marks for which there's no precomposed form in Unicode. Jungshik -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
Re: Unicode fonts on Debian
On Sat, 20 Dec 2003, Edward H. Trager wrote: On Saturday 2003.12.20 15:06:11 +0100, Jan Willem Stumpel wrote: Actually, no. I think I already explained this. Yes, you did (on 15 December). Sorry. I stand corrected. So: the default language group is determined by the UTF locale (which s/UTF// :-) incidentally also determines Mozillas GUI font). On Linux, the default language group determines the fonts which Mozilla tries to use (by preference) for displaying all Unicode characters. On Yes, unless there are other pieces of information that are more relevant. Windows, the preferred font is determined by the code range, which seems more sensible, and in your bug report you propose to have the same mechanism on Linux also. I second that: Regardless of what mechanisms are used, it would be very nice if Mozilla worked identically on Linux and on Windows. (moved below) Also, I assume that it would lead to some slight simplification of the Mozilla code base, Nobody would ever disagree with you. Do you seriously believe Mozilla developers would make their tasks more difficult not doing what you wrote? However, the reality is not that simple. Note that on Linux/Unix alone, we have a few different toolkits/font technologies to support that are very different in their characteristics (XLFD vs fontconfig). Aside from Linux, gecko-based browsers run not only on Win 9x/ME and Win2k/XP (they're different OS' in many aspects) but also on several Unix', OS2, Mac OS X, Qnx, and VMS (and an unknown number of embedded devices). There might (or might not) be a way to abstract away all these platform/toolkit dependencies, but the current level of the abstraction in Mozilla is not there yet. If we could use 'fontconfig' (+ pango or ICU) _everywhere_, it'd be easy to do that. However, we'd not want to ask Mozilla users on Windows or Mac OS X to install fontconfig + pango or ICU. Including them into Mozilla is obviously out of question because Mozilla without them is already too 'fat'. That makes it much easier for developers who have to test whether web pages look the same on different platforms. Well, the platform-dependent font availability is another important factor that makes the platform parity hard to achieve. Probably not :-( , because when I try it on Win98 with Mozilla 1.5, accessing a page with span lang=ru /span , Putin is in the Cyrillic preferred font, while Yeltsin is in the Western font. Exactly the same as in Linux. There's another factor I didn't mention that affects when/whether 'Unicode char. to script' mapping kicks in. Mozilla-Win tries to stay in the currently selected font as much as possible to avoid 'ransom note' style (which looks horrible in some cases) rendering. Therefore, as long as the current font can cover Cyrillic letters, I believe it wouldn't switch. However, I guess 'lang=ru, xml:lang=ru' is regarded as a strong indication of the authorial intent that warrants the font switching. (it's been a while since the last time I looked at that part of the code so that I'm just writing from memory.) BTW, Mozilla doesn't do any 'global optimization' [1] in the font selection as might be done by some word processors or other rendering engines/libraries (e.g. Pango or ATSUI on Mac OS X). That is, its text drawing/measuring routines can take only a small text chunk (sometimes just a single character) at a time and doesn't know anything beyond that. So I _still_ dont understand it (including your bug report). Apologies in advance if I have overlooked something obvious.. You don't have to apologize. It's complicated and the only way to understand it fully is to read the code and work on it. Although I worked on Windows and Gtk (Linux/Unix) ports of Mozilla's text drawing/measuring routines for a while, I don't claim to know every gory detail. What's certain is that Mozilla developers try to match what's stipulated in the CSS specification (http://www.w3.org/TR/CSS2) [2]. Whether they're successful or not is another matter, though. Jungshik [1] http://www.ifi.unizh.ch/groups/mml/people/mduerst/papers/PS/FontComposition.ps.gz [2] See, for instance, http://bugzilla.mozilla.org/show_bug.cgi?id=227889 -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
Re: Unicode fonts on Debian
On Wed, 17 Dec 2003, Jan Willem Stumpel wrote: [EMAIL PROTECTED] wrote: http://ken2403king.kir.jp/form.htm Thats a funny one, indeed. When I opened it in Mozilla it was displayed as .For a moment I thought it was Chinese (which I do not know) but it is gibberish. Mozilla thought it was Chinese Simplified GB 18030. The source says html LANG=ja. It is Japanese with shift-jis encoding, in reality it says . (Isnt Unicode fun, allowing to put both variants in a mail message, just by copying from the Mozilla screen like this..) So, isnt the LANG attribute *more* irrelevant, because it did not help Mozilla (1.5a) to display the text correctly? It's impossible to infer the document encoding from 'lang' tag. With NCRs, any document encoding can be used to represent any Unicode characters. Even if that's not the case, how could you know if it's Shift_JIS, EUC-JP or ISO-2022-JP or EUC-JP (with JIS X 0213) _purely_ based on the value of 'lang' (suppose we don't have UTF-8, UTF-16, UTF-32, for the sake of argument). The value of 'lang' plays a role ONLY after the identity of characters in documents are determined. See below. A META tag attribute charset=shift-jis added to (a copy of) the page did. Doesnt that mean that encoding is more relevant than language? Internally, Mozilla works in terms of Unicode. That is, it has to determine the document encoding correctly (to convert a 'byte stream' in the document to render) to a Unicode character 'stream' before doing any font selection. If it mistakes Shift_JIS for GB18030, what the character drawing routine receives doesn't make sense and the 'langGroup' inferred from the document encoding is in conflict with (with NCRs to represent any Unicode characters, whether they're covered by the current document encoding, this could happen all the time) the language specified in the document(a part thereof). Which one is given a higher priority? IIRC, it's the latter. So Mozilla tries to render what it regards as 'a document in GB18030' (which is actually in Shift_JIS) with Japanese fonts if possible. BTW, as you know, GB18030 is another UTF so that even without resorting to NCRs (#x(hh); or #..;) it can cover the full range of Unicode. Another BTW, it depends on your setting in View | Character coding | Autodetect setting which character encoding Mozilla comes up with for unlabelled documents. If it's set to Chinese, it'll come up with one of Chinese encodings for a Shift_JIS document. Therefore, properly labelling html/xhtml/css documents is very important. Try the document in question with the html/xhtml validator at http://validator.w3.org and see what it says) Jungshik -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
Re: Unicode fonts on Debian
On Fri, 19 Dec 2003, Eric Streit wrote: I have a small question ... The pages are perfectly rendered on the screen, but when it comes to printing, only one encoding is done and all the other glyphs are converted to missing-caracters. Why not Mozilla ? That's partly because Mozilla's printing on Unix have a lot of things to improve and partly because you didn't configure it properly. Well, the latter is also partly due to the former (it should be easier and more intuitive to configure). In my posting in this thread, I explained three different printing 'modules' and gave some refernces. If you're interested in printing Latin letters and Cyrillic letters, all three methods should work, but Xprint and Freetype printing should give you better results than the default PS module (which is always the case for any script). How to use Xprint with Mozilla is well documented in http://xprint.mozdev.org. As for freetype printing, you have to edit either the global (system-wide) unix.js (found in places like /usr/lib/mozilla-1.5/defaults/prefs/unix.js. From this, you may guess where it's actually placed on your system) or per-profile configuration file prefs.js in $HOME/.mozilla/profile_name/salted_name/prefs.js (where salted_name is like 'k9xkxtyu.slt') to add the following: pref(font.FreeType2.enable, true); pref(font.FreeType2.printing, true); //on by default in mozilla.org builds. pref(font.freetype2.shared-library, libfreetype.so.6); pref(font.directory.truetype.1, /true/type/dir/1st); pref(font.directory.truetype.2, /true/type/dir/2nd); pref(font.directory.truetype.n, /true/type/dir/nth); where /true/type/dir/1st' and '.../nth' are directories with truetype fonts. If you edit the latter (per-profile user configuration), you have to use 'user_pref' in place of 'pref'. The latter should be edited while Mozilla is NOT running. Alternatively, you can edit them by typing 'about:config' in the location bar. In the 'filter' box at the top of the page, type 'freetype' and you can change the value as you wish by right-clicking with a pref. entry you want to edit selected. If you want to add a new entry, you can choose 'New | Entry type' in the pop-up menu that comes up. Hope this helps, Jungshik -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
Re: Unicode fonts on Debian
Edward H. Trager wrote: On Saturday 2003.12.13 15:23:30 +0100, Jan Willem Stumpel wrote: Does anyone have a step-by-step description of how to install Bitstream Cyberbit in Debian Sid? And similarly for (MS) Arialuni? I am still puzzled on when exactly what font is used for display and for printing in the various Mozilla versions. Each time I think 'I got it' it turns out that 'I didnt get it'... I don't know whether the following page will answer your question or not: http://eyegene.ophthy.med.umich.edu/unicode/#fonts In Edit|Preferences|Appearance|Fonts, Mozilla provides options for specifying fonts for various script encodings, so you should be able to fine tune exactly which fonts get used. I wouldn't use 'fine-tune' and 'exactly'. As I wrote in my previous messages, Mozilla's font selection algorithm is complex and Mozilla contributors (including myself) have put a lot of time and efforts, but still there are issues. Besides, Mozilla's font selection menu is NOT per 'font encoding' BUT per 'langGroup' (which had better be called 'script group'). Only in Mozilla-X11core build, the loose mapping between 'font encodings' (XLFD-based) and 'langGroups' exists. There is also a checkbox to Allow documents to use other fonts which I assume means that if the right glyph isn't found in the specified Unicode font, a glyph will get picked from whatever remaining installed font has that glyph. No, that doesn't mean that. That checkbox controls whether or not author-specified fonts (via font-family in CSS and font-face in old style html) should be given a higher priority than fonts configured in Mozilla's font selection menu. If it's not checked, author-specified fonts are ignored. I see this happen when I view Chinese pages with unusual characters in them. Whether the above option is turned on or not, Mozilla does its best to render every character. If it fails, it falls back to transliteration on Windows and Linux (if X11core-build is used). In case of Mozilla-Xft, it uses 4 (BMP) or 6 digit (non-BMP) hex number inside a rectangle. Jungshik -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
Re: Unicode fonts on Debian
On Tue, 16 Dec 2003, Edward H. Trager wrote: On Wednesday 2003.12.17 00:24:54 +0900, Jungshik Shin wrote: Edward H. Trager wrote: In Edit|Preferences|Appearance|Fonts, Mozilla provides options for specifying fonts for various script encodings, so you should be able to fine tune exactly which fonts get used. Mozilla's font selection menu is NOT per 'font encoding' BUT per 'langGroup' (which had better be called 'script group'). Only in Mozilla-X11core build, the loose mapping between 'font encodings' (XLFD-based) and 'langGroups' exists. I wish I understood this better! What exactly does langGroup or scriptGroup mean in Mozilla? Can you point me to 'scriptGroup' is just a term coined by me that I believe is better than 'langGroup' because it's not languages but scripts that are relevant here. 'langGroup's in Mozilla include 'Western', 'Central European', 'Japanese', 'Cyrillic', 'Arabic', 'Hebrew', 'Tamil', 'Devanagari', and so forth (just what you see in the font-selection dialog). a URL that explains exactly how Mozilla does these things, and how that might be different from, say, the xft/fontconfig way of doing things? I tried to explain it in my long email you quoted in your previous email apparently without reading it. Maybe not very clearly, but my two emails (before your first email in this thread) answered most of your questions. Clearly, from a user's perspective I was led to believe something possibly quite different about these dialogs in Mozilla. What did you believe was the case? Then, I'll go from there if necessary. Jungshik -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
Re: Unicode fonts on Debian
On Sun, 14 Dec 2003, Jan Willem Stumpel wrote: In the Mozilla font preferences you can set font preferences for Unicode, as well as for specific languages like Western, Japanese, etc. Am I then correct in assuming that the language-specific preferences always take priority over the Unicode preferences? Even when displaying a Web page which has charset=utf-8 in the headers? Yes, it's confusing. I think we should get rid of the font preference entry for Unicode because that's just confusing (there is some use for it at the moment, though). The font selection in Mozilla is strongly influenced by 'langGroup' (had better be 'script' or 'script group'). How is it determined? If there's an explicit specification of the language with 'lang' in html and 'xml:lang' in xml/xhtml in the document [1], it's honored. If not, it's inferred from the document encoding. Obviously, this inference doesn't work at all for utf-8. Currently, Mozilla uses the 'langGroup' corresponding to the current locale for UTF-8 documents. That is, if you run Mozilla under zh_TW.(UTF-8|big5|EUC-TW) locale, the langGroup of utf-8 document is regarded as zh-TW. This doesn't work well and totally breaks down when you have an iso-8859-1 (or any other non-Unicode encoding) documents with a lot of characters outside the repertoire of ISO-8859-1 represented in NCRs. (see http://bugzilla.mozilla.org/show_bug.cgi?id=208479 and http://bugzilla.mozilla.org/show_bug.cgi?id=91190). To work around this problem, Mozilla on Windows maps Unicode code blocks to Mozilla's 'langGroups', which achieves what you asked below. In other words is there a mechanism (inside Mozilla) that says - hmm... I have to display the character with number 49436 (hex C11C) here. - this character is in the range of Korean syllables. - now has a language-specific Korean font been specified? If so Ill use it. - If not, I use the Unicode font (Bitstream Cyberbit, or whatever). As I wrote above, on Windows, Mozilla does more or less what you wrote above. Mozilla-X11core and Mozilla-Xft have different font selection mechanisms. Mozilla-Xft is strongly dependent on fontconfig, which gives usually a lot better result than the font selection mechansim of Mozilla-X11core, but that also makes it hard to fix bug 208479 mentioned above. In other words, are huge complete Unicode fonts like Bitstream Cyberbit or Arialuni (which I promise not to try to use again..) only used for filling in the gaps where there are no language-specific fonts available? There does not seem to be much point in having them, then? You can also configure Mozilla to use those pan-Unicode fonts (or fonts whose coverage is broad enough) for all langGroups you're interested in. Another question: does Mozilla consider 'Latin Extended A' characters like (o with macron) to be 'Western'? Many Western As I explained above, Mozilla-Win does, but in Mozilla-X11core and Mozilla-Xft, which character belongs to which langGroup is not a function of Unicode code point (as it should be) but a function of the current document encoding and the value of 'lang/xml:lang'. fonts (like Times New Roman) have them and display them fine. But for instance Bitstream Vera Serif does not have them, and some other font (I dont know which) is substituted. Which rules are used for this substitution? Does mozilla look for them in *another* Western font, or does it look in the 'Unicode' font? Mozilla's font selection mechanism is so complex that I can't explain it in a few words (and it's also platform/toolkit dependent). In Mozilla-Xft, fonts for 'Unicode' langGroup are mostly immaterial, IIRC (I have to look up the code). Mozilla-Xft searches for a font to render a character in the priortized list of fonts returned by fontconfig. Therefore, what fontconfig returns in response to Mozilla's query (that usually specifies 'lang' and 'font family name' but NOT characters to render) determines which font is used to render which character. Mozlla-X11core is a different story. Using 20-year old XLFD makes it very hard to do things right (if you take a look at nsFontMetricsGTK.cpp at http://lxr.mozilla.org, you'll see what I mean) and I guess fonts specified for 'unicode langGroup' is refered to at a certain stage. Mozilla's international release notes is your friend although we didn't give gory details in the document. In Mozilla, goto ... Thanks very much for pointing this out. I had found out about the You're welcome :-) As regards to printing: I have (and have had for years) just 'lprng' and 'magicfilter' to print on my old Laserjet IIP. Also xprint works with that (as far as it works). Is there any point for me (or in general for users wanting a 100 % Unicode system) in switching to CUPS? I guess magicfilter should be fine especially considering that you have a non-PS printer. CUPS is handy when you have a PS printer that's not quite up-to-date. Mozilla's FT2 printing
Re: Unicode fonts on Debian
On Sat, 13 Dec 2003, Jan Willem Stumpel wrote: Does anyone have a step-by-step description of how to install Bitstream Cyberbit in Debian Sid? And similarly for (MS) Arialuni? Well, you're not supposed to install MS Arial Unicode on Linux at least in some countries. If you want to install a Pan-Unicode font, you'd better install James Kass' Code2000(BMP) and Code2001(non-BMP). They're available at http://home.att.net/~jameskass. It'd be nice of you to pay him $5. He's done a great service by making his fonts available and deserves some monetary compensation, IMHO. You have to note that for a good quality rendering, you'd better get fonts specifically made for a subset of Unicode repertoire instead of pan-Unicode fonts. Google 'alan wood unicode fonts' and you'll get Alan Wood's Unicode font site. For Latin, you definitely need to install Bitstream Vera series (donated by Bitstream). If you're also interested in Greek and Cyrillic, a set of fonts made available by SIL (Gentium) are good to have. I am still puzzled on when exactly what font is used for display and for printing in the various Mozilla versions. Each time I think 'I got it' it turns out that 'I didn't get it'... Mozilla's international release notes is your friend although we didn't give gory details in the document. In Mozilla, goto 'Help' and 'Release Notes'. In the release notes web page, follow the link to 'international known issues'. Basically, there are two different versions of Mozilla for Linux and three different ways for printing. 1. X11core font build(with gtk or gtk2 widget) : This is what's available by default at www.mozilla.org. It renders text using server-side X11core fonts, which can be bitmap (bdf), Speedo, type1, truetype, CID-keyed fonts, etc. However, all of them are 'presented' clients (in this case, Mozilla) as a set of glyphs with a certain char. to glyph mapping and metrics expressed in XLFD. 1' The X11core font build also can take advantage of truetype fonts available on the client side if freetype is enabled (font.FreeTyp2.enable has to be set to 'true' in prefs.js). By default, it's enabled. You have to add directories with truetype fonts by editing prefs.js in your profile directory (usually, ~/.mozilla/${PROFILE_NAME}/${SALTED_NAME}/prefs.js). The preference entries for truetype fonts are font.directory.truetype.1, font.directory.truetype.2, and so forth (Mozilla takes a look at the directory explicitly specified and does not look inside subdirectories.) Alternatively, you can add them in 'about:config' (type 'about:config' in the location bar). In addition, you have to specify the location of your freetype2 shared library. 2. Xft-based build (with gtk or gtk2 widget). This builds take advantage of new client-side font libraries, Xft and fontconfig that in turn rely on freetype2 library. RedHat rpms available at ftp.mozilla.org are Xft + gtk2 builds. I guess you can install one of them on debian with alien or similar tools. Usually, this builds gives faster and better rendering results especially if you're interested in viewing non-Western European web pages. Now for printing. 1. Postscript printing module : this is the oldest. Some people regard this as totally broken and demanded that it be removed. Western European users may not have much trouble, but if you go beyond that, it begins to show its limitation. Even for Western European text, its PS output is far from 'WYSWYG'. That is, fonts used on the screen rendering have nothing to do with fonts used in print-out. It can be used with both builds listed above. 2. PS + freetype2 : You have to enable both freetype (mentioned above) and freetype printing. This can be used with both kinds of builds. However, old rpms (Xft+gtk2 build) used to come with freetype disabled, but recent Xft+gtk2 at mozilla.org seem to have been built with freetype enabled. This gives a reasonable (not very faithful) WYSWYG. It's not faithful because the font selection mechanism is different for printing and screen rendering. Combined with CUPS and other modern Linux print servers, this works rather well. 3. Xprint (http://xprint.mozdev.org). With this, Mozilla is an Xprint client (X11) to an Xprint server. You need to have an Xprint server running for Mozilla to talk to. The font selection mechanism is XLFD-based. Xprint (client-side) is enabled in X11core build at mozilla.org, but is disabled in Xft+gtk2 build. Xprint server is available at http://xprint.mozdev.org More can be found at the aforementioned international known issues page and links therein. Hope this helps, Jungshik -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
file system conversion tool
Hi, I thought some of you might be interested in 'convmv', a file system encoding conversion utility I just came across. Most of you on this list are likely to have switched over to UTF-8 and wrote a script or two for the job. Nonetheless, it may be handy to have tools like this nearby so that you can help other 'skeptics' around you to 'convert' to UTF-8. http://osx.freshmeat.net/releases/144059/ convmv converts filenames (not file content), directories, and even whole filesystems to a different encoding. This comes in very handy if, for example, one switches from an 8-bit locale to an UTF-8 locale. It has some smart features: it automagically recognises if a file is already UTF-8 encoded (thus partly converted filesystems can be fully moved to UTF-8) and it also takes care of symlinks. Additionally, it is able to convert from normalization form C (UTF-8 NFC) to NFD and vice-versa. This is important for interoperability with Mac OS X, for example, which uses NFD, while Linux and most other Unixes use NFC. Though it's primary written to convert from/to UTF-8 it can also be used with almost any other charset encoding. Note that this is a command line tool which requires at least Perl version 5.8.0. Jungshik -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
Re: FYI: Some links about UTF-16
On Fri, 11 Jul 2003, Wu Yongwei wrote: S***, it seems I made a mistake. The font selection in Windows 2000 is not at all as flexible as Java; it's more like Linux. Just that the default font in the Simplified Chinese version is still Tahoma instead of Song Ti. Thanks for checking that out. You saved me some tinkering :-) Jungshik must be right that I could change the default font in locale zh_CN to make ASCII characters appear nicer. With Gtk2 and fontconfig, I don't have to tinker with the font configuration as much as before because it looks all right to me. As for CSS-style font list specification, the infrastructure is already in place (fontconfig), but the 'UI' part needs some catch-up to do. For instance, most GUI programs and window managers don't have UI to let multiple (ordered-list of) fonts be specified (although it's possible to do so by editing configuration files manually in _some_ cases.) The only problem is that the standard locale for Simplified Chinese in Red Hat 8.0 (which I use) is zh_CN.GB18030. I was told that it was possible to change that to zh_CN.UTF-8, but I did not find the motive/time to do that. It's rather easy. See https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=75829. Regarding the 'A' APIs in Windows. Do you mean that there should be some API to change the interpretation of strings in 'A' APIs (esp. regarding file names, etc.)? If that were the case, the OS must speak Unicode in some form internally. Yes, that's what I meant. Beni already gave some details. Beni win2k does have the option of Beni witching the encoding used in the 'A' APIs, it's just global and Beni requires a reboot. Yup, I frequently do to test Mozilla under different locales. Having to reboot is really painful. On POSIX systems, we can just run a program under any supported locale at the command line. Under Win2k/XP, 'chcp' works inside a 'command prompt' (even setlocale() works), but I haven't checked out if there's 'SetACP' (the opposite of 'GetACP'). remount the partition in an appropriate encoding; if it is on an EXT2/3 As you found out, there's a tool or you can easily make one as many other have done. Once you switch to UTF-8 locale, there's no need to look back. partition or on a CD-ROM, then I am out of luck. Maybe the mount tool should do something to handle this? :-) In case of CD-ROM, it's not much of an issue. See mount(8) man page and other man pages referred there. Jungshik P.S. A word of caution. A lot of _text-mode_ programs still assume that a single octet takes a single screen 'cell', which holds for most legacy single byte and double byte encodings. This assumption breaks down for UTF-8 and three byte sequences of EUC-JP and four byte sequences of GB18030 (and eight byte sequences of EUC-KR). Some of them are modified to cope with two-byte UTF-8 sequences (U+0100 - U+07FF), but don't work with U+0800 and beyond. Needless to say, combining characters are not handled in those programs. -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
Re: FYI: Some links about UTF-16
On Thu, 10 Jul 2003, Wu Yongwei wrote: Jungshik Shin wrote: I think it's not so much due to defects in programs as due to the lack of high-quality fonts. These days, most Linux distributions come with free truetype fonts for zh, ja, ko, th and other Asian scripts. However, the number and the quality of fonts for Linux desktop are still inferior to those for Windows. The problem is mainly not font itself, but font combination. I really cannot bear the display of ASCII characters in Song Ti, which is simply ugly (and fixed width). Why don't you specify a variable-width font as the system default? I understand you still don't like Latin glyphs in Chinese fonts. I hate Latin glyphs in Korean fonts, too. locale Linux seems to be able to do so, but in the Chinese locale all is in the Chinese font, which is not suitable at all for Latin characters. I don't think there's any difference between English and Chinese locales provided that you meant en_*.UTF-8 and zh_*.UTF-8. You may get an impression that it seems to work under en_US.UTF-8 because the 'system default font' for en_US.UTF-8 does not cover Chinese characters and the automatic font selection mechanism picks up a Chinese font for Chinese characters while using the default font for Latin letters. On the other hand, in zh*.UTF-8, the system default font covers Latin letters as well as Chinese characters so that both Latin/Chinese are rendered with the default font. A way to work around is to specify your favorite Latin font ahead of your Chinese font if CSS-style font list can be used. Beginning with Windows 2000, Windows could choose the font to use based on the Unicode range (Java does this too). In the English This is a good feature to have although CSS-style font list works most of time. Almost everything we need for this is already in place (fontconfig, pango). BTW, I haven't seen this available in Win2k. How can I do that? (not that I don't believe you but that I'm curious) I used an Windows Gtk application, which used Tahoma (an good sans serif font) at first. But after an upgrade it automatically chose to use the system default font, which is the Chinese Song Ti. It took me several hours to correct the ugly and corrupt (yes, because dialogue dimensions are different) display. Again, I haven't run Gtk programs under Win32 so that I don't know how they select fonts. Do they use fontconfig? fontconfig can make a big difference. There seems little sense now arguing the virtues of UTF-8 and UTF-16. Technically they both have advantages and disadvantages. I suppose we If MS had decided to use UTF-8 (instead of coming up with a whole new set of APIs for UTF-16) with 'A' APIs, Mozilla developers' headache(and UTF-8/'A' APIs vs UTF-16/'W' APIs and there are many other things to consider in case of Win32. It seems impossible because there are some many legacy applications. On the Simplified Chinese versions of Windows, 'A' always implies GB2312/GBK. Switching ALL to UTF-8 seems too radical an idea about 1994. At the time Using 'A' APIs and UTF-8 does not mean that 'A' APIs are made to work ONLY with UTF-8. As you know well, 'A' APIs are bascially for APIs to deal with 'char *'. As such, in theory, it can be used for any single or multibyte encodings including Windows 932, 936, 949, 950 and 6(I forgot the codepage designation for UTF-8). As Unix(e.g. Solaris and AIX and to a lesser degree Linux) demonstrated, a single application (written to support multibyte encodings) can work well both under legacy-encoding-based locales and under UTF-8 locales. Microsoft adopted Unicode, people might truly believe UCS-2 is enough for most application, and Microsoft had not the file name compatibility burden in Unix Well, this is an orthogonal issue. POSIX file system is so 'simple' (which is a virtue in some aspects) that it doesn't have an inherent notion of 'codeset/encoding/charset'. However, Windows doesn't use POSIX file system and using 'A' APIs does NOT mean that they couldn't use VFAT or NTFS where filenames are in a form of Unicode. (I suppose you all know that the long file names in Windows are in UTF-16). Actually, VFAT documentation is so hard to come by that we can just speculate that it's UTF-16 (it could well be just UCS-2 in Windows 95) I would not blame Microsoft for this. I wouldn't either and I didn't mean to. I believe they weighted all pros and cons of different options and decided to go with their two-tiered API approach. In my previous message, I just gave a downside to that approach aggregating all other arguments into a single phrase 'there are many other things to consider.' Also consider the following fact: Windows 95 emerged at a time when many people had only 8MB of RAM. Yah, I don't think AT THAT TIME we could tolerate a 50% growth in memory occupation. Windows 95/98/ME are not Unicode-enabled in many senses while
Re: FYI: Some links about UTF-16
On Tue, 8 Jul 2003, Marcin 'Qrczak' Kowalczyk wrote: Dnia wto 8. lipca 2003 05:22, Wu Yongwei napisa³: Is it true that Almost all modern software that supports Unicode, especially software that supports it well, does so using 16-bit Unicode internally: Windows and all Microsoft applications (Office etc.), Java, MacOS X and its applications, ECMAScript/JavaScript/JScript, Python, Rosette, ICU, C#, XML DOM, KDE/Qt, Opera, Mozilla/NetScape, OpenOffice/StarOffice, ... ? Do they support characters above U+ as fully as others? For Python I know Yes. . At least, I know for sure Mozilla and MS IE, MS Office XP do. That does not make me a fan of UTF-16. You shouldn't assume that others don't do what you're not happy to deal with. The reason they use UTF-16 is NOT because it's inherently better than other UTF's(UTF-8, UTF-32) BUT because they (not all) began with UCS-2 and have a lot of baggages (written in UCS-2) to carry on. The prime example of this Win32 W API's. The same is true of Java, ECMAScript (the transition is not yet complete in case of ECMAScript), and Mozilla. (see http://bugzilla.mozilla.org/show_bug.cgi?id=183156, for instance) In case of applications written with UTF-8 as the internal string representation (asked for in another posting), there are lots of them. Basically, most gnome/gtk applications do because glib and pango are based on UTF-8. Moreover, there's a programming language whose internal char. representation is UTF-8 as is well known. It's Perl. Besides, judging from the fact that Sun's iconv(3) implementation uses UTF-8 as a hub (instead of UTF-32 as is the case of glibc's iconv(3)), many programs in Solaris must be heavy users of UTF-8. Jungshik -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
Re: FYI: Some links about UTF-16
On Tue, 8 Jul 2003, srintuar26 wrote: Is it true that Almost all modern software that supports Unicode, especially software that supports it well, does so using 16-bit Unicode internally: Windows and all Microsoft applications (Office etc.), Java, These decisions seem designed mostly to ease compatibility with Microsoft's OS. I agree. Or, for the lack of foresight... The Asian-language argument for UTF-16 seems mostly vacuous, and even if it were true it would be the lone Here again I agree. The worst case (text made entirely of chars. between U+0800 and U+) is 3:2. With characters below U+0800 (especially US-ASCII range) mixed up, the ratio is even lower. For CJK Ext. B and C, UTF-8, UTF-16 and UTF-32 are all even. Jungshik -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
Re: FYI: Some links about UTF-16
On Wed, 9 Jul 2003, Wu Yongwei wrote: (excluding the desktop, which I prefer KDE). But I did have some bad experience with Windows Gtk applications running on Chinese versions of Windows. Not for functionality, but for UI. You are right that they do care about Asian languages, but the problem seems that they may not have the hands to test on Asian language platforms. At least not on Simplified Chinese Windows. Not their fault, I must add. Ah, I cannot bear setting I have no experience with Windows Gtk, but it could well be due to the fact that Win32 APIs come in two flavors, 'A'(NSI) APIs and 'W' APIs. MS recommened a few different paths to support both pre-Unicode (ANSI-based ) Windows (Win 9x/ME) and Unicode-based Windows (Win2k/XP). One of them is to use 'MSLU'(Microsoft Layer for Unicode?) with pure 'W' APIs (not using 'A' APIs at all). Mozilla developers once considered this approach, but gave it up because it led to a dillemma. To make Mozilla run under Win 9x/ME, Mozilla developers have to tell Mozilla users to install MS IE 5.x or later (or MS Office or other programs that have license to bundle MSLU dll with themselves). Obviously, it doesn't make much sense to ask users to install its competitor before using it (needless to say, the reality is that virtually MS Win users have MS IE installed so that we don't have to worry...). There may be other reasons that MSLU path was not taken that I don't know of. What Mozilla ended up doing is to write our own wrappers and function pointers for two dozen or so of Win32 APIs that get pointed to either A APIs or W APIs according to the run-time detection of the OS (Win9x/ME vs Win2k/XP). Mozilla's transition to this is not yet complete (see http://bugzilla.mozilla.org/show_bug.cgi?id=162361 and http://www.mozilla.org/releases/mozilla1.4/known-issues-int.html) It's likely that Win32 Gtk is still dependent on 'A'NSI APIs. However, this is a pure speculation and could well be completely wrong. Linux locale to Chinese, which makes the desktop too ugly to me. Rationale: The good intent of Open Source developers may not result in understanding the requirements of Asian users owing to lack of native developers/testers/users. That's a bit strange. My desktop under ko_KR.UTF-8 locale is not so bad. Anyway, it's not yet as pretty as that of Win32. I think it's not so much due to defects in programs as due to the lack of high-quality fonts. These days, most Linux distributions come with free truetype fonts for zh, ja, ko, th and other Asian scripts. However, the number and the quality of fonts for Linux desktop are still inferior to those for Windows. There seems little sense now arguing the virtues of UTF-8 and UTF-16. Technically they both have advantages and disadvantages. I suppose we have presented enough of them in this discussion. Let me just add my last comment... If MS had decided to use UTF-8 (instead of coming up with a whole new set of APIs for UTF-16) with 'A' APIs, Mozilla developers' headache(and that of other opensource developers) mentioned above would have been a lot easier to cure :-) Of course, this is just one aspect of UTF-8/'A' APIs vs UTF-16/'W' APIs and there are many other things to consider in case of Win32. Jungshik -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
Re: Strings in a programming language
On Mon, 7 Jul 2003, Wu Yongwei wrote: I wonder, how many people really want to use Unicode codepoints beyond U+? I don't want to make it incorrect by design just because cases it doesn't handle are rare. It's unnecessary to handle ALL cases. You could address only issues encountered/expected by your end users. IMHO, it is more important to make an application be light-weight and run in 99% cases. Or, you may find your language used by, say, 1 people, and none uses the extra features that you spend 40% of your development labour. And it is As you wrote, one can do what one believes. Anyway, correctly handling non-BMP characters are not so much difficult (40% of your devel. time for 1% constituency seems to me too big an exaggeration :-) I know you're just maing your case clear...). Moreover, with Math characters in plane 1 and MathML more widely used, it'd not be so rare to find people who want to use non-BMP characters. Jungshik -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
Re: diacritic marks for Latin alphabet (Re: supporting XIM)
Edward Cherlin wrote: On Monday 31 March 2003 10:05 pm, Jungshik Shin wrote: Let's try some more. aeiounx I'm pleased that the accents are still there after four levels of replies. That's because all three of us (Gaspar, you and I) do what we preach, namely, using UTF-8 in our everyday computing :-) Not too bad, except that only the first three accents on each letter are actually displayed, and the dot on the i isn't removed. Hmm, I can see only two diacritics in Kwrite with Code2000 Yes, I get only two visible diacriticswith Code2000. I think Code2000 has some (maybe not so comprehensive) ot layout tables for Latin letters. I'm copying this to its author, James Kass. font. I found that you appended as many as five of them to each character in your sample. What font did you use? Nonetheless, it's a pleasant surprise that Kwrite does more than simple overstriking. kwrite 4.0 kde 3.0.3 Arial Unicode MS (licensed copy) shows 3 diacritics Can you check your font with VOLT (www.microsoft.com/typography) as to whether it has OT layout tables for Latin letters? You need to apply to join the OT developer group to get a copy. It seems to be the only tool available for editing OT layout table. I hope pfaedit will offer the feature, soon. kmail 1.4.3 Courier [Adobe] 3 diacritics displayed Courier? Hmm. How about 'Courier' in kwrite? So, are multiple diacritics stacked over each other taking *disjoint* spaces instead of overlapping one another at the same spot? Anyway, now I'm wondering what Qt/KDE use for rendering. Does it use pango(it couldn't be because Pango doesn't support OT layout table for Latin, yet although simple overstriking is supported) or has their own complex script rendering library? Jungshik -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
Re: Mozilla Rendering (was Re: gtk2 + japanese; gnome2 and keyboardlayouts)
Edward Cherlin wrote: On Tuesday 01 April 2003 08:02 am, Edward H Trager wrote: Can Jungshik or someone else please clarify for me what Mozilla 1.3 currently uses for complex script rendering? I'm seeing differences in rendering of Thai on Linux (horrible) vs. in Windows (OK) in Mozilla 1.3. Uniscribe on Windows. It supports Thai. Well, I guess even on Windows, Mozilla does not make use of Uniscribe (at least it doesn't explicitly as far as I know) and intelligent fonts with opentype layout tables. Actually, I'm not sure. I asked about this a couple of times, but got no answer. I don't know what it uses on Linux, but it uses something that doesn't support Thai properly, It sorta does if you compile it with CTL(complex text language) feature turned on. Mozilla source code includes a 'miniature version' of Pango for rendering a couple of Indic scripts and Thai(contributed by Sun). However, that's only for 'plain gtk' build of Mozilla (not using Xft but old X11 core fonts). A similar 'hack' (but not depending on Pango) should be possible for Xft-build of Mozilla when bug 176290 is resolved (http://bugzilla.mozilla.org/show_bug.cgi?id=176290) This is the point about building text rendering into the system. Applications cannot have their own rendering engines in general. So whatever the system renderer supports is the best you can expect in most software (if that). I fully agree with you. The problem with the current Mozilla is that it seems rather hard to write a bridge to Pango (although I have a couple of 'vague' ideas as to how to do it and I'm sure genuine gurus of Mozilla have their own better ideas as well.) Besides, I believe Mozilla-Graphite 'marriage' should serve as a good model for Mozilla-Pango couple. Jungshik P.S. BTW, Thai can get rendered 'automagically' (well, not so great as expected by Thai people) if you have fonts for simple overstriking with zero/negative advance for combining characters. -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
Re: gtk2 + japanese; gnome2 and keyboard layouts
Edward Cherlin wrote: On Monday 31 March 2003 10:40 pm, Jungshik Shin wrote: Edward Cherlin wrote: Have you looked at SILA? It uses SIL Graphite as the renderer for Mozilla. http://sila.mozdev.org/ Yup. I'm aware of it. At least for now it's only for Windows, though. However, we may get some valuable insights from the project that can be applicatble to 'Mozilla-pango' marriage. I mean the part of the project that says they want to do a Linux port of Graphite, and thus of SILA, but not much is going on with it. A couple of issues: I guess OpenGraphite for Linux is not yet ready for the prime time while Pango is mature. SILA currently uses MS COM instead of xpcom. To make SILA for Linux, MS COM needs to be replaced by xpcom. We'll see which one gets there first, OpenGraphite or Pango. Jungshik -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
Re: diacritic marks for Latin alphabet (Re: supporting XIM)
Pablo Saratxaga wrote: The only latin-script based languages I know that use some accentuated letters not existing in precomposed form in unicode are Guarani (it uses g with tilde) and Chechen (it uses several letters with a dot above, some exist in precomposed, but others don't). There may be others, but I only know about those two. I think orthographies of some African languages also need Latin letters with diacritics for which Unicode/ISO 10646 have never assigned and will never assign precomposed fomts. And, if we consider Old and Middle European languages, there are even more. Needless to say, IPA(although not a language) is a very 'fertile' source of a number of accented letters. (I believe there are some IPA letters linguists want to use that are not given separate codepoints.) I didn't and wouldn't count math symbols here although there are a lot of them with Latin letter as base char. Jungshik -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
Re: opentype
srintuar26 wrote: (For the sake of argument, if all precomposed glyphs were abolished, leaving NFC==NFD, then how would we store composition specializations inside fonts...) You have to distinguish between characters and glyphs here. The number of Unicode characters representable with a font is different from the number of glyphs in the font. Because as you wrote, diacritic marks for Latin/Greek/Cyrillic and other combining characters take different shapes and different positions depending on where they're used. The same is true of base characters The shape of a base char. is different whether it's used alone or combined with combining characters and how many and which combining characters it combine with. In modern intelligent fonts like opentype fonts, char to glyph mapping is not 1 to 1 but m to n where m and n = 1. The way this m to n mapping is stored in fonts and accessed by rendering/layout engines varies. (there's even a proposal to add this intelligence to old X11 BDF.) Opentype fonts have layout tables like gsub and gpos that have to be accessed and activated by rendering engines like Uniscribe and Pango. The amount of intelligence in embedded opentype fonts is smaller than that in AAT (Apple's intelligent font format) in that in the former Uniscribe and Pango should more work than necessary for AAT fonts. Graphite is another font format(? it uses opentype format, but its layout tables are different from gsub/gpos and so forth used by Pango/Uniscrbe) and rendering library pair. For details, see http://www.microsoft.com/typography http://developers.apple.com/fonts http://www.pango.org http://graphite.sil.org and Adobe's page Jungshik -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
a patch for vim to add 'cjkw' option for CJK users with CJK monospacefonts
Hi, Attached is my patch to add 'cjkw(idth)' option to toggle CJK width option. When turned on, characters with East Asian width class of 'A'(mbiguous) (see UTR #1? 'East Asian Width) are treated as having the cell width of 2 instead of 1. The default is off(because characters affected had better be treated as having the cell width of '1' 'typography-wise' ) and it's only effective when the fileencoding is UTF-8. This option is necessary because in the GUI mode (and in a terminal where a CJK font is used or a similar option is turned on. e.g. xterm with 'cjk-width' option), many East Asian people (CJK) use CJK fonts which have fullwidth (cell width of 2) glyphs for characters with EA Width class 'A'. With this patch and 'cjkw' turned on, there's no more inconsistency between the width of glyphs for characters like Euro, registered sign, copyright sign in those fonts and that perceived by vim. FYI, xterm has a similar option 'cjk-width'. Lik xterm, my patch uses Markus Kuhn's EA width 'A' character table automatically generated from Unicode 3.2. When Unicode 4.0 is finalized, the table has to be updated. It'd be nice if the patch can get in soon. Jungshik -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
Re: alias in fontconfig (Re: supporting XIM)
Tomohiro KUBOTA wrote: - Xmms cannot display non-8bit languages (music titles and so on). Are you sure? It CAN display Chinese/Japanese/ Korean id3 v1 tag as long as the codeset of the current locale is the codeset used in ID3 v1 tag. I'll test this further. However, please note I won't be satisfied by i18n which require specific configuration other than setting LANG variable (and installing required softwares and resources). xmms does NOT take anything more than setting LANG. The reason I used LC_ALL in my example is because that's the only sure way to set the locale. If I use LANG, it can get shadowed by LC_ALL and LC_*. LC_ALL overrides LC_* and LANG. Other complications are not the fault of xmms but that of ID3 v1 tag that does not have any mechanism for specifying the encoding. ID3 v2 should solve this problem by using Unicode, but not many programs support it. (I doubt many portable mp3 players support it) I want such alias to be automated. If I have one Korean font installed, it is obvious that renderer must use the font for all Korean texts. It is not a good idea that the renderer fail to display Korean when the user doesn't configure the alias. fontconfig always returns a font if there's a font on the system with the character requested. So, it's possible now. - There are no lightweight web browser like dillo which is i18n-ed. I think that w3m-m17n is an excellent lightweight browser that supports I18N well. Well, I meant a lightweight GUI browser. Though I haven't checked, It's sorta gui browser. It supports image rendering and mouse. You can also compile it with JS interpreter .BTW, how about Phoenix(www.mozilla.org/projects/phoenix) and Galeon ? There is another i18n extension of w3m: w3mmee. I don't know which is better. I'm aware of that. I just wish either of them (or a combination of two) to be included in w3m. - FreeType mode of XFree86 Xterm doesn't support doublewidth characters. Well, it sort of does. Anyway, I submitted a patch to Thomas and I expect he'll apply it sooner or later. After that, I'll add '-faw' option (similar to '-fw' option). Fantastic! May I want more? Xterm can automatically search a good (corresponding) doublewidth font in non-FreeType mode. How about your patch? I'm not sure whether I can. We'll see. -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
Re: supporting XIM
Jungshik Shin wrote: Edward Cherlin wrote: The starting point of this discussion was the inability to use Chinese, Korean, and Japanese IMEs in the same locale. I write documents in all three languages, and I would do it more often if it were actually convenient. This is becoming rather frustrating. How many times do I have to write that it IS possible right now to install all of them and switch between them in a *single* application (session) running under any UTF-8 locale of your choice? Why don't you try installing I'm sorry I somehow didn't realize (how couldn't I? I don't know...) that you wrote the above probably because I had written that everything that you need for CJK input came by default with modern Linux distros, which is not true, and you don't need HOWTO. Certainly, it's not well known that it's possible to switch between multiple gtk2 input modules (including those for CJK) and it'd be nice to have a well-written summary on the issue with pointers to various gtk2 input modules. It also would be nice for major Linux distributions to include them. Jungshik -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
Re: alias in fontconfig (Re: supporting XIM)
On Mon, 31 Mar 2003, Edward Cherlin wrote: On Monday 31 March 2003 04:31 pm, Jungshik Shin wrote: Tomohiro KUBOTA wrote: I want such alias to be automated. If I have one Korean font installed, it is obvious that renderer must use the font for all Korean texts. It is not a good idea that the renderer fail to display Korean when the user doesn't configure the alias. fontconfig always returns a font if there's a font on the system with the character requested. So, it's possible now. Doing it one character at a time is guaranteed to give hideous results. I have had the unfortunate experience of viewing a display in mixed CJK fonts, and I have had many similar Well, it depends on what kinds of fonts you have on your system and the way you specify fonts you want to use. I'm well aware of 'ransom note-like results when you mix up fonts of many *different* styles and design principles in a single run of text. This problem can be minimized if you are careful in putting together fonts of similar styles and design principles. Anyway, if someone finds it difficult to edit fonts.conf file and doesn't want to install a minimal set of well-populated fonts (sans, serif, monospace, etc), but still wants as many characters as possible to be rendered, randsom note is what she deserve to get. unfortunate experiences of viewing APL code rendered in random math fonts. It is extremely important to a lot of people that they be able to specify a font *per language*, without regard to Well, *per-langauge* is not a cure-for-all although on many occasions, it's sufficient. the definition of Unicode blocks or old-time code pages or ISO-8859-* or any other 8-bit font hack. But we want to do it We don't live in that world any more largely thanks to fontconfig, Xft and Pango. The age of X11 corefonts and XLFD hack has gone for good. There is, of course, the question of defining the character repertoire and rendering rules for a language (which may differ substantially from the rules for another language written in the same script). To get started, it will suffice if I can say that the set of characters in one font that I designate defines the repertoire for my use of the language. When we have adequate support for more intelligent fonts, we can build in some of the rendering rules, also, but in the end language-specific document creation will be the job of applications well above the text In case of html, 'lang' does the job abd Mozilla supports it pretty well. Unfortunaely, 'xml:lang' is not yet supported. editor level. At some point, explicit repertoire lists will be needed, I suppose. Or something else we haven't thought of yet. Care to take a look at http://fontconfig.org ? It includes lang-dependent repertoire list for most, if not all, of languages listed in ISO 639 (or is it ISO 30xx?)? Jungshik -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
fontconfig, alias/pseudo-fonts, Xft (was...Re: supporting XIM)
Mike FABIAN wrote: (B (BPablo Saratxaga [EMAIL PROTECTED] $B$5$s$O=q$-$^$7$?(B: (B (B (B (BAlso, Xft allows to define "virtual fonts" created from a list of other (Bfonts; "Sans", "Serif" and "Monospace" come in standard. (B (B (B (B~/.fonts.conf (B (B (B (BI guess Pablo meant something like the following (Bbut this doesn't work the way he (and (BI) wrote it would if only Xft APIs are used(see below). For instance, (B'monospace' is a 'virtual' font defined as (B (Balias (Bfamilymonospace/family (Bprefer (BfamilyLuxi Mono/family (BfamilyNimbus Mono L/family (BfamilyKochi Gothic/family (BfamilyZYSong18030/family (BfamilyAR PL SungtiL GB/family (BfamilyAR PL Mingti2L Big5/family (BfamilyGulimche/family (BfamilyAndale Mono/family (BfamilyCourier New/family (B/prefer (B/alias (B (B (Band define some pseudo-fonts you want. (B (B (B (BHow does that work? I didn't know that it is possible to define (B"virtual fonts" from a list of other fonts using fontconfig/Xft2. (B (B (BBut I don't yet know a *simple* way to achieve that by using only Xft2. (BWhen using something like (B (B xft_font = XftFontOpenPattern(dpy, pattern); (B (B (BI guess you have to call fontconfig APIs(e.g. FcFontSort) directly (Band do manual break-up of your input text into mutilple pieces (Bto be rendered by one of fonts returned (by FcFontSort) depending (Bon their coverage. And, you know this *complex* way, don't you? (B (BI always got exactly one font. Are you saying that it is possible to use (Bmore than one font with a single call to XftFontOpenPattern() (Bby doing some setup in ~/.fonts.conf? (B (B (B (BI think Pablo mistook what fontconfig does for what Xft does unless (BI'm missing something Pablo knows. I also plead guilty of making (Ba similar mistake when I wrote abuot working-around a hard-coded (Bfont name in a Window manager and a theme (e.g. Courier) (B (BJungshik (B (B-- (BLinux-UTF8: i18n of Linux on all levels (BArchive: http://mail.nl.linux.org/linux-utf8/
diacritic marks for Latin alphabet (Re: supporting XIM)
Edward Cherlin wrote: On Monday 31 March 2003 06:38 am, Gaspar Sinai wrote: On Sun, 30 Mar 2003, Edward Cherlin wrote: Let's try some more. aeiounx Not too bad, except that only the first three accents on each letter are actually displayed, and the dot on the i isn't removed. Hmm, I can see only two diacritics in Kwrite with Code2000 font. I found that you appended as many as five of them to each character in your sample. What font did you use? Nonetheless, it's a pleasant surprise that Kwrite does more than simple overstriking. What do you see in your mail? Yudit currently supports Mark-To-Base and Mark-To-Mark (2.7.5.beta10) OpenType GPOS and it uses GSUB only for Indic scripts, ligatures and shaping. Resonable Tibetan (almost ready) also needs all of these complexities. If there is an urgent need for this in other scripts I can take a look at it. Not in Latin-alphabet text generally. Writing systems that have such needs include Vietnamese, IPA, Math, Polytonic Greek, Does Vietnamese need diacritic marks ? Sure, it does, but I think all it needs are encoded as precomposed so that they don't need a special treatment other than the conversion between NFC and NFD. Indic and South Asian are much higher priority than multiply accented Latin for mathematicians. That's why Indic scripts are rather well supported in Yudit now :-) Is it possible to define all the combinations in GPOS and GSUB tables in the font at all? It seems like this is where AAT fonts with state machine are superior to opentype fonts. -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
Re: gtk2 + japanese; gnome2 and keyboard layouts
Edward Cherlin wrote: On Sunday 30 March 2003 11:25 pm, Jungshik Shin wrote: I'm also gonna explore if it's easier to wed 'pango' with Mozilla if gtk2 instead of gtk is used. That would dramatically improve complex script handling of Mozilla if possible. Have you looked at SILA? It uses SIL Graphite as the renderer for Mozilla. http://sila.mozdev.org/ Yup. I'm aware of it. At least for now it's only for Windows, though. However, we may get some valuable insights from the project that can be applicatble to 'Mozilla-pango' marriage. Jungshik -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
Re: supporting XIM
On Sat, 29 Mar 2003, Edward Cherlin wrote: aplications explicitly at present, and automatic support for Cyrillic, Greek, Armenian, or Hindi doesn't help Japanese users much. Automatic support for Hindi? Hmm, do I live in a world different from yours? It's NOT CJ(K) BUT Hindi, Tibetan, Arabic, Hebrew, Bengali, pre-1933 Korean, Polytonic Greek (and Latin/Cyrillic with diacritic marks for which combining characters are necessary) and other complex scripts that have the largest wish list. Pango has supports for some Indic scripts and Thai script, but it doesn't yet support layout of Greek/Cyrillic/Latin with opentype layout tables. out a way to funnel IME input through the normal character input calls, we might well achieve CJK support in the majority of apps. Well right now, the majority of programs in modern Linux distros DO work well with CJK IMEs. In case of gtk2 applications, they also work well with any gtk2 input modules including those for CJK. Of course, this doesn't mean that there's very little to do when it comes to CJ(K) support, but I don't share Kubota-san's concern. Jungshik -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
Re: supporting XIM
Tomohiro KUBOTA wrote: - a word processor whose menus and messages are translated into your native language but cannot input/display text in your native language - a word processor whose menus and messages are in English but can input/display/print text in your native language Which is better? The first one is completely unusable and the second one is unconveinent but usable. I agree with you on this point. That's why I compared the status of KDE in 1999-2000 with that in 2003. Back in 1999-2000, KDE/Qt people thought that translating messsages is I18N, but they don't do any more and KDE/Qt supports 'genuine I18N' much better now. Now brief list of examples. - Xmms cannot display non-8bit languages (music titles and so on). Are you sure? It CAN display Chinese/Japanese/ Korean id3 v1 tag as long as the codeset of the current locale is the codeset used in ID3 v1 tag. The problem with mp3 and id3 v1 tag is that id3 v1 tag doesn't have any means of labelling the codeset used in the tag. Therefore, you can't view Russian id3 v1 tags (in KOI8-R ) and Korean id3 v1 tags in EUC-KR in a *single* xmms session. To work around this, there are three ways ( we discussed this issue a couple of months agon on this list): 1. convert all id3 v1 tags in your mp3 collection to UTF-8 2. Give up the idea and launch two separate xmms under two different locales % LC_ALL=ru_RU xmms % LC_ALL=ko_KR xmms - Xft/Xft2-based softwares cannot display Japanese and Korean at the same time while Xft and Xft2 are UTF-8-based, because there are no fonts which contain both of Japanese and Korean. This should not be regarded as a font-side problem, because (1) font-style principle is different among scripts (there are no courier font for Japanese) You can use 'alias' in fontconfig if some programs use 'Courier' or 'Arial' instead of generic fonts names like 'monospace', 'serif', 'sansserif', and so forth. and (2) such fonts need developers who can design letters all over the world. Pango's approach (changing font according to script) is needed. Well, if Xft2 is used along with fontconfig, there's no such problem. - There are many window managers which support themes. Even if the window manager itself is already i18n-ed, some themes cannot display non-Latin-1 languages. This occurs in two cases: (1) when the theme specifies a font name (it is very likely) or (2) when the theme supplies an origial font. In the first case, you can work around the problem rather easily with 'alias' mechanism in fontconfig. - There are no lightweight web browser like dillo which is i18n-ed. I think that w3m-m17n is an excellent lightweight browser that supports I18N well. - FreeType mode of XFree86 Xterm doesn't support doublewidth characters. Well, it sort of does. Anyway, I submitted a patch to Thomas and I expect he'll apply it sooner or later. After that, I'll add '-faw' option (similar to '-fw' option). - Ghostscript. It is known that it can handle Japanese by some trick (by localized version?) but it is too complex and difficult for me. It's not that hard. Most changes made by gs-cjk project have been folded back to the upstream gs. Moreover, modern Linux distros now come with ghostscript with all the 'hard' jobs(configurations) already done for you and you don't have much to do. - Even OpenOffice.org 1.0 cannot display Japanese even with Japanese add-on package. I have to configure some font substitution. Note that this can be done only after installation, thus I cannot read (translated) messages during installation at all. OpenOffice seems to have a serious problem when run under UTF-8 locale. Under locales with legacy codesets, it more or less works, but Unix/X11 version appears to have to be overhauled with a new client-based font framework (fontconfig, Xft, pango). Its use of the old server-side font technology makes it slow and ugly. - Curses-basd softwares. They must not assume number of bytes is same as number of columns or number of characters. Doublewidth and combining character support is needed. As I mentioned already, this is where we need a lot of works. There are a few programs that work well, though when linked against ncursesw. One prominent example is mutt. - Perl doesn't have wcwidth(). Well, there are a couple of Perl packages that let you query various Unicode character properties so that it should be trivial to write your own wcwidth() if somebody hasn't done it already. - Text line wrapping. Chinese and Japanese (not Korean) don't use whitespace between words. I already mentioned this issue. Programs like 'fmt' has to be modified, but there's already an alternative to 'fmt' that supports Unicod linebreaking algorithm. I feel that CJK people everytime have to keep a watch on softwares which are already i18n-ed, because i18n support of such softwares
Re: supporting XIM
Tomohiro KUBOTA wrote: Perhaps not double-width, but there are plenty of non-ASCII, non-ISO-8859-1 characters in the Unicode set that should be interesting to U.S. programmers. This is a good information. I hope such people will hard-code UTF-8 support up to two bytes. Though I didn't find such softwares, I heard there are such softwares. We have to continue keeping watch on i18n implement of softwares How about em-dash or ligatures such as fi or ffl? Are they doublewidth? Em-dash is a valid example, but 'fi/ffl' are NOT. Ligatures should not be 'hardcoded' by those who edit documents, but have to be automatically 'summonned' at the rendering layer. Anyway, other examples include Euro sign, genuine opening quoation marks and many more that have been mentioned several times by Markus Kuhn on this list before. -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
Re: I18nized apps (was Re: supporting XIM)
Edward Cherlin wrote: Nadine Kano wrote one, published by Microsoft, which is unfortunately very much out of date and out of print. I know of Well, the book is not just outdated but has some critical errors/mistakes and Microsoft-centrism(that doesn't work well for POSIX system) along with useful information. BTW, I believe MS press released an update to the book recently. Perhaps some of us should get together and pitch the idea to O'Reilly. Certainly a HOWTO is in order. Although it's not exactly the kind you're looking for, CJKV Information Processing would be a useful reference for I18N engineers. -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
Re: Pango tutorial? (Re: supporting XIM)
Tomohiro KUBOTA wrote: Unfortunately, there are no tutorials for Pango. A developer of Xplanet and I sent mails to a Pango developers (Evan Martin and Noah Levitt) to ask that but they think Pango is not intended to be used from applications Owen Taylor is 'the' Pango developer, isn't he? -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
Re: supporting XIM
Glenn Maynard wrote: programmers in X care more about X support than Windows support (which is very annoying to Windows users, who often end up with old, buggy ports of X software when they get them at all). off-topic:This is one of many reasons scientific community (astronomy/astrophysics for instance) was one of the earliest groups that quickly embraced Linux. Their main toolsets are all written for X11 and their Windows/MacOS ports were buggy and outdated, but porting them to Linux is a lot easier. This is actually one advantage of NFD: it makes combining support much more important. (At least, it's an advantage from this perspective; those who would have to implement combining who wouldn't otherwise probably wouldn't see it that way.) Another advantage of NFD is the consistency. In NFC, some characters with diacritic marks are represented as precomposed while others are represented with base character + diacritics. In NFD, all characters are represented the same way except for some Korean Hangul Jamos due to 'the' very stupid mistake of South Korean standard body that requsted the removal of decomposition of cluster Jamos into sequences of simple/basic Jamos. (Overall, Korean script handling in Unicode/10646 is among the worst.) By the way, I just gave lv a try: apt-get installed it, used it on a UTF-8 textfile containing Japanese, and I'm seeing garbage. It looks like it's stripping off the high bits of each byte and printing it as ASCII. I had to play around with switches to get it to display; apparently it ignores the locale. Very poor. Less, on the other hand, displays it without having to play games. It has some problems with double-width characters, unfortunately. Actually, with Owen Talyor's patch posted here about a year and half ago(?), 'less' works pretty well in UTF-8 under UTF-8 xterm. Jungshik Shin -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
Re: gtk2 + japanese; gnome2 and keyboard layouts
Evan Martin wrote: (Following the earlier discussion about XIM...) http://im-ja.sourceforge.net/ is a pretty effective input module for Japanese input in GTK2. And, you can install *along* its side, http://sourceforge.net/projects/wenju/ (includes gtk2 input module(s) for Chinese : table-based) http://kldp.net/projects/imhangul : Korean gtk2 input module suite and other gtk2 input modules for other scripts. You can also switch around various Xkb supported key layouts as you and others wrote with help of KDE keyboard swticher or Gnome2 keyboard switcher.Besides, if you want, you can still use one of XIM servers you like to use. I'd rather use the built-in XIM server (Compose for UTF-8 locale) by resetting XMODIFIERS env. variable (or equivalents in Xresources). As long as input method is concerned, this thread is almost a replica of the thread last Dcember and all these information was given then (except for KDE/Gnom2 Xkb kbd switcher and im-ja in which place a less advanced gtk2 input module for Japanese was mentioned by Owen ). Is there anything wrong with collective memory of this list? ;-) Jungshik -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
Re: gtk2 + japanese; gnome2 and keyboard layouts
srintuar26 wrote: As long as input method is concerned, this thread is almost a replica of the thread last Dcember and all these information was given then (except for KDE/Gnom2 Xkb kbd switcher and im-ja in which place a less advanced gtk2 input module for Japanese was mentioned by Owen ). Is there anything wrong with collective memory of this list? ;-) Well I for one have been placated for now by im-ja. Its precisely what ive been looking for, and extensive googling didnt root it out. im-ja may have not turned up in google, but the archive of this list includes all the necessary information we went over again the last week except for KDE/Gnome2 kbd switcher. Actually, I'm not sure of my own memory and that may also have been mentioned in the past. XIM has been a disappointment for me, and I got tired of using iconv, rom2hira scripts, a trivial console based canna interface, and kanjipad for my input needs. (rh8 uses euc-jp for its Japanese locale, and I refuse to use non-utf-8 locales, but XIM wont work correctly or stably outside of the euc-jp locale...) Well, you must not have been on this list long enough. Last Nov/December, I posted how to make RH8 support ja_JP.UTF-8 and ko_KR.UTF-8. Most of my changes have been fed back to XFree86 and are included in XF86 4.3. Hopefully, RedHat 9.0 turn on UTF-8 locale for CJK by default as I urged them to on several occasions. BTW, I've been using ko_KR.UTF-8 for about a year now. Now if only more apps were gtk2 based... Mozilla and gvim come to mind. gtk2 patch for vim works very well. Just try 'vim gtk2 patch' and you'll get http://regexxer.sourceforge.net/vim. If you're adventurous, you can try building gtk2-port of Mozilla yourself. It's being worked on. I'm gonna give it a shot myself soonish. I'm also gonna explore if it's easier to wed 'pango' with Mozilla if gtk2 instead of gtk is used. That would dramatically improve complex script handling of Mozilla if possible. Jungshik -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
Re: supporting XIM
Edward Cherlin wrote: On Sunday 30 March 2003 06:29 pm, Jungshik Shin wrote: Edward Cherlin wrote: On Sunday 30 March 2003 03:26 am, Jungshik Shin wrote: I can't test some of the others myself, and haven't heard any detailed information on them. I have not found any problems with diacritics in Latin and Cyrillic. Well, you do have problems with characters with diacritics in Latin,Greek and Cyrillic for which Unicode does NOT have assigned and will NEVER assign separate codepoints. That's what I was talking about. There are tens , if not hundreds, thousands, if not tens of thousands. I'm a mathematician. I know how to multiply, too. It doesn't take a mathematician to multiply, does it? :-) The reason I wrote tens/ hundreds instead of thousands/tens of thousands was that I like to give the number of combinations that have turned up in existing documents rather than the number of all possible combinations. of combinations (base character + one or more diacritic mark(s)) that can ONLY be represented by combining character sequences. Like this? a It's an a with two accents, and it composes and displays correctly in kwrite and kmail, with one accent above the other. Let's try some more. aeiounx Not too bad, except that only the first three accents on each letter are actually displayed, and the dot on the i isn't removed. Curiously, Yudit doesn't handle multiple accents as well as these simple-minded apps do. Yudit needs the same change as I proposed for Pango in this mail and a couple of others. Yudit supports opentype layout table for several Indic scripts and it needs to do the same for Latin/Greek/Cyrillic alphabets. SIL has one such font. Unfortunately, the last time I downloaded it, there's something wrong with zip and I couldn't try it. (http://www.sil.org/~gaultney/gentium/index.html) What do you see in your mail? I can't tell without knowing what I'm supposed to see. Anyway, what I see is two diacritics overlapped over each other instead of taking disjoint 'spaces' alongside or on top of /below each other. See http://www.columbia.edu/kermit/st-erkenwald.html for a real life example. Didn't I specifically write that Pango does not support diacrtic marks combined with base characters while Uniscribe does (although it didn't until very recently)? I know that xterm and vim support up to two combining characters and that's how pre-1933 Korean script and Latin/Greek/Cyrillic diacritic marks are supported by xterm/vim. I guess kmail/kwrite do likewise. However, that's a kind of the last resort when you don't have a better way to do it properly. Eventually, what we need is support in Pango and that's filed as bug 101079 (see http://bugzilla.gnome.org/show_bug.cgi?id=101079) Other pango bugs I filed (excluding Korean-specific ones) include : http://bugzilla.gnome.org/show_bug.cgi?id=101081 http://bugzilla.gnome.org/show_bug.cgi?id=106624 The starting point of this discussion was the inability to use Chinese, Korean, and Japanese IMEs in the same locale. I write documents in all three languages, and I would do it more often if it were actually convenient. This is becoming rather frustrating. How many times do I have to write that it IS possible right now to install all of them and switch between them in a *single* application (session) running under any UTF-8 locale of your choice? Why don't you try installing all three of them (im-ja, imhangul and wenju ) and fire up gedit and right-click on the text input area to see what you have? The very same information was given in last Decemeber and this thread doesn't add any new information except for im-ja in place of other less advanced Japanese gtk2 input modules. Jungshik -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
Re: supporting XIM
Tomohiro KUBOTA wrote: Hi, From: Jungshik Shin [EMAIL PROTECTED] Subject: Re: supporting XIM Date: Thu, 27 Mar 2003 18:38:51 -0500 (EST) That's not a problem at all because there are Korean, Japanese and Chinese input modules that can coexist with other input modules and be switched to and from each other. With them, you don't need to use XIM. ... One point: Many Japanese texts include Alphabets, so Japanese people want to input not only Hiragana, Katakana, Kanji, and Numerics but also Alphabets. I imagine Korean people want, too. In such a case, switching between Alphabet (no conversion mode) and conversion mode has to be achieved by simple key typing like Shift + Space. There are two switchings involved here. One is the intra-module mode/level switching and the other is inter-module switching. What you want for Japanese (and correctly guessed Koreans also need) can be easily achieved by the intra-module mode swtiching method of a single gtk2 input module. For instance, all 5 modules included in imhangul Korean gtk2 input modul suite interpret 'shift-space' as the toggle switch between Korean and English input modes and 'F9' for Hangul-to-Hanja conversion. I don't see any reason the same cannot be done for Japanese gtk2 input modules. I believe there's nothing in gtk2 input moduel framework that prevents a single input module from supporting multiple 'modes' (or levels) that can be switched around if necessary. As for inter-module switching, I guess some more work is necessary. It seems like the only way to switch to another input module is through pop-up menu that can be 'summoned' by right-clicking. However, combined with KDE keyboard switcher (I got to know that gnome2 has a similar utilitiy) that appears to be a simple wrapper over xsetkeymap, you don't have to right-click very often, I believe. Another point: I want to purge all non-internationalized softwares. Today, internationalization (such as Japanese character support) is regarded as a special feature. However, I think that non-supporting of internationalization should be regarded as a bug which is as severe I agree and think most, if not all, people on this list agree, too. Thanks to a lot of smart people from all over the world including a lot of contributors like you from Japan, free/open source communitiy has taken several, if not a lot more, huge steps forward in terms of I18N during the last few years. Back in 1998, when I read Drepper's paper on I18N in glibc, the problem appeared to be overwhelming. As lately as 1999/2000, KDE team mixed up L10N and I18N and claimed that KDE 1 supports CJK while all it actually had was translated messages in CJK. Now look what we have. gtk2/gnome 2/pango, KDE3/qt, glibc2, XFree86, Xft/fontconfig, freetype, _NET_WM extension, ICU, Perl 5.8, xterm/mlterm, vim, yudit, Omega/Lambda, many others I forgot to mention means users have freedom to choose. Such a freedom of choice must not be a priviledge of English-speaking (or European-languages-speaking) people. Do you have any idea to solve this problem? No question about that. What do we have to do? Well, just as we have done so far, I think we have to keep working as well and as hard as we can. I think I18N-awareness and I18N-mind are now widespread among developers worldwide and I'm not worried as much about CJ(K) as you're. However, we still need to go a long way to (fully) support complex scripts of South Asia, SouthEast Asia, SouthWest Asia (Middle East) , Korea(Hangul is a complex script) and Europe/Africa/North America(yes, Europe ! Latin/Greek/Cyrillic alphabets are complex, too !!) Of course several Japanese companies are competing in Input Method area on Windows. These companies are researching for better input methods -- larger and better-tuned dictionaries with newly coined words and phrases, better grammartical and semantic analyzers, and so on so on. I imagine this area is one of areas where Open Source people cannot compete with commercial softwares by full-time developer teams. As some linguists observed, Japanese writing system seems to offer a number of fascinating opportunities for linguists/computer programmers to put their mature and immature ideas to test. How about Korean? In case of Korean, conversion to Hanja(Chinese characters) is not such a important issue as in Japan. Simple dictionary based word and character look-up appears to be sufficient for most Korean users because they rarely use Hanja. As for Hangul input(putting aside pre-1933 orthography Korean for the moment), there are two major keyboard layouts (like qwerty vs dvorak) with a few variants, but the situation has been stable for more than a decade. In other words, there doesn't seem to be much room for innovation because Korean input is not much more complex than input of Latin/Greek/Cyrillic alphabet-based scripts. Cheers, Jungshik -- Linux-UTF8: i18n of Linux on all levels Archive
Re: supporting XIM
On Sat, 29 Mar 2003, Pablo Saratxaga wrote: On Sun, Mar 30, 2003 at 12:37:49AM +0900, Tomohiro KUBOTA wrote: However, I am often annoyed by people who think supporting European languages is more important than supporting Asian languages I don't think you meant that way, but I found it very annoying that some people and software use 'Asia' to mean only CJK. One prominent example is Sun's Staroffice and Openoffice. That's almost an insult to people of Indian subcontinent, Southeast Asia, Central Asia, and Southwest Asia. Are there such people? There might be some, but as I wrote in my response to Kubota-san, I18N-mind is much more widely spread than 5 years ago and I agree to your assesment of I18N in Linux below. Note also that, currently, I do'nt agree with you that i18n of programs is low; to the contrary, the majority of programs have good to very good i18n support. How should I call such people? I know they are never racists in its original meaning. ethno-centrist is the word you are looking for I suppose. If they're from Western Europe, 'Western-Eurocentric' :-) Tell me about one single current major program/project that doesn't have i18n support (maybe there are, and I'm just not aware of it (probably because a modern software without i18n support is not worth it in my eyes). One example is mkisofs in cdrtools. It's 'single-byte-centric' and the project maintainer has yet to accept a patch for multibyte support (including UTF-8). Sonner or later, I'll send him a new patch in such a form that he find it hard to leave it aside. Other examples include fmt, and other textutils, mc (it sorta works, but needs a lot of work to be fully I18Nized and UTF-8 friendly), lynx (one MIME charset at a time is well supported, but it needs multilingual ability as found in w3m-m17n. I hope major linux distros include w3m-m17n instead of plain w3m) and Pine (it works fine for a single MIME charset, but not yet multilingual and screen handling is single-byte centric. My UTF-8 patch solves only a small subset of these problems). 'less' still needs more work (Owen's patch is better than my patch that went into less 37x.) Some terminal emulators and terminal-based/-like programs need to pay more attention to East Asian Width (UTR #1? ). xterm has an option '-cjk-width' and other programs need a similar option/feature. Vim needs this. Its current column width cacluation routine is not based on wcwidth(). (I'll plan to fix this soon. It's very easy and Markus's wcwidth and wcwidth_cjk come very handy. It's better to use them than wcwidth from glibc which is locale-dependent.) gtk2 font selection widget should optionally offer a way to designate a *separate* 'monospace' font for 'double width'. So does Qt's font selection widget. It's naive to believe that fontconfig and pango can do the magic for this case as evidenced by the fact that MS Word under MS Windows even with equivalents of fontconfig and pango lets users select East Asian font separately. Full-screen text based programs need to be linked against ncursesw rather than ncurses or slang (how good is slang's UTF-8 and multibyte support?) and delegate as many screen-manipulating tasks to ncursesw as possible . When used with mutt, ncursesw appears to work well under UTF-8 locale. Jungshik -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
Re: supporting XIM
On Wed, 26 Mar 2003, Edward Cherlin wrote: KDE has a decent keyboard and IME switcher in the KDE Control Module. You can install it on the toolbar and choose your hot key combinations from a drop-down menu. Thanks for the info. I didn't know KDE has this feature. However, does it work for switching XIM's as well? It lets me switch among as many keyboard laouts as I want, but it doesn't look like it supports switching between XIM's. Hmm. is it time to upgrade my KDE? Anyway, I found gtk2 input module switching very nice and hope many more gtk2 input modules come standard with popular Linux distros. Jungshik -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
Re: supporting XIM
On Thu, 27 Mar 2003, Pablo Saratxaga wrote: [I Cc: to gnome-i18n as it concerns mainly the gtk2 input] On Thu, Mar 27, 2003 at 04:17:58AM -0500, Jungshik Shin wrote: As mentioned before, this is possible in GTK2 applications. Fire up gnome-terminal and right-click in any text input area and you'll get a pop-up menu from which you can choose a gtk2 input module a la Windows. But you are limited to only one X input method... That is the big problem; it would be much better if it would be possible to have *seceral* X input methods, like in yudit. That's not a problem at all because there are Korean, Japanese and Chinese input modules that can coexist with other input modules and be switched to and from each other. With them, you don't need to use XIM. For instance, imhangul gtk2 input module for Korean(http://kldp.net/projects/imhangul) is much more powerful than Ami. I haven't tried Japanese or Chinese gtk2 input module, but judging from the way imhangul works, it should be possible to write Japanese and Chinese input modules as powerful as, if not more powerful than, Japanese and Chinese XIM servers. BTW, this also works *along* with Xkb. So, if you have KDE 'keyboard switcher'(which appears to be a simple wrapper over setxkbmap and of which feature can be done by setxkbmap in non-KDE environment.), you can switch between all gtk2 input modules, XIM (either Compose or one of XIM servers ) and as many Xkb layouts as you want. me (I can only type some accented letters, while with an UTF-8 locale and xkb keyboard (trough X input method) I can type much more. You meant 'Compose'(the built-in XIM server) by 'xkb keyboard', didn't you? I never use the built-in input of gtk2, as it is too deficient for In particular esperanto accented letters, azeri schwa, and others. You can just Xkb for what it's easier to type with Xkb than with gtk2 input modules. You wrote as if there's an inherent limit in gtk2 input modules, but obviously there isn't. It only depends on how well any given module is written and designed. But then, I cannot type in japanese... There is at least one Japanese gtk2 input module as I wrote above. You just have to install it because it doesn't come default with gnome 2.x. Well, I don't always use all of them, as I don't speak all those languages; but a lot of people may have needs that cover several input methods, for example Korean and Japanese, or Japanese and French (something almost impossible to do properly right now, if you have Japanese input you lost some accents), or Chinese and accented pinyin... With gtk2 input modules, you can have all of them. gtk2 input methods for translitering cyrillic or other scripts are useful, but not required. more useful are the methods to type in transliteration for scripts that use sillabaries with a wide range of combination (korean, geez, inuit-cree, etc.), Well, Korean script is not usually classified as a syllabary although it could be many different things depending on how you look at it :-). Anyway, if there's a need for them(transliterating input methods for Ethiopic, Inuit, Korean, etc), somebody has to write input modules for them. Perhaps, taking advantage of what's done in yudit would be a good idea when writing such a input module. But there is still missing the ability to use various XIM input methods and switch between them. It'd be nice to have that feature, but it's not necessary because scripts that usually require XIM servers can be and are supported by gtk2 input modules. Jungshik Shin -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
Re: Perl script to hunt for malformed/overlong UTF-8 sequences
Markus Kuhn wrote: The attached Perl script print cuts from all lines in a plaintext file that contain non-ASCII bytes. With option -m, it looks for malformed and overlong UTF-8 sequences instead. Usefull for reviewing files with unknown encoding manually. It may be a good idea to filter out 'UTF-8' representation of surrogate codepoints (0x0d800 - 0xdfff) as well. That is, the following can be added to $utf8malformed \xed[\xa0-\bf][\x80-\xbf] Jungshik -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
Re: Perl script to hunt for malformed/overlong UTF-8 sequences
Jungshik Shin wrote: Markus Kuhn wrote: The attached Perl script print cuts from all lines in a plaintext file that contain non-ASCII bytes. With option -m, it looks for malformed and overlong UTF-8 sequences instead. Usefull for reviewing files with unknown encoding manually. It may be a good idea to filter out 'UTF-8' representation of surrogate codepoints (0x0d800 - 0xdfff) as well. That is, the following can be added to $utf8malformed \xed[\xa0-\bf][\x80-\xbf] In addition, non-characters (0x and 0xfffe in all planes) may as well be filtered out. \xef\xbf[\xbe-\xbf]| [\xf0-\xf7][\x8f,\x9f,\xaf,\xbf]\xbf[\xbe-\xbf] ( and 5 and 6byte ones if you want) -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
Re: UTF-8 and LaTeX
Markus Kuhn wrote: Frank Mittelbach ([EMAIL PROTECTED]) has posted on 2003-01-07 on [EMAIL PROTECTED] the beginnings of a far more lightweight UTF-8 support for LaTeX within the inputenc framework, which will hopefully find its way into the next release: http://www.latex-project.org/cgi-bin/ltxbugs2html?pr=latex%2F3480 I'm not sure how far LaTeX can get stretched to support Unicode. It appears that Lambda based on Omega( http://omega.cse.unsw.edu.au:8080) is one of better ways, if not the way, along with true/opentype fonts and dvi drivers like dvipdfmx(http://project.ktug.or.kr/dvipdfmx) to get Unicode fully supported. Jungshik Shin -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
Re: UTF-8 Editors? (Was XML and tags)
On Sat, 22 Feb 2003, Roozbeh Pournader wrote: On Sat, 22 Feb 2003, Edward H Trager wrote: It turns out that the version of vim that I have does indeed work under xterm for an assortment of LTR languages (Indian languages not tested), It wouldn't work for Indic scripts because xterm does not support Indic scripts (although it supports Thai). It's not even clear what VT100/220 terminal emulators should do for them. but not Arabic (the only RTL language tested) Arabic is not in vim yet. They are putting it in now that we're talking, and there have been a lot of discussions on something called 'cream' that is a vim distribution that has included the Arabic patch. You meant a standalone-gui vim (e.g. gvim) as opposed to vim running inside a terminal emulator, didn't you? Without RTL scripts supported by the term. emulatore it's running under, I presume that it'd be very hard to support Arabic in vim. BTW, there's a port of gui-based vim to gtk2(and pango) which reportedly supports RTL scripts See http://www.opensky.ca/gnome-vim/todo.html. The latest patch is not the one linked there but you shuold get it at http://regexxer.sourceforge.net/vim. Jungshik Shin -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
mutt and ncursesw
On Tue, 18 Feb 2003, Nikolai Prokoschenko wrote: On Tue, Feb 18, 2003 at 03:57:30AM -0500, Glenn Maynard wrote: mutt from Debian doesn't have any problems at all! Debian has a mutt-utf8 package that's compiled against ncursesw. Not quite - it's some kind of additional packages - maybe it includes just the updated binary, I don't really know or care - it works! Last time I checked, mutt compiled against the ordinary ncurses (as opposed to ncursesw) does NOT work for characters with East Asian width of 'full'. You may get an impression that it works because you use it only for chars. with East Asian width of 'half'. For CJK, compiling mutt against 'ncursesw' is a must. Jungshik -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
Re: mp3-tags, zip-archives, tool to convert filenames to UTF
On Fri, 14 Feb 2003, Jungshik Shin wrote: On Fri, 14 Feb 2003, Nikolai Prokoschenko wrote: On Fri, Feb 14, 2003 at 07:01:56PM +0100, Helge Hielscher wrote: 1) I have some mp3-Files with ID3-Tag, most of these files use the ISO-8859-1 encoding, but some use a russian encoding. Which programms can display the russian ID3-Tags? I have tried XMMS, but with no success. If you have a mix of mp3 files with id3v1 tag in ISO-8859-1 and other mp3 files with id3v1 tag in KOI8-R, the only way to display both kinds of tags correctly *simultaneously*(in a single xmms session) is to convert both tags to UTF-8 and run xmms under UTF-8 locale. One problem with this is that most portable mp3 players in the market can't handle UTF-8 although they support a dozen or more languages. Consequently, you may have to reconvert id3v1 tags in your mp3 files if you need to store them in portable mp3 players. They shpport multiple languages by assuming that there's a one-to-one correspondence between languages and encodings. This is plainly wrong, but there's not much they can do given that id3v1 tag does not have any means of indicating which encoding is used and for the vast majority of mp3 files circulated and made on the net the aforementioned one-to-one mapping is valid. BTW, id3v2 tags don't have this problem. We can just hope that id3v2 will be widely used soon and a new generations of mp3 portable players will support it. BTW, a number of PDAs, mobile phones and other devices might share the problem arising from the misguided assumption that languages/scripts and encodings are tightly bound to each other(the same is true of stupid web mail services like Hotmail, Yahoo mail, etc). Hopefully, more wide use of Linux in those devices and better UTF-8 support in Linux will change the situation. Jungshik -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
Re: dos2unix and UTF-8 BOM
On Sun, 16 Feb 2003, Roozbeh Pournader wrote: I was thinking about the annoying BOM-like sequence that Windows 2000's and XP's Notepads are putting at the beginning of UTF-8 files. The byte sequence EF BB BF that's invalid as a header/signature in Unix UTF-8. Shouldn't 'dos2unix' be patched to also remove this sequence? That would be useful. However, that doesn't work very well if multiples files are fed to it (e.g. 'cat a b c | dos2unix'). And, that's why we all hate UTF-8 BOM ;-). How about these? Incidentally, it just occurred to me that ftp/ssh clients may offer an user-configurable option for the automatic removal of 'UTF-8 BOM' at the beginning of a text file in UTF-8 when moving files from Windows to non-Windows platforms (Unix/Unix-like OS and MacOS). The same is true of Kermit (Frank, are you here?). All those tools can be configured to translate between three (and nowadays even more?) EOL conventions, CF/LF/CR,LF for text files. Then, the automatic removal(and addition if that's regarded as necessary) of UTF-8 BOM at platform boundaries would be as useful. As for web servers, a configurable option can be added to remove UTF-8 BOM at the beginning of text/* files(they serve). For instance, it's easy to write a simple module for Apache(used at Unicode.org web site) to do that. VFAT, NTFS and FAT for Linux can be modified in a similar way. And, editors like Vim (which automatically detects EOL used in text files) can do the same. Jungshik -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
Re: redhat 8.0 - using locales
On Fri, 10 Jan 2003, Markus Kuhn wrote: strongly prefered that locale names do not use a country name at all, unless it is necessary to distinguish between countries. The only excuse to do so is usually the currency field, which nobody uses anyway and LC_COLLATE is sometimes region/country dependent. For instance, ko_KP and ko_KR have different collation rules (although I wish there were a common set of rules shared by ko_KR and ko_KP). In addition, differences between zh_* in LC_MESSAGES are not trivial. Jungshik -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
Re: hanzi vs kanji
On Fri, 3 Jan 2003, Maiorana, Jason wrote: Can we please maintain the distinctions between 1. language, 2. script, and 3. typeface 'category' or other typeface differences. Thats really the question: Is the difference between Hanzi and Kanji more one of typeface or of script. I would argue that it is a real script difference, I strongly disagree with you on this point. Most people on the Unicode list would agree with me. If they're different scripts, CJK Unification should be overthrown right away. but it is typically implemented as a typeface difference. A character in these scripts do have a precise set of radicals, stroke order, and proportion. This is only the case if you regard anything other than what Japanese MoES(Min. of Education and Science) standardized as 'non-Japanese'. My grandfather, father and I(Koreans) could write a single Chinese character with different stroke counts and sometimes even differently looking radicals, but all of us know what we mean. (Stylization is something applied afterwards, deviating from the script norm.) Who has the final say in the script norm? I don't want Korean MoE(Min. of Education) to tell me to change the way I write some Chinese characters. My grandfather would get enraged if some ignorant beuraucrats in Seoul wanted him to change the way he writes. It is certainly possible for some to overcome this difference, and read their own language despite its being in another script, but that does not prove that they are identical scripts. Neither does it prove that they're different scripts. The difference between fraktur and arial however, is purely one of typeface, and seems relatively trivial. If it's trivial, the diff. across CJK glyph variants is far far far more trivial. Jungshik Shin -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
Re: Japanese Input under RH8
On Fri, 13 Dec 2002, Mike FABIAN wrote: Jim Z [EMAIL PROTECTED] さんは書きました: I tried your tip to bring up kinput2 I.e. you tried export XMODIFIERS=@im=kinput2 LANG=ja_JP LC_ALL=ja_JP kinput2 -xim -kinput -canna LANG=en_US.UTF-8 LC_CTYPE=ja_JP.UTF-8 program... I thought you had written that the following also works with a new kinput2 (suppose LC_CTYPE/LC_ALL is not defined.) and that might have been what Jim tried. export XMODIFIERS=@im=kinput2 LANG=ja_JP.UTF-8 kinput2 -xim -kinput -canna LANG=ja_JP.UTF-8 program-where... Actually, I've just tried it and it worked. Jungshik -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
Re: [Fonts]Re: Xprint
On 11 Dec 2002, Juliusz Chroboczek wrote: Sorry for mis-reading your mail, then. No problem :-) JS As for complex script rendering, it's possible... You'll doubtless agree with me that what you're describing are a ... for decades now -- it's high time to move on. Yes, I agree with you, but somebody needs to do the work. Actually, the most difficult part may not be programming but may be getting/making some intelligent fonts (opentype or AAT) for complex scripts. For Indic scripts, things are going pretty well and the number of freely available opentype fonts for Indic scripts are increasing. For Korean, it's not so good as I wrote before. I have yet to see a single free opentype font. BTW, you'll be surprised to read comments made by some people at http://bugzilla.mozilla.org/show_bug.cgi?id=144663. They want to kill PS module in mozilla in favor of Xprint. JC I'm a little bit suspicious about their choice to use Type 42 CIDFonts JS Given that truetype fonts are much easier to come by than genuine JS CID-keyed fonts for CJK (which is also true of truetype fonts vs PS JS type 1 fonts for European scripts although to a lesser degree), I guess JS the choice is all but inevitable... I may have misunderstood something, but last time I checked the approach was to use Type 42 CIDFonts *only*. These are currently a fairly rare beast (only supported since version 3012, if memory serves). I also thought that's the case. However, Brian Stell changed the plan (see http://bugzilla.mozilla.org/show_bug.cgi?id=144663. ) and he's now gonna use type 8 (neither type 11=what you're calling type42 CIDFont = CIDFont type2 nor type 42). What's type 8 font, btw? JC [using Type 42 CIDFonts] will require many users to rasterise JC everything with ghostscript on the host, with all the ensuing JC performance and printing quality issues. Because you wrote the above, I thought that you had reservation about doing everything on the host side regarding printers as dumb devices which may sacrifice the printing quaility. I also thought that you prefer to leave as much as possible for PS printers to take care of. That's why I didn't even mention the most certain way to produce portable PS output (type3 bitmap) and I wrote about the percentage of end-users owning PS printers. Conversion to Type 1 fonts works everywhere, gives excellent results, and the code is readily available (ttftpt1). Finally, if everything Does this conversion code also work for large CJK ttf fonts(with more than 256 glyphs)? Or, does it also support conversion to composite font(OCF?)? As you see, I am not arguing against support for CIDFonts; I'm merely stating that making Type 42 CIDFonts the only download format for TTFs makes me er... suspicious. I'm not against producing portable PS, either :-). However, I think the portability of PS output doesn't matter much considering the way printing is handled these days in Unix/Linux. Jungshik -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
Re: cxterm cut/paste: COMPOUND_TEXT, UTF8_STRING?
On Mon, 9 Dec 2002, Tony Laszlo wrote: Hi, I found this 1999 post in the mozilla-i18n archives from Jungshik. http://www.geocrawler.com/archives/3/113/1999/7/150/2441628/ I seem to be having a similar issue, at the moment, with Chinese copied from cxterm and pasted into Mozilla (or yudit, or an mlterm window). RH7.1, latest Mozilla, latest yudit, kde. As I wrote there, cxterm and hanterm are to blame because they violate X11 ICCCM. Mozilla, yuidt,mlterm and kde are doing just what they're supposed to do. (I mentioned a work-around that may be implemented by 'programs on the receiving end' in my posting, but I think that's not a good idea.) Mozilla has since implemented UTF8_STRING. 'The' way to solve this problem is to fix cxterm and hanterm to support UTF8_STRING and COMPOUND_TEXT. kterm(Kanji term) and rxvt(cjk) support COMPOUND_TEXT and mlterm and xterm(XFree86) support both. Jungshik -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
Re: Input under RH8
On Fri, 6 Dec 2002, Maiorana, Jason wrote: First, thanks to Jungshik Shin Mike FABIAN for your replies. You're welcome :-) I surmise that the current state of RH8 is that it is not yet suitable for entry of all languages simultaneously. (flaws in XIM itself being part of the problem) You're right. You can't do MS Windows/MacOS style IME switching, yet, in all applications. I can probably setup some scripts to pop up a gedit in a given mode, but, with the exception of VIQR and Korean, I cannot yet graphically switch around to any input method with the version of gtk2 that comes with rh8. Gtk2 as shipped in RH8 has Thai(broken?), Tamil, Cyrillic(transliterated), Innuikitut, IPA, Tigrigna-Ethiopian, Tigrigna-Eriterian, and Amharic input modules in addition to XIM, Vietnamese, *broken* Korean(KSC5601) input module. For Korean, you'd better install 'imhangul' input module at http://imhangul.kldp.net. You can download the source by clicking 'download' in red and install it by following the instruction in the gray box below the link for download. If this is the first time you install 'imhangul', you have to run 'make install' twice (it's due to a bug to be fixed.) You can also make use of Xkb. With its support of multiple levels, you can add yet another 'input method' to your repertoire of input methods accessible in gedit(a gtk2 application). As for Xkb, refer to XFree86 I18N archive. Hopefully, in the near future, RH will ship all utf-8 locales by default, and gtk2 will have a XIM wrapper that allows access to any input method on the system from any language locale. Alternatively, 'meta XIM server' (as implemented at the client level by Yudit and mlterm) that lets users switch between multiple XIMs will be handy. Then, it can be used for non-gtk2 applications as well as gtk2 applications. BTW, has anybody heard of gtk2 input modules for Chinese and Japanese? A quick googling didn't turn up anything. Jungshik -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
RE: UTF-8 wakeup call
On Sat, 7 Dec 2002, Kent Karlsson wrote: The mappings used are at least also from the RFC 1345 (recode uses that) or the IS 15897 which uses many if the same names and mappings. Specifically I have seen that Linux is *not* using the Unicode data because of copyright issues. Hmmm. From http://www.unicode.org/Public/UNIDATA/UnicodeCharacterDatabase.html: Limitations on Rights to Redistribute This Data Recipient is granted the right to make copies in any form for internal distribution and to freely use the I don't see this as restrictive for use in Linux. I'm sure Unicode consortium would like to see its data being used also in open source glibc 2.x may not use them, yet. However, glib(and other libraries built on top of it) indeed makes an extensive use of Unicode data files. So do Perl, Yudit, Mozilla and other free/opensource programs/projects that run on Linux. Jungshik -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
RE: Japanese Input under RH8
On Fri, 6 Dec 2002, Maiorana, Jason wrote: thanks for the tips, but what I really wanted was use japanese/other languages input methods, but not be in a ja_JP locale. (just the default local en_US.UTF-8) (Also I was hoping it could be done in an application that was already running, for example I would start off in VIQR, then maybe do some korean input, then switch to XIM/kinput2/canna, all in the original gedit window...) You're talking about two different things here. One is XIM and the other is gtk2 input modules. Gtk2 input module mechanism (that you bring up by 'right-clicking' in gtk2 input widget area) lets you do what you want. It also supports XIM as one of supported 'modules'. Under en_US.UTF-8 locale, XIM selected is (unless XMODIFIERS is set to @im..) the default built-in XIM which is Compose mechanism. Compose mechanism is pretty powerful for alphabetic scripts although it's not so useful for Japanese and Chinese. im curious why I would set the LC_CYPTE to ja_JP.UTF-8, why would that be any different than en_US.UTF-8 when the LANG is en_US.UTF-8. I'm not worried about japanese collation i'd prefer to use a default unicode collation. Unfortunately, most XIM servers are written in such a way that they can only be launched under a certain locale. However, gtk2 input module mechanism can be used to achieve what you want( switching between any number of different input modules in any UTF-8 locale). Somebody has to write (a) gtk2 input module(s) for Japanese (if it hasn't been written yet. There are a very powerful set of Korean input modules for gtk2 all based on U+1100 Hangul Jamos alone) Then, you can use it regardless of the locale you're in. This is great as long as you use gtk2 applications. For non-gtk2 applications, it doesn't work, though and there's still a need to write a 'wrapper XIM' server that lets users to invoke multiple XIM servers at will. There are a couple of projects going on in that direction. There's also a 'next generation input protocol' for X11 and other platforms. (look around http://www.li18nux.org). You can find more details in XFree86 I18N mailing list archive. Im curious, why do you suggest that kinput2 should be run with eucJP as its startup encoding? Does it have bugs if that is not the case? I guess kinput2 was written that way. That was also the case of Korean input method Ami without my patch. Because launched under ko_KR.EUC-KR, it can't be used to input the full repertoire of Hangul syllables in Unicode, I patched it to be launchable under under ko_KR.UTF-8 locale. Jungshik -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
Re: Why doesn't Linux display Japanese file names encoded in UTF-8?
On Fri, 6 Dec 2002, Jim Z wrote: Jim, However, there are issues. After those changes when I logged into Japanese EUC locale, everything is displayed in English. :( So was for Japanese UTF-8 locale. Is that because the system couldn't find the resources? Have you checked what's in /etc/sysconfig/i18n and ~/.i18n? Why don't you make both of them clean and see what you get? Also make sure that you installed kde-i18n-Japanese package for KDE? In my case, both Gnome and KDE came up nicely in Japanese. I didn't check and made sure that the locale.dir was modified (I'll check again). Also, in UTF-8 for Japanese mode, there is no Japanese input (Shift-space bar). As already noted by others, kinput2 has to be launched under ja_JP.EUC-JP. Certainly, this has to be fixed. In general, looks like UTF-8 works on Lunix for CJK; There are still some issues (input methods as you found, localized man pages). Localized man pages are mostly in legacy encodings and it's hard to figure out how to make them work in UTF-8 locale(if at all possible). 'man', 'less' and 'groff' all do things differently (when it comes to interpreting LC_* and LANG environment variables) and they interact with each other in a intricate way. At least, I think 'man' has to be fixed to either call setlocale(LC_MESSAGES,...) directly or to use the SUS-provisioned order of resolving LC_*/LANG env. variables. (i.e. 1. LC_ALL 2. LC_ 3. LANG) At the moment, even 'LC_ALL=C man xyz' doesn't give me man pages in English, let alone 'LC_MESSAGES=C' when LANG is set to ko_KR.UTF-8. Note that LANG should be given the lowest precedence in the locale resolution and LC_ALL should be at the top. Certainly, man doesn't honor that order. A couple of years ago, we discussed how to tag(if we decide to tag them) the encoding used in man pages, but it got nowhwere. A reasonable approach appears to be to conver them all to UTF-8 (assuming groff UTF-8 support will come along soon). however, there is no way for general users to do what they intent to do. According to what I heard on this list, SuSe 9.1 offers UTF-8 locales for all languages as an alternative to traditional encodings so that SuSe users should have no problem there. Mandrake 9.0 seems to do it, but it doesn't work out of box (I have to make some modifications) as far as I can tell. Your help is appreciated and I would like to see your fixes get into near future builds so all can benefit. My changes to XFree86 have gotten into CVS of XFree86 so that I guess it'll be included in upcoming 4.3.0 release. With increasing use of Xft/fontconfig and client-side fonts, the importance of my patch(to X11 locale) will diminish. Jungshik Shin -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
RE: Japanese Input under RH8
On Fri, 6 Dec 2002, Jungshik Shin wrote: On Fri, 6 Dec 2002, Maiorana, Jason wrote: im curious why I would set the LC_CYPTE to ja_JP.UTF-8, why would that be any different than en_US.UTF-8 when the LANG is en_US.UTF-8. I'm not worried about japanese collation Unfortunately, most XIM servers are written in such a way that they can only be launched under a certain locale. However, BTW, I didn't mean that kinput2, Xcin and Ami cannot be modified to work under en_US.UTF-8 locale. They can, but their dependency on fontset make them work less optimal than under their 'native' locales. I guess we have to give up 'stretching' old XIM protocol and had better focus on a new IIIMF(Internet Intranet Input Method Framework: http://www.openi18n.org/subgroups/im/IIIMF. Li18Nux.org changed the name to become OpenI18N.org) or gtk2 input modules or similar mechanisms. MS Windows has something called TSF(Text Service Framework) which appears to be very flexible. IMHO, XIM is too old to be on par with likes of TSF. IIIMF is at a far better position for that than XIM. Jungshik -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
Re: filename and normalization (was gcc identifiers)
On Wed, 4 Dec 2002, seer26 wrote: is to insist that 11,172 modern precomposed syllables be encoded in Unicode/10646. Next biggest blunder they made is to encode tens of totally unnecessary cluster-Jamos when only 17+11+17+ a few more would have been more than sufficient. Next stupid thing they did is Would Chinese be in a similiar situation if it the radicals were combining characters, and any combination of them could in theory be a valid character? Possibly. However, radicals are only a small subset of 'components' used in Chinese characters. You need to have a lot more 'components' than radicals listed in any Chinese character dictionary. In practice, of course, a normal person would use far fewer than 10,000 distinct characters. Do you think anybody wants a character set standard(like Unicode) to specify the list of sequences of Latin/Greek/Cyrillic alphabets that are allowed? Imagine that you can use 'ab, eb, ob, se, ce' but cannot use 'sce, gh, ph' That's what encoding a fixed set of precomposed syllables does for Korean alphabet. Have you ever needed a character that wasnt among the 11,172 precomposed ones? Sure! See http://jshin.net/i18n/korean/hunmin.html or http://jshin.net/i18n/uyeo.html. 11,172 precomposed syllables don't include any pre-1933 orthography syllables. The set doesn't include modern incomplete syllables(which high school Korean teachers need to teach Korean grammar), either. Basically, it was a very stupid idea (and a vast waste of codespace) to enumerate possible combinations of alphabetic letters. Just encoding alphabetic letters should be more than enough. I wish Korean Nat'l Standard body had been half as competent as as its counterpart in India. ISCII (which ISO 10646/Unicode copied almost verbatim) did a great job of encoding only what's absolutely necessary for Indic scripts. And, that was in early 1990's when no intelligent modern rendering engine and font were in sight. They, however, had a foresight that encoding hundreds or thousdands of 'presentation forms' for each of Indic scripts was not a way to go and that eventually intelligent and advanced fonts/rendering engine would come out. They were right and nowadays Indic scripts are pretty well supported by Pango, Uniscribe, ATSUI, and Graphite. It may take a little more while to have opentype fonts in public domains for all Indic scripts, but they're coming... Jungshik Shin -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
Re: filename and normalization
On Wed, 4 Dec 2002, Werner LEMBERG wrote: the manpage was not using a regular ascii '-', but instead one of the HYPEN, or EM_DASH things (Which is why i HATE them). you can configure the way your 'man' works in man.config. You can set NROFF to use '-Tascii -man' and you get 'ASCII approximation' of real em_dash, hyphen etc so that you can copy and paste and search A better temporary solution is to add the following to man.local: .if '\*[.T]'utf8' \ . char \- \N'45' Thanks. It worked great. Neither of Mandrake 9 and RH 8 has this in man.local. I guess they should. Jungshik -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
Re: gcc identifiers
On Wed, 4 Dec 2002, Keld Jørn Simonsen wrote: On Tue, Dec 03, 2002 at 10:33:19PM -0800, H. Peter Anvin wrote: Maybe a --normalize-utf option to the linker might be a good idea, but it should be an option, IMO. First of all, the standard does not refer to Unicode, but to 10646. And the C standard does not use Unicode normalization. There is a list in the ISO C standard of 10646 characters that are allowed in identifiers, and these do not have alternate representations. Thank you for the note. I found FCD of ISO/IEC 9899 1999 (N2794 at http://wwwold.dkuug.dk/jtc1/sc22/open/n2794). It dates from Aug., 1998. In Annex I 'Universal Character names for identifiers'(page 487. If you use Acroread to view PDF version, it's 499), a set of characters allowed are listed. (More or less identical list is found at http://std.dkuug.dk/TC1/SC22/WG20/docs/standards#10176) Basically ISO C99 seems to avoid problems arising from multiple representation issues by allowing only precomposed characters in identifiers(is there any change in this regard in the finally approved ISO/IEC 9899 1999?) Keld's statement that they do not have alternate representations is not right. If that's the case, characters like 'Latin Small Letter with Macron' or 'Hangul Syllable Gga' for which there are alternate representations should not be present in the list, but they are listed as allowed. What ISO C99 seems to do is to shift the burden of normalization to editors or whatever tool used by programmers to edit source files from compilers and linkers. That's fine(editors can do that) and is perhaps a wise decision (preventing potential troubles from propagating thru a compiler-linker chain at the earliest stage by issuing an error and stopping compilation), but there's a little trouble with allowing only precomposed characters. Both ISO/IEC JTC1/SC2/WG2 and UTC would not encode any more precomposed characters which can be represented with exisitng base characters followed by one or more combining characters. However, 'combining diacritical marks'(e.g. \u0300 - \u0362) are not allowed in identifiers so that 'any character' that's not encoded as a precomposed form can't be used in identifiers. Some people would resent not being able to use 'their characters' in identifiers and may use it to make a case for encoding precomposed forms of theirs in ISO 10646. How about references to filenames (as in '#include directive') with combining diacritic marks that are parts of characters NOT encoded in precomposed form? Aha, they can use '\u, or \U)... Jungshik -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
Re: filename and normalization (was gcc identifiers)
On 4 Dec 2002, H. Peter Anvin wrote: By author:Jungshik Shin [EMAIL PROTECTED] All right. That's what the *current* SUS/POSIX says. However, that is hardly a solace to a user who'd be puzzled that two visually identical and cannonically equivalent filenames are treated differently. There *is* no way to solve this problem. You have the same kind of problem with U+0041 LATIN CAPTIAL LETTER A versus U+0391 GREEK CAPITAL LETTER ALPHA. However, if you attempt normalizations you *will* U+0041, U+0391, and U+0410 are NOT equivalent in any Unicode normalization form. They're not even equivalent in NFK*. Note that I didn't just say visually (almost) identical but also modified it with 'canonically equivalent'. introduce security holes in the system (as have been amply shown by Windows, even though *its* normalizations are even much simpler.) Therefore, your exmaple cannot be used to show that there's a security hole(unless you're talking about applying normalization not specified in Unicode) although it can be used to demonstrate that even after normalization, there still could be user confusion because there are some visually (almost) identical characters that would be treated differently. A better example for your case would be U+00C5(Latin captial letter with ring above) and U+212B(Angstrom sign) or U+004B and U+212A(Kelvin Sign). They're canonically equivalent. available to the user (ls -b or somesuch.) Attempting canonicalization is doomed to failure, if nothing else when the next version of Unicode comes out, and you already have files that are encoded with a different set of normalizations. Now your files cannot be accessed! Oops! I might agree that normalization is not necessarily a good thing. However, your cited reason is not so solid. Unicode Normalization form is **permanenly frozen** for exisitng characters. And, UTC and JTC1/SC2/WG2 committed themselves not to encode any more precomposed characters that can be represented with existing base char. and combining characters. If you're not sure of their committment, perhaps using NFD is safer than using NFC. Hmm.. that may be one of reasons why Apple chose NFD in Mac OS X. BTW, without changing anything in Unix APIs and Unix filesystem(which are not desirable anyway), shells 'might' be a good place to 'add' some normalization (per user-configurable option at the time of invocation and with env. variables) Jungshik -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
RE: filename and normalization
On Wed, 4 Dec 2002, Maiorana, Jason wrote: As a side-note, I copy/pasted a command line flag from a RH8.0 manpage back into the console, and tried to execute the command. It failed, and gave me usage. The reason, I discovered, is that the manpage was not using a regular ascii '-', but instead one of the HYPEN, or EM_DASH things (Which is why i HATE them). I discovered that a long time ago and gave up copy'n'pasting from man pages. I began to write that those characters should not be used in man pages, but then I came up with a couple of argument against my own and didn't send a message here. One of them was that you can configure the way your 'man' works in man.config. You can set NROFF to use '-Tascii -man' and you get 'ASCII approximation' of real em_dash, hyphen etc so that you can copy and paste and search backwad/forward for command line options. Another was that man page is not only for screen viewing but also for print out. When printed out, genuine hyphen and em dash look certainly better than their ASCII approximation. Jungshik -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
RE: filename and normalization (was gcc identifiers)
On Wed, 4 Dec 2002, Maiorana, Jason wrote: Normalization for D has some serious drawbacks: if you were to try to implement, say vietnamese using only composing characters, it would look horrible. The appearance, position, shape, and size of the combining accents depends on which letter they are being combined with, as well as which other diacritics are being combined with that same letter. What's your point here? NFD or NFC, they should be rendered identically by 'modern' rendering engines. You're making an assumption that the way characters are rendered depend on in which NF they're stored/represented. At least in principle, that should not be the case. Even a not-so-capable renderer(e.g. xterm with bitmap font or Linux console) can do a internal normalization to fit their need and capability. NF-C is most appropriate for some scripts, and NF-D may be desirable for others. It would be better, What are your criteria? Again, rendering? As I wrote above, that has nothing to do with NFs used. IMO, if unicode would get rid of both forms, and simply support one representation of each possible glyph. (No combining characters unless they are the ONLY 'glyphs'? Coded character set is not about glyphs but about characters. way to represent a particular glyph) (Actually, no combining chars at all would be best, because its simplest. Why not just assign more code space to the langs that need it?) Do you want to give 1.5 million (and more) code points to Korean script? Why don't you propose your idea to UTC and ISO/IEC JTC1/SC2/WG2? Either your mailbox will be bombarded with a lot of emails or you will be greeted with 'dead slience'. If you have a filesystem that forces NF-D, then I would say its a poorly designed filesystem that makes such choices, because its way to low level to care about things like that. Filenames should be string of bytes, and the UI-conventions should allow one to distunguish. If you are on a NF-C==canonical system, and you mount such a filesystem, you should see bakemoji, and not any translated normalization form. Why bakemoji? No matter what NF are used in filenames, they should be just rendered as they should be rendered by any Unicode-compliant rendering engines. This behavior is more consistent with your view that filenames are strings of bytes than showing 'bakemjoi'. Jungshik Shin -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
RE: filename and normalization (was gcc identifiers)
On Wed, 4 Dec 2002, Maiorana, Jason wrote: If characters are ever introduced which have no precomposed codepoint, then it will be difficult for a font to normalize them to one glyph which has the appropriate internal layout. The font file itself would then have to know about composition rules, such as when X is composed with Y then Z, then use this glyph XYZ which has no single codepoint in unicode. Have you ever heard of Opentype and AAT fonts? Modern font technologies and modern rendering engines (Pango, AAT, Uniscribe, Graphite) can all do that. Otherwise, how would Indic scripts be used at all? What you describe above is done by everyday by Pango, Uniscribe and AAT/ATSUI, Graphite. For that reason, I dont like form D at all. I wonder how much space it would take to represent every possible Jamo-combination, then just do away with combining characters alltogether... No way!! The biggest blunder ever made by Korean nat'l standard body is to insist that 11,172 modern precomposed syllables be encoded in Unicode/10646. Next biggest blunder they made is to encode tens of totally unnecessary cluster-Jamos when only 17+11+17+ a few more would have been more than sufficient. Next stupid thing they did is to remove compatibility decomposition between cluster Jamos and basic Jamo sequences although they should be canonically(not just compatibly) equivalent. Now, you're saying that all possible combinations of them be encoded. How many? It's __infinite__ in theory. In practice, it could be around 1.5 milllion. That's more than the total number of codepoints available in 20.1 bit coded character set which is ISO 10646/Unicode. Jungshik Shin -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
Re: filename and normalization (was gcc identifiers)
On 4 Dec 2002, H. Peter Anvin wrote: By author:Jungshik Shin [EMAIL PROTECTED] How many? It's __infinite__ in theory. In practice, it could be around 1.5 milllion. That's more than the total number of codepoints available in 20.1 bit coded character set which is ISO 10646/Unicode. And people give me funny looks when I tell them not to trust the 20.1 bits forever statement from Unicode, just as I didn't trust the earlier 16 bits forever statement... Whether you're convinced or not, it's not only in Unicode but also inscribed in ISO 10646. Jungshik Shin -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
RE: filename and normalization (was gcc identifiers)
On Wed, 4 Dec 2002, Maiorana, Jason wrote: For that reason, I dont like form D at all. I wonder how much space it would take to represent every possible Jamo-combination, then just do away with combining characters alltogether... No way!! The biggest blunder ever made by Korean nat'l standard body is to insist that 11,172 modern precomposed syllables be encoded in Unicode/10646. Next biggest blunder they made is to encode tens .. available in 20.1 bit coded character set which is ISO 10646/Unicode. Wow, ok, I guess that idea wont work for Korean. Also, since glyph swapping has to be done for merely adjacent characters, doing it for combining ones must be a relatively minor concern. Out of curiousity, how many of those Korean letters are actually made use of by the language? 1.5 million sounds higher than any number of phoneme's that a human can produce Needless to say, modern Korean speakers can pronounce only a very very small fraction and chances are that the number will decrease as time goes by because as in most other languages, speakers are on the winning side of the battle between listeners and speakers. You have to understand that Korean Hangul is alphabetic and the number of possible syllables that can be made out of a finite set of alphabetic letters is infinite whether it's Latin, Greek, Cyrillic, Indic or Korean. (what if the cluster jamo's were dropped?) It doesn't make any difference at all. Cluster Jamos can be represented as well by a seqeunce of basic Jamos. Please, note that the most generic form of Hangul sequence is given as L+V+T*M? where L, V, T, and M denote leading consonant, vowel, trailing consonant and combining mark(for Hangul, it's most likely to be one of two tone marks and '+', '*', '?' have their usual meanings in RE. That's why I wrote that cluster Jamos shouldn't have been encoded at all. The same is true of all those 11,172 precomposed syllables. For Korean Hangul, all we need are about a few dozens of basic Jamos. I feel 'guilty' (although I haven't been involved in any way forcing them through) that Korean Hangul took about a fifth of BMP codespace when about two hundredth of that is enough. Are we heading for a long-run scenario, where Form-D becomes canonical, and all the old pre-composed codepoints are deprecated? NF-C seems to be getting more and more entrenched from what I can tell... Well, from the very beginning, UTC didn't want to have precomposed forms in Unicode. Precomposed characters are not there because they wanted to encode them but because they had to maintain 'compatibility' with legacy coded character sets in which they're encoded as seprate entitites. If they had been able to start afresh without any concern for legacy character sets, there would have been NO precomposed characters that can be represented by sequences of base characters and combining characters. Jungshik Shin -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
Re: Why doesn't Linux display Japanese file names encoded in UTF-8?
On Wed, 4 Dec 2002, Jim Z wrote: Jim, This time, I hope my answer will solve your problem :-) From: Jungshik Shin [EMAIL PROTECTED] On Tue, 3 Dec 2002, Jim Z wrote: You can easily add 'Japanese(UTF-8' to your gdm/kdm language selection menu. See https://bugzilla.mozilla.org/bugzilla/show_bug.cgi?id=75829 I couldn't get into here and is it a typo? PLEASE help - I really want to I'm sorry it's https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=75829 I did a 'showmount -e 10.xxx.xxx.xxx' but I got scambled Japanese characters for those entries that are encoded in UTF-8. Then I switched the locale to ja_JP.UTF-8, but the same stuff was returned. What's wrong with this picture? It's an UNIX (Linux) to UNIX (NetBSD) mount. The UTF-8 Japanese file names are in my NetBSD:/etc/exports. I can only mount those entries that are ASCII equivalent. I also tried it from Solaris 8 (logged in as 'Japanese UTF-8 (Unicode)') and it worked fine. I am sure if I can turn on UTF8 mode I should be able to do so. NFS should be encoding-neutral just like the rest of Unix FS is. (except for cases like exporting to and from non-Unix systems where different file systems are used.). Why don't you begin with a simpler case? Before using UTF-8 for directory names to export via NFS, you can begin with making sure UTF-8 filenames under a NFS-exported directory come out all right on the client side. BTW, I've just experimented with UTF-8 directory names in export list(/etc/exports), it worked fine between Mandrake 9.0(server) and RedHat 8.0(client). Judging from this and the fact that Solaris and NetBSD worked fine, it should also work between NetBSD and RH 7.3 Needless to say, you have to run your shell in UTF-8 terminal (e.g. xterm 16x or mlterm) to view UTF-8 characters. I can't get it to work. 'xterm -u8' doesn't work. the locale never changes. From Solaris you can do a LANG=ja_JP.UTF-8 dtterm and the new dtterm has You have to do the same for xterm as you do for dtterm. 'LANG=ja_JP.UTF-8 xterm'. '-u8' option is not necessary for recent xterm. Or, you can do in the opposite order. That is, run 'xterm -u8' and then set LANG to ja_JP.UTF-8 in xterm (UTF-8). Actually, you have to do the latter way if your /etc/sysconfig/i18n or ~/.i18n sets $LANG to a value other than ja_JP.UTF-8 because the shell initialization script in RedHat *overrides* the value set before the shell invocation with the value in /etc/sysconfig/i18n or ~/.i18n.(see /etc/profile.d/lang.(sh|csh)). what is mlterm? Couldn't find it on Linux 7.3. I'm not sure if it's in RH 7.3. You can get it at http://mlterm.sourceforge.net Jungshik -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
Re: Why doesn't Linux display Japanese file names encoded in UTF-8?
On Tue, 3 Dec 2002, Jim Z wrote: I created a few Japanese file and directory names in UTF-8 in Windows. Then How could you make filename and directory names in UTF-8 in Windows? Windows(both NTFS and VFAT) use UTF-16 for filenames. I logged in from Linux (7.3) that is configured to run Japanese. From the login 'language' I can only select 'Japanese (eucJP)' (there is no Japanese (UTF-8)). You can easily add 'Japanese(UTF-8' to your gdm/kdm language selection menu. See https://bugzilla.mozilla.org/bugzilla/show_bug.cgi?id=75829 Or, you can just set it in ~/.1i8n. I did a 'showmount -e 10.xxx.xxx.xxx' but I got scambled Japanese characters for those entries that are encoded in UTF-8. Then I switched the locale to ja_JP.UTF-8, but the same stuff was returned. What's wrong with this picture? How did you mount Windows filesystem? With smbmount or NFS? If it's NTFS that is mounted via samba, you have to specify 'iocharset=utf-8'. If it's VFAT exported over the net, you also have to specify codepage(for Japanese, it's 932). For local filesystems,, specifying 'utf8' (and 'codepage=932' for VFAT) option to mount command would be sufficient. (see the man pages of mount(8) and fstab) Needless to say, you have to run your shell in UTF-8 terminal (e.g. xterm 16x or mlterm) to view UTF-8 characters. Now in case of NFS, I have no idea how 'Windows NFS server' translates UTF-16 used in NTFS and VFAT to multibyte encodings. There must be a server config. option for that.(the default might be the 'ANSI' codepage of the current locale. For Japanese, it's Windows-932/Shift_JIS) For Unix NFS server - Unix client, there's little need for encoding translation although having one would be nice for some cases(e.g. EUC-JP on the server and UTF-8 on the client-side) Jungshik -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
Re: going past the bmp
Thank you for the note, Owen and Bruno. On Thu, 28 Nov 2002, Owen Taylor wrote: The path to adding full beyond-the-BMP support to Pango is pretty straightforward. (I'm a little suprised that it doesn't sort of work now for TrueType fonts, but I haven't tested it at all.) So, what I wrote about 'UTF-32 cleanness' was not the case. There are some libraries that support BMP only for the momemnt. As for Pango, I had the same thought as yours. I mean, for truetype fonts, I thought it would work as it is. On Thu, 28 Nov 2002, Bruno Haible wrote: Jungshik Shin writes: kwin(in KDE 3.x) can't handle non-BMP characters in the title bar of windows. The cause is probably that Qt's internal string representation is based on UCS-2. Aha. They fear to switch to UCS-4 because of the memory consumption. They don't have to as Win32 and Java showed. If they're worried about the memory consumption, they can just use UTF-16 instead of UCS-4/UTF-32. Win32 and Java showed that it's relatively easy (at least much less complicated than supporting traditional variable length encodings) to modify APIs to support UTF-16(UCS-2 + surrogate pairs to represent non-BMP characters) instead of UCS-2. Jungshik -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
Re: ISO9660 UTF-8
On Mon, 14 Oct 2002, Markus Kuhn wrote: Jungshik Shin wrote on 2002-10-14 06:37 UTC: When I made a patch, I wrote to the maintainer of cdrtools, but his response was not so positive. At first, he asked me whether Try again, he's just busy. (I interacted on this with him as well) Ilya did (I sent him an email on his behalf becaus his ISP is blacklisted). In his reply, he wrote that mkisofs is currently frozen for an imminent major release. Perhaps, in next cycle of development, iconv() will be considered. One of his concerns was how to detect the availability of iconv(3) with autoconf. I pointed out that iconv.m4 for autoconf had been written by Bruno. So, this should not be a problem. However, I had to tell him that there's another hurdle to overcome. My patch hard-coded 'UTF-16LE' as the codeset name for 'UTF-16 Little Endian', but it's not very portable. There should be a way to detect the codeset name to use with iconv(3) on a given platform for UTF-16LE. Is there any autoconf macro written for this? One way I can think of is to first detect the codeset name for UTF-8 (utf-8, utf8,utf_8 and uppercase variants) by iconv_open with two identical codesets and then try iconv_open with a set of candidate names for UTF-16LE and the detected UTF-8 name. Then, invoke iconv() with a known UTF-8 string and check the result for endianness. An alternative is to just make it user-configurable at run-time. This is easier for programmers, but not so user-friendly... He means ISO 13346 and its profile UDF 2.01. info. on 13346/UDF. snipped Thank you for the info. on ISO 13346 and UDF. Jungshik -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
Re: Please do not use en_US.UTF-8 outside the US
On Thu, 17 Oct 2002 [EMAIL PROTECTED] wrote: It would be yet simpler to eliminate all non-utf-8 locales. This is what RedHat 8.0 does except for CJK for which still legacy encodings are used.(well for zh_CN, GB 18030 is used, which is just another UTF in a sense.) The exclusion of CJK in a switch-over to UTF-8 is very unfortunate (I've been using ko_KR.UTF-8 for over half a year and I really like it) and I hope it'll change soon (see https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=75829) As I wrote many times before, Korean desperately needs UTF-8 and that's why ko_KR.UTF-8 was among the very few UTF-8 locales offered for Solaris and AIX (see Ienup's message.) in mid-1990's. It would be simpler, but since the vast majority of the world is still using legacy locales, it's irrelevant. Come back in 5-10 years, maybe; I'm talking about things that can be done today. They could still be available, but they would not be the default (legacy encodings) When you setup a new machine, its not front-loaded with scads of text file docs you care about; you will add things as you go. If you recieve new messages (email,documents,etc) they would all be converted to something you can read normally. All you care about is that it is well integrated and it works. I totally agree with you. Jungshik -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
Re: Please do not use en_US.UTF-8 outside the US
On Thu, 17 Oct 2002, Thomas Wolff wrote: wolfffscce14:~ uname -a ; locale -a | grep UTF-8 SunOS fscce14 5.8 Generic_108528-12 sun4us sparc FJSV,GPUSK en_US.UTF-8 sv.UTF-8 sv_SE.UTF-8 sv_SE.UTF-8euro In principle, you could set LANG=de LC_CTYPE=en_US.UTF-8 OK, I get: wolfffscce14:~ LANG=de LC_CTYPE=en_US.UTF-8 /bin/sh couldn't set locale correctly couldn't set locale correctly That's probably because you don't have 'de' locale installed. Have you tried 'LANG=sv_SE.UTF-8' if Swedish is all right with you? If that's the case, you don't have to set LC_CTYPE to en_US.UTF-8. Or, you can unset LANG and set other LC_* as you wish. LC_CTYPE=en_US.UTF-8 or sv_SE.UTF-8 (character classification, collation and so forth would behave differently) LC_MESSAGES=C (if just plain English is better for you than localized messages) LC_TIME=C (again, just want plain old Unix/Posix behavior) . I want an LC_* setting that tells my applications to use UTF-8 and doesn't affect the system inappropriately otherwise, and that works with SunOS and doesn't let /bin/sh choke! I don't know why Sun doesn't ship its Solaris with all the locales supported by Solaris. Perhaps, a marketing ploy :-) DEC (now Compaq and should it HP by now?) Digital Unix 4.x (now Tru64) came with all the locales on OS CD-ROM. It's up to the system administrator which locale is installed. Jungshik Shin -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
Re: ISO9660 UTF-8
On Sun, 13 Oct 2002, Ilya Konstantinov wrote: Dear Ilya, Thanks a lot for 'fixing my patch' :-) I'm attaching a patch which complements Jungshik's original patch ( http://mail.nl.linux.org/linux-utf8/2002-03/msg00022.html ) which made mkisofs use iconv instead of internal Unicode conversion tables. Jungshik's patch already worked well for 8-bit encodings, but it didn't account for UTF-8, which is a varying character length encoding. The attached patch modified joliet_strlen so that it'll return the correct target UCS-2 length. Without this patch, UTF-8 filenames containing non-Latin characters won't work on Windows. They would show in directory listings and be accessible by 8.3 names, but not by their long filenames. This patch remedies this problem. Ahah, that's the cause. With my patch, I was able to burn a CD with Korean filenames(in CP949 or EUC-KR which is also a multibyte encoding like UTF-8) which Linux doesn't have any problem accessing(I mounted it as a joliet CD-ROM instead of ISO9660) However, under MS Windows, it has the very problem you mentioned that your patch solved. How do we go about merging this into the cdrtools package? When I made a patch, I wrote to the maintainer of cdrtools, but his response was not so positive. At first, he asked me whether iconv(3) is available on any platform other than Solaris. After I replied that iconv(3) is a standard API specified in Single Unix Spec and that Glibc 2.2.x has had it for a few years and Bruno's implementation of iconv(3) in libiconv is widely available and had been ported to virtually all platforms, he didn't reply. He eventually wanted to move onto a more generic format (for DVD and similar media) whose name is currently escaping me. Anyway, I guess it's not a bad idea to give it another try to make a case for your patch to him. Why don't you write him with detailed explanation of what your patch does and the wide availability of iconv(3) on multitude of platforms? The address should be available in cdrtools document and web page. Although it's desirable to fix things in as upstream as possible, we may try to go around a bit and persuade various Linux distribution builders to apply our patches to cdrtools shipped in their distros. Engineers from RH, Mandrake, PLD and SuSE and perhaps other distros are here Linux-UTF8 list. Could you pick up our patches and apply it to cdrtools? Jungshik -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
Re: [Devel] Re: Linux Console in UTF-8 - current state
On Wed, 9 Oct 2002, Antoine Leca wrote: En Vadim Plessky va escriure: |And presumably FreeType2 will have, or acquire, the smarts for |rendering the Arabic and Indic scripts properly. I am wondering *how important* those Arabic and Indic scripts? While there is a certan number of people living in those countries, I doubt that they have a lot of computers, and nuymbe rof *Linux* users from that number is quaestionable, too. To add another complexity, there is no current agreement about the way to encode Indic fonts. Besides proprietary glyph-based encodings (that clearly do not scale up), the Apple scheme looks like a dead way, so the only You may or may not be right about AAT and ATSUI (http://developer.apple.com/intl). As long as Mac is alive, they'll live on. BTW, there's a third contender, Graphite developed at SIL. solution I see is the OpenType scheme, which fits more or less with Unicode (but lags about 6-8 years later), and is initiated (and as I see It seems like support of Indic scripts in OTF has been rapidly emerging and MS Windows 2k/XP has a pretty good support of a few Indic scripts using OTFs and Uniscribe. More and more Indic scripts will be supported as time goes by. I heard that there are lots of talented programmers/font developers on MS's typography list(?) interested in OT fonts for Indic scripts. Besides, I don't think Pango is much behind Uniscribe supporting Indic scripts with OTF. things, still currently owned) by Microsoft, something that is not really welcome in the Linux community ;-). I'm not sure what you mean by 'owned'. Opentype standard has been developed jointly by MS and Adobe.(http://www.microsoft.com/typography) I don't think Linux developers/users are so stubborn to reject anything invented by MS. Pango developers are certainly not because they've been working to support Indic scripts with OTF(as you know well: Pango 1.1.1 now supports Indic scripts with code ported from ICU.) Neither are developers of XFree86 and Freetype library and the author of Yudit. As a result, I do not believe that efforts for the Indic scripts are likely to be successful for the very next years: this is probably more of a long-term project; consequently, I believe that Indians will continue to use English when speaking with computers for a few years... As far as Linux-console is concerned, I agree with you. However, on the GUI front, I'm not so pessimistic as you're because we already have some tangible results. IMHO, Linux can't afford to lose hundreds of millions of potential users in South Asia when competing OS like MS Windows 2k/XP and MacOS X are moving forward on the front. Jungshik -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
Re: Linux Console in UTF-8 - current state
On 11 Sep 2002, H. Peter Anvin wrote: On 10 Sep 2002, H. Peter Anvin wrote: The only sane way to deal with this is to do a console daemon in userspace... Reinventing Xterm is more like it. One of the ideas that has come up is to write such a console daemon so that it could also run in an X window, which would give us something we right now sorely lack -- a consistent terminal in a window and on the console. Did you mean 'iterm' briefly mentioned by Redovan in this thread? On xfree86-i18n list, Hideki Hiura gave the details at http://www.xfree86.org/pipermail/i18n/2002-August/003405.html Jungshik -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
Re: a basic question
On Mon, 2 Sep 2002, Markus Kuhn wrote: M.P.N. Peters wrote on 2002-09-02 12:10 UTC: Recently I found out about Unicode and UTF-8. Unfortunately, it raise s a lot of questions. My first question is, how can I, with a limited (= qwerty) keyboard that can generate only about 100 scancodes (I think), produce all the keycodes needed to reach for example the phon- .. - For more rarely required symbols (e.g., mathematical notation, for many people typically also phonetic alphabet), it might be a sufficient entry method to chose these with a mouseclick from an on-screen menue. Xterm allows you to do this already today via the cutpaste mechanism. Just keep a short file that contains neatly arranged the Unicode characters that you need to enter most frequently in your work, and cutpaste from there. That's the technique I find myself using most frequently. One can also use 'ucm' (http://www.pps.jussieu.fr/~jch/software/files/ucm-0.3.tar.gz) by Juliusz for this purpose. - Have in the keyboard driver a key combination that initiates hexadecimal entry of a Unicode character, as a fallback mechanism for expert users As you know well, it's implemented by some application programs (e.g. Yudit and Vim). Having this in the keyboard driver may be a good idea. Some MS Windows applications using 'richtext edit' control (or sth. like that) have this where 'Alt-X' followed by 4 hex digit produces a Unicode character. There's even an ISO standard for this. It's very generic and Yudit, Vim and MS Windows method are all compliant to the standard. Jungshik Shin -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
Re: world of utf-8
On Tue, 20 Aug 2002, Markus Kuhn wrote: [EMAIL PROTECTED] wrote on 2002-08-20 00:29 UTC: Does anyone know offhand what other barriers remain to sending email as raw utf-8? My experience with ESMTP have actually been rather good. The problem is less the email system itself, but more outdated auxiliary tools, such as programs that convert a mailing list archive into HTML and have been written without any appreciation for non-ASCII messages. Many of such Most of those tools also have little notion of RFC 2047/RFC 2231. (some are pretty good, but not yet perfect.) The situation is a little better with stupid web mail services (hotmail, yahoo mail, lycos mail and a bunch of others geared for local users all over the world), but they're still far from multilingual. Most of these services work more or less (even with RFC 2047/RFC 2231 encoded headers and RFC 2045-encoded - quoted-printable/base64- message bodies) in *one* legacy encoding(or UTF-8 in a few cases) at a time/per user/per account. However, they break down if multiple messages in different encodings are present in a single box. Besides, most of them set MIME charset in http header field to the legacy encoding for the language chosen by their users (e.g. Shift_JIS for Japanese in hotmail/yahoo mail, ISO-8859-1 for West European languages, EUC-KR for Korean, Big5 for TC, GB2312 for SC, ISO-8859-7 for Greek, KOI8-R for Russian etc) regardless of the actual MIME charset specified in messages so that readers of messages have to manually override the encoding of their web browsers to read UTF-8 messages. Therefore when I write to my (not-so-computer-savvy) correspondents (including my father) using those 'parochial' web mail services in a language requiring characters beyond US-ASCII, I have to use the prefered legacy encoding of speakers of the langauge. tools have been written in Perl, and thanks to the excellent UTF-8 support of the new Perl 5.8, perhaps it is now time for the authors of these to have a look at the issue, because all the conversion and UTF-8 handling infrastructure is now readily there in Perl. You're right. Perl 5.8 also has an excellent support for handling of legacy encodings (Encoding module) so that thoese tools can be truly multilingual by working primarily in UTF-8 (i.e. converting all incoming messages in various legacy encodings to UTF-8 before presenting them in html.) Jungshik -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
Re: world of utf-8
On Mon, 19 Aug 2002 [EMAIL PROTECTED] wrote: If you only have UTF-8 files you don't need to do anything. If you communicate with other planets (and this message indicates you do :-) your message was sent: Content-Type: text/plain; charset=US-ASCII which could be considered a utf-8 sub-set. Admittedly, sendmail's hangups with the eighth bit make sending clear utf-8 documents somewhat unreliable. What's wrong with sendmail? Is your machine one of few remaining machines running antique sendmail 5.x or sendmail 4.x? Sendmail has been 8bit-clean since 8.6.x. Sendmail 8.7.x or higher is strictly compliant to STD 10/RFC 821 and RFC 1652 (ESMTP extension) and RFC 2045. If correctly configured, it sends out 8BITMIME messages if it's certain that the other end of the communication can receive 8BITMIME. Compliant to RFC 1652, it asks whether or not the other side of the link can understand 8BITMIME and sends out 8BITMIME if the answer is positive. Otherwise (i.e the other side is either 8bit clean but not compliant to RFC 1652 or not 8bit clean like totally outdated sendmail 4.x/5.x), again __compliant to_ RFC 1652, it falls back to quoted-printable or base64. It's stupid and/or non-standard-compliant MUAs/MTAs like Outlook Express and qmail/smail (qmail/smail violated RFC 1652 a few years ago when I checked. my apology if they've changed their behavior since) that have to be blame for sending always base64 or blindly sending 8bitmime without checking the other side's ability. Email is one embarrasing case where it may take awhile for the infrastructure to catch up. (putting all text email in a base-64 mime attachment can be said to suck) It doesn't have to be a MIME _attachment_. C-T-E of RFC 822 body can be Base64/QP as well as in 7bit/8bit. MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: base64 or MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable is as good as MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 8bit Does anyone know offhand what other barriers remain to sending email as raw utf-8? Why would you bother whether C-T-E is base64/quoted-printable or 8bit? If your MUA(mail user agent) can't cope with MIME, it's time for you to consider switching to a *modern* MIME-compliant MUA. Besides, sendmail can convert back qp/base64 encoded _single_part messages back to 8bitmime messages before delivering them to local mailboxes. Jungshik -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
extended X11 input methods (was Perl 5.8....)
On Tue, 23 Jul 2002, Tomohiro KUBOTA wrote: Extended input method is also needed. For example, I cannot input both of Japanese and Korean in one xterm session, because there are no XIM servers which support both of Japanese and Korean while Well, Korean XIM's (such as Ami) do support Korean and Japanese input although entering the latter is rather inconvenient :-). A better example would have been, as you did later, Japanese and French. Even in this case, in theory, a single XIM can be extended to support as many input methods/keyboard layouts as it wants to. Obviously, we don't want to do that because that means devlopers of every single XIM have to repeat what others have done for other XIMs. xterm cannot switch XIM connection. (mlterm can do this, but I Seriously, I can't agree with you more that we need a input method framework under which users of every compliant X11 client can easily switch among multiple input methods/keyboards (as is possible under MS Windows and MacOS 9 or X.) I think IIIMF(Internet/Intranet Input Method Framework) and its Xlib client IIIMXCF(IIIM X Client Framework) is a (if not the) way to go. See http://www.li18nux.org/subgroups/im. Until it's widely distributed (I heard it works well right now) , 'ucm' can be used for sporadic input of Unicode characters not supported by the active XIM/keyboard. Also, yudit(which also lets users switch input methods/kbd), vim, and openoffice offer their own way for this. As the last resort, we always have cut'n'paste :-) Jungshik -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
a patch to pine4.44 for a better UTF-8(I18N) support
-1252 -t UTF-8, _CHARSET(ISO-8859-15)_ /usr/bin/iconv -c -f ISO8859-15 -t UTF-8, _CHARSET(ISO-2022-JP)_ /usr/bin/iconv -c -f ISO-2022-JP -t UTF-8, _CHARSET(GB2312)_ /usr/bin/iconv -c -f GB2312 -t UTF-8, _CHARSET(BIG5)_ /usr/bin/iconv -c -f BIG5 -t UTF-8, _CHARSET(Windows-1251)_ /usr/bin/iconv -c -f WINDOWS-1251 -t UTF-8, _CHARSET(Windows-1252)_ /usr/bin/iconv -c -f WINDOWS-1252 -t UTF-8, _CHARSET(Windows-1253)_ /usr/bin/iconv -c -f WINDOWS-1253 -t UTF-8, _CHARSET(ISO-8859-2)_ /usr/bin/iconv -c -f ISO8859-2 -t UTF-8, _CHARSET(ISO-8859-3)_ /usr/bin/iconv -c -f ISO8859-3 -t UTF-8, _CHARSET(ISO-8859-4)_ /usr/bin/iconv -c -f ISO8859-4 -t UTF-8, _CHARSET(ISO-8859-5)_ /usr/bin/iconv -c -f ISO8859-5 -t UTF-8, _CHARSET(ISO-8859-6)_ /usr/bin/iconv -c -f ISO8859-6 -t UTF-8, _CHARSET(ISO-8859-7)_ /usr/bin/iconv -c -f ISO8859-7 -t UTF-8, _CHARSET(ISO-8859-8)_ /usr/bin/iconv -c -f ISO8859-8 -t UTF-8, _CHARSET(ISO-8859-9)_ /usr/bin/iconv -c -f ISO8859-9 -t UTF-8, _CHARSET(ISO-8859-10)_ /usr/bin/iconv -c -f ISO8859-10 -t UTF-8, _CHARSET(ISO-8859-11)_ /usr/bin/iconv -c -f ISO8859-11 -t UTF-8, _CHARSET(ISO-8859-13)_ /usr/bin/iconv -c -f ISO8859-13 -t UTF-8, _CHARSET(ISO-8859-14)_ /usr/bin/iconv -c -f ISO8859-14 -t UTF-8, _CHARSET(ISO-8859-16)_ /usr/bin/iconv -c -f ISO8859-16 -t UTF-8, _CHARSET(KOI8-R)_ /usr/bin/iconv -c -f KOI8-R -t UTF-8, _CHARSET(KOI8-U)_ /usr/bin/iconv -c -f KOI8-U -t UTF-8, _CHARSET(Windows-874)_ /usr/bin/iconv -c -f CP874 -t UTF-8, _CHARSET(UTF-7)_ /usr/bin/iconv -c -f UTF-7 -t UTF-8 There are a couple of problems with my patch. One of them is that I haven't done anything to fix 'one octet - one column width model'. In UTF-8, this false assumption completely breaks down except for characters in US-ASCII(U+0020 - U+007E) as you are well aware. Therefore,in the message display screen, lines are wrapped prematurely and in the message index screen, headers (subject, recipient, etc) are truncated prematurely. The other is that somehow the link to 'email list management information' at the end of a message with 'list management information' header does not work. I guess it's easy to fix, but I haven't gotten around to look into it yet. There may be other problems as well. I'll be glad to hear about them, although I may not be able to fix them as quickly as I wish to. BTW, Pine 4.44 with my patch can also be run under non-UTF-8 terminal. In that case, you have to set 'character-set' to the encoding of your terminal (say, EUC-JP) and define your display filters accordingly. My goal was to make Pine a text-terminal version of MS OE or Mozilla-mail in terms of I18N support. With my patch, Pine got closer to that goal, but is still far from it. Some of features I want to see include: - The encoding(MIME charset) for outgoing emails should be decoupled from the encoding of a terminal under which Pine is launched. - It should be possible to change the encoding(MIME charset) of outgoing messages _at the time of_ composition (as is possible with MS OE and Mozilla-Mail.) Although going all the way to UTF-8 is desirable, the reality is that some of my correspondents cannot deal with UTF-8 messages. For them, I have to write in legacy encodings. Currently, I have to launch another Pine with a separate pinerc to compose my email in a legacy encoding. - The internal encoding conversion (as opposed to relying on users setting display filters correctly in pinerc) with iconv - 'assumed-charset' should be settable per-folder basis as well as globally. Hope a lot of people find my patch useful, Jungshik Shin -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
Re: mk_wcwidth
On Thu, 20 Jun 2002 [EMAIL PROTECTED] wrote: You do realize that people in CJK locales expect some characters to be double width that people in European/American locales expect to be single width. Doublewidth roman letters are in the unicode range FF00-FFFE, so when converting from a legacy encoding that assumes the ascii ranges are all doublewidth, you map to (ascii+FEE0). With Well, legacy _encodings_ like EUC-JP/KR, Shift_JIS, Big5 and GB2312(should be EUC-CN) include _two_ distinct sets of Latin letters, one set in US-ASCII(or its national counterpart) and the other set in JIS X 0208 (EUC-JP), KS X 1001(EUC-KR), JIS X 0208(Shift_JIS), Big5(Big5), GB2312-80(EUC-CN). It's _only the latter_ that has to be mapped to full width US-ASCII characters in Unicode. Most CJK input methods , whether in Unix/X11, MS-WIndows or MacOS, offer a distinctive way to input full width US-ASCII characters. unicode you can even mix double and singlewidth ascii in a single document; many of the roman letters became kanji when in doublewidth form (for example doublewidth capital letter H can mean pornography) and have a different meaning than their single-width brethren. So a unicode char-cell width function should function identically for all locales. Not true. Although I'm not among those who like to see Greek and Cyrillic letters rendered in full-wdith (it's really ugly !!), there ARE _some_ (I wouldn't say there are many) CJK people who want to keep them that way. Moreover, it's not only Greek and Cyrillic letters but also line drawings that have locale-dependent width. You may as well read UTR #11/UAX #11 East Asian Width at http://www.unicode.org/reports/tr11/. (I dont know of any unicode support for fullwidth greek or cyrillic, but should such a thing be needed, there is room north of the BMP) There will be never such thing in Unicode. Only reason the full width Latin letters are encoded separately in Unicode was that they had been present in legacy CJK characters with distinct code points from US-ASCII (half-width) counterparts. See above. Jungshik Shin -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
less-374 patch (was Re: less 358)
On Mon, 20 May 2002, Zvi Har'El wrote: I am using less on a UTF-8 Redhat Linux 7.3 machine. I am having troubles with using man, because of the overstiking is not handled properly. I read the Unicode HOWTO and compiled less (358) with the patch suggested by http://mail.nl.linux.org/linux-utf8/2001-05/msg00023.html and the situation improved. However it is not completely OK, as you may easily I'm afraid the patch you applied introduced the problem you described while solving the problem of overstriking in UTF-8 mode. BTW, the patch (as applied by the author of less in less 361) only works for two-octet-long UTF-8 characters. at the beta version of less (377?), but it didn't adress this bug at all The patch you refered to seems to have been applied in less 361 according to version.c file. Anyway, attached is a *simplistic*(not perfect) patch against less 374(the newest at less home page) I've just made that apparently solves both issues, overstriking of three-octet-long UTF-8 characters and underlining and overstriking of two identical US-ASCII characters in a row ('ff' in 'troff', 'tt' in 'pattern'). It's not perfect because it only checks the first octet of a two or three octet-long UTF-8 char to see if it's identical with the char. preceding backspace. I tested it under UTF-8 xterm and it worked fine with an attached test case with 'nroff', U+0411, U+2010, and U+AC00, U+4E00 overstruck and 'pattern' underlined. Underlining doesn't work for UTF-8 characters(other than US-ASCII), though. However, this is also the case of less-374 without my patch. Hope this helps, Jungshik Shin --- line.c.orig Mon May 20 11:56:34 2002 +++ line.c Mon May 20 12:53:36 2002 -592,12 +592,19 * or just deletion of the character in the buffer. */ overstrike--; - if (utf_mode curr 1 (char)c == linebuf[curr-2]) + if (utf_mode c 0x80 curr 2 (char)c == linebuf[curr-3]) { backc(); backc(); + backc(); + overstrike = 3; + } else if (utf_mode c 0x80 curr 1 (char)c == +linebuf[curr-2]) + { + backc(); + backc(); + STORE_CHAR(linebuf[curr], AT_BOLD, pos); overstrike = 2; - } else if (utf_mode curr 0 (char)c == linebuf[curr-1]) + } else if (utf_mode curr 0 c 0x80 (char)c == +linebuf[curr-1]) { backc(); STORE_CHAR(linebuf[curr], AT_BOLD, pos); 1. nroff nnrrooffff nnrrooffffgg ABCD 2. UTF-8 chars : two octet or three otcte long ББ ‐‐ 가가가abbc 一一一가 3. This does not work !! The first octet of a char. following backspace is the same as the first octet of a char. preceding backspace, but the subsequent octet is different so that backspace should erase the char. before it. 가각가abbc Бӡ 4. pattern : underlined _p_a_t_t_e_r_n 5. underlining does not work for UTF-8 chars. _‐ _Б _A_B 6. This is the reverse of the common convention(as used by nroff), but it works. ‐_ Б_ 가_
Re: utf8-utf16
On Mon, 13 May 2002, Tay, William wrote: How/what can I use to convert utf8 to utf16 (Windows) ? Check out http://msdn.microsoft.com/library/default.asp?url=/library/en-us/intl/unicode_2bj9.asp WideChar in Windows is at least UCS-2 if not UTF-16. If what you're looking for is a command line tool, you can use iconv(under Cygwin and native) and native2ascii (that comes with JDK). Also is there anyway I can input and store utf8 encoded strings in a Window system? Notepad(perhaps only under Win NT4 or up?), Vim, Yudit, Mozilla-composer, SC Unipad(?), Wordpad, MS-Word, etc... Jungshik Shin -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
Switching to UTF-8 and Gnome 1.2.x
Hi, In my transition to UTF-8, I found that Gnome 1.2.x has a lot of files in mixed encodings. All *.desktop files and .directory files are in mixed encodings. Entries for [ja] are in EUC-JP, entries for [de] are in ISO-8859-1/15 and entries for [ru] are in KOI8-R and so on. On the other hand, corresponding KDE files are all in UTF-8 so that I don't need to change anything there. Anyway, thanks to Encoding module (to be included in upcoming Perl 5.8 by default), I was able to write a simple script to add ko_KR.UTF-8 entries for all [ko] entries in EUC-KR in *desktop files and .directory files. Below is the list of directories I have to run my script on: /usr/share/apps /usr/share/applets /usr/share/applnk /etc/X11/applnk /usr/share/mc $HOME/.gnome Still, I got gibberish in Gnome tip of the day. It turned out that gnome hint files (usually installed in /usr/share/gnome/hints) are Xml files in mixed encodings. I don't think they're compliant to Xml standard because I've never heard of Xml files in mixed encodings. So, I also had to add ko_KR.UTF-8 entries for all [ko] entries. Even with this, for some reason unknown to me, whenever I cross the 'boundary'(i.e. from the last to the first or the other way around), I got gibberish. Two other places where languages are tied to encodings are Gnome help (usually in /usr/share/gnome/help) and Gimp tips (/usr/(local/)share/gimp/$version/tips/gimp_tips.[lang].txt) I also had to make UTF-8 version of them. I believe all these problems have been addressed in Gnome 2.0(RC?/beta), but still Gnome 1.x are widely used. I thought my experience would help others who want to move on to UTF-8 as well as distribution builders. Jungshik Shin -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
Re: Switching to UTF-8
On Mon, 6 May 2002, Pablo Saratxaga wrote: On Mon, May 06, 2002 at 10:11:34AM +0900, Tomohiro KUBOTA wrote: Note for xkb experts who don't know Hiragana/Katakana/Hangul: input methods of these scripts need backtracking. For example, in Hangul, imagine I hit keys in the c-v-c-v (c: consonant, v: vowel) sequence. When I hit c-v-c, it should represent one Hangul syllable c-v-c. However, when I hit the next v, it should be two Hangul syllables of c-v c-v. That is only the case with 2-mode keyboard; with 3-mode keyboard there is no ambiguity, as there are three groups of keys V, C1, C2; allowing for all the possible combinations: V-C2, C1-V-C2. Eg: there are two keys 'V-C2 and C1-V-C2' should be 'C1-V and 'C1-V-C2' :-) To go all the way to Xkb, even three-set keyboard array has to be modified a little because some clusters of vowels and consonants are not assigned separate keys, but have to be entered by a sequence of keys assigned to basic/simple vowels and consonants. Alternatively, programs have to be modified to truly support 'L+V+T*' model of Hangul syllables as stipulated in TUS 3.0. p. 53. for each consoun: one for the leading syllab consoun, and one for the ending syllab consoun. (I think the small round glyph to fill an empty place in a syllab is always at place C2, that is, c-v is always written C1-V-C2 with a special C2 that is not written in latin transliteration) You almost got it right except that IEung ('ㅇ') is NULL at the syllable onset position (i.e. it's a place holder for syllables that begin with a vowel and does not appear in Latin transliteration). IEung is not NULL at the syllable coda-position but corresponds to [ng] (IPA : [ŋ] ) as in 'young'. To put in your way, V-C2 syllable is always written as IEung-V-C2 with IEung having no phonetic value. Here I assumed we're not talking about the orthography of the 15th century ;-) Jungshik Shin -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
Re: Switching to UTF-8
On Sun, 5 May 2002, Tomohiro KUBOTA wrote: At 02 May 2002 23:54:37 +1000, Roger So wrote: I _do_ think xkb is sufficient for Japanese though, if you limit Japanese to only hiragana and katagana. ;) I believe that you are kidding to say about such a limitation. Japanese language has much less vowels and consonants than Korean, which results in much more homonyms than Korean. Thus, I think Well, actually it's due to not so much the difference in the number of consonants and vowels as the fact that Korean has both closed and open syllables while Japanese has only open syllables that makes Japanese have a lot more homonyms than Korean. native Japanese speakers won't decide to abolish Kanji. I don't think Japanese will ever do, either. However, I'm afraid having too many homonyms is a little too 'feeble' a 'rationale' for not being able to convert to all phonetic scripts like Hiragana and Katakana. The easiest counter argument to that is how Japanese speakers can tell which homonym is meant in oral communication if Kanji is so important to disambiguate among homonyms. They don't have any Kanjis to help them, (well, sometimes you may have to write down Kanjis to break the ambiguity in the middle of conversation, but I guess it's mostly limited to proper nouns). I heard that they don't have much trouble because the context helps a listener a lot with figuring out which of many homonyms is meant by a speaker. This is true in any language. Arguably, the same thing could help readers in written communication. Of course, using logographic/ideographic characters like Kanji certainly helps readers very much and that should be a very good reason for Japanese to keep Kanji in their writing system. English writing system is also 'logographic' in a sense (so is modern Korean orthography in pure Hangul as it departs from the strict agreement between pronunciation and spelling ) and a spelling reform (to make English have a similar degree of the agreement between spelling and pronunciation as to that in Spanish) would make it harder to read written text depriving English written text of its 'logographic' nature. On the other hand, it would help learners and writers. It's always been struggle between readers vs writers and listeners vs speakers xkb can be used. However, more than half of Japanese computer users use Romaji-kana conversion, two-keys-one-hiragana/katakana method. The complexity of the algorithm is like two or three-key input method of Hangul, I think. Do you think such an algorithm can be implemented as xkb? If yes, I think Romaji-kana conversion (whose complexity is like Hangul input method) can be implemented as xkb. I also like to know whether it's possible with Xkb. BTW, if we use three-set keyboards (where leading consonants and trailing consonants are assigned separate keys) and use U+1100 Hangul Conjoining Jamos, Korean Hangul input is entirely possible with Xkb alone. Jungshik Shin -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
LC_PAPER vs /etc/papersize (was..Re: Please do not use en_US.UTF-8..)
On Tue, 30 Apr 2002, David Starner wrote: On Tue, Apr 30, 2002 at 11:09:55PM -0400, Jungshik Shin wrote: However, to me overiding the default at the command line is a perfectly good solution. Everytime you use a program? Stuff like that gets real tiring, real fast to me. What are shell scripts/aliases for ;-) ? What if your site has multiple printers with different sizes of paper loaded by default? How about printers with multiple trays? Whichever method you use to set the default, you have to use a command line option or other means to overide the default. However, I have to admit that you clearly have a point. It's not most desirable for programs to derive the default paper size from the locale *name* assigned to LC_PAPER. It's certainly true that if programs rely on /etc/papersize instead of mapping the locale *name* to the default papersize, it's easier to change the default paper size. What has to be done is to use the actual *value* stored in LC_PAPER instead of 'guessing' the default paper size from the locale *name* provided that LC_PAPER is a standard locale category. It's not, yet. I was wrong to say that LC_PAPER is defined in ISO 14652 (draft). It's not there. SUS V3 doesn't have it, either. So, it's not a standard locale category but at least it's available where glibc 2.2.x is used (i.e. all Linux distributions including Debian) Even there, nl_langinfo(PAPER_HEIGHT) and nl_langinfo(PAPER_WIDTH) don't work yet. langinfo.h in glibc 2.2.x has _NL_PAPER_HEIGHT and _NL_PAPER_WIDTH. Therefore, programmers might use nl_langinfo(_NL_PAPER_WIDTH) and nl_langinfo(_NL_PAPER_HEIGHT). However, it's not very portable (both across platforms and over the time) because I believe '_' at the beg. of _NL_PAPER_* indicates their non-standard nature. Now what follows is based on not what it's widely available (or standard) but what it may be in the future. hypothetic situation How often do you (think people) use papersize other than US letter (or A4 outside the US)? If the answer is most of time, you can build your own locale with LC_PAPER defined for the most frequently used papersize at your site (say, en_US.UTF-8@legal)? Then, you can have LC_PAPER=en_US.UTF-8@legal LANG=en_US.UTF-8 And a French living in the US may have LC_PAPER=en_US.UTF-8@legal LANG=fr_FR.UTF-8 What difference is there between setting /etc/papersize and building and installing a new locale for your favorite size? Sure, editing one-line is easier than building a new locale. However, it's not so flexible as you think. With en_US.UTF-8@legal built and installed, different users with different choices of the default paper size (because their offices have different printers with the primary tray for different papersize) can happily *share* a *single* system. They don't have to fight over which paper size goes into /etc/papersize. Those who mainly use US letter can just set LANG to en_US.UTF-8 and leave LC_PAPER alone (or they can specify that to en_US.UTF-8 if they want to). Others who mainly use legal paper can set LC_PAPER to en_US.UTF-8@legal with LANG set to en_US.UTF-8. /hypothetic situation Jungshik Shin (1) LC_PAPER definition for US letter goes like this (the unit is mm.) LC_PAPER height 279 width216 END LC_PAPER You can change height and width to whatever value you want. -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
Re: Please do not use en_US.UTF-8 outside the US
On Thu, 2 May 2002, Keld Jørn Simonsen wrote: The nice thing about LC_PAPER is that it is set either on installation, or as part of the normal setup. I think most people knows how to set the locale, while some, maybe many, would not know that there be a /etc/papersize file. Yes, I've been bitten more than once by these 'hidden' files lurking around in /etc that affect the way programs work. LC_PAPER was in 14652 at some time but was taken out, because some people thought that it was not useful :-( So, my memory was not telling me a lie. I was almost sure I had seen it in ISO 14652 when I wrote that LC_PAPER is in ISO 14652. Later when I checked it, it's not there, which led me to believe that my memory didn't serve me right once more. Anyway, what's the plan of ISO/IEC JTC1/SC22/WG20 on this? Jungshik Shin -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/